[PATCHSET sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches
@ 2024-07-09 21:21 Tejun Heo
  2024-07-09 21:21 ` [PATCH 1/6] sched: Move struct balance_callback definition upward Tejun Heo
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Tejun Heo @ 2024-07-09 21:21 UTC (permalink / raw)
  To: void
  Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo, righi.andrea

Hello,

In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the
local DSQ of any CPU. However, during direct dispatch from ops.select_cpu()
and ops.enqueue(), this isn't allowed. This is because dispatching to the
local DSQ of a remote CPU requires locking both the task's current and new
rq's and such double locking can't be done directly from ops.enqueue(). This
gap in API forces schedulers into work-arounds which are not straightforward
or optimal such as skipping direct dispatches in such cases.

This patchset implements SCX_DSQ_LOCAL_ON support for direct dispatches.

This patchset contains the following patches:

 0001-sched-Move-struct-balance_callback-definition-upward.patch
 0002-sched_ext-Open-code-task_linked_on_dsq.patch
 0003-sched_ext-Make-rf-optional-for-dispatch_to_local_dsq.patch
 0004-sched_ext-s-SCX_RQ_BALANCING-SCX_RQ_IN_BALANCE-and-a.patch
 0005-sched_ext-Allow-SCX_DSQ_LOCAL_ON-for-direct-dispatch.patch
 0006-sched_ext-scx_qmap-Pick-idle-CPU-for-direct-dispatch.patch

While 0001 is a scheduler core change, all it does is moving the definition
of struct balance_callback so that it's visible for struct scx_rq
definition. If there are no objections, it'd make sense to route it through
the sched_ext tree with the rest of the changes.

and is also available in the following git branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-ddsp-local-on

 kernel/sched/ext.c             |  207 +++++++++++++++++++++++++++++++++++++++++++++++++++------------
 kernel/sched/sched.h           |   19 +++--
 tools/sched_ext/scx_qmap.bpf.c |   39 +++++++++--
 tools/sched_ext/scx_qmap.c     |    5 -
 4 files changed, 215 insertions(+), 55 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/6] sched: Move struct balance_callback definition upward
  2024-07-09 21:21 [PATCHSET sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
@ 2024-07-09 21:21 ` Tejun Heo
  2024-07-10 18:53   ` David Vernet
  2024-07-09 21:21 ` [PATCH 2/6] sched_ext: Open-code task_linked_on_dsq() Tejun Heo
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2024-07-09 21:21 UTC (permalink / raw)
  To: void
  Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo, righi.andrea, Tejun Heo

Move struct balance_callback definition upward so that it's visible to
class-specific rq struct definitions. This will be used to embed a struct
balance_callback in struct scx_rq.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/sched.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 04184e87ba7c..86314a17f1c7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -615,6 +615,11 @@ do {									\
 # define u64_u32_load(var)		u64_u32_load_copy(var, var##_copy)
 # define u64_u32_store(var, val)	u64_u32_store_copy(var, var##_copy, val)
 
+struct balance_callback {
+	struct balance_callback *next;
+	void (*func)(struct rq *rq);
+};
+
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight	load;
@@ -1054,11 +1059,6 @@ struct uclamp_rq {
 DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
 #endif /* CONFIG_UCLAMP_TASK */
 
-struct balance_callback {
-	struct balance_callback *next;
-	void (*func)(struct rq *rq);
-};
-
 /*
  * This is the main, per-CPU runqueue data structure.
  *
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/6] sched_ext: Open-code task_linked_on_dsq()
  2024-07-09 21:21 [PATCHSET sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
  2024-07-09 21:21 ` [PATCH 1/6] sched: Move struct balance_callback definition upward Tejun Heo
@ 2024-07-09 21:21 ` Tejun Heo
  2024-07-10  8:56   ` Andrea Righi
  2024-07-10 18:53   ` David Vernet
  2024-07-09 21:21 ` [PATCH 3/6] sched_ext: Make @rf optional for dispatch_to_local_dsq() Tejun Heo
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 21+ messages in thread
From: Tejun Heo @ 2024-07-09 21:21 UTC (permalink / raw)
  To: void
  Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo, righi.andrea, Tejun Heo

task_linked_on_dsq() exists as a helper becauase it used to test both the
rbtree and list nodes. It now only tests the list node and the list node
will soon be used for something else too. The helper doesn't improve
anything materially and the naming will become confusing. Open-code the list
node testing and remove task_linked_on_dsq()

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
---
 kernel/sched/ext.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f16d72d72635..52340ac8038f 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1572,18 +1572,13 @@ static void task_unlink_from_dsq(struct task_struct *p,
 	list_del_init(&p->scx.dsq_list.node);
 }
 
-static bool task_linked_on_dsq(struct task_struct *p)
-{
-	return !list_empty(&p->scx.dsq_list.node);
-}
-
 static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
 {
 	struct scx_dispatch_q *dsq = p->scx.dsq;
 	bool is_local = dsq == &rq->scx.local_dsq;
 
 	if (!dsq) {
-		WARN_ON_ONCE(task_linked_on_dsq(p));
+		WARN_ON_ONCE(!list_empty(&p->scx.dsq_list.node));
 		/*
 		 * When dispatching directly from the BPF scheduler to a local
 		 * DSQ, the task isn't associated with any DSQ but
@@ -1604,7 +1599,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
 	*/
 	if (p->scx.holding_cpu < 0) {
 		/* @p must still be on @dsq, dequeue */
-		WARN_ON_ONCE(!task_linked_on_dsq(p));
+		WARN_ON_ONCE(list_empty(&p->scx.dsq_list.node));
 		task_unlink_from_dsq(p, dsq);
 		dsq_mod_nr(dsq, -1);
 	} else {
@@ -1614,7 +1609,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
 		 * holding_cpu which tells dispatch_to_local_dsq() that it lost
 		 * the race.
 		 */
-		WARN_ON_ONCE(task_linked_on_dsq(p));
+		WARN_ON_ONCE(!list_empty(&p->scx.dsq_list.node));
 		p->scx.holding_cpu = -1;
 	}
 	p->scx.dsq = NULL;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/6] sched_ext: Make @rf optional for dispatch_to_local_dsq()
  2024-07-09 21:21 [PATCHSET sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
  2024-07-09 21:21 ` [PATCH 1/6] sched: Move struct balance_callback definition upward Tejun Heo
  2024-07-09 21:21 ` [PATCH 2/6] sched_ext: Open-code task_linked_on_dsq() Tejun Heo
@ 2024-07-09 21:21 ` Tejun Heo
  2024-07-10  8:59   ` Andrea Righi
  2024-07-10 16:10   ` David Vernet
  2024-07-09 21:21 ` [PATCH 4/6] sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUP Tejun Heo
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 21+ messages in thread
From: Tejun Heo @ 2024-07-09 21:21 UTC (permalink / raw)
  To: void
  Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo, righi.andrea, Tejun Heo

dispatch_to_local_dsq() may unlock the current rq when dispatching to a
local DSQ on a different CPU. This function is currently called from the
balance path where the rq lock is pinned and the associated rq_flags is
available to unpin it.

dispatch_to_local_dsq() will be used to implement direct dispatch to a local
DSQ on a remote CPU from the enqueue path where it will be called with rq
locked but not pinned. Make @rf optional so that the function can be used
from a context which doesn't pin the rq lock.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
---
 kernel/sched/ext.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 52340ac8038f..e96727460df2 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2040,7 +2040,7 @@ static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p,
 /**
  * dispatch_to_local_dsq_lock - Ensure source and destination rq's are locked
  * @rq: current rq which is locked
- * @rf: rq_flags to use when unlocking @rq
+ * @rf: optional rq_flags to use when unlocking @rq if its lock is pinned
  * @src_rq: rq to move task from
  * @dst_rq: rq to move task to
  *
@@ -2052,17 +2052,20 @@ static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p,
 static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq_flags *rf,
 				       struct rq *src_rq, struct rq *dst_rq)
 {
-	rq_unpin_lock(rq, rf);
+	if (rf)
+		rq_unpin_lock(rq, rf);
 
 	if (src_rq == dst_rq) {
 		raw_spin_rq_unlock(rq);
 		raw_spin_rq_lock(dst_rq);
 	} else if (rq == src_rq) {
 		double_lock_balance(rq, dst_rq);
-		rq_repin_lock(rq, rf);
+		if (rf)
+			rq_repin_lock(rq, rf);
 	} else if (rq == dst_rq) {
 		double_lock_balance(rq, src_rq);
-		rq_repin_lock(rq, rf);
+		if (rf)
+			rq_repin_lock(rq, rf);
 	} else {
 		raw_spin_rq_unlock(rq);
 		double_rq_lock(src_rq, dst_rq);
@@ -2072,7 +2075,7 @@ static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq_flags *rf,
 /**
  * dispatch_to_local_dsq_unlock - Undo dispatch_to_local_dsq_lock()
  * @rq: current rq which is locked
- * @rf: rq_flags to use when unlocking @rq
+ * @rf: optional rq_flags to use when unlocking @rq if its lock is pinned
  * @src_rq: rq to move task from
  * @dst_rq: rq to move task to
  *
@@ -2084,7 +2087,8 @@ static void dispatch_to_local_dsq_unlock(struct rq *rq, struct rq_flags *rf,
 	if (src_rq == dst_rq) {
 		raw_spin_rq_unlock(dst_rq);
 		raw_spin_rq_lock(rq);
-		rq_repin_lock(rq, rf);
+		if (rf)
+			rq_repin_lock(rq, rf);
 	} else if (rq == src_rq) {
 		double_unlock_balance(rq, dst_rq);
 	} else if (rq == dst_rq) {
@@ -2092,7 +2096,8 @@ static void dispatch_to_local_dsq_unlock(struct rq *rq, struct rq_flags *rf,
 	} else {
 		double_rq_unlock(src_rq, dst_rq);
 		raw_spin_rq_lock(rq);
-		rq_repin_lock(rq, rf);
+		if (rf)
+			rq_repin_lock(rq, rf);
 	}
 }
 #endif	/* CONFIG_SMP */
@@ -2214,7 +2219,7 @@ enum dispatch_to_local_dsq_ret {
 /**
  * dispatch_to_local_dsq - Dispatch a task to a local dsq
  * @rq: current rq which is locked
- * @rf: rq_flags to use when unlocking @rq
+ * @rf: optional rq_flags to use when unlocking @rq if its lock is pinned
  * @dsq_id: destination dsq ID
  * @p: task to dispatch
  * @enq_flags: %SCX_ENQ_*
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/6] sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUP
  2024-07-09 21:21 [PATCHSET sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
                   ` (2 preceding siblings ...)
  2024-07-09 21:21 ` [PATCH 3/6] sched_ext: Make @rf optional for dispatch_to_local_dsq() Tejun Heo
@ 2024-07-09 21:21 ` Tejun Heo
  2024-07-10 18:54   ` David Vernet
  2024-07-09 21:21 ` [PATCH 5/6] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2024-07-09 21:21 UTC (permalink / raw)
  To: void
  Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo, righi.andrea, Tejun Heo

SCX_RQ_BALANCING is used to mark that the rq is currently in balance().
Rename it to SCX_RQ_IN_BALANCE and add SCX_RQ_IN_WAKEUP which marks whether
the rq is currently enqueueing for a wakeup. This will be used to implement
direct dispatching to local DSQ of another CPU.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
---
 kernel/sched/ext.c   | 13 +++++++++----
 kernel/sched/sched.h |  6 ++++--
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index e96727460df2..a88c51ce63a5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1828,6 +1828,9 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
 {
 	int sticky_cpu = p->scx.sticky_cpu;
 
+	if (enq_flags & ENQUEUE_WAKEUP)
+		rq->scx.flags |= SCX_RQ_IN_WAKEUP;
+
 	enq_flags |= rq->scx.extra_enq_flags;
 
 	if (sticky_cpu >= 0)
@@ -1844,7 +1847,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
 
 	if (p->scx.flags & SCX_TASK_QUEUED) {
 		WARN_ON_ONCE(!task_runnable(p));
-		return;
+		goto out;
 	}
 
 	set_task_runnable(rq, p);
@@ -1859,6 +1862,8 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
 		touch_core_sched(rq, p);
 
 	do_enqueue_task(rq, p, enq_flags, sticky_cpu);
+out:
+	rq->scx.flags &= ~SCX_RQ_IN_WAKEUP;
 }
 
 static void ops_dequeue(struct task_struct *p, u64 deq_flags)
@@ -2443,7 +2448,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev,
 	bool has_tasks = false;
 
 	lockdep_assert_rq_held(rq);
-	rq->scx.flags |= SCX_RQ_BALANCING;
+	rq->scx.flags |= SCX_RQ_IN_BALANCE;
 
 	if (static_branch_unlikely(&scx_ops_cpu_preempt) &&
 	    unlikely(rq->scx.cpu_released)) {
@@ -2538,7 +2543,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev,
 has_tasks:
 	has_tasks = true;
 out:
-	rq->scx.flags &= ~SCX_RQ_BALANCING;
+	rq->scx.flags &= ~SCX_RQ_IN_BALANCE;
 	return has_tasks;
 }
 
@@ -5093,7 +5098,7 @@ static bool can_skip_idle_kick(struct rq *rq)
 	 * The race window is small and we don't and can't guarantee that @rq is
 	 * only kicked while idle anyway. Skip only when sure.
 	 */
-	return !is_idle_task(rq->curr) && !(rq->scx.flags & SCX_RQ_BALANCING);
+	return !is_idle_task(rq->curr) && !(rq->scx.flags & SCX_RQ_IN_BALANCE);
 }
 
 static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *pseqs)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 86314a17f1c7..8a0e8052f6b0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -737,8 +737,10 @@ enum scx_rq_flags {
 	 * only while the BPF scheduler considers the CPU to be online.
 	 */
 	SCX_RQ_ONLINE		= 1 << 0,
-	SCX_RQ_BALANCING	= 1 << 1,
-	SCX_RQ_CAN_STOP_TICK	= 1 << 2,
+	SCX_RQ_CAN_STOP_TICK	= 1 << 1,
+
+	SCX_RQ_IN_WAKEUP	= 1 << 16,
+	SCX_RQ_IN_BALANCE	= 1 << 17,
 };
 
 struct scx_rq {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/6] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches
  2024-07-09 21:21 [PATCHSET sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
                   ` (3 preceding siblings ...)
  2024-07-09 21:21 ` [PATCH 4/6] sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUP Tejun Heo
@ 2024-07-09 21:21 ` Tejun Heo
  2024-07-10 18:51   ` David Vernet
  2024-07-09 21:21 ` [PATCH 6/6] sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues Tejun Heo
  2024-07-10  9:01 ` [PATCHSET sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Andrea Righi
  6 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2024-07-09 21:21 UTC (permalink / raw)
  To: void
  Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo, righi.andrea, Tejun Heo

In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the
local DSQ of any CPU. However, during direct dispatch from ops.select_cpu()
and ops.enqueue(), this isn't allowed. This is because dispatching to the
local DSQ of a remote CPU requires locking both the task's current and new
rq's and such double locking can't be done directly from ops.enqueue().

While waking up a task, as ops.select_cpu() can pick any CPU and both
ops.select_cpu() and ops.enqueue() can use SCX_DSQ_LOCAL as the dispatch
target to dispatch to the DSQ of the picked CPU, the BPF scheduler can still
do whatever it wants to do. However, while a task is being enqueued for a
different reason, e.g. after its slice expiration, only ops.enqueue() is
called and there's no way for the BPF scheduler to directly dispatch to the
local DSQ of a remote CPU. This gap in API forces schedulers into
work-arounds which are not straightforward or optimal such as skipping
direct dispatches in such cases.

Implement deferred enqueueing to allow directly dispatching to the local DSQ
of a remote CPU from ops.select_cpu() and ops.enqueue(). Such tasks are
temporarily queued on rq->scx.ddsp_deferred_locals. When the rq lock can be
safely released, the tasks are taken off the list and queued on the target
local DSQs using dispatch_to_local_dsq().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Changwoo Min <changwoo@igalia.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
---
 kernel/sched/ext.c   | 164 ++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h |   3 +
 2 files changed, 149 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index a88c51ce63a5..f4edc37bf89b 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -889,6 +889,7 @@ static struct kobject *scx_root_kobj;
 #define CREATE_TRACE_POINTS
 #include <trace/events/sched_ext.h>
 
+static void process_ddsp_deferred_locals(struct rq *rq);
 static void scx_bpf_kick_cpu(s32 cpu, u64 flags);
 static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind,
 					     s64 exit_code,
@@ -1363,6 +1364,63 @@ static int ops_sanitize_err(const char *ops_name, s32 err)
 	return -EPROTO;
 }
 
+static void run_deferred(struct rq *rq)
+{
+	process_ddsp_deferred_locals(rq);
+}
+
+static void deferred_bal_cb_workfn(struct rq *rq)
+{
+	run_deferred(rq);
+}
+
+static void deferred_irq_workfn(struct irq_work *irq_work)
+{
+	struct rq *rq = container_of(irq_work, struct rq, scx.deferred_irq_work);
+
+	raw_spin_rq_lock(rq);
+	run_deferred(rq);
+	raw_spin_rq_unlock(rq);
+}
+
+/**
+ * schedule_deferred - Schedule execution of deferred actions on an rq
+ * @rq: target rq
+ *
+ * Schedule execution of deferred actions on @rq. Must be called with @rq
+ * locked. Deferred actions are executed with @rq locked but unpinned, and thus
+ * can unlock @rq to e.g. migrate tasks to other rqs.
+ */
+static void schedule_deferred(struct rq *rq)
+{
+	lockdep_assert_rq_held(rq);
+
+#ifdef CONFIG_SMP
+	/*
+	 * If in the middle of waking up a task, task_woken_scx() will be called
+	 * afterwards which will then run the deferred actions, no need to
+	 * schedule anything.
+	 */
+	if (rq->scx.flags & SCX_RQ_IN_WAKEUP)
+		return;
+
+	/*
+	 * If in balance, the balance callbacks will be called before rq lock is
+	 * released. Schedule one.
+	 */
+	if (rq->scx.flags & SCX_RQ_IN_BALANCE)
+		queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
+				       deferred_bal_cb_workfn);
+#endif
+	/*
+	 * No scheduler hooks available. Queue an irq work. They are executed on
+	 * IRQ re-enable which may take a bit longer than the scheduler hooks.
+	 * The above WAKEUP and BALANCE paths should cover most of the cases and
+	 * the time to IRQ re-enable shouldn't be long.
+	 */
+	irq_work_queue(&rq->scx.deferred_irq_work);
+}
+
 /**
  * touch_core_sched - Update timestamp used for core-sched task ordering
  * @rq: rq to read clock from, must be locked
@@ -1578,7 +1636,13 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
 	bool is_local = dsq == &rq->scx.local_dsq;
 
 	if (!dsq) {
-		WARN_ON_ONCE(!list_empty(&p->scx.dsq_list.node));
+		/*
+		 * If !dsq && on-list, @p is on @rq's ddsp_deferred_locals.
+		 * Unlinking is all that's needed to cancel.
+		 */
+		if (unlikely(!list_empty(&p->scx.dsq_list.node)))
+			list_del_init(&p->scx.dsq_list.node);
+
 		/*
 		 * When dispatching directly from the BPF scheduler to a local
 		 * DSQ, the task isn't associated with any DSQ but
@@ -1587,6 +1651,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
 		 */
 		if (p->scx.holding_cpu >= 0)
 			p->scx.holding_cpu = -1;
+
 		return;
 	}
 
@@ -1674,17 +1739,6 @@ static void mark_direct_dispatch(struct task_struct *ddsp_task,
 		return;
 	}
 
-	/*
-	 * %SCX_DSQ_LOCAL_ON is not supported during direct dispatch because
-	 * dispatching to the local DSQ of a different CPU requires unlocking
-	 * the current rq which isn't allowed in the enqueue path. Use
-	 * ops.select_cpu() to be on the target CPU and then %SCX_DSQ_LOCAL.
-	 */
-	if (unlikely((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON)) {
-		scx_ops_error("SCX_DSQ_LOCAL_ON can't be used for direct-dispatch");
-		return;
-	}
-
 	WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID);
 	WARN_ON_ONCE(p->scx.ddsp_enq_flags);
 
@@ -1694,13 +1748,58 @@ static void mark_direct_dispatch(struct task_struct *ddsp_task,
 
 static void direct_dispatch(struct task_struct *p, u64 enq_flags)
 {
+	struct rq *rq = task_rq(p);
 	struct scx_dispatch_q *dsq;
+	u64 dsq_id = p->scx.ddsp_dsq_id;
+
+	touch_core_sched_dispatch(rq, p);
+
+	p->scx.ddsp_enq_flags |= enq_flags;
+
+	/*
+	 * We are in the enqueue path with @rq locked and pinned, and thus can't
+	 * double lock a remote rq and enqueue to its local DSQ. For
+	 * DSQ_LOCAL_ON verdicts targeting the local DSQ of a remote CPU, defer
+	 * the enqueue so that it's executed when @rq can be unlocked.
+	 */
+	if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
+		s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+		unsigned long opss;
+
+		if (cpu == cpu_of(rq)) {
+			dsq_id = SCX_DSQ_LOCAL;
+			goto dispatch;
+		}
+
+		opss = atomic_long_read(&p->scx.ops_state) & SCX_OPSS_STATE_MASK;
+
+		switch (opss & SCX_OPSS_STATE_MASK) {
+		case SCX_OPSS_NONE:
+			break;
+		case SCX_OPSS_QUEUEING:
+			/*
+			 * As @p was never passed to the BPF side, _release is
+			 * not strictly necessary. Still do it for consistency.
+			 */
+			atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
+			break;
+		default:
+			WARN_ONCE(true, "sched_ext: %s[%d] has invalid ops state 0x%lx in direct_dispatch()",
+				  p->comm, p->pid, opss);
+			atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
+			break;
+		}
 
-	touch_core_sched_dispatch(task_rq(p), p);
+		WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node));
+		list_add_tail(&p->scx.dsq_list.node,
+			      &rq->scx.ddsp_deferred_locals);
+		schedule_deferred(rq);
+		return;
+	}
 
-	enq_flags |= (p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
-	dsq = find_dsq_for_dispatch(task_rq(p), p->scx.ddsp_dsq_id, p);
-	dispatch_enqueue(dsq, p, enq_flags);
+dispatch:
+	dsq = find_dsq_for_dispatch(rq, dsq_id, p);
+	dispatch_enqueue(dsq, p, p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
 }
 
 static bool scx_rq_online(struct rq *rq)
@@ -2631,6 +2730,29 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
 	}
 }
 
+static void process_ddsp_deferred_locals(struct rq *rq)
+{
+	struct task_struct *p, *tmp;
+
+	lockdep_assert_rq_held(rq);
+
+	/*
+	 * Now that @rq can be unlocked, execute the deferred enqueueing of
+	 * tasks directly dispatched to the local DSQs of other CPUs. See
+	 * direct_dispatch().
+	 */
+	list_for_each_entry_safe(p, tmp, &rq->scx.ddsp_deferred_locals,
+				 scx.dsq_list.node) {
+		s32 ret;
+
+		list_del_init(&p->scx.dsq_list.node);
+
+		ret = dispatch_to_local_dsq(rq, NULL, p->scx.ddsp_dsq_id, p,
+					    p->scx.ddsp_enq_flags);
+		WARN_ON_ONCE(ret == DTL_NOT_LOCAL);
+	}
+}
+
 static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
 {
 #ifndef CONFIG_SMP
@@ -3052,6 +3174,11 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
 	}
 }
 
+static void task_woken_scx(struct rq *rq, struct task_struct *p)
+{
+	run_deferred(rq);
+}
+
 static void set_cpus_allowed_scx(struct task_struct *p,
 				 struct affinity_context *ac)
 {
@@ -3568,8 +3695,6 @@ bool scx_can_stop_tick(struct rq *rq)
  *
  * - task_fork/dead: We need fork/dead notifications for all tasks regardless of
  *   their current sched_class. Call them directly from sched core instead.
- *
- * - task_woken: Unnecessary.
  */
 DEFINE_SCHED_CLASS(ext) = {
 	.enqueue_task		= enqueue_task_scx,
@@ -3589,6 +3714,7 @@ DEFINE_SCHED_CLASS(ext) = {
 #ifdef CONFIG_SMP
 	.balance		= balance_scx,
 	.select_task_rq		= select_task_rq_scx,
+	.task_woken		= task_woken_scx,
 	.set_cpus_allowed	= set_cpus_allowed_scx,
 
 	.rq_online		= rq_online_scx,
@@ -5293,11 +5419,13 @@ void __init init_sched_ext_class(void)
 
 		init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
 		INIT_LIST_HEAD(&rq->scx.runnable_list);
+		INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals);
 
 		BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick, GFP_KERNEL));
 		BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL));
 		BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_preempt, GFP_KERNEL));
 		BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_wait, GFP_KERNEL));
+		init_irq_work(&rq->scx.deferred_irq_work, deferred_irq_workfn);
 		init_irq_work(&rq->scx.kick_cpus_irq_work, kick_cpus_irq_workfn);
 
 		if (cpu_online(cpu))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8a0e8052f6b0..be7be54484c0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -746,6 +746,7 @@ enum scx_rq_flags {
 struct scx_rq {
 	struct scx_dispatch_q	local_dsq;
 	struct list_head	runnable_list;		/* runnable tasks on this rq */
+	struct list_head	ddsp_deferred_locals;	/* deferred ddsps from enq */
 	unsigned long		ops_qseq;
 	u64			extra_enq_flags;	/* see move_task_to_local_dsq() */
 	u32			nr_running;
@@ -757,6 +758,8 @@ struct scx_rq {
 	cpumask_var_t		cpus_to_preempt;
 	cpumask_var_t		cpus_to_wait;
 	unsigned long		pnt_seq;
+	struct balance_callback	deferred_bal_cb;
+	struct irq_work		deferred_irq_work;
 	struct irq_work		kick_cpus_irq_work;
 };
 #endif /* CONFIG_SCHED_CLASS_EXT */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/6] sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues
  2024-07-09 21:21 [PATCHSET sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
                   ` (4 preceding siblings ...)
  2024-07-09 21:21 ` [PATCH 5/6] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
@ 2024-07-09 21:21 ` Tejun Heo
  2024-07-10 19:25   ` David Vernet
  2024-07-10  9:01 ` [PATCHSET sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Andrea Righi
  6 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2024-07-09 21:21 UTC (permalink / raw)
  To: void
  Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo, righi.andrea, Tejun Heo

Because there was no way to directly dispatch to the local DSQ of a remote
CPU from ops.enqueue(), scx_qmap skipped looking for an idle CPU on !wakeup
enqueues. This restriction was removed and schbed_ext now allows
SCX_DSQ_LOCAL_ON verdicts for direct dispatches.

Factor out pick_direct_dispatch_cpu() from ops.select_cpu() and use it to
direct dispatch from ops.enqueue() on !wakeup enqueues.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Changwoo Min <changwoo@igalia.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
---
 tools/sched_ext/scx_qmap.bpf.c | 39 ++++++++++++++++++++++++++--------
 tools/sched_ext/scx_qmap.c     |  5 +++--
 2 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 27e35066a602..892278f12dce 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -120,11 +120,26 @@ struct {
 } cpu_ctx_stor SEC(".maps");
 
 /* Statistics */
-u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued;
+u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued, nr_ddsp_from_enq;
 u64 nr_core_sched_execed;
 u32 cpuperf_min, cpuperf_avg, cpuperf_max;
 u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max;
 
+static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu)
+{
+	s32 cpu;
+
+	if (p->nr_cpus_allowed == 1 ||
+	    scx_bpf_test_and_clear_cpu_idle(prev_cpu))
+		return prev_cpu;
+
+	cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
+	if (cpu >= 0)
+		return cpu;
+
+	return -1;
+}
+
 s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
 {
@@ -137,17 +152,14 @@ s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
 		return -ESRCH;
 	}
 
-	if (p->nr_cpus_allowed == 1 ||
-	    scx_bpf_test_and_clear_cpu_idle(prev_cpu)) {
+	cpu = pick_direct_dispatch_cpu(p, prev_cpu);
+
+	if (cpu >= 0) {
 		tctx->force_local = true;
+		return cpu;
+	} else {
 		return prev_cpu;
 	}
-
-	cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
-	if (cpu >= 0)
-		return cpu;
-
-	return prev_cpu;
 }
 
 static int weight_to_idx(u32 weight)
@@ -172,6 +184,7 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 	u32 pid = p->pid;
 	int idx = weight_to_idx(p->scx.weight);
 	void *ring;
+	s32 cpu;
 
 	if (p->flags & PF_KTHREAD) {
 		if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth))
@@ -207,6 +220,14 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 		return;
 	}
 
+	/* if !WAKEUP, select_cpu() wasn't called, try direct dispatch */
+	if (!(enq_flags & SCX_ENQ_WAKEUP) &&
+	    (cpu = pick_direct_dispatch_cpu(p, scx_bpf_task_cpu(p))) >= 0) {
+		__sync_fetch_and_add(&nr_ddsp_from_enq, 1);
+		scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, slice_ns, enq_flags);
+		return;
+	}
+
 	/*
 	 * If the task was re-enqueued due to the CPU being preempted by a
 	 * higher priority scheduling class, just re-enqueue the task directly
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index 304f0488a386..c9ca30d62b2b 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -116,10 +116,11 @@ int main(int argc, char **argv)
 		long nr_enqueued = skel->bss->nr_enqueued;
 		long nr_dispatched = skel->bss->nr_dispatched;
 
-		printf("stats  : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64"\n",
+		printf("stats  : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64" enq_ddsp=%"PRIu64"\n",
 		       nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
 		       skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
-		       skel->bss->nr_core_sched_execed);
+		       skel->bss->nr_core_sched_execed,
+		       skel->bss->nr_ddsp_from_enq);
 		if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur"))
 			printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n",
 			       skel->bss->cpuperf_min,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] sched_ext: Open-code task_linked_on_dsq()
  2024-07-09 21:21 ` [PATCH 2/6] sched_ext: Open-code task_linked_on_dsq() Tejun Heo
@ 2024-07-10  8:56   ` Andrea Righi
  2024-07-10 18:53   ` David Vernet
  1 sibling, 0 replies; 21+ messages in thread
From: Andrea Righi @ 2024-07-10  8:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: void, linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo

On Tue, Jul 09, 2024 at 11:21:08AM -1000, Tejun Heo wrote:
> task_linked_on_dsq() exists as a helper becauase it used to test both the

small nit: becauase -> because.

> rbtree and list nodes. It now only tests the list node and the list node
> will soon be used for something else too. The helper doesn't improve
> anything materially and the naming will become confusing. Open-code the list
> node testing and remove task_linked_on_dsq()
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: David Vernet <void@manifault.com>
> ---
>  kernel/sched/ext.c | 11 +++--------
>  1 file changed, 3 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index f16d72d72635..52340ac8038f 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1572,18 +1572,13 @@ static void task_unlink_from_dsq(struct task_struct *p,
>  	list_del_init(&p->scx.dsq_list.node);
>  }
>  
> -static bool task_linked_on_dsq(struct task_struct *p)
> -{
> -	return !list_empty(&p->scx.dsq_list.node);
> -}
> -
>  static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
>  {
>  	struct scx_dispatch_q *dsq = p->scx.dsq;
>  	bool is_local = dsq == &rq->scx.local_dsq;
>  
>  	if (!dsq) {
> -		WARN_ON_ONCE(task_linked_on_dsq(p));
> +		WARN_ON_ONCE(!list_empty(&p->scx.dsq_list.node));
>  		/*
>  		 * When dispatching directly from the BPF scheduler to a local
>  		 * DSQ, the task isn't associated with any DSQ but
> @@ -1604,7 +1599,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
>  	*/
>  	if (p->scx.holding_cpu < 0) {
>  		/* @p must still be on @dsq, dequeue */
> -		WARN_ON_ONCE(!task_linked_on_dsq(p));
> +		WARN_ON_ONCE(list_empty(&p->scx.dsq_list.node));
>  		task_unlink_from_dsq(p, dsq);
>  		dsq_mod_nr(dsq, -1);
>  	} else {
> @@ -1614,7 +1609,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
>  		 * holding_cpu which tells dispatch_to_local_dsq() that it lost
>  		 * the race.
>  		 */
> -		WARN_ON_ONCE(task_linked_on_dsq(p));
> +		WARN_ON_ONCE(!list_empty(&p->scx.dsq_list.node));
>  		p->scx.holding_cpu = -1;
>  	}
>  	p->scx.dsq = NULL;
> -- 
> 2.45.2

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/6] sched_ext: Make @rf optional for dispatch_to_local_dsq()
  2024-07-09 21:21 ` [PATCH 3/6] sched_ext: Make @rf optional for dispatch_to_local_dsq() Tejun Heo
@ 2024-07-10  8:59   ` Andrea Righi
  2024-07-10 11:41     ` Peter Zijlstra
  2024-07-10 16:10   ` David Vernet
  1 sibling, 1 reply; 21+ messages in thread
From: Andrea Righi @ 2024-07-10  8:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: void, linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo

On Tue, Jul 09, 2024 at 11:21:09AM -1000, Tejun Heo wrote:
...
> @@ -2052,17 +2052,20 @@ static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p,
>  static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq_flags *rf,
>  				       struct rq *src_rq, struct rq *dst_rq)
>  {
> -	rq_unpin_lock(rq, rf);
> +	if (rf)
> +		rq_unpin_lock(rq, rf);
>  
>  	if (src_rq == dst_rq) {
>  		raw_spin_rq_unlock(rq);
>  		raw_spin_rq_lock(dst_rq);
>  	} else if (rq == src_rq) {
>  		double_lock_balance(rq, dst_rq);
> -		rq_repin_lock(rq, rf);
> +		if (rf)
> +			rq_repin_lock(rq, rf);
>  	} else if (rq == dst_rq) {
>  		double_lock_balance(rq, src_rq);
> -		rq_repin_lock(rq, rf);
> +		if (rf)
> +			rq_repin_lock(rq, rf);
>  	} else {
>  		raw_spin_rq_unlock(rq);
>  		double_rq_lock(src_rq, dst_rq);

Not a blocker, but would it make sense to provide some wrappers for
rq_unpin_lock() / rq_repin_lock() to simply return if rf == NULL?

Maybe it can help to make the code a bit more readable.

-Andrea

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHSET sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches
  2024-07-09 21:21 [PATCHSET sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
                   ` (5 preceding siblings ...)
  2024-07-09 21:21 ` [PATCH 6/6] sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues Tejun Heo
@ 2024-07-10  9:01 ` Andrea Righi
  6 siblings, 0 replies; 21+ messages in thread
From: Andrea Righi @ 2024-07-10  9:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: void, linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo

On Tue, Jul 09, 2024 at 11:21:06AM -1000, Tejun Heo wrote:
> Hello,
> 
> In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the
> local DSQ of any CPU. However, during direct dispatch from ops.select_cpu()
> and ops.enqueue(), this isn't allowed. This is because dispatching to the
> local DSQ of a remote CPU requires locking both the task's current and new
> rq's and such double locking can't be done directly from ops.enqueue(). This
> gap in API forces schedulers into work-arounds which are not straightforward
> or optimal such as skipping direct dispatches in such cases.
> 
> This patchset implements SCX_DSQ_LOCAL_ON support for direct dispatches.
> 
> This patchset contains the following patches:
> 
>  0001-sched-Move-struct-balance_callback-definition-upward.patch
>  0002-sched_ext-Open-code-task_linked_on_dsq.patch
>  0003-sched_ext-Make-rf-optional-for-dispatch_to_local_dsq.patch
>  0004-sched_ext-s-SCX_RQ_BALANCING-SCX_RQ_IN_BALANCE-and-a.patch
>  0005-sched_ext-Allow-SCX_DSQ_LOCAL_ON-for-direct-dispatch.patch
>  0006-sched_ext-scx_qmap-Pick-idle-CPU-for-direct-dispatch.patch
> 
> While 0001 is a scheduler core change, all it does is moving the definition
> of struct balance_callback so that it's visible for struct scx_rq
> definition. If there are no objections, it'd make sense to route it through
> the sched_ext tree with the rest of the changes.

Apart for the minor nits, everything looks good to me. I did a quick
test inside virtme-ng and I haven't found any regression. I'll do more
tests with the new way of using SCX_DSQ_LOCAL_ON, but so far so good.
FWIW:

Tested-by: Andrea Righi <righi.andrea@gmail.com>

Thanks,
-Andrea

> 
> and is also available in the following git branch:
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-ddsp-local-on
> 
>  kernel/sched/ext.c             |  207 +++++++++++++++++++++++++++++++++++++++++++++++++++------------
>  kernel/sched/sched.h           |   19 +++--
>  tools/sched_ext/scx_qmap.bpf.c |   39 +++++++++--
>  tools/sched_ext/scx_qmap.c     |    5 -
>  4 files changed, 215 insertions(+), 55 deletions(-)
> 
> --
> tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/6] sched_ext: Make @rf optional for dispatch_to_local_dsq()
  2024-07-10  8:59   ` Andrea Righi
@ 2024-07-10 11:41     ` Peter Zijlstra
  2024-07-10 16:39       ` Tejun Heo
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2024-07-10 11:41 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, void, linux-kernel, kernel-team, schatzberg.dan, mingo,
	changwoo

On Wed, Jul 10, 2024 at 10:59:29AM +0200, Andrea Righi wrote:
> On Tue, Jul 09, 2024 at 11:21:09AM -1000, Tejun Heo wrote:
> ...
> > @@ -2052,17 +2052,20 @@ static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p,
> >  static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq_flags *rf,
> >  				       struct rq *src_rq, struct rq *dst_rq)
> >  {
> > -	rq_unpin_lock(rq, rf);
> > +	if (rf)
> > +		rq_unpin_lock(rq, rf);
> >  
> >  	if (src_rq == dst_rq) {
> >  		raw_spin_rq_unlock(rq);
> >  		raw_spin_rq_lock(dst_rq);
> >  	} else if (rq == src_rq) {
> >  		double_lock_balance(rq, dst_rq);
> > -		rq_repin_lock(rq, rf);
> > +		if (rf)
> > +			rq_repin_lock(rq, rf);
> >  	} else if (rq == dst_rq) {
> >  		double_lock_balance(rq, src_rq);
> > -		rq_repin_lock(rq, rf);
> > +		if (rf)
> > +			rq_repin_lock(rq, rf);
> >  	} else {
> >  		raw_spin_rq_unlock(rq);
> >  		double_rq_lock(src_rq, dst_rq);
> 
> Not a blocker, but would it make sense to provide some wrappers for
> rq_unpin_lock() / rq_repin_lock() to simply return if rf == NULL?

There are very limited contexts where unpin is sound. I have no idea if
this is one of them or not.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/6] sched_ext: Make @rf optional for dispatch_to_local_dsq()
  2024-07-09 21:21 ` [PATCH 3/6] sched_ext: Make @rf optional for dispatch_to_local_dsq() Tejun Heo
  2024-07-10  8:59   ` Andrea Righi
@ 2024-07-10 16:10   ` David Vernet
  2024-07-10 16:46     ` Tejun Heo
  1 sibling, 1 reply; 21+ messages in thread
From: David Vernet @ 2024-07-10 16:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo, righi.andrea

[-- Attachment #1: Type: text/plain, Size: 4167 bytes --]

On Tue, Jul 09, 2024 at 11:21:09AM -1000, Tejun Heo wrote:
> dispatch_to_local_dsq() may unlock the current rq when dispatching to a
> local DSQ on a different CPU. This function is currently called from the
> balance path where the rq lock is pinned and the associated rq_flags is
> available to unpin it.
> 
> dispatch_to_local_dsq() will be used to implement direct dispatch to a local
> DSQ on a remote CPU from the enqueue path where it will be called with rq
> locked but not pinned. Make @rf optional so that the function can be used
> from a context which doesn't pin the rq lock.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: David Vernet <void@manifault.com>
> ---
>  kernel/sched/ext.c | 21 +++++++++++++--------
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 52340ac8038f..e96727460df2 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2040,7 +2040,7 @@ static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p,
>  /**
>   * dispatch_to_local_dsq_lock - Ensure source and destination rq's are locked
>   * @rq: current rq which is locked
> - * @rf: rq_flags to use when unlocking @rq
> + * @rf: optional rq_flags to use when unlocking @rq if its lock is pinned
>   * @src_rq: rq to move task from
>   * @dst_rq: rq to move task to
>   *
> @@ -2052,17 +2052,20 @@ static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p,
>  static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq_flags *rf,
>  				       struct rq *src_rq, struct rq *dst_rq)
>  {
> -	rq_unpin_lock(rq, rf);
> +	if (rf)
> +		rq_unpin_lock(rq, rf);
>  
>  	if (src_rq == dst_rq) {
>  		raw_spin_rq_unlock(rq);
>  		raw_spin_rq_lock(dst_rq);
>  	} else if (rq == src_rq) {
>  		double_lock_balance(rq, dst_rq);
> -		rq_repin_lock(rq, rf);
> +		if (rf)
> +			rq_repin_lock(rq, rf);
>  	} else if (rq == dst_rq) {
>  		double_lock_balance(rq, src_rq);
> -		rq_repin_lock(rq, rf);
> +		if (rf)
> +			rq_repin_lock(rq, rf);

It feels kind of weird to have the callee need to know about pinning
requirements in the caller instead of vice versa. I mean, I guess it's inherent
to doing an unpin / repin inside of a locked region, but it'd be nice if we
could minimize the amount of variance in that codepath regardless. I think what
you have is correct, but maybe it'd simpler if we just pinned in the caller on
the enqueue path? That way the semantics of when locks can be dropped is
consistent in dispatch_to_local_dsq().

>  	} else {
>  		raw_spin_rq_unlock(rq);
>  		double_rq_lock(src_rq, dst_rq);
> @@ -2072,7 +2075,7 @@ static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq_flags *rf,
>  /**
>   * dispatch_to_local_dsq_unlock - Undo dispatch_to_local_dsq_lock()
>   * @rq: current rq which is locked
> - * @rf: rq_flags to use when unlocking @rq
> + * @rf: optional rq_flags to use when unlocking @rq if its lock is pinned
>   * @src_rq: rq to move task from
>   * @dst_rq: rq to move task to
>   *
> @@ -2084,7 +2087,8 @@ static void dispatch_to_local_dsq_unlock(struct rq *rq, struct rq_flags *rf,
>  	if (src_rq == dst_rq) {
>  		raw_spin_rq_unlock(dst_rq);
>  		raw_spin_rq_lock(rq);
> -		rq_repin_lock(rq, rf);
> +		if (rf)
> +			rq_repin_lock(rq, rf);
>  	} else if (rq == src_rq) {
>  		double_unlock_balance(rq, dst_rq);
>  	} else if (rq == dst_rq) {
> @@ -2092,7 +2096,8 @@ static void dispatch_to_local_dsq_unlock(struct rq *rq, struct rq_flags *rf,
>  	} else {
>  		double_rq_unlock(src_rq, dst_rq);
>  		raw_spin_rq_lock(rq);
> -		rq_repin_lock(rq, rf);
> +		if (rf)
> +			rq_repin_lock(rq, rf);
>  	}
>  }
>  #endif	/* CONFIG_SMP */
> @@ -2214,7 +2219,7 @@ enum dispatch_to_local_dsq_ret {
>  /**
>   * dispatch_to_local_dsq - Dispatch a task to a local dsq
>   * @rq: current rq which is locked
> - * @rf: rq_flags to use when unlocking @rq
> + * @rf: optional rq_flags to use when unlocking @rq if its lock is pinned
>   * @dsq_id: destination dsq ID
>   * @p: task to dispatch
>   * @enq_flags: %SCX_ENQ_*
> -- 
> 2.45.2
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/6] sched_ext: Make @rf optional for dispatch_to_local_dsq()
  2024-07-10 11:41     ` Peter Zijlstra
@ 2024-07-10 16:39       ` Tejun Heo
  0 siblings, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2024-07-10 16:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Righi, void, linux-kernel, kernel-team, schatzberg.dan,
	mingo, changwoo

Hello,

On Wed, Jul 10, 2024 at 01:41:42PM +0200, Peter Zijlstra wrote:
...
> > Not a blocker, but would it make sense to provide some wrappers for
> > rq_unpin_lock() / rq_repin_lock() to simply return if rf == NULL?

Will address this part later.

> There are very limited contexts where unpin is sound. I have no idea if
> this is one of them or not.

This is called from balance to migrate tasks across rq's, which is where the
@rf for the unpin/repin is coming from. This patchset makes the same code
path called from other places, e.g. ->task_woken(), where the rq lock is
already unpinned. It could be that the better thing to do here is just
unpinning from the balance()'s context so that the inner functions don't
have to worry about lock pinning. They're always called in a context where
the rq lock can be dropped after all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/6] sched_ext: Make @rf optional for dispatch_to_local_dsq()
  2024-07-10 16:10   ` David Vernet
@ 2024-07-10 16:46     ` Tejun Heo
  0 siblings, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2024-07-10 16:46 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo, righi.andrea

Hello,

On Wed, Jul 10, 2024 at 11:10:25AM -0500, David Vernet wrote:
> >  static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq_flags *rf,
> >  				       struct rq *src_rq, struct rq *dst_rq)
> >  {
> > -	rq_unpin_lock(rq, rf);
> > +	if (rf)
> > +		rq_unpin_lock(rq, rf);
> >  
> >  	if (src_rq == dst_rq) {
> >  		raw_spin_rq_unlock(rq);
> >  		raw_spin_rq_lock(dst_rq);
> >  	} else if (rq == src_rq) {
> >  		double_lock_balance(rq, dst_rq);
> > -		rq_repin_lock(rq, rf);
> > +		if (rf)
> > +			rq_repin_lock(rq, rf);
> >  	} else if (rq == dst_rq) {
> >  		double_lock_balance(rq, src_rq);
> > -		rq_repin_lock(rq, rf);
> > +		if (rf)
> > +			rq_repin_lock(rq, rf);
> 
> It feels kind of weird to have the callee need to know about pinning
> requirements in the caller instead of vice versa. I mean, I guess it's inherent
> to doing an unpin / repin inside of a locked region, but it'd be nice if we
> could minimize the amount of variance in that codepath regardless. I think what
> you have is correct, but maybe it'd simpler if we just pinned in the caller on
> the enqueue path? That way the semantics of when locks can be dropped is
> consistent in dispatch_to_local_dsq().

Yeah, it's kinda nasty. I think going the other direction probalby is
better. The reason why we're doing unpin/repin down in the callstack is
because this is being called from sched_class->balance() which is invoked
with rq locked and pinned, but also with @rf so that the lock can be
unpinned. There isn't much that can benefit from extending rq lock pinning
deeper into balance implementation, so it probably makes more sense to do so
in the outer balance function so that the inner functions don't have to
worry about it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches
  2024-07-09 21:21 ` [PATCH 5/6] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
@ 2024-07-10 18:51   ` David Vernet
  2024-07-10 23:43     ` Tejun Heo
  0 siblings, 1 reply; 21+ messages in thread
From: David Vernet @ 2024-07-10 18:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo, righi.andrea

[-- Attachment #1: Type: text/plain, Size: 10875 bytes --]

On Tue, Jul 09, 2024 at 11:21:11AM -1000, Tejun Heo wrote:
> In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the
> local DSQ of any CPU. However, during direct dispatch from ops.select_cpu()
> and ops.enqueue(), this isn't allowed. This is because dispatching to the
> local DSQ of a remote CPU requires locking both the task's current and new
> rq's and such double locking can't be done directly from ops.enqueue().
> 
> While waking up a task, as ops.select_cpu() can pick any CPU and both
> ops.select_cpu() and ops.enqueue() can use SCX_DSQ_LOCAL as the dispatch
> target to dispatch to the DSQ of the picked CPU, the BPF scheduler can still
> do whatever it wants to do. However, while a task is being enqueued for a
> different reason, e.g. after its slice expiration, only ops.enqueue() is
> called and there's no way for the BPF scheduler to directly dispatch to the
> local DSQ of a remote CPU. This gap in API forces schedulers into
> work-arounds which are not straightforward or optimal such as skipping
> direct dispatches in such cases.
> 
> Implement deferred enqueueing to allow directly dispatching to the local DSQ
> of a remote CPU from ops.select_cpu() and ops.enqueue(). Such tasks are
> temporarily queued on rq->scx.ddsp_deferred_locals. When the rq lock can be
> safely released, the tasks are taken off the list and queued on the target
> local DSQs using dispatch_to_local_dsq().
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: David Vernet <void@manifault.com>
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Changwoo Min <changwoo@igalia.com>
> Cc: Andrea Righi <righi.andrea@gmail.com>

Hi Tejun,

Overall this looks great. It's really not very complicated at all which is
great to see, but will be a big usability improvement for schedulers so it's a
clear win. Just left a couple comments below.

> ---
>  kernel/sched/ext.c   | 164 ++++++++++++++++++++++++++++++++++++++-----
>  kernel/sched/sched.h |   3 +
>  2 files changed, 149 insertions(+), 18 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index a88c51ce63a5..f4edc37bf89b 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -889,6 +889,7 @@ static struct kobject *scx_root_kobj;
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/sched_ext.h>
>  
> +static void process_ddsp_deferred_locals(struct rq *rq);
>  static void scx_bpf_kick_cpu(s32 cpu, u64 flags);
>  static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind,
>  					     s64 exit_code,
> @@ -1363,6 +1364,63 @@ static int ops_sanitize_err(const char *ops_name, s32 err)
>  	return -EPROTO;
>  }
>  
> +static void run_deferred(struct rq *rq)
> +{
> +	process_ddsp_deferred_locals(rq);
> +}
> +
> +static void deferred_bal_cb_workfn(struct rq *rq)
> +{
> +	run_deferred(rq);
> +}
> +
> +static void deferred_irq_workfn(struct irq_work *irq_work)
> +{
> +	struct rq *rq = container_of(irq_work, struct rq, scx.deferred_irq_work);
> +
> +	raw_spin_rq_lock(rq);
> +	run_deferred(rq);
> +	raw_spin_rq_unlock(rq);
> +}
> +
> +/**
> + * schedule_deferred - Schedule execution of deferred actions on an rq
> + * @rq: target rq
> + *
> + * Schedule execution of deferred actions on @rq. Must be called with @rq
> + * locked. Deferred actions are executed with @rq locked but unpinned, and thus
> + * can unlock @rq to e.g. migrate tasks to other rqs.
> + */
> +static void schedule_deferred(struct rq *rq)
> +{
> +	lockdep_assert_rq_held(rq);
> +
> +#ifdef CONFIG_SMP
> +	/*
> +	 * If in the middle of waking up a task, task_woken_scx() will be called
> +	 * afterwards which will then run the deferred actions, no need to
> +	 * schedule anything.
> +	 */
> +	if (rq->scx.flags & SCX_RQ_IN_WAKEUP)
> +		return;
> +
> +	/*
> +	 * If in balance, the balance callbacks will be called before rq lock is
> +	 * released. Schedule one.
> +	 */
> +	if (rq->scx.flags & SCX_RQ_IN_BALANCE)
> +		queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
> +				       deferred_bal_cb_workfn);

Should we be returning above if we're able to use a balance cb?

Also, should we maybe add a WARN_ON_ONCE(rq->balance_callback ==
&balance_push_callback)? I could see that being unnecessary given that we
should never be hitting this path when the CPU is deactivated anyways, but the
push callback thing is a kindn extraneous implementation detail so might be
worth guarding against it just in case.

> +#endif
> +	/*
> +	 * No scheduler hooks available. Queue an irq work. They are executed on
> +	 * IRQ re-enable which may take a bit longer than the scheduler hooks.
> +	 * The above WAKEUP and BALANCE paths should cover most of the cases and
> +	 * the time to IRQ re-enable shouldn't be long.
> +	 */
> +	irq_work_queue(&rq->scx.deferred_irq_work);
> +}
> +
>  /**
>   * touch_core_sched - Update timestamp used for core-sched task ordering
>   * @rq: rq to read clock from, must be locked
> @@ -1578,7 +1636,13 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
>  	bool is_local = dsq == &rq->scx.local_dsq;
>  
>  	if (!dsq) {
> -		WARN_ON_ONCE(!list_empty(&p->scx.dsq_list.node));
> +		/*
> +		 * If !dsq && on-list, @p is on @rq's ddsp_deferred_locals.
> +		 * Unlinking is all that's needed to cancel.
> +		 */
> +		if (unlikely(!list_empty(&p->scx.dsq_list.node)))
> +			list_del_init(&p->scx.dsq_list.node);
> +
>  		/*
>  		 * When dispatching directly from the BPF scheduler to a local
>  		 * DSQ, the task isn't associated with any DSQ but
> @@ -1587,6 +1651,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
>  		 */
>  		if (p->scx.holding_cpu >= 0)
>  			p->scx.holding_cpu = -1;
> +
>  		return;
>  	}
>  
> @@ -1674,17 +1739,6 @@ static void mark_direct_dispatch(struct task_struct *ddsp_task,
>  		return;
>  	}
>  
> -	/*
> -	 * %SCX_DSQ_LOCAL_ON is not supported during direct dispatch because
> -	 * dispatching to the local DSQ of a different CPU requires unlocking
> -	 * the current rq which isn't allowed in the enqueue path. Use
> -	 * ops.select_cpu() to be on the target CPU and then %SCX_DSQ_LOCAL.
> -	 */
> -	if (unlikely((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON)) {
> -		scx_ops_error("SCX_DSQ_LOCAL_ON can't be used for direct-dispatch");
> -		return;
> -	}
> -
>  	WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID);
>  	WARN_ON_ONCE(p->scx.ddsp_enq_flags);
>  
> @@ -1694,13 +1748,58 @@ static void mark_direct_dispatch(struct task_struct *ddsp_task,
>  
>  static void direct_dispatch(struct task_struct *p, u64 enq_flags)
>  {
> +	struct rq *rq = task_rq(p);
>  	struct scx_dispatch_q *dsq;
> +	u64 dsq_id = p->scx.ddsp_dsq_id;
> +
> +	touch_core_sched_dispatch(rq, p);
> +
> +	p->scx.ddsp_enq_flags |= enq_flags;
> +
> +	/*
> +	 * We are in the enqueue path with @rq locked and pinned, and thus can't
> +	 * double lock a remote rq and enqueue to its local DSQ. For
> +	 * DSQ_LOCAL_ON verdicts targeting the local DSQ of a remote CPU, defer
> +	 * the enqueue so that it's executed when @rq can be unlocked.
> +	 */
> +	if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
> +		s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
> +		unsigned long opss;
> +
> +		if (cpu == cpu_of(rq)) {
> +			dsq_id = SCX_DSQ_LOCAL;
> +			goto dispatch;
> +		}
> +
> +		opss = atomic_long_read(&p->scx.ops_state) & SCX_OPSS_STATE_MASK;
> +
> +		switch (opss & SCX_OPSS_STATE_MASK) {
> +		case SCX_OPSS_NONE:
> +			break;
> +		case SCX_OPSS_QUEUEING:
> +			/*
> +			 * As @p was never passed to the BPF side, _release is
> +			 * not strictly necessary. Still do it for consistency.
> +			 */
> +			atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
> +			break;
> +		default:
> +			WARN_ONCE(true, "sched_ext: %s[%d] has invalid ops state 0x%lx in direct_dispatch()",
> +				  p->comm, p->pid, opss);
> +			atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
> +			break;
> +		}
>  
> -	touch_core_sched_dispatch(task_rq(p), p);
> +		WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node));
> +		list_add_tail(&p->scx.dsq_list.node,
> +			      &rq->scx.ddsp_deferred_locals);
> +		schedule_deferred(rq);
> +		return;
> +	}
>  
> -	enq_flags |= (p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
> -	dsq = find_dsq_for_dispatch(task_rq(p), p->scx.ddsp_dsq_id, p);
> -	dispatch_enqueue(dsq, p, enq_flags);
> +dispatch:
> +	dsq = find_dsq_for_dispatch(rq, dsq_id, p);
> +	dispatch_enqueue(dsq, p, p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
>  }
>  
>  static bool scx_rq_online(struct rq *rq)
> @@ -2631,6 +2730,29 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
>  	}
>  }
>  
> +static void process_ddsp_deferred_locals(struct rq *rq)
> +{
> +	struct task_struct *p, *tmp;
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/*
> +	 * Now that @rq can be unlocked, execute the deferred enqueueing of
> +	 * tasks directly dispatched to the local DSQs of other CPUs. See
> +	 * direct_dispatch().
> +	 */
> +	list_for_each_entry_safe(p, tmp, &rq->scx.ddsp_deferred_locals,
> +				 scx.dsq_list.node) {
> +		s32 ret;
> +
> +		list_del_init(&p->scx.dsq_list.node);
> +
> +		ret = dispatch_to_local_dsq(rq, NULL, p->scx.ddsp_dsq_id, p,
> +					    p->scx.ddsp_enq_flags);
> +		WARN_ON_ONCE(ret == DTL_NOT_LOCAL);
> +	}

As mentioned in the other thread, it might be simplest to just pin and unpin
around this loop here to keep the logic simpler in the callee.

> +}
> +
>  static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
>  {
>  #ifndef CONFIG_SMP
> @@ -3052,6 +3174,11 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
>  	}
>  }
>  
> +static void task_woken_scx(struct rq *rq, struct task_struct *p)
> +{
> +	run_deferred(rq);
> +}
> +
>  static void set_cpus_allowed_scx(struct task_struct *p,
>  				 struct affinity_context *ac)
>  {
> @@ -3568,8 +3695,6 @@ bool scx_can_stop_tick(struct rq *rq)
>   *
>   * - task_fork/dead: We need fork/dead notifications for all tasks regardless of
>   *   their current sched_class. Call them directly from sched core instead.
> - *
> - * - task_woken: Unnecessary.
>   */
>  DEFINE_SCHED_CLASS(ext) = {
>  	.enqueue_task		= enqueue_task_scx,
> @@ -3589,6 +3714,7 @@ DEFINE_SCHED_CLASS(ext) = {
>  #ifdef CONFIG_SMP
>  	.balance		= balance_scx,
>  	.select_task_rq		= select_task_rq_scx,
> +	.task_woken		= task_woken_scx,

Should we update the comment in the caller in core.c given that rq is no longer
only used for statistics tracking?

>  	.set_cpus_allowed	= set_cpus_allowed_scx,
>  
>  	.rq_online		= rq_online_scx,

[...]

Thanks,
David

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/6] sched: Move struct balance_callback definition upward
  2024-07-09 21:21 ` [PATCH 1/6] sched: Move struct balance_callback definition upward Tejun Heo
@ 2024-07-10 18:53   ` David Vernet
  0 siblings, 0 replies; 21+ messages in thread
From: David Vernet @ 2024-07-10 18:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo, righi.andrea

[-- Attachment #1: Type: text/plain, Size: 498 bytes --]

On Tue, Jul 09, 2024 at 11:21:07AM -1000, Tejun Heo wrote:
> Move struct balance_callback definition upward so that it's visible to
> class-specific rq struct definitions. This will be used to embed a struct
> balance_callback in struct scx_rq.
> 
> No functional changes.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: David Vernet <void@manifault.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] sched_ext: Open-code task_linked_on_dsq()
  2024-07-09 21:21 ` [PATCH 2/6] sched_ext: Open-code task_linked_on_dsq() Tejun Heo
  2024-07-10  8:56   ` Andrea Righi
@ 2024-07-10 18:53   ` David Vernet
  1 sibling, 0 replies; 21+ messages in thread
From: David Vernet @ 2024-07-10 18:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo, righi.andrea

[-- Attachment #1: Type: text/plain, Size: 549 bytes --]

On Tue, Jul 09, 2024 at 11:21:08AM -1000, Tejun Heo wrote:
> task_linked_on_dsq() exists as a helper becauase it used to test both the
> rbtree and list nodes. It now only tests the list node and the list node
> will soon be used for something else too. The helper doesn't improve
> anything materially and the naming will become confusing. Open-code the list
> node testing and remove task_linked_on_dsq()
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: David Vernet <void@manifault.com>

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/6] sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUP
  2024-07-09 21:21 ` [PATCH 4/6] sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUP Tejun Heo
@ 2024-07-10 18:54   ` David Vernet
  0 siblings, 0 replies; 21+ messages in thread
From: David Vernet @ 2024-07-10 18:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo, righi.andrea

[-- Attachment #1: Type: text/plain, Size: 480 bytes --]

On Tue, Jul 09, 2024 at 11:21:10AM -1000, Tejun Heo wrote:
> SCX_RQ_BALANCING is used to mark that the rq is currently in balance().
> Rename it to SCX_RQ_IN_BALANCE and add SCX_RQ_IN_WAKEUP which marks whether
> the rq is currently enqueueing for a wakeup. This will be used to implement
> direct dispatching to local DSQ of another CPU.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: David Vernet <void@manifault.com>

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/6] sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues
  2024-07-09 21:21 ` [PATCH 6/6] sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues Tejun Heo
@ 2024-07-10 19:25   ` David Vernet
  2024-07-11  0:05     ` Tejun Heo
  0 siblings, 1 reply; 21+ messages in thread
From: David Vernet @ 2024-07-10 19:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo, righi.andrea

[-- Attachment #1: Type: text/plain, Size: 5282 bytes --]

On Tue, Jul 09, 2024 at 11:21:12AM -1000, Tejun Heo wrote:
> Because there was no way to directly dispatch to the local DSQ of a remote
> CPU from ops.enqueue(), scx_qmap skipped looking for an idle CPU on !wakeup
> enqueues. This restriction was removed and schbed_ext now allows

s/schbed_ext/sched_ext

> SCX_DSQ_LOCAL_ON verdicts for direct dispatches.
> 
> Factor out pick_direct_dispatch_cpu() from ops.select_cpu() and use it to
> direct dispatch from ops.enqueue() on !wakeup enqueues.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: David Vernet <void@manifault.com>
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Changwoo Min <changwoo@igalia.com>
> Cc: Andrea Righi <righi.andrea@gmail.com>

Hi Tejun,

This LG as is, but I also left a comment below in case we want to tweak. Feel
free to just apply the tag if you'd rather not iterate given that this is just
an example scheduler.

Acked-by: David Vernet <void@manifault.com>

> ---
>  tools/sched_ext/scx_qmap.bpf.c | 39 ++++++++++++++++++++++++++--------
>  tools/sched_ext/scx_qmap.c     |  5 +++--
>  2 files changed, 33 insertions(+), 11 deletions(-)
> 
> diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
> index 27e35066a602..892278f12dce 100644
> --- a/tools/sched_ext/scx_qmap.bpf.c
> +++ b/tools/sched_ext/scx_qmap.bpf.c
> @@ -120,11 +120,26 @@ struct {
>  } cpu_ctx_stor SEC(".maps");
>  
>  /* Statistics */
> -u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued;
> +u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued, nr_ddsp_from_enq;
>  u64 nr_core_sched_execed;
>  u32 cpuperf_min, cpuperf_avg, cpuperf_max;
>  u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max;
>  
> +static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu)
> +{
> +	s32 cpu;
> +
> +	if (p->nr_cpus_allowed == 1 ||
> +	    scx_bpf_test_and_clear_cpu_idle(prev_cpu))
> +		return prev_cpu;
> +
> +	cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
> +	if (cpu >= 0)
> +		return cpu;
> +
> +	return -1;
> +}
> +
>  s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
>  		   s32 prev_cpu, u64 wake_flags)
>  {
> @@ -137,17 +152,14 @@ s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
>  		return -ESRCH;
>  	}
>  
> -	if (p->nr_cpus_allowed == 1 ||
> -	    scx_bpf_test_and_clear_cpu_idle(prev_cpu)) {
> +	cpu = pick_direct_dispatch_cpu(p, prev_cpu);
> +
> +	if (cpu >= 0) {
>  		tctx->force_local = true;
> +		return cpu;
> +	} else {
>  		return prev_cpu;
>  	}
> -
> -	cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
> -	if (cpu >= 0)
> -		return cpu;
> -
> -	return prev_cpu;
>  }
>  
>  static int weight_to_idx(u32 weight)
> @@ -172,6 +184,7 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
>  	u32 pid = p->pid;
>  	int idx = weight_to_idx(p->scx.weight);
>  	void *ring;
> +	s32 cpu;
>  
>  	if (p->flags & PF_KTHREAD) {
>  		if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth))
> @@ -207,6 +220,14 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
>  		return;
>  	}
>  
> +	/* if !WAKEUP, select_cpu() wasn't called, try direct dispatch */
> +	if (!(enq_flags & SCX_ENQ_WAKEUP) &&
> +	    (cpu = pick_direct_dispatch_cpu(p, scx_bpf_task_cpu(p))) >= 0) {
> +		__sync_fetch_and_add(&nr_ddsp_from_enq, 1);
> +		scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, slice_ns, enq_flags);
> +		return;
> +	}

Hmm, will this be a typical pattern for how this is used? I'd expect
ops.select_cpu() and ops.enqueue() to quite often be nearly the same
implementation. Meaning you would e.g. try to find an idle core in both, and do
SCX_DSQ_LOCAL_ON, with the difference being that you'd just return the cpu and
save the extra lock juggling if you did it on the ops.select_cpu() path. Not a
huge deal given that it's just an example scheduler, but it might be a good
idea to try and mirror typical use cases for that reason as well so readers get
an idea of what a typical pattern would look like.

> +
>  	/*
>  	 * If the task was re-enqueued due to the CPU being preempted by a
>  	 * higher priority scheduling class, just re-enqueue the task directly
> diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
> index 304f0488a386..c9ca30d62b2b 100644
> --- a/tools/sched_ext/scx_qmap.c
> +++ b/tools/sched_ext/scx_qmap.c
> @@ -116,10 +116,11 @@ int main(int argc, char **argv)
>  		long nr_enqueued = skel->bss->nr_enqueued;
>  		long nr_dispatched = skel->bss->nr_dispatched;
>  
> -		printf("stats  : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64"\n",
> +		printf("stats  : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64" enq_ddsp=%"PRIu64"\n",
>  		       nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
>  		       skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
> -		       skel->bss->nr_core_sched_execed);
> +		       skel->bss->nr_core_sched_execed,
> +		       skel->bss->nr_ddsp_from_enq);
>  		if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur"))
>  			printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n",
>  			       skel->bss->cpuperf_min,
> -- 
> 2.45.2
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches
  2024-07-10 18:51   ` David Vernet
@ 2024-07-10 23:43     ` Tejun Heo
  0 siblings, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2024-07-10 23:43 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo, righi.andrea

Hello,

On Wed, Jul 10, 2024 at 01:51:35PM -0500, David Vernet wrote:
...
> > +	/*
> > +	 * If in balance, the balance callbacks will be called before rq lock is
> > +	 * released. Schedule one.
> > +	 */
> > +	if (rq->scx.flags & SCX_RQ_IN_BALANCE)
> > +		queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
> > +				       deferred_bal_cb_workfn);
> 
> Should we be returning above if we're able to use a balance cb?

Oh yeah, it definitely should.

> Also, should we maybe add a WARN_ON_ONCE(rq->balance_callback ==
> &balance_push_callback)? I could see that being unnecessary given that we
> should never be hitting this path when the CPU is deactivated anyways, but the
> push callback thing is a kindn extraneous implementation detail so might be
> worth guarding against it just in case.

I'm not sure about that. It feels like a sched core detail which is better
left within sched core code rather than pushing to queue_balance_callback()
users. The deferred execution mechanism itself doesn't really care about how
the callback is run or the state of the CPU.

...
> > +static void process_ddsp_deferred_locals(struct rq *rq)
> > +{
> > +	struct task_struct *p, *tmp;
> > +
> > +	lockdep_assert_rq_held(rq);
> > +
> > +	/*
> > +	 * Now that @rq can be unlocked, execute the deferred enqueueing of
> > +	 * tasks directly dispatched to the local DSQs of other CPUs. See
> > +	 * direct_dispatch().
> > +	 */
> > +	list_for_each_entry_safe(p, tmp, &rq->scx.ddsp_deferred_locals,
> > +				 scx.dsq_list.node) {
> > +		s32 ret;
> > +
> > +		list_del_init(&p->scx.dsq_list.node);
> > +
> > +		ret = dispatch_to_local_dsq(rq, NULL, p->scx.ddsp_dsq_id, p,
> > +					    p->scx.ddsp_enq_flags);
> > +		WARN_ON_ONCE(ret == DTL_NOT_LOCAL);
> > +	}
> 
> As mentioned in the other thread, it might be simplest to just pin and unpin
> around this loop here to keep the logic simpler in the callee.

Yeah, I'm moving unpinning/repinning to outer balance function.

> > @@ -3589,6 +3714,7 @@ DEFINE_SCHED_CLASS(ext) = {
> >  #ifdef CONFIG_SMP
> >  	.balance		= balance_scx,
> >  	.select_task_rq		= select_task_rq_scx,
> > +	.task_woken		= task_woken_scx,
> 
> Should we update the comment in the caller in core.c given that rq is no longer
> only used for statistics tracking?

task_woken_rt() is already doing push_rt_task() which is kinda similar to
what scx is doing, so the comment was already stale. Yeah, please go ahead
and send a patch.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/6] sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues
  2024-07-10 19:25   ` David Vernet
@ 2024-07-11  0:05     ` Tejun Heo
  0 siblings, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2024-07-11  0:05 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
	changwoo, righi.andrea

Hello,

On Wed, Jul 10, 2024 at 02:25:23PM -0500, David Vernet wrote:
...
> > +	/* if !WAKEUP, select_cpu() wasn't called, try direct dispatch */
> > +	if (!(enq_flags & SCX_ENQ_WAKEUP) &&
> > +	    (cpu = pick_direct_dispatch_cpu(p, scx_bpf_task_cpu(p))) >= 0) {
> > +		__sync_fetch_and_add(&nr_ddsp_from_enq, 1);
> > +		scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, slice_ns, enq_flags);
> > +		return;
> > +	}
> 
> Hmm, will this be a typical pattern for how this is used? I'd expect
> ops.select_cpu() and ops.enqueue() to quite often be nearly the same
> implementation. Meaning you would e.g. try to find an idle core in both, and do
> SCX_DSQ_LOCAL_ON, with the difference being that you'd just return the cpu and
> save the extra lock juggling if you did it on the ops.select_cpu() path. Not a
> huge deal given that it's just an example scheduler, but it might be a good
> idea to try and mirror typical use cases for that reason as well so readers get
> an idea of what a typical pattern would look like.

scx_qmap is a bit special in that it wants to be able to skip n'th tasks to
test the stall detection mechanism. We can't express that in select_cpu() as
not dispatching in select_cpu() just means that enqueue() will be invoked,
so it just funnels all tasks, even direct dispatch ones, to enqueue(), so it
looks a bit different from other schedulers. scx_qmap is mostly used to
exercise different code paths and is rather messy for anyone to use it as a
template. It'd probably be useful to note that in the comment.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-07-11  0:05 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-09 21:21 [PATCHSET sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
2024-07-09 21:21 ` [PATCH 1/6] sched: Move struct balance_callback definition upward Tejun Heo
2024-07-10 18:53   ` David Vernet
2024-07-09 21:21 ` [PATCH 2/6] sched_ext: Open-code task_linked_on_dsq() Tejun Heo
2024-07-10  8:56   ` Andrea Righi
2024-07-10 18:53   ` David Vernet
2024-07-09 21:21 ` [PATCH 3/6] sched_ext: Make @rf optional for dispatch_to_local_dsq() Tejun Heo
2024-07-10  8:59   ` Andrea Righi
2024-07-10 11:41     ` Peter Zijlstra
2024-07-10 16:39       ` Tejun Heo
2024-07-10 16:10   ` David Vernet
2024-07-10 16:46     ` Tejun Heo
2024-07-09 21:21 ` [PATCH 4/6] sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUP Tejun Heo
2024-07-10 18:54   ` David Vernet
2024-07-09 21:21 ` [PATCH 5/6] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
2024-07-10 18:51   ` David Vernet
2024-07-10 23:43     ` Tejun Heo
2024-07-09 21:21 ` [PATCH 6/6] sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues Tejun Heo
2024-07-10 19:25   ` David Vernet
2024-07-11  0:05     ` Tejun Heo
2024-07-10  9:01 ` [PATCHSET sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox