* [PATCH 1/6] sched: Move struct balance_callback definition upward
2024-07-11 1:13 [PATCHSET v2 sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
@ 2024-07-11 1:13 ` Tejun Heo
2024-07-11 21:06 ` Tejun Heo
2024-07-11 1:13 ` [PATCH 2/6] sched_ext: Open-code task_linked_on_dsq() Tejun Heo
` (5 subsequent siblings)
6 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2024-07-11 1:13 UTC (permalink / raw)
To: void
Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
changwoo, righi.andrea, Tejun Heo
Move struct balance_callback definition upward so that it's visible to
class-specific rq struct definitions. This will be used to embed a struct
balance_callback in struct scx_rq.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
kernel/sched/sched.h | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 04184e87ba7c..86314a17f1c7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -615,6 +615,11 @@ do { \
# define u64_u32_load(var) u64_u32_load_copy(var, var##_copy)
# define u64_u32_store(var, val) u64_u32_store_copy(var, var##_copy, val)
+struct balance_callback {
+ struct balance_callback *next;
+ void (*func)(struct rq *rq);
+};
+
/* CFS-related fields in a runqueue */
struct cfs_rq {
struct load_weight load;
@@ -1054,11 +1059,6 @@ struct uclamp_rq {
DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
#endif /* CONFIG_UCLAMP_TASK */
-struct balance_callback {
- struct balance_callback *next;
- void (*func)(struct rq *rq);
-};
-
/*
* This is the main, per-CPU runqueue data structure.
*
--
2.45.2
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [PATCH 1/6] sched: Move struct balance_callback definition upward
2024-07-11 1:13 ` [PATCH 1/6] sched: Move struct balance_callback definition upward Tejun Heo
@ 2024-07-11 21:06 ` Tejun Heo
0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-07-11 21:06 UTC (permalink / raw)
To: void
Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
changwoo, righi.andrea
Hello,
On Wed, Jul 10, 2024 at 03:13:58PM -1000, Tejun Heo wrote:
> Move struct balance_callback definition upward so that it's visible to
> class-specific rq struct definitions. This will be used to embed a struct
> balance_callback in struct scx_rq.
>
> No functional changes.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Acked-by: David Vernet <void@manifault.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
Unless there are objections, as this is really trivial, I'll apply it
together with other patches. Please holler if there are any concerns.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/6] sched_ext: Open-code task_linked_on_dsq()
2024-07-11 1:13 [PATCHSET v2 sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
2024-07-11 1:13 ` [PATCH 1/6] sched: Move struct balance_callback definition upward Tejun Heo
@ 2024-07-11 1:13 ` Tejun Heo
2024-07-11 1:14 ` [PATCH 3/6] sched_ext: Unpin and repin rq lock from balance_scx() Tejun Heo
` (4 subsequent siblings)
6 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-07-11 1:13 UTC (permalink / raw)
To: void
Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
changwoo, righi.andrea, Tejun Heo
task_linked_on_dsq() exists as a helper because it used to test both the
rbtree and list nodes. It now only tests the list node and the list node
will soon be used for something else too. The helper doesn't improve
anything materially and the naming will become confusing. Open-code the list
node testing and remove task_linked_on_dsq()
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
---
kernel/sched/ext.c | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f4a7abcec793..62574d980409 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1572,18 +1572,13 @@ static void task_unlink_from_dsq(struct task_struct *p,
list_del_init(&p->scx.dsq_list.node);
}
-static bool task_linked_on_dsq(struct task_struct *p)
-{
- return !list_empty(&p->scx.dsq_list.node);
-}
-
static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
{
struct scx_dispatch_q *dsq = p->scx.dsq;
bool is_local = dsq == &rq->scx.local_dsq;
if (!dsq) {
- WARN_ON_ONCE(task_linked_on_dsq(p));
+ WARN_ON_ONCE(!list_empty(&p->scx.dsq_list.node));
/*
* When dispatching directly from the BPF scheduler to a local
* DSQ, the task isn't associated with any DSQ but
@@ -1604,7 +1599,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
*/
if (p->scx.holding_cpu < 0) {
/* @p must still be on @dsq, dequeue */
- WARN_ON_ONCE(!task_linked_on_dsq(p));
+ WARN_ON_ONCE(list_empty(&p->scx.dsq_list.node));
task_unlink_from_dsq(p, dsq);
dsq_mod_nr(dsq, -1);
} else {
@@ -1614,7 +1609,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
* holding_cpu which tells dispatch_to_local_dsq() that it lost
* the race.
*/
- WARN_ON_ONCE(task_linked_on_dsq(p));
+ WARN_ON_ONCE(!list_empty(&p->scx.dsq_list.node));
p->scx.holding_cpu = -1;
}
p->scx.dsq = NULL;
--
2.45.2
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 3/6] sched_ext: Unpin and repin rq lock from balance_scx()
2024-07-11 1:13 [PATCHSET v2 sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
2024-07-11 1:13 ` [PATCH 1/6] sched: Move struct balance_callback definition upward Tejun Heo
2024-07-11 1:13 ` [PATCH 2/6] sched_ext: Open-code task_linked_on_dsq() Tejun Heo
@ 2024-07-11 1:14 ` Tejun Heo
2024-07-11 2:05 ` David Vernet
2024-07-11 1:14 ` [PATCH 4/6] sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUP Tejun Heo
` (3 subsequent siblings)
6 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2024-07-11 1:14 UTC (permalink / raw)
To: void
Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
changwoo, righi.andrea, Tejun Heo
sched_ext often needs to migrate tasks across CPUs right before execution
and thus uses the balance path to dispatch tasks from the BPF scheduler.
balance_scx() is called with rq locked and pinned but is passed @rf and thus
allowed to unpin and unlock. Currently, @rf is passed down the call stack so
the rq lock is unpinned just when double locking is needed.
This creates unnecessary complications such as having to explicitly
manipulate lock pinning for core scheduling. We also want to use
dispatch_to_local_dsq_lock() from other paths which are called with rq
locked but unpinned.
rq lock handling in the dispatch path is straightforward outside the
migration implementation and extending the pinning protection down the
callstack doesn't add enough meaningful extra protection to justify the
extra complexity.
Unpin and repin rq lock from the outer balance_scx() and drop @rf passing
and lock pinning handling from the inner functions. UP is updated to call
balance_one() instead of balance_scx() to avoid adding NULL @rf handling to
balance_scx(). AS this makes balance_scx() unused in UP, it's put inside a
CONFIG_SMP block.
No functional changes intended outside of lock annotation updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Righi <righi.andrea@gmail.com>
Cc: David Vernet <void@manifault.com>
---
kernel/sched/ext.c | 93 +++++++++++++++++-----------------------------
1 file changed, 34 insertions(+), 59 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 62574d980409..d4f801cd2548 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -851,7 +851,6 @@ static u32 scx_dsp_max_batch;
struct scx_dsp_ctx {
struct rq *rq;
- struct rq_flags *rf;
u32 cursor;
u32 nr_tasks;
struct scx_dsp_buf_ent buf[];
@@ -2040,7 +2039,6 @@ static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p,
/**
* dispatch_to_local_dsq_lock - Ensure source and destination rq's are locked
* @rq: current rq which is locked
- * @rf: rq_flags to use when unlocking @rq
* @src_rq: rq to move task from
* @dst_rq: rq to move task to
*
@@ -2049,20 +2047,16 @@ static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p,
* @rq stays locked isn't important as long as the state is restored after
* dispatch_to_local_dsq_unlock().
*/
-static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq_flags *rf,
- struct rq *src_rq, struct rq *dst_rq)
+static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq *src_rq,
+ struct rq *dst_rq)
{
- rq_unpin_lock(rq, rf);
-
if (src_rq == dst_rq) {
raw_spin_rq_unlock(rq);
raw_spin_rq_lock(dst_rq);
} else if (rq == src_rq) {
double_lock_balance(rq, dst_rq);
- rq_repin_lock(rq, rf);
} else if (rq == dst_rq) {
double_lock_balance(rq, src_rq);
- rq_repin_lock(rq, rf);
} else {
raw_spin_rq_unlock(rq);
double_rq_lock(src_rq, dst_rq);
@@ -2072,19 +2066,17 @@ static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq_flags *rf,
/**
* dispatch_to_local_dsq_unlock - Undo dispatch_to_local_dsq_lock()
* @rq: current rq which is locked
- * @rf: rq_flags to use when unlocking @rq
* @src_rq: rq to move task from
* @dst_rq: rq to move task to
*
* Unlock @src_rq and @dst_rq and ensure that @rq is locked on return.
*/
-static void dispatch_to_local_dsq_unlock(struct rq *rq, struct rq_flags *rf,
- struct rq *src_rq, struct rq *dst_rq)
+static void dispatch_to_local_dsq_unlock(struct rq *rq, struct rq *src_rq,
+ struct rq *dst_rq)
{
if (src_rq == dst_rq) {
raw_spin_rq_unlock(dst_rq);
raw_spin_rq_lock(rq);
- rq_repin_lock(rq, rf);
} else if (rq == src_rq) {
double_unlock_balance(rq, dst_rq);
} else if (rq == dst_rq) {
@@ -2092,7 +2084,6 @@ static void dispatch_to_local_dsq_unlock(struct rq *rq, struct rq_flags *rf,
} else {
double_rq_unlock(src_rq, dst_rq);
raw_spin_rq_lock(rq);
- rq_repin_lock(rq, rf);
}
}
#endif /* CONFIG_SMP */
@@ -2132,8 +2123,7 @@ static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq)
return true;
}
-static bool consume_remote_task(struct rq *rq, struct rq_flags *rf,
- struct scx_dispatch_q *dsq,
+static bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq,
struct task_struct *p, struct rq *task_rq)
{
bool moved = false;
@@ -2153,9 +2143,7 @@ static bool consume_remote_task(struct rq *rq, struct rq_flags *rf,
p->scx.holding_cpu = raw_smp_processor_id();
raw_spin_unlock(&dsq->lock);
- rq_unpin_lock(rq, rf);
double_lock_balance(rq, task_rq);
- rq_repin_lock(rq, rf);
moved = move_task_to_local_dsq(rq, p, 0);
@@ -2165,13 +2153,11 @@ static bool consume_remote_task(struct rq *rq, struct rq_flags *rf,
}
#else /* CONFIG_SMP */
static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq) { return false; }
-static bool consume_remote_task(struct rq *rq, struct rq_flags *rf,
- struct scx_dispatch_q *dsq,
+static bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq,
struct task_struct *p, struct rq *task_rq) { return false; }
#endif /* CONFIG_SMP */
-static bool consume_dispatch_q(struct rq *rq, struct rq_flags *rf,
- struct scx_dispatch_q *dsq)
+static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
{
struct task_struct *p;
retry:
@@ -2194,7 +2180,7 @@ static bool consume_dispatch_q(struct rq *rq, struct rq_flags *rf,
}
if (task_can_run_on_remote_rq(p, rq)) {
- if (likely(consume_remote_task(rq, rf, dsq, p, task_rq)))
+ if (likely(consume_remote_task(rq, dsq, p, task_rq)))
return true;
goto retry;
}
@@ -2214,7 +2200,6 @@ enum dispatch_to_local_dsq_ret {
/**
* dispatch_to_local_dsq - Dispatch a task to a local dsq
* @rq: current rq which is locked
- * @rf: rq_flags to use when unlocking @rq
* @dsq_id: destination dsq ID
* @p: task to dispatch
* @enq_flags: %SCX_ENQ_*
@@ -2227,8 +2212,8 @@ enum dispatch_to_local_dsq_ret {
* %SCX_OPSS_DISPATCHING).
*/
static enum dispatch_to_local_dsq_ret
-dispatch_to_local_dsq(struct rq *rq, struct rq_flags *rf, u64 dsq_id,
- struct task_struct *p, u64 enq_flags)
+dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p,
+ u64 enq_flags)
{
struct rq *src_rq = task_rq(p);
struct rq *dst_rq;
@@ -2277,7 +2262,7 @@ dispatch_to_local_dsq(struct rq *rq, struct rq_flags *rf, u64 dsq_id,
/* store_release ensures that dequeue sees the above */
atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
- dispatch_to_local_dsq_lock(rq, rf, src_rq, locked_dst_rq);
+ dispatch_to_local_dsq_lock(rq, src_rq, locked_dst_rq);
/*
* We don't require the BPF scheduler to avoid dispatching to
@@ -2312,7 +2297,7 @@ dispatch_to_local_dsq(struct rq *rq, struct rq_flags *rf, u64 dsq_id,
dst_rq->curr->sched_class))
resched_curr(dst_rq);
- dispatch_to_local_dsq_unlock(rq, rf, src_rq, locked_dst_rq);
+ dispatch_to_local_dsq_unlock(rq, src_rq, locked_dst_rq);
return dsp ? DTL_DISPATCHED : DTL_LOST;
}
@@ -2326,7 +2311,6 @@ dispatch_to_local_dsq(struct rq *rq, struct rq_flags *rf, u64 dsq_id,
/**
* finish_dispatch - Asynchronously finish dispatching a task
* @rq: current rq which is locked
- * @rf: rq_flags to use when unlocking @rq
* @p: task to finish dispatching
* @qseq_at_dispatch: qseq when @p started getting dispatched
* @dsq_id: destination DSQ ID
@@ -2343,8 +2327,7 @@ dispatch_to_local_dsq(struct rq *rq, struct rq_flags *rf, u64 dsq_id,
* was valid in the first place. Make sure that the task is still owned by the
* BPF scheduler and claim the ownership before dispatching.
*/
-static void finish_dispatch(struct rq *rq, struct rq_flags *rf,
- struct task_struct *p,
+static void finish_dispatch(struct rq *rq, struct task_struct *p,
unsigned long qseq_at_dispatch,
u64 dsq_id, u64 enq_flags)
{
@@ -2397,7 +2380,7 @@ static void finish_dispatch(struct rq *rq, struct rq_flags *rf,
BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
- switch (dispatch_to_local_dsq(rq, rf, dsq_id, p, enq_flags)) {
+ switch (dispatch_to_local_dsq(rq, dsq_id, p, enq_flags)) {
case DTL_DISPATCHED:
break;
case DTL_LOST:
@@ -2413,7 +2396,7 @@ static void finish_dispatch(struct rq *rq, struct rq_flags *rf,
}
}
-static void flush_dispatch_buf(struct rq *rq, struct rq_flags *rf)
+static void flush_dispatch_buf(struct rq *rq)
{
struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx);
u32 u;
@@ -2421,7 +2404,7 @@ static void flush_dispatch_buf(struct rq *rq, struct rq_flags *rf)
for (u = 0; u < dspc->cursor; u++) {
struct scx_dsp_buf_ent *ent = &dspc->buf[u];
- finish_dispatch(rq, rf, ent->task, ent->qseq, ent->dsq_id,
+ finish_dispatch(rq, ent->task, ent->qseq, ent->dsq_id,
ent->enq_flags);
}
@@ -2429,8 +2412,7 @@ static void flush_dispatch_buf(struct rq *rq, struct rq_flags *rf)
dspc->cursor = 0;
}
-static int balance_one(struct rq *rq, struct task_struct *prev,
- struct rq_flags *rf, bool local)
+static int balance_one(struct rq *rq, struct task_struct *prev, bool local)
{
struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx);
bool prev_on_scx = prev->sched_class == &ext_sched_class;
@@ -2484,14 +2466,13 @@ static int balance_one(struct rq *rq, struct task_struct *prev,
if (rq->scx.local_dsq.nr)
goto has_tasks;
- if (consume_dispatch_q(rq, rf, &scx_dsq_global))
+ if (consume_dispatch_q(rq, &scx_dsq_global))
goto has_tasks;
if (!SCX_HAS_OP(dispatch) || scx_ops_bypassing() || !scx_rq_online(rq))
goto out;
dspc->rq = rq;
- dspc->rf = rf;
/*
* The dispatch loop. Because flush_dispatch_buf() may drop the rq lock,
@@ -2506,11 +2487,11 @@ static int balance_one(struct rq *rq, struct task_struct *prev,
SCX_CALL_OP(SCX_KF_DISPATCH, dispatch, cpu_of(rq),
prev_on_scx ? prev : NULL);
- flush_dispatch_buf(rq, rf);
+ flush_dispatch_buf(rq);
if (rq->scx.local_dsq.nr)
goto has_tasks;
- if (consume_dispatch_q(rq, rf, &scx_dsq_global))
+ if (consume_dispatch_q(rq, &scx_dsq_global))
goto has_tasks;
/*
@@ -2537,12 +2518,15 @@ static int balance_one(struct rq *rq, struct task_struct *prev,
return has_tasks;
}
+#ifdef CONFIG_SMP
static int balance_scx(struct rq *rq, struct task_struct *prev,
struct rq_flags *rf)
{
int ret;
- ret = balance_one(rq, prev, rf, true);
+ rq_unpin_lock(rq, rf);
+
+ ret = balance_one(rq, prev, true);
#ifdef CONFIG_SCHED_SMT
/*
@@ -2556,28 +2540,19 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
for_each_cpu_andnot(scpu, smt_mask, cpumask_of(cpu_of(rq))) {
struct rq *srq = cpu_rq(scpu);
- struct rq_flags srf;
struct task_struct *sprev = srq->curr;
- /*
- * While core-scheduling, rq lock is shared among
- * siblings but the debug annotations and rq clock
- * aren't. Do pinning dance to transfer the ownership.
- */
WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq));
- rq_unpin_lock(rq, rf);
- rq_pin_lock(srq, &srf);
-
update_rq_clock(srq);
- balance_one(srq, sprev, &srf, false);
-
- rq_unpin_lock(srq, &srf);
- rq_repin_lock(rq, rf);
+ balance_one(srq, sprev, false);
}
}
#endif
+ rq_repin_lock(rq, rf);
+
return ret;
}
+#endif
static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
{
@@ -2647,11 +2622,11 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
* balance_scx() must be called before the previous SCX task goes
* through put_prev_task_scx().
*
- * As UP doesn't transfer tasks around, balance_scx() doesn't need @rf.
- * Pass in %NULL.
+ * @rq is pinned and can't be unlocked. As UP doesn't transfer tasks
+ * around, balance_one() doesn't need to.
*/
if (p->scx.flags & (SCX_TASK_QUEUED | SCX_TASK_DEQD_FOR_SLEEP))
- balance_scx(rq, p, NULL);
+ balance_one(rq, p, true);
#endif
update_curr_scx(rq);
@@ -2714,7 +2689,7 @@ static struct task_struct *pick_next_task_scx(struct rq *rq)
#ifndef CONFIG_SMP
/* UP workaround - see the comment at the head of put_prev_task_scx() */
if (unlikely(rq->curr->sched_class != &ext_sched_class))
- balance_scx(rq, rq->curr, NULL);
+ balance_one(rq, rq->curr, true);
#endif
p = first_local_task(rq);
@@ -5577,7 +5552,7 @@ __bpf_kfunc bool scx_bpf_consume(u64 dsq_id)
if (!scx_kf_allowed(SCX_KF_DISPATCH))
return false;
- flush_dispatch_buf(dspc->rq, dspc->rf);
+ flush_dispatch_buf(dspc->rq);
dsq = find_non_local_dsq(dsq_id);
if (unlikely(!dsq)) {
@@ -5585,7 +5560,7 @@ __bpf_kfunc bool scx_bpf_consume(u64 dsq_id)
return false;
}
- if (consume_dispatch_q(dspc->rq, dspc->rf, dsq)) {
+ if (consume_dispatch_q(dspc->rq, dsq)) {
/*
* A successfully consumed task can be dequeued before it starts
* running while the CPU is trying to migrate other dispatched
--
2.45.2
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [PATCH 3/6] sched_ext: Unpin and repin rq lock from balance_scx()
2024-07-11 1:14 ` [PATCH 3/6] sched_ext: Unpin and repin rq lock from balance_scx() Tejun Heo
@ 2024-07-11 2:05 ` David Vernet
0 siblings, 0 replies; 14+ messages in thread
From: David Vernet @ 2024-07-11 2:05 UTC (permalink / raw)
To: Tejun Heo
Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
changwoo, righi.andrea
[-- Attachment #1: Type: text/plain, Size: 1559 bytes --]
On Wed, Jul 10, 2024 at 03:14:00PM -1000, Tejun Heo wrote:
> sched_ext often needs to migrate tasks across CPUs right before execution
> and thus uses the balance path to dispatch tasks from the BPF scheduler.
> balance_scx() is called with rq locked and pinned but is passed @rf and thus
> allowed to unpin and unlock. Currently, @rf is passed down the call stack so
> the rq lock is unpinned just when double locking is needed.
>
> This creates unnecessary complications such as having to explicitly
> manipulate lock pinning for core scheduling. We also want to use
> dispatch_to_local_dsq_lock() from other paths which are called with rq
> locked but unpinned.
>
> rq lock handling in the dispatch path is straightforward outside the
> migration implementation and extending the pinning protection down the
> callstack doesn't add enough meaningful extra protection to justify the
> extra complexity.
>
> Unpin and repin rq lock from the outer balance_scx() and drop @rf passing
> and lock pinning handling from the inner functions. UP is updated to call
> balance_one() instead of balance_scx() to avoid adding NULL @rf handling to
> balance_scx(). AS this makes balance_scx() unused in UP, it's put inside a
> CONFIG_SMP block.
>
> No functional changes intended outside of lock annotation updates.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andrea Righi <righi.andrea@gmail.com>
> Cc: David Vernet <void@manifault.com>
Acked-by: David Vernet <void@manifault.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 4/6] sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUP
2024-07-11 1:13 [PATCHSET v2 sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
` (2 preceding siblings ...)
2024-07-11 1:14 ` [PATCH 3/6] sched_ext: Unpin and repin rq lock from balance_scx() Tejun Heo
@ 2024-07-11 1:14 ` Tejun Heo
2024-07-11 1:14 ` [PATCH 5/6] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
` (2 subsequent siblings)
6 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-07-11 1:14 UTC (permalink / raw)
To: void
Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
changwoo, righi.andrea, Tejun Heo
SCX_RQ_BALANCING is used to mark that the rq is currently in balance().
Rename it to SCX_RQ_IN_BALANCE and add SCX_RQ_IN_WAKEUP which marks whether
the rq is currently enqueueing for a wakeup. This will be used to implement
direct dispatching to local DSQ of another CPU.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
---
kernel/sched/ext.c | 13 +++++++++----
kernel/sched/sched.h | 6 ++++--
2 files changed, 13 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d4f801cd2548..57d6ea65f857 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1827,6 +1827,9 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
{
int sticky_cpu = p->scx.sticky_cpu;
+ if (enq_flags & ENQUEUE_WAKEUP)
+ rq->scx.flags |= SCX_RQ_IN_WAKEUP;
+
enq_flags |= rq->scx.extra_enq_flags;
if (sticky_cpu >= 0)
@@ -1843,7 +1846,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
if (p->scx.flags & SCX_TASK_QUEUED) {
WARN_ON_ONCE(!task_runnable(p));
- return;
+ goto out;
}
set_task_runnable(rq, p);
@@ -1858,6 +1861,8 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
touch_core_sched(rq, p);
do_enqueue_task(rq, p, enq_flags, sticky_cpu);
+out:
+ rq->scx.flags &= ~SCX_RQ_IN_WAKEUP;
}
static void ops_dequeue(struct task_struct *p, u64 deq_flags)
@@ -2420,7 +2425,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local)
bool has_tasks = false;
lockdep_assert_rq_held(rq);
- rq->scx.flags |= SCX_RQ_BALANCING;
+ rq->scx.flags |= SCX_RQ_IN_BALANCE;
if (static_branch_unlikely(&scx_ops_cpu_preempt) &&
unlikely(rq->scx.cpu_released)) {
@@ -2514,7 +2519,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local)
has_tasks:
has_tasks = true;
out:
- rq->scx.flags &= ~SCX_RQ_BALANCING;
+ rq->scx.flags &= ~SCX_RQ_IN_BALANCE;
return has_tasks;
}
@@ -5063,7 +5068,7 @@ static bool can_skip_idle_kick(struct rq *rq)
* The race window is small and we don't and can't guarantee that @rq is
* only kicked while idle anyway. Skip only when sure.
*/
- return !is_idle_task(rq->curr) && !(rq->scx.flags & SCX_RQ_BALANCING);
+ return !is_idle_task(rq->curr) && !(rq->scx.flags & SCX_RQ_IN_BALANCE);
}
static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *pseqs)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 86314a17f1c7..8a0e8052f6b0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -737,8 +737,10 @@ enum scx_rq_flags {
* only while the BPF scheduler considers the CPU to be online.
*/
SCX_RQ_ONLINE = 1 << 0,
- SCX_RQ_BALANCING = 1 << 1,
- SCX_RQ_CAN_STOP_TICK = 1 << 2,
+ SCX_RQ_CAN_STOP_TICK = 1 << 1,
+
+ SCX_RQ_IN_WAKEUP = 1 << 16,
+ SCX_RQ_IN_BALANCE = 1 << 17,
};
struct scx_rq {
--
2.45.2
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 5/6] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches
2024-07-11 1:13 [PATCHSET v2 sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
` (3 preceding siblings ...)
2024-07-11 1:14 ` [PATCH 4/6] sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUP Tejun Heo
@ 2024-07-11 1:14 ` Tejun Heo
2024-07-11 2:05 ` David Vernet
2024-07-11 1:14 ` [PATCH 6/6] sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues Tejun Heo
2024-07-12 18:22 ` [PATCHSET v2 sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
6 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2024-07-11 1:14 UTC (permalink / raw)
To: void
Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
changwoo, righi.andrea, Tejun Heo
In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the
local DSQ of any CPU. However, during direct dispatch from ops.select_cpu()
and ops.enqueue(), this isn't allowed. This is because dispatching to the
local DSQ of a remote CPU requires locking both the task's current and new
rq's and such double locking can't be done directly from ops.enqueue().
While waking up a task, as ops.select_cpu() can pick any CPU and both
ops.select_cpu() and ops.enqueue() can use SCX_DSQ_LOCAL as the dispatch
target to dispatch to the DSQ of the picked CPU, the BPF scheduler can still
do whatever it wants to do. However, while a task is being enqueued for a
different reason, e.g. after its slice expiration, only ops.enqueue() is
called and there's no way for the BPF scheduler to directly dispatch to the
local DSQ of a remote CPU. This gap in API forces schedulers into
work-arounds which are not straightforward or optimal such as skipping
direct dispatches in such cases.
Implement deferred enqueueing to allow directly dispatching to the local DSQ
of a remote CPU from ops.select_cpu() and ops.enqueue(). Such tasks are
temporarily queued on rq->scx.ddsp_deferred_locals. When the rq lock can be
safely released, the tasks are taken off the list and queued on the target
local DSQs using dispatch_to_local_dsq().
v2: - Add missing return after queue_balance_callback() in
schedule_deferred(). (David).
- dispatch_to_local_dsq() now assumes that @rq is locked but unpinned
and thus no longer takes @rf. Updated accordingly.
- UP build warning fix.
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Andrea Righi <righi.andrea@gmail.com>
Cc: David Vernet <void@manifault.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Changwoo Min <changwoo@igalia.com>
---
kernel/sched/ext.c | 168 ++++++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 3 +
2 files changed, 153 insertions(+), 18 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 57d6ea65f857..03da2cecb547 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -888,6 +888,7 @@ static struct kobject *scx_root_kobj;
#define CREATE_TRACE_POINTS
#include <trace/events/sched_ext.h>
+static void process_ddsp_deferred_locals(struct rq *rq);
static void scx_bpf_kick_cpu(s32 cpu, u64 flags);
static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind,
s64 exit_code,
@@ -1362,6 +1363,67 @@ static int ops_sanitize_err(const char *ops_name, s32 err)
return -EPROTO;
}
+static void run_deferred(struct rq *rq)
+{
+ process_ddsp_deferred_locals(rq);
+}
+
+#ifdef CONFIG_SMP
+static void deferred_bal_cb_workfn(struct rq *rq)
+{
+ run_deferred(rq);
+}
+#endif
+
+static void deferred_irq_workfn(struct irq_work *irq_work)
+{
+ struct rq *rq = container_of(irq_work, struct rq, scx.deferred_irq_work);
+
+ raw_spin_rq_lock(rq);
+ run_deferred(rq);
+ raw_spin_rq_unlock(rq);
+}
+
+/**
+ * schedule_deferred - Schedule execution of deferred actions on an rq
+ * @rq: target rq
+ *
+ * Schedule execution of deferred actions on @rq. Must be called with @rq
+ * locked. Deferred actions are executed with @rq locked but unpinned, and thus
+ * can unlock @rq to e.g. migrate tasks to other rqs.
+ */
+static void schedule_deferred(struct rq *rq)
+{
+ lockdep_assert_rq_held(rq);
+
+#ifdef CONFIG_SMP
+ /*
+ * If in the middle of waking up a task, task_woken_scx() will be called
+ * afterwards which will then run the deferred actions, no need to
+ * schedule anything.
+ */
+ if (rq->scx.flags & SCX_RQ_IN_WAKEUP)
+ return;
+
+ /*
+ * If in balance, the balance callbacks will be called before rq lock is
+ * released. Schedule one.
+ */
+ if (rq->scx.flags & SCX_RQ_IN_BALANCE) {
+ queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
+ deferred_bal_cb_workfn);
+ return;
+ }
+#endif
+ /*
+ * No scheduler hooks available. Queue an irq work. They are executed on
+ * IRQ re-enable which may take a bit longer than the scheduler hooks.
+ * The above WAKEUP and BALANCE paths should cover most of the cases and
+ * the time to IRQ re-enable shouldn't be long.
+ */
+ irq_work_queue(&rq->scx.deferred_irq_work);
+}
+
/**
* touch_core_sched - Update timestamp used for core-sched task ordering
* @rq: rq to read clock from, must be locked
@@ -1577,7 +1639,13 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
bool is_local = dsq == &rq->scx.local_dsq;
if (!dsq) {
- WARN_ON_ONCE(!list_empty(&p->scx.dsq_list.node));
+ /*
+ * If !dsq && on-list, @p is on @rq's ddsp_deferred_locals.
+ * Unlinking is all that's needed to cancel.
+ */
+ if (unlikely(!list_empty(&p->scx.dsq_list.node)))
+ list_del_init(&p->scx.dsq_list.node);
+
/*
* When dispatching directly from the BPF scheduler to a local
* DSQ, the task isn't associated with any DSQ but
@@ -1586,6 +1654,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
*/
if (p->scx.holding_cpu >= 0)
p->scx.holding_cpu = -1;
+
return;
}
@@ -1673,17 +1742,6 @@ static void mark_direct_dispatch(struct task_struct *ddsp_task,
return;
}
- /*
- * %SCX_DSQ_LOCAL_ON is not supported during direct dispatch because
- * dispatching to the local DSQ of a different CPU requires unlocking
- * the current rq which isn't allowed in the enqueue path. Use
- * ops.select_cpu() to be on the target CPU and then %SCX_DSQ_LOCAL.
- */
- if (unlikely((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON)) {
- scx_ops_error("SCX_DSQ_LOCAL_ON can't be used for direct-dispatch");
- return;
- }
-
WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID);
WARN_ON_ONCE(p->scx.ddsp_enq_flags);
@@ -1693,13 +1751,58 @@ static void mark_direct_dispatch(struct task_struct *ddsp_task,
static void direct_dispatch(struct task_struct *p, u64 enq_flags)
{
+ struct rq *rq = task_rq(p);
struct scx_dispatch_q *dsq;
+ u64 dsq_id = p->scx.ddsp_dsq_id;
+
+ touch_core_sched_dispatch(rq, p);
+
+ p->scx.ddsp_enq_flags |= enq_flags;
+
+ /*
+ * We are in the enqueue path with @rq locked and pinned, and thus can't
+ * double lock a remote rq and enqueue to its local DSQ. For
+ * DSQ_LOCAL_ON verdicts targeting the local DSQ of a remote CPU, defer
+ * the enqueue so that it's executed when @rq can be unlocked.
+ */
+ if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
+ s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+ unsigned long opss;
- touch_core_sched_dispatch(task_rq(p), p);
+ if (cpu == cpu_of(rq)) {
+ dsq_id = SCX_DSQ_LOCAL;
+ goto dispatch;
+ }
+
+ opss = atomic_long_read(&p->scx.ops_state) & SCX_OPSS_STATE_MASK;
+
+ switch (opss & SCX_OPSS_STATE_MASK) {
+ case SCX_OPSS_NONE:
+ break;
+ case SCX_OPSS_QUEUEING:
+ /*
+ * As @p was never passed to the BPF side, _release is
+ * not strictly necessary. Still do it for consistency.
+ */
+ atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
+ break;
+ default:
+ WARN_ONCE(true, "sched_ext: %s[%d] has invalid ops state 0x%lx in direct_dispatch()",
+ p->comm, p->pid, opss);
+ atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
+ break;
+ }
- enq_flags |= (p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
- dsq = find_dsq_for_dispatch(task_rq(p), p->scx.ddsp_dsq_id, p);
- dispatch_enqueue(dsq, p, enq_flags);
+ WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node));
+ list_add_tail(&p->scx.dsq_list.node,
+ &rq->scx.ddsp_deferred_locals);
+ schedule_deferred(rq);
+ return;
+ }
+
+dispatch:
+ dsq = find_dsq_for_dispatch(rq, dsq_id, p);
+ dispatch_enqueue(dsq, p, p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
}
static bool scx_rq_online(struct rq *rq)
@@ -2601,6 +2704,29 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
}
}
+static void process_ddsp_deferred_locals(struct rq *rq)
+{
+ struct task_struct *p, *tmp;
+
+ lockdep_assert_rq_held(rq);
+
+ /*
+ * Now that @rq can be unlocked, execute the deferred enqueueing of
+ * tasks directly dispatched to the local DSQs of other CPUs. See
+ * direct_dispatch().
+ */
+ list_for_each_entry_safe(p, tmp, &rq->scx.ddsp_deferred_locals,
+ scx.dsq_list.node) {
+ s32 ret;
+
+ list_del_init(&p->scx.dsq_list.node);
+
+ ret = dispatch_to_local_dsq(rq, p->scx.ddsp_dsq_id, p,
+ p->scx.ddsp_enq_flags);
+ WARN_ON_ONCE(ret == DTL_NOT_LOCAL);
+ }
+}
+
static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
{
#ifndef CONFIG_SMP
@@ -3022,6 +3148,11 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
}
}
+static void task_woken_scx(struct rq *rq, struct task_struct *p)
+{
+ run_deferred(rq);
+}
+
static void set_cpus_allowed_scx(struct task_struct *p,
struct affinity_context *ac)
{
@@ -3538,8 +3669,6 @@ bool scx_can_stop_tick(struct rq *rq)
*
* - task_fork/dead: We need fork/dead notifications for all tasks regardless of
* their current sched_class. Call them directly from sched core instead.
- *
- * - task_woken: Unnecessary.
*/
DEFINE_SCHED_CLASS(ext) = {
.enqueue_task = enqueue_task_scx,
@@ -3559,6 +3688,7 @@ DEFINE_SCHED_CLASS(ext) = {
#ifdef CONFIG_SMP
.balance = balance_scx,
.select_task_rq = select_task_rq_scx,
+ .task_woken = task_woken_scx,
.set_cpus_allowed = set_cpus_allowed_scx,
.rq_online = rq_online_scx,
@@ -5263,11 +5393,13 @@ void __init init_sched_ext_class(void)
init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
INIT_LIST_HEAD(&rq->scx.runnable_list);
+ INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals);
BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_preempt, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_wait, GFP_KERNEL));
+ init_irq_work(&rq->scx.deferred_irq_work, deferred_irq_workfn);
init_irq_work(&rq->scx.kick_cpus_irq_work, kick_cpus_irq_workfn);
if (cpu_online(cpu))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8a0e8052f6b0..be7be54484c0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -746,6 +746,7 @@ enum scx_rq_flags {
struct scx_rq {
struct scx_dispatch_q local_dsq;
struct list_head runnable_list; /* runnable tasks on this rq */
+ struct list_head ddsp_deferred_locals; /* deferred ddsps from enq */
unsigned long ops_qseq;
u64 extra_enq_flags; /* see move_task_to_local_dsq() */
u32 nr_running;
@@ -757,6 +758,8 @@ struct scx_rq {
cpumask_var_t cpus_to_preempt;
cpumask_var_t cpus_to_wait;
unsigned long pnt_seq;
+ struct balance_callback deferred_bal_cb;
+ struct irq_work deferred_irq_work;
struct irq_work kick_cpus_irq_work;
};
#endif /* CONFIG_SCHED_CLASS_EXT */
--
2.45.2
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [PATCH 5/6] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches
2024-07-11 1:14 ` [PATCH 5/6] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
@ 2024-07-11 2:05 ` David Vernet
0 siblings, 0 replies; 14+ messages in thread
From: David Vernet @ 2024-07-11 2:05 UTC (permalink / raw)
To: Tejun Heo
Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
changwoo, righi.andrea
[-- Attachment #1: Type: text/plain, Size: 2013 bytes --]
On Wed, Jul 10, 2024 at 03:14:02PM -1000, Tejun Heo wrote:
> In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the
> local DSQ of any CPU. However, during direct dispatch from ops.select_cpu()
> and ops.enqueue(), this isn't allowed. This is because dispatching to the
> local DSQ of a remote CPU requires locking both the task's current and new
> rq's and such double locking can't be done directly from ops.enqueue().
>
> While waking up a task, as ops.select_cpu() can pick any CPU and both
> ops.select_cpu() and ops.enqueue() can use SCX_DSQ_LOCAL as the dispatch
> target to dispatch to the DSQ of the picked CPU, the BPF scheduler can still
> do whatever it wants to do. However, while a task is being enqueued for a
> different reason, e.g. after its slice expiration, only ops.enqueue() is
> called and there's no way for the BPF scheduler to directly dispatch to the
> local DSQ of a remote CPU. This gap in API forces schedulers into
> work-arounds which are not straightforward or optimal such as skipping
> direct dispatches in such cases.
>
> Implement deferred enqueueing to allow directly dispatching to the local DSQ
> of a remote CPU from ops.select_cpu() and ops.enqueue(). Such tasks are
> temporarily queued on rq->scx.ddsp_deferred_locals. When the rq lock can be
> safely released, the tasks are taken off the list and queued on the target
> local DSQs using dispatch_to_local_dsq().
>
> v2: - Add missing return after queue_balance_callback() in
> schedule_deferred(). (David).
>
> - dispatch_to_local_dsq() now assumes that @rq is locked but unpinned
> and thus no longer takes @rf. Updated accordingly.
>
> - UP build warning fix.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Tested-by: Andrea Righi <righi.andrea@gmail.com>
> Cc: David Vernet <void@manifault.com>
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Changwoo Min <changwoo@igalia.com>
Acked-by: David Vernet <void@manifault.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 6/6] sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues
2024-07-11 1:13 [PATCHSET v2 sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
` (4 preceding siblings ...)
2024-07-11 1:14 ` [PATCH 5/6] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
@ 2024-07-11 1:14 ` Tejun Heo
2024-07-12 18:22 ` [PATCHSET v2 sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
6 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-07-11 1:14 UTC (permalink / raw)
To: void
Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
changwoo, righi.andrea, Tejun Heo
Because there was no way to directly dispatch to the local DSQ of a remote
CPU from ops.enqueue(), scx_qmap skipped looking for an idle CPU on !wakeup
enqueues. This restriction was removed and sched_ext now allows
SCX_DSQ_LOCAL_ON verdicts for direct dispatches.
Factor out pick_direct_dispatch_cpu() from ops.select_cpu() and use it to
direct dispatch from ops.enqueue() on !wakeup enqueues.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Changwoo Min <changwoo@igalia.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
---
tools/sched_ext/scx_qmap.bpf.c | 39 ++++++++++++++++++++++++++--------
tools/sched_ext/scx_qmap.c | 5 +++--
2 files changed, 33 insertions(+), 11 deletions(-)
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 27e35066a602..892278f12dce 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -120,11 +120,26 @@ struct {
} cpu_ctx_stor SEC(".maps");
/* Statistics */
-u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued;
+u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued, nr_ddsp_from_enq;
u64 nr_core_sched_execed;
u32 cpuperf_min, cpuperf_avg, cpuperf_max;
u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max;
+static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu)
+{
+ s32 cpu;
+
+ if (p->nr_cpus_allowed == 1 ||
+ scx_bpf_test_and_clear_cpu_idle(prev_cpu))
+ return prev_cpu;
+
+ cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
+ if (cpu >= 0)
+ return cpu;
+
+ return -1;
+}
+
s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
{
@@ -137,17 +152,14 @@ s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
return -ESRCH;
}
- if (p->nr_cpus_allowed == 1 ||
- scx_bpf_test_and_clear_cpu_idle(prev_cpu)) {
+ cpu = pick_direct_dispatch_cpu(p, prev_cpu);
+
+ if (cpu >= 0) {
tctx->force_local = true;
+ return cpu;
+ } else {
return prev_cpu;
}
-
- cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
- if (cpu >= 0)
- return cpu;
-
- return prev_cpu;
}
static int weight_to_idx(u32 weight)
@@ -172,6 +184,7 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
u32 pid = p->pid;
int idx = weight_to_idx(p->scx.weight);
void *ring;
+ s32 cpu;
if (p->flags & PF_KTHREAD) {
if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth))
@@ -207,6 +220,14 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
return;
}
+ /* if !WAKEUP, select_cpu() wasn't called, try direct dispatch */
+ if (!(enq_flags & SCX_ENQ_WAKEUP) &&
+ (cpu = pick_direct_dispatch_cpu(p, scx_bpf_task_cpu(p))) >= 0) {
+ __sync_fetch_and_add(&nr_ddsp_from_enq, 1);
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, slice_ns, enq_flags);
+ return;
+ }
+
/*
* If the task was re-enqueued due to the CPU being preempted by a
* higher priority scheduling class, just re-enqueue the task directly
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index 304f0488a386..c9ca30d62b2b 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -116,10 +116,11 @@ int main(int argc, char **argv)
long nr_enqueued = skel->bss->nr_enqueued;
long nr_dispatched = skel->bss->nr_dispatched;
- printf("stats : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64"\n",
+ printf("stats : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64" enq_ddsp=%"PRIu64"\n",
nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
- skel->bss->nr_core_sched_execed);
+ skel->bss->nr_core_sched_execed,
+ skel->bss->nr_ddsp_from_enq);
if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur"))
printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n",
skel->bss->cpuperf_min,
--
2.45.2
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [PATCHSET v2 sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches
2024-07-11 1:13 [PATCHSET v2 sched_ext/for-6.11] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches Tejun Heo
` (5 preceding siblings ...)
2024-07-11 1:14 ` [PATCH 6/6] sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues Tejun Heo
@ 2024-07-12 18:22 ` Tejun Heo
6 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-07-12 18:22 UTC (permalink / raw)
To: void
Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
changwoo, righi.andrea
On Wed, Jul 10, 2024 at 03:13:57PM -1000, Tejun Heo wrote:
> This is v2. Changes from v1:
> (http://lkml.kernel.org/r/20240709212137.1199269-1-tj@kernel.org)
>
> - 0003-sched_ext-Make-rf-optional-for-dispatch_to_local_dsq.patch is
> replaced with
> 0003-sched_ext-Unpin-and-repin-rq-lock-from-balance_scx.patch.
> balance_scx() now unpins the rq lock and the inner functions don't handle
> rq pinning. This simplifies code quite a bit.
>
> - schedule_deferred() was missing a return after balance_work scheduling.
> Fixed.
>
> - Other misc changes including UP build warning fix.
>
> In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the
> local DSQ of any CPU. However, during direct dispatch from ops.select_cpu()
> and ops.enqueue(), this isn't allowed. This is because dispatching to the
> local DSQ of a remote CPU requires locking both the task's current and new
> rq's and such double locking can't be done directly from ops.enqueue(). This
> gap in API forces schedulers into work-arounds which are not straightforward
> or optimal such as skipping direct dispatches in such cases.
>
> This patchset implements SCX_DSQ_LOCAL_ON support for direct dispatches.
>
> This patchset contains the following patches:
>
> 0001-sched-Move-struct-balance_callback-definition-upward.patch
> 0002-sched_ext-Open-code-task_linked_on_dsq.patch
> 0003-sched_ext-Unpin-and-repin-rq-lock-from-balance_scx.patch
> 0004-sched_ext-s-SCX_RQ_BALANCING-SCX_RQ_IN_BALANCE-and-a.patch
> 0005-sched_ext-Allow-SCX_DSQ_LOCAL_ON-for-direct-dispatch.patch
> 0006-sched_ext-scx_qmap-Pick-idle-CPU-for-direct-dispatch.patch
Applied to sched_ext/for-6.11.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 5/6] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches
2024-07-09 21:21 [PATCHSET " Tejun Heo
@ 2024-07-09 21:21 ` Tejun Heo
2024-07-10 18:51 ` David Vernet
0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2024-07-09 21:21 UTC (permalink / raw)
To: void
Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
changwoo, righi.andrea, Tejun Heo
In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the
local DSQ of any CPU. However, during direct dispatch from ops.select_cpu()
and ops.enqueue(), this isn't allowed. This is because dispatching to the
local DSQ of a remote CPU requires locking both the task's current and new
rq's and such double locking can't be done directly from ops.enqueue().
While waking up a task, as ops.select_cpu() can pick any CPU and both
ops.select_cpu() and ops.enqueue() can use SCX_DSQ_LOCAL as the dispatch
target to dispatch to the DSQ of the picked CPU, the BPF scheduler can still
do whatever it wants to do. However, while a task is being enqueued for a
different reason, e.g. after its slice expiration, only ops.enqueue() is
called and there's no way for the BPF scheduler to directly dispatch to the
local DSQ of a remote CPU. This gap in API forces schedulers into
work-arounds which are not straightforward or optimal such as skipping
direct dispatches in such cases.
Implement deferred enqueueing to allow directly dispatching to the local DSQ
of a remote CPU from ops.select_cpu() and ops.enqueue(). Such tasks are
temporarily queued on rq->scx.ddsp_deferred_locals. When the rq lock can be
safely released, the tasks are taken off the list and queued on the target
local DSQs using dispatch_to_local_dsq().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Changwoo Min <changwoo@igalia.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
---
kernel/sched/ext.c | 164 ++++++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 3 +
2 files changed, 149 insertions(+), 18 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index a88c51ce63a5..f4edc37bf89b 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -889,6 +889,7 @@ static struct kobject *scx_root_kobj;
#define CREATE_TRACE_POINTS
#include <trace/events/sched_ext.h>
+static void process_ddsp_deferred_locals(struct rq *rq);
static void scx_bpf_kick_cpu(s32 cpu, u64 flags);
static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind,
s64 exit_code,
@@ -1363,6 +1364,63 @@ static int ops_sanitize_err(const char *ops_name, s32 err)
return -EPROTO;
}
+static void run_deferred(struct rq *rq)
+{
+ process_ddsp_deferred_locals(rq);
+}
+
+static void deferred_bal_cb_workfn(struct rq *rq)
+{
+ run_deferred(rq);
+}
+
+static void deferred_irq_workfn(struct irq_work *irq_work)
+{
+ struct rq *rq = container_of(irq_work, struct rq, scx.deferred_irq_work);
+
+ raw_spin_rq_lock(rq);
+ run_deferred(rq);
+ raw_spin_rq_unlock(rq);
+}
+
+/**
+ * schedule_deferred - Schedule execution of deferred actions on an rq
+ * @rq: target rq
+ *
+ * Schedule execution of deferred actions on @rq. Must be called with @rq
+ * locked. Deferred actions are executed with @rq locked but unpinned, and thus
+ * can unlock @rq to e.g. migrate tasks to other rqs.
+ */
+static void schedule_deferred(struct rq *rq)
+{
+ lockdep_assert_rq_held(rq);
+
+#ifdef CONFIG_SMP
+ /*
+ * If in the middle of waking up a task, task_woken_scx() will be called
+ * afterwards which will then run the deferred actions, no need to
+ * schedule anything.
+ */
+ if (rq->scx.flags & SCX_RQ_IN_WAKEUP)
+ return;
+
+ /*
+ * If in balance, the balance callbacks will be called before rq lock is
+ * released. Schedule one.
+ */
+ if (rq->scx.flags & SCX_RQ_IN_BALANCE)
+ queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
+ deferred_bal_cb_workfn);
+#endif
+ /*
+ * No scheduler hooks available. Queue an irq work. They are executed on
+ * IRQ re-enable which may take a bit longer than the scheduler hooks.
+ * The above WAKEUP and BALANCE paths should cover most of the cases and
+ * the time to IRQ re-enable shouldn't be long.
+ */
+ irq_work_queue(&rq->scx.deferred_irq_work);
+}
+
/**
* touch_core_sched - Update timestamp used for core-sched task ordering
* @rq: rq to read clock from, must be locked
@@ -1578,7 +1636,13 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
bool is_local = dsq == &rq->scx.local_dsq;
if (!dsq) {
- WARN_ON_ONCE(!list_empty(&p->scx.dsq_list.node));
+ /*
+ * If !dsq && on-list, @p is on @rq's ddsp_deferred_locals.
+ * Unlinking is all that's needed to cancel.
+ */
+ if (unlikely(!list_empty(&p->scx.dsq_list.node)))
+ list_del_init(&p->scx.dsq_list.node);
+
/*
* When dispatching directly from the BPF scheduler to a local
* DSQ, the task isn't associated with any DSQ but
@@ -1587,6 +1651,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
*/
if (p->scx.holding_cpu >= 0)
p->scx.holding_cpu = -1;
+
return;
}
@@ -1674,17 +1739,6 @@ static void mark_direct_dispatch(struct task_struct *ddsp_task,
return;
}
- /*
- * %SCX_DSQ_LOCAL_ON is not supported during direct dispatch because
- * dispatching to the local DSQ of a different CPU requires unlocking
- * the current rq which isn't allowed in the enqueue path. Use
- * ops.select_cpu() to be on the target CPU and then %SCX_DSQ_LOCAL.
- */
- if (unlikely((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON)) {
- scx_ops_error("SCX_DSQ_LOCAL_ON can't be used for direct-dispatch");
- return;
- }
-
WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID);
WARN_ON_ONCE(p->scx.ddsp_enq_flags);
@@ -1694,13 +1748,58 @@ static void mark_direct_dispatch(struct task_struct *ddsp_task,
static void direct_dispatch(struct task_struct *p, u64 enq_flags)
{
+ struct rq *rq = task_rq(p);
struct scx_dispatch_q *dsq;
+ u64 dsq_id = p->scx.ddsp_dsq_id;
+
+ touch_core_sched_dispatch(rq, p);
+
+ p->scx.ddsp_enq_flags |= enq_flags;
+
+ /*
+ * We are in the enqueue path with @rq locked and pinned, and thus can't
+ * double lock a remote rq and enqueue to its local DSQ. For
+ * DSQ_LOCAL_ON verdicts targeting the local DSQ of a remote CPU, defer
+ * the enqueue so that it's executed when @rq can be unlocked.
+ */
+ if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
+ s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+ unsigned long opss;
+
+ if (cpu == cpu_of(rq)) {
+ dsq_id = SCX_DSQ_LOCAL;
+ goto dispatch;
+ }
+
+ opss = atomic_long_read(&p->scx.ops_state) & SCX_OPSS_STATE_MASK;
+
+ switch (opss & SCX_OPSS_STATE_MASK) {
+ case SCX_OPSS_NONE:
+ break;
+ case SCX_OPSS_QUEUEING:
+ /*
+ * As @p was never passed to the BPF side, _release is
+ * not strictly necessary. Still do it for consistency.
+ */
+ atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
+ break;
+ default:
+ WARN_ONCE(true, "sched_ext: %s[%d] has invalid ops state 0x%lx in direct_dispatch()",
+ p->comm, p->pid, opss);
+ atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
+ break;
+ }
- touch_core_sched_dispatch(task_rq(p), p);
+ WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node));
+ list_add_tail(&p->scx.dsq_list.node,
+ &rq->scx.ddsp_deferred_locals);
+ schedule_deferred(rq);
+ return;
+ }
- enq_flags |= (p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
- dsq = find_dsq_for_dispatch(task_rq(p), p->scx.ddsp_dsq_id, p);
- dispatch_enqueue(dsq, p, enq_flags);
+dispatch:
+ dsq = find_dsq_for_dispatch(rq, dsq_id, p);
+ dispatch_enqueue(dsq, p, p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
}
static bool scx_rq_online(struct rq *rq)
@@ -2631,6 +2730,29 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
}
}
+static void process_ddsp_deferred_locals(struct rq *rq)
+{
+ struct task_struct *p, *tmp;
+
+ lockdep_assert_rq_held(rq);
+
+ /*
+ * Now that @rq can be unlocked, execute the deferred enqueueing of
+ * tasks directly dispatched to the local DSQs of other CPUs. See
+ * direct_dispatch().
+ */
+ list_for_each_entry_safe(p, tmp, &rq->scx.ddsp_deferred_locals,
+ scx.dsq_list.node) {
+ s32 ret;
+
+ list_del_init(&p->scx.dsq_list.node);
+
+ ret = dispatch_to_local_dsq(rq, NULL, p->scx.ddsp_dsq_id, p,
+ p->scx.ddsp_enq_flags);
+ WARN_ON_ONCE(ret == DTL_NOT_LOCAL);
+ }
+}
+
static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
{
#ifndef CONFIG_SMP
@@ -3052,6 +3174,11 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
}
}
+static void task_woken_scx(struct rq *rq, struct task_struct *p)
+{
+ run_deferred(rq);
+}
+
static void set_cpus_allowed_scx(struct task_struct *p,
struct affinity_context *ac)
{
@@ -3568,8 +3695,6 @@ bool scx_can_stop_tick(struct rq *rq)
*
* - task_fork/dead: We need fork/dead notifications for all tasks regardless of
* their current sched_class. Call them directly from sched core instead.
- *
- * - task_woken: Unnecessary.
*/
DEFINE_SCHED_CLASS(ext) = {
.enqueue_task = enqueue_task_scx,
@@ -3589,6 +3714,7 @@ DEFINE_SCHED_CLASS(ext) = {
#ifdef CONFIG_SMP
.balance = balance_scx,
.select_task_rq = select_task_rq_scx,
+ .task_woken = task_woken_scx,
.set_cpus_allowed = set_cpus_allowed_scx,
.rq_online = rq_online_scx,
@@ -5293,11 +5419,13 @@ void __init init_sched_ext_class(void)
init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
INIT_LIST_HEAD(&rq->scx.runnable_list);
+ INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals);
BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_preempt, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_wait, GFP_KERNEL));
+ init_irq_work(&rq->scx.deferred_irq_work, deferred_irq_workfn);
init_irq_work(&rq->scx.kick_cpus_irq_work, kick_cpus_irq_workfn);
if (cpu_online(cpu))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8a0e8052f6b0..be7be54484c0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -746,6 +746,7 @@ enum scx_rq_flags {
struct scx_rq {
struct scx_dispatch_q local_dsq;
struct list_head runnable_list; /* runnable tasks on this rq */
+ struct list_head ddsp_deferred_locals; /* deferred ddsps from enq */
unsigned long ops_qseq;
u64 extra_enq_flags; /* see move_task_to_local_dsq() */
u32 nr_running;
@@ -757,6 +758,8 @@ struct scx_rq {
cpumask_var_t cpus_to_preempt;
cpumask_var_t cpus_to_wait;
unsigned long pnt_seq;
+ struct balance_callback deferred_bal_cb;
+ struct irq_work deferred_irq_work;
struct irq_work kick_cpus_irq_work;
};
#endif /* CONFIG_SCHED_CLASS_EXT */
--
2.45.2
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [PATCH 5/6] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches
2024-07-09 21:21 ` [PATCH 5/6] " Tejun Heo
@ 2024-07-10 18:51 ` David Vernet
2024-07-10 23:43 ` Tejun Heo
0 siblings, 1 reply; 14+ messages in thread
From: David Vernet @ 2024-07-10 18:51 UTC (permalink / raw)
To: Tejun Heo
Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
changwoo, righi.andrea
[-- Attachment #1: Type: text/plain, Size: 10875 bytes --]
On Tue, Jul 09, 2024 at 11:21:11AM -1000, Tejun Heo wrote:
> In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the
> local DSQ of any CPU. However, during direct dispatch from ops.select_cpu()
> and ops.enqueue(), this isn't allowed. This is because dispatching to the
> local DSQ of a remote CPU requires locking both the task's current and new
> rq's and such double locking can't be done directly from ops.enqueue().
>
> While waking up a task, as ops.select_cpu() can pick any CPU and both
> ops.select_cpu() and ops.enqueue() can use SCX_DSQ_LOCAL as the dispatch
> target to dispatch to the DSQ of the picked CPU, the BPF scheduler can still
> do whatever it wants to do. However, while a task is being enqueued for a
> different reason, e.g. after its slice expiration, only ops.enqueue() is
> called and there's no way for the BPF scheduler to directly dispatch to the
> local DSQ of a remote CPU. This gap in API forces schedulers into
> work-arounds which are not straightforward or optimal such as skipping
> direct dispatches in such cases.
>
> Implement deferred enqueueing to allow directly dispatching to the local DSQ
> of a remote CPU from ops.select_cpu() and ops.enqueue(). Such tasks are
> temporarily queued on rq->scx.ddsp_deferred_locals. When the rq lock can be
> safely released, the tasks are taken off the list and queued on the target
> local DSQs using dispatch_to_local_dsq().
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: David Vernet <void@manifault.com>
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Changwoo Min <changwoo@igalia.com>
> Cc: Andrea Righi <righi.andrea@gmail.com>
Hi Tejun,
Overall this looks great. It's really not very complicated at all which is
great to see, but will be a big usability improvement for schedulers so it's a
clear win. Just left a couple comments below.
> ---
> kernel/sched/ext.c | 164 ++++++++++++++++++++++++++++++++++++++-----
> kernel/sched/sched.h | 3 +
> 2 files changed, 149 insertions(+), 18 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index a88c51ce63a5..f4edc37bf89b 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -889,6 +889,7 @@ static struct kobject *scx_root_kobj;
> #define CREATE_TRACE_POINTS
> #include <trace/events/sched_ext.h>
>
> +static void process_ddsp_deferred_locals(struct rq *rq);
> static void scx_bpf_kick_cpu(s32 cpu, u64 flags);
> static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind,
> s64 exit_code,
> @@ -1363,6 +1364,63 @@ static int ops_sanitize_err(const char *ops_name, s32 err)
> return -EPROTO;
> }
>
> +static void run_deferred(struct rq *rq)
> +{
> + process_ddsp_deferred_locals(rq);
> +}
> +
> +static void deferred_bal_cb_workfn(struct rq *rq)
> +{
> + run_deferred(rq);
> +}
> +
> +static void deferred_irq_workfn(struct irq_work *irq_work)
> +{
> + struct rq *rq = container_of(irq_work, struct rq, scx.deferred_irq_work);
> +
> + raw_spin_rq_lock(rq);
> + run_deferred(rq);
> + raw_spin_rq_unlock(rq);
> +}
> +
> +/**
> + * schedule_deferred - Schedule execution of deferred actions on an rq
> + * @rq: target rq
> + *
> + * Schedule execution of deferred actions on @rq. Must be called with @rq
> + * locked. Deferred actions are executed with @rq locked but unpinned, and thus
> + * can unlock @rq to e.g. migrate tasks to other rqs.
> + */
> +static void schedule_deferred(struct rq *rq)
> +{
> + lockdep_assert_rq_held(rq);
> +
> +#ifdef CONFIG_SMP
> + /*
> + * If in the middle of waking up a task, task_woken_scx() will be called
> + * afterwards which will then run the deferred actions, no need to
> + * schedule anything.
> + */
> + if (rq->scx.flags & SCX_RQ_IN_WAKEUP)
> + return;
> +
> + /*
> + * If in balance, the balance callbacks will be called before rq lock is
> + * released. Schedule one.
> + */
> + if (rq->scx.flags & SCX_RQ_IN_BALANCE)
> + queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
> + deferred_bal_cb_workfn);
Should we be returning above if we're able to use a balance cb?
Also, should we maybe add a WARN_ON_ONCE(rq->balance_callback ==
&balance_push_callback)? I could see that being unnecessary given that we
should never be hitting this path when the CPU is deactivated anyways, but the
push callback thing is a kindn extraneous implementation detail so might be
worth guarding against it just in case.
> +#endif
> + /*
> + * No scheduler hooks available. Queue an irq work. They are executed on
> + * IRQ re-enable which may take a bit longer than the scheduler hooks.
> + * The above WAKEUP and BALANCE paths should cover most of the cases and
> + * the time to IRQ re-enable shouldn't be long.
> + */
> + irq_work_queue(&rq->scx.deferred_irq_work);
> +}
> +
> /**
> * touch_core_sched - Update timestamp used for core-sched task ordering
> * @rq: rq to read clock from, must be locked
> @@ -1578,7 +1636,13 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
> bool is_local = dsq == &rq->scx.local_dsq;
>
> if (!dsq) {
> - WARN_ON_ONCE(!list_empty(&p->scx.dsq_list.node));
> + /*
> + * If !dsq && on-list, @p is on @rq's ddsp_deferred_locals.
> + * Unlinking is all that's needed to cancel.
> + */
> + if (unlikely(!list_empty(&p->scx.dsq_list.node)))
> + list_del_init(&p->scx.dsq_list.node);
> +
> /*
> * When dispatching directly from the BPF scheduler to a local
> * DSQ, the task isn't associated with any DSQ but
> @@ -1587,6 +1651,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
> */
> if (p->scx.holding_cpu >= 0)
> p->scx.holding_cpu = -1;
> +
> return;
> }
>
> @@ -1674,17 +1739,6 @@ static void mark_direct_dispatch(struct task_struct *ddsp_task,
> return;
> }
>
> - /*
> - * %SCX_DSQ_LOCAL_ON is not supported during direct dispatch because
> - * dispatching to the local DSQ of a different CPU requires unlocking
> - * the current rq which isn't allowed in the enqueue path. Use
> - * ops.select_cpu() to be on the target CPU and then %SCX_DSQ_LOCAL.
> - */
> - if (unlikely((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON)) {
> - scx_ops_error("SCX_DSQ_LOCAL_ON can't be used for direct-dispatch");
> - return;
> - }
> -
> WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID);
> WARN_ON_ONCE(p->scx.ddsp_enq_flags);
>
> @@ -1694,13 +1748,58 @@ static void mark_direct_dispatch(struct task_struct *ddsp_task,
>
> static void direct_dispatch(struct task_struct *p, u64 enq_flags)
> {
> + struct rq *rq = task_rq(p);
> struct scx_dispatch_q *dsq;
> + u64 dsq_id = p->scx.ddsp_dsq_id;
> +
> + touch_core_sched_dispatch(rq, p);
> +
> + p->scx.ddsp_enq_flags |= enq_flags;
> +
> + /*
> + * We are in the enqueue path with @rq locked and pinned, and thus can't
> + * double lock a remote rq and enqueue to its local DSQ. For
> + * DSQ_LOCAL_ON verdicts targeting the local DSQ of a remote CPU, defer
> + * the enqueue so that it's executed when @rq can be unlocked.
> + */
> + if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
> + s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
> + unsigned long opss;
> +
> + if (cpu == cpu_of(rq)) {
> + dsq_id = SCX_DSQ_LOCAL;
> + goto dispatch;
> + }
> +
> + opss = atomic_long_read(&p->scx.ops_state) & SCX_OPSS_STATE_MASK;
> +
> + switch (opss & SCX_OPSS_STATE_MASK) {
> + case SCX_OPSS_NONE:
> + break;
> + case SCX_OPSS_QUEUEING:
> + /*
> + * As @p was never passed to the BPF side, _release is
> + * not strictly necessary. Still do it for consistency.
> + */
> + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
> + break;
> + default:
> + WARN_ONCE(true, "sched_ext: %s[%d] has invalid ops state 0x%lx in direct_dispatch()",
> + p->comm, p->pid, opss);
> + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
> + break;
> + }
>
> - touch_core_sched_dispatch(task_rq(p), p);
> + WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node));
> + list_add_tail(&p->scx.dsq_list.node,
> + &rq->scx.ddsp_deferred_locals);
> + schedule_deferred(rq);
> + return;
> + }
>
> - enq_flags |= (p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
> - dsq = find_dsq_for_dispatch(task_rq(p), p->scx.ddsp_dsq_id, p);
> - dispatch_enqueue(dsq, p, enq_flags);
> +dispatch:
> + dsq = find_dsq_for_dispatch(rq, dsq_id, p);
> + dispatch_enqueue(dsq, p, p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
> }
>
> static bool scx_rq_online(struct rq *rq)
> @@ -2631,6 +2730,29 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
> }
> }
>
> +static void process_ddsp_deferred_locals(struct rq *rq)
> +{
> + struct task_struct *p, *tmp;
> +
> + lockdep_assert_rq_held(rq);
> +
> + /*
> + * Now that @rq can be unlocked, execute the deferred enqueueing of
> + * tasks directly dispatched to the local DSQs of other CPUs. See
> + * direct_dispatch().
> + */
> + list_for_each_entry_safe(p, tmp, &rq->scx.ddsp_deferred_locals,
> + scx.dsq_list.node) {
> + s32 ret;
> +
> + list_del_init(&p->scx.dsq_list.node);
> +
> + ret = dispatch_to_local_dsq(rq, NULL, p->scx.ddsp_dsq_id, p,
> + p->scx.ddsp_enq_flags);
> + WARN_ON_ONCE(ret == DTL_NOT_LOCAL);
> + }
As mentioned in the other thread, it might be simplest to just pin and unpin
around this loop here to keep the logic simpler in the callee.
> +}
> +
> static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
> {
> #ifndef CONFIG_SMP
> @@ -3052,6 +3174,11 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
> }
> }
>
> +static void task_woken_scx(struct rq *rq, struct task_struct *p)
> +{
> + run_deferred(rq);
> +}
> +
> static void set_cpus_allowed_scx(struct task_struct *p,
> struct affinity_context *ac)
> {
> @@ -3568,8 +3695,6 @@ bool scx_can_stop_tick(struct rq *rq)
> *
> * - task_fork/dead: We need fork/dead notifications for all tasks regardless of
> * their current sched_class. Call them directly from sched core instead.
> - *
> - * - task_woken: Unnecessary.
> */
> DEFINE_SCHED_CLASS(ext) = {
> .enqueue_task = enqueue_task_scx,
> @@ -3589,6 +3714,7 @@ DEFINE_SCHED_CLASS(ext) = {
> #ifdef CONFIG_SMP
> .balance = balance_scx,
> .select_task_rq = select_task_rq_scx,
> + .task_woken = task_woken_scx,
Should we update the comment in the caller in core.c given that rq is no longer
only used for statistics tracking?
> .set_cpus_allowed = set_cpus_allowed_scx,
>
> .rq_online = rq_online_scx,
[...]
Thanks,
David
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH 5/6] sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches
2024-07-10 18:51 ` David Vernet
@ 2024-07-10 23:43 ` Tejun Heo
0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-07-10 23:43 UTC (permalink / raw)
To: David Vernet
Cc: linux-kernel, kernel-team, schatzberg.dan, mingo, peterz,
changwoo, righi.andrea
Hello,
On Wed, Jul 10, 2024 at 01:51:35PM -0500, David Vernet wrote:
...
> > + /*
> > + * If in balance, the balance callbacks will be called before rq lock is
> > + * released. Schedule one.
> > + */
> > + if (rq->scx.flags & SCX_RQ_IN_BALANCE)
> > + queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
> > + deferred_bal_cb_workfn);
>
> Should we be returning above if we're able to use a balance cb?
Oh yeah, it definitely should.
> Also, should we maybe add a WARN_ON_ONCE(rq->balance_callback ==
> &balance_push_callback)? I could see that being unnecessary given that we
> should never be hitting this path when the CPU is deactivated anyways, but the
> push callback thing is a kindn extraneous implementation detail so might be
> worth guarding against it just in case.
I'm not sure about that. It feels like a sched core detail which is better
left within sched core code rather than pushing to queue_balance_callback()
users. The deferred execution mechanism itself doesn't really care about how
the callback is run or the state of the CPU.
...
> > +static void process_ddsp_deferred_locals(struct rq *rq)
> > +{
> > + struct task_struct *p, *tmp;
> > +
> > + lockdep_assert_rq_held(rq);
> > +
> > + /*
> > + * Now that @rq can be unlocked, execute the deferred enqueueing of
> > + * tasks directly dispatched to the local DSQs of other CPUs. See
> > + * direct_dispatch().
> > + */
> > + list_for_each_entry_safe(p, tmp, &rq->scx.ddsp_deferred_locals,
> > + scx.dsq_list.node) {
> > + s32 ret;
> > +
> > + list_del_init(&p->scx.dsq_list.node);
> > +
> > + ret = dispatch_to_local_dsq(rq, NULL, p->scx.ddsp_dsq_id, p,
> > + p->scx.ddsp_enq_flags);
> > + WARN_ON_ONCE(ret == DTL_NOT_LOCAL);
> > + }
>
> As mentioned in the other thread, it might be simplest to just pin and unpin
> around this loop here to keep the logic simpler in the callee.
Yeah, I'm moving unpinning/repinning to outer balance function.
> > @@ -3589,6 +3714,7 @@ DEFINE_SCHED_CLASS(ext) = {
> > #ifdef CONFIG_SMP
> > .balance = balance_scx,
> > .select_task_rq = select_task_rq_scx,
> > + .task_woken = task_woken_scx,
>
> Should we update the comment in the caller in core.c given that rq is no longer
> only used for statistics tracking?
task_woken_rt() is already doing push_rt_task() which is kinda similar to
what scx is doing, so the comment was already stale. Yeah, please go ahead
and send a patch.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 14+ messages in thread