* [PATCH 01/12] sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate
2024-09-01 16:43 [PATCHSET v2 sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
@ 2024-09-01 16:43 ` Tejun Heo
2024-09-01 16:43 ` [PATCH 02/12] sched_ext: Refactor consume_remote_task() Tejun Heo
` (11 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-09-01 16:43 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, Tejun Heo
Sleepables don't need to be in its own kfunc set as each is tagged with
KF_SLEEPABLE. Rename to scx_kfunc_set_unlocked indicating that rq lock is
not held and relocate right above the any set. This will be used to add
kfuncs that are allowed to be called from SYSCALL but not TRACING.
No functional changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
---
kernel/sched/ext.c | 66 +++++++++++++++++++++++-----------------------
1 file changed, 33 insertions(+), 33 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 450389c2073d..e7c6e824f875 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5393,35 +5393,6 @@ void __init init_sched_ext_class(void)
__bpf_kfunc_start_defs();
-/**
- * scx_bpf_create_dsq - Create a custom DSQ
- * @dsq_id: DSQ to create
- * @node: NUMA node to allocate from
- *
- * Create a custom DSQ identified by @dsq_id. Can be called from any sleepable
- * scx callback, and any BPF_PROG_TYPE_SYSCALL prog.
- */
-__bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node)
-{
- if (unlikely(node >= (int)nr_node_ids ||
- (node < 0 && node != NUMA_NO_NODE)))
- return -EINVAL;
- return PTR_ERR_OR_ZERO(create_dsq(dsq_id, node));
-}
-
-__bpf_kfunc_end_defs();
-
-BTF_KFUNCS_START(scx_kfunc_ids_sleepable)
-BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE)
-BTF_KFUNCS_END(scx_kfunc_ids_sleepable)
-
-static const struct btf_kfunc_id_set scx_kfunc_set_sleepable = {
- .owner = THIS_MODULE,
- .set = &scx_kfunc_ids_sleepable,
-};
-
-__bpf_kfunc_start_defs();
-
/**
* scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu()
* @p: task_struct to select a CPU for
@@ -5764,6 +5735,35 @@ static const struct btf_kfunc_id_set scx_kfunc_set_cpu_release = {
__bpf_kfunc_start_defs();
+/**
+ * scx_bpf_create_dsq - Create a custom DSQ
+ * @dsq_id: DSQ to create
+ * @node: NUMA node to allocate from
+ *
+ * Create a custom DSQ identified by @dsq_id. Can be called from any sleepable
+ * scx callback, and any BPF_PROG_TYPE_SYSCALL prog.
+ */
+__bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node)
+{
+ if (unlikely(node >= (int)nr_node_ids ||
+ (node < 0 && node != NUMA_NO_NODE)))
+ return -EINVAL;
+ return PTR_ERR_OR_ZERO(create_dsq(dsq_id, node));
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(scx_kfunc_ids_unlocked)
+BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE)
+BTF_KFUNCS_END(scx_kfunc_ids_unlocked)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_unlocked,
+};
+
+__bpf_kfunc_start_defs();
+
/**
* scx_bpf_kick_cpu - Trigger reschedule on a CPU
* @cpu: cpu to kick
@@ -6460,10 +6460,6 @@ static int __init scx_init(void)
* check using scx_kf_allowed().
*/
if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
- &scx_kfunc_set_sleepable)) ||
- (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL,
- &scx_kfunc_set_sleepable)) ||
- (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
&scx_kfunc_set_select_cpu)) ||
(ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
&scx_kfunc_set_enqueue_dispatch)) ||
@@ -6471,6 +6467,10 @@ static int __init scx_init(void)
&scx_kfunc_set_dispatch)) ||
(ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
&scx_kfunc_set_cpu_release)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_unlocked)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL,
+ &scx_kfunc_set_unlocked)) ||
(ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
&scx_kfunc_set_any)) ||
(ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
--
2.46.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 02/12] sched_ext: Refactor consume_remote_task()
2024-09-01 16:43 [PATCHSET v2 sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
2024-09-01 16:43 ` [PATCH 01/12] sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate Tejun Heo
@ 2024-09-01 16:43 ` Tejun Heo
2024-09-01 16:43 ` [PATCH 03/12] sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON Tejun Heo
` (10 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-09-01 16:43 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, Tejun Heo
The tricky p->scx.holding_cpu handling was split across
consume_remote_task() body and move_task_to_local_dsq(). Refactor such that:
- All the tricky part is now in the new unlink_dsq_and_lock_src_rq() with
consolidated documentation.
- move_task_to_local_dsq() now implements straightforward task migration
making it easier to use in other places.
- dispatch_to_local_dsq() is another user move_task_to_local_dsq(). The
usage is updated accordingly. This makes the local and remote cases more
symmetric.
No functional changes intended.
v2: s/task_rq/src_rq/ for consistency.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
---
kernel/sched/ext.c | 145 ++++++++++++++++++++++++---------------------
1 file changed, 76 insertions(+), 69 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index e7c6e824f875..e148c7c5341d 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2107,49 +2107,13 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
* @src_rq: rq to move the task from, locked on entry, released on return
* @dst_rq: rq to move the task into, locked on return
*
- * Move @p which is currently on @src_rq to @dst_rq's local DSQ. The caller
- * must:
- *
- * 1. Start with exclusive access to @p either through its DSQ lock or
- * %SCX_OPSS_DISPATCHING flag.
- *
- * 2. Set @p->scx.holding_cpu to raw_smp_processor_id().
- *
- * 3. Remember task_rq(@p) as @src_rq. Release the exclusive access so that we
- * don't deadlock with dequeue.
- *
- * 4. Lock @src_rq from #3.
- *
- * 5. Call this function.
- *
- * Returns %true if @p was successfully moved. %false after racing dequeue and
- * losing. On return, @src_rq is unlocked and @dst_rq is locked.
+ * Move @p which is currently on @src_rq to @dst_rq's local DSQ.
*/
-static bool move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
+static void move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
struct rq *src_rq, struct rq *dst_rq)
{
lockdep_assert_rq_held(src_rq);
- /*
- * If dequeue got to @p while we were trying to lock @src_rq, it'd have
- * cleared @p->scx.holding_cpu to -1. While other cpus may have updated
- * it to different values afterwards, as this operation can't be
- * preempted or recurse, @p->scx.holding_cpu can never become
- * raw_smp_processor_id() again before we're done. Thus, we can tell
- * whether we lost to dequeue by testing whether @p->scx.holding_cpu is
- * still raw_smp_processor_id().
- *
- * @p->rq couldn't have changed if we're still the holding cpu.
- *
- * See dispatch_dequeue() for the counterpart.
- */
- if (unlikely(p->scx.holding_cpu != raw_smp_processor_id()) ||
- WARN_ON_ONCE(src_rq != task_rq(p))) {
- raw_spin_rq_unlock(src_rq);
- raw_spin_rq_lock(dst_rq);
- return false;
- }
-
/* the following marks @p MIGRATING which excludes dequeue */
deactivate_task(src_rq, p, 0);
set_task_cpu(p, cpu_of(dst_rq));
@@ -2168,8 +2132,6 @@ static bool move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
dst_rq->scx.extra_enq_flags = enq_flags;
activate_task(dst_rq, p, 0);
dst_rq->scx.extra_enq_flags = 0;
-
- return true;
}
#endif /* CONFIG_SMP */
@@ -2234,28 +2196,69 @@ static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq,
return true;
}
-static bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq,
- struct task_struct *p, struct rq *task_rq)
+/**
+ * unlink_dsq_and_lock_src_rq() - Unlink task from its DSQ and lock its task_rq
+ * @p: target task
+ * @dsq: locked DSQ @p is currently on
+ * @src_rq: rq @p is currently on, stable with @dsq locked
+ *
+ * Called with @dsq locked but no rq's locked. We want to move @p to a different
+ * DSQ, including any local DSQ, but are not locking @src_rq. Locking @src_rq is
+ * required when transferring into a local DSQ. Even when transferring into a
+ * non-local DSQ, it's better to use the same mechanism to protect against
+ * dequeues and maintain the invariant that @p->scx.dsq can only change while
+ * @src_rq is locked, which e.g. scx_dump_task() depends on.
+ *
+ * We want to grab @src_rq but that can deadlock if we try while locking @dsq,
+ * so we want to unlink @p from @dsq, drop its lock and then lock @src_rq. As
+ * this may race with dequeue, which can't drop the rq lock or fail, do a little
+ * dancing from our side.
+ *
+ * @p->scx.holding_cpu is set to this CPU before @dsq is unlocked. If @p gets
+ * dequeued after we unlock @dsq but before locking @src_rq, the holding_cpu
+ * would be cleared to -1. While other cpus may have updated it to different
+ * values afterwards, as this operation can't be preempted or recurse, the
+ * holding_cpu can never become this CPU again before we're done. Thus, we can
+ * tell whether we lost to dequeue by testing whether the holding_cpu still
+ * points to this CPU. See dispatch_dequeue() for the counterpart.
+ *
+ * On return, @dsq is unlocked and @src_rq is locked. Returns %true if @p is
+ * still valid. %false if lost to dequeue.
+ */
+static bool unlink_dsq_and_lock_src_rq(struct task_struct *p,
+ struct scx_dispatch_q *dsq,
+ struct rq *src_rq)
{
- lockdep_assert_held(&dsq->lock); /* released on return */
+ s32 cpu = raw_smp_processor_id();
+
+ lockdep_assert_held(&dsq->lock);
- /*
- * @dsq is locked and @p is on a remote rq. @p is currently protected by
- * @dsq->lock. We want to pull @p to @rq but may deadlock if we grab
- * @task_rq while holding @dsq and @rq locks. As dequeue can't drop the
- * rq lock or fail, do a little dancing from our side. See
- * move_task_to_local_dsq().
- */
WARN_ON_ONCE(p->scx.holding_cpu >= 0);
task_unlink_from_dsq(p, dsq);
dsq_mod_nr(dsq, -1);
- p->scx.holding_cpu = raw_smp_processor_id();
+ p->scx.holding_cpu = cpu;
+
raw_spin_unlock(&dsq->lock);
+ raw_spin_rq_lock(src_rq);
- raw_spin_rq_unlock(rq);
- raw_spin_rq_lock(task_rq);
+ /* task_rq couldn't have changed if we're still the holding cpu */
+ return likely(p->scx.holding_cpu == cpu) &&
+ !WARN_ON_ONCE(src_rq != task_rq(p));
+}
- return move_task_to_local_dsq(p, 0, task_rq, rq);
+static bool consume_remote_task(struct rq *this_rq, struct scx_dispatch_q *dsq,
+ struct task_struct *p, struct rq *src_rq)
+{
+ raw_spin_rq_unlock(this_rq);
+
+ if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) {
+ move_task_to_local_dsq(p, 0, src_rq, this_rq);
+ return true;
+ } else {
+ raw_spin_rq_unlock(src_rq);
+ raw_spin_rq_lock(this_rq);
+ return false;
+ }
}
#else /* CONFIG_SMP */
static inline bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq, bool trigger_error) { return false; }
@@ -2359,7 +2362,8 @@ dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p,
* As DISPATCHING guarantees that @p is wholly ours, we can
* pretend that we're moving from a DSQ and use the same
* mechanism - mark the task under transfer with holding_cpu,
- * release DISPATCHING and then follow the same protocol.
+ * release DISPATCHING and then follow the same protocol. See
+ * unlink_dsq_and_lock_src_rq().
*/
p->scx.holding_cpu = raw_smp_processor_id();
@@ -2372,28 +2376,31 @@ dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p,
raw_spin_rq_lock(src_rq);
}
- if (src_rq == dst_rq) {
+ /* task_rq couldn't have changed if we're still the holding cpu */
+ dsp = p->scx.holding_cpu == raw_smp_processor_id() &&
+ !WARN_ON_ONCE(src_rq != task_rq(p));
+
+ if (likely(dsp)) {
/*
- * As @p is staying on the same rq, there's no need to
+ * If @p is staying on the same rq, there's no need to
* go through the full deactivate/activate cycle.
* Optimize by abbreviating the operations in
* move_task_to_local_dsq().
*/
- dsp = p->scx.holding_cpu == raw_smp_processor_id();
- if (likely(dsp)) {
+ if (src_rq == dst_rq) {
p->scx.holding_cpu = -1;
- dispatch_enqueue(&dst_rq->scx.local_dsq, p,
- enq_flags);
+ dispatch_enqueue(&dst_rq->scx.local_dsq,
+ p, enq_flags);
+ } else {
+ move_task_to_local_dsq(p, enq_flags,
+ src_rq, dst_rq);
}
- } else {
- dsp = move_task_to_local_dsq(p, enq_flags,
- src_rq, dst_rq);
- }
- /* if the destination CPU is idle, wake it up */
- if (dsp && sched_class_above(p->sched_class,
- dst_rq->curr->sched_class))
- resched_curr(dst_rq);
+ /* if the destination CPU is idle, wake it up */
+ if (sched_class_above(p->sched_class,
+ dst_rq->curr->sched_class))
+ resched_curr(dst_rq);
+ }
/* switch back to @rq lock */
if (rq != dst_rq) {
--
2.46.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 03/12] sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON
2024-09-01 16:43 [PATCHSET v2 sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
2024-09-01 16:43 ` [PATCH 01/12] sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate Tejun Heo
2024-09-01 16:43 ` [PATCH 02/12] sched_ext: Refactor consume_remote_task() Tejun Heo
@ 2024-09-01 16:43 ` Tejun Heo
2024-09-01 16:43 ` [PATCH 04/12] sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling Tejun Heo
` (9 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-09-01 16:43 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, Tejun Heo
find_dsq_for_dispatch() handles all DSQ IDs except SCX_DSQ_LOCAL_ON.
Instead, each caller is hanlding SCX_DSQ_LOCAL_ON before calling it. Move
SCX_DSQ_LOCAL_ON lookup into find_dsq_for_dispatch() to remove duplicate
code in direct_dispatch() and dispatch_to_local_dsq().
No functional changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
---
kernel/sched/ext.c | 90 +++++++++++++++++++++-------------------------
1 file changed, 40 insertions(+), 50 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index e148c7c5341d..1d35298ee561 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1724,6 +1724,15 @@ static struct scx_dispatch_q *find_dsq_for_dispatch(struct rq *rq, u64 dsq_id,
if (dsq_id == SCX_DSQ_LOCAL)
return &rq->scx.local_dsq;
+ if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
+ s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+
+ if (!ops_cpu_valid(cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict"))
+ return &scx_dsq_global;
+
+ return &cpu_rq(cpu)->scx.local_dsq;
+ }
+
dsq = find_non_local_dsq(dsq_id);
if (unlikely(!dsq)) {
scx_ops_error("non-existent DSQ 0x%llx for %s[%d]",
@@ -1767,8 +1776,8 @@ static void mark_direct_dispatch(struct task_struct *ddsp_task,
static void direct_dispatch(struct task_struct *p, u64 enq_flags)
{
struct rq *rq = task_rq(p);
- struct scx_dispatch_q *dsq;
- u64 dsq_id = p->scx.ddsp_dsq_id;
+ struct scx_dispatch_q *dsq =
+ find_dsq_for_dispatch(rq, p->scx.ddsp_dsq_id, p);
touch_core_sched_dispatch(rq, p);
@@ -1780,15 +1789,9 @@ static void direct_dispatch(struct task_struct *p, u64 enq_flags)
* DSQ_LOCAL_ON verdicts targeting the local DSQ of a remote CPU, defer
* the enqueue so that it's executed when @rq can be unlocked.
*/
- if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
- s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+ if (dsq->id == SCX_DSQ_LOCAL && dsq != &rq->scx.local_dsq) {
unsigned long opss;
- if (cpu == cpu_of(rq)) {
- dsq_id = SCX_DSQ_LOCAL;
- goto dispatch;
- }
-
opss = atomic_long_read(&p->scx.ops_state) & SCX_OPSS_STATE_MASK;
switch (opss & SCX_OPSS_STATE_MASK) {
@@ -1815,8 +1818,6 @@ static void direct_dispatch(struct task_struct *p, u64 enq_flags)
return;
}
-dispatch:
- dsq = find_dsq_for_dispatch(rq, dsq_id, p);
dispatch_enqueue(dsq, p, p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
}
@@ -2301,51 +2302,38 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
enum dispatch_to_local_dsq_ret {
DTL_DISPATCHED, /* successfully dispatched */
DTL_LOST, /* lost race to dequeue */
- DTL_NOT_LOCAL, /* destination is not a local DSQ */
DTL_INVALID, /* invalid local dsq_id */
};
/**
* dispatch_to_local_dsq - Dispatch a task to a local dsq
* @rq: current rq which is locked
- * @dsq_id: destination dsq ID
+ * @dst_dsq: destination DSQ
* @p: task to dispatch
* @enq_flags: %SCX_ENQ_*
*
- * We're holding @rq lock and want to dispatch @p to the local DSQ identified by
- * @dsq_id. This function performs all the synchronization dancing needed
- * because local DSQs are protected with rq locks.
+ * We're holding @rq lock and want to dispatch @p to @dst_dsq which is a local
+ * DSQ. This function performs all the synchronization dancing needed because
+ * local DSQs are protected with rq locks.
*
* The caller must have exclusive ownership of @p (e.g. through
* %SCX_OPSS_DISPATCHING).
*/
static enum dispatch_to_local_dsq_ret
-dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p,
- u64 enq_flags)
+dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
+ struct task_struct *p, u64 enq_flags)
{
struct rq *src_rq = task_rq(p);
- struct rq *dst_rq;
+ struct rq *dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
/*
* We're synchronized against dequeue through DISPATCHING. As @p can't
* be dequeued, its task_rq and cpus_allowed are stable too.
+ *
+ * If dispatching to @rq that @p is already on, no lock dancing needed.
*/
- if (dsq_id == SCX_DSQ_LOCAL) {
- dst_rq = rq;
- } else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
- s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
-
- if (!ops_cpu_valid(cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict"))
- return DTL_INVALID;
- dst_rq = cpu_rq(cpu);
- } else {
- return DTL_NOT_LOCAL;
- }
-
- /* if dispatching to @rq that @p is already on, no lock dancing needed */
if (rq == src_rq && rq == dst_rq) {
- dispatch_enqueue(&dst_rq->scx.local_dsq, p,
- enq_flags | SCX_ENQ_CLEAR_OPSS);
+ dispatch_enqueue(dst_dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
return DTL_DISPATCHED;
}
@@ -2487,19 +2475,21 @@ static void finish_dispatch(struct rq *rq, struct task_struct *p,
BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
- switch (dispatch_to_local_dsq(rq, dsq_id, p, enq_flags)) {
- case DTL_DISPATCHED:
- break;
- case DTL_LOST:
- break;
- case DTL_INVALID:
- dsq_id = SCX_DSQ_GLOBAL;
- fallthrough;
- case DTL_NOT_LOCAL:
- dsq = find_dsq_for_dispatch(cpu_rq(raw_smp_processor_id()),
- dsq_id, p);
+ dsq = find_dsq_for_dispatch(this_rq(), dsq_id, p);
+
+ if (dsq->id == SCX_DSQ_LOCAL) {
+ switch (dispatch_to_local_dsq(rq, dsq, p, enq_flags)) {
+ case DTL_DISPATCHED:
+ break;
+ case DTL_LOST:
+ break;
+ case DTL_INVALID:
+ dispatch_enqueue(&scx_dsq_global, p,
+ enq_flags | SCX_ENQ_CLEAR_OPSS);
+ break;
+ }
+ } else {
dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
- break;
}
}
@@ -2716,13 +2706,13 @@ static void process_ddsp_deferred_locals(struct rq *rq)
*/
while ((p = list_first_entry_or_null(&rq->scx.ddsp_deferred_locals,
struct task_struct, scx.dsq_list.node))) {
- s32 ret;
+ struct scx_dispatch_q *dsq;
list_del_init(&p->scx.dsq_list.node);
- ret = dispatch_to_local_dsq(rq, p->scx.ddsp_dsq_id, p,
- p->scx.ddsp_enq_flags);
- WARN_ON_ONCE(ret == DTL_NOT_LOCAL);
+ dsq = find_dsq_for_dispatch(rq, p->scx.ddsp_dsq_id, p);
+ if (!WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL))
+ dispatch_to_local_dsq(rq, dsq, p, p->scx.ddsp_enq_flags);
}
}
--
2.46.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 04/12] sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling
2024-09-01 16:43 [PATCHSET v2 sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
` (2 preceding siblings ...)
2024-09-01 16:43 ` [PATCH 03/12] sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON Tejun Heo
@ 2024-09-01 16:43 ` Tejun Heo
2024-09-01 16:43 ` [PATCH 05/12] sched_ext: Restructure dispatch_to_local_dsq() Tejun Heo
` (8 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-09-01 16:43 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, Tejun Heo
With the preceding update, the only return value which makes meaningful
difference is DTL_INVALID, for which one caller, finish_dispatch(), falls
back to the global DSQ and the other, process_ddsp_deferred_locals(),
doesn't do anything.
It should always fallback to the global DSQ. Move the global DSQ fallback
into dispatch_to_local_dsq() and remove the return value.
v2: Patch title and description updated to reflect the behavior fix for
process_ddsp_deferred_locals().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
---
kernel/sched/ext.c | 41 ++++++++++-------------------------------
1 file changed, 10 insertions(+), 31 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 1d35298ee561..ec61ab676517 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2299,12 +2299,6 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
return false;
}
-enum dispatch_to_local_dsq_ret {
- DTL_DISPATCHED, /* successfully dispatched */
- DTL_LOST, /* lost race to dequeue */
- DTL_INVALID, /* invalid local dsq_id */
-};
-
/**
* dispatch_to_local_dsq - Dispatch a task to a local dsq
* @rq: current rq which is locked
@@ -2319,9 +2313,8 @@ enum dispatch_to_local_dsq_ret {
* The caller must have exclusive ownership of @p (e.g. through
* %SCX_OPSS_DISPATCHING).
*/
-static enum dispatch_to_local_dsq_ret
-dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
- struct task_struct *p, u64 enq_flags)
+static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
+ struct task_struct *p, u64 enq_flags)
{
struct rq *src_rq = task_rq(p);
struct rq *dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
@@ -2334,13 +2327,11 @@ dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
*/
if (rq == src_rq && rq == dst_rq) {
dispatch_enqueue(dst_dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
- return DTL_DISPATCHED;
+ return;
}
#ifdef CONFIG_SMP
if (likely(task_can_run_on_remote_rq(p, dst_rq, true))) {
- bool dsp;
-
/*
* @p is on a possibly remote @src_rq which we need to lock to
* move the task. If dequeue is in progress, it'd be locking
@@ -2365,10 +2356,8 @@ dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
}
/* task_rq couldn't have changed if we're still the holding cpu */
- dsp = p->scx.holding_cpu == raw_smp_processor_id() &&
- !WARN_ON_ONCE(src_rq != task_rq(p));
-
- if (likely(dsp)) {
+ if (likely(p->scx.holding_cpu == raw_smp_processor_id()) &&
+ !WARN_ON_ONCE(src_rq != task_rq(p))) {
/*
* If @p is staying on the same rq, there's no need to
* go through the full deactivate/activate cycle.
@@ -2396,11 +2385,11 @@ dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
raw_spin_rq_lock(rq);
}
- return dsp ? DTL_DISPATCHED : DTL_LOST;
+ return;
}
#endif /* CONFIG_SMP */
- return DTL_INVALID;
+ dispatch_enqueue(&scx_dsq_global, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
}
/**
@@ -2477,20 +2466,10 @@ static void finish_dispatch(struct rq *rq, struct task_struct *p,
dsq = find_dsq_for_dispatch(this_rq(), dsq_id, p);
- if (dsq->id == SCX_DSQ_LOCAL) {
- switch (dispatch_to_local_dsq(rq, dsq, p, enq_flags)) {
- case DTL_DISPATCHED:
- break;
- case DTL_LOST:
- break;
- case DTL_INVALID:
- dispatch_enqueue(&scx_dsq_global, p,
- enq_flags | SCX_ENQ_CLEAR_OPSS);
- break;
- }
- } else {
+ if (dsq->id == SCX_DSQ_LOCAL)
+ dispatch_to_local_dsq(rq, dsq, p, enq_flags);
+ else
dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
- }
}
static void flush_dispatch_buf(struct rq *rq)
--
2.46.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 05/12] sched_ext: Restructure dispatch_to_local_dsq()
2024-09-01 16:43 [PATCHSET v2 sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
` (3 preceding siblings ...)
2024-09-01 16:43 ` [PATCH 04/12] sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling Tejun Heo
@ 2024-09-01 16:43 ` Tejun Heo
2024-09-01 16:43 ` [PATCH 06/12] sched_ext: Reorder args for consume_local/remote_task() Tejun Heo
` (7 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-09-01 16:43 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, Tejun Heo
Now that there's nothing left after the big if block, flip the if condition
and unindent the body.
No functional changes intended.
v2: Add BUG() to clarify control can't reach the end of
dispatch_to_local_dsq() in UP kernels per David.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
---
kernel/sched/ext.c | 96 ++++++++++++++++++++++------------------------
1 file changed, 46 insertions(+), 50 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index ec61ab676517..89a393f183dd 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2331,65 +2331,61 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
}
#ifdef CONFIG_SMP
- if (likely(task_can_run_on_remote_rq(p, dst_rq, true))) {
- /*
- * @p is on a possibly remote @src_rq which we need to lock to
- * move the task. If dequeue is in progress, it'd be locking
- * @src_rq and waiting on DISPATCHING, so we can't grab @src_rq
- * lock while holding DISPATCHING.
- *
- * As DISPATCHING guarantees that @p is wholly ours, we can
- * pretend that we're moving from a DSQ and use the same
- * mechanism - mark the task under transfer with holding_cpu,
- * release DISPATCHING and then follow the same protocol. See
- * unlink_dsq_and_lock_src_rq().
- */
- p->scx.holding_cpu = raw_smp_processor_id();
+ if (unlikely(!task_can_run_on_remote_rq(p, dst_rq, true))) {
+ dispatch_enqueue(&scx_dsq_global, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
+ return;
+ }
- /* store_release ensures that dequeue sees the above */
- atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
+ /*
+ * @p is on a possibly remote @src_rq which we need to lock to move the
+ * task. If dequeue is in progress, it'd be locking @src_rq and waiting
+ * on DISPATCHING, so we can't grab @src_rq lock while holding
+ * DISPATCHING.
+ *
+ * As DISPATCHING guarantees that @p is wholly ours, we can pretend that
+ * we're moving from a DSQ and use the same mechanism - mark the task
+ * under transfer with holding_cpu, release DISPATCHING and then follow
+ * the same protocol. See unlink_dsq_and_lock_src_rq().
+ */
+ p->scx.holding_cpu = raw_smp_processor_id();
- /* switch to @src_rq lock */
- if (rq != src_rq) {
- raw_spin_rq_unlock(rq);
- raw_spin_rq_lock(src_rq);
- }
+ /* store_release ensures that dequeue sees the above */
+ atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
- /* task_rq couldn't have changed if we're still the holding cpu */
- if (likely(p->scx.holding_cpu == raw_smp_processor_id()) &&
- !WARN_ON_ONCE(src_rq != task_rq(p))) {
- /*
- * If @p is staying on the same rq, there's no need to
- * go through the full deactivate/activate cycle.
- * Optimize by abbreviating the operations in
- * move_task_to_local_dsq().
- */
- if (src_rq == dst_rq) {
- p->scx.holding_cpu = -1;
- dispatch_enqueue(&dst_rq->scx.local_dsq,
- p, enq_flags);
- } else {
- move_task_to_local_dsq(p, enq_flags,
- src_rq, dst_rq);
- }
+ /* switch to @src_rq lock */
+ if (rq != src_rq) {
+ raw_spin_rq_unlock(rq);
+ raw_spin_rq_lock(src_rq);
+ }
- /* if the destination CPU is idle, wake it up */
- if (sched_class_above(p->sched_class,
- dst_rq->curr->sched_class))
- resched_curr(dst_rq);
+ /* task_rq couldn't have changed if we're still the holding cpu */
+ if (likely(p->scx.holding_cpu == raw_smp_processor_id()) &&
+ !WARN_ON_ONCE(src_rq != task_rq(p))) {
+ /*
+ * If @p is staying on the same rq, there's no need to go
+ * through the full deactivate/activate cycle. Optimize by
+ * abbreviating the operations in move_task_to_local_dsq().
+ */
+ if (src_rq == dst_rq) {
+ p->scx.holding_cpu = -1;
+ dispatch_enqueue(&dst_rq->scx.local_dsq, p, enq_flags);
+ } else {
+ move_task_to_local_dsq(p, enq_flags, src_rq, dst_rq);
}
- /* switch back to @rq lock */
- if (rq != dst_rq) {
- raw_spin_rq_unlock(dst_rq);
- raw_spin_rq_lock(rq);
- }
+ /* if the destination CPU is idle, wake it up */
+ if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
+ resched_curr(dst_rq);
+ }
- return;
+ /* switch back to @rq lock */
+ if (rq != dst_rq) {
+ raw_spin_rq_unlock(dst_rq);
+ raw_spin_rq_lock(rq);
}
+#else /* CONFIG_SMP */
+ BUG(); /* control can not reach here on UP */
#endif /* CONFIG_SMP */
-
- dispatch_enqueue(&scx_dsq_global, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
}
/**
--
2.46.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 06/12] sched_ext: Reorder args for consume_local/remote_task()
2024-09-01 16:43 [PATCHSET v2 sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
` (4 preceding siblings ...)
2024-09-01 16:43 ` [PATCH 05/12] sched_ext: Restructure dispatch_to_local_dsq() Tejun Heo
@ 2024-09-01 16:43 ` Tejun Heo
2024-09-01 16:43 ` [PATCH 07/12] sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq() Tejun Heo
` (6 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-09-01 16:43 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, Tejun Heo
Reorder args for consistency in the order of:
current_rq, p, src_[rq|dsq], dst_[rq|dsq].
No functional changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 89a393f183dd..0829b7637c52 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2137,8 +2137,8 @@ static void move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
#endif /* CONFIG_SMP */
-static void consume_local_task(struct rq *rq, struct scx_dispatch_q *dsq,
- struct task_struct *p)
+static void consume_local_task(struct task_struct *p,
+ struct scx_dispatch_q *dsq, struct rq *rq)
{
lockdep_assert_held(&dsq->lock); /* released on return */
@@ -2247,8 +2247,8 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p,
!WARN_ON_ONCE(src_rq != task_rq(p));
}
-static bool consume_remote_task(struct rq *this_rq, struct scx_dispatch_q *dsq,
- struct task_struct *p, struct rq *src_rq)
+static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
+ struct scx_dispatch_q *dsq, struct rq *src_rq)
{
raw_spin_rq_unlock(this_rq);
@@ -2263,7 +2263,7 @@ static bool consume_remote_task(struct rq *this_rq, struct scx_dispatch_q *dsq,
}
#else /* CONFIG_SMP */
static inline bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq, bool trigger_error) { return false; }
-static inline bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq, struct task_struct *p, struct rq *task_rq) { return false; }
+static inline bool consume_remote_task(struct rq *this_rq, struct task_struct *p, struct scx_dispatch_q *dsq, struct rq *task_rq) { return false; }
#endif /* CONFIG_SMP */
static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
@@ -2284,12 +2284,12 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
struct rq *task_rq = task_rq(p);
if (rq == task_rq) {
- consume_local_task(rq, dsq, p);
+ consume_local_task(p, dsq, rq);
return true;
}
if (task_can_run_on_remote_rq(p, rq, false)) {
- if (likely(consume_remote_task(rq, dsq, p, task_rq)))
+ if (likely(consume_remote_task(rq, p, dsq, task_rq)))
return true;
goto retry;
}
--
2.46.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 07/12] sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()
2024-09-01 16:43 [PATCHSET v2 sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
` (5 preceding siblings ...)
2024-09-01 16:43 ` [PATCH 06/12] sched_ext: Reorder args for consume_local/remote_task() Tejun Heo
@ 2024-09-01 16:43 ` Tejun Heo
2024-09-01 16:43 ` [PATCH 08/12] sched_ext: Move consume_local_task() upward Tejun Heo
` (5 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-09-01 16:43 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, Tejun Heo
All task_unlink_from_dsq() users are doing dsq_mod_nr(dsq, -1). Move it into
task_unlink_from_dsq(). Also move sanity check into it.
No functional changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
---
kernel/sched/ext.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 0829b7637c52..4ec0a4f7f3ee 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1639,6 +1639,8 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
static void task_unlink_from_dsq(struct task_struct *p,
struct scx_dispatch_q *dsq)
{
+ WARN_ON_ONCE(list_empty(&p->scx.dsq_list.node));
+
if (p->scx.dsq_flags & SCX_TASK_DSQ_ON_PRIQ) {
rb_erase(&p->scx.dsq_priq, &dsq->priq);
RB_CLEAR_NODE(&p->scx.dsq_priq);
@@ -1646,6 +1648,7 @@ static void task_unlink_from_dsq(struct task_struct *p,
}
list_del_init(&p->scx.dsq_list.node);
+ dsq_mod_nr(dsq, -1);
}
static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
@@ -1682,9 +1685,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
*/
if (p->scx.holding_cpu < 0) {
/* @p must still be on @dsq, dequeue */
- WARN_ON_ONCE(list_empty(&p->scx.dsq_list.node));
task_unlink_from_dsq(p, dsq);
- dsq_mod_nr(dsq, -1);
} else {
/*
* We're racing against dispatch_to_local_dsq() which already
@@ -2146,7 +2147,6 @@ static void consume_local_task(struct task_struct *p,
WARN_ON_ONCE(p->scx.holding_cpu >= 0);
task_unlink_from_dsq(p, dsq);
list_add_tail(&p->scx.dsq_list.node, &rq->scx.local_dsq.list);
- dsq_mod_nr(dsq, -1);
dsq_mod_nr(&rq->scx.local_dsq, 1);
p->scx.dsq = &rq->scx.local_dsq;
raw_spin_unlock(&dsq->lock);
@@ -2236,7 +2236,6 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p,
WARN_ON_ONCE(p->scx.holding_cpu >= 0);
task_unlink_from_dsq(p, dsq);
- dsq_mod_nr(dsq, -1);
p->scx.holding_cpu = cpu;
raw_spin_unlock(&dsq->lock);
--
2.46.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 08/12] sched_ext: Move consume_local_task() upward
2024-09-01 16:43 [PATCHSET v2 sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
` (6 preceding siblings ...)
2024-09-01 16:43 ` [PATCH 07/12] sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq() Tejun Heo
@ 2024-09-01 16:43 ` Tejun Heo
2024-09-01 16:43 ` [PATCH 09/12] sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq() Tejun Heo
` (4 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-09-01 16:43 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, Tejun Heo
So that the local case comes first and two CONFIG_SMP blocks can be merged.
No functional changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
---
kernel/sched/ext.c | 31 ++++++++++++++-----------------
1 file changed, 14 insertions(+), 17 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 4ec0a4f7f3ee..e0bc4851b6e2 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2101,6 +2101,20 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
return false;
}
+static void consume_local_task(struct task_struct *p,
+ struct scx_dispatch_q *dsq, struct rq *rq)
+{
+ lockdep_assert_held(&dsq->lock); /* released on return */
+
+ /* @dsq is locked and @p is on this rq */
+ WARN_ON_ONCE(p->scx.holding_cpu >= 0);
+ task_unlink_from_dsq(p, dsq);
+ list_add_tail(&p->scx.dsq_list.node, &rq->scx.local_dsq.list);
+ dsq_mod_nr(&rq->scx.local_dsq, 1);
+ p->scx.dsq = &rq->scx.local_dsq;
+ raw_spin_unlock(&dsq->lock);
+}
+
#ifdef CONFIG_SMP
/**
* move_task_to_local_dsq - Move a task from a different rq to a local DSQ
@@ -2136,23 +2150,6 @@ static void move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
dst_rq->scx.extra_enq_flags = 0;
}
-#endif /* CONFIG_SMP */
-
-static void consume_local_task(struct task_struct *p,
- struct scx_dispatch_q *dsq, struct rq *rq)
-{
- lockdep_assert_held(&dsq->lock); /* released on return */
-
- /* @dsq is locked and @p is on this rq */
- WARN_ON_ONCE(p->scx.holding_cpu >= 0);
- task_unlink_from_dsq(p, dsq);
- list_add_tail(&p->scx.dsq_list.node, &rq->scx.local_dsq.list);
- dsq_mod_nr(&rq->scx.local_dsq, 1);
- p->scx.dsq = &rq->scx.local_dsq;
- raw_spin_unlock(&dsq->lock);
-}
-
-#ifdef CONFIG_SMP
/*
* Similar to kernel/sched/core.c::is_cpu_allowed(). However, there are two
* differences:
--
2.46.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 09/12] sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()
2024-09-01 16:43 [PATCHSET v2 sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
` (7 preceding siblings ...)
2024-09-01 16:43 ` [PATCH 08/12] sched_ext: Move consume_local_task() upward Tejun Heo
@ 2024-09-01 16:43 ` Tejun Heo
2024-09-01 16:43 ` [PATCH 10/12] sched_ext: Compact struct bpf_iter_scx_dsq_kern Tejun Heo
` (3 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-09-01 16:43 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, Tejun Heo
- Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq().
- Rename consume_local_task() to move_local_task_to_local_dsq() and remove
task_unlink_from_dsq() and source DSQ unlocking from it.
This is to make the migration code easier to reuse.
No functional changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
---
kernel/sched/ext.c | 42 ++++++++++++++++++++++++++----------------
1 file changed, 26 insertions(+), 16 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index e0bc4851b6e2..d50166a2651a 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2101,23 +2101,30 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
return false;
}
-static void consume_local_task(struct task_struct *p,
- struct scx_dispatch_q *dsq, struct rq *rq)
+static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
+ struct scx_dispatch_q *src_dsq,
+ struct rq *dst_rq)
{
- lockdep_assert_held(&dsq->lock); /* released on return */
+ struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
+
+ /* @dsq is locked and @p is on @dst_rq */
+ lockdep_assert_held(&src_dsq->lock);
+ lockdep_assert_rq_held(dst_rq);
- /* @dsq is locked and @p is on this rq */
WARN_ON_ONCE(p->scx.holding_cpu >= 0);
- task_unlink_from_dsq(p, dsq);
- list_add_tail(&p->scx.dsq_list.node, &rq->scx.local_dsq.list);
- dsq_mod_nr(&rq->scx.local_dsq, 1);
- p->scx.dsq = &rq->scx.local_dsq;
- raw_spin_unlock(&dsq->lock);
+
+ if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
+ list_add(&p->scx.dsq_list.node, &dst_dsq->list);
+ else
+ list_add_tail(&p->scx.dsq_list.node, &dst_dsq->list);
+
+ dsq_mod_nr(dst_dsq, 1);
+ p->scx.dsq = dst_dsq;
}
#ifdef CONFIG_SMP
/**
- * move_task_to_local_dsq - Move a task from a different rq to a local DSQ
+ * move_remote_task_to_local_dsq - Move a task from a foreign rq to a local DSQ
* @p: task to move
* @enq_flags: %SCX_ENQ_*
* @src_rq: rq to move the task from, locked on entry, released on return
@@ -2125,8 +2132,8 @@ static void consume_local_task(struct task_struct *p,
*
* Move @p which is currently on @src_rq to @dst_rq's local DSQ.
*/
-static void move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
- struct rq *src_rq, struct rq *dst_rq)
+static void move_remote_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
+ struct rq *src_rq, struct rq *dst_rq)
{
lockdep_assert_rq_held(src_rq);
@@ -2249,7 +2256,7 @@ static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
raw_spin_rq_unlock(this_rq);
if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) {
- move_task_to_local_dsq(p, 0, src_rq, this_rq);
+ move_remote_task_to_local_dsq(p, 0, src_rq, this_rq);
return true;
} else {
raw_spin_rq_unlock(src_rq);
@@ -2280,7 +2287,9 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
struct rq *task_rq = task_rq(p);
if (rq == task_rq) {
- consume_local_task(p, dsq, rq);
+ task_unlink_from_dsq(p, dsq);
+ move_local_task_to_local_dsq(p, 0, dsq, rq);
+ raw_spin_unlock(&dsq->lock);
return true;
}
@@ -2360,13 +2369,14 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
/*
* If @p is staying on the same rq, there's no need to go
* through the full deactivate/activate cycle. Optimize by
- * abbreviating the operations in move_task_to_local_dsq().
+ * abbreviating move_remote_task_to_local_dsq().
*/
if (src_rq == dst_rq) {
p->scx.holding_cpu = -1;
dispatch_enqueue(&dst_rq->scx.local_dsq, p, enq_flags);
} else {
- move_task_to_local_dsq(p, enq_flags, src_rq, dst_rq);
+ move_remote_task_to_local_dsq(p, enq_flags,
+ src_rq, dst_rq);
}
/* if the destination CPU is idle, wake it up */
--
2.46.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 10/12] sched_ext: Compact struct bpf_iter_scx_dsq_kern
2024-09-01 16:43 [PATCHSET v2 sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
` (8 preceding siblings ...)
2024-09-01 16:43 ` [PATCH 09/12] sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq() Tejun Heo
@ 2024-09-01 16:43 ` Tejun Heo
2024-09-01 16:43 ` [PATCH 11/12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
` (2 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-09-01 16:43 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, Tejun Heo
struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was
using 5 of them. We want to add two more u64 fields but it's better if we do
so while staying within scx_iter_scx_dsq to maintain binary compatibility.
The way scx_iter_scx_dsq_kern is laid out is rather inefficient - the node
field takes up three u64's but only one bit of the last u64 is used. Turn
the bool into u32 flags and only use the lower 16 bits freeing up 48 bits -
16 bits for flags, 32 bits for a u32 - for use by struct
bpf_iter_scx_dsq_kern.
This allows moving the dsq_seq and flags fields of bpf_iter_scx_dsq_kern
into the cursor field reducing the struct size by a full u64.
No behavior changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/sched/ext.h | 10 +++++++++-
kernel/sched/ext.c | 23 ++++++++++++-----------
2 files changed, 21 insertions(+), 12 deletions(-)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 69f68e2121a8..9a75534cfac6 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -120,9 +120,17 @@ enum scx_kf_mask {
__SCX_KF_TERMINAL = SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
};
+enum scx_dsq_lnode_flags {
+ SCX_DSQ_LNODE_ITER_CURSOR = 1 << 0,
+
+ /* high 16 bits can be for iter cursor flags */
+ __SCX_DSQ_LNODE_PRIV_SHIFT = 16,
+};
+
struct scx_dsq_list_node {
struct list_head node;
- bool is_bpf_iter_cursor;
+ u32 flags;
+ u32 priv; /* can be used by iter cursor */
};
/*
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d50166a2651a..28f421227b5b 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1100,7 +1100,7 @@ static struct task_struct *nldsq_next_task(struct scx_dispatch_q *dsq,
dsq_lnode = container_of(list_node, struct scx_dsq_list_node,
node);
- } while (dsq_lnode->is_bpf_iter_cursor);
+ } while (dsq_lnode->flags & SCX_DSQ_LNODE_ITER_CURSOR);
return container_of(dsq_lnode, struct task_struct, scx.dsq_list);
}
@@ -1118,16 +1118,15 @@ static struct task_struct *nldsq_next_task(struct scx_dispatch_q *dsq,
*/
enum scx_dsq_iter_flags {
/* iterate in the reverse dispatch order */
- SCX_DSQ_ITER_REV = 1U << 0,
+ SCX_DSQ_ITER_REV = 1U << 16,
- __SCX_DSQ_ITER_ALL_FLAGS = SCX_DSQ_ITER_REV,
+ __SCX_DSQ_ITER_USER_FLAGS = SCX_DSQ_ITER_REV,
+ __SCX_DSQ_ITER_ALL_FLAGS = __SCX_DSQ_ITER_USER_FLAGS,
};
struct bpf_iter_scx_dsq_kern {
struct scx_dsq_list_node cursor;
struct scx_dispatch_q *dsq;
- u32 dsq_seq;
- u32 flags;
} __attribute__((aligned(8)));
struct bpf_iter_scx_dsq {
@@ -1165,6 +1164,9 @@ static void scx_task_iter_init(struct scx_task_iter *iter)
{
lockdep_assert_held(&scx_tasks_lock);
+ BUILD_BUG_ON(__SCX_DSQ_ITER_ALL_FLAGS &
+ ((1U << __SCX_DSQ_LNODE_PRIV_SHIFT) - 1));
+
iter->cursor = (struct sched_ext_entity){ .flags = SCX_TASK_CURSOR };
list_add(&iter->cursor.tasks_node, &scx_tasks);
iter->locked = NULL;
@@ -5876,7 +5878,7 @@ __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id,
BUILD_BUG_ON(__alignof__(struct bpf_iter_scx_dsq_kern) !=
__alignof__(struct bpf_iter_scx_dsq));
- if (flags & ~__SCX_DSQ_ITER_ALL_FLAGS)
+ if (flags & ~__SCX_DSQ_ITER_USER_FLAGS)
return -EINVAL;
kit->dsq = find_non_local_dsq(dsq_id);
@@ -5884,9 +5886,8 @@ __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id,
return -ENOENT;
INIT_LIST_HEAD(&kit->cursor.node);
- kit->cursor.is_bpf_iter_cursor = true;
- kit->dsq_seq = READ_ONCE(kit->dsq->seq);
- kit->flags = flags;
+ kit->cursor.flags |= SCX_DSQ_LNODE_ITER_CURSOR | flags;
+ kit->cursor.priv = READ_ONCE(kit->dsq->seq);
return 0;
}
@@ -5900,7 +5901,7 @@ __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id,
__bpf_kfunc struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it)
{
struct bpf_iter_scx_dsq_kern *kit = (void *)it;
- bool rev = kit->flags & SCX_DSQ_ITER_REV;
+ bool rev = kit->cursor.flags & SCX_DSQ_ITER_REV;
struct task_struct *p;
unsigned long flags;
@@ -5921,7 +5922,7 @@ __bpf_kfunc struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *i
*/
do {
p = nldsq_next_task(kit->dsq, p, rev);
- } while (p && unlikely(u32_before(kit->dsq_seq, p->scx.dsq_seq)));
+ } while (p && unlikely(u32_before(kit->cursor.priv, p->scx.dsq_seq)));
if (p) {
if (rev)
--
2.46.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 11/12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
2024-09-01 16:43 [PATCHSET v2 sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
` (9 preceding siblings ...)
2024-09-01 16:43 ` [PATCH 10/12] sched_ext: Compact struct bpf_iter_scx_dsq_kern Tejun Heo
@ 2024-09-01 16:43 ` Tejun Heo
2024-09-01 16:43 ` [PATCH 12/12] scx_qmap: Implement highpri boosting Tejun Heo
2024-09-04 20:27 ` [PATCHSET v2 sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
12 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-09-01 16:43 UTC (permalink / raw)
To: void
Cc: kernel-team, linux-kernel, Tejun Heo, Daniel Hodges, Changwoo Min,
Andrea Righi, Dan Schatzberg
Once a task is put into a DSQ, the allowed operations are fairly limited.
Tasks in the built-in local and global DSQs are executed automatically and,
ignoring dequeue, there is only one way a task in a user DSQ can be
manipulated - scx_bpf_consume() moves the first task to the dispatching
local DSQ. This inflexibility sometimes gets in the way and is an area where
multiple feature requests have been made.
Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during
DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and
user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context
which dosen't hold a rq lock including BPF timers and SYSCALL programs.
This is an expansion of an earlier patch which only allowed moving into the
dispatching local DSQ:
http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org
v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as
they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument
count limit and often won't be needed anyway. Instead provide
scx_bpf_dispatch_from_dsq_set_{slice|vtime}() kfuncs which can be called
only when needed and override the specified parameter for the subsequent
dispatch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: David Vernet <void@manifault.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
---
kernel/sched/ext.c | 232 ++++++++++++++++++++++-
tools/sched_ext/include/scx/common.bpf.h | 10 +
2 files changed, 239 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 28f421227b5b..cb048e264866 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1067,6 +1067,11 @@ static __always_inline bool scx_kf_allowed_on_arg_tasks(u32 mask,
return true;
}
+static bool scx_kf_allowed_if_unlocked(void)
+{
+ return !current->scx.kf_mask;
+}
+
/**
* nldsq_next_task - Iterate to the next task in a non-local DSQ
* @dsq: user dsq being interated
@@ -1120,13 +1125,20 @@ enum scx_dsq_iter_flags {
/* iterate in the reverse dispatch order */
SCX_DSQ_ITER_REV = 1U << 16,
+ __SCX_DSQ_ITER_HAS_SLICE = 1U << 30,
+ __SCX_DSQ_ITER_HAS_VTIME = 1U << 31,
+
__SCX_DSQ_ITER_USER_FLAGS = SCX_DSQ_ITER_REV,
- __SCX_DSQ_ITER_ALL_FLAGS = __SCX_DSQ_ITER_USER_FLAGS,
+ __SCX_DSQ_ITER_ALL_FLAGS = __SCX_DSQ_ITER_USER_FLAGS |
+ __SCX_DSQ_ITER_HAS_SLICE |
+ __SCX_DSQ_ITER_HAS_VTIME,
};
struct bpf_iter_scx_dsq_kern {
struct scx_dsq_list_node cursor;
struct scx_dispatch_q *dsq;
+ u64 slice;
+ u64 vtime;
} __attribute__((aligned(8)));
struct bpf_iter_scx_dsq {
@@ -5463,7 +5475,7 @@ __bpf_kfunc_start_defs();
* scx_bpf_dispatch - Dispatch a task into the FIFO queue of a DSQ
* @p: task_struct to dispatch
* @dsq_id: DSQ to dispatch to
- * @slice: duration @p can run for in nsecs
+ * @slice: duration @p can run for in nsecs, 0 to keep the current value
* @enq_flags: SCX_ENQ_*
*
* Dispatch @p into the FIFO queue of the DSQ identified by @dsq_id. It is safe
@@ -5513,7 +5525,7 @@ __bpf_kfunc void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
* scx_bpf_dispatch_vtime - Dispatch a task into the vtime priority queue of a DSQ
* @p: task_struct to dispatch
* @dsq_id: DSQ to dispatch to
- * @slice: duration @p can run for in nsecs
+ * @slice: duration @p can run for in nsecs, 0 to keep the current value
* @vtime: @p's ordering inside the vtime-sorted queue of the target DSQ
* @enq_flags: SCX_ENQ_*
*
@@ -5554,6 +5566,118 @@ static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = {
.set = &scx_kfunc_ids_enqueue_dispatch,
};
+static bool scx_dispatch_from_dsq(struct bpf_iter_scx_dsq_kern *kit,
+ struct task_struct *p, u64 dsq_id,
+ u64 enq_flags)
+{
+ struct scx_dispatch_q *src_dsq = kit->dsq, *dst_dsq;
+ struct rq *this_rq, *src_rq, *dst_rq, *locked_rq;
+ bool dispatched = false;
+ bool in_balance;
+ unsigned long flags;
+
+ if (!scx_kf_allowed_if_unlocked() && !scx_kf_allowed(SCX_KF_DISPATCH))
+ return false;
+
+ /*
+ * Can be called from either ops.dispatch() locking this_rq() or any
+ * context where no rq lock is held. If latter, lock @p's task_rq which
+ * we'll likely need anyway.
+ */
+ src_rq = task_rq(p);
+
+ local_irq_save(flags);
+ this_rq = this_rq();
+ in_balance = this_rq->scx.flags & SCX_RQ_IN_BALANCE;
+
+ if (in_balance) {
+ if (this_rq != src_rq) {
+ raw_spin_rq_unlock(this_rq);
+ raw_spin_rq_lock(src_rq);
+ }
+ } else {
+ raw_spin_rq_lock(src_rq);
+ }
+
+ locked_rq = src_rq;
+ raw_spin_lock(&src_dsq->lock);
+
+ /*
+ * Did someone else get to it? @p could have already left $src_dsq, got
+ * re-enqueud, or be in the process of being consumed by someone else.
+ */
+ if (unlikely(p->scx.dsq != src_dsq ||
+ u32_before(kit->cursor.priv, p->scx.dsq_seq) ||
+ p->scx.holding_cpu >= 0) ||
+ WARN_ON_ONCE(src_rq != task_rq(p))) {
+ raw_spin_unlock(&src_dsq->lock);
+ goto out;
+ }
+
+ /* @p is still on $src_dsq and stable, determine the destination */
+ dst_dsq = find_dsq_for_dispatch(this_rq, dsq_id, p);
+
+ if (dst_dsq->id == SCX_DSQ_LOCAL) {
+ dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
+ if (!task_can_run_on_remote_rq(p, dst_rq, true)) {
+ dst_dsq = &scx_dsq_global;
+ dst_rq = src_rq;
+ }
+ } else {
+ /* no need to migrate if destination is a non-local DSQ */
+ dst_rq = src_rq;
+ }
+
+ /*
+ * Move @p into $dst_dsq. If $dst_dsq is the local DSQ of a different
+ * CPU, @p will be migrated.
+ */
+ if (dst_dsq->id == SCX_DSQ_LOCAL) {
+ /* @p is going from a non-local DSQ to a local DSQ */
+ if (src_rq == dst_rq) {
+ task_unlink_from_dsq(p, src_dsq);
+ move_local_task_to_local_dsq(p, enq_flags,
+ src_dsq, dst_rq);
+ raw_spin_unlock(&src_dsq->lock);
+ } else {
+ raw_spin_unlock(&src_dsq->lock);
+ move_remote_task_to_local_dsq(p, enq_flags,
+ src_rq, dst_rq);
+ locked_rq = dst_rq;
+ }
+ } else {
+ /*
+ * @p is going from a non-local DSQ to a non-local DSQ. As
+ * $src_dsq is already locked, do an abbreviated dequeue.
+ */
+ task_unlink_from_dsq(p, src_dsq);
+ p->scx.dsq = NULL;
+ raw_spin_unlock(&src_dsq->lock);
+
+ if (kit->cursor.flags & __SCX_DSQ_ITER_HAS_VTIME)
+ p->scx.dsq_vtime = kit->vtime;
+ dispatch_enqueue(dst_dsq, p, enq_flags);
+ }
+
+ if (kit->cursor.flags & __SCX_DSQ_ITER_HAS_SLICE)
+ p->scx.slice = kit->slice;
+
+ dispatched = true;
+out:
+ if (in_balance) {
+ if (this_rq != locked_rq) {
+ raw_spin_rq_unlock(locked_rq);
+ raw_spin_rq_lock(this_rq);
+ }
+ } else {
+ raw_spin_rq_unlock_irqrestore(locked_rq, flags);
+ }
+
+ kit->cursor.flags &= ~(__SCX_DSQ_ITER_HAS_SLICE |
+ __SCX_DSQ_ITER_HAS_VTIME);
+ return dispatched;
+}
+
__bpf_kfunc_start_defs();
/**
@@ -5633,12 +5757,112 @@ __bpf_kfunc bool scx_bpf_consume(u64 dsq_id)
}
}
+/**
+ * scx_bpf_dispatch_from_dsq_set_slice - Override slice when dispatching from DSQ
+ * @it__iter: DSQ iterator in progress
+ * @slice: duration the dispatched task can run for in nsecs
+ *
+ * Override the slice of the next task that will be dispatched from @it__iter
+ * using scx_bpf_dispatch_from_dsq[_vtime](). If this function is not called,
+ * the previous slice duration is kept.
+ */
+__bpf_kfunc void scx_bpf_dispatch_from_dsq_set_slice(
+ struct bpf_iter_scx_dsq *it__iter, u64 slice)
+{
+ struct bpf_iter_scx_dsq_kern *kit = (void *)it__iter;
+
+ kit->slice = slice;
+ kit->cursor.flags |= __SCX_DSQ_ITER_HAS_SLICE;
+}
+
+/**
+ * scx_bpf_dispatch_from_dsq_set_vtime - Override vtime when dispatching from DSQ
+ * @it__iter: DSQ iterator in progress
+ * @vtime: task's ordering inside the vtime-sorted queue of the target DSQ
+ *
+ * Override the vtime of the next task that will be dispatched from @it__iter
+ * using scx_bpf_dispatch_from_dsq_vtime(). If this function is not called, the
+ * previous slice vtime is kept. If scx_bpf_dispatch_from_dsq() is used to
+ * dispatch the next task, the override is ignored and cleared.
+ */
+__bpf_kfunc void scx_bpf_dispatch_from_dsq_set_vtime(
+ struct bpf_iter_scx_dsq *it__iter, u64 vtime)
+{
+ struct bpf_iter_scx_dsq_kern *kit = (void *)it__iter;
+
+ kit->vtime = vtime;
+ kit->cursor.flags |= __SCX_DSQ_ITER_HAS_VTIME;
+}
+
+/**
+ * scx_bpf_dispatch_from_dsq - Move a task from DSQ iteration to a DSQ
+ * @it__iter: DSQ iterator in progress
+ * @p: task to transfer
+ * @dsq_id: DSQ to move @p to
+ * @enq_flags: SCX_ENQ_*
+ *
+ * Transfer @p which is on the DSQ currently iterated by @it__iter to the DSQ
+ * specified by @dsq_id. All DSQs - local DSQs, global DSQ and user DSQs - can
+ * be the destination.
+ *
+ * For the transfer to be successful, @p must still be on the DSQ and have been
+ * queued before the DSQ iteration started. This function doesn't care whether
+ * @p was obtained from the DSQ iteration. @p just has to be on the DSQ and have
+ * been queued before the iteration started.
+ *
+ * @p's slice is kept by default. Use scx_bpf_dispatch_from_dsq_set_slice() to
+ * update.
+ *
+ * Can be called from ops.dispatch() or any BPF context which doesn't hold a rq
+ * lock (e.g. BPF timers or SYSCALL programs).
+ *
+ * Returns %true if @p has been consumed, %false if @p had already been consumed
+ * or dequeued.
+ */
+__bpf_kfunc bool scx_bpf_dispatch_from_dsq(struct bpf_iter_scx_dsq *it__iter,
+ struct task_struct *p, u64 dsq_id,
+ u64 enq_flags)
+{
+ return scx_dispatch_from_dsq((struct bpf_iter_scx_dsq_kern *)it__iter,
+ p, dsq_id, enq_flags);
+}
+
+/**
+ * scx_bpf_dispatch_vtime_from_dsq - Move a task from DSQ iteration to a PRIQ DSQ
+ * @it__iter: DSQ iterator in progress
+ * @p: task to transfer
+ * @dsq_id: DSQ to move @p to
+ * @enq_flags: SCX_ENQ_*
+ *
+ * Transfer @p which is on the DSQ currently iterated by @it__iter to the
+ * priority queue of the DSQ specified by @dsq_id. The destination must be a
+ * user DSQ as only user DSQs support priority queue.
+ *
+ * @p's slice and vtime are kept by default. Use
+ * scx_bpf_dispatch_from_dsq_set_slice() and
+ * scx_bpf_dispatch_from_dsq_set_vtime() to update.
+ *
+ * All other aspects are identical to scx_bpf_dispatch_from_dsq(). See
+ * scx_bpf_dispatch_vtime() for more information on @vtime.
+ */
+__bpf_kfunc bool scx_bpf_dispatch_vtime_from_dsq(struct bpf_iter_scx_dsq *it__iter,
+ struct task_struct *p, u64 dsq_id,
+ u64 enq_flags)
+{
+ return scx_dispatch_from_dsq((struct bpf_iter_scx_dsq_kern *)it__iter,
+ p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ);
+}
+
__bpf_kfunc_end_defs();
BTF_KFUNCS_START(scx_kfunc_ids_dispatch)
BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots)
BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel)
BTF_ID_FLAGS(func, scx_bpf_consume)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq_set_slice)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq_set_vtime)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime_from_dsq, KF_RCU)
BTF_KFUNCS_END(scx_kfunc_ids_dispatch)
static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {
@@ -5735,6 +5959,8 @@ __bpf_kfunc_end_defs();
BTF_KFUNCS_START(scx_kfunc_ids_unlocked)
BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime_from_dsq, KF_RCU)
BTF_KFUNCS_END(scx_kfunc_ids_unlocked)
static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = {
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 20280df62857..c8fe21f65d14 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -35,6 +35,10 @@ void scx_bpf_dispatch_vtime(struct task_struct *p, u64 dsq_id, u64 slice, u64 vt
u32 scx_bpf_dispatch_nr_slots(void) __ksym;
void scx_bpf_dispatch_cancel(void) __ksym;
bool scx_bpf_consume(u64 dsq_id) __ksym;
+void scx_bpf_dispatch_from_dsq_set_slice(struct bpf_iter_scx_dsq *it__iter, u64 slice) __ksym;
+void scx_bpf_dispatch_from_dsq_set_vtime(struct bpf_iter_scx_dsq *it__iter, u64 vtime) __ksym;
+bool scx_bpf_dispatch_from_dsq(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak;
+bool scx_bpf_dispatch_vtime_from_dsq(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak;
u32 scx_bpf_reenqueue_local(void) __ksym;
void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
@@ -62,6 +66,12 @@ bool scx_bpf_task_running(const struct task_struct *p) __ksym;
s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
struct rq *scx_bpf_cpu_rq(s32 cpu) __ksym;
+/*
+ * Use the following as @it__iter when calling
+ * scx_bpf_dispatch[_vtime]_from_dsq() from within bpf_for_each() loops.
+ */
+#define BPF_FOR_EACH_ITER (&___it)
+
static inline __attribute__((format(printf, 1, 2)))
void ___scx_bpf_bstr_format_checker(const char *fmt, ...) {}
--
2.46.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 12/12] scx_qmap: Implement highpri boosting
2024-09-01 16:43 [PATCHSET v2 sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
` (10 preceding siblings ...)
2024-09-01 16:43 ` [PATCH 11/12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
@ 2024-09-01 16:43 ` Tejun Heo
2024-09-04 20:27 ` [PATCHSET v2 sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
12 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-09-01 16:43 UTC (permalink / raw)
To: void
Cc: kernel-team, linux-kernel, Tejun Heo, Daniel Hodges, Changwoo Min,
Andrea Righi, Dan Schatzberg
Implement a silly boosting mechanism for nice -20 tasks. The only purpose is
demonstrating and testing scx_bpf_dispatch_from_dsq(). The boosting only
works within SHARED_DSQ and makes only minor differences with increased
dispatch batch (-b).
This exercises moving tasks to a user DSQ and all local DSQs from
ops.dispatch() and BPF timerfn.
v2: - Updated to use scx_bpf_dispatch_from_dsq_set_{slice|vtime}().
- Drop the workaround for the iterated tasks not being trusted by the
verifier. The issue is fixed from BPF side.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: David Vernet <void@manifault.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
---
tools/sched_ext/scx_qmap.bpf.c | 133 +++++++++++++++++++++++++++++----
tools/sched_ext/scx_qmap.c | 11 ++-
2 files changed, 130 insertions(+), 14 deletions(-)
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 892278f12dce..391d80b4ac8e 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -27,6 +27,8 @@
enum consts {
ONE_SEC_IN_NS = 1000000000,
SHARED_DSQ = 0,
+ HIGHPRI_DSQ = 1,
+ HIGHPRI_WEIGHT = 8668, /* this is what -20 maps to */
};
char _license[] SEC("license") = "GPL";
@@ -36,10 +38,12 @@ const volatile u32 stall_user_nth;
const volatile u32 stall_kernel_nth;
const volatile u32 dsp_inf_loop_after;
const volatile u32 dsp_batch;
+const volatile bool highpri_boosting;
const volatile bool print_shared_dsq;
const volatile s32 disallow_tgid;
const volatile bool suppress_dump;
+u64 nr_highpri_queued;
u32 test_error_cnt;
UEI_DEFINE(uei);
@@ -95,6 +99,7 @@ static u64 core_sched_tail_seqs[5];
/* Per-task scheduling context */
struct task_ctx {
bool force_local; /* Dispatch directly to local_dsq */
+ bool highpri;
u64 core_sched_seq;
};
@@ -122,6 +127,7 @@ struct {
/* Statistics */
u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued, nr_ddsp_from_enq;
u64 nr_core_sched_execed;
+u64 nr_expedited_local, nr_expedited_remote, nr_expedited_lost, nr_expedited_from_timer;
u32 cpuperf_min, cpuperf_avg, cpuperf_max;
u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max;
@@ -140,17 +146,25 @@ static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu)
return -1;
}
+static struct task_ctx *lookup_task_ctx(struct task_struct *p)
+{
+ struct task_ctx *tctx;
+
+ if (!(tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0))) {
+ scx_bpf_error("task_ctx lookup failed");
+ return NULL;
+ }
+ return tctx;
+}
+
s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
{
struct task_ctx *tctx;
s32 cpu;
- tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
- if (!tctx) {
- scx_bpf_error("task_ctx lookup failed");
+ if (!(tctx = lookup_task_ctx(p)))
return -ESRCH;
- }
cpu = pick_direct_dispatch_cpu(p, prev_cpu);
@@ -197,11 +211,8 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
if (test_error_cnt && !--test_error_cnt)
scx_bpf_error("test triggering error");
- tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
- if (!tctx) {
- scx_bpf_error("task_ctx lookup failed");
+ if (!(tctx = lookup_task_ctx(p)))
return;
- }
/*
* All enqueued tasks must have their core_sched_seq updated for correct
@@ -256,6 +267,10 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
return;
}
+ if (highpri_boosting && p->scx.weight >= HIGHPRI_WEIGHT) {
+ tctx->highpri = true;
+ __sync_fetch_and_add(&nr_highpri_queued, 1);
+ }
__sync_fetch_and_add(&nr_enqueued, 1);
}
@@ -272,13 +287,80 @@ void BPF_STRUCT_OPS(qmap_dequeue, struct task_struct *p, u64 deq_flags)
static void update_core_sched_head_seq(struct task_struct *p)
{
- struct task_ctx *tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
int idx = weight_to_idx(p->scx.weight);
+ struct task_ctx *tctx;
- if (tctx)
+ if ((tctx = lookup_task_ctx(p)))
core_sched_head_seqs[idx] = tctx->core_sched_seq;
- else
- scx_bpf_error("task_ctx lookup failed");
+}
+
+/*
+ * To demonstrate the use of scx_bpf_dispatch_from_dsq(), implement silly
+ * selective priority boosting mechanism by scanning SHARED_DSQ looking for
+ * highpri tasks, moving them to HIGHPRI_DSQ and then consuming them first. This
+ * makes minor difference only when dsp_batch is larger than 1.
+ *
+ * scx_bpf_dispatch[_vtime]_from_dsq() are allowed both from ops.dispatch() and
+ * non-rq-lock holding BPF programs. As demonstration, this function is called
+ * from qmap_dispatch() and monitor_timerfn().
+ */
+static bool dispatch_highpri(bool from_timer)
+{
+ struct task_struct *p;
+ s32 this_cpu = bpf_get_smp_processor_id();
+
+ /* scan SHARED_DSQ and move highpri tasks to HIGHPRI_DSQ */
+ bpf_for_each(scx_dsq, p, SHARED_DSQ, 0) {
+ static u64 highpri_seq;
+ struct task_ctx *tctx;
+
+ if (!(tctx = lookup_task_ctx(p)))
+ return false;
+
+ if (tctx->highpri) {
+ /* exercise the set_*() and vtime interface too */
+ scx_bpf_dispatch_from_dsq_set_slice(
+ BPF_FOR_EACH_ITER, slice_ns * 2);
+ scx_bpf_dispatch_from_dsq_set_vtime(
+ BPF_FOR_EACH_ITER, highpri_seq++);
+ scx_bpf_dispatch_vtime_from_dsq(
+ BPF_FOR_EACH_ITER, p, HIGHPRI_DSQ, 0);
+ }
+ }
+
+ /*
+ * Scan HIGHPRI_DSQ and dispatch until a task that can run on this CPU
+ * is found.
+ */
+ bpf_for_each(scx_dsq, p, HIGHPRI_DSQ, 0) {
+ bool dispatched = false;
+ s32 cpu;
+
+ if (bpf_cpumask_test_cpu(this_cpu, p->cpus_ptr))
+ cpu = this_cpu;
+ else
+ cpu = scx_bpf_pick_any_cpu(p->cpus_ptr, 0);
+
+ if (scx_bpf_dispatch_from_dsq(BPF_FOR_EACH_ITER, p,
+ SCX_DSQ_LOCAL_ON | cpu,
+ SCX_ENQ_PREEMPT)) {
+ if (cpu == this_cpu) {
+ dispatched = true;
+ __sync_fetch_and_add(&nr_expedited_local, 1);
+ } else {
+ __sync_fetch_and_add(&nr_expedited_remote, 1);
+ }
+ if (from_timer)
+ __sync_fetch_and_add(&nr_expedited_from_timer, 1);
+ } else {
+ __sync_fetch_and_add(&nr_expedited_lost, 1);
+ }
+
+ if (dispatched)
+ return true;
+ }
+
+ return false;
}
void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
@@ -289,7 +371,10 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
void *fifo;
s32 i, pid;
- if (scx_bpf_consume(SHARED_DSQ))
+ if (dispatch_highpri(false))
+ return;
+
+ if (!nr_highpri_queued && scx_bpf_consume(SHARED_DSQ))
return;
if (dsp_inf_loop_after && nr_dispatched > dsp_inf_loop_after) {
@@ -326,6 +411,8 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
/* Dispatch or advance. */
bpf_repeat(BPF_MAX_LOOPS) {
+ struct task_ctx *tctx;
+
if (bpf_map_pop_elem(fifo, &pid))
break;
@@ -333,13 +420,25 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
if (!p)
continue;
+ if (!(tctx = lookup_task_ctx(p))) {
+ bpf_task_release(p);
+ return;
+ }
+
+ if (tctx->highpri)
+ __sync_fetch_and_sub(&nr_highpri_queued, 1);
+
update_core_sched_head_seq(p);
__sync_fetch_and_add(&nr_dispatched, 1);
+
scx_bpf_dispatch(p, SHARED_DSQ, slice_ns, 0);
bpf_task_release(p);
+
batch--;
cpuc->dsp_cnt--;
if (!batch || !scx_bpf_dispatch_nr_slots()) {
+ if (dispatch_highpri(false))
+ return;
scx_bpf_consume(SHARED_DSQ);
return;
}
@@ -649,6 +748,10 @@ static void dump_shared_dsq(void)
static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer)
{
+ bpf_rcu_read_lock();
+ dispatch_highpri(true);
+ bpf_rcu_read_unlock();
+
monitor_cpuperf();
if (print_shared_dsq)
@@ -670,6 +773,10 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
if (ret)
return ret;
+ ret = scx_bpf_create_dsq(HIGHPRI_DSQ, -1);
+ if (ret)
+ return ret;
+
timer = bpf_map_lookup_elem(&monitor_timer, &key);
if (!timer)
return -ESRCH;
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index c9ca30d62b2b..ac45a02b4055 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -29,6 +29,7 @@ const char help_fmt[] =
" -l COUNT Trigger dispatch infinite looping after COUNT dispatches\n"
" -b COUNT Dispatch upto COUNT tasks together\n"
" -P Print out DSQ content to trace_pipe every second, use with -b\n"
+" -H Boost nice -20 tasks in SHARED_DSQ, use with -b\n"
" -d PID Disallow a process from switching into SCHED_EXT (-1 for self)\n"
" -D LEN Set scx_exit_info.dump buffer length\n"
" -S Suppress qmap-specific debug dump\n"
@@ -63,7 +64,7 @@ int main(int argc, char **argv)
skel = SCX_OPS_OPEN(qmap_ops, scx_qmap);
- while ((opt = getopt(argc, argv, "s:e:t:T:l:b:Pd:D:Spvh")) != -1) {
+ while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PHd:D:Spvh")) != -1) {
switch (opt) {
case 's':
skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
@@ -86,6 +87,9 @@ int main(int argc, char **argv)
case 'P':
skel->rodata->print_shared_dsq = true;
break;
+ case 'H':
+ skel->rodata->highpri_boosting = true;
+ break;
case 'd':
skel->rodata->disallow_tgid = strtol(optarg, NULL, 0);
if (skel->rodata->disallow_tgid < 0)
@@ -121,6 +125,11 @@ int main(int argc, char **argv)
skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
skel->bss->nr_core_sched_execed,
skel->bss->nr_ddsp_from_enq);
+ printf(" exp_local=%"PRIu64" exp_remote=%"PRIu64" exp_timer=%"PRIu64" exp_lost=%"PRIu64"\n",
+ skel->bss->nr_expedited_local,
+ skel->bss->nr_expedited_remote,
+ skel->bss->nr_expedited_from_timer,
+ skel->bss->nr_expedited_lost);
if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur"))
printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n",
skel->bss->cpuperf_min,
--
2.46.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [PATCHSET v2 sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
2024-09-01 16:43 [PATCHSET v2 sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
` (11 preceding siblings ...)
2024-09-01 16:43 ` [PATCH 12/12] scx_qmap: Implement highpri boosting Tejun Heo
@ 2024-09-04 20:27 ` Tejun Heo
12 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-09-04 20:27 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel
On Sun, Sep 01, 2024 at 06:43:37AM -1000, Tejun Heo wrote:
> 0001-sched_ext-Rename-scx_kfunc_set_sleepable-to-unlocked.patch
> 0002-sched_ext-Refactor-consume_remote_task.patch
> 0003-sched_ext-Make-find_dsq_for_dispatch-handle-SCX_DSQ_.patch
> 0004-sched_ext-Fix-processs_ddsp_deferred_locals-by-unify.patch
> 0005-sched_ext-Restructure-dispatch_to_local_dsq.patch
> 0006-sched_ext-Reorder-args-for-consume_local-remote_task.patch
> 0007-sched_ext-Move-sanity-check-and-dsq_mod_nr-into-task.patch
> 0008-sched_ext-Move-consume_local_task-upward.patch
> 0009-sched_ext-Replace-consume_local_task-with-move_local.patch
> 0010-sched_ext-Compact-struct-bpf_iter_scx_dsq_kern.patch
> 0011-sched_ext-Implement-scx_bpf_dispatch-_vtime-_from_ds.patch
> 0012-scx_qmap-Implement-highpri-boosting.patch
Applied to sched_ext/for-6.12.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 14+ messages in thread