* [PATCH v3] Implement SCX_OPS_TRACK_MIGRATION
@ 2025-06-23 6:30 Henry Huang
2025-06-23 6:30 ` [PATCH v3] sched_ext: " Henry Huang
0 siblings, 1 reply; 7+ messages in thread
From: Henry Huang @ 2025-06-23 6:30 UTC (permalink / raw)
To: changwoo, arighi, tj, void
Cc: 谈鉴锋, Yan Yan(cailing), linux-kernel,
sched-ext, Henry Huang
In our environment, we need track task migrations to update per-cpu
map.
Implementing fentry(on enqueue_task_scx & dequeue_task_scx)
is a feasible solution. But there are some limitations:
1. Can't modify p->scx.xxx
2. enqueue_task_scx & dequeue_task_scx can't have some
special compilation optimizations.
3. Has more overhead compared to struct_ops
So we introduce SCX_OPS_TRACK_MIGRATION to support tracking task
migrations.
If SCX_OPS_TRACK_MIGRATION is set, runnable/quiescent
would be called whether task is doing migration or not.
For v2:
1. if task_on_rq_migrating(p) == true
set DEQUEUE_MIGRATING to deq_flags in dequeue_task_scx
set ENQUEUE_MIGRATING to enq_flags in enqueue_task_scx
For v3:
1. introduce SCX_ENQ_MIGRATING(=ENQUEUE_MIGRATING),
SCX_DEQ_MIGRATING(DEQUEUE_MIGRATING)
2. change patch title:
include SCX_OPS_TRACK_MIGRATION --> Implement SCX_OPS_TRACK_MIGRATION
Henry Huang (1):
sched_ext: Implement SCX_OPS_TRACK_MIGRATION
kernel/sched/ext.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
--
Henry
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v3] sched_ext: Implement SCX_OPS_TRACK_MIGRATION
2025-06-23 6:30 [PATCH v3] Implement SCX_OPS_TRACK_MIGRATION Henry Huang
@ 2025-06-23 6:30 ` Henry Huang
2025-06-23 6:42 ` Andrea Righi
2025-06-23 17:59 ` Tejun Heo
0 siblings, 2 replies; 7+ messages in thread
From: Henry Huang @ 2025-06-23 6:30 UTC (permalink / raw)
To: changwoo, arighi, tj, void
Cc: 谈鉴锋, Yan Yan(cailing), linux-kernel,
sched-ext, Henry Huang
For some BPF-schedulers, they should do something when
task is doing migration, such as updating per-cpu map.
If SCX_OPS_TRACK_MIGRATION is set, runnable/quiescent
would be called whether task is doing migration or not.
Signed-off-by: Henry Huang <henry.hj@antgroup.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index b498d86..376e028 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -161,6 +161,12 @@ enum scx_ops_flags {
SCX_OPS_BUILTIN_IDLE_PER_NODE = 1LLU << 6,
/*
+ * If set, runnable/quiescent ops would be called whether the task is
+ * doing migration or not.
+ */
+ SCX_OPS_TRACK_MIGRATION = 1LLU << 7,
+
+ /*
* CPU cgroup support flags
*/
SCX_OPS_HAS_CGROUP_WEIGHT = 1LLU << 16, /* DEPRECATED, will be removed on 6.18 */
@@ -172,6 +178,7 @@ enum scx_ops_flags {
SCX_OPS_ALLOW_QUEUED_WAKEUP |
SCX_OPS_SWITCH_PARTIAL |
SCX_OPS_BUILTIN_IDLE_PER_NODE |
+ SCX_OPS_TRACK_MIGRATION |
SCX_OPS_HAS_CGROUP_WEIGHT,
/* high 8 bits are internal, don't include in SCX_OPS_ALL_FLAGS */
@@ -870,6 +877,7 @@ enum scx_enq_flags {
SCX_ENQ_WAKEUP = ENQUEUE_WAKEUP,
SCX_ENQ_HEAD = ENQUEUE_HEAD,
SCX_ENQ_CPU_SELECTED = ENQUEUE_RQ_SELECTED,
+ SCX_ENQ_MIGRATING = ENQUEUE_MIGRATING,
/* high 32bits are SCX specific */
@@ -913,6 +921,7 @@ enum scx_enq_flags {
enum scx_deq_flags {
/* expose select DEQUEUE_* flags as enums */
SCX_DEQ_SLEEP = DEQUEUE_SLEEP,
+ SCX_DEQ_MIGRATING = DEQUEUE_MIGRATING,
/* high 32bits are SCX specific */
@@ -2390,7 +2399,11 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
rq->scx.nr_running++;
add_nr_running(rq, 1);
- if (SCX_HAS_OP(sch, runnable) && !task_on_rq_migrating(p))
+ if (task_on_rq_migrating(p))
+ enq_flags |= SCX_ENQ_MIGRATING;
+
+ if (SCX_HAS_OP(sch, runnable) &&
+ ((sch->ops.flags & SCX_OPS_TRACK_MIGRATION) || !(enq_flags & SCX_ENQ_MIGRATING)))
SCX_CALL_OP_TASK(sch, SCX_KF_REST, runnable, rq, p, enq_flags);
if (enq_flags & SCX_ENQ_WAKEUP)
@@ -2463,6 +2476,9 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags
return true;
}
+ if (task_on_rq_migrating(p))
+ deq_flags |= SCX_DEQ_MIGRATING;
+
ops_dequeue(rq, p, deq_flags);
/*
@@ -2482,7 +2498,8 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags
SCX_CALL_OP_TASK(sch, SCX_KF_REST, stopping, rq, p, false);
}
- if (SCX_HAS_OP(sch, quiescent) && !task_on_rq_migrating(p))
+ if (SCX_HAS_OP(sch, quiescent) &&
+ ((sch->ops.flags & SCX_OPS_TRACK_MIGRATION) || !(deq_flags & SCX_DEQ_MIGRATING)))
SCX_CALL_OP_TASK(sch, SCX_KF_REST, quiescent, rq, p, deq_flags);
if (deq_flags & SCX_DEQ_SLEEP)
@@ -5495,6 +5512,11 @@ static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
return -EINVAL;
}
+ if ((ops->flags & SCX_OPS_TRACK_MIGRATION) && (!ops->runnable || !ops->quiescent)) {
+ scx_error(sch, "SCX_OPS_TRACK_MIGRATION requires ops.runnable() and ops.quiescent() to be implemented");
+ return -EINVAL;
+ }
+
if (ops->flags & SCX_OPS_HAS_CGROUP_WEIGHT)
pr_warn("SCX_OPS_HAS_CGROUP_WEIGHT is deprecated and a noop\n");
--
Henry
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v3] sched_ext: Implement SCX_OPS_TRACK_MIGRATION
2025-06-23 6:30 ` [PATCH v3] sched_ext: " Henry Huang
@ 2025-06-23 6:42 ` Andrea Righi
2025-06-23 17:59 ` Tejun Heo
1 sibling, 0 replies; 7+ messages in thread
From: Andrea Righi @ 2025-06-23 6:42 UTC (permalink / raw)
To: Henry Huang
Cc: changwoo, tj, void, 谈鉴锋, Yan Yan(cailing),
linux-kernel, sched-ext
On Mon, Jun 23, 2025 at 02:30:33PM +0800, Henry Huang wrote:
> For some BPF-schedulers, they should do something when
> task is doing migration, such as updating per-cpu map.
> If SCX_OPS_TRACK_MIGRATION is set, runnable/quiescent
> would be called whether task is doing migration or not.
Looks good, thanks for moving SCX_ENQ_MIGRATING to the proper place. :)
You already added my reviewed-by line, but just in case:
Reviewed-by: Andrea Righi <arighi@nvidia.com>
-Andrea
>
> Signed-off-by: Henry Huang <henry.hj@antgroup.com>
> Reviewed-by: Andrea Righi <arighi@nvidia.com>
> ---
> kernel/sched/ext.c | 26 ++++++++++++++++++++++++--
> 1 file changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index b498d86..376e028 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -161,6 +161,12 @@ enum scx_ops_flags {
> SCX_OPS_BUILTIN_IDLE_PER_NODE = 1LLU << 6,
>
> /*
> + * If set, runnable/quiescent ops would be called whether the task is
> + * doing migration or not.
> + */
> + SCX_OPS_TRACK_MIGRATION = 1LLU << 7,
> +
> + /*
> * CPU cgroup support flags
> */
> SCX_OPS_HAS_CGROUP_WEIGHT = 1LLU << 16, /* DEPRECATED, will be removed on 6.18 */
> @@ -172,6 +178,7 @@ enum scx_ops_flags {
> SCX_OPS_ALLOW_QUEUED_WAKEUP |
> SCX_OPS_SWITCH_PARTIAL |
> SCX_OPS_BUILTIN_IDLE_PER_NODE |
> + SCX_OPS_TRACK_MIGRATION |
> SCX_OPS_HAS_CGROUP_WEIGHT,
>
> /* high 8 bits are internal, don't include in SCX_OPS_ALL_FLAGS */
> @@ -870,6 +877,7 @@ enum scx_enq_flags {
> SCX_ENQ_WAKEUP = ENQUEUE_WAKEUP,
> SCX_ENQ_HEAD = ENQUEUE_HEAD,
> SCX_ENQ_CPU_SELECTED = ENQUEUE_RQ_SELECTED,
> + SCX_ENQ_MIGRATING = ENQUEUE_MIGRATING,
>
> /* high 32bits are SCX specific */
>
> @@ -913,6 +921,7 @@ enum scx_enq_flags {
> enum scx_deq_flags {
> /* expose select DEQUEUE_* flags as enums */
> SCX_DEQ_SLEEP = DEQUEUE_SLEEP,
> + SCX_DEQ_MIGRATING = DEQUEUE_MIGRATING,
>
> /* high 32bits are SCX specific */
>
> @@ -2390,7 +2399,11 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
> rq->scx.nr_running++;
> add_nr_running(rq, 1);
>
> - if (SCX_HAS_OP(sch, runnable) && !task_on_rq_migrating(p))
> + if (task_on_rq_migrating(p))
> + enq_flags |= SCX_ENQ_MIGRATING;
> +
> + if (SCX_HAS_OP(sch, runnable) &&
> + ((sch->ops.flags & SCX_OPS_TRACK_MIGRATION) || !(enq_flags & SCX_ENQ_MIGRATING)))
> SCX_CALL_OP_TASK(sch, SCX_KF_REST, runnable, rq, p, enq_flags);
>
> if (enq_flags & SCX_ENQ_WAKEUP)
> @@ -2463,6 +2476,9 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags
> return true;
> }
>
> + if (task_on_rq_migrating(p))
> + deq_flags |= SCX_DEQ_MIGRATING;
> +
> ops_dequeue(rq, p, deq_flags);
>
> /*
> @@ -2482,7 +2498,8 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags
> SCX_CALL_OP_TASK(sch, SCX_KF_REST, stopping, rq, p, false);
> }
>
> - if (SCX_HAS_OP(sch, quiescent) && !task_on_rq_migrating(p))
> + if (SCX_HAS_OP(sch, quiescent) &&
> + ((sch->ops.flags & SCX_OPS_TRACK_MIGRATION) || !(deq_flags & SCX_DEQ_MIGRATING)))
> SCX_CALL_OP_TASK(sch, SCX_KF_REST, quiescent, rq, p, deq_flags);
>
> if (deq_flags & SCX_DEQ_SLEEP)
> @@ -5495,6 +5512,11 @@ static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
> return -EINVAL;
> }
>
> + if ((ops->flags & SCX_OPS_TRACK_MIGRATION) && (!ops->runnable || !ops->quiescent)) {
> + scx_error(sch, "SCX_OPS_TRACK_MIGRATION requires ops.runnable() and ops.quiescent() to be implemented");
> + return -EINVAL;
> + }
> +
> if (ops->flags & SCX_OPS_HAS_CGROUP_WEIGHT)
> pr_warn("SCX_OPS_HAS_CGROUP_WEIGHT is deprecated and a noop\n");
>
> --
> Henry
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] sched_ext: Implement SCX_OPS_TRACK_MIGRATION
2025-06-23 6:30 ` [PATCH v3] sched_ext: " Henry Huang
2025-06-23 6:42 ` Andrea Righi
@ 2025-06-23 17:59 ` Tejun Heo
2025-06-24 3:03 ` [PATCH v3] sched_ext: include SCX_OPS_TRACK_MIGRATION Henry Huang
1 sibling, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2025-06-23 17:59 UTC (permalink / raw)
To: Henry Huang
Cc: changwoo, arighi, void, 谈鉴锋,
Yan Yan(cailing), linux-kernel, sched-ext
Hello,
On Mon, Jun 23, 2025 at 02:30:33PM +0800, Henry Huang wrote:
> For some BPF-schedulers, they should do something when
> task is doing migration, such as updating per-cpu map.
> If SCX_OPS_TRACK_MIGRATION is set, runnable/quiescent
> would be called whether task is doing migration or not.
It's rather odd to invoke runnable/quiescent on these transitions as the
runnable state isn't actually changing and the events end up triggering for
all the migration operations that SCX does internally.
In the head message (BTW, if it's just a single patch, it'd be better to
include all the context in the patch description), you said that this is
needed to udpate percpu data structures when tasks migrate. Wouldn't you be
able to do that by tracking whether the current CPU is different from the
previous one from ops.running()?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] sched_ext: include SCX_OPS_TRACK_MIGRATION
2025-06-23 17:59 ` Tejun Heo
@ 2025-06-24 3:03 ` Henry Huang
2025-06-25 23:07 ` Tejun Heo
0 siblings, 1 reply; 7+ messages in thread
From: Henry Huang @ 2025-06-24 3:03 UTC (permalink / raw)
To: tj
Cc: arighi, changwoo, Henry Huang, 谈鉴锋,
linux-kernel, sched-ext, void, Yan Yan(cailing)
On Mon, 23 Jun 2025 07:59:54 -1000, Tejun Heo wrote:
> It's rather odd to invoke runnable/quiescent on these transitions as the
> runnable state isn't actually changing and the events end up triggering for
> all the migration operations that SCX does internally.
>
> In the head message (BTW, if it's just a single patch, it'd be better to
> include all the context in the patch description), you said that this is
> needed to udpate percpu data structures when tasks migrate. Wouldn't you be
> able to do that by tracking whether the current CPU is different from the
> previous one from ops.running()?
We will traverse the per-CPU map information in ops.select_cpu() to select
the appropriate CPU. To reduce the competition for the rq spinlock, tasks are
likely to run on the CPU selected by ops.select_cpu().
However, I can think of two scenarios where passive migration may occur:
1. set_task_allowed_cpus
2. cpu_stop
There may also be some passive migration scenarios that we haven't thought of.
This could lead to incorrect information in the per-CPU map. Therefore,
we hope to track enqueue_task_scx and dequeue_task_scx to ensure that the
information in the per-CPU map is accurate.
--
Henry
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] sched_ext: include SCX_OPS_TRACK_MIGRATION
2025-06-24 3:03 ` [PATCH v3] sched_ext: include SCX_OPS_TRACK_MIGRATION Henry Huang
@ 2025-06-25 23:07 ` Tejun Heo
2025-06-27 3:43 ` [PATCH v3] sched_ext: Implement SCX_OPS_TRACK_MIGRATION Henry Huang
0 siblings, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2025-06-25 23:07 UTC (permalink / raw)
To: Henry Huang
Cc: arighi, changwoo, 谈鉴锋, linux-kernel,
sched-ext, void, Yan Yan(cailing)
Hello,
On Tue, Jun 24, 2025 at 11:03:14AM +0800, Henry Huang wrote:
...
> We will traverse the per-CPU map information in ops.select_cpu() to select
> the appropriate CPU. To reduce the competition for the rq spinlock, tasks are
> likely to run on the CPU selected by ops.select_cpu().
>
> However, I can think of two scenarios where passive migration may occur:
> 1. set_task_allowed_cpus
> 2. cpu_stop
What do you mean by "passive migration"? The above two cases would still
travel ops.enqueue(). There are cases where ops.select_cpu()'s return value
or the local DSQ that ops.dispatch() targeted are overridden, mostly when
the CPU goes down inbetween. Are you referring to those cases?
> There may also be some passive migration scenarios that we haven't thought of.
> This could lead to incorrect information in the per-CPU map. Therefore,
> we hope to track enqueue_task_scx and dequeue_task_scx to ensure that the
> information in the per-CPU map is accurate.
Even in such cases, wouldn't something like the following work?
void my_running(struct task_struct *p)
{
struct my_task_ctx *taskc;
if (!(taskc = lookup_task_ctx(p)))
return;
if (taskc->cpu != scx_bpf_task_cpu(p)) {
/* update other stuff */
taskc->cpu = scx_bpf_task_cpu(p);
}
}
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] sched_ext: Implement SCX_OPS_TRACK_MIGRATION
2025-06-25 23:07 ` Tejun Heo
@ 2025-06-27 3:43 ` Henry Huang
0 siblings, 0 replies; 7+ messages in thread
From: Henry Huang @ 2025-06-27 3:43 UTC (permalink / raw)
To: tj
Cc: arighi, changwoo, Henry Huang, 谈鉴锋,
linux-kernel, sched-ext, void, Yan Yan(cailing)
On Wed, 25 Jun 2025 13:07:04 -1000, Tejun Heo wrote:
> Even in such cases, wouldn't something like the following work?
>
> void my_running(struct task_struct *p)
> {
> struct my_task_ctx *taskc;
>
> if (!(taskc = lookup_task_ctx(p)))
> return;
> if (taskc->cpu != scx_bpf_task_cpu(p)) {
> /* update other stuff */
> taskc->cpu = scx_bpf_task_cpu(p);
> }
> }
Thank you for your reply. I will try it again and will reach out via
email if I have any other questions.
--
Henry
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-06-27 3:43 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-23 6:30 [PATCH v3] Implement SCX_OPS_TRACK_MIGRATION Henry Huang
2025-06-23 6:30 ` [PATCH v3] sched_ext: " Henry Huang
2025-06-23 6:42 ` Andrea Righi
2025-06-23 17:59 ` Tejun Heo
2025-06-24 3:03 ` [PATCH v3] sched_ext: include SCX_OPS_TRACK_MIGRATION Henry Huang
2025-06-25 23:07 ` Tejun Heo
2025-06-27 3:43 ` [PATCH v3] sched_ext: Implement SCX_OPS_TRACK_MIGRATION Henry Huang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).