* [RFC][PATCH 1/3] sched: Detect per-class runqueue changes
2025-10-06 10:46 [RFC][PATCH 0/3] sched/ext: Cleanup pick_task_scx() Peter Zijlstra
@ 2025-10-06 10:46 ` Peter Zijlstra
2025-10-07 10:08 ` Juri Lelli
2025-10-16 9:33 ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2025-10-06 10:46 ` [RFC][PATCH 2/3] sched: Add support to pick functions to take rf Peter Zijlstra
` (3 subsequent siblings)
4 siblings, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2025-10-06 10:46 UTC (permalink / raw)
To: tj
Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
liuwenfang, tglx
Have enqueue/dequeue set a per-class bit in rq->queue_mask. This then
enables easy tracking of which runqueues are modified over a
lock-break.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/sched/core.c | 2 ++
kernel/sched/deadline.c | 2 ++
kernel/sched/ext.c | 2 ++
kernel/sched/fair.c | 7 +++++--
kernel/sched/idle.c | 2 ++
kernel/sched/rt.c | 2 ++
kernel/sched/sched.h | 10 ++++++++++
kernel/sched/stop_task.c | 2 ++
8 files changed, 27 insertions(+), 2 deletions(-)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2089,6 +2089,7 @@ void enqueue_task(struct rq *rq, struct
*/
uclamp_rq_inc(rq, p, flags);
+ rq->queue_mask |= p->sched_class->queue_mask;
p->sched_class->enqueue_task(rq, p, flags);
psi_enqueue(p, flags);
@@ -2121,6 +2122,7 @@ inline bool dequeue_task(struct rq *rq,
* and mark the task ->sched_delayed.
*/
uclamp_rq_dec(rq, p);
+ rq->queue_mask |= p->sched_class->queue_mask;
return p->sched_class->dequeue_task(rq, p, flags);
}
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3094,6 +3094,8 @@ static int task_is_throttled_dl(struct t
DEFINE_SCHED_CLASS(dl) = {
+ .queue_mask = 8,
+
.enqueue_task = enqueue_task_dl,
.dequeue_task = dequeue_task_dl,
.yield_task = yield_task_dl,
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3234,6 +3234,8 @@ static void scx_cgroup_unlock(void) {}
* their current sched_class. Call them directly from sched core instead.
*/
DEFINE_SCHED_CLASS(ext) = {
+ .queue_mask = 1,
+
.enqueue_task = enqueue_task_scx,
.dequeue_task = dequeue_task_scx,
.yield_task = yield_task_scx,
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12830,6 +12830,7 @@ static int sched_balance_newidle(struct
}
rcu_read_unlock();
+ this_rq->queue_mask = 0;
raw_spin_rq_unlock(this_rq);
t0 = sched_clock_cpu(this_cpu);
@@ -12887,8 +12888,8 @@ static int sched_balance_newidle(struct
if (this_rq->cfs.h_nr_queued && !pulled_task)
pulled_task = 1;
- /* Is there a task of a high priority class? */
- if (this_rq->nr_running != this_rq->cfs.h_nr_queued)
+ /* If a higher prio class was modified, restart the pick */
+ if (this_rq->queue_mask & ~((fair_sched_class.queue_mask << 1)-1))
pulled_task = -1;
out:
@@ -13623,6 +13624,8 @@ static unsigned int get_rr_interval_fair
*/
DEFINE_SCHED_CLASS(fair) = {
+ .queue_mask = 2,
+
.enqueue_task = enqueue_task_fair,
.dequeue_task = dequeue_task_fair,
.yield_task = yield_task_fair,
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -522,6 +522,8 @@ static void update_curr_idle(struct rq *
*/
DEFINE_SCHED_CLASS(idle) = {
+ .queue_mask = 0,
+
/* no enqueue/yield_task for idle tasks */
/* dequeue is not valid, we print a debug message there: */
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2571,6 +2571,8 @@ static int task_is_throttled_rt(struct t
DEFINE_SCHED_CLASS(rt) = {
+ .queue_mask = 4,
+
.enqueue_task = enqueue_task_rt,
.dequeue_task = dequeue_task_rt,
.yield_task = yield_task_rt,
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1118,6 +1118,7 @@ struct rq {
/* runqueue lock: */
raw_spinlock_t __lock;
+ unsigned int queue_mask;
unsigned int nr_running;
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
@@ -2414,6 +2415,15 @@ struct sched_class {
#ifdef CONFIG_UCLAMP_TASK
int uclamp_enabled;
#endif
+ /*
+ * idle: 0
+ * ext: 1
+ * fair: 2
+ * rt: 4
+ * dl: 8
+ * stop: 16
+ */
+ unsigned int queue_mask;
/*
* move_queued_task/activate_task/enqueue_task: rq->lock
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -99,6 +99,8 @@ static void update_curr_stop(struct rq *
*/
DEFINE_SCHED_CLASS(stop) = {
+ .queue_mask = 16,
+
.enqueue_task = enqueue_task_stop,
.dequeue_task = dequeue_task_stop,
.yield_task = yield_task_stop,
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC][PATCH 1/3] sched: Detect per-class runqueue changes
2025-10-06 10:46 ` [RFC][PATCH 1/3] sched: Detect per-class runqueue changes Peter Zijlstra
@ 2025-10-07 10:08 ` Juri Lelli
2025-10-07 10:16 ` Peter Zijlstra
2025-10-16 9:33 ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
1 sibling, 1 reply; 23+ messages in thread
From: Juri Lelli @ 2025-10-07 10:08 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tj, linux-kernel, mingo, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, longman, hannes, mkoutny,
void, arighi, changwoo, cgroups, sched-ext, liuwenfang, tglx
Hi Peter,
On 06/10/25 12:46, Peter Zijlstra wrote:
> Have enqueue/dequeue set a per-class bit in rq->queue_mask. This then
> enables easy tracking of which runqueues are modified over a
> lock-break.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
Nice.
> @@ -12887,8 +12888,8 @@ static int sched_balance_newidle(struct
> if (this_rq->cfs.h_nr_queued && !pulled_task)
> pulled_task = 1;
>
> - /* Is there a task of a high priority class? */
> - if (this_rq->nr_running != this_rq->cfs.h_nr_queued)
> + /* If a higher prio class was modified, restart the pick */
> + if (this_rq->queue_mask & ~((fair_sched_class.queue_mask << 1)-1))
> pulled_task = -1;
Does this however want a self-documenting inline helper or macro to make
it even more clear? If this is always going to be the only caller maybe
not so much.
Thanks,
Juri
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC][PATCH 1/3] sched: Detect per-class runqueue changes
2025-10-07 10:08 ` Juri Lelli
@ 2025-10-07 10:16 ` Peter Zijlstra
2025-10-07 10:26 ` Juri Lelli
0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2025-10-07 10:16 UTC (permalink / raw)
To: Juri Lelli
Cc: tj, linux-kernel, mingo, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, longman, hannes, mkoutny,
void, arighi, changwoo, cgroups, sched-ext, liuwenfang, tglx
On Tue, Oct 07, 2025 at 12:08:03PM +0200, Juri Lelli wrote:
> Hi Peter,
>
> On 06/10/25 12:46, Peter Zijlstra wrote:
> > Have enqueue/dequeue set a per-class bit in rq->queue_mask. This then
> > enables easy tracking of which runqueues are modified over a
> > lock-break.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
>
> Nice.
>
> > @@ -12887,8 +12888,8 @@ static int sched_balance_newidle(struct
> > if (this_rq->cfs.h_nr_queued && !pulled_task)
> > pulled_task = 1;
> >
> > - /* Is there a task of a high priority class? */
> > - if (this_rq->nr_running != this_rq->cfs.h_nr_queued)
> > + /* If a higher prio class was modified, restart the pick */
> > + if (this_rq->queue_mask & ~((fair_sched_class.queue_mask << 1)-1))
> > pulled_task = -1;
>
> Does this however want a self-documenting inline helper or macro to make
> it even more clear? If this is always going to be the only caller maybe
> not so much.
There's another one in patch 3. I suppose we can do that. Maybe
something like:
static inline bool rq_modified_above(struct rq *rq, struct sched_class *class)
{
unsigned int mask = class->queue_mask;
return rq->queue_mask & ~((mask << 1) - 1);
}
This then writes the above like:
if (rq_modified_above(this_rq, &fair_sched_class))
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC][PATCH 1/3] sched: Detect per-class runqueue changes
2025-10-07 10:16 ` Peter Zijlstra
@ 2025-10-07 10:26 ` Juri Lelli
0 siblings, 0 replies; 23+ messages in thread
From: Juri Lelli @ 2025-10-07 10:26 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tj, linux-kernel, mingo, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, longman, hannes, mkoutny,
void, arighi, changwoo, cgroups, sched-ext, liuwenfang, tglx
On 07/10/25 12:16, Peter Zijlstra wrote:
> On Tue, Oct 07, 2025 at 12:08:03PM +0200, Juri Lelli wrote:
> > Hi Peter,
> >
> > On 06/10/25 12:46, Peter Zijlstra wrote:
> > > Have enqueue/dequeue set a per-class bit in rq->queue_mask. This then
> > > enables easy tracking of which runqueues are modified over a
> > > lock-break.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> >
> > Nice.
> >
> > > @@ -12887,8 +12888,8 @@ static int sched_balance_newidle(struct
> > > if (this_rq->cfs.h_nr_queued && !pulled_task)
> > > pulled_task = 1;
> > >
> > > - /* Is there a task of a high priority class? */
> > > - if (this_rq->nr_running != this_rq->cfs.h_nr_queued)
> > > + /* If a higher prio class was modified, restart the pick */
> > > + if (this_rq->queue_mask & ~((fair_sched_class.queue_mask << 1)-1))
> > > pulled_task = -1;
> >
> > Does this however want a self-documenting inline helper or macro to make
> > it even more clear? If this is always going to be the only caller maybe
> > not so much.
>
> There's another one in patch 3. I suppose we can do that. Maybe
> something like:
>
> static inline bool rq_modified_above(struct rq *rq, struct sched_class *class)
> {
> unsigned int mask = class->queue_mask;
> return rq->queue_mask & ~((mask << 1) - 1);
> }
>
> This then writes the above like:
>
> if (rq_modified_above(this_rq, &fair_sched_class))
>
Yeah. Maybe also add a "check rq::queue_mask comment for additional
details" or something like this.
Thanks!
Juri
^ permalink raw reply [flat|nested] 23+ messages in thread
* [tip: sched/core] sched: Detect per-class runqueue changes
2025-10-06 10:46 ` [RFC][PATCH 1/3] sched: Detect per-class runqueue changes Peter Zijlstra
2025-10-07 10:08 ` Juri Lelli
@ 2025-10-16 9:33 ` tip-bot2 for Peter Zijlstra
1 sibling, 0 replies; 23+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-10-16 9:33 UTC (permalink / raw)
To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Tejun Heo, x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 1e900f415c6082cd4bcdae4c92515d21fb389473
Gitweb: https://git.kernel.org/tip/1e900f415c6082cd4bcdae4c92515d21fb389473
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 01 Oct 2025 15:50:15 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 16 Oct 2025 11:13:55 +02:00
sched: Detect per-class runqueue changes
Have enqueue/dequeue set a per-class bit in rq->queue_mask. This then
enables easy tracking of which runqueues are modified over a
lock-break.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/core.c | 2 ++
kernel/sched/deadline.c | 2 ++
kernel/sched/ext.c | 2 ++
kernel/sched/fair.c | 7 +++++--
kernel/sched/idle.c | 2 ++
kernel/sched/rt.c | 2 ++
kernel/sched/sched.h | 25 +++++++++++++++++++++++++
kernel/sched/stop_task.c | 2 ++
8 files changed, 42 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e2199e4..9fc990f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2089,6 +2089,7 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
*/
uclamp_rq_inc(rq, p, flags);
+ rq->queue_mask |= p->sched_class->queue_mask;
p->sched_class->enqueue_task(rq, p, flags);
psi_enqueue(p, flags);
@@ -2121,6 +2122,7 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
* and mark the task ->sched_delayed.
*/
uclamp_rq_dec(rq, p);
+ rq->queue_mask |= p->sched_class->queue_mask;
return p->sched_class->dequeue_task(rq, p, flags);
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1f94994..83e6175 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3092,6 +3092,8 @@ static int task_is_throttled_dl(struct task_struct *p, int cpu)
DEFINE_SCHED_CLASS(dl) = {
+ .queue_mask = 8,
+
.enqueue_task = enqueue_task_dl,
.dequeue_task = dequeue_task_dl,
.yield_task = yield_task_dl,
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 5717042..949c3a6 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3234,6 +3234,8 @@ static void scx_cgroup_unlock(void) {}
* their current sched_class. Call them directly from sched core instead.
*/
DEFINE_SCHED_CLASS(ext) = {
+ .queue_mask = 1,
+
.enqueue_task = enqueue_task_scx,
.dequeue_task = dequeue_task_scx,
.yield_task = yield_task_scx,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 77a713e..23ac05c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12841,6 +12841,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
}
rcu_read_unlock();
+ rq_modified_clear(this_rq);
raw_spin_rq_unlock(this_rq);
t0 = sched_clock_cpu(this_cpu);
@@ -12898,8 +12899,8 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
if (this_rq->cfs.h_nr_queued && !pulled_task)
pulled_task = 1;
- /* Is there a task of a high priority class? */
- if (this_rq->nr_running != this_rq->cfs.h_nr_queued)
+ /* If a higher prio class was modified, restart the pick */
+ if (rq_modified_above(this_rq, &fair_sched_class))
pulled_task = -1;
out:
@@ -13633,6 +13634,8 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
*/
DEFINE_SCHED_CLASS(fair) = {
+ .queue_mask = 2,
+
.enqueue_task = enqueue_task_fair,
.dequeue_task = dequeue_task_fair,
.yield_task = yield_task_fair,
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index dee6e01..055b0dd 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -521,6 +521,8 @@ static void update_curr_idle(struct rq *rq)
*/
DEFINE_SCHED_CLASS(idle) = {
+ .queue_mask = 0,
+
/* no enqueue/yield_task for idle tasks */
/* dequeue is not valid, we print a debug message there: */
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index c2347e4..9bc828d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2569,6 +2569,8 @@ static int task_is_throttled_rt(struct task_struct *p, int cpu)
DEFINE_SCHED_CLASS(rt) = {
+ .queue_mask = 4,
+
.enqueue_task = enqueue_task_rt,
.dequeue_task = dequeue_task_rt,
.yield_task = yield_task_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e3d2710..f4a3230 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1118,6 +1118,8 @@ struct rq {
/* runqueue lock: */
raw_spinlock_t __lock;
+ /* Per class runqueue modification mask; bits in class order. */
+ unsigned int queue_mask;
unsigned int nr_running;
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
@@ -2414,6 +2416,15 @@ struct sched_class {
#ifdef CONFIG_UCLAMP_TASK
int uclamp_enabled;
#endif
+ /*
+ * idle: 0
+ * ext: 1
+ * fair: 2
+ * rt: 4
+ * dl: 8
+ * stop: 16
+ */
+ unsigned int queue_mask;
/*
* move_queued_task/activate_task/enqueue_task: rq->lock
@@ -2571,6 +2582,20 @@ struct sched_class {
#endif
};
+/*
+ * Does not nest; only used around sched_class::pick_task() rq-lock-breaks.
+ */
+static inline void rq_modified_clear(struct rq *rq)
+{
+ rq->queue_mask = 0;
+}
+
+static inline bool rq_modified_above(struct rq *rq, const struct sched_class * class)
+{
+ unsigned int mask = class->queue_mask;
+ return rq->queue_mask & ~((mask << 1) - 1);
+}
+
static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
{
WARN_ON_ONCE(rq->donor != prev);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 73aa8de..d98c453 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -98,6 +98,8 @@ static void update_curr_stop(struct rq *rq)
*/
DEFINE_SCHED_CLASS(stop) = {
+ .queue_mask = 16,
+
.enqueue_task = enqueue_task_stop,
.dequeue_task = dequeue_task_stop,
.yield_task = yield_task_stop,
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC][PATCH 2/3] sched: Add support to pick functions to take rf
2025-10-06 10:46 [RFC][PATCH 0/3] sched/ext: Cleanup pick_task_scx() Peter Zijlstra
2025-10-06 10:46 ` [RFC][PATCH 1/3] sched: Detect per-class runqueue changes Peter Zijlstra
@ 2025-10-06 10:46 ` Peter Zijlstra
2025-10-08 13:16 ` Vincent Guittot
2025-10-16 9:33 ` [tip: sched/core] " tip-bot2 for Joel Fernandes
2025-10-06 10:46 ` [RFC][PATCH 3/3] sched/ext: Fold balance_scx() into pick_task_scx() Peter Zijlstra
` (2 subsequent siblings)
4 siblings, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2025-10-06 10:46 UTC (permalink / raw)
To: tj
Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
liuwenfang, tglx, Joel Fernandes
From: Joel Fernandes <joelagnelf@nvidia.com>
Some pick functions like the internal pick_next_task_fair() already take
rf but some others dont. We need this for scx's server pick function.
Prepare for this by having pick functions accept it.
[peterz: - added RETRY_TASK handling
- removed pick_next_task_fair indirection]
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
SubmissionLink: https://lkml.kernel.org/r/20250809184800.129831-6-joelagnelf@nvidia.com
---
include/linux/sched.h | 7 ++-----
kernel/sched/core.c | 35 ++++++++++++++++++++++++++---------
kernel/sched/deadline.c | 8 ++++----
kernel/sched/ext.c | 2 +-
kernel/sched/fair.c | 16 ++++++----------
kernel/sched/idle.c | 2 +-
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 10 ++++++----
kernel/sched/stop_task.c | 2 +-
9 files changed, 48 insertions(+), 36 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -637,8 +637,8 @@ struct sched_rt_entity {
#endif
} __randomize_layout;
-typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
-typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
+struct rq_flags;
+typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *, struct rq_flags *rf);
struct sched_dl_entity {
struct rb_node rb_node;
@@ -730,9 +730,6 @@ struct sched_dl_entity {
* dl_server_update().
*
* @rq the runqueue this server is for
- *
- * @server_has_tasks() returns true if @server_pick return a
- * runnable task.
*/
struct rq *rq;
dl_server_pick_f server_pick_task;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5901,7 +5901,7 @@ __pick_next_task(struct rq *rq, struct t
/* Assume the next prioritized class is idle_sched_class */
if (!p) {
- p = pick_task_idle(rq);
+ p = pick_task_idle(rq, rf);
put_prev_set_next_task(rq, prev, p);
}
@@ -5913,11 +5913,15 @@ __pick_next_task(struct rq *rq, struct t
for_each_active_class(class) {
if (class->pick_next_task) {
- p = class->pick_next_task(rq, prev);
+ p = class->pick_next_task(rq, prev, rf);
+ if (unlikely(p == RETRY_TASK))
+ goto restart;
if (p)
return p;
} else {
- p = class->pick_task(rq);
+ p = class->pick_task(rq, rf);
+ if (unlikely(p == RETRY_TASK))
+ goto restart;
if (p) {
put_prev_set_next_task(rq, prev, p);
return p;
@@ -5947,7 +5951,11 @@ static inline bool cookie_match(struct t
return a->core_cookie == b->core_cookie;
}
-static inline struct task_struct *pick_task(struct rq *rq)
+/*
+ * Careful; this can return RETRY_TASK, it does not include the retry-loop
+ * itself due to the whole SMT pick retry thing below.
+ */
+static inline struct task_struct *pick_task(struct rq *rq, struct rq_flags *rf)
{
const struct sched_class *class;
struct task_struct *p;
@@ -5955,7 +5963,7 @@ static inline struct task_struct *pick_t
rq->dl_server = NULL;
for_each_active_class(class) {
- p = class->pick_task(rq);
+ p = class->pick_task(rq, rf);
if (p)
return p;
}
@@ -5970,7 +5978,7 @@ static void queue_core_balance(struct rq
static struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
- struct task_struct *next, *p, *max = NULL;
+ struct task_struct *next, *p, *max;
const struct cpumask *smt_mask;
bool fi_before = false;
bool core_clock_updated = (rq == rq->core);
@@ -6055,7 +6063,10 @@ pick_next_task(struct rq *rq, struct tas
* and there are no cookied tasks running on siblings.
*/
if (!need_sync) {
- next = pick_task(rq);
+restart_single:
+ next = pick_task(rq, rf);
+ if (unlikely(next == RETRY_TASK))
+ goto restart_single;
if (!next->core_cookie) {
rq->core_pick = NULL;
rq->core_dl_server = NULL;
@@ -6075,6 +6086,8 @@ pick_next_task(struct rq *rq, struct tas
*
* Tie-break prio towards the current CPU
*/
+restart_multi:
+ max = NULL;
for_each_cpu_wrap(i, smt_mask, cpu) {
rq_i = cpu_rq(i);
@@ -6086,7 +6099,11 @@ pick_next_task(struct rq *rq, struct tas
if (i != cpu && (rq_i != rq->core || !core_clock_updated))
update_rq_clock(rq_i);
- rq_i->core_pick = p = pick_task(rq_i);
+ p = pick_task(rq_i, rf);
+ if (unlikely(p == RETRY_TASK))
+ goto restart_multi;
+
+ rq_i->core_pick = p;
rq_i->core_dl_server = rq_i->dl_server;
if (!max || prio_less(max, p, fi_before))
@@ -6108,7 +6125,7 @@ pick_next_task(struct rq *rq, struct tas
if (cookie)
p = sched_core_find(rq_i, cookie);
if (!p)
- p = idle_sched_class.pick_task(rq_i);
+ p = idle_sched_class.pick_task(rq_i, rf);
}
rq_i->core_pick = p;
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2352,7 +2352,7 @@ static struct sched_dl_entity *pick_next
* __pick_next_task_dl - Helper to pick the next -deadline task to run.
* @rq: The runqueue to pick the next task from.
*/
-static struct task_struct *__pick_task_dl(struct rq *rq)
+static struct task_struct *__pick_task_dl(struct rq *rq, struct rq_flags *rf)
{
struct sched_dl_entity *dl_se;
struct dl_rq *dl_rq = &rq->dl;
@@ -2366,7 +2366,7 @@ static struct task_struct *__pick_task_d
WARN_ON_ONCE(!dl_se);
if (dl_server(dl_se)) {
- p = dl_se->server_pick_task(dl_se);
+ p = dl_se->server_pick_task(dl_se, rf);
if (!p) {
dl_server_stop(dl_se);
goto again;
@@ -2379,9 +2379,9 @@ static struct task_struct *__pick_task_d
return p;
}
-static struct task_struct *pick_task_dl(struct rq *rq)
+static struct task_struct *pick_task_dl(struct rq *rq, struct rq_flags *rf)
{
- return __pick_task_dl(rq);
+ return __pick_task_dl(rq, rf);
}
static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct task_struct *next)
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2332,7 +2332,7 @@ static struct task_struct *first_local_t
struct task_struct, scx.dsq_list.node);
}
-static struct task_struct *pick_task_scx(struct rq *rq)
+static struct task_struct *pick_task_scx(struct rq *rq, struct rq_flags *rf)
{
struct task_struct *prev = rq->curr;
struct task_struct *p;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8822,7 +8822,7 @@ static void check_preempt_wakeup_fair(st
resched_curr_lazy(rq);
}
-static struct task_struct *pick_task_fair(struct rq *rq)
+static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
{
struct sched_entity *se;
struct cfs_rq *cfs_rq;
@@ -8866,7 +8866,7 @@ pick_next_task_fair(struct rq *rq, struc
int new_tasks;
again:
- p = pick_task_fair(rq);
+ p = pick_task_fair(rq, rf);
if (!p)
goto idle;
se = &p->se;
@@ -8945,14 +8945,10 @@ pick_next_task_fair(struct rq *rq, struc
return NULL;
}
-static struct task_struct *__pick_next_task_fair(struct rq *rq, struct task_struct *prev)
+static struct task_struct *
+fair_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf)
{
- return pick_next_task_fair(rq, prev, NULL);
-}
-
-static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se)
-{
- return pick_task_fair(dl_se->rq);
+ return pick_task_fair(dl_se->rq, rf);
}
void fair_server_init(struct rq *rq)
@@ -13632,7 +13628,7 @@ DEFINE_SCHED_CLASS(fair) = {
.wakeup_preempt = check_preempt_wakeup_fair,
.pick_task = pick_task_fair,
- .pick_next_task = __pick_next_task_fair,
+ .pick_next_task = pick_next_task_fair,
.put_prev_task = put_prev_task_fair,
.set_next_task = set_next_task_fair,
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -466,7 +466,7 @@ static void set_next_task_idle(struct rq
next->se.exec_start = rq_clock_task(rq);
}
-struct task_struct *pick_task_idle(struct rq *rq)
+struct task_struct *pick_task_idle(struct rq *rq, struct rq_flags *rf)
{
scx_update_idle(rq, true, false);
return rq->idle;
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1695,7 +1695,7 @@ static struct task_struct *_pick_next_ta
return rt_task_of(rt_se);
}
-static struct task_struct *pick_task_rt(struct rq *rq)
+static struct task_struct *pick_task_rt(struct rq *rq, struct rq_flags *rf)
{
struct task_struct *p;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2467,7 +2467,7 @@ struct sched_class {
/*
* schedule/pick_next_task: rq->lock
*/
- struct task_struct *(*pick_task)(struct rq *rq);
+ struct task_struct *(*pick_task)(struct rq *rq, struct rq_flags *rf);
/*
* Optional! When implemented pick_next_task() should be equivalent to:
*
@@ -2477,7 +2477,8 @@ struct sched_class {
* set_next_task_first(next);
* }
*/
- struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev);
+ struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev,
+ struct rq_flags *rf);
/*
* sched_change:
@@ -2690,8 +2691,9 @@ static inline bool sched_fair_runnable(s
return rq->cfs.nr_queued > 0;
}
-extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
-extern struct task_struct *pick_task_idle(struct rq *rq);
+extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev,
+ struct rq_flags *rf);
+extern struct task_struct *pick_task_idle(struct rq *rq, struct rq_flags *rf);
#define SCA_CHECK 0x01
#define SCA_MIGRATE_DISABLE 0x02
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -32,7 +32,7 @@ static void set_next_task_stop(struct rq
stop->se.exec_start = rq_clock_task(rq);
}
-static struct task_struct *pick_task_stop(struct rq *rq)
+static struct task_struct *pick_task_stop(struct rq *rq, struct rq_flags *rf)
{
if (!sched_stop_runnable(rq))
return NULL;
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC][PATCH 2/3] sched: Add support to pick functions to take rf
2025-10-06 10:46 ` [RFC][PATCH 2/3] sched: Add support to pick functions to take rf Peter Zijlstra
@ 2025-10-08 13:16 ` Vincent Guittot
2025-10-08 13:58 ` Peter Zijlstra
2025-10-16 9:33 ` [tip: sched/core] " tip-bot2 for Joel Fernandes
1 sibling, 1 reply; 23+ messages in thread
From: Vincent Guittot @ 2025-10-08 13:16 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tj, linux-kernel, mingo, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, longman, hannes, mkoutny, void,
arighi, changwoo, cgroups, sched-ext, liuwenfang, tglx,
Joel Fernandes
On Mon, 6 Oct 2025 at 12:57, Peter Zijlstra <peterz@infradead.org> wrote:
>
> From: Joel Fernandes <joelagnelf@nvidia.com>
>
> Some pick functions like the internal pick_next_task_fair() already take
> rf but some others dont. We need this for scx's server pick function.
> Prepare for this by having pick functions accept it.
>
> [peterz: - added RETRY_TASK handling
> - removed pick_next_task_fair indirection]
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> SubmissionLink: https://lkml.kernel.org/r/20250809184800.129831-6-joelagnelf@nvidia.com
> ---
> include/linux/sched.h | 7 ++-----
> kernel/sched/core.c | 35 ++++++++++++++++++++++++++---------
> kernel/sched/deadline.c | 8 ++++----
> kernel/sched/ext.c | 2 +-
> kernel/sched/fair.c | 16 ++++++----------
> kernel/sched/idle.c | 2 +-
> kernel/sched/rt.c | 2 +-
> kernel/sched/sched.h | 10 ++++++----
> kernel/sched/stop_task.c | 2 +-
> 9 files changed, 48 insertions(+), 36 deletions(-)
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -637,8 +637,8 @@ struct sched_rt_entity {
> #endif
> } __randomize_layout;
>
> -typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
> -typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
> +struct rq_flags;
> +typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *, struct rq_flags *rf);
>
> struct sched_dl_entity {
> struct rb_node rb_node;
> @@ -730,9 +730,6 @@ struct sched_dl_entity {
> * dl_server_update().
> *
> * @rq the runqueue this server is for
> - *
> - * @server_has_tasks() returns true if @server_pick return a
> - * runnable task.
> */
> struct rq *rq;
> dl_server_pick_f server_pick_task;
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5901,7 +5901,7 @@ __pick_next_task(struct rq *rq, struct t
>
> /* Assume the next prioritized class is idle_sched_class */
> if (!p) {
> - p = pick_task_idle(rq);
> + p = pick_task_idle(rq, rf);
> put_prev_set_next_task(rq, prev, p);
> }
>
> @@ -5913,11 +5913,15 @@ __pick_next_task(struct rq *rq, struct t
>
> for_each_active_class(class) {
> if (class->pick_next_task) {
> - p = class->pick_next_task(rq, prev);
> + p = class->pick_next_task(rq, prev, rf);
> + if (unlikely(p == RETRY_TASK))
> + goto restart;
> if (p)
> return p;
> } else {
> - p = class->pick_task(rq);
> + p = class->pick_task(rq, rf);
> + if (unlikely(p == RETRY_TASK))
> + goto restart;
> if (p) {
> put_prev_set_next_task(rq, prev, p);
> return p;
> @@ -5947,7 +5951,11 @@ static inline bool cookie_match(struct t
> return a->core_cookie == b->core_cookie;
> }
>
> -static inline struct task_struct *pick_task(struct rq *rq)
> +/*
> + * Careful; this can return RETRY_TASK, it does not include the retry-loop
> + * itself due to the whole SMT pick retry thing below.
> + */
> +static inline struct task_struct *pick_task(struct rq *rq, struct rq_flags *rf)
> {
> const struct sched_class *class;
> struct task_struct *p;
> @@ -5955,7 +5963,7 @@ static inline struct task_struct *pick_t
> rq->dl_server = NULL;
>
> for_each_active_class(class) {
> - p = class->pick_task(rq);
> + p = class->pick_task(rq, rf);
> if (p)
> return p;
> }
> @@ -5970,7 +5978,7 @@ static void queue_core_balance(struct rq
> static struct task_struct *
> pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> {
> - struct task_struct *next, *p, *max = NULL;
> + struct task_struct *next, *p, *max;
> const struct cpumask *smt_mask;
> bool fi_before = false;
> bool core_clock_updated = (rq == rq->core);
> @@ -6055,7 +6063,10 @@ pick_next_task(struct rq *rq, struct tas
> * and there are no cookied tasks running on siblings.
> */
> if (!need_sync) {
> - next = pick_task(rq);
> +restart_single:
> + next = pick_task(rq, rf);
> + if (unlikely(next == RETRY_TASK))
> + goto restart_single;
> if (!next->core_cookie) {
> rq->core_pick = NULL;
> rq->core_dl_server = NULL;
> @@ -6075,6 +6086,8 @@ pick_next_task(struct rq *rq, struct tas
> *
> * Tie-break prio towards the current CPU
> */
> +restart_multi:
> + max = NULL;
> for_each_cpu_wrap(i, smt_mask, cpu) {
> rq_i = cpu_rq(i);
>
> @@ -6086,7 +6099,11 @@ pick_next_task(struct rq *rq, struct tas
> if (i != cpu && (rq_i != rq->core || !core_clock_updated))
> update_rq_clock(rq_i);
>
> - rq_i->core_pick = p = pick_task(rq_i);
> + p = pick_task(rq_i, rf);
> + if (unlikely(p == RETRY_TASK))
> + goto restart_multi;
> +
> + rq_i->core_pick = p;
> rq_i->core_dl_server = rq_i->dl_server;
>
> if (!max || prio_less(max, p, fi_before))
> @@ -6108,7 +6125,7 @@ pick_next_task(struct rq *rq, struct tas
> if (cookie)
> p = sched_core_find(rq_i, cookie);
> if (!p)
> - p = idle_sched_class.pick_task(rq_i);
> + p = idle_sched_class.pick_task(rq_i, rf);
> }
>
> rq_i->core_pick = p;
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2352,7 +2352,7 @@ static struct sched_dl_entity *pick_next
> * __pick_next_task_dl - Helper to pick the next -deadline task to run.
> * @rq: The runqueue to pick the next task from.
> */
> -static struct task_struct *__pick_task_dl(struct rq *rq)
> +static struct task_struct *__pick_task_dl(struct rq *rq, struct rq_flags *rf)
> {
> struct sched_dl_entity *dl_se;
> struct dl_rq *dl_rq = &rq->dl;
> @@ -2366,7 +2366,7 @@ static struct task_struct *__pick_task_d
> WARN_ON_ONCE(!dl_se);
>
> if (dl_server(dl_se)) {
> - p = dl_se->server_pick_task(dl_se);
> + p = dl_se->server_pick_task(dl_se, rf);
> if (!p) {
> dl_server_stop(dl_se);
> goto again;
> @@ -2379,9 +2379,9 @@ static struct task_struct *__pick_task_d
> return p;
> }
>
> -static struct task_struct *pick_task_dl(struct rq *rq)
> +static struct task_struct *pick_task_dl(struct rq *rq, struct rq_flags *rf)
> {
> - return __pick_task_dl(rq);
> + return __pick_task_dl(rq, rf);
> }
>
> static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct task_struct *next)
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2332,7 +2332,7 @@ static struct task_struct *first_local_t
> struct task_struct, scx.dsq_list.node);
> }
>
> -static struct task_struct *pick_task_scx(struct rq *rq)
> +static struct task_struct *pick_task_scx(struct rq *rq, struct rq_flags *rf)
> {
> struct task_struct *prev = rq->curr;
> struct task_struct *p;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8822,7 +8822,7 @@ static void check_preempt_wakeup_fair(st
> resched_curr_lazy(rq);
> }
>
> -static struct task_struct *pick_task_fair(struct rq *rq)
> +static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
> {
> struct sched_entity *se;
> struct cfs_rq *cfs_rq;
> @@ -8866,7 +8866,7 @@ pick_next_task_fair(struct rq *rq, struc
> int new_tasks;
>
> again:
> - p = pick_task_fair(rq);
> + p = pick_task_fair(rq, rf);
> if (!p)
> goto idle;
> se = &p->se;
> @@ -8945,14 +8945,10 @@ pick_next_task_fair(struct rq *rq, struc
> return NULL;
> }
>
> -static struct task_struct *__pick_next_task_fair(struct rq *rq, struct task_struct *prev)
> +static struct task_struct *
> +fair_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf)
> {
> - return pick_next_task_fair(rq, prev, NULL);
The special case of a NULL rf pointer is used to skip
sched_balance_newidle() at the end of pick_next_task_fair() in the
pick_next_task() slo path when prev_balance has already it. This means
that it will be called twice if prev is not a fair task.
While reviewing this series, I also noticed an older issue that we
have with check pelt lost idle time [1]
[1] https://lore.kernel.org/all/20251008131214.3759798-1-vincent.guittot@linaro.org/
We can't use rf in pick_next_task_fair() to skip
sched_balance_newidle() but we need a dedicated param
> -}
> -
> -static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se)
> -{
> - return pick_task_fair(dl_se->rq);
> + return pick_task_fair(dl_se->rq, rf);
> }
>
> void fair_server_init(struct rq *rq)
> @@ -13632,7 +13628,7 @@ DEFINE_SCHED_CLASS(fair) = {
> .wakeup_preempt = check_preempt_wakeup_fair,
>
> .pick_task = pick_task_fair,
> - .pick_next_task = __pick_next_task_fair,
> + .pick_next_task = pick_next_task_fair,
> .put_prev_task = put_prev_task_fair,
> .set_next_task = set_next_task_fair,
>
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -466,7 +466,7 @@ static void set_next_task_idle(struct rq
> next->se.exec_start = rq_clock_task(rq);
> }
>
> -struct task_struct *pick_task_idle(struct rq *rq)
> +struct task_struct *pick_task_idle(struct rq *rq, struct rq_flags *rf)
> {
> scx_update_idle(rq, true, false);
> return rq->idle;
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1695,7 +1695,7 @@ static struct task_struct *_pick_next_ta
> return rt_task_of(rt_se);
> }
>
> -static struct task_struct *pick_task_rt(struct rq *rq)
> +static struct task_struct *pick_task_rt(struct rq *rq, struct rq_flags *rf)
> {
> struct task_struct *p;
>
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2467,7 +2467,7 @@ struct sched_class {
> /*
> * schedule/pick_next_task: rq->lock
> */
> - struct task_struct *(*pick_task)(struct rq *rq);
> + struct task_struct *(*pick_task)(struct rq *rq, struct rq_flags *rf);
> /*
> * Optional! When implemented pick_next_task() should be equivalent to:
> *
> @@ -2477,7 +2477,8 @@ struct sched_class {
> * set_next_task_first(next);
> * }
> */
> - struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev);
> + struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev,
> + struct rq_flags *rf);
>
> /*
> * sched_change:
> @@ -2690,8 +2691,9 @@ static inline bool sched_fair_runnable(s
> return rq->cfs.nr_queued > 0;
> }
>
> -extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
> -extern struct task_struct *pick_task_idle(struct rq *rq);
> +extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev,
> + struct rq_flags *rf);
> +extern struct task_struct *pick_task_idle(struct rq *rq, struct rq_flags *rf);
>
> #define SCA_CHECK 0x01
> #define SCA_MIGRATE_DISABLE 0x02
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -32,7 +32,7 @@ static void set_next_task_stop(struct rq
> stop->se.exec_start = rq_clock_task(rq);
> }
>
> -static struct task_struct *pick_task_stop(struct rq *rq)
> +static struct task_struct *pick_task_stop(struct rq *rq, struct rq_flags *rf)
> {
> if (!sched_stop_runnable(rq))
> return NULL;
>
>
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC][PATCH 2/3] sched: Add support to pick functions to take rf
2025-10-08 13:16 ` Vincent Guittot
@ 2025-10-08 13:58 ` Peter Zijlstra
2025-10-08 15:22 ` Vincent Guittot
0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2025-10-08 13:58 UTC (permalink / raw)
To: Vincent Guittot
Cc: tj, linux-kernel, mingo, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, longman, hannes, mkoutny, void,
arighi, changwoo, cgroups, sched-ext, liuwenfang, tglx,
Joel Fernandes
On Wed, Oct 08, 2025 at 03:16:58PM +0200, Vincent Guittot wrote:
> > +static struct task_struct *
> > +fair_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf)
> > {
> > - return pick_next_task_fair(rq, prev, NULL);
>
> The special case of a NULL rf pointer is used to skip
> sched_balance_newidle() at the end of pick_next_task_fair() in the
> pick_next_task() slo path when prev_balance has already it. This means
> that it will be called twice if prev is not a fair task.
Oh right. I suppose we can simply remove balance_fair.
> While reviewing this series, I also noticed an older issue that we
> have with check pelt lost idle time [1]
> [1] https://lore.kernel.org/all/20251008131214.3759798-1-vincent.guittot@linaro.org/
Let me go have a look.
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC][PATCH 2/3] sched: Add support to pick functions to take rf
2025-10-08 13:58 ` Peter Zijlstra
@ 2025-10-08 15:22 ` Vincent Guittot
2025-10-08 20:34 ` Tejun Heo
0 siblings, 1 reply; 23+ messages in thread
From: Vincent Guittot @ 2025-10-08 15:22 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tj, linux-kernel, mingo, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, longman, hannes, mkoutny, void,
arighi, changwoo, cgroups, sched-ext, liuwenfang, tglx,
Joel Fernandes
On Wed, 8 Oct 2025 at 15:58, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 08, 2025 at 03:16:58PM +0200, Vincent Guittot wrote:
>
> > > +static struct task_struct *
> > > +fair_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf)
> > > {
> > > - return pick_next_task_fair(rq, prev, NULL);
> >
> > The special case of a NULL rf pointer is used to skip
> > sched_balance_newidle() at the end of pick_next_task_fair() in the
> > pick_next_task() slo path when prev_balance has already it. This means
> > that it will be called twice if prev is not a fair task.
>
> Oh right. I suppose we can simply remove balance_fair.
That was the option that I also had in mind but this will change from
current behavior and I'm afraid that sched_ext people will complain.
Currently, if prev is sched_ext, we don't call higher class.balance()
which includes the fair class balance_fair->sched_balance_newidle. If
we now always call sched_balance_newidle() at the end
pick_next_task_fair(), we will try to pull a fair task at each
schedule between sched_ext tasks
>
> > While reviewing this series, I also noticed an older issue that we
> > have with check pelt lost idle time [1]
> > [1] https://lore.kernel.org/all/20251008131214.3759798-1-vincent.guittot@linaro.org/
>
> Let me go have a look.
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC][PATCH 2/3] sched: Add support to pick functions to take rf
2025-10-08 15:22 ` Vincent Guittot
@ 2025-10-08 20:34 ` Tejun Heo
2025-10-09 7:17 ` Vincent Guittot
0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2025-10-08 20:34 UTC (permalink / raw)
To: Vincent Guittot
Cc: Peter Zijlstra, linux-kernel, mingo, juri.lelli, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, longman, hannes, mkoutny,
void, arighi, changwoo, cgroups, sched-ext, liuwenfang, tglx,
Joel Fernandes
Hello,
On Wed, Oct 08, 2025 at 05:22:42PM +0200, Vincent Guittot wrote:
> On Wed, 8 Oct 2025 at 15:58, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Oct 08, 2025 at 03:16:58PM +0200, Vincent Guittot wrote:
> >
> > > > +static struct task_struct *
> > > > +fair_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf)
> > > > {
> > > > - return pick_next_task_fair(rq, prev, NULL);
> > >
> > > The special case of a NULL rf pointer is used to skip
> > > sched_balance_newidle() at the end of pick_next_task_fair() in the
> > > pick_next_task() slo path when prev_balance has already it. This means
> > > that it will be called twice if prev is not a fair task.
> >
> > Oh right. I suppose we can simply remove balance_fair.
>
> That was the option that I also had in mind but this will change from
> current behavior and I'm afraid that sched_ext people will complain.
> Currently, if prev is sched_ext, we don't call higher class.balance()
> which includes the fair class balance_fair->sched_balance_newidle. If
> we now always call sched_balance_newidle() at the end
> pick_next_task_fair(), we will try to pull a fair task at each
> schedule between sched_ext tasks
If we pass in @prev into pick(), can't pick() decide whether to newidle
balance or not based on that?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC][PATCH 2/3] sched: Add support to pick functions to take rf
2025-10-08 20:34 ` Tejun Heo
@ 2025-10-09 7:17 ` Vincent Guittot
2025-10-13 11:04 ` Peter Zijlstra
0 siblings, 1 reply; 23+ messages in thread
From: Vincent Guittot @ 2025-10-09 7:17 UTC (permalink / raw)
To: Tejun Heo
Cc: Peter Zijlstra, linux-kernel, mingo, juri.lelli, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, longman, hannes, mkoutny,
void, arighi, changwoo, cgroups, sched-ext, liuwenfang, tglx,
Joel Fernandes
On Wed, 8 Oct 2025 at 22:34, Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Oct 08, 2025 at 05:22:42PM +0200, Vincent Guittot wrote:
> > On Wed, 8 Oct 2025 at 15:58, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Wed, Oct 08, 2025 at 03:16:58PM +0200, Vincent Guittot wrote:
> > >
> > > > > +static struct task_struct *
> > > > > +fair_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf)
> > > > > {
> > > > > - return pick_next_task_fair(rq, prev, NULL);
> > > >
> > > > The special case of a NULL rf pointer is used to skip
> > > > sched_balance_newidle() at the end of pick_next_task_fair() in the
> > > > pick_next_task() slo path when prev_balance has already it. This means
> > > > that it will be called twice if prev is not a fair task.
> > >
> > > Oh right. I suppose we can simply remove balance_fair.
> >
> > That was the option that I also had in mind but this will change from
> > current behavior and I'm afraid that sched_ext people will complain.
> > Currently, if prev is sched_ext, we don't call higher class.balance()
> > which includes the fair class balance_fair->sched_balance_newidle. If
> > we now always call sched_balance_newidle() at the end
> > pick_next_task_fair(), we will try to pull a fair task at each
> > schedule between sched_ext tasks
>
> If we pass in @prev into pick(), can't pick() decide whether to newidle
> balance or not based on that?
The problem is that with dl_server, you can has a prev of a lower prio
but still want to run a newidle balance :
-cfs task preempted by dl server
-cfs task migrates to another cpu
-we want to run newidle balance when cpu become idle
>
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC][PATCH 2/3] sched: Add support to pick functions to take rf
2025-10-09 7:17 ` Vincent Guittot
@ 2025-10-13 11:04 ` Peter Zijlstra
2025-10-13 11:09 ` Peter Zijlstra
0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2025-10-13 11:04 UTC (permalink / raw)
To: Vincent Guittot
Cc: Tejun Heo, linux-kernel, mingo, juri.lelli, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, longman, hannes, mkoutny,
void, arighi, changwoo, cgroups, sched-ext, liuwenfang, tglx,
Joel Fernandes
On Thu, Oct 09, 2025 at 09:17:44AM +0200, Vincent Guittot wrote:
> On Wed, 8 Oct 2025 at 22:34, Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Wed, Oct 08, 2025 at 05:22:42PM +0200, Vincent Guittot wrote:
> > > On Wed, 8 Oct 2025 at 15:58, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > On Wed, Oct 08, 2025 at 03:16:58PM +0200, Vincent Guittot wrote:
> > > >
> > > > > > +static struct task_struct *
> > > > > > +fair_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf)
> > > > > > {
> > > > > > - return pick_next_task_fair(rq, prev, NULL);
> > > > >
> > > > > The special case of a NULL rf pointer is used to skip
> > > > > sched_balance_newidle() at the end of pick_next_task_fair() in the
> > > > > pick_next_task() slo path when prev_balance has already it. This means
> > > > > that it will be called twice if prev is not a fair task.
> > > >
> > > > Oh right. I suppose we can simply remove balance_fair.
> > >
> > > That was the option that I also had in mind but this will change from
> > > current behavior and I'm afraid that sched_ext people will complain.
> > > Currently, if prev is sched_ext, we don't call higher class.balance()
> > > which includes the fair class balance_fair->sched_balance_newidle. If
> > > we now always call sched_balance_newidle() at the end
> > > pick_next_task_fair(), we will try to pull a fair task at each
> > > schedule between sched_ext tasks
> >
> > If we pass in @prev into pick(), can't pick() decide whether to newidle
> > balance or not based on that?
>
> The problem is that with dl_server, you can has a prev of a lower prio
> but still want to run a newidle balance :
> -cfs task preempted by dl server
> -cfs task migrates to another cpu
> -we want to run newidle balance when cpu become idle
Bah; so yeah, this new behaviour is better for indeed always calling
newidle when it is needed, but you're also right that in case of ext
this might not be ideal.
So I have a pile of newidle hacks here:
https://lkml.kernel.org/r/20251010170937.GG4067720@noisy.programming.kicks-ass.net
and while I don't particularly like NI_SPARE (the has_spare_tasks thing
is fickle); the idea seems to have some merit for this situation --
where we know we'll not be having fair tasks at all.
I mean, we can always do something like this to sched_balance_newidle():
if (scx_switched_all())
return 0;
Not pretty, but should do the job.
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC][PATCH 2/3] sched: Add support to pick functions to take rf
2025-10-13 11:04 ` Peter Zijlstra
@ 2025-10-13 11:09 ` Peter Zijlstra
2025-10-13 13:06 ` Vincent Guittot
0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2025-10-13 11:09 UTC (permalink / raw)
To: Vincent Guittot
Cc: Tejun Heo, linux-kernel, mingo, juri.lelli, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, longman, hannes, mkoutny,
void, arighi, changwoo, cgroups, sched-ext, liuwenfang, tglx,
Joel Fernandes
On Mon, Oct 13, 2025 at 01:04:49PM +0200, Peter Zijlstra wrote:
> Bah; so yeah, this new behaviour is better for indeed always calling
> newidle when it is needed, but you're also right that in case of ext
> this might not be ideal.
>
> So I have a pile of newidle hacks here:
>
> https://lkml.kernel.org/r/20251010170937.GG4067720@noisy.programming.kicks-ass.net
>
> and while I don't particularly like NI_SPARE (the has_spare_tasks thing
> is fickle); the idea seems to have some merit for this situation --
> where we know we'll not be having fair tasks at all.
>
> I mean, we can always do something like this to sched_balance_newidle():
>
> if (scx_switched_all())
> return 0;
>
> Not pretty, but should do the job.
Oh, never mind, none of this is needed.
__pick_next_task()
if (scx_enabled())
goto restart;
...
restart:
for_each_active_class(class) {
...
}
And then we have next_active_class() skip fair_sched_class entirely when
scx_switch_all().
So in the common ext case, we'll not hit pick_next_task_fair() at all.
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC][PATCH 2/3] sched: Add support to pick functions to take rf
2025-10-13 11:09 ` Peter Zijlstra
@ 2025-10-13 13:06 ` Vincent Guittot
2025-10-13 17:20 ` Tejun Heo
0 siblings, 1 reply; 23+ messages in thread
From: Vincent Guittot @ 2025-10-13 13:06 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tejun Heo, linux-kernel, mingo, juri.lelli, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, longman, hannes, mkoutny,
void, arighi, changwoo, cgroups, sched-ext, liuwenfang, tglx,
Joel Fernandes
On Mon, 13 Oct 2025 at 13:09, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Oct 13, 2025 at 01:04:49PM +0200, Peter Zijlstra wrote:
>
> > Bah; so yeah, this new behaviour is better for indeed always calling
> > newidle when it is needed, but you're also right that in case of ext
> > this might not be ideal.
> >
> > So I have a pile of newidle hacks here:
> >
> > https://lkml.kernel.org/r/20251010170937.GG4067720@noisy.programming.kicks-ass.net
> >
> > and while I don't particularly like NI_SPARE (the has_spare_tasks thing
> > is fickle); the idea seems to have some merit for this situation --
> > where we know we'll not be having fair tasks at all.
> >
> > I mean, we can always do something like this to sched_balance_newidle():
> >
> > if (scx_switched_all())
> > return 0;
> >
> > Not pretty, but should do the job.
>
> Oh, never mind, none of this is needed.
>
> __pick_next_task()
>
> if (scx_enabled())
> goto restart;
>
> ...
> restart:
> for_each_active_class(class) {
> ...
> }
>
>
> And then we have next_active_class() skip fair_sched_class entirely when
> scx_switch_all().
Ah yes you're right. fair is not called in case of scx_switched_all()
>
> So in the common ext case, we'll not hit pick_next_task_fair() at all.
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC][PATCH 2/3] sched: Add support to pick functions to take rf
2025-10-13 13:06 ` Vincent Guittot
@ 2025-10-13 17:20 ` Tejun Heo
0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2025-10-13 17:20 UTC (permalink / raw)
To: Vincent Guittot
Cc: Peter Zijlstra, linux-kernel, mingo, juri.lelli, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, longman, hannes, mkoutny,
void, arighi, changwoo, cgroups, sched-ext, liuwenfang, tglx,
Joel Fernandes
On Mon, Oct 13, 2025 at 03:06:33PM +0200, Vincent Guittot wrote:
...
> > And then we have next_active_class() skip fair_sched_class entirely when
> > scx_switch_all().
>
> Ah yes you're right. fair is not called in case of scx_switched_all()
Yeah, when siwtch_all this isn't an issue at all. In partial mode, the extra
new idle may be noticeable but I don't think many people are using partial
mode, so we can worry about it when it becomes a problem.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 23+ messages in thread
* [tip: sched/core] sched: Add support to pick functions to take rf
2025-10-06 10:46 ` [RFC][PATCH 2/3] sched: Add support to pick functions to take rf Peter Zijlstra
2025-10-08 13:16 ` Vincent Guittot
@ 2025-10-16 9:33 ` tip-bot2 for Joel Fernandes
1 sibling, 0 replies; 23+ messages in thread
From: tip-bot2 for Joel Fernandes @ 2025-10-16 9:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Joel Fernandes, Peter Zijlstra (Intel), Tejun Heo, x86,
linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 50653216e4ff7a74c95b2ee9ec439916875556ec
Gitweb: https://git.kernel.org/tip/50653216e4ff7a74c95b2ee9ec439916875556ec
Author: Joel Fernandes <joelagnelf@nvidia.com>
AuthorDate: Sat, 09 Aug 2025 14:47:50 -04:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 16 Oct 2025 11:13:55 +02:00
sched: Add support to pick functions to take rf
Some pick functions like the internal pick_next_task_fair() already take
rf but some others dont. We need this for scx's server pick function.
Prepare for this by having pick functions accept it.
[peterz: - added RETRY_TASK handling
- removed pick_next_task_fair indirection]
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
---
include/linux/sched.h | 7 ++-----
kernel/sched/core.c | 35 ++++++++++++++++++++++++++---------
kernel/sched/deadline.c | 8 ++++----
kernel/sched/ext.c | 2 +-
kernel/sched/fair.c | 26 ++++++--------------------
kernel/sched/idle.c | 2 +-
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 10 ++++++----
kernel/sched/stop_task.c | 2 +-
9 files changed, 48 insertions(+), 46 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 77426c3..0757647 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -637,8 +637,8 @@ struct sched_rt_entity {
#endif
} __randomize_layout;
-typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
-typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
+struct rq_flags;
+typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *, struct rq_flags *rf);
struct sched_dl_entity {
struct rb_node rb_node;
@@ -730,9 +730,6 @@ struct sched_dl_entity {
* dl_server_update().
*
* @rq the runqueue this server is for
- *
- * @server_has_tasks() returns true if @server_pick return a
- * runnable task.
*/
struct rq *rq;
dl_server_pick_f server_pick_task;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9fc990f..a75d456 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5901,7 +5901,7 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
/* Assume the next prioritized class is idle_sched_class */
if (!p) {
- p = pick_task_idle(rq);
+ p = pick_task_idle(rq, rf);
put_prev_set_next_task(rq, prev, p);
}
@@ -5913,11 +5913,15 @@ restart:
for_each_active_class(class) {
if (class->pick_next_task) {
- p = class->pick_next_task(rq, prev);
+ p = class->pick_next_task(rq, prev, rf);
+ if (unlikely(p == RETRY_TASK))
+ goto restart;
if (p)
return p;
} else {
- p = class->pick_task(rq);
+ p = class->pick_task(rq, rf);
+ if (unlikely(p == RETRY_TASK))
+ goto restart;
if (p) {
put_prev_set_next_task(rq, prev, p);
return p;
@@ -5947,7 +5951,11 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
return a->core_cookie == b->core_cookie;
}
-static inline struct task_struct *pick_task(struct rq *rq)
+/*
+ * Careful; this can return RETRY_TASK, it does not include the retry-loop
+ * itself due to the whole SMT pick retry thing below.
+ */
+static inline struct task_struct *pick_task(struct rq *rq, struct rq_flags *rf)
{
const struct sched_class *class;
struct task_struct *p;
@@ -5955,7 +5963,7 @@ static inline struct task_struct *pick_task(struct rq *rq)
rq->dl_server = NULL;
for_each_active_class(class) {
- p = class->pick_task(rq);
+ p = class->pick_task(rq, rf);
if (p)
return p;
}
@@ -5970,7 +5978,7 @@ static void queue_core_balance(struct rq *rq);
static struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
- struct task_struct *next, *p, *max = NULL;
+ struct task_struct *next, *p, *max;
const struct cpumask *smt_mask;
bool fi_before = false;
bool core_clock_updated = (rq == rq->core);
@@ -6055,7 +6063,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
* and there are no cookied tasks running on siblings.
*/
if (!need_sync) {
- next = pick_task(rq);
+restart_single:
+ next = pick_task(rq, rf);
+ if (unlikely(next == RETRY_TASK))
+ goto restart_single;
if (!next->core_cookie) {
rq->core_pick = NULL;
rq->core_dl_server = NULL;
@@ -6075,6 +6086,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
*
* Tie-break prio towards the current CPU
*/
+restart_multi:
+ max = NULL;
for_each_cpu_wrap(i, smt_mask, cpu) {
rq_i = cpu_rq(i);
@@ -6086,7 +6099,11 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
if (i != cpu && (rq_i != rq->core || !core_clock_updated))
update_rq_clock(rq_i);
- rq_i->core_pick = p = pick_task(rq_i);
+ p = pick_task(rq_i, rf);
+ if (unlikely(p == RETRY_TASK))
+ goto restart_multi;
+
+ rq_i->core_pick = p;
rq_i->core_dl_server = rq_i->dl_server;
if (!max || prio_less(max, p, fi_before))
@@ -6108,7 +6125,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
if (cookie)
p = sched_core_find(rq_i, cookie);
if (!p)
- p = idle_sched_class.pick_task(rq_i);
+ p = idle_sched_class.pick_task(rq_i, rf);
}
rq_i->core_pick = p;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 83e6175..48357d4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2352,7 +2352,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct dl_rq *dl_rq)
* __pick_next_task_dl - Helper to pick the next -deadline task to run.
* @rq: The runqueue to pick the next task from.
*/
-static struct task_struct *__pick_task_dl(struct rq *rq)
+static struct task_struct *__pick_task_dl(struct rq *rq, struct rq_flags *rf)
{
struct sched_dl_entity *dl_se;
struct dl_rq *dl_rq = &rq->dl;
@@ -2366,7 +2366,7 @@ again:
WARN_ON_ONCE(!dl_se);
if (dl_server(dl_se)) {
- p = dl_se->server_pick_task(dl_se);
+ p = dl_se->server_pick_task(dl_se, rf);
if (!p) {
dl_server_stop(dl_se);
goto again;
@@ -2379,9 +2379,9 @@ again:
return p;
}
-static struct task_struct *pick_task_dl(struct rq *rq)
+static struct task_struct *pick_task_dl(struct rq *rq, struct rq_flags *rf)
{
- return __pick_task_dl(rq);
+ return __pick_task_dl(rq, rf);
}
static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct task_struct *next)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 949c3a6..dc743ca 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2332,7 +2332,7 @@ static struct task_struct *first_local_task(struct rq *rq)
struct task_struct, scx.dsq_list.node);
}
-static struct task_struct *pick_task_scx(struct rq *rq)
+static struct task_struct *pick_task_scx(struct rq *rq, struct rq_flags *rf)
{
struct task_struct *prev = rq->curr;
struct task_struct *p;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23ac05c..2554055 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8705,15 +8705,6 @@ static void set_cpus_allowed_fair(struct task_struct *p, struct affinity_context
set_task_max_allowed_capacity(p);
}
-static int
-balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
-{
- if (sched_fair_runnable(rq))
- return 1;
-
- return sched_balance_newidle(rq, rf) != 0;
-}
-
static void set_next_buddy(struct sched_entity *se)
{
for_each_sched_entity(se) {
@@ -8822,7 +8813,7 @@ preempt:
resched_curr_lazy(rq);
}
-static struct task_struct *pick_task_fair(struct rq *rq)
+static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
{
struct sched_entity *se;
struct cfs_rq *cfs_rq;
@@ -8866,7 +8857,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
int new_tasks;
again:
- p = pick_task_fair(rq);
+ p = pick_task_fair(rq, rf);
if (!p)
goto idle;
se = &p->se;
@@ -8945,14 +8936,10 @@ idle:
return NULL;
}
-static struct task_struct *__pick_next_task_fair(struct rq *rq, struct task_struct *prev)
-{
- return pick_next_task_fair(rq, prev, NULL);
-}
-
-static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se)
+static struct task_struct *
+fair_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf)
{
- return pick_task_fair(dl_se->rq);
+ return pick_task_fair(dl_se->rq, rf);
}
void fair_server_init(struct rq *rq)
@@ -13644,11 +13631,10 @@ DEFINE_SCHED_CLASS(fair) = {
.wakeup_preempt = check_preempt_wakeup_fair,
.pick_task = pick_task_fair,
- .pick_next_task = __pick_next_task_fair,
+ .pick_next_task = pick_next_task_fair,
.put_prev_task = put_prev_task_fair,
.set_next_task = set_next_task_fair,
- .balance = balance_fair,
.select_task_rq = select_task_rq_fair,
.migrate_task_rq = migrate_task_rq_fair,
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 055b0dd..7fa0b59 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -466,7 +466,7 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
next->se.exec_start = rq_clock_task(rq);
}
-struct task_struct *pick_task_idle(struct rq *rq)
+struct task_struct *pick_task_idle(struct rq *rq, struct rq_flags *rf)
{
scx_update_idle(rq, true, false);
return rq->idle;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 9bc828d..1fd97f2 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1695,7 +1695,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
return rt_task_of(rt_se);
}
-static struct task_struct *pick_task_rt(struct rq *rq)
+static struct task_struct *pick_task_rt(struct rq *rq, struct rq_flags *rf)
{
struct task_struct *p;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f4a3230..8946294 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2470,7 +2470,7 @@ struct sched_class {
/*
* schedule/pick_next_task: rq->lock
*/
- struct task_struct *(*pick_task)(struct rq *rq);
+ struct task_struct *(*pick_task)(struct rq *rq, struct rq_flags *rf);
/*
* Optional! When implemented pick_next_task() should be equivalent to:
*
@@ -2480,7 +2480,8 @@ struct sched_class {
* set_next_task_first(next);
* }
*/
- struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev);
+ struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev,
+ struct rq_flags *rf);
/*
* sched_change:
@@ -2707,8 +2708,9 @@ static inline bool sched_fair_runnable(struct rq *rq)
return rq->cfs.nr_queued > 0;
}
-extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
-extern struct task_struct *pick_task_idle(struct rq *rq);
+extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev,
+ struct rq_flags *rf);
+extern struct task_struct *pick_task_idle(struct rq *rq, struct rq_flags *rf);
#define SCA_CHECK 0x01
#define SCA_MIGRATE_DISABLE 0x02
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index d98c453..4f9192b 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -32,7 +32,7 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop, bool fir
stop->se.exec_start = rq_clock_task(rq);
}
-static struct task_struct *pick_task_stop(struct rq *rq)
+static struct task_struct *pick_task_stop(struct rq *rq, struct rq_flags *rf)
{
if (!sched_stop_runnable(rq))
return NULL;
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC][PATCH 3/3] sched/ext: Fold balance_scx() into pick_task_scx()
2025-10-06 10:46 [RFC][PATCH 0/3] sched/ext: Cleanup pick_task_scx() Peter Zijlstra
2025-10-06 10:46 ` [RFC][PATCH 1/3] sched: Detect per-class runqueue changes Peter Zijlstra
2025-10-06 10:46 ` [RFC][PATCH 2/3] sched: Add support to pick functions to take rf Peter Zijlstra
@ 2025-10-06 10:46 ` Peter Zijlstra
2025-10-16 9:33 ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2025-10-07 12:40 ` [RFC][PATCH 0/3] sched/ext: Cleanup pick_task_scx() Christian Loehle
2025-10-07 21:48 ` Tejun Heo
4 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2025-10-06 10:46 UTC (permalink / raw)
To: tj
Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
liuwenfang, tglx
With pick_task() having an rf argument, it is possible to do the
lock-break there, get rid of the weird balance/pick_task hack.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/sched/core.c | 13 --------
kernel/sched/ext.c | 78 +++++++--------------------------------------------
kernel/sched/sched.h | 1
3 files changed, 12 insertions(+), 80 deletions(-)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5845,19 +5845,6 @@ static void prev_balance(struct rq *rq,
const struct sched_class *start_class = prev->sched_class;
const struct sched_class *class;
-#ifdef CONFIG_SCHED_CLASS_EXT
- /*
- * SCX requires a balance() call before every pick_task() including when
- * waking up from SCHED_IDLE. If @start_class is below SCX, start from
- * SCX instead. Also, set a flag to detect missing balance() call.
- */
- if (scx_enabled()) {
- rq->scx.flags |= SCX_RQ_BAL_PENDING;
- if (sched_class_above(&ext_sched_class, start_class))
- start_class = &ext_sched_class;
- }
-#endif
-
/*
* We must do the balancing pass before put_prev_task(), such
* that when we release the rq->lock the task is in the same
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2013,7 +2013,7 @@ static int balance_one(struct rq *rq, st
lockdep_assert_rq_held(rq);
rq->scx.flags |= SCX_RQ_IN_BALANCE;
- rq->scx.flags &= ~(SCX_RQ_BAL_PENDING | SCX_RQ_BAL_KEEP);
+ rq->scx.flags &= ~SCX_RQ_BAL_KEEP;
if ((sch->ops.flags & SCX_OPS_HAS_CPU_PREEMPT) &&
unlikely(rq->scx.cpu_released)) {
@@ -2119,40 +2119,6 @@ static int balance_one(struct rq *rq, st
return true;
}
-static int balance_scx(struct rq *rq, struct task_struct *prev,
- struct rq_flags *rf)
-{
- int ret;
-
- rq_unpin_lock(rq, rf);
-
- ret = balance_one(rq, prev);
-
-#ifdef CONFIG_SCHED_SMT
- /*
- * When core-sched is enabled, this ops.balance() call will be followed
- * by pick_task_scx() on this CPU and the SMT siblings. Balance the
- * siblings too.
- */
- if (sched_core_enabled(rq)) {
- const struct cpumask *smt_mask = cpu_smt_mask(cpu_of(rq));
- int scpu;
-
- for_each_cpu_andnot(scpu, smt_mask, cpumask_of(cpu_of(rq))) {
- struct rq *srq = cpu_rq(scpu);
- struct task_struct *sprev = srq->curr;
-
- WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq));
- update_rq_clock(srq);
- balance_one(srq, sprev);
- }
- }
-#endif
- rq_repin_lock(rq, rf);
-
- return ret;
-}
-
static void process_ddsp_deferred_locals(struct rq *rq)
{
struct task_struct *p;
@@ -2335,38 +2301,19 @@ static struct task_struct *first_local_t
static struct task_struct *pick_task_scx(struct rq *rq, struct rq_flags *rf)
{
struct task_struct *prev = rq->curr;
+ bool keep_prev, kick_idle = false;
struct task_struct *p;
- bool keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP;
- bool kick_idle = false;
- /*
- * WORKAROUND:
- *
- * %SCX_RQ_BAL_KEEP should be set iff $prev is on SCX as it must just
- * have gone through balance_scx(). Unfortunately, there currently is a
- * bug where fair could say yes on balance() but no on pick_task(),
- * which then ends up calling pick_task_scx() without preceding
- * balance_scx().
- *
- * Keep running @prev if possible and avoid stalling from entering idle
- * without balancing.
- *
- * Once fair is fixed, remove the workaround and trigger WARN_ON_ONCE()
- * if pick_task_scx() is called without preceding balance_scx().
- */
- if (unlikely(rq->scx.flags & SCX_RQ_BAL_PENDING)) {
- if (prev->scx.flags & SCX_TASK_QUEUED) {
- keep_prev = true;
- } else {
- keep_prev = false;
- kick_idle = true;
- }
- } else if (unlikely(keep_prev &&
- prev->sched_class != &ext_sched_class)) {
- /*
- * Can happen while enabling as SCX_RQ_BAL_PENDING assertion is
- * conditional on scx_enabled() and may have been skipped.
- */
+ rq->queue_mask = 0;
+ rq_unpin_lock(rq, rf);
+ balance_one(rq, prev);
+ rq_repin_lock(rq, rf);
+ if (rq->queue_mask & ~((ext_sched_class.queue_mask << 1)-1))
+ return RETRY_TASK;
+
+ keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP;
+ if (unlikely(keep_prev &&
+ prev->sched_class != &ext_sched_class)) {
WARN_ON_ONCE(scx_enable_state() == SCX_ENABLED);
keep_prev = false;
}
@@ -3243,7 +3190,6 @@ DEFINE_SCHED_CLASS(ext) = {
.wakeup_preempt = wakeup_preempt_scx,
- .balance = balance_scx,
.pick_task = pick_task_scx,
.put_prev_task = put_prev_task_scx,
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -779,7 +779,6 @@ enum scx_rq_flags {
*/
SCX_RQ_ONLINE = 1 << 0,
SCX_RQ_CAN_STOP_TICK = 1 << 1,
- SCX_RQ_BAL_PENDING = 1 << 2, /* balance hasn't run yet */
SCX_RQ_BAL_KEEP = 1 << 3, /* balance decided to keep current */
SCX_RQ_BYPASSING = 1 << 4,
SCX_RQ_CLK_VALID = 1 << 5, /* RQ clock is fresh and valid */
^ permalink raw reply [flat|nested] 23+ messages in thread* [tip: sched/core] sched/ext: Fold balance_scx() into pick_task_scx()
2025-10-06 10:46 ` [RFC][PATCH 3/3] sched/ext: Fold balance_scx() into pick_task_scx() Peter Zijlstra
@ 2025-10-16 9:33 ` tip-bot2 for Peter Zijlstra
0 siblings, 0 replies; 23+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-10-16 9:33 UTC (permalink / raw)
To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Tejun Heo, x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 4c95380701f58b8112f0b891de8d160e4199e19d
Gitweb: https://git.kernel.org/tip/4c95380701f58b8112f0b891de8d160e4199e19d
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 01 Oct 2025 20:41:36 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 16 Oct 2025 11:13:55 +02:00
sched/ext: Fold balance_scx() into pick_task_scx()
With pick_task() having an rf argument, it is possible to do the
lock-break there, get rid of the weird balance/pick_task hack.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/core.c | 13 +-------
kernel/sched/ext.c | 78 ++++++-------------------------------------
kernel/sched/sched.h | 1 +-
3 files changed, 12 insertions(+), 80 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a75d456..096e8d0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5845,19 +5845,6 @@ static void prev_balance(struct rq *rq, struct task_struct *prev,
const struct sched_class *start_class = prev->sched_class;
const struct sched_class *class;
-#ifdef CONFIG_SCHED_CLASS_EXT
- /*
- * SCX requires a balance() call before every pick_task() including when
- * waking up from SCHED_IDLE. If @start_class is below SCX, start from
- * SCX instead. Also, set a flag to detect missing balance() call.
- */
- if (scx_enabled()) {
- rq->scx.flags |= SCX_RQ_BAL_PENDING;
- if (sched_class_above(&ext_sched_class, start_class))
- start_class = &ext_sched_class;
- }
-#endif
-
/*
* We must do the balancing pass before put_prev_task(), such
* that when we release the rq->lock the task is in the same
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index dc743ca..49f4a9e 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2013,7 +2013,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
lockdep_assert_rq_held(rq);
rq->scx.flags |= SCX_RQ_IN_BALANCE;
- rq->scx.flags &= ~(SCX_RQ_BAL_PENDING | SCX_RQ_BAL_KEEP);
+ rq->scx.flags &= ~SCX_RQ_BAL_KEEP;
if ((sch->ops.flags & SCX_OPS_HAS_CPU_PREEMPT) &&
unlikely(rq->scx.cpu_released)) {
@@ -2119,40 +2119,6 @@ has_tasks:
return true;
}
-static int balance_scx(struct rq *rq, struct task_struct *prev,
- struct rq_flags *rf)
-{
- int ret;
-
- rq_unpin_lock(rq, rf);
-
- ret = balance_one(rq, prev);
-
-#ifdef CONFIG_SCHED_SMT
- /*
- * When core-sched is enabled, this ops.balance() call will be followed
- * by pick_task_scx() on this CPU and the SMT siblings. Balance the
- * siblings too.
- */
- if (sched_core_enabled(rq)) {
- const struct cpumask *smt_mask = cpu_smt_mask(cpu_of(rq));
- int scpu;
-
- for_each_cpu_andnot(scpu, smt_mask, cpumask_of(cpu_of(rq))) {
- struct rq *srq = cpu_rq(scpu);
- struct task_struct *sprev = srq->curr;
-
- WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq));
- update_rq_clock(srq);
- balance_one(srq, sprev);
- }
- }
-#endif
- rq_repin_lock(rq, rf);
-
- return ret;
-}
-
static void process_ddsp_deferred_locals(struct rq *rq)
{
struct task_struct *p;
@@ -2335,38 +2301,19 @@ static struct task_struct *first_local_task(struct rq *rq)
static struct task_struct *pick_task_scx(struct rq *rq, struct rq_flags *rf)
{
struct task_struct *prev = rq->curr;
+ bool keep_prev, kick_idle = false;
struct task_struct *p;
- bool keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP;
- bool kick_idle = false;
- /*
- * WORKAROUND:
- *
- * %SCX_RQ_BAL_KEEP should be set iff $prev is on SCX as it must just
- * have gone through balance_scx(). Unfortunately, there currently is a
- * bug where fair could say yes on balance() but no on pick_task(),
- * which then ends up calling pick_task_scx() without preceding
- * balance_scx().
- *
- * Keep running @prev if possible and avoid stalling from entering idle
- * without balancing.
- *
- * Once fair is fixed, remove the workaround and trigger WARN_ON_ONCE()
- * if pick_task_scx() is called without preceding balance_scx().
- */
- if (unlikely(rq->scx.flags & SCX_RQ_BAL_PENDING)) {
- if (prev->scx.flags & SCX_TASK_QUEUED) {
- keep_prev = true;
- } else {
- keep_prev = false;
- kick_idle = true;
- }
- } else if (unlikely(keep_prev &&
- prev->sched_class != &ext_sched_class)) {
- /*
- * Can happen while enabling as SCX_RQ_BAL_PENDING assertion is
- * conditional on scx_enabled() and may have been skipped.
- */
+ rq_modified_clear(rq);
+ rq_unpin_lock(rq, rf);
+ balance_one(rq, prev);
+ rq_repin_lock(rq, rf);
+ if (rq_modified_above(rq, &ext_sched_class))
+ return RETRY_TASK;
+
+ keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP;
+ if (unlikely(keep_prev &&
+ prev->sched_class != &ext_sched_class)) {
WARN_ON_ONCE(scx_enable_state() == SCX_ENABLED);
keep_prev = false;
}
@@ -3243,7 +3190,6 @@ DEFINE_SCHED_CLASS(ext) = {
.wakeup_preempt = wakeup_preempt_scx,
- .balance = balance_scx,
.pick_task = pick_task_scx,
.put_prev_task = put_prev_task_scx,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8946294..e7718f1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -779,7 +779,6 @@ enum scx_rq_flags {
*/
SCX_RQ_ONLINE = 1 << 0,
SCX_RQ_CAN_STOP_TICK = 1 << 1,
- SCX_RQ_BAL_PENDING = 1 << 2, /* balance hasn't run yet */
SCX_RQ_BAL_KEEP = 1 << 3, /* balance decided to keep current */
SCX_RQ_BYPASSING = 1 << 4,
SCX_RQ_CLK_VALID = 1 << 5, /* RQ clock is fresh and valid */
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [RFC][PATCH 0/3] sched/ext: Cleanup pick_task_scx()
2025-10-06 10:46 [RFC][PATCH 0/3] sched/ext: Cleanup pick_task_scx() Peter Zijlstra
` (2 preceding siblings ...)
2025-10-06 10:46 ` [RFC][PATCH 3/3] sched/ext: Fold balance_scx() into pick_task_scx() Peter Zijlstra
@ 2025-10-07 12:40 ` Christian Loehle
2025-10-07 21:48 ` Tejun Heo
4 siblings, 0 replies; 23+ messages in thread
From: Christian Loehle @ 2025-10-07 12:40 UTC (permalink / raw)
To: Peter Zijlstra, tj
Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
liuwenfang, tglx
On 10/6/25 11:46, Peter Zijlstra wrote:
> Hi,
>
> So I had a poke at 'give @rf to pick_task() and fold balance_scx() into
> pick_task_scx()' option to see how terrible it was. Turns out, not terrible at
> all.
>
> I've ran the sched_ext selftest and stress-ng --race-sched 0 thing with various
> scx_* thingies on.
>
> These patches were done on top of the 'sched_change' patches posted just now:
>
> https://lkml.kernel.org/r/20251006104402.946760805@infradead.org
>
> The combined set is also available here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/cleanup-pick
>
FWIW I didn't run into any issues either running some scx scheduler + workload
combinations.
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC][PATCH 0/3] sched/ext: Cleanup pick_task_scx()
2025-10-06 10:46 [RFC][PATCH 0/3] sched/ext: Cleanup pick_task_scx() Peter Zijlstra
` (3 preceding siblings ...)
2025-10-07 12:40 ` [RFC][PATCH 0/3] sched/ext: Cleanup pick_task_scx() Christian Loehle
@ 2025-10-07 21:48 ` Tejun Heo
2025-10-08 9:11 ` Peter Zijlstra
4 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2025-10-07 21:48 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
liuwenfang, tglx
On Mon, Oct 06, 2025 at 12:46:52PM +0200, Peter Zijlstra wrote:
> Hi,
>
> So I had a poke at 'give @rf to pick_task() and fold balance_scx() into
> pick_task_scx()' option to see how terrible it was. Turns out, not terrible at
> all.
>
> I've ran the sched_ext selftest and stress-ng --race-sched 0 thing with various
> scx_* thingies on.
This is great. I was thinking that I needed to call pick_task() of other
classes to detect the retry conditions but yeah enqueue() must be the
triggering event and this is way neater. Does this mean that balance() can
be dropped from other classes too?
For the whole series:
Acked-by: Tejun Heo <tj@kernel.org>
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC][PATCH 0/3] sched/ext: Cleanup pick_task_scx()
2025-10-07 21:48 ` Tejun Heo
@ 2025-10-08 9:11 ` Peter Zijlstra
2025-10-08 20:21 ` Tejun Heo
0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2025-10-08 9:11 UTC (permalink / raw)
To: Tejun Heo
Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
liuwenfang, tglx, alexei.starovoitov
On Tue, Oct 07, 2025 at 11:48:15AM -1000, Tejun Heo wrote:
> On Mon, Oct 06, 2025 at 12:46:52PM +0200, Peter Zijlstra wrote:
> > Hi,
> >
> > So I had a poke at 'give @rf to pick_task() and fold balance_scx() into
> > pick_task_scx()' option to see how terrible it was. Turns out, not terrible at
> > all.
> >
> > I've ran the sched_ext selftest and stress-ng --race-sched 0 thing with various
> > scx_* thingies on.
>
> This is great. I was thinking that I needed to call pick_task() of other
> classes to detect the retry conditions but yeah enqueue() must be the
> triggering event and this is way neater.
:-)
> Does this mean that balance() can be dropped from other classes too?
Possibly. But lets stick that on a todo list :-) I was also looking to
get rid of sched_class::pick_next_task() -- the only reason that
currently still exists is because of that fair cgroup nonsense.
I was looking at bringing back Rik's flatten series (easier now that the
cfs bandwidth throttle thing is fixed). That should get rid of that
hierarchical pick; and thus obviate that whole mess in
pick_next_task_fair().
> For the whole series:
>
> Acked-by: Tejun Heo <tj@kernel.org>
>
Thanks!
Now, I do have a few questions from staring at this ext stuff for a
while:
- So the 'RT' problem with balance_one() is due to:
o rq-lock-break allowing higher prio task to come in
o placing a task on the local dsq
which results in that local-dsq getting 'starved'. The patches want
to call switch_class(), which I suppose will work, this is something
like:
if (rq_modified_above(rq, &ext_sched_class)) {
/*
* We don't have a next task at this point, but can guess at its
* class based on the highest set bit in the queue_mask.
*/
scx_cpu_release(reason_from_mask(rq->queue_mask), NULL);
return RETRY_TASK;
}
But I was thinking that we could also just stick that task back onto
some global dsq, right? (presumably the one we just pulled it from is
a good target). This would effectively 'undo' the balance_one().
- finish_dispatch() -- one of the problems that I ran into with the
shared rq lock implementation is that pick_task_scx() will in fact
try and enqueue on a non-local dsq.
The callchain is something like:
pick_task_scx()
bpf__sched_ext_ops_dispatch()
scx_bpf_dsq_move_to_local()
flush_dispatch_buf()
dispatch_enqueue() // dsq->id != SCX_DSQ_LOCAL
And this completely messes up the locking -- I'm not sure how to fix
this yet. But adding this flush to do random other things to various
code paths really complicates things. Per the function what we really
want to do is move-to-local, but then we end up doing random other
things instead :/
- finish_dispatch() -- per the above problem I read this function and
found that:
"the BPF scheduler is allowed to dispatch tasks spuriously"
and I had to go and buy a new WTF'o'meter again :/ Why would you
allow such a thing? Detect the case because the BPF thing is
untrusted and can do crazy things, sure. But then kill it dead; don't
try and excuse badness.
- scx_bpf_dsq_move_to_local() found per the above problem, but per its
comment it is possible BPF calls this with its own locks held. This
then results in:
CPU1
try_to_wake_up()
rq_lock();
enqueue_task() := enqueue_task_scx()
bpf__sched_ext_ops_something_or_other()
your bpf area lock thing
// rq->lock
// bpf area lock
CPU2
bpf__sched_ext_whatever()
bpf area lock
scx_bpf_move_to_local()
rq_lock()
// bpf area lock
// rq->lock
and we have a deadlock -- I thought BPF was supposed to be safe? And
while the recent rqspinlock has a timeout, and there the bpf validator
knows a spinlock exists and can prohibit kernel helper calls, this
bpf area lock you have has no such things (afaict) and BPF can
completely mess up the kernel. How is this okay?
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC][PATCH 0/3] sched/ext: Cleanup pick_task_scx()
2025-10-08 9:11 ` Peter Zijlstra
@ 2025-10-08 20:21 ` Tejun Heo
0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2025-10-08 20:21 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
liuwenfang, tglx, alexei.starovoitov
Hello,
On Wed, Oct 08, 2025 at 11:11:51AM +0200, Peter Zijlstra wrote:
...
> Possibly. But lets stick that on a todo list :-) I was also looking to
> get rid of sched_class::pick_next_task() -- the only reason that
> currently still exists is because of that fair cgroup nonsense.
>
> I was looking at bringing back Rik's flatten series (easier now that the
> cfs bandwidth throttle thing is fixed). That should get rid of that
> hierarchical pick; and thus obviate that whole mess in
> pick_next_task_fair().
Yeah, that'd be great. IIRC, the blocking issue was dealing with thundering
herds. ISTR Rik discussing a way to resolve that.
> Now, I do have a few questions from staring at this ext stuff for a
> while:
>
> - So the 'RT' problem with balance_one() is due to:
>
> o rq-lock-break allowing higher prio task to come in
> o placing a task on the local dsq
>
> which results in that local-dsq getting 'starved'. The patches want
> to call switch_class(), which I suppose will work, this is something
> like:
>
> if (rq_modified_above(rq, &ext_sched_class)) {
> /*
> * We don't have a next task at this point, but can guess at its
> * class based on the highest set bit in the queue_mask.
> */
> scx_cpu_release(reason_from_mask(rq->queue_mask), NULL);
> return RETRY_TASK;
> }
>
> But I was thinking that we could also just stick that task back onto
> some global dsq, right? (presumably the one we just pulled it from is
> a good target). This would effectively 'undo' the balance_one().
From ops.cpu_release(), the BPF scheduler can call scx_bpf_reenqueue_local()
which runs the enqueue path again for the queued tasks so that the scheduler
can decide on what to do with these tasks including sticking it into a
global DSQ or putting it back to the source DSQ. Yeah, it may make sense to
make some bulit-in actions available through ops flags.
> - finish_dispatch() -- one of the problems that I ran into with the
> shared rq lock implementation is that pick_task_scx() will in fact
> try and enqueue on a non-local dsq.
>
> The callchain is something like:
>
> pick_task_scx()
> bpf__sched_ext_ops_dispatch()
> scx_bpf_dsq_move_to_local()
> flush_dispatch_buf()
> dispatch_enqueue() // dsq->id != SCX_DSQ_LOCAL
>
> And this completely messes up the locking -- I'm not sure how to fix
> this yet. But adding this flush to do random other things to various
> code paths really complicates things. Per the function what we really
> want to do is move-to-local, but then we end up doing random other
> things instead :/
Hmm... this is deferred execution of tasks being dispatched to non-local
DSQs. The reason why they're deferred is to avoid entanglement with BPF side
synchronization, although BPF side locking developed in a different way than
we were thinking back then, so we may be able to remove the deferred
operations. Need to think more about that.
However, I'm not sure that would change much w.r.t. how it interacts with
core locking. While deferred, it's deferred within ops.dispatch(). ie.
ops.dispatch() is allowed to move any task to any DSQ. Whether that's
deferred or not within ops.dispatch() shouldn't make much difference.
> - finish_dispatch() -- per the above problem I read this function and
> found that:
>
> "the BPF scheduler is allowed to dispatch tasks spuriously"
>
> and I had to go and buy a new WTF'o'meter again :/ Why would you
> allow such a thing? Detect the case because the BPF thing is
> untrusted and can do crazy things, sure. But then kill it dead; don't
> try and excuse badness.
This came to be because the BPF data structures that we were playing with
earlier didn't easily support arbitrary element removals. The kernel side
needs to be able to reliably detect whether the dispatch attempt is allowed
or not anyway, so, as a workaround, instead of aborting the scheduler, I
just made it ignore spurious attempts. With new arena based data structures,
this shouldn't be a problem anymore, and it'd make sense to make this
stricter.
> - scx_bpf_dsq_move_to_local() found per the above problem, but per its
> comment it is possible BPF calls this with its own locks held. This
> then results in:
>
> CPU1
>
> try_to_wake_up()
> rq_lock();
> enqueue_task() := enqueue_task_scx()
> bpf__sched_ext_ops_something_or_other()
> your bpf area lock thing
>
> // rq->lock
> // bpf area lock
>
> CPU2
> bpf__sched_ext_whatever()
> bpf area lock
> scx_bpf_move_to_local()
> rq_lock()
>
> // bpf area lock
> // rq->lock
>
> and we have a deadlock -- I thought BPF was supposed to be safe? And
> while the recent rqspinlock has a timeout, and there the bpf validator
> knows a spinlock exists and can prohibit kernel helper calls, this
> bpf area lock you have has no such things (afaict) and BPF can
> completely mess up the kernel. How is this okay?
BPF arena locks are not allowed to spin indefinitely. It will break out of
spinning and fail the locking attempt. Right now, such failures need to be
handled manually, but, once the BPF program abort support is merged, we
should be able to auto-abort schedulers.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 23+ messages in thread