linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently
@ 2025-06-21  4:09 liuwenfang
  2025-06-23 19:50 ` 'Tejun Heo'
  0 siblings, 1 reply; 29+ messages in thread
From: liuwenfang @ 2025-06-21  4:09 UTC (permalink / raw)
  To: 'Tejun Heo'
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Supposed RT task(rt1) is running on one CPU with its rq->scx.cpu_released
set to true, if the rt1 becomes sleeping, then the scheduler will balance
the remote SCX task(scx1) because there is no other RT task on its rq,
and rq->scx.cpu_released is false. While one RT task(rt2) is placed on
this rq(maybe rt2 wakeup or migration occurs) before the scx1 is enqueued,
then the scheduler will pick rt2. At last, rt2 will be running on this cpu
with rq->scx.cpu_released being false!
The main reason is that consume_remote_task() will unlock rq lock.

So, expose the switch_class() and check sched class again to fix the value
of rq->scx.cpu_released.

Signed-off-by: liuwenfang liuwenfang@honor.com

---
 kernel/sched/ext.c   | 2 +-
 kernel/sched/sched.h | 6 ++++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f5133249f..6bbea0ea1 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3187,7 +3187,7 @@ preempt_reason_from_class(const struct sched_class *class)
 	return SCX_CPU_PREEMPT_UNKNOWN;
 }
 
-static void switch_class(struct rq *rq, struct task_struct *next)
+void switch_class(struct rq *rq, struct task_struct *next)
 {
 	const struct sched_class *next_class = next->sched_class;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 47972f34e..d377075d0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1738,6 +1738,7 @@ static inline void scx_rq_clock_invalidate(struct rq *rq)
 	WRITE_ONCE(rq->scx.flags, rq->scx.flags & ~SCX_RQ_CLK_VALID);
 }
 
+void switch_class(struct rq *rq, struct task_struct *next);
 #else /* !CONFIG_SCHED_CLASS_EXT */
 #define scx_enabled()		false
 #define scx_switched_all()	false
@@ -2470,6 +2471,11 @@ static inline void put_prev_set_next_task(struct rq *rq,
 
 	prev->sched_class->put_prev_task(rq, prev, next);
 	next->sched_class->set_next_task(rq, next, true);
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+	if (scx_enabled())
+		switch_class(rq, next);
+#endif
 }
 
 /*
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently
  2025-06-21  4:09 [PATCH] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently liuwenfang
@ 2025-06-23 19:50 ` 'Tejun Heo'
  2025-06-28  6:50   ` [PATCH v2 1/2] " liuwenfang
  2025-06-28  7:20   ` [PATCH v2 2/2] sched_ext: Fix cpu_released while changing sched policy of the running task liuwenfang
  0 siblings, 2 replies; 29+ messages in thread
From: 'Tejun Heo' @ 2025-06-23 19:50 UTC (permalink / raw)
  To: liuwenfang
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Hello,

On Sat, Jun 21, 2025 at 04:09:55AM +0000, liuwenfang wrote:
> Supposed RT task(rt1) is running on one CPU with its rq->scx.cpu_released
> set to true, if the rt1 becomes sleeping, then the scheduler will balance
> the remote SCX task(scx1) because there is no other RT task on its rq,
> and rq->scx.cpu_released is false. While one RT task(rt2) is placed on
> this rq(maybe rt2 wakeup or migration occurs) before the scx1 is enqueued,
> then the scheduler will pick rt2. At last, rt2 will be running on this cpu
> with rq->scx.cpu_released being false!
> The main reason is that consume_remote_task() will unlock rq lock.

This is rather difficult to follow. Can you please break this down to a
table? People often use a format like the following:

         CPU X                           CPU Y
  A does something
                                  B does something else
  ...
                                  ...
  Boom

> @@ -2470,6 +2471,11 @@ static inline void put_prev_set_next_task(struct rq *rq,
>  
>  	prev->sched_class->put_prev_task(rq, prev, next);
>  	next->sched_class->set_next_task(rq, next, true);
> +
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +	if (scx_enabled())
> +		switch_class(rq, next);
> +#endif

You're right that there is a race condition around this and I can't see a
way to solve this in SCX proper as there's no way for balance() to tell
whether a higher priority sched class has queued something while balance()
dropped the rq lock for migration, so adding a hook to
put_prev_set_next_task() seems like a reasoanble solution. However, can you
please do the followings?

- Improve the description so that the race condition is clearly
  understandable and explain why the extra hook in put_prev_set_next_task()
  is necessary.

- Rename switch_class() to something which fits the new location better -
  maybe scx_put_prev_set_next_task().

- If the function is called from put_prev_set_next_task(), it doesn't need
  to be called from put_prev_task_scx(). Drop that call.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 1/2] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently
  2025-06-23 19:50 ` 'Tejun Heo'
@ 2025-06-28  6:50   ` liuwenfang
  2025-07-17 21:38     ` 'Tejun Heo'
  2025-06-28  7:20   ` [PATCH v2 2/2] sched_ext: Fix cpu_released while changing sched policy of the running task liuwenfang
  1 sibling, 1 reply; 29+ messages in thread
From: liuwenfang @ 2025-06-28  6:50 UTC (permalink / raw)
  To: 'Tejun Heo'
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Supposed RT task(RT1) is running on CPU0 and RT task(RT2) is awakened on CPU1,
RT1 becomes sleep and SCX task(SCX1) will be dispatched to CPU0, RT2 will be
placed on CPU0:

CPU0(schedule)                                     CPU1(try_to_wake_up)
set_current_state(TASK_INTERRUPTIBLE)              try_to_wake_up # RT2
__schedule                                           select_task_rq # CPU0 is selected
LOCK rq(0)->lock # lock CPU0 rq                        ttwu_queue
  deactivate_task # RT1                                  LOCK rq(0)->lock # busy waiting
    pick_next_task # no more RT tasks on rq                 |
      prev_balance                                          |
        balance_scx                                         |
          balance_one                                       |
            rq->scx.cpu_released = false;                   |
              consume_global_dsq                            |
                consume_dispatch_q                          |
                  consume_remote_task                       |
                    UNLOCK rq(0)->lock                      V
                                                         LOCK rq(0)->lock # succ
                    deactivate_task # SCX1               ttwu_do_activate
                    LOCK rq(0)->lock # busy waiting      activate_task # RT2 equeued
                       |                                 UNLOCK rq(0)->lock
                       V
                    LOCK rq(0)->lock # succ
                    activate_task # SCX1
      pick_task # RT2 is picked
      put_prev_set_next_task # prev is RT1, next is RT2, rq->scx.cpu_released = false;
UNLOCK rq(0)->lock

At last, RT2 will be running on CPU0 with rq->scx.cpu_released being false!

So, Add the scx_next_task_picked () and check sched class again to fix the value
of rq->scx.cpu_released.

Signed-off-by: l00013971 <l00013971@hihonor.com>
---
 kernel/sched/ext.c   | 24 +++++++++++++++++-------
 kernel/sched/sched.h |  5 +++++
 2 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f5133249f..f161156be 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3187,7 +3187,7 @@ preempt_reason_from_class(const struct sched_class *class)
 	return SCX_CPU_PREEMPT_UNKNOWN;
 }
 
-static void switch_class(struct rq *rq, struct task_struct *next)
+static void switch_class(struct rq *rq, struct task_struct *next, bool prev_on_scx)
 {
 	const struct sched_class *next_class = next->sched_class;
 
@@ -3197,7 +3197,8 @@ static void switch_class(struct rq *rq, struct task_struct *next)
 	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
 	 * resched.
 	 */
-	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
+	if (prev_on_scx)
+		smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
 #endif
 	if (!static_branch_unlikely(&scx_ops_cpu_preempt))
 		return;
@@ -3233,6 +3234,19 @@ static void switch_class(struct rq *rq, struct task_struct *next)
 	}
 }
 
+void scx_next_task_picked(struct rq *rq, struct task_struct *prev,
+			  struct task_struct *next)
+{
+	bool prev_on_scx = prev && (prev->sched_class == &ext_sched_class);
+
+	if (!scx_enabled() ||
+	    !next ||
+	    next->sched_class == &ext_sched_class)
+		return;
+
+	switch_class(rq, next, prev_on_scx);
+}
+
 static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 			      struct task_struct *next)
 {
@@ -3253,7 +3267,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 		 */
 		if (p->scx.slice && !scx_rq_bypassing(rq)) {
 			dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
-			goto switch_class;
+			return;
 		}
 
 		/*
@@ -3269,10 +3283,6 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 			do_enqueue_task(rq, p, 0, -1);
 		}
 	}
-
-switch_class:
-	if (next && next->sched_class != &ext_sched_class)
-		switch_class(rq, next);
 }
 
 static struct task_struct *first_local_task(struct rq *rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 47972f34e..f8e1b2173 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1738,12 +1738,15 @@ static inline void scx_rq_clock_invalidate(struct rq *rq)
 	WRITE_ONCE(rq->scx.flags, rq->scx.flags & ~SCX_RQ_CLK_VALID);
 }
 
+void scx_next_task_picked(struct rq *rq, struct task_struct *prev, struct task_struct *next);
 #else /* !CONFIG_SCHED_CLASS_EXT */
 #define scx_enabled()		false
 #define scx_switched_all()	false
 
 static inline void scx_rq_clock_update(struct rq *rq, u64 clock) {}
 static inline void scx_rq_clock_invalidate(struct rq *rq) {}
+static inline void scx_next_task_picked(struct rq *rq, struct task_struct *prev,
+					struct task_struct *next) {}
 #endif /* !CONFIG_SCHED_CLASS_EXT */
 
 /*
@@ -2465,6 +2468,8 @@ static inline void put_prev_set_next_task(struct rq *rq,
 
 	__put_prev_set_next_dl_server(rq, prev, next);
 
+	scx_next_task_picked(rq, prev, next);
+
 	if (next == prev)
 		return;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 2/2] sched_ext: Fix cpu_released while changing sched policy of the running task
  2025-06-23 19:50 ` 'Tejun Heo'
  2025-06-28  6:50   ` [PATCH v2 1/2] " liuwenfang
@ 2025-06-28  7:20   ` liuwenfang
  2025-07-17 21:48     ` 'Tejun Heo'
  1 sibling, 1 reply; 29+ messages in thread
From: liuwenfang @ 2025-06-28  7:20 UTC (permalink / raw)
  To: 'Tejun Heo'
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Priority inheritance policy should be cared, eg., one SCX task can be
boosted to REAL-TIME while holding a mutex lock, and then restored while
releasing the lock. The value cpu_released is fixed when changing the
sched class of the running task.

Signed-off-by: liuwenfang liuwenfang@honor.com
---
 kernel/sched/ext.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f161156be..7fab0a657 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3903,11 +3903,29 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p)
 
 static void switched_from_scx(struct rq *rq, struct task_struct *p)
 {
+	switch_class(rq, p, true);
+
 	scx_ops_disable_task(p);
 }
 
 static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
-static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
+static void switched_to_scx(struct rq *rq, struct task_struct *p)
+{
+	lockdep_assert_rq_held(rq);
+
+	if (static_branch_unlikely(&scx_ops_cpu_preempt) &&
+	    unlikely(rq->scx.cpu_released)) {
+		/*
+		 * If the previous sched_class for the current CPU was not SCX,
+		 * notify the BPF scheduler that it again has control of the
+		 * core. This callback complements ->cpu_release(), which is
+		 * emitted in switch_class().
+		 */
+		if (SCX_HAS_OP(cpu_acquire))
+			SCX_CALL_OP(SCX_KF_REST, cpu_acquire, rq, cpu_of(rq), NULL);
+		rq->scx.cpu_released = false;
+	}
+}
 
 int scx_check_setscheduler(struct task_struct *p, int policy)
 {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently
  2025-06-28  6:50   ` [PATCH v2 1/2] " liuwenfang
@ 2025-07-17 21:38     ` 'Tejun Heo'
  2025-07-20  9:20       ` liuwenfang
  0 siblings, 1 reply; 29+ messages in thread
From: 'Tejun Heo' @ 2025-07-17 21:38 UTC (permalink / raw)
  To: liuwenfang
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Hello,

My aplogies for really late reply. I've been off work and ended up a lot
more offline than I expected.

On Sat, Jun 28, 2025 at 06:50:32AM +0000, liuwenfang wrote:
> Supposed RT task(RT1) is running on CPU0 and RT task(RT2) is awakened on CPU1,
> RT1 becomes sleep and SCX task(SCX1) will be dispatched to CPU0, RT2 will be
> placed on CPU0:
> 
> CPU0(schedule)                                     CPU1(try_to_wake_up)
> set_current_state(TASK_INTERRUPTIBLE)              try_to_wake_up # RT2
> __schedule                                           select_task_rq # CPU0 is selected
> LOCK rq(0)->lock # lock CPU0 rq                        ttwu_queue
>   deactivate_task # RT1                                  LOCK rq(0)->lock # busy waiting
>     pick_next_task # no more RT tasks on rq                 |
>       prev_balance                                          |
>         balance_scx                                         |
>           balance_one                                       |
>             rq->scx.cpu_released = false;                   |
>               consume_global_dsq                            |
>                 consume_dispatch_q                          |
>                   consume_remote_task                       |
>                     UNLOCK rq(0)->lock                      V
>                                                          LOCK rq(0)->lock # succ
>                     deactivate_task # SCX1               ttwu_do_activate
>                     LOCK rq(0)->lock # busy waiting      activate_task # RT2 equeued
>                        |                                 UNLOCK rq(0)->lock
>                        V
>                     LOCK rq(0)->lock # succ
>                     activate_task # SCX1
>       pick_task # RT2 is picked
>       put_prev_set_next_task # prev is RT1, next is RT2, rq->scx.cpu_released = false;
> UNLOCK rq(0)->lock
> 
> At last, RT2 will be running on CPU0 with rq->scx.cpu_released being false!
> 
> So, Add the scx_next_task_picked () and check sched class again to fix the value
> of rq->scx.cpu_released.

Yeah, the problem and diagnosis look correct to me. It'd be nice if we don't
have to add an explicit hook but ops.cpu_acquire() needs to be called before
dispatching to the CPU and then we can lose while doing ops.pick_task().

> Signed-off-by: l00013971 <l00013971@hihonor.com>

Can you please use "FIRST_NAME LAST_NAME <EMAIL>" when signing off?

> -static void switch_class(struct rq *rq, struct task_struct *next)
> +static void switch_class(struct rq *rq, struct task_struct *next, bool prev_on_scx)
>  {
>  	const struct sched_class *next_class = next->sched_class;
>  
> @@ -3197,7 +3197,8 @@ static void switch_class(struct rq *rq, struct task_struct *next)
>  	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
>  	 * resched.
>  	 */
> -	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
> +	if (prev_on_scx)
> +		smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);

It's currently obviously broken as the seq is currently only incremented on
scx -> !scx transitions but it should be called on all transitions. This is
a breakage introduced by b999e365c298 ("sched, sched_ext: Replace
scx_next_task_picked() with sched_class->switch_class()").

> +void scx_next_task_picked(struct rq *rq, struct task_struct *prev,
> +			  struct task_struct *next)
> +{
> +	bool prev_on_scx = prev && (prev->sched_class == &ext_sched_class);

I don't think @prev or @next can ever be NULL, can they?

> +
> +	if (!scx_enabled() ||

Let's make this an inline function in ext.h. The pnt_seq update should be
moved here after scx_enabled() test, I think. This probably should be a
separate patch.

> +	    !next ||
> +	    next->sched_class == &ext_sched_class)
> +		return;
> +
> +	switch_class(rq, next, prev_on_scx);
> +}
>
>  static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
>  			      struct task_struct *next)
>  {
> @@ -3253,7 +3267,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
>  		 */
>  		if (p->scx.slice && !scx_rq_bypassing(rq)) {
>  			dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
> -			goto switch_class;
> +			return;
...
> @@ -2465,6 +2468,8 @@ static inline void put_prev_set_next_task(struct rq *rq,
>  
>  	__put_prev_set_next_dl_server(rq, prev, next);
>  
> +	scx_next_task_picked(rq, prev, next);

It's a bit unfortunate that we need to add this hook but I can't see another
way around it for both the problem you're reporting and the pnt_seq issue.
Maybe name it scx_put_prev_set_next(rq, prev, next) for consistency?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/2] sched_ext: Fix cpu_released while changing sched policy of the running task
  2025-06-28  7:20   ` [PATCH v2 2/2] sched_ext: Fix cpu_released while changing sched policy of the running task liuwenfang
@ 2025-07-17 21:48     ` 'Tejun Heo'
  2025-07-18  9:06       ` liuwenfang
  0 siblings, 1 reply; 29+ messages in thread
From: 'Tejun Heo' @ 2025-07-17 21:48 UTC (permalink / raw)
  To: liuwenfang
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Hello,

On Sat, Jun 28, 2025 at 07:20:59AM +0000, liuwenfang wrote:
>  static void switched_from_scx(struct rq *rq, struct task_struct *p)
>  {
> +	switch_class(rq, p, true);
> +
>  	scx_ops_disable_task(p);
>  }

Hmm... but this function can be called when @p is not currently running from
setscheduler() path, and it wouldn't make sense to call switch_class()
during that.

>  static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
> -static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
> +static void switched_to_scx(struct rq *rq, struct task_struct *p)
> +{
> +	lockdep_assert_rq_held(rq);
> +
> +	if (static_branch_unlikely(&scx_ops_cpu_preempt) &&
> +	    unlikely(rq->scx.cpu_released)) {
> +		/*
> +		 * If the previous sched_class for the current CPU was not SCX,
> +		 * notify the BPF scheduler that it again has control of the
> +		 * core. This callback complements ->cpu_release(), which is
> +		 * emitted in switch_class().
> +		 */
> +		if (SCX_HAS_OP(cpu_acquire))
> +			SCX_CALL_OP(SCX_KF_REST, cpu_acquire, rq, cpu_of(rq), NULL);
> +		rq->scx.cpu_released = false;
> +	}
> +}

Ditto. This should only apply if @p is current, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v2 2/2] sched_ext: Fix cpu_released while changing sched policy of the running task
  2025-07-17 21:48     ` 'Tejun Heo'
@ 2025-07-18  9:06       ` liuwenfang
  2025-07-20  9:36         ` [PATCH v3 1/3] sched_ext: Fix pnt_seq calculation liuwenfang
  0 siblings, 1 reply; 29+ messages in thread
From: liuwenfang @ 2025-07-18  9:06 UTC (permalink / raw)
  To: 'Tejun Heo'
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

> Hello,
> 
> On Sat, Jun 28, 2025 at 07:20:59AM +0000, liuwenfang wrote:
> >  static void switched_from_scx(struct rq *rq, struct task_struct *p)
> > {
> > +	switch_class(rq, p, true);
> > +
> >  	scx_ops_disable_task(p);
> >  }
> 
> Hmm... but this function can be called when @p is not currently running from
> setscheduler() path, and it wouldn't make sense to call switch_class() during
> that.
Yeah, task_current(rq, p) should be checked here.
> 
> >  static void wakeup_preempt_scx(struct rq *rq, struct task_struct
> > *p,int wake_flags) {} -static void switched_to_scx(struct rq *rq,
> > struct task_struct *p) {}
> > +static void switched_to_scx(struct rq *rq, struct task_struct *p) {
> > +	lockdep_assert_rq_held(rq);
> > +
> > +	if (static_branch_unlikely(&scx_ops_cpu_preempt) &&
> > +	    unlikely(rq->scx.cpu_released)) {
> > +		/*
> > +		 * If the previous sched_class for the current CPU was not SCX,
> > +		 * notify the BPF scheduler that it again has control of the
> > +		 * core. This callback complements ->cpu_release(), which is
> > +		 * emitted in switch_class().
> > +		 */
> > +		if (SCX_HAS_OP(cpu_acquire))
> > +			SCX_CALL_OP(SCX_KF_REST, cpu_acquire, rq, cpu_of(rq), NULL);
> > +		rq->scx.cpu_released = false;
> > +	}
> > +}
> 
> Ditto. This should only apply if @p is current, right?
task_current(rq, p) should be added.
> 
> Thanks.
> 
> --
> Tejun

Thanks.

--
Wenfang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v2 1/2] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently
  2025-07-17 21:38     ` 'Tejun Heo'
@ 2025-07-20  9:20       ` liuwenfang
  2025-07-20  9:38         ` [PATCH v3 2/3] " liuwenfang
  2025-07-20  9:41         ` [PATCH v3 3/3] sched_ext: Fix cpu_released while changing sched policy of the running task liuwenfang
  0 siblings, 2 replies; 29+ messages in thread
From: liuwenfang @ 2025-07-20  9:20 UTC (permalink / raw)
  To: 'Tejun Heo'
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Thanks for your feedback.

> 
> Hello,
> 
> My aplogies for really late reply. I've been off work and ended up a lot more
> offline than I expected.
> 
> On Sat, Jun 28, 2025 at 06:50:32AM +0000, liuwenfang wrote:
> > Supposed RT task(RT1) is running on CPU0 and RT task(RT2) is awakened
> > on CPU1,
> > RT1 becomes sleep and SCX task(SCX1) will be dispatched to CPU0, RT2
> > will be placed on CPU0:
> >
> > CPU0(schedule)
> CPU1(try_to_wake_up)
> > set_current_state(TASK_INTERRUPTIBLE)              try_to_wake_up #
> RT2
> > __schedule
> select_task_rq # CPU0 is selected
> > LOCK rq(0)->lock # lock CPU0 rq                        ttwu_queue
> >   deactivate_task # RT1                                  LOCK
> rq(0)->lock # busy waiting
> >     pick_next_task # no more RT tasks on rq                 |
> >       prev_balance                                          |
> >         balance_scx                                         |
> >           balance_one                                       |
> >             rq->scx.cpu_released = false;                   |
> >               consume_global_dsq                            |
> >                 consume_dispatch_q                          |
> >                   consume_remote_task                       |
> >                     UNLOCK rq(0)->lock                      V
> >                                                          LOCK
> rq(0)->lock # succ
> >                     deactivate_task # SCX1
> ttwu_do_activate
> >                     LOCK rq(0)->lock # busy waiting      activate_task
> # RT2 equeued
> >                        |
> UNLOCK rq(0)->lock
> >                        V
> >                     LOCK rq(0)->lock # succ
> >                     activate_task # SCX1
> >       pick_task # RT2 is picked
> >       put_prev_set_next_task # prev is RT1, next is RT2,
> > rq->scx.cpu_released = false; UNLOCK rq(0)->lock
> >
> > At last, RT2 will be running on CPU0 with rq->scx.cpu_released being false!
> >
> > So, Add the scx_next_task_picked () and check sched class again to fix
> > the value of rq->scx.cpu_released.
> 
> Yeah, the problem and diagnosis look correct to me. It'd be nice if we don't have
> to add an explicit hook but ops.cpu_acquire() needs to be called before
> dispatching to the CPU and then we can lose while doing ops.pick_task().
> 
> > Signed-off-by: l00013971 <l00013971@hihonor.com>
> 
> Can you please use "FIRST_NAME LAST_NAME <EMAIL>" when signing off?
> 
> > -static void switch_class(struct rq *rq, struct task_struct *next)
> > +static void switch_class(struct rq *rq, struct task_struct *next,
> > +bool prev_on_scx)
> >  {
> >  	const struct sched_class *next_class = next->sched_class;
> >
> > @@ -3197,7 +3197,8 @@ static void switch_class(struct rq *rq, struct
> task_struct *next)
> >  	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
> >  	 * resched.
> >  	 */
> > -	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
> > +	if (prev_on_scx)
> > +		smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
> 
> It's currently obviously broken as the seq is currently only incremented on scx
> -> !scx transitions but it should be called on all transitions. This is a breakage
> introduced by b999e365c298 ("sched, sched_ext: Replace
> scx_next_task_picked() with sched_class->switch_class()").
Thanks for the suggestion.
> 
> > +void scx_next_task_picked(struct rq *rq, struct task_struct *prev,
> > +			  struct task_struct *next)
> > +{
> > +	bool prev_on_scx = prev && (prev->sched_class == &ext_sched_class);
> 
> I don't think @prev or @next can ever be NULL, can they?
@prev always has valid value in core scheduler routine.
> 
> > +
> > +	if (!scx_enabled() ||
> 
> Let's make this an inline function in ext.h. The pnt_seq update should be moved
> here after scx_enabled() test, I think. This probably should be a separate patch.
Makes sense.  Thanks for the suggestion.
> 
> > +	    !next ||
> > +	    next->sched_class == &ext_sched_class)
> > +		return;
> > +
> > +	switch_class(rq, next, prev_on_scx); }
> >
> >  static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
> >  			      struct task_struct *next)
> >  {
> > @@ -3253,7 +3267,7 @@ static void put_prev_task_scx(struct rq *rq, struct
> task_struct *p,
> >  		 */
> >  		if (p->scx.slice && !scx_rq_bypassing(rq)) {
> >  			dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
> > -			goto switch_class;
> > +			return;
> ...
> > @@ -2465,6 +2468,8 @@ static inline void put_prev_set_next_task(struct
> > rq *rq,
> >
> >  	__put_prev_set_next_dl_server(rq, prev, next);
> >
> > +	scx_next_task_picked(rq, prev, next);
> 
> It's a bit unfortunate that we need to add this hook but I can't see another way
> around it for both the problem you're reporting and the pnt_seq issue.
> Maybe name it scx_put_prev_set_next(rq, prev, next) for consistency?
Makes sense.  Thanks for the suggestion.
> 
> Thanks.
> 
> --
> Tejun
-- 
Regards.
wenfang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v3 1/3] sched_ext: Fix pnt_seq calculation
  2025-07-18  9:06       ` liuwenfang
@ 2025-07-20  9:36         ` liuwenfang
  2025-08-12  0:03           ` 'Tejun Heo'
  2025-08-18 17:47           ` Peter Zijlstra
  0 siblings, 2 replies; 29+ messages in thread
From: liuwenfang @ 2025-07-20  9:36 UTC (permalink / raw)
  To: 'Tejun Heo'
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Fix pnt_seq calculation for all transitions.

Signed-off-by: Wenfang Liu liuwenfang@honor.com
---
 kernel/sched/ext.c   | 23 ++++++++++++++---------
 kernel/sched/fair.c  |  3 +++
 kernel/sched/sched.h |  8 ++++++++
 3 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f5133249f..93e03b7d0 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3191,14 +3191,6 @@ static void switch_class(struct rq *rq, struct task_struct *next)
 {
 	const struct sched_class *next_class = next->sched_class;
 
-#ifdef CONFIG_SMP
-	/*
-	 * Pairs with the smp_load_acquire() issued by a CPU in
-	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
-	 * resched.
-	 */
-	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
-#endif
 	if (!static_branch_unlikely(&scx_ops_cpu_preempt))
 		return;
 
@@ -3233,6 +3225,19 @@ static void switch_class(struct rq *rq, struct task_struct *next)
 	}
 }
 
+void scx_put_prev_set_next(struct rq *rq, struct task_struct *prev,
+			   struct task_struct *next)
+{
+#ifdef CONFIG_SMP
+	/*
+	 * Pairs with the smp_load_acquire() issued by a CPU in
+	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
+	 * resched.
+	 */
+	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
+#endif
+}
+
 static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 			      struct task_struct *next)
 {
@@ -5966,7 +5971,7 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
 		if (cpu != cpu_of(this_rq)) {
 			/*
 			 * Pairs with smp_store_release() issued by this CPU in
-			 * switch_class() on the resched path.
+			 * scx_put_prev_set_next() on the resched path.
 			 *
 			 * We busy-wait here to guarantee that no other task can
 			 * be scheduled on our core before the target CPU has
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fb9bf995..50d757e92 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8887,6 +8887,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 
 	__put_prev_set_next_dl_server(rq, prev, p);
 
+	if (scx_enabled())
+		scx_put_prev_set_next(rq, prev, p);
+
 	/*
 	 * Because of the set_next_buddy() in dequeue_task_fair() it is rather
 	 * likely that a next task is from the same cgroup as the current.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 47972f34e..bcb7f175c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1738,12 +1738,17 @@ static inline void scx_rq_clock_invalidate(struct rq *rq)
 	WRITE_ONCE(rq->scx.flags, rq->scx.flags & ~SCX_RQ_CLK_VALID);
 }
 
+void scx_put_prev_set_next(struct rq *rq, struct task_struct *prev,
+			   struct task_struct *next);
 #else /* !CONFIG_SCHED_CLASS_EXT */
 #define scx_enabled()		false
 #define scx_switched_all()	false
 
 static inline void scx_rq_clock_update(struct rq *rq, u64 clock) {}
 static inline void scx_rq_clock_invalidate(struct rq *rq) {}
+static inline void scx_put_prev_set_next(struct rq *rq,
+					 struct task_struct *prev,
+					 struct task_struct *next) {}
 #endif /* !CONFIG_SCHED_CLASS_EXT */
 
 /*
@@ -2465,6 +2470,9 @@ static inline void put_prev_set_next_task(struct rq *rq,
 
 	__put_prev_set_next_dl_server(rq, prev, next);
 
+	if (scx_enabled())
+		scx_put_prev_set_next(rq, prev, next);
+
 	if (next == prev)
 		return;
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently
  2025-07-20  9:20       ` liuwenfang
@ 2025-07-20  9:38         ` liuwenfang
  2025-08-12  1:26           ` 'Tejun Heo'
  2025-07-20  9:41         ` [PATCH v3 3/3] sched_ext: Fix cpu_released while changing sched policy of the running task liuwenfang
  1 sibling, 1 reply; 29+ messages in thread
From: liuwenfang @ 2025-07-20  9:38 UTC (permalink / raw)
  To: 'Tejun Heo'
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Supposed RT task(RT1) is running on CPU0 and RT task(RT2) is awakened on CPU1,
RT1 becomes sleep and SCX task(SCX1) will be dispatched to CPU0, RT2 will be
placed on CPU0:

CPU0(schedule)                                     CPU1(try_to_wake_up)
set_current_state(TASK_INTERRUPTIBLE)              try_to_wake_up # RT2
__schedule                                           select_task_rq # CPU0 is selected
LOCK rq(0)->lock # lock CPU0 rq                        ttwu_queue
  deactivate_task # RT1                                  LOCK rq(0)->lock # busy waiting
    pick_next_task # no more RT tasks on rq                 |
      prev_balance                                          |
        balance_scx                                         |
          balance_one                                       |
            rq->scx.cpu_released = false;                   |
              consume_global_dsq                            |
                consume_dispatch_q                          |
                  consume_remote_task                       |
                    UNLOCK rq(0)->lock                      V
                                                         LOCK rq(0)->lock # succ
                    deactivate_task # SCX1               ttwu_do_activate
                    LOCK rq(0)->lock # busy waiting      activate_task # RT2 equeued
                       |                                 UNLOCK rq(0)->lock
                       V
                    LOCK rq(0)->lock # succ
                    activate_task # SCX1
      pick_task # RT2 is picked
      put_prev_set_next_task # prev is RT1, next is RT2, rq->scx.cpu_released = false;
UNLOCK rq(0)->lock

At last, RT2 will be running on CPU0 with rq->scx.cpu_released being false!

So, Add the scx_next_task_picked () and check sched class again to fix the value
of rq->scx.cpu_released.

Signed-off-by: Wenfang Liu liuwenfang@honor.com
---
 kernel/sched/ext.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 93e03b7d0..ddf4bd060 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3236,6 +3236,11 @@ void scx_put_prev_set_next(struct rq *rq, struct task_struct *prev,
 	 */
 	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
 #endif
+
+	if (next->sched_class == &ext_sched_class)
+		return;
+
+	switch_class(rq, next);
 }
 
 static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
@@ -3258,7 +3263,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 		 */
 		if (p->scx.slice && !scx_rq_bypassing(rq)) {
 			dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
-			goto switch_class;
+			return;
 		}
 
 		/*
@@ -3274,10 +3279,6 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 			do_enqueue_task(rq, p, 0, -1);
 		}
 	}
-
-switch_class:
-	if (next && next->sched_class != &ext_sched_class)
-		switch_class(rq, next);
 }
 
 static struct task_struct *first_local_task(struct rq *rq)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 3/3] sched_ext: Fix cpu_released while changing sched policy of the running task
  2025-07-20  9:20       ` liuwenfang
  2025-07-20  9:38         ` [PATCH v3 2/3] " liuwenfang
@ 2025-07-20  9:41         ` liuwenfang
  2025-08-12  1:31           ` 'Tejun Heo'
  2025-08-19  6:52           ` [PATCH v4 1/3] sched_ext: Fix pnt_seq calculation when picking the next task liuwenfang
  1 sibling, 2 replies; 29+ messages in thread
From: liuwenfang @ 2025-07-20  9:41 UTC (permalink / raw)
  To: 'Tejun Heo'
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Priority inheritance policy should be cared, eg., one SCX task can be
boosted to REAL-TIME while holding a mutex lock, and then restored while
releasing the lock. The value cpu_released is fixed when changing the
sched class of the running task.

Signed-off-by: Wenfang Liu liuwenfang@honor.com
---
 kernel/sched/ext.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index ddf4bd060..e3b9032e2 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3899,11 +3899,30 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p)
 
 static void switched_from_scx(struct rq *rq, struct task_struct *p)
 {
+	if (task_current(rq, p))
+		switch_class(rq, p);
+
 	scx_ops_disable_task(p);
 }
 
 static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
-static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
+static void switched_to_scx(struct rq *rq, struct task_struct *p)
+{
+	lockdep_assert_rq_held(rq);
+
+	if (static_branch_unlikely(&scx_ops_cpu_preempt) &&
+	    unlikely(rq->scx.cpu_released) && task_current(rq, p)) {
+		/*
+		 * If the previous sched_class for the current CPU was not SCX,
+		 * notify the BPF scheduler that it again has control of the
+		 * core. This callback complements ->cpu_release(), which is
+		 * emitted in switch_class().
+		 */
+		if (SCX_HAS_OP(cpu_acquire))
+			SCX_CALL_OP(SCX_KF_REST, cpu_acquire, rq, cpu_of(rq), NULL);
+		rq->scx.cpu_released = false;
+	}
+}
 
 int scx_check_setscheduler(struct task_struct *p, int policy)
 {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/3] sched_ext: Fix pnt_seq calculation
  2025-07-20  9:36         ` [PATCH v3 1/3] sched_ext: Fix pnt_seq calculation liuwenfang
@ 2025-08-12  0:03           ` 'Tejun Heo'
  2025-08-12  0:30             ` 'Tejun Heo'
  2025-08-18 17:47           ` Peter Zijlstra
  1 sibling, 1 reply; 29+ messages in thread
From: 'Tejun Heo' @ 2025-08-12  0:03 UTC (permalink / raw)
  To: liuwenfang
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Hello,

Sorry for another delay. I'm finallyback from a long vacation and should be
more responsive from now on.

On Sun, Jul 20, 2025 at 09:36:22AM +0000, liuwenfang wrote:
> Fix pnt_seq calculation for all transitions.

This needs a lot more explanation about the bug it fixes and how.

> +void scx_put_prev_set_next(struct rq *rq, struct task_struct *prev,
> +			   struct task_struct *next)
> +{
> +#ifdef CONFIG_SMP
> +	/*
> +	 * Pairs with the smp_load_acquire() issued by a CPU in
> +	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
> +	 * resched.
> +	 */
> +	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
> +#endif
> +}

Let's use a more specific name - e.g. something like scx_bump_sched_seq().
Note that pnt_seq is a bit of misnomer at this point. We probablys should
rename it to sched_seq in a separate patch.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0fb9bf995..50d757e92 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8887,6 +8887,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  
>  	__put_prev_set_next_dl_server(rq, prev, p);
>  
> +	if (scx_enabled())
> +		scx_put_prev_set_next(rq, prev, p);
> +
>  	/*
>  	 * Because of the set_next_buddy() in dequeue_task_fair() it is rather
>  	 * likely that a next task is from the same cgroup as the current.
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 47972f34e..bcb7f175c 100644
> @@ -2465,6 +2470,9 @@ static inline void put_prev_set_next_task(struct rq *rq,
>  
>  	__put_prev_set_next_dl_server(rq, prev, next);
>  
> +	if (scx_enabled())
> +		scx_put_prev_set_next(rq, prev, next);
> +
>  	if (next == prev)
>  		return;

I'm not sure these are the best spots to call this function. How about
putting it in the CONFIG_SCHED_CLASS_EXT section in prev_balance()? The goal
of the seq counter is to wait for scheduler path to be entered, so that's
good enough a spot and there already is scx specific section, so it doesn't
add too much noise.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/3] sched_ext: Fix pnt_seq calculation
  2025-08-12  0:03           ` 'Tejun Heo'
@ 2025-08-12  0:30             ` 'Tejun Heo'
  2025-08-18 10:45               ` liuwenfang
  0 siblings, 1 reply; 29+ messages in thread
From: 'Tejun Heo' @ 2025-08-12  0:30 UTC (permalink / raw)
  To: liuwenfang
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Hello,

On Mon, Aug 11, 2025 at 02:03:16PM -1000, 'Tejun Heo' wrote:
...
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0fb9bf995..50d757e92 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8887,6 +8887,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> >  
> >  	__put_prev_set_next_dl_server(rq, prev, p);
> >  
> > +	if (scx_enabled())
> > +		scx_put_prev_set_next(rq, prev, p);
> > +
> >  	/*
> >  	 * Because of the set_next_buddy() in dequeue_task_fair() it is rather
> >  	 * likely that a next task is from the same cgroup as the current.
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 47972f34e..bcb7f175c 100644
> > @@ -2465,6 +2470,9 @@ static inline void put_prev_set_next_task(struct rq *rq,
> >  
> >  	__put_prev_set_next_dl_server(rq, prev, next);
> >  
> > +	if (scx_enabled())
> > +		scx_put_prev_set_next(rq, prev, next);
> > +
> >  	if (next == prev)
> >  		return;
> 
> I'm not sure these are the best spots to call this function. How about
> putting it in the CONFIG_SCHED_CLASS_EXT section in prev_balance()? The goal
> of the seq counter is to wait for scheduler path to be entered, so that's
> good enough a spot and there already is scx specific section, so it doesn't
> add too much noise.

Strike that. I see that we need a hook after task is picked to resolve the
bug around cpu_released. Can you please move scx_enabled() test into
scx_put_prev_set_next() and add a helper which calls both
__put_prev_set_next_dl_server() and scx_put_prev_set_next() so that the call
doesn't have to be added to two places?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently
  2025-07-20  9:38         ` [PATCH v3 2/3] " liuwenfang
@ 2025-08-12  1:26           ` 'Tejun Heo'
  0 siblings, 0 replies; 29+ messages in thread
From: 'Tejun Heo' @ 2025-08-12  1:26 UTC (permalink / raw)
  To: liuwenfang
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

On Sun, Jul 20, 2025 at 09:38:50AM +0000, liuwenfang wrote:
> Supposed RT task(RT1) is running on CPU0 and RT task(RT2) is awakened on CPU1,
> RT1 becomes sleep and SCX task(SCX1) will be dispatched to CPU0, RT2 will be
> placed on CPU0:
> 
> CPU0(schedule)                                     CPU1(try_to_wake_up)
> set_current_state(TASK_INTERRUPTIBLE)              try_to_wake_up # RT2
> __schedule                                           select_task_rq # CPU0 is selected
> LOCK rq(0)->lock # lock CPU0 rq                        ttwu_queue
>   deactivate_task # RT1                                  LOCK rq(0)->lock # busy waiting
>     pick_next_task # no more RT tasks on rq                 |
>       prev_balance                                          |
>         balance_scx                                         |
>           balance_one                                       |
>             rq->scx.cpu_released = false;                   |
>               consume_global_dsq                            |
>                 consume_dispatch_q                          |
>                   consume_remote_task                       |
>                     UNLOCK rq(0)->lock                      V
>                                                          LOCK rq(0)->lock # succ
>                     deactivate_task # SCX1               ttwu_do_activate
>                     LOCK rq(0)->lock # busy waiting      activate_task # RT2 equeued
>                        |                                 UNLOCK rq(0)->lock
>                        V
>                     LOCK rq(0)->lock # succ
>                     activate_task # SCX1
>       pick_task # RT2 is picked
>       put_prev_set_next_task # prev is RT1, next is RT2, rq->scx.cpu_released = false;
> UNLOCK rq(0)->lock
> 
> At last, RT2 will be running on CPU0 with rq->scx.cpu_released being false!

It'd be better to describe the result of this - ie. cpu_release() /
cpu_acquire() calls being missed.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched_ext: Fix cpu_released while changing sched policy of the running task
  2025-07-20  9:41         ` [PATCH v3 3/3] sched_ext: Fix cpu_released while changing sched policy of the running task liuwenfang
@ 2025-08-12  1:31           ` 'Tejun Heo'
  2025-08-19  6:52           ` [PATCH v4 1/3] sched_ext: Fix pnt_seq calculation when picking the next task liuwenfang
  1 sibling, 0 replies; 29+ messages in thread
From: 'Tejun Heo' @ 2025-08-12  1:31 UTC (permalink / raw)
  To: liuwenfang
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

On Sun, Jul 20, 2025 at 09:41:03AM +0000, liuwenfang wrote:
> Priority inheritance policy should be cared, eg., one SCX task can be
> boosted to REAL-TIME while holding a mutex lock, and then restored while
> releasing the lock. The value cpu_released is fixed when changing the
> sched class of the running task.
> 
> Signed-off-by: Wenfang Liu liuwenfang@honor.com
> ---
>  kernel/sched/ext.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index ddf4bd060..e3b9032e2 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -3899,11 +3899,30 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p)
>  
>  static void switched_from_scx(struct rq *rq, struct task_struct *p)
>  {
> +	if (task_current(rq, p))
> +		switch_class(rq, p);

switch_class() should probably be named to something more specific - e.g.
maybe_cpu_release()?

>  	scx_ops_disable_task(p);
>  }
>  
>  static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
> -static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
> +static void switched_to_scx(struct rq *rq, struct task_struct *p)
> +{
> +	lockdep_assert_rq_held(rq);
> +
> +	if (static_branch_unlikely(&scx_ops_cpu_preempt) &&
> +	    unlikely(rq->scx.cpu_released) && task_current(rq, p)) {
> +		/*
> +		 * If the previous sched_class for the current CPU was not SCX,
> +		 * notify the BPF scheduler that it again has control of the
> +		 * core. This callback complements ->cpu_release(), which is
> +		 * emitted in switch_class().
> +		 */
> +		if (SCX_HAS_OP(cpu_acquire))
> +			SCX_CALL_OP(SCX_KF_REST, cpu_acquire, rq, cpu_of(rq), NULL);
> +		rq->scx.cpu_released = false;
> +	}
> +}

and most of this can be factored into e.g. maybe_cpu_acquire() so that the
same code is not duplicated between here and balance_one()?

Also, please target the for-6.17-fixes branch of the sched_ext tree.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.17-fixes

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v3 1/3] sched_ext: Fix pnt_seq calculation
  2025-08-12  0:30             ` 'Tejun Heo'
@ 2025-08-18 10:45               ` liuwenfang
  2025-08-18 17:43                 ` 'Tejun Heo'
  0 siblings, 1 reply; 29+ messages in thread
From: liuwenfang @ 2025-08-18 10:45 UTC (permalink / raw)
  To: 'Tejun Heo'
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

> Hello,
> 
> On Mon, Aug 11, 2025 at 02:03:16PM -1000, 'Tejun Heo' wrote:
> ...
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
> > > 0fb9bf995..50d757e92 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8887,6 +8887,9 @@ pick_next_task_fair(struct rq *rq, struct
> > > task_struct *prev, struct rq_flags *rf
> > >
> > >  	__put_prev_set_next_dl_server(rq, prev, p);
> > >
> > > +	if (scx_enabled())
> > > +		scx_put_prev_set_next(rq, prev, p);
> > > +
> > >  	/*
> > >  	 * Because of the set_next_buddy() in dequeue_task_fair() it is rather
> > >  	 * likely that a next task is from the same cgroup as the current.
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index
> > > 47972f34e..bcb7f175c 100644 @@ -2465,6 +2470,9 @@ static inline void
> > > put_prev_set_next_task(struct rq *rq,
> > >
> > >  	__put_prev_set_next_dl_server(rq, prev, next);
> > >
> > > +	if (scx_enabled())
> > > +		scx_put_prev_set_next(rq, prev, next);
> > > +
> > >  	if (next == prev)
> > >  		return;
> >
> > I'm not sure these are the best spots to call this function. How about
> > putting it in the CONFIG_SCHED_CLASS_EXT section in prev_balance()?
> > The goal of the seq counter is to wait for scheduler path to be
> > entered, so that's good enough a spot and there already is scx
> > specific section, so it doesn't add too much noise.
> 
> Strike that. I see that we need a hook after task is picked to resolve the bug
> around cpu_released. Can you please move scx_enabled() test into
> scx_put_prev_set_next() and add a helper which calls both
> __put_prev_set_next_dl_server() and scx_put_prev_set_next() so that the call
> doesn't have to be added to two places?
Thanks for your feedback.

__put_prev_set_next is added here as the helper, the fixed function is:

+void scx_put_prev_set_next(struct rq *rq, struct task_struct *prev,
+                          struct task_struct *next)
+{
+       if (!scx_enabled())
+               return;
+
+#ifdef CONFIG_SMP
+       /*
+        * Pairs with the smp_load_acquire() issued by a CPU in
+        * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
+        * resched.
+        */
+       smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
+#endif
+}

+static inline void __put_prev_set_next(struct rq *rq,
+                                      struct task_struct *prev,
+                                      struct task_struct *next)
+{
+       __put_prev_set_next_dl_server(rq, prev, next);
+       scx_put_prev_set_next(rq, prev, next);
+}
+
 static inline void put_prev_set_next_task(struct rq *rq,
                                          struct task_struct *prev,
                                          struct task_struct *next)
 {
        WARN_ON_ONCE(rq->curr != prev);

-       __put_prev_set_next_dl_server(rq, prev, next);
+       __put_prev_set_next(rq, prev, next);

        if (next == prev)

pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
        if (prev->sched_class != &fair_sched_class)
                goto simple;

-       __put_prev_set_next_dl_server(rq, prev, p);
+       __put_prev_set_next(rq, prev, p);

Any suggestions will be appreciated and a formal patch will be sent out later.
> 
> Thanks.
> 
> --
> Tejun
Thanks.

Wenfang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/3] sched_ext: Fix pnt_seq calculation
  2025-08-18 10:45               ` liuwenfang
@ 2025-08-18 17:43                 ` 'Tejun Heo'
  2025-08-19  7:41                   ` liuwenfang
  0 siblings, 1 reply; 29+ messages in thread
From: 'Tejun Heo' @ 2025-08-18 17:43 UTC (permalink / raw)
  To: liuwenfang
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Hello,

On Mon, Aug 18, 2025 at 10:45:55AM +0000, liuwenfang wrote:
> +void scx_put_prev_set_next(struct rq *rq, struct task_struct *prev,
> +                          struct task_struct *next)

Might as well match the surrounding naming convention and use
__put_prev_set_next_scx()

> Any suggestions will be appreciated and a formal patch will be sent out later.

Other than that, looks fine to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/3] sched_ext: Fix pnt_seq calculation
  2025-07-20  9:36         ` [PATCH v3 1/3] sched_ext: Fix pnt_seq calculation liuwenfang
  2025-08-12  0:03           ` 'Tejun Heo'
@ 2025-08-18 17:47           ` Peter Zijlstra
  2025-08-19  7:36             ` liuwenfang
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2025-08-18 17:47 UTC (permalink / raw)
  To: liuwenfang
  Cc: 'Tejun Heo', 'David Vernet',
	'Andrea Righi', 'Changwoo Min',
	'Ingo Molnar', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

On Sun, Jul 20, 2025 at 09:36:22AM +0000, liuwenfang wrote:
> Fix pnt_seq calculation for all transitions.

This doesn't even begin to be an adequate changelog.

And please, don't put an out of line function call in
put_prev_set_next_task(.

> Signed-off-by: Wenfang Liu liuwenfang@honor.com
> ---
>  kernel/sched/ext.c   | 23 ++++++++++++++---------
>  kernel/sched/fair.c  |  3 +++
>  kernel/sched/sched.h |  8 ++++++++
>  3 files changed, 25 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index f5133249f..93e03b7d0 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -3191,14 +3191,6 @@ static void switch_class(struct rq *rq, struct task_struct *next)
>  {
>  	const struct sched_class *next_class = next->sched_class;
>  
> -#ifdef CONFIG_SMP
> -	/*
> -	 * Pairs with the smp_load_acquire() issued by a CPU in
> -	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
> -	 * resched.
> -	 */
> -	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
> -#endif
>  	if (!static_branch_unlikely(&scx_ops_cpu_preempt))
>  		return;
>  
> @@ -3233,6 +3225,19 @@ static void switch_class(struct rq *rq, struct task_struct *next)
>  	}
>  }
>  
> +void scx_put_prev_set_next(struct rq *rq, struct task_struct *prev,
> +			   struct task_struct *next)
> +{
> +#ifdef CONFIG_SMP
> +	/*
> +	 * Pairs with the smp_load_acquire() issued by a CPU in
> +	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
> +	 * resched.
> +	 */
> +	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
> +#endif
> +}
> +
>  static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
>  			      struct task_struct *next)
>  {
> @@ -5966,7 +5971,7 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
>  		if (cpu != cpu_of(this_rq)) {
>  			/*
>  			 * Pairs with smp_store_release() issued by this CPU in
> -			 * switch_class() on the resched path.
> +			 * scx_put_prev_set_next() on the resched path.
>  			 *
>  			 * We busy-wait here to guarantee that no other task can
>  			 * be scheduled on our core before the target CPU has
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0fb9bf995..50d757e92 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8887,6 +8887,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  
>  	__put_prev_set_next_dl_server(rq, prev, p);
>  
> +	if (scx_enabled())
> +		scx_put_prev_set_next(rq, prev, p);
> +
>  	/*
>  	 * Because of the set_next_buddy() in dequeue_task_fair() it is rather
>  	 * likely that a next task is from the same cgroup as the current.
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 47972f34e..bcb7f175c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1738,12 +1738,17 @@ static inline void scx_rq_clock_invalidate(struct rq *rq)
>  	WRITE_ONCE(rq->scx.flags, rq->scx.flags & ~SCX_RQ_CLK_VALID);
>  }
>  
> +void scx_put_prev_set_next(struct rq *rq, struct task_struct *prev,
> +			   struct task_struct *next);
>  #else /* !CONFIG_SCHED_CLASS_EXT */
>  #define scx_enabled()		false
>  #define scx_switched_all()	false
>  
>  static inline void scx_rq_clock_update(struct rq *rq, u64 clock) {}
>  static inline void scx_rq_clock_invalidate(struct rq *rq) {}
> +static inline void scx_put_prev_set_next(struct rq *rq,
> +					 struct task_struct *prev,
> +					 struct task_struct *next) {}
>  #endif /* !CONFIG_SCHED_CLASS_EXT */
>  
>  /*
> @@ -2465,6 +2470,9 @@ static inline void put_prev_set_next_task(struct rq *rq,
>  
>  	__put_prev_set_next_dl_server(rq, prev, next);
>  
> +	if (scx_enabled())
> +		scx_put_prev_set_next(rq, prev, next);
> +
>  	if (next == prev)
>  		return;
>  
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v4 1/3] sched_ext: Fix pnt_seq calculation when picking the next task
  2025-07-20  9:41         ` [PATCH v3 3/3] sched_ext: Fix cpu_released while changing sched policy of the running task liuwenfang
  2025-08-12  1:31           ` 'Tejun Heo'
@ 2025-08-19  6:52           ` liuwenfang
  2025-08-19  6:55             ` [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently liuwenfang
  1 sibling, 1 reply; 29+ messages in thread
From: liuwenfang @ 2025-08-19  6:52 UTC (permalink / raw)
  To: 'Tejun Heo'
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Now as the rq->scx.pnt_seq is only incremented when the target CPU
switches from one SCX task to one non-SCX task, the pair CPU would
not exit the busy-wait state reasonably in scx_pair.

In scx_pair, rq->scx.pnt_seq is introduced to improve exclusion
guarantees. The invoking CPU calls scx_bpf_kick_cpu() with
SCX_KICK_WAIT and enters the busy-wait state. It should exit this
state once the target CPU has entered the rescheduling path with
rq->scx.pnt_seq incremented.

So, pnt_seq calculation is moved to put_prev_set_next_task(), it
will be incremented for any task switches on the target CPU, then
the invoking CPU can exit the busy-wait state properly.

Signed-off-by: Wenfang Liu liuwenfang@honor.com
---
 kernel/sched/ext.c   | 10 +---------
 kernel/sched/fair.c  |  2 +-
 kernel/sched/sched.h | 30 +++++++++++++++++++++++++++++-
 3 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f5133249f..ba99739d7 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3191,14 +3191,6 @@ static void switch_class(struct rq *rq, struct task_struct *next)
 {
 	const struct sched_class *next_class = next->sched_class;
 
-#ifdef CONFIG_SMP
-	/*
-	 * Pairs with the smp_load_acquire() issued by a CPU in
-	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
-	 * resched.
-	 */
-	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
-#endif
 	if (!static_branch_unlikely(&scx_ops_cpu_preempt))
 		return;
 
@@ -5966,7 +5958,7 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
 		if (cpu != cpu_of(this_rq)) {
 			/*
 			 * Pairs with smp_store_release() issued by this CPU in
-			 * switch_class() on the resched path.
+			 * __put_prev_set_next_scx() on the resched path.
 			 *
 			 * We busy-wait here to guarantee that no other task can
 			 * be scheduled on our core before the target CPU has
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fb9bf995..21214b3fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8885,7 +8885,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	if (prev->sched_class != &fair_sched_class)
 		goto simple;
 
-	__put_prev_set_next_dl_server(rq, prev, p);
+	__put_prev_set_next(rq, prev, p);
 
 	/*
 	 * Because of the set_next_buddy() in dequeue_task_fair() it is rather
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 47972f34e..435de61c4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1738,12 +1738,32 @@ static inline void scx_rq_clock_invalidate(struct rq *rq)
 	WRITE_ONCE(rq->scx.flags, rq->scx.flags & ~SCX_RQ_CLK_VALID);
 }
 
+static inline void __put_prev_set_next_scx(struct rq *rq,
+					   struct task_struct *prev,
+					   struct task_struct *next)
+{
+	if (!scx_enabled())
+		return;
+
+#ifdef CONFIG_SMP
+	/*
+	 * Pairs with the smp_load_acquire() issued by a CPU in
+	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
+	 * resched.
+	 */
+	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
+#endif
+}
+
 #else /* !CONFIG_SCHED_CLASS_EXT */
 #define scx_enabled()		false
 #define scx_switched_all()	false
 
 static inline void scx_rq_clock_update(struct rq *rq, u64 clock) {}
 static inline void scx_rq_clock_invalidate(struct rq *rq) {}
+static inline void __put_prev_set_next_scx(struct rq *rq,
+					   struct task_struct *prev,
+					   struct task_struct *next) {}
 #endif /* !CONFIG_SCHED_CLASS_EXT */
 
 /*
@@ -2457,13 +2477,21 @@ __put_prev_set_next_dl_server(struct rq *rq,
 	rq->dl_server = NULL;
 }
 
+static inline void __put_prev_set_next(struct rq *rq,
+				       struct task_struct *prev,
+				       struct task_struct *next)
+{
+	__put_prev_set_next_dl_server(rq, prev, next);
+	__put_prev_set_next_scx(rq, prev, next);
+}
+
 static inline void put_prev_set_next_task(struct rq *rq,
 					  struct task_struct *prev,
 					  struct task_struct *next)
 {
 	WARN_ON_ONCE(rq->curr != prev);
 
-	__put_prev_set_next_dl_server(rq, prev, next);
+	__put_prev_set_next(rq, prev, next);
 
 	if (next == prev)
 		return;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently
  2025-08-19  6:52           ` [PATCH v4 1/3] sched_ext: Fix pnt_seq calculation when picking the next task liuwenfang
@ 2025-08-19  6:55             ` liuwenfang
  2025-08-19  7:07               ` [PATCH v4 3/3] sched_ext: Fix cpu_released while changing sched policy of the running task liuwenfang
  2025-08-19  7:47               ` [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently Peter Zijlstra
  0 siblings, 2 replies; 29+ messages in thread
From: liuwenfang @ 2025-08-19  6:55 UTC (permalink / raw)
  To: 'Tejun Heo'
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Supposed RT task(RT1) is running on CPU0 and RT task(RT2) is awakened on CPU1,
RT1 becomes sleep and SCX task(SCX1) will be dispatched to CPU0, RT2 will be
placed on CPU0:

CPU0(schedule)                                     CPU1(try_to_wake_up)
set_current_state(TASK_INTERRUPTIBLE)              try_to_wake_up # RT2
__schedule                                           select_task_rq # CPU0 is selected
LOCK rq(0)->lock # lock CPU0 rq                        ttwu_queue
  deactivate_task # RT1                                  LOCK rq(0)->lock # busy waiting
    pick_next_task # no more RT tasks on rq                 |
      prev_balance                                          |
        balance_scx                                         |
          balance_one                                       |
            rq->scx.cpu_released = false;                   |
              consume_global_dsq                            |
                consume_dispatch_q                          |
                  consume_remote_task                       |
                    UNLOCK rq(0)->lock                      V
                                                         LOCK rq(0)->lock # succ
                    deactivate_task # SCX1               ttwu_do_activate
                    LOCK rq(0)->lock # busy waiting      activate_task # RT2 equeued
                       |                                 UNLOCK rq(0)->lock
                       V
                    LOCK rq(0)->lock # succ
                    activate_task # SCX1
      pick_task # RT2 is picked
      put_prev_set_next_task # prev is RT1, next is RT2, rq->scx.cpu_released = false;
UNLOCK rq(0)->lock

At last, RT2 will be running on CPU0 with rq->scx.cpu_released being false, which would
lead the BPF scheduler to make decisions improperly.

So, check the sched class in __put_prev_set_next_scx() to fix the value of
rq->scx.cpu_released.

Signed-off-by: Wenfang Liu liuwenfang@honor.com
---
 kernel/sched/ext.c   | 8 ++------
 kernel/sched/sched.h | 7 +++++++
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index ba99739d7..98a05025b 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3187,7 +3187,7 @@ preempt_reason_from_class(const struct sched_class *class)
 	return SCX_CPU_PREEMPT_UNKNOWN;
 }
 
-static void switch_class(struct rq *rq, struct task_struct *next)
+void switch_class(struct rq *rq, struct task_struct *next)
 {
 	const struct sched_class *next_class = next->sched_class;
 
@@ -3245,7 +3245,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 		 */
 		if (p->scx.slice && !scx_rq_bypassing(rq)) {
 			dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
-			goto switch_class;
+			return;
 		}
 
 		/*
@@ -3261,10 +3261,6 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 			do_enqueue_task(rq, p, 0, -1);
 		}
 	}
-
-switch_class:
-	if (next && next->sched_class != &ext_sched_class)
-		switch_class(rq, next);
 }
 
 static struct task_struct *first_local_task(struct rq *rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 435de61c4..e46becfed 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1738,6 +1738,8 @@ static inline void scx_rq_clock_invalidate(struct rq *rq)
 	WRITE_ONCE(rq->scx.flags, rq->scx.flags & ~SCX_RQ_CLK_VALID);
 }
 
+extern void switch_class(struct rq *rq, struct task_struct *next);
+
 static inline void __put_prev_set_next_scx(struct rq *rq,
 					   struct task_struct *prev,
 					   struct task_struct *next)
@@ -1753,6 +1755,11 @@ static inline void __put_prev_set_next_scx(struct rq *rq,
 	 */
 	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
 #endif
+
+	if (next->sched_class == &ext_sched_class)
+		return;
+
+	switch_class(rq, next);
 }
 
 #else /* !CONFIG_SCHED_CLASS_EXT */
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 3/3] sched_ext: Fix cpu_released while changing sched policy of the running task
  2025-08-19  6:55             ` [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently liuwenfang
@ 2025-08-19  7:07               ` liuwenfang
  2025-08-19  7:47               ` [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently Peter Zijlstra
  1 sibling, 0 replies; 29+ messages in thread
From: liuwenfang @ 2025-08-19  7:07 UTC (permalink / raw)
  To: 'Tejun Heo'
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Priority inheritance policy should be cared, eg., one SCX task can be
boosted to REAL-TIME while holding a mutex lock, and then restored while
releasing the lock. The value cpu_released is fixed when changing the
sched class of the running task.

Signed-off-by: Wenfang Liu liuwenfang@honor.com
---
 kernel/sched/ext.c   | 44 ++++++++++++++++++++++++++++++--------------
 kernel/sched/sched.h |  4 ++--
 2 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 98a05025b..bf4512908 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2959,6 +2959,8 @@ static void flush_dispatch_buf(struct rq *rq)
 	dspc->cursor = 0;
 }
 
+static void scx_maybe_cpu_acquire(struct rq *rq);
+
 static int balance_one(struct rq *rq, struct task_struct *prev)
 {
 	struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx);
@@ -2970,18 +2972,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 	rq->scx.flags |= SCX_RQ_IN_BALANCE;
 	rq->scx.flags &= ~(SCX_RQ_BAL_PENDING | SCX_RQ_BAL_KEEP);
 
-	if (static_branch_unlikely(&scx_ops_cpu_preempt) &&
-	    unlikely(rq->scx.cpu_released)) {
-		/*
-		 * If the previous sched_class for the current CPU was not SCX,
-		 * notify the BPF scheduler that it again has control of the
-		 * core. This callback complements ->cpu_release(), which is
-		 * emitted in switch_class().
-		 */
-		if (SCX_HAS_OP(cpu_acquire))
-			SCX_CALL_OP(SCX_KF_REST, cpu_acquire, rq, cpu_of(rq), NULL);
-		rq->scx.cpu_released = false;
-	}
+	scx_maybe_cpu_acquire(rq);
 
 	if (prev_on_scx) {
 		update_curr_scx(rq);
@@ -3187,7 +3178,23 @@ preempt_reason_from_class(const struct sched_class *class)
 	return SCX_CPU_PREEMPT_UNKNOWN;
 }
 
-void switch_class(struct rq *rq, struct task_struct *next)
+static void scx_maybe_cpu_acquire(struct rq *rq)
+{
+	if (static_branch_unlikely(&scx_ops_cpu_preempt) &&
+	    unlikely(rq->scx.cpu_released)) {
+		/*
+		 * If the previous sched_class for the current CPU was not SCX,
+		 * notify the BPF scheduler that it again has control of the
+		 * core. This callback complements ->cpu_release(), which is
+		 * emitted in scx_maybe_cpu_release().
+		 */
+		if (SCX_HAS_OP(cpu_acquire))
+			SCX_CALL_OP(SCX_KF_REST, cpu_acquire, rq, cpu_of(rq), NULL);
+		rq->scx.cpu_released = false;
+	}
+}
+
+void scx_maybe_cpu_release(struct rq *rq, struct task_struct *next)
 {
 	const struct sched_class *next_class = next->sched_class;
 
@@ -3881,11 +3888,20 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p)
 
 static void switched_from_scx(struct rq *rq, struct task_struct *p)
 {
+	if (task_current(rq, p))
+		scx_maybe_cpu_release(rq, p);
+
 	scx_ops_disable_task(p);
 }
 
 static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
-static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
+static void switched_to_scx(struct rq *rq, struct task_struct *p)
+{
+	lockdep_assert_rq_held(rq);
+
+	if (task_current(rq, p))
+		scx_maybe_cpu_acquire(rq);
+}
 
 int scx_check_setscheduler(struct task_struct *p, int policy)
 {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e46becfed..ee0f35d47 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1738,7 +1738,7 @@ static inline void scx_rq_clock_invalidate(struct rq *rq)
 	WRITE_ONCE(rq->scx.flags, rq->scx.flags & ~SCX_RQ_CLK_VALID);
 }
 
-extern void switch_class(struct rq *rq, struct task_struct *next);
+extern void scx_maybe_cpu_release(struct rq *rq, struct task_struct *next);
 
 static inline void __put_prev_set_next_scx(struct rq *rq,
 					   struct task_struct *prev,
@@ -1759,7 +1759,7 @@ static inline void __put_prev_set_next_scx(struct rq *rq,
 	if (next->sched_class == &ext_sched_class)
 		return;
 
-	switch_class(rq, next);
+	scx_maybe_cpu_release(rq, next);
 }
 
 #else /* !CONFIG_SCHED_CLASS_EXT */
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* RE: [PATCH v3 1/3] sched_ext: Fix pnt_seq calculation
  2025-08-18 17:47           ` Peter Zijlstra
@ 2025-08-19  7:36             ` liuwenfang
  0 siblings, 0 replies; 29+ messages in thread
From: liuwenfang @ 2025-08-19  7:36 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: 'Tejun Heo', 'David Vernet',
	'Andrea Righi', 'Changwoo Min',
	'Ingo Molnar', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Hello, 
> This doesn't even begin to be an adequate changelog.
> 
> And please, don't put an out of line function call in put_prev_set_next_task(.
A changelog has been added accordingly in [PATCH v4 1/3], and __put_prev_set_next_scx() has been changed to an inline function.

However, switch_class() called in __put_prev_set_next_scx() is difficult to convert into an inline function in [PATCH v4 2/3].

look forward for your reply.
--
Regards.

Wenfang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v3 1/3] sched_ext: Fix pnt_seq calculation
  2025-08-18 17:43                 ` 'Tejun Heo'
@ 2025-08-19  7:41                   ` liuwenfang
  0 siblings, 0 replies; 29+ messages in thread
From: liuwenfang @ 2025-08-19  7:41 UTC (permalink / raw)
  To: 'Tejun Heo'
  Cc: 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Peter Zijlstra', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Hello,
> 
> Might as well match the surrounding naming convention and use
> __put_prev_set_next_scx()
It makes sense, and __put_prev_set_next_scx() has been changed to an inline function in [PATCH v4 1/3].
A changelog has been added accordingly.

switch_class() is renamed as scx_maybe_cpu_release() in [PATCH v4 3/3].
Also scx_maybe_cpu_acquire() is added in [PATCH v4 3/3].

However, scx_maybe_cpu_release() called in __put_prev_set_next_scx() is difficult to convert into an inline function in [PATCH v4 3/3].

look forward for your reply.
--
Regards.

Wenfang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently
  2025-08-19  6:55             ` [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently liuwenfang
  2025-08-19  7:07               ` [PATCH v4 3/3] sched_ext: Fix cpu_released while changing sched policy of the running task liuwenfang
@ 2025-08-19  7:47               ` Peter Zijlstra
  2025-08-19  8:47                 ` 回复: " liuwenfang
  2025-08-20  0:28                 ` 'Tejun Heo'
  1 sibling, 2 replies; 29+ messages in thread
From: Peter Zijlstra @ 2025-08-19  7:47 UTC (permalink / raw)
  To: liuwenfang
  Cc: 'Tejun Heo', 'David Vernet',
	'Andrea Righi', 'Changwoo Min',
	'Ingo Molnar', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'


Could you please not thread your new patches onto the old thread? That
makes them near impossible to find.

On Tue, Aug 19, 2025 at 06:55:38AM +0000, liuwenfang wrote:
> Supposed RT task(RT1) is running on CPU0 and RT task(RT2) is awakened on CPU1,
> RT1 becomes sleep and SCX task(SCX1) will be dispatched to CPU0, RT2 will be
> placed on CPU0:
> 
> CPU0(schedule)                                     CPU1(try_to_wake_up)
> set_current_state(TASK_INTERRUPTIBLE)              try_to_wake_up # RT2
> __schedule                                           select_task_rq # CPU0 is selected
> LOCK rq(0)->lock # lock CPU0 rq                        ttwu_queue
>   deactivate_task # RT1                                  LOCK rq(0)->lock # busy waiting
>     pick_next_task # no more RT tasks on rq                 |
>       prev_balance                                          |
>         balance_scx                                         |
>           balance_one                                       |
>             rq->scx.cpu_released = false;                   |
>               consume_global_dsq                            |
>                 consume_dispatch_q                          |
>                   consume_remote_task                       |
>                     UNLOCK rq(0)->lock                      V
>                                                          LOCK rq(0)->lock # succ
>                     deactivate_task # SCX1               ttwu_do_activate
>                     LOCK rq(0)->lock # busy waiting      activate_task # RT2 equeued
>                        |                                 UNLOCK rq(0)->lock
>                        V
>                     LOCK rq(0)->lock # succ
>                     activate_task # SCX1
>       pick_task # RT2 is picked
>       put_prev_set_next_task # prev is RT1, next is RT2, rq->scx.cpu_released = false;
> UNLOCK rq(0)->lock
> 
> At last, RT2 will be running on CPU0 with rq->scx.cpu_released being false, which would
> lead the BPF scheduler to make decisions improperly.
> 
> So, check the sched class in __put_prev_set_next_scx() to fix the value of
> rq->scx.cpu_released.

Oh gawd, this is terrible.

Why would you start the pick in balance and then cry when you fail the
pick in pick ?!?

This is also the reason you need that weird CLASS_EXT exception in
prev_balance(), isn't it?

You're now asking for a 3rd call out to do something like:

  ->balance() -- ready a task for pick
  ->pick() -- picks a random other task
  ->put_prev() -- oops, our task didn't get picked, stick it back

Which is bloody ludicrous. So no. We're not doing this.

Why can't pick DTRT ?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* 回复: [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently
  2025-08-19  7:47               ` [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently Peter Zijlstra
@ 2025-08-19  8:47                 ` liuwenfang
  2025-08-19 10:08                   ` Peter Zijlstra
  2025-08-20  0:28                 ` 'Tejun Heo'
  1 sibling, 1 reply; 29+ messages in thread
From: liuwenfang @ 2025-08-19  8:47 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: 'Tejun Heo', 'David Vernet',
	'Andrea Righi', 'Changwoo Min',
	'Ingo Molnar', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org'

Hello,
> Could you please not thread your new patches onto the old thread? That makes
> them near impossible to find.
I will try to fix it later.
> 
> On Tue, Aug 19, 2025 at 06:55:38AM +0000, liuwenfang wrote:
> > Supposed RT task(RT1) is running on CPU0 and RT task(RT2) is awakened
> > on CPU1,
> > RT1 becomes sleep and SCX task(SCX1) will be dispatched to CPU0, RT2
> > will be placed on CPU0:
> >
> > CPU0(schedule)
> CPU1(try_to_wake_up)
> > set_current_state(TASK_INTERRUPTIBLE)              try_to_wake_up #
> RT2
> > __schedule
> select_task_rq # CPU0 is selected
> > LOCK rq(0)->lock # lock CPU0 rq                        ttwu_queue
> >   deactivate_task # RT1                                  LOCK
> rq(0)->lock # busy waiting
> >     pick_next_task # no more RT tasks on rq                 |
> >       prev_balance                                          |
> >         balance_scx                                         |
> >           balance_one                                       |
> >             rq->scx.cpu_released = false;                   |
> >               consume_global_dsq                            |
> >                 consume_dispatch_q                          |
> >                   consume_remote_task                       |
> >                     UNLOCK rq(0)->lock                      V
> >                                                          LOCK
> rq(0)->lock # succ
> >                     deactivate_task # SCX1
> ttwu_do_activate
> >                     LOCK rq(0)->lock # busy waiting      activate_task
> # RT2 equeued
> >                        |
> UNLOCK rq(0)->lock
> >                        V
> >                     LOCK rq(0)->lock # succ
> >                     activate_task # SCX1
> >       pick_task # RT2 is picked
> >       put_prev_set_next_task # prev is RT1, next is RT2,
> > rq->scx.cpu_released = false; UNLOCK rq(0)->lock
> >
> > At last, RT2 will be running on CPU0 with rq->scx.cpu_released being
> > false, which would lead the BPF scheduler to make decisions improperly.
> >
> > So, check the sched class in __put_prev_set_next_scx() to fix the
> > value of
> > rq->scx.cpu_released.
> 
> Oh gawd, this is terrible.
> 
> Why would you start the pick in balance and then cry when you fail the pick in
> pick ?!?
> 
> This is also the reason you need that weird CLASS_EXT exception in
> prev_balance(), isn't it?
Yeah, you are right, because there is task migration in our exception process.
> 
> You're now asking for a 3rd call out to do something like:
> 
>   ->balance() -- ready a task for pick
We must clarify that the target SCX task is currently located in the global queue, and it's CPU selection maybe CPU2,
when the current CPU0 will be idle, this SCX task should be migrated to CPU0.
>   ->pick() -- picks a random other task
The rq lock of CPU0 will be released during task migration, and higher priority task will be placed on CPU0 rq,
So the CPU0 will not always pick the target SCX task timely.
>   ->put_prev() -- oops, our task didn't get picked, stick it back
The higher priority task may cost a long time on CPU0, so we need to get the SCX task back for its low latency demand. 
> 
> Which is bloody ludicrous. So no. We're not doing this.
> 
> Why can't pick DTRT ?
This's why the CPU0 cannot pick one SCX task directly which task_cpu() is not CPU0.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 回复: [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently
  2025-08-19  8:47                 ` 回复: " liuwenfang
@ 2025-08-19 10:08                   ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2025-08-19 10:08 UTC (permalink / raw)
  To: liuwenfang
  Cc: 'Tejun Heo', 'David Vernet',
	'Andrea Righi', 'Changwoo Min',
	'Ingo Molnar', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	'linux-kernel@vger.kernel.org', joelagnelf

On Tue, Aug 19, 2025 at 08:47:38AM +0000, liuwenfang wrote:


> > Oh gawd, this is terrible.
> > 
> > Why would you start the pick in balance and then cry when you fail the pick in
> > pick ?!?
> > 
> > This is also the reason you need that weird CLASS_EXT exception in
> > prev_balance(), isn't it?
> Yeah, you are right, because there is task migration in our exception process.
> > 
> > You're now asking for a 3rd call out to do something like:
> > 
> >   ->balance() -- ready a task for pick
> We must clarify that the target SCX task is currently located in the global queue, and it's CPU selection maybe CPU2,
> when the current CPU0 will be idle, this SCX task should be migrated to CPU0.
> >   ->pick() -- picks a random other task
> The rq lock of CPU0 will be released during task migration, and higher priority task will be placed on CPU0 rq,
> So the CPU0 will not always pick the target SCX task timely.
> >   ->put_prev() -- oops, our task didn't get picked, stick it back
> The higher priority task may cost a long time on CPU0, so we need to get the SCX task back for its low latency demand. 
> > 
> > Which is bloody ludicrous. So no. We're not doing this.
> > 
> > Why can't pick DTRT ?
> This's why the CPU0 cannot pick one SCX task directly which task_cpu() is not CPU0.

Can you please have a look at these patches from Joel:

  https://lkml.kernel.org/r/20250809184800.129831-1-joelagnelf@nvidia.com

I think it is dealing with a similar problem, and possibly making your
issue worse. At the same time, it might just be sufficient to solve
both.

It would be really good if we can get rid of this prev_balance() hack and
instead make sched_class::pick_task() able to deal with the whole thing.

Notably, I'm thinking that by passing &rf to pick_task() it becomes
possible to do something like (and I'm sketchy here because I'm never
quite sure how ext actually works):


  pick_task_scx(rq, rf)
  {
  again:
    p = pick_local_task();
    if (!p) {
       /*
	* comment that explains why we can drop rq-lock here
	*/
       unlock(rq, rf);
       ... get task from global thing ...
       lock(rq, rf);
       if (rq->nr_running != rq->ext.nr_running)
	 return RETRY_PICK;
       goto again;
    }

    return p;
  }


And Joel's code seem very close to that, but doesn't seem to deal with
your issue.

Anyway, the above has this: 'rq->nr_running != rq->ext.nr_running'
thing, which is supposed to check if there is a higher class runnable
task, in which case it must 'abort' and re-try the pick. We should do
this every time we drop rq->lock.

Now in the case of a dl-server, we'll obviously have the dl-server be
runnable (in fact, its pick is what gets you here in the first place).
But its presence will then make you re-try the pick.

It might just work, it might need a little help -- please carefully
consider that case.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently
  2025-08-19  7:47               ` [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently Peter Zijlstra
  2025-08-19  8:47                 ` 回复: " liuwenfang
@ 2025-08-20  0:28                 ` 'Tejun Heo'
  2025-08-20  9:18                   ` Peter Zijlstra
  1 sibling, 1 reply; 29+ messages in thread
From: 'Tejun Heo' @ 2025-08-20  0:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: liuwenfang, 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Juri Lelli', 'Vincent Guittot',
	'Dietmar Eggemann', 'Steven Rostedt',
	'Ben Segall', 'Mel Gorman',
	'Valentin Schneider',
	'linux-kernel@vger.kernel.org', Joel Fernandes

Hello, Peter.

(cc'ing Joel for the @rf addition to pick_task())

On Tue, Aug 19, 2025 at 09:47:36AM +0200, Peter Zijlstra wrote:
...
> You're now asking for a 3rd call out to do something like:
> 
>   ->balance() -- ready a task for pick
>   ->pick() -- picks a random other task
>   ->put_prev() -- oops, our task didn't get picked, stick it back
> 
> Which is bloody ludicrous. So no. We're not doing this.
> 
> Why can't pick DTRT ?

This is unfortunate, but, given how things are set up right now, I think we
probably need the last one. Taking a step back and also considering the
proposed @rf addition to pick():

- The reason why SCX needs to do most of its dispatch operations in
  balance() is because the kernel side doesn't know which tasks are going to
  execute on which CPU until the task is actually picked for execution, so
  all picking must be preceded by balance() where tasks can be moved across
  rqs.

- There's a gap between balance() and pick_task() where a successful return
  from balance() doesn't guarantee that the corresponding pick() would be
  called. This seems intentional to guarantee that no matter what happens
  during balance(), pick_task() of the highest priority class with a pending
  task is guaranteed to get the CPU.

  This guarantee changes if we add @rf to pick_task() and let it unlock and
  relock. A higher priority task may get queued while the rq lock is
  released and then the lower priority pick_task() may still return a task
  of its own. This should be resolvable although it may not be completely
  trivial. We need to shift clear_tsk_need_resched() before pick_task()'s
  and then make wakeup_preempt() would probalby need more complications to
  guarantee that resched_curr() is not skipped while scheduling is taking
  place.

- SCX's ops.cpu_acquire() and .cpu_release() are to tell the BPF scheduler
  that a CPU is available for running SCX tasks or not. We want to tell the
  BPF side that a CPU became available before its ops.dispatch() is called -
  ie. before balance(). So, IIUC, this is where the problem is. Because
  there's a gap between balance() and pick_task(), the CPU might get taken
  by a higher priority sched class inbetween. If that happens, we need to
  tell the BPF scheduler that it lost the CPU. However, if the previous task
  wasn't a SCX one, there's currently no place to tell so.

  IOW, SCX needs to invoke ops.cpu_released() when a CPU is taken between
  its balance() and pick_task(); however, that can happen when both prev and
  next tasks are !SCX tasks, so it needs something which is always called.

If @rf is added to pick_task() so that we can merge balance() into
pick_task(), that'd be simplify these. SCX wouldn't need balance index
boosting and can handle cpu_acquire/release() within pick_task(). What do
you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently
  2025-08-20  0:28                 ` 'Tejun Heo'
@ 2025-08-20  9:18                   ` Peter Zijlstra
  2025-08-20 16:52                     ` 'Tejun Heo'
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2025-08-20  9:18 UTC (permalink / raw)
  To: 'Tejun Heo'
  Cc: liuwenfang, 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Juri Lelli', 'Vincent Guittot',
	'Dietmar Eggemann', 'Steven Rostedt',
	'Ben Segall', 'Mel Gorman',
	'Valentin Schneider',
	'linux-kernel@vger.kernel.org', Joel Fernandes

On Tue, Aug 19, 2025 at 02:28:17PM -1000, 'Tejun Heo' wrote:
> Hello, Peter.
> 
> (cc'ing Joel for the @rf addition to pick_task())
> 
> On Tue, Aug 19, 2025 at 09:47:36AM +0200, Peter Zijlstra wrote:
> ...
> > You're now asking for a 3rd call out to do something like:
> > 
> >   ->balance() -- ready a task for pick
> >   ->pick() -- picks a random other task
> >   ->put_prev() -- oops, our task didn't get picked, stick it back
> > 
> > Which is bloody ludicrous. So no. We're not doing this.
> > 
> > Why can't pick DTRT ?
> 
> This is unfortunate, but, given how things are set up right now, I think we
> probably need the last one. Taking a step back and also considering the
> proposed @rf addition to pick():
> 
> - The reason why SCX needs to do most of its dispatch operations in
>   balance() is because the kernel side doesn't know which tasks are going to
>   execute on which CPU until the task is actually picked for execution, so
>   all picking must be preceded by balance() where tasks can be moved across
>   rqs.
> 
> - There's a gap between balance() and pick_task() where a successful return
>   from balance() doesn't guarantee that the corresponding pick() would be
>   called. This seems intentional to guarantee that no matter what happens
>   during balance(), pick_task() of the highest priority class with a pending
>   task is guaranteed to get the CPU.

Yep, somewhat important that ;-)

>   This guarantee changes if we add @rf to pick_task() and let it unlock and
>   relock. A higher priority task may get queued while the rq lock is
>   released and then the lower priority pick_task() may still return a task
>   of its own.

No, this would be broken. This guarantee must not change.

What you can do however is something like:

again:
   p = pick_local_task();
   if (!p) {
     unlock(rq, rf);
     // get yourself a local task
     lock(rq, rf);
     if (higher-class-task-available(rq)) {
       // roll back whatever state
       return RETRY_TASK;
     }
     goto again;
   }

   return p;

> - SCX's ops.cpu_acquire() and .cpu_release() are to tell the BPF scheduler
>   that a CPU is available for running SCX tasks or not. We want to tell the
>   BPF side that a CPU became available before its ops.dispatch() is called -
>   ie. before balance(). So, IIUC, this is where the problem is. Because
>   there's a gap between balance() and pick_task(), the CPU might get taken
>   by a higher priority sched class inbetween. If that happens, we need to
>   tell the BPF scheduler that it lost the CPU. However, if the previous task
>   wasn't a SCX one, there's currently no place to tell so.
> 
>   IOW, SCX needs to invoke ops.cpu_released() when a CPU is taken between
>   its balance() and pick_task(); however, that can happen when both prev and
>   next tasks are !SCX tasks, so it needs something which is always called.
> 
> If @rf is added to pick_task() so that we can merge balance() into
> pick_task(), that'd be simplify these. SCX wouldn't need balance index
> boosting and can handle cpu_acquire/release() within pick_task(). What do
> you think?

That's more or less what I suggested here:

  https://lkml.kernel.org/r/20250819100838.GH3245006@noisy.programming.kicks-ass.net



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently
  2025-08-20  9:18                   ` Peter Zijlstra
@ 2025-08-20 16:52                     ` 'Tejun Heo'
  0 siblings, 0 replies; 29+ messages in thread
From: 'Tejun Heo' @ 2025-08-20 16:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: liuwenfang, 'David Vernet', 'Andrea Righi',
	'Changwoo Min', 'Ingo Molnar',
	'Juri Lelli', 'Vincent Guittot',
	'Dietmar Eggemann', 'Steven Rostedt',
	'Ben Segall', 'Mel Gorman',
	'Valentin Schneider',
	'linux-kernel@vger.kernel.org', Joel Fernandes

Hello,

On Wed, Aug 20, 2025 at 11:18:10AM +0200, Peter Zijlstra wrote:
> >   This guarantee changes if we add @rf to pick_task() and let it unlock and
> >   relock. A higher priority task may get queued while the rq lock is
> >   released and then the lower priority pick_task() may still return a task
> >   of its own.
> 
> No, this would be broken. This guarantee must not change.
> 
> What you can do however is something like:
> 
> again:
>    p = pick_local_task();
>    if (!p) {
>      unlock(rq, rf);
>      // get yourself a local task
>      lock(rq, rf);
>      if (higher-class-task-available(rq)) {
>        // roll back whatever state
>        return RETRY_TASK;
>      }
>      goto again;
>    }
> 
>    return p;

Isn't that kinda nasty as we'd have to do that in every pick_task(). What'd
be equivalent but in a central location would be making sure that
wakeup_preempt() asserts a resched event if it hits a race window like that.
I haven't thought too much about how that should be done exactly but the
races should be pretty rare, so it'd be surprising if the behavior
difference is noticeable.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-08-20 16:52 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-21  4:09 [PATCH] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently liuwenfang
2025-06-23 19:50 ` 'Tejun Heo'
2025-06-28  6:50   ` [PATCH v2 1/2] " liuwenfang
2025-07-17 21:38     ` 'Tejun Heo'
2025-07-20  9:20       ` liuwenfang
2025-07-20  9:38         ` [PATCH v3 2/3] " liuwenfang
2025-08-12  1:26           ` 'Tejun Heo'
2025-07-20  9:41         ` [PATCH v3 3/3] sched_ext: Fix cpu_released while changing sched policy of the running task liuwenfang
2025-08-12  1:31           ` 'Tejun Heo'
2025-08-19  6:52           ` [PATCH v4 1/3] sched_ext: Fix pnt_seq calculation when picking the next task liuwenfang
2025-08-19  6:55             ` [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently liuwenfang
2025-08-19  7:07               ` [PATCH v4 3/3] sched_ext: Fix cpu_released while changing sched policy of the running task liuwenfang
2025-08-19  7:47               ` [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently Peter Zijlstra
2025-08-19  8:47                 ` 回复: " liuwenfang
2025-08-19 10:08                   ` Peter Zijlstra
2025-08-20  0:28                 ` 'Tejun Heo'
2025-08-20  9:18                   ` Peter Zijlstra
2025-08-20 16:52                     ` 'Tejun Heo'
2025-06-28  7:20   ` [PATCH v2 2/2] sched_ext: Fix cpu_released while changing sched policy of the running task liuwenfang
2025-07-17 21:48     ` 'Tejun Heo'
2025-07-18  9:06       ` liuwenfang
2025-07-20  9:36         ` [PATCH v3 1/3] sched_ext: Fix pnt_seq calculation liuwenfang
2025-08-12  0:03           ` 'Tejun Heo'
2025-08-12  0:30             ` 'Tejun Heo'
2025-08-18 10:45               ` liuwenfang
2025-08-18 17:43                 ` 'Tejun Heo'
2025-08-19  7:41                   ` liuwenfang
2025-08-18 17:47           ` Peter Zijlstra
2025-08-19  7:36             ` liuwenfang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).