* [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched()
@ 2026-06-27 8:16 Sechang Lim
2026-06-29 4:11 ` K Prateek Nayak
0 siblings, 1 reply; 12+ messages in thread
From: Sechang Lim @ 2026-06-27 8:16 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Gabriele Monaco,
linux-kernel
set_tsk_need_resched() tests TIF_NEED_RESCHED, calls
__trace_set_need_resched() if the flag is clear, then sets it via
set_tsk_thread_flag(). A BPF raw_tp program attached to
sched_set_need_resched executes synchronously inside __bpf_trace_run().
On return, __bpf_trace_run() drops the RCU lock with
rcu_read_unlock_migrate(), which on the preempt-or-BH-disabled path
calls set_need_resched_current() -> set_tsk_need_resched() again.
set_tsk_thread_flag() follows the tracepoint call, so every re-entrant
frame sees TIF_NEED_RESCHED clear and calls __trace_set_need_resched()
again:
BUG: TASK stack guard page was hit at ffffc9001224ff98
Oops: stack guard page: 0000 [#1] SMP KASAN PTI
RIP: 0010:__bpf_trace_sched_set_need_resched_tp+0x1c/0x190
Call Trace:
trace_sched_set_need_resched_tp+0x110/0x130
set_tsk_need_resched include/linux/sched.h:2076
set_need_resched_current include/linux/sched.h:2094
rcu_read_unlock_special+0x43a/0x440
__rcu_read_unlock+0x9e/0x120
rcu_read_unlock_migrate+0xa9/0x240
__bpf_trace_run+0x131/0x180
bpf_trace_run3+0x333/0x430
__bpf_trace_sched_set_need_resched_tp+0x13a/0x190
trace_sched_set_need_resched_tp+0x110/0x130
set_tsk_need_resched include/linux/sched.h:2076
...
__resched_curr() has the same ordering, firing the tracepoint before
setting the flag via set_ti_thread_flag() or set_nr_and_not_polling().
Fix it for consistency.
Replace the separate test_tsk_thread_flag() + set_tsk_thread_flag() pair
in set_tsk_need_resched() with test_and_set_tsk_thread_flag(). In
__resched_curr(), move the tracepoint call after the flag is set in
each path.
Fixes: adcc3bfa8806 ("sched: Adapt sched tracepoints for RV task model")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
v2:
- also fix __resched_curr (Peter Zijlstra)
v1:
- https://lore.kernel.org/all/20260625065656.392182-1-rhkrqnwk98@gmail.com/
include/linux/sched.h | 5 ++---
kernel/sched/core.c | 7 +++++--
2 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ee06cba5c6f5..c9efd08dae92 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2071,10 +2071,9 @@ static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
static inline void set_tsk_need_resched(struct task_struct *tsk)
{
- if (tracepoint_enabled(sched_set_need_resched_tp) &&
- !test_tsk_thread_flag(tsk, TIF_NEED_RESCHED))
+ if (!test_and_set_tsk_thread_flag(tsk, TIF_NEED_RESCHED) &&
+ tracepoint_enabled(sched_set_need_resched_tp))
__trace_set_need_resched(tsk, TIF_NEED_RESCHED);
- set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
}
static inline void clear_tsk_need_resched(struct task_struct *tsk)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b8871449d3c6..b358fac315d0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1172,6 +1172,7 @@ static void __resched_curr(struct rq *rq, int tif)
struct task_struct *curr = rq->curr;
struct thread_info *cti = task_thread_info(curr);
int cpu;
+ bool need_ipi;
lockdep_assert_rq_held(rq);
@@ -1187,15 +1188,17 @@ static void __resched_curr(struct rq *rq, int tif)
cpu = cpu_of(rq);
- trace_sched_set_need_resched_tp(curr, cpu, tif);
if (cpu == smp_processor_id()) {
set_ti_thread_flag(cti, tif);
if (tif == TIF_NEED_RESCHED)
set_preempt_need_resched();
+ trace_sched_set_need_resched_tp(curr, cpu, tif);
return;
}
- if (set_nr_and_not_polling(cti, tif)) {
+ need_ipi = set_nr_and_not_polling(cti, tif);
+ trace_sched_set_need_resched_tp(curr, cpu, tif);
+ if (need_ipi) {
if (tif == TIF_NEED_RESCHED)
smp_send_reschedule(cpu);
} else {
--
2.43.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched()
2026-06-27 8:16 [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched() Sechang Lim
@ 2026-06-29 4:11 ` K Prateek Nayak
2026-06-29 12:40 ` Gabriele Monaco
2026-06-30 7:58 ` Sechang Lim
0 siblings, 2 replies; 12+ messages in thread
From: K Prateek Nayak @ 2026-06-29 4:11 UTC (permalink / raw)
To: Sechang Lim, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Gabriele Monaco, linux-kernel
Hello Sechang,
On 6/27/2026 1:46 PM, Sechang Lim wrote:
> static inline void clear_tsk_need_resched(struct task_struct *tsk)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b8871449d3c6..b358fac315d0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1172,6 +1172,7 @@ static void __resched_curr(struct rq *rq, int tif)
> struct task_struct *curr = rq->curr;
> struct thread_info *cti = task_thread_info(curr);
> int cpu;
> + bool need_ipi;
nit. This declaration can go before "int cpu;" to preserve the
reverse x-mas arrangement for the independent variables.
>
> lockdep_assert_rq_held(rq);
>
> @@ -1187,15 +1188,17 @@ static void __resched_curr(struct rq *rq, int tif)
>
> cpu = cpu_of(rq);
>
> - trace_sched_set_need_resched_tp(curr, cpu, tif);
> if (cpu == smp_processor_id()) {
> set_ti_thread_flag(cti, tif);
> if (tif == TIF_NEED_RESCHED)
> set_preempt_need_resched();
> + trace_sched_set_need_resched_tp(curr, cpu, tif);
> return;
> }
>
> - if (set_nr_and_not_polling(cti, tif)) {
> + need_ipi = set_nr_and_not_polling(cti, tif);
> + trace_sched_set_need_resched_tp(curr, cpu, tif);
Unrelated to changes here: This tracepoint seems to be missing from
wake_up_idle_cpu() which modifies the ti->flags without taking the
rq_lock.
A newly enqueued timer can race with the wakeup on idle CPU and you can
see a wakeup + sched_switch without seeing a corresponding
trace_sched_set_need_resched_tp() for the CPU.
Gabriele, is that a concern for RV?
> + if (need_ipi) {
> if (tif == TIF_NEED_RESCHED)
> smp_send_reschedule(cpu);
> } else {
> --
> 2.43.0
>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched()
2026-06-29 4:11 ` K Prateek Nayak
@ 2026-06-29 12:40 ` Gabriele Monaco
2026-06-29 17:35 ` K Prateek Nayak
2026-06-30 7:58 ` Sechang Lim
1 sibling, 1 reply; 12+ messages in thread
From: Gabriele Monaco @ 2026-06-29 12:40 UTC (permalink / raw)
To: K Prateek Nayak, Sechang Lim, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel
On Mon, 2026-06-29 at 09:41 +0530, K Prateek Nayak wrote:
> Unrelated to changes here: This tracepoint seems to be missing from
> wake_up_idle_cpu() which modifies the ti->flags without taking the
> rq_lock.
>
> A newly enqueued timer can race with the wakeup on idle CPU and you can
> see a wakeup + sched_switch without seeing a corresponding
> trace_sched_set_need_resched_tp() for the CPU.
>
> Gabriele, is that a concern for RV?
Thanks Prateek for pointing this out. So as I'm understanding, we are missing
the tracepoint in that path, but even after putting it, this race might cause a
sched_switch to be visible concurrently or before the corresponding need_resched
for that CPU, right?
That's potentially an issue.
Currently the only model really relying on the need_resched event is nrp,
validating every kernel preemption has a corresponding need_resched set.
The scenario passing through wake_up_idle_cpu() is setting the flag on a CPU
that is idle (causing __schedule(SM_IDLE)), this doesn't count as a preemption
for the model. Or am I missing something?
That said, we should indeed add the tracepoint to that path and probably adapt
the monitor if that's making it fail indirectly.
Thanks,
Gabriele
> > + if (need_ipi) {
> > if (tif == TIF_NEED_RESCHED)
> > smp_send_reschedule(cpu);
> > } else {
> > --
> > 2.43.0
> >
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched()
2026-06-29 12:40 ` Gabriele Monaco
@ 2026-06-29 17:35 ` K Prateek Nayak
2026-06-30 8:58 ` Gabriele Monaco
0 siblings, 1 reply; 12+ messages in thread
From: K Prateek Nayak @ 2026-06-29 17:35 UTC (permalink / raw)
To: Gabriele Monaco, Sechang Lim, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel
Hello Gabriele,
On 6/29/2026 6:10 PM, Gabriele Monaco wrote:
> On Mon, 2026-06-29 at 09:41 +0530, K Prateek Nayak wrote:
>> Unrelated to changes here: This tracepoint seems to be missing from
>> wake_up_idle_cpu() which modifies the ti->flags without taking the
>> rq_lock.
>>
>> A newly enqueued timer can race with the wakeup on idle CPU and you can
>> see a wakeup + sched_switch without seeing a corresponding
>> trace_sched_set_need_resched_tp() for the CPU.
>>
>> Gabriele, is that a concern for RV?
>
> Thanks Prateek for pointing this out. So as I'm understanding, we are missing
> the tracepoint in that path, but even after putting it, this race might cause a
> sched_switch to be visible concurrently or before the corresponding need_resched
> for that CPU, right?
Ack but if that comment in wake_up_idle_cpu() is accurate, it is a very
infrequent operation and we might be able to get away with grabbing the
rq_lock and synchronizing it too.
>
> That's potentially an issue.
>
> Currently the only model really relying on the need_resched event is nrp,
> validating every kernel preemption has a corresponding need_resched set.
> The scenario passing through wake_up_idle_cpu() is setting the flag on a CPU
> that is idle (causing __schedule(SM_IDLE)), this doesn't count as a preemption
> for the model. Or am I missing something?
I just saw the npr monitor and yes, the SM_IDLE path will not count as
preemption and the state machine should just stay at
"any_thread_running" from the "schedule_entry" edge.
Sorry for the scare!
> That said, we should indeed add the tracepoint to that path and probably adapt
> the monitor if that's making it fail indirectly.
For the npr use-case, I think the current scheme is fine since
only SM_PREEMPt counts as a "schedule_entry_preempt" transition
and only that can transition the state machine out of the
"any_thread_running" state.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched()
2026-06-29 4:11 ` K Prateek Nayak
2026-06-29 12:40 ` Gabriele Monaco
@ 2026-06-30 7:58 ` Sechang Lim
1 sibling, 0 replies; 12+ messages in thread
From: Sechang Lim @ 2026-06-30 7:58 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Gabriele Monaco, linux-kernel
On Mon, Jun 29, 2026 at 09:41:15AM +0530, K Prateek Nayak wrote:
>Hello Sechang,
>
>On 6/27/2026 1:46 PM, Sechang Lim wrote:
>> static inline void clear_tsk_need_resched(struct task_struct *tsk)
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index b8871449d3c6..b358fac315d0 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1172,6 +1172,7 @@ static void __resched_curr(struct rq *rq, int tif)
>> struct task_struct *curr = rq->curr;
>> struct thread_info *cti = task_thread_info(curr);
>> int cpu;
>> + bool need_ipi;
>
>nit. This declaration can go before "int cpu;" to preserve the
>reverse x-mas arrangement for the independent variables.
>
Will fix in v3. Thanks!
Bests,
Sechang
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched()
2026-06-29 17:35 ` K Prateek Nayak
@ 2026-06-30 8:58 ` Gabriele Monaco
2026-06-30 16:16 ` K Prateek Nayak
0 siblings, 1 reply; 12+ messages in thread
From: Gabriele Monaco @ 2026-06-30 8:58 UTC (permalink / raw)
To: K Prateek Nayak, Sechang Lim, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel
On Mon, 2026-06-29 at 23:05 +0530, K Prateek Nayak wrote:
> > That said, we should indeed add the tracepoint to that path and probably
> > adapt the monitor if that's making it fail indirectly.
>
> For the npr use-case, I think the current scheme is fine since
> only SM_PREEMPt counts as a "schedule_entry_preempt" transition
> and only that can transition the state machine out of the
> "any_thread_running" state.
Right, the monitor can live without it, but I wonder if we need to put that
tracepoint for correctness sake. After all, however unlikely, that's a
need_resched too.
(then if the monitor really saw a need_resched after it's sched_entry, it would
stay erroneously in rescheduling).
Anyway this all isn't related to the patch.
Thanks,
Gabriele
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched()
2026-06-30 8:58 ` Gabriele Monaco
@ 2026-06-30 16:16 ` K Prateek Nayak
2026-06-30 20:34 ` Peter Zijlstra
0 siblings, 1 reply; 12+ messages in thread
From: K Prateek Nayak @ 2026-06-30 16:16 UTC (permalink / raw)
To: Gabriele Monaco, Sechang Lim, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel
Hello Gabriele,
On 6/30/2026 2:28 PM, Gabriele Monaco wrote:
> On Mon, 2026-06-29 at 23:05 +0530, K Prateek Nayak wrote:
>>> That said, we should indeed add the tracepoint to that path and probably
>>> adapt the monitor if that's making it fail indirectly.
>>
>> For the npr use-case, I think the current scheme is fine since
>> only SM_PREEMPt counts as a "schedule_entry_preempt" transition
>> and only that can transition the state machine out of the
>> "any_thread_running" state.
>
> Right, the monitor can live without it, but I wonder if we need to put that
> tracepoint for correctness sake. After all, however unlikely, that's a
> need_resched too.
>
> (then if the monitor really saw a need_resched after it's sched_entry, it would
> stay erroneously in rescheduling).
>
> Anyway this all isn't related to the patch.
Ack! Would something like this work for completeness:
(Lightly tested; Based on current tip:sched/core)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 96226707c2f6..934f540d0d3f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1049,9 +1049,16 @@ static inline void hrtick_schedule_exit(struct rq *rq) { }
* this avoids any races wrt polling state changes and thereby avoids
* spurious IPIs.
*/
-static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
+static inline bool set_nr_and_not_polling(struct rq *rq, int tif)
{
- return !(fetch_or(&ti->flags, 1 << tif) & _TIF_POLLING_NRFLAG);
+ struct task_struct *curr = rq->curr;
+ struct thread_info *ti = task_thread_info(curr);
+ unsigned long old_flags = fetch_or(&ti->flags, 1 << tif);
+
+ if (trace_sched_set_need_resched_tp_enabled() && !(old_flags & (1 << tif)))
+ trace_call__sched_set_need_resched_tp(curr, cpu_of(rq), tif);
+
+ return !(old_flags & _TIF_POLLING_NRFLAG);
}
/*
@@ -1076,8 +1083,11 @@ static bool set_nr_if_polling(struct task_struct *p)
}
#else
-static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
+static inline bool set_nr_and_not_polling(struct rq *rq, int tif)
{
+ struct task_struct *curr = rq->curr;
+ struct thread_info *ti = task_thread_info(curr);
+
set_ti_thread_flag(ti, tif);
return true;
}
@@ -1202,15 +1212,17 @@ static void __resched_curr(struct rq *rq, int tif)
cpu = cpu_of(rq);
- trace_sched_set_need_resched_tp(curr, cpu, tif);
if (cpu == smp_processor_id()) {
- set_ti_thread_flag(cti, tif);
+ int set = test_and_set_ti_thread_flag(cti, tif);
+
+ if (trace_sched_set_need_resched_tp_enabled() && !set)
+ trace_call__sched_set_need_resched_tp(curr, cpu, tif);
if (tif == TIF_NEED_RESCHED)
set_preempt_need_resched();
return;
}
- if (set_nr_and_not_polling(cti, tif)) {
+ if (set_nr_and_not_polling(rq, tif)) {
if (tif == TIF_NEED_RESCHED)
smp_send_reschedule(cpu);
} else {
@@ -1350,7 +1362,7 @@ static void wake_up_idle_cpu(int cpu)
* and testing of the above solutions didn't appear to report
* much benefits.
*/
- if (set_nr_and_not_polling(task_thread_info(rq->idle), TIF_NEED_RESCHED))
+ if (set_nr_and_not_polling(rq, TIF_NEED_RESCHED))
smp_send_reschedule(cpu);
else
trace_sched_wake_idle_without_ipi(cpu);
---
>
> Thanks,
> Gabriele
>
--
Thanks and Regards,
Prateek
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched()
2026-06-30 16:16 ` K Prateek Nayak
@ 2026-06-30 20:34 ` Peter Zijlstra
2026-07-01 6:51 ` K Prateek Nayak
0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2026-06-30 20:34 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Gabriele Monaco, Sechang Lim, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel
On Tue, Jun 30, 2026 at 09:46:11PM +0530, K Prateek Nayak wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 96226707c2f6..934f540d0d3f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1049,9 +1049,16 @@ static inline void hrtick_schedule_exit(struct rq *rq) { }
> * this avoids any races wrt polling state changes and thereby avoids
> * spurious IPIs.
> */
> -static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
> +static inline bool set_nr_and_not_polling(struct rq *rq, int tif)
> {
> - return !(fetch_or(&ti->flags, 1 << tif) & _TIF_POLLING_NRFLAG);
> + struct task_struct *curr = rq->curr;
> + struct thread_info *ti = task_thread_info(curr);
> + unsigned long old_flags = fetch_or(&ti->flags, 1 << tif);
> +
> + if (trace_sched_set_need_resched_tp_enabled() && !(old_flags & (1 << tif)))
> + trace_call__sched_set_need_resched_tp(curr, cpu_of(rq), tif);
> +
> + return !(old_flags & _TIF_POLLING_NRFLAG);
> }
>
> /*
> @@ -1076,8 +1083,11 @@ static bool set_nr_if_polling(struct task_struct *p)
> }
>
> #else
> -static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
> +static inline bool set_nr_and_not_polling(struct rq *rq, int tif)
> {
> + struct task_struct *curr = rq->curr;
> + struct thread_info *ti = task_thread_info(curr);
> +
> set_ti_thread_flag(ti, tif);
> return true;
> }
This !POLLING thing also needs tracing, no?
> @@ -1202,15 +1212,17 @@ static void __resched_curr(struct rq *rq, int tif)
>
> cpu = cpu_of(rq);
>
> - trace_sched_set_need_resched_tp(curr, cpu, tif);
> if (cpu == smp_processor_id()) {
> - set_ti_thread_flag(cti, tif);
> + int set = test_and_set_ti_thread_flag(cti, tif);
> +
> + if (trace_sched_set_need_resched_tp_enabled() && !set)
> + trace_call__sched_set_need_resched_tp(curr, cpu, tif);
> if (tif == TIF_NEED_RESCHED)
> set_preempt_need_resched();
> return;
> }
>
> - if (set_nr_and_not_polling(cti, tif)) {
> + if (set_nr_and_not_polling(rq, tif)) {
> if (tif == TIF_NEED_RESCHED)
> smp_send_reschedule(cpu);
> } else {
> @@ -1350,7 +1362,7 @@ static void wake_up_idle_cpu(int cpu)
> * and testing of the above solutions didn't appear to report
> * much benefits.
> */
> - if (set_nr_and_not_polling(task_thread_info(rq->idle), TIF_NEED_RESCHED))
> + if (set_nr_and_not_polling(rq, TIF_NEED_RESCHED))
> smp_send_reschedule(cpu);
> else
> trace_sched_wake_idle_without_ipi(cpu);
> ---
>
> >
> > Thanks,
> > Gabriele
> >
>
> --
> Thanks and Regards,
> Prateek
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched()
2026-06-30 20:34 ` Peter Zijlstra
@ 2026-07-01 6:51 ` K Prateek Nayak
2026-07-01 6:54 ` Peter Zijlstra
0 siblings, 1 reply; 12+ messages in thread
From: K Prateek Nayak @ 2026-07-01 6:51 UTC (permalink / raw)
To: Peter Zijlstra, Gabriele Monaco
Cc: Sechang Lim, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel
Hello Peter, Gabriele,
On 7/1/2026 2:04 AM, Peter Zijlstra wrote:
>> -static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
>> +static inline bool set_nr_and_not_polling(struct rq *rq, int tif)
>> {
>> + struct task_struct *curr = rq->curr;
>> + struct thread_info *ti = task_thread_info(curr);
>> +
>> set_ti_thread_flag(ti, tif);
>> return true;
>> }
>
> This !POLLING thing also needs tracing, no?
Indeed! I'm stupid.
Here is the full diff with set_nr_if_polling() covered too on top of
tip:sched/core + Sechang's v3 at
https://lore.kernel.org/lkml/20260630084750.2792851-1-rhkrqnwk98@gmail.com/
for completeness.
Let me know if you prefer folding it in or if this needs to be a
separate patch.
(lightly testet)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7cbd541f656f..bd2f7fb87dc9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1049,9 +1049,16 @@ static inline void hrtick_schedule_exit(struct rq *rq) { }
* this avoids any races wrt polling state changes and thereby avoids
* spurious IPIs.
*/
-static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
+static inline bool set_nr_and_not_polling(struct rq *rq, int tif)
{
- return !(fetch_or(&ti->flags, 1 << tif) & _TIF_POLLING_NRFLAG);
+ struct task_struct *curr = rq->curr;
+ struct thread_info *ti = task_thread_info(curr);
+ unsigned long old_flags = fetch_or(&ti->flags, 1 << tif);
+
+ if (trace_sched_set_need_resched_tp_enabled() && !(old_flags & (1 << tif)))
+ trace_call__sched_set_need_resched_tp(curr, cpu_of(rq), tif);
+
+ return !(old_flags & _TIF_POLLING_NRFLAG);
}
/*
@@ -1072,13 +1079,20 @@ static bool set_nr_if_polling(struct task_struct *p)
return true;
} while (!try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED));
+ trace_sched_set_need_resched_tp(p, task_cpu(p), TIF_NEED_RESCHED);
return true;
}
#else
-static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
+static inline bool set_nr_and_not_polling(struct rq *rq, int tif)
{
- set_ti_thread_flag(ti, tif);
+ struct task_struct *curr = rq->curr;
+ struct thread_info *ti = task_thread_info(curr);
+ int set = test_and_set_ti_thread_flag(ti, tif);
+
+ if (trace_sched_set_need_resched_tp_enabled() && !set)
+ trace_call__sched_set_need_resched_tp(curr, cpu_of(rq), tif);
+
return true;
}
@@ -1186,7 +1200,6 @@ static void __resched_curr(struct rq *rq, int tif)
{
struct task_struct *curr = rq->curr;
struct thread_info *cti = task_thread_info(curr);
- bool need_ipi;
int cpu;
lockdep_assert_rq_held(rq);
@@ -1204,16 +1217,16 @@ static void __resched_curr(struct rq *rq, int tif)
cpu = cpu_of(rq);
if (cpu == smp_processor_id()) {
- set_ti_thread_flag(cti, tif);
+ int set = test_and_set_ti_thread_flag(cti, tif);
+
+ if (trace_sched_set_need_resched_tp_enabled() && !set)
+ trace_call__sched_set_need_resched_tp(curr, cpu, tif);
if (tif == TIF_NEED_RESCHED)
set_preempt_need_resched();
- trace_sched_set_need_resched_tp(curr, cpu, tif);
return;
}
- need_ipi = set_nr_and_not_polling(cti, tif);
- trace_sched_set_need_resched_tp(curr, cpu, tif);
- if (need_ipi) {
+ if (set_nr_and_not_polling(rq, tif)) {
if (tif == TIF_NEED_RESCHED)
smp_send_reschedule(cpu);
} else {
@@ -1353,7 +1366,7 @@ static void wake_up_idle_cpu(int cpu)
* and testing of the above solutions didn't appear to report
* much benefits.
*/
- if (set_nr_and_not_polling(task_thread_info(rq->idle), TIF_NEED_RESCHED))
+ if (set_nr_and_not_polling(rq, TIF_NEED_RESCHED))
smp_send_reschedule(cpu);
else
trace_sched_wake_idle_without_ipi(cpu);
---
Now, on a separate note to Gabriele, this trips up RV's sched:nrp monitor
when running sched-messaging because of the following race:
CPU0 CPU1
==== ====
<Any thread running>
... /* Wakes up a task on CPU0 */
<IRQ> rq_lock(rq0)
__resched_curr(rq0)
set = test_and_set_ti_thread_flag(curr, 0, TIF_NEED_RESCHED);
<... interrupted by something before tracepoint hits>
!!! rq0->curr has NEED_RESCHED set but trace_sched_set_need_resched_tp() is not called !!!
</IRQ>
raw_irqentry_exit_cond_resched()
if (!preempt_count() /* Masks PREEMPT_NEED_RESCHED */) {
if (need_resched() /* Only check TIF flags */) {
preempt_schedule_irq()
schedule(SM_PREEMPT)
trace_sched_entry_tp(true /* SM_PREEMPT */)
!!! SPLAT: rv: monitor nrp does not allow event schedule_entry_preempt on state any_thread_running !!!
Easy way to solve this is by moving the trace_sched_entry_tp() within
the rq_lock critical section. SM_PREEMPT has to happen on a
!POLLING CPU and __resched_curr() is always under the rq_lock() so
the __schedule() on a remote CPU cannot race with it before
trace_sched_set_need_resched_tp() is executed.
(lightly tested again)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bd2f7fb87dc93..961c325feca84 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7086,9 +7086,6 @@ static void __sched notrace __schedule(int sched_mode)
struct rq *rq;
int cpu;
- /* Trace preemptions consistently with task switches */
- trace_sched_entry_tp(sched_mode == SM_PREEMPT);
-
cpu = smp_processor_id();
rq = cpu_rq(cpu);
prev = rq->curr;
@@ -7121,6 +7118,9 @@ static void __sched notrace __schedule(int sched_mode)
rq_lock(rq, &rf);
smp_mb__after_spinlock();
+ /* Trace preemptions consistently with task switches */
+ trace_sched_entry_tp(sched_mode == SM_PREEMPT);
+
hrtick_schedule_enter(rq);
/* Promote REQ to ACT */
---
I'm not sure if this is a problem with current mainline but I did trip
it in my testing and the above seems to solve it - your mileage may vary
:-)
--
Thanks and Regards,
Prateek
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched()
2026-07-01 6:51 ` K Prateek Nayak
@ 2026-07-01 6:54 ` Peter Zijlstra
2026-07-01 8:09 ` K Prateek Nayak
0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2026-07-01 6:54 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Gabriele Monaco, Sechang Lim, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel
On Wed, Jul 01, 2026 at 12:21:14PM +0530, K Prateek Nayak wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7cbd541f656f..bd2f7fb87dc9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1049,9 +1049,16 @@ static inline void hrtick_schedule_exit(struct rq *rq) { }
> * this avoids any races wrt polling state changes and thereby avoids
> * spurious IPIs.
> */
> -static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
> +static inline bool set_nr_and_not_polling(struct rq *rq, int tif)
> {
> - return !(fetch_or(&ti->flags, 1 << tif) & _TIF_POLLING_NRFLAG);
> + struct task_struct *curr = rq->curr;
> + struct thread_info *ti = task_thread_info(curr);
> + unsigned long old_flags = fetch_or(&ti->flags, 1 << tif);
> +
> + if (trace_sched_set_need_resched_tp_enabled() && !(old_flags & (1 << tif)))
> + trace_call__sched_set_need_resched_tp(curr, cpu_of(rq), tif);
> +
> + return !(old_flags & _TIF_POLLING_NRFLAG);
> }
>
> /*
> @@ -1072,13 +1079,20 @@ static bool set_nr_if_polling(struct task_struct *p)
> return true;
> } while (!try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED));
>
> + trace_sched_set_need_resched_tp(p, task_cpu(p), TIF_NEED_RESCHED);
> return true;
> }
>
> #else
> -static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
> +static inline bool set_nr_and_not_polling(struct rq *rq, int tif)
> {
> - set_ti_thread_flag(ti, tif);
> + struct task_struct *curr = rq->curr;
> + struct thread_info *ti = task_thread_info(curr);
> + int set = test_and_set_ti_thread_flag(ti, tif);
> +
> + if (trace_sched_set_need_resched_tp_enabled() && !set)
> + trace_call__sched_set_need_resched_tp(curr, cpu_of(rq), tif);
> +
> return true;
> }
>
> @@ -1186,7 +1200,6 @@ static void __resched_curr(struct rq *rq, int tif)
> {
> struct task_struct *curr = rq->curr;
> struct thread_info *cti = task_thread_info(curr);
> - bool need_ipi;
> int cpu;
>
> lockdep_assert_rq_held(rq);
> @@ -1204,16 +1217,16 @@ static void __resched_curr(struct rq *rq, int tif)
> cpu = cpu_of(rq);
>
> if (cpu == smp_processor_id()) {
> - set_ti_thread_flag(cti, tif);
> + int set = test_and_set_ti_thread_flag(cti, tif);
> +
> + if (trace_sched_set_need_resched_tp_enabled() && !set)
> + trace_call__sched_set_need_resched_tp(curr, cpu, tif);
> if (tif == TIF_NEED_RESCHED)
> set_preempt_need_resched();
> - trace_sched_set_need_resched_tp(curr, cpu, tif);
> return;
> }
I can't help but notice that the local and !POLLING cases show
remarkable similarity. Just not sure extracting that isn't going to make
a mess.
Anyway, yes this looks about right.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched()
2026-07-01 6:54 ` Peter Zijlstra
@ 2026-07-01 8:09 ` K Prateek Nayak
2026-07-01 8:49 ` Peter Zijlstra
0 siblings, 1 reply; 12+ messages in thread
From: K Prateek Nayak @ 2026-07-01 8:09 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Gabriele Monaco, Sechang Lim, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel
Hello Peter,
On 7/1/2026 12:24 PM, Peter Zijlstra wrote:
>> @@ -1204,16 +1217,16 @@ static void __resched_curr(struct rq *rq, int tif)
>> cpu = cpu_of(rq);
>>
>> if (cpu == smp_processor_id()) {
>> - set_ti_thread_flag(cti, tif);
>> + int set = test_and_set_ti_thread_flag(cti, tif);
>> +
>> + if (trace_sched_set_need_resched_tp_enabled() && !set)
>> + trace_call__sched_set_need_resched_tp(curr, cpu, tif);
>> if (tif == TIF_NEED_RESCHED)
>> set_preempt_need_resched();
>> - trace_sched_set_need_resched_tp(curr, cpu, tif);
>> return;
>> }
>
> I can't help but notice that the local and !POLLING cases show
> remarkable similarity. Just not sure extracting that isn't going to make
> a mess.
>
> Anyway, yes this looks about right.
If the fetch_or() based path is okay for !POLLING and local cases which
uses a slightly (vastly?) worse instruction to set the ti->flags, we can
instead do:
(Lightly tested after removing "HAVE_TIF_POLLING_NRFLAG" for x86)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bd2f7fb87dc93..ea793e8a94a8f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1029,6 +1029,15 @@ static inline void hrtick_schedule_enter(struct rq *rq) { }
static inline void hrtick_schedule_exit(struct rq *rq) { }
#endif /* !CONFIG_SCHED_HRTICK */
+#ifndef TIF_POLLING_NRFLAG
+/*
+ * If arch doesn't define _TIF_POLLING_NRFLAG, set it 0 to
+ * allow compilers to optimize (val & _TIF_POLLING_NRFLAG)
+ * based branches during build.
+ */
+#define _TIF_POLLING_NRFLAG 0U
+#endif
+
/*
* try_cmpxchg based fetch_or() macro so it works for different integer types:
*/
@@ -1043,7 +1052,6 @@ static inline void hrtick_schedule_exit(struct rq *rq) { }
_val; \
})
-#ifdef TIF_POLLING_NRFLAG
/*
* Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
* this avoids any races wrt polling state changes and thereby avoids
@@ -1083,25 +1091,6 @@ static bool set_nr_if_polling(struct task_struct *p)
return true;
}
-#else
-static inline bool set_nr_and_not_polling(struct rq *rq, int tif)
-{
- struct task_struct *curr = rq->curr;
- struct thread_info *ti = task_thread_info(curr);
- int set = test_and_set_ti_thread_flag(ti, tif);
-
- if (trace_sched_set_need_resched_tp_enabled() && !set)
- trace_call__sched_set_need_resched_tp(curr, cpu_of(rq), tif);
-
- return true;
-}
-
-static inline bool set_nr_if_polling(struct task_struct *p)
-{
- return false;
-}
-#endif
-
static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task)
{
struct wake_q_node *node = &task->wake_q;
@@ -1216,19 +1205,19 @@ static void __resched_curr(struct rq *rq, int tif)
cpu = cpu_of(rq);
- if (cpu == smp_processor_id()) {
- int set = test_and_set_ti_thread_flag(cti, tif);
-
- if (trace_sched_set_need_resched_tp_enabled() && !set)
- trace_call__sched_set_need_resched_tp(curr, cpu, tif);
- if (tif == TIF_NEED_RESCHED)
- set_preempt_need_resched();
- return;
- }
-
if (set_nr_and_not_polling(rq, tif)) {
- if (tif == TIF_NEED_RESCHED)
- smp_send_reschedule(cpu);
+ if (tif != TIF_NEED_RESCHED)
+ return;
+ /*
+ * For local CPU, folding the NEED_RESCHED
+ * into preempt_count() is sufficient.
+ */
+ if (cpu == smp_processor_id()) {
+ set_preempt_need_resched();
+ return;
+ }
+ /* Use an IPI for remote CPUs. */
+ smp_send_reschedule(cpu);
} else {
trace_sched_wake_idle_without_ipi(cpu);
}
---
set_nr_and_not_polling() will always return true for !POLLING which will
go down the above path in __resched_curr() that deals appropriately with
local CPU case.
set_nr_if_polling() will always return false for !POLLING from the first
condition in the do-while loop.
Thoughts?
--
Thanks and Regards,
Prateek
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched()
2026-07-01 8:09 ` K Prateek Nayak
@ 2026-07-01 8:49 ` Peter Zijlstra
0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2026-07-01 8:49 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Gabriele Monaco, Sechang Lim, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel, Mark Rutland,
Will Deacon
On Wed, Jul 01, 2026 at 01:39:23PM +0530, K Prateek Nayak wrote:
> Hello Peter,
>
> On 7/1/2026 12:24 PM, Peter Zijlstra wrote:
> >> @@ -1204,16 +1217,16 @@ static void __resched_curr(struct rq *rq, int tif)
> >> cpu = cpu_of(rq);
> >>
> >> if (cpu == smp_processor_id()) {
> >> - set_ti_thread_flag(cti, tif);
> >> + int set = test_and_set_ti_thread_flag(cti, tif);
> >> +
> >> + if (trace_sched_set_need_resched_tp_enabled() && !set)
> >> + trace_call__sched_set_need_resched_tp(curr, cpu, tif);
> >> if (tif == TIF_NEED_RESCHED)
> >> set_preempt_need_resched();
> >> - trace_sched_set_need_resched_tp(curr, cpu, tif);
> >> return;
> >> }
> >
> > I can't help but notice that the local and !POLLING cases show
> > remarkable similarity. Just not sure extracting that isn't going to make
> > a mess.
> >
> > Anyway, yes this looks about right.
>
> If the fetch_or() based path is okay for !POLLING and local cases which
> uses a slightly (vastly?) worse instruction to set the ti->flags, we can
> instead do:
>
Tempting, but at least arm64 would have to agree I think. That said, I
wonder why ARM64 doesn't use LDXR+WFE for idle, just like we have
MONITOR+MWAIT.
Probably too damn many idle flavours to deal with or somesuch.
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bd2f7fb87dc93..ea793e8a94a8f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1029,6 +1029,15 @@ static inline void hrtick_schedule_enter(struct rq *rq) { }
> static inline void hrtick_schedule_exit(struct rq *rq) { }
> #endif /* !CONFIG_SCHED_HRTICK */
>
> +#ifndef TIF_POLLING_NRFLAG
> +/*
> + * If arch doesn't define _TIF_POLLING_NRFLAG, set it 0 to
> + * allow compilers to optimize (val & _TIF_POLLING_NRFLAG)
> + * based branches during build.
> + */
> +#define _TIF_POLLING_NRFLAG 0U
> +#endif
> +
> /*
> * try_cmpxchg based fetch_or() macro so it works for different integer types:
> */
> @@ -1043,7 +1052,6 @@ static inline void hrtick_schedule_exit(struct rq *rq) { }
> _val; \
> })
>
> -#ifdef TIF_POLLING_NRFLAG
> /*
> * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
> * this avoids any races wrt polling state changes and thereby avoids
> @@ -1083,25 +1091,6 @@ static bool set_nr_if_polling(struct task_struct *p)
> return true;
> }
>
> -#else
> -static inline bool set_nr_and_not_polling(struct rq *rq, int tif)
> -{
> - struct task_struct *curr = rq->curr;
> - struct thread_info *ti = task_thread_info(curr);
> - int set = test_and_set_ti_thread_flag(ti, tif);
> -
> - if (trace_sched_set_need_resched_tp_enabled() && !set)
> - trace_call__sched_set_need_resched_tp(curr, cpu_of(rq), tif);
> -
> - return true;
> -}
> -
> -static inline bool set_nr_if_polling(struct task_struct *p)
> -{
> - return false;
> -}
> -#endif
> -
> static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task)
> {
> struct wake_q_node *node = &task->wake_q;
> @@ -1216,19 +1205,19 @@ static void __resched_curr(struct rq *rq, int tif)
>
> cpu = cpu_of(rq);
>
> - if (cpu == smp_processor_id()) {
> - int set = test_and_set_ti_thread_flag(cti, tif);
> -
> - if (trace_sched_set_need_resched_tp_enabled() && !set)
> - trace_call__sched_set_need_resched_tp(curr, cpu, tif);
> - if (tif == TIF_NEED_RESCHED)
> - set_preempt_need_resched();
> - return;
> - }
> -
> if (set_nr_and_not_polling(rq, tif)) {
> - if (tif == TIF_NEED_RESCHED)
> - smp_send_reschedule(cpu);
> + if (tif != TIF_NEED_RESCHED)
> + return;
> + /*
> + * For local CPU, folding the NEED_RESCHED
> + * into preempt_count() is sufficient.
> + */
> + if (cpu == smp_processor_id()) {
> + set_preempt_need_resched();
> + return;
> + }
> + /* Use an IPI for remote CPUs. */
> + smp_send_reschedule(cpu);
> } else {
> trace_sched_wake_idle_without_ipi(cpu);
> }
> ---
>
> set_nr_and_not_polling() will always return true for !POLLING which will
> go down the above path in __resched_curr() that deals appropriately with
> local CPU case.
>
> set_nr_if_polling() will always return false for !POLLING from the first
> condition in the do-while loop.
>
> Thoughts?
>
> --
> Thanks and Regards,
> Prateek
>
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-07-01 8:49 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-27 8:16 [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched() Sechang Lim
2026-06-29 4:11 ` K Prateek Nayak
2026-06-29 12:40 ` Gabriele Monaco
2026-06-29 17:35 ` K Prateek Nayak
2026-06-30 8:58 ` Gabriele Monaco
2026-06-30 16:16 ` K Prateek Nayak
2026-06-30 20:34 ` Peter Zijlstra
2026-07-01 6:51 ` K Prateek Nayak
2026-07-01 6:54 ` Peter Zijlstra
2026-07-01 8:09 ` K Prateek Nayak
2026-07-01 8:49 ` Peter Zijlstra
2026-06-30 7:58 ` Sechang Lim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox