* [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched()
@ 2026-06-27 8:16 Sechang Lim
2026-06-29 4:11 ` K Prateek Nayak
0 siblings, 1 reply; 12+ messages in thread
From: Sechang Lim @ 2026-06-27 8:16 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Gabriele Monaco,
linux-kernel
set_tsk_need_resched() tests TIF_NEED_RESCHED, calls
__trace_set_need_resched() if the flag is clear, then sets it via
set_tsk_thread_flag(). A BPF raw_tp program attached to
sched_set_need_resched executes synchronously inside __bpf_trace_run().
On return, __bpf_trace_run() drops the RCU lock with
rcu_read_unlock_migrate(), which on the preempt-or-BH-disabled path
calls set_need_resched_current() -> set_tsk_need_resched() again.
set_tsk_thread_flag() follows the tracepoint call, so every re-entrant
frame sees TIF_NEED_RESCHED clear and calls __trace_set_need_resched()
again:
BUG: TASK stack guard page was hit at ffffc9001224ff98
Oops: stack guard page: 0000 [#1] SMP KASAN PTI
RIP: 0010:__bpf_trace_sched_set_need_resched_tp+0x1c/0x190
Call Trace:
trace_sched_set_need_resched_tp+0x110/0x130
set_tsk_need_resched include/linux/sched.h:2076
set_need_resched_current include/linux/sched.h:2094
rcu_read_unlock_special+0x43a/0x440
__rcu_read_unlock+0x9e/0x120
rcu_read_unlock_migrate+0xa9/0x240
__bpf_trace_run+0x131/0x180
bpf_trace_run3+0x333/0x430
__bpf_trace_sched_set_need_resched_tp+0x13a/0x190
trace_sched_set_need_resched_tp+0x110/0x130
set_tsk_need_resched include/linux/sched.h:2076
...
__resched_curr() has the same ordering, firing the tracepoint before
setting the flag via set_ti_thread_flag() or set_nr_and_not_polling().
Fix it for consistency.
Replace the separate test_tsk_thread_flag() + set_tsk_thread_flag() pair
in set_tsk_need_resched() with test_and_set_tsk_thread_flag(). In
__resched_curr(), move the tracepoint call after the flag is set in
each path.
Fixes: adcc3bfa8806 ("sched: Adapt sched tracepoints for RV task model")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
v2:
- also fix __resched_curr (Peter Zijlstra)
v1:
- https://lore.kernel.org/all/20260625065656.392182-1-rhkrqnwk98@gmail.com/
include/linux/sched.h | 5 ++---
kernel/sched/core.c | 7 +++++--
2 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ee06cba5c6f5..c9efd08dae92 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2071,10 +2071,9 @@ static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
static inline void set_tsk_need_resched(struct task_struct *tsk)
{
- if (tracepoint_enabled(sched_set_need_resched_tp) &&
- !test_tsk_thread_flag(tsk, TIF_NEED_RESCHED))
+ if (!test_and_set_tsk_thread_flag(tsk, TIF_NEED_RESCHED) &&
+ tracepoint_enabled(sched_set_need_resched_tp))
__trace_set_need_resched(tsk, TIF_NEED_RESCHED);
- set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
}
static inline void clear_tsk_need_resched(struct task_struct *tsk)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b8871449d3c6..b358fac315d0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1172,6 +1172,7 @@ static void __resched_curr(struct rq *rq, int tif)
struct task_struct *curr = rq->curr;
struct thread_info *cti = task_thread_info(curr);
int cpu;
+ bool need_ipi;
lockdep_assert_rq_held(rq);
@@ -1187,15 +1188,17 @@ static void __resched_curr(struct rq *rq, int tif)
cpu = cpu_of(rq);
- trace_sched_set_need_resched_tp(curr, cpu, tif);
if (cpu == smp_processor_id()) {
set_ti_thread_flag(cti, tif);
if (tif == TIF_NEED_RESCHED)
set_preempt_need_resched();
+ trace_sched_set_need_resched_tp(curr, cpu, tif);
return;
}
- if (set_nr_and_not_polling(cti, tif)) {
+ need_ipi = set_nr_and_not_polling(cti, tif);
+ trace_sched_set_need_resched_tp(curr, cpu, tif);
+ if (need_ipi) {
if (tif == TIF_NEED_RESCHED)
smp_send_reschedule(cpu);
} else {
--
2.43.0
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched() 2026-06-27 8:16 [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched() Sechang Lim @ 2026-06-29 4:11 ` K Prateek Nayak 2026-06-29 12:40 ` Gabriele Monaco 2026-06-30 7:58 ` Sechang Lim 0 siblings, 2 replies; 12+ messages in thread From: K Prateek Nayak @ 2026-06-29 4:11 UTC (permalink / raw) To: Sechang Lim, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Gabriele Monaco, linux-kernel Hello Sechang, On 6/27/2026 1:46 PM, Sechang Lim wrote: > static inline void clear_tsk_need_resched(struct task_struct *tsk) > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index b8871449d3c6..b358fac315d0 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1172,6 +1172,7 @@ static void __resched_curr(struct rq *rq, int tif) > struct task_struct *curr = rq->curr; > struct thread_info *cti = task_thread_info(curr); > int cpu; > + bool need_ipi; nit. This declaration can go before "int cpu;" to preserve the reverse x-mas arrangement for the independent variables. > > lockdep_assert_rq_held(rq); > > @@ -1187,15 +1188,17 @@ static void __resched_curr(struct rq *rq, int tif) > > cpu = cpu_of(rq); > > - trace_sched_set_need_resched_tp(curr, cpu, tif); > if (cpu == smp_processor_id()) { > set_ti_thread_flag(cti, tif); > if (tif == TIF_NEED_RESCHED) > set_preempt_need_resched(); > + trace_sched_set_need_resched_tp(curr, cpu, tif); > return; > } > > - if (set_nr_and_not_polling(cti, tif)) { > + need_ipi = set_nr_and_not_polling(cti, tif); > + trace_sched_set_need_resched_tp(curr, cpu, tif); Unrelated to changes here: This tracepoint seems to be missing from wake_up_idle_cpu() which modifies the ti->flags without taking the rq_lock. A newly enqueued timer can race with the wakeup on idle CPU and you can see a wakeup + sched_switch without seeing a corresponding trace_sched_set_need_resched_tp() for the CPU. Gabriele, is that a concern for RV? > + if (need_ipi) { > if (tif == TIF_NEED_RESCHED) > smp_send_reschedule(cpu); > } else { > -- > 2.43.0 > -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched() 2026-06-29 4:11 ` K Prateek Nayak @ 2026-06-29 12:40 ` Gabriele Monaco 2026-06-29 17:35 ` K Prateek Nayak 2026-06-30 7:58 ` Sechang Lim 1 sibling, 1 reply; 12+ messages in thread From: Gabriele Monaco @ 2026-06-29 12:40 UTC (permalink / raw) To: K Prateek Nayak, Sechang Lim, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel On Mon, 2026-06-29 at 09:41 +0530, K Prateek Nayak wrote: > Unrelated to changes here: This tracepoint seems to be missing from > wake_up_idle_cpu() which modifies the ti->flags without taking the > rq_lock. > > A newly enqueued timer can race with the wakeup on idle CPU and you can > see a wakeup + sched_switch without seeing a corresponding > trace_sched_set_need_resched_tp() for the CPU. > > Gabriele, is that a concern for RV? Thanks Prateek for pointing this out. So as I'm understanding, we are missing the tracepoint in that path, but even after putting it, this race might cause a sched_switch to be visible concurrently or before the corresponding need_resched for that CPU, right? That's potentially an issue. Currently the only model really relying on the need_resched event is nrp, validating every kernel preemption has a corresponding need_resched set. The scenario passing through wake_up_idle_cpu() is setting the flag on a CPU that is idle (causing __schedule(SM_IDLE)), this doesn't count as a preemption for the model. Or am I missing something? That said, we should indeed add the tracepoint to that path and probably adapt the monitor if that's making it fail indirectly. Thanks, Gabriele > > + if (need_ipi) { > > if (tif == TIF_NEED_RESCHED) > > smp_send_reschedule(cpu); > > } else { > > -- > > 2.43.0 > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched() 2026-06-29 12:40 ` Gabriele Monaco @ 2026-06-29 17:35 ` K Prateek Nayak 2026-06-30 8:58 ` Gabriele Monaco 0 siblings, 1 reply; 12+ messages in thread From: K Prateek Nayak @ 2026-06-29 17:35 UTC (permalink / raw) To: Gabriele Monaco, Sechang Lim, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel Hello Gabriele, On 6/29/2026 6:10 PM, Gabriele Monaco wrote: > On Mon, 2026-06-29 at 09:41 +0530, K Prateek Nayak wrote: >> Unrelated to changes here: This tracepoint seems to be missing from >> wake_up_idle_cpu() which modifies the ti->flags without taking the >> rq_lock. >> >> A newly enqueued timer can race with the wakeup on idle CPU and you can >> see a wakeup + sched_switch without seeing a corresponding >> trace_sched_set_need_resched_tp() for the CPU. >> >> Gabriele, is that a concern for RV? > > Thanks Prateek for pointing this out. So as I'm understanding, we are missing > the tracepoint in that path, but even after putting it, this race might cause a > sched_switch to be visible concurrently or before the corresponding need_resched > for that CPU, right? Ack but if that comment in wake_up_idle_cpu() is accurate, it is a very infrequent operation and we might be able to get away with grabbing the rq_lock and synchronizing it too. > > That's potentially an issue. > > Currently the only model really relying on the need_resched event is nrp, > validating every kernel preemption has a corresponding need_resched set. > The scenario passing through wake_up_idle_cpu() is setting the flag on a CPU > that is idle (causing __schedule(SM_IDLE)), this doesn't count as a preemption > for the model. Or am I missing something? I just saw the npr monitor and yes, the SM_IDLE path will not count as preemption and the state machine should just stay at "any_thread_running" from the "schedule_entry" edge. Sorry for the scare! > That said, we should indeed add the tracepoint to that path and probably adapt > the monitor if that's making it fail indirectly. For the npr use-case, I think the current scheme is fine since only SM_PREEMPt counts as a "schedule_entry_preempt" transition and only that can transition the state machine out of the "any_thread_running" state. -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched() 2026-06-29 17:35 ` K Prateek Nayak @ 2026-06-30 8:58 ` Gabriele Monaco 2026-06-30 16:16 ` K Prateek Nayak 0 siblings, 1 reply; 12+ messages in thread From: Gabriele Monaco @ 2026-06-30 8:58 UTC (permalink / raw) To: K Prateek Nayak, Sechang Lim, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel On Mon, 2026-06-29 at 23:05 +0530, K Prateek Nayak wrote: > > That said, we should indeed add the tracepoint to that path and probably > > adapt the monitor if that's making it fail indirectly. > > For the npr use-case, I think the current scheme is fine since > only SM_PREEMPt counts as a "schedule_entry_preempt" transition > and only that can transition the state machine out of the > "any_thread_running" state. Right, the monitor can live without it, but I wonder if we need to put that tracepoint for correctness sake. After all, however unlikely, that's a need_resched too. (then if the monitor really saw a need_resched after it's sched_entry, it would stay erroneously in rescheduling). Anyway this all isn't related to the patch. Thanks, Gabriele ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched() 2026-06-30 8:58 ` Gabriele Monaco @ 2026-06-30 16:16 ` K Prateek Nayak 2026-06-30 20:34 ` Peter Zijlstra 0 siblings, 1 reply; 12+ messages in thread From: K Prateek Nayak @ 2026-06-30 16:16 UTC (permalink / raw) To: Gabriele Monaco, Sechang Lim, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel Hello Gabriele, On 6/30/2026 2:28 PM, Gabriele Monaco wrote: > On Mon, 2026-06-29 at 23:05 +0530, K Prateek Nayak wrote: >>> That said, we should indeed add the tracepoint to that path and probably >>> adapt the monitor if that's making it fail indirectly. >> >> For the npr use-case, I think the current scheme is fine since >> only SM_PREEMPt counts as a "schedule_entry_preempt" transition >> and only that can transition the state machine out of the >> "any_thread_running" state. > > Right, the monitor can live without it, but I wonder if we need to put that > tracepoint for correctness sake. After all, however unlikely, that's a > need_resched too. > > (then if the monitor really saw a need_resched after it's sched_entry, it would > stay erroneously in rescheduling). > > Anyway this all isn't related to the patch. Ack! Would something like this work for completeness: (Lightly tested; Based on current tip:sched/core) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 96226707c2f6..934f540d0d3f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1049,9 +1049,16 @@ static inline void hrtick_schedule_exit(struct rq *rq) { } * this avoids any races wrt polling state changes and thereby avoids * spurious IPIs. */ -static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif) +static inline bool set_nr_and_not_polling(struct rq *rq, int tif) { - return !(fetch_or(&ti->flags, 1 << tif) & _TIF_POLLING_NRFLAG); + struct task_struct *curr = rq->curr; + struct thread_info *ti = task_thread_info(curr); + unsigned long old_flags = fetch_or(&ti->flags, 1 << tif); + + if (trace_sched_set_need_resched_tp_enabled() && !(old_flags & (1 << tif))) + trace_call__sched_set_need_resched_tp(curr, cpu_of(rq), tif); + + return !(old_flags & _TIF_POLLING_NRFLAG); } /* @@ -1076,8 +1083,11 @@ static bool set_nr_if_polling(struct task_struct *p) } #else -static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif) +static inline bool set_nr_and_not_polling(struct rq *rq, int tif) { + struct task_struct *curr = rq->curr; + struct thread_info *ti = task_thread_info(curr); + set_ti_thread_flag(ti, tif); return true; } @@ -1202,15 +1212,17 @@ static void __resched_curr(struct rq *rq, int tif) cpu = cpu_of(rq); - trace_sched_set_need_resched_tp(curr, cpu, tif); if (cpu == smp_processor_id()) { - set_ti_thread_flag(cti, tif); + int set = test_and_set_ti_thread_flag(cti, tif); + + if (trace_sched_set_need_resched_tp_enabled() && !set) + trace_call__sched_set_need_resched_tp(curr, cpu, tif); if (tif == TIF_NEED_RESCHED) set_preempt_need_resched(); return; } - if (set_nr_and_not_polling(cti, tif)) { + if (set_nr_and_not_polling(rq, tif)) { if (tif == TIF_NEED_RESCHED) smp_send_reschedule(cpu); } else { @@ -1350,7 +1362,7 @@ static void wake_up_idle_cpu(int cpu) * and testing of the above solutions didn't appear to report * much benefits. */ - if (set_nr_and_not_polling(task_thread_info(rq->idle), TIF_NEED_RESCHED)) + if (set_nr_and_not_polling(rq, TIF_NEED_RESCHED)) smp_send_reschedule(cpu); else trace_sched_wake_idle_without_ipi(cpu); --- > > Thanks, > Gabriele > -- Thanks and Regards, Prateek ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched() 2026-06-30 16:16 ` K Prateek Nayak @ 2026-06-30 20:34 ` Peter Zijlstra 2026-07-01 6:51 ` K Prateek Nayak 0 siblings, 1 reply; 12+ messages in thread From: Peter Zijlstra @ 2026-06-30 20:34 UTC (permalink / raw) To: K Prateek Nayak Cc: Gabriele Monaco, Sechang Lim, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel On Tue, Jun 30, 2026 at 09:46:11PM +0530, K Prateek Nayak wrote: > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 96226707c2f6..934f540d0d3f 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1049,9 +1049,16 @@ static inline void hrtick_schedule_exit(struct rq *rq) { } > * this avoids any races wrt polling state changes and thereby avoids > * spurious IPIs. > */ > -static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif) > +static inline bool set_nr_and_not_polling(struct rq *rq, int tif) > { > - return !(fetch_or(&ti->flags, 1 << tif) & _TIF_POLLING_NRFLAG); > + struct task_struct *curr = rq->curr; > + struct thread_info *ti = task_thread_info(curr); > + unsigned long old_flags = fetch_or(&ti->flags, 1 << tif); > + > + if (trace_sched_set_need_resched_tp_enabled() && !(old_flags & (1 << tif))) > + trace_call__sched_set_need_resched_tp(curr, cpu_of(rq), tif); > + > + return !(old_flags & _TIF_POLLING_NRFLAG); > } > > /* > @@ -1076,8 +1083,11 @@ static bool set_nr_if_polling(struct task_struct *p) > } > > #else > -static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif) > +static inline bool set_nr_and_not_polling(struct rq *rq, int tif) > { > + struct task_struct *curr = rq->curr; > + struct thread_info *ti = task_thread_info(curr); > + > set_ti_thread_flag(ti, tif); > return true; > } This !POLLING thing also needs tracing, no? > @@ -1202,15 +1212,17 @@ static void __resched_curr(struct rq *rq, int tif) > > cpu = cpu_of(rq); > > - trace_sched_set_need_resched_tp(curr, cpu, tif); > if (cpu == smp_processor_id()) { > - set_ti_thread_flag(cti, tif); > + int set = test_and_set_ti_thread_flag(cti, tif); > + > + if (trace_sched_set_need_resched_tp_enabled() && !set) > + trace_call__sched_set_need_resched_tp(curr, cpu, tif); > if (tif == TIF_NEED_RESCHED) > set_preempt_need_resched(); > return; > } > > - if (set_nr_and_not_polling(cti, tif)) { > + if (set_nr_and_not_polling(rq, tif)) { > if (tif == TIF_NEED_RESCHED) > smp_send_reschedule(cpu); > } else { > @@ -1350,7 +1362,7 @@ static void wake_up_idle_cpu(int cpu) > * and testing of the above solutions didn't appear to report > * much benefits. > */ > - if (set_nr_and_not_polling(task_thread_info(rq->idle), TIF_NEED_RESCHED)) > + if (set_nr_and_not_polling(rq, TIF_NEED_RESCHED)) > smp_send_reschedule(cpu); > else > trace_sched_wake_idle_without_ipi(cpu); > --- > > > > > Thanks, > > Gabriele > > > > -- > Thanks and Regards, > Prateek > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched() 2026-06-30 20:34 ` Peter Zijlstra @ 2026-07-01 6:51 ` K Prateek Nayak 2026-07-01 6:54 ` Peter Zijlstra 0 siblings, 1 reply; 12+ messages in thread From: K Prateek Nayak @ 2026-07-01 6:51 UTC (permalink / raw) To: Peter Zijlstra, Gabriele Monaco Cc: Sechang Lim, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel Hello Peter, Gabriele, On 7/1/2026 2:04 AM, Peter Zijlstra wrote: >> -static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif) >> +static inline bool set_nr_and_not_polling(struct rq *rq, int tif) >> { >> + struct task_struct *curr = rq->curr; >> + struct thread_info *ti = task_thread_info(curr); >> + >> set_ti_thread_flag(ti, tif); >> return true; >> } > > This !POLLING thing also needs tracing, no? Indeed! I'm stupid. Here is the full diff with set_nr_if_polling() covered too on top of tip:sched/core + Sechang's v3 at https://lore.kernel.org/lkml/20260630084750.2792851-1-rhkrqnwk98@gmail.com/ for completeness. Let me know if you prefer folding it in or if this needs to be a separate patch. (lightly testet) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7cbd541f656f..bd2f7fb87dc9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1049,9 +1049,16 @@ static inline void hrtick_schedule_exit(struct rq *rq) { } * this avoids any races wrt polling state changes and thereby avoids * spurious IPIs. */ -static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif) +static inline bool set_nr_and_not_polling(struct rq *rq, int tif) { - return !(fetch_or(&ti->flags, 1 << tif) & _TIF_POLLING_NRFLAG); + struct task_struct *curr = rq->curr; + struct thread_info *ti = task_thread_info(curr); + unsigned long old_flags = fetch_or(&ti->flags, 1 << tif); + + if (trace_sched_set_need_resched_tp_enabled() && !(old_flags & (1 << tif))) + trace_call__sched_set_need_resched_tp(curr, cpu_of(rq), tif); + + return !(old_flags & _TIF_POLLING_NRFLAG); } /* @@ -1072,13 +1079,20 @@ static bool set_nr_if_polling(struct task_struct *p) return true; } while (!try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED)); + trace_sched_set_need_resched_tp(p, task_cpu(p), TIF_NEED_RESCHED); return true; } #else -static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif) +static inline bool set_nr_and_not_polling(struct rq *rq, int tif) { - set_ti_thread_flag(ti, tif); + struct task_struct *curr = rq->curr; + struct thread_info *ti = task_thread_info(curr); + int set = test_and_set_ti_thread_flag(ti, tif); + + if (trace_sched_set_need_resched_tp_enabled() && !set) + trace_call__sched_set_need_resched_tp(curr, cpu_of(rq), tif); + return true; } @@ -1186,7 +1200,6 @@ static void __resched_curr(struct rq *rq, int tif) { struct task_struct *curr = rq->curr; struct thread_info *cti = task_thread_info(curr); - bool need_ipi; int cpu; lockdep_assert_rq_held(rq); @@ -1204,16 +1217,16 @@ static void __resched_curr(struct rq *rq, int tif) cpu = cpu_of(rq); if (cpu == smp_processor_id()) { - set_ti_thread_flag(cti, tif); + int set = test_and_set_ti_thread_flag(cti, tif); + + if (trace_sched_set_need_resched_tp_enabled() && !set) + trace_call__sched_set_need_resched_tp(curr, cpu, tif); if (tif == TIF_NEED_RESCHED) set_preempt_need_resched(); - trace_sched_set_need_resched_tp(curr, cpu, tif); return; } - need_ipi = set_nr_and_not_polling(cti, tif); - trace_sched_set_need_resched_tp(curr, cpu, tif); - if (need_ipi) { + if (set_nr_and_not_polling(rq, tif)) { if (tif == TIF_NEED_RESCHED) smp_send_reschedule(cpu); } else { @@ -1353,7 +1366,7 @@ static void wake_up_idle_cpu(int cpu) * and testing of the above solutions didn't appear to report * much benefits. */ - if (set_nr_and_not_polling(task_thread_info(rq->idle), TIF_NEED_RESCHED)) + if (set_nr_and_not_polling(rq, TIF_NEED_RESCHED)) smp_send_reschedule(cpu); else trace_sched_wake_idle_without_ipi(cpu); --- Now, on a separate note to Gabriele, this trips up RV's sched:nrp monitor when running sched-messaging because of the following race: CPU0 CPU1 ==== ==== <Any thread running> ... /* Wakes up a task on CPU0 */ <IRQ> rq_lock(rq0) __resched_curr(rq0) set = test_and_set_ti_thread_flag(curr, 0, TIF_NEED_RESCHED); <... interrupted by something before tracepoint hits> !!! rq0->curr has NEED_RESCHED set but trace_sched_set_need_resched_tp() is not called !!! </IRQ> raw_irqentry_exit_cond_resched() if (!preempt_count() /* Masks PREEMPT_NEED_RESCHED */) { if (need_resched() /* Only check TIF flags */) { preempt_schedule_irq() schedule(SM_PREEMPT) trace_sched_entry_tp(true /* SM_PREEMPT */) !!! SPLAT: rv: monitor nrp does not allow event schedule_entry_preempt on state any_thread_running !!! Easy way to solve this is by moving the trace_sched_entry_tp() within the rq_lock critical section. SM_PREEMPT has to happen on a !POLLING CPU and __resched_curr() is always under the rq_lock() so the __schedule() on a remote CPU cannot race with it before trace_sched_set_need_resched_tp() is executed. (lightly tested again) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bd2f7fb87dc93..961c325feca84 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7086,9 +7086,6 @@ static void __sched notrace __schedule(int sched_mode) struct rq *rq; int cpu; - /* Trace preemptions consistently with task switches */ - trace_sched_entry_tp(sched_mode == SM_PREEMPT); - cpu = smp_processor_id(); rq = cpu_rq(cpu); prev = rq->curr; @@ -7121,6 +7118,9 @@ static void __sched notrace __schedule(int sched_mode) rq_lock(rq, &rf); smp_mb__after_spinlock(); + /* Trace preemptions consistently with task switches */ + trace_sched_entry_tp(sched_mode == SM_PREEMPT); + hrtick_schedule_enter(rq); /* Promote REQ to ACT */ --- I'm not sure if this is a problem with current mainline but I did trip it in my testing and the above seems to solve it - your mileage may vary :-) -- Thanks and Regards, Prateek ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched() 2026-07-01 6:51 ` K Prateek Nayak @ 2026-07-01 6:54 ` Peter Zijlstra 2026-07-01 8:09 ` K Prateek Nayak 0 siblings, 1 reply; 12+ messages in thread From: Peter Zijlstra @ 2026-07-01 6:54 UTC (permalink / raw) To: K Prateek Nayak Cc: Gabriele Monaco, Sechang Lim, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel On Wed, Jul 01, 2026 at 12:21:14PM +0530, K Prateek Nayak wrote: > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 7cbd541f656f..bd2f7fb87dc9 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1049,9 +1049,16 @@ static inline void hrtick_schedule_exit(struct rq *rq) { } > * this avoids any races wrt polling state changes and thereby avoids > * spurious IPIs. > */ > -static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif) > +static inline bool set_nr_and_not_polling(struct rq *rq, int tif) > { > - return !(fetch_or(&ti->flags, 1 << tif) & _TIF_POLLING_NRFLAG); > + struct task_struct *curr = rq->curr; > + struct thread_info *ti = task_thread_info(curr); > + unsigned long old_flags = fetch_or(&ti->flags, 1 << tif); > + > + if (trace_sched_set_need_resched_tp_enabled() && !(old_flags & (1 << tif))) > + trace_call__sched_set_need_resched_tp(curr, cpu_of(rq), tif); > + > + return !(old_flags & _TIF_POLLING_NRFLAG); > } > > /* > @@ -1072,13 +1079,20 @@ static bool set_nr_if_polling(struct task_struct *p) > return true; > } while (!try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED)); > > + trace_sched_set_need_resched_tp(p, task_cpu(p), TIF_NEED_RESCHED); > return true; > } > > #else > -static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif) > +static inline bool set_nr_and_not_polling(struct rq *rq, int tif) > { > - set_ti_thread_flag(ti, tif); > + struct task_struct *curr = rq->curr; > + struct thread_info *ti = task_thread_info(curr); > + int set = test_and_set_ti_thread_flag(ti, tif); > + > + if (trace_sched_set_need_resched_tp_enabled() && !set) > + trace_call__sched_set_need_resched_tp(curr, cpu_of(rq), tif); > + > return true; > } > > @@ -1186,7 +1200,6 @@ static void __resched_curr(struct rq *rq, int tif) > { > struct task_struct *curr = rq->curr; > struct thread_info *cti = task_thread_info(curr); > - bool need_ipi; > int cpu; > > lockdep_assert_rq_held(rq); > @@ -1204,16 +1217,16 @@ static void __resched_curr(struct rq *rq, int tif) > cpu = cpu_of(rq); > > if (cpu == smp_processor_id()) { > - set_ti_thread_flag(cti, tif); > + int set = test_and_set_ti_thread_flag(cti, tif); > + > + if (trace_sched_set_need_resched_tp_enabled() && !set) > + trace_call__sched_set_need_resched_tp(curr, cpu, tif); > if (tif == TIF_NEED_RESCHED) > set_preempt_need_resched(); > - trace_sched_set_need_resched_tp(curr, cpu, tif); > return; > } I can't help but notice that the local and !POLLING cases show remarkable similarity. Just not sure extracting that isn't going to make a mess. Anyway, yes this looks about right. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched() 2026-07-01 6:54 ` Peter Zijlstra @ 2026-07-01 8:09 ` K Prateek Nayak 2026-07-01 8:49 ` Peter Zijlstra 0 siblings, 1 reply; 12+ messages in thread From: K Prateek Nayak @ 2026-07-01 8:09 UTC (permalink / raw) To: Peter Zijlstra Cc: Gabriele Monaco, Sechang Lim, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel Hello Peter, On 7/1/2026 12:24 PM, Peter Zijlstra wrote: >> @@ -1204,16 +1217,16 @@ static void __resched_curr(struct rq *rq, int tif) >> cpu = cpu_of(rq); >> >> if (cpu == smp_processor_id()) { >> - set_ti_thread_flag(cti, tif); >> + int set = test_and_set_ti_thread_flag(cti, tif); >> + >> + if (trace_sched_set_need_resched_tp_enabled() && !set) >> + trace_call__sched_set_need_resched_tp(curr, cpu, tif); >> if (tif == TIF_NEED_RESCHED) >> set_preempt_need_resched(); >> - trace_sched_set_need_resched_tp(curr, cpu, tif); >> return; >> } > > I can't help but notice that the local and !POLLING cases show > remarkable similarity. Just not sure extracting that isn't going to make > a mess. > > Anyway, yes this looks about right. If the fetch_or() based path is okay for !POLLING and local cases which uses a slightly (vastly?) worse instruction to set the ti->flags, we can instead do: (Lightly tested after removing "HAVE_TIF_POLLING_NRFLAG" for x86) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bd2f7fb87dc93..ea793e8a94a8f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1029,6 +1029,15 @@ static inline void hrtick_schedule_enter(struct rq *rq) { } static inline void hrtick_schedule_exit(struct rq *rq) { } #endif /* !CONFIG_SCHED_HRTICK */ +#ifndef TIF_POLLING_NRFLAG +/* + * If arch doesn't define _TIF_POLLING_NRFLAG, set it 0 to + * allow compilers to optimize (val & _TIF_POLLING_NRFLAG) + * based branches during build. + */ +#define _TIF_POLLING_NRFLAG 0U +#endif + /* * try_cmpxchg based fetch_or() macro so it works for different integer types: */ @@ -1043,7 +1052,6 @@ static inline void hrtick_schedule_exit(struct rq *rq) { } _val; \ }) -#ifdef TIF_POLLING_NRFLAG /* * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG, * this avoids any races wrt polling state changes and thereby avoids @@ -1083,25 +1091,6 @@ static bool set_nr_if_polling(struct task_struct *p) return true; } -#else -static inline bool set_nr_and_not_polling(struct rq *rq, int tif) -{ - struct task_struct *curr = rq->curr; - struct thread_info *ti = task_thread_info(curr); - int set = test_and_set_ti_thread_flag(ti, tif); - - if (trace_sched_set_need_resched_tp_enabled() && !set) - trace_call__sched_set_need_resched_tp(curr, cpu_of(rq), tif); - - return true; -} - -static inline bool set_nr_if_polling(struct task_struct *p) -{ - return false; -} -#endif - static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task) { struct wake_q_node *node = &task->wake_q; @@ -1216,19 +1205,19 @@ static void __resched_curr(struct rq *rq, int tif) cpu = cpu_of(rq); - if (cpu == smp_processor_id()) { - int set = test_and_set_ti_thread_flag(cti, tif); - - if (trace_sched_set_need_resched_tp_enabled() && !set) - trace_call__sched_set_need_resched_tp(curr, cpu, tif); - if (tif == TIF_NEED_RESCHED) - set_preempt_need_resched(); - return; - } - if (set_nr_and_not_polling(rq, tif)) { - if (tif == TIF_NEED_RESCHED) - smp_send_reschedule(cpu); + if (tif != TIF_NEED_RESCHED) + return; + /* + * For local CPU, folding the NEED_RESCHED + * into preempt_count() is sufficient. + */ + if (cpu == smp_processor_id()) { + set_preempt_need_resched(); + return; + } + /* Use an IPI for remote CPUs. */ + smp_send_reschedule(cpu); } else { trace_sched_wake_idle_without_ipi(cpu); } --- set_nr_and_not_polling() will always return true for !POLLING which will go down the above path in __resched_curr() that deals appropriately with local CPU case. set_nr_if_polling() will always return false for !POLLING from the first condition in the do-while loop. Thoughts? -- Thanks and Regards, Prateek ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched() 2026-07-01 8:09 ` K Prateek Nayak @ 2026-07-01 8:49 ` Peter Zijlstra 0 siblings, 0 replies; 12+ messages in thread From: Peter Zijlstra @ 2026-07-01 8:49 UTC (permalink / raw) To: K Prateek Nayak Cc: Gabriele Monaco, Sechang Lim, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel, Mark Rutland, Will Deacon On Wed, Jul 01, 2026 at 01:39:23PM +0530, K Prateek Nayak wrote: > Hello Peter, > > On 7/1/2026 12:24 PM, Peter Zijlstra wrote: > >> @@ -1204,16 +1217,16 @@ static void __resched_curr(struct rq *rq, int tif) > >> cpu = cpu_of(rq); > >> > >> if (cpu == smp_processor_id()) { > >> - set_ti_thread_flag(cti, tif); > >> + int set = test_and_set_ti_thread_flag(cti, tif); > >> + > >> + if (trace_sched_set_need_resched_tp_enabled() && !set) > >> + trace_call__sched_set_need_resched_tp(curr, cpu, tif); > >> if (tif == TIF_NEED_RESCHED) > >> set_preempt_need_resched(); > >> - trace_sched_set_need_resched_tp(curr, cpu, tif); > >> return; > >> } > > > > I can't help but notice that the local and !POLLING cases show > > remarkable similarity. Just not sure extracting that isn't going to make > > a mess. > > > > Anyway, yes this looks about right. > > If the fetch_or() based path is okay for !POLLING and local cases which > uses a slightly (vastly?) worse instruction to set the ti->flags, we can > instead do: > Tempting, but at least arm64 would have to agree I think. That said, I wonder why ARM64 doesn't use LDXR+WFE for idle, just like we have MONITOR+MWAIT. Probably too damn many idle flavours to deal with or somesuch. > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index bd2f7fb87dc93..ea793e8a94a8f 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1029,6 +1029,15 @@ static inline void hrtick_schedule_enter(struct rq *rq) { } > static inline void hrtick_schedule_exit(struct rq *rq) { } > #endif /* !CONFIG_SCHED_HRTICK */ > > +#ifndef TIF_POLLING_NRFLAG > +/* > + * If arch doesn't define _TIF_POLLING_NRFLAG, set it 0 to > + * allow compilers to optimize (val & _TIF_POLLING_NRFLAG) > + * based branches during build. > + */ > +#define _TIF_POLLING_NRFLAG 0U > +#endif > + > /* > * try_cmpxchg based fetch_or() macro so it works for different integer types: > */ > @@ -1043,7 +1052,6 @@ static inline void hrtick_schedule_exit(struct rq *rq) { } > _val; \ > }) > > -#ifdef TIF_POLLING_NRFLAG > /* > * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG, > * this avoids any races wrt polling state changes and thereby avoids > @@ -1083,25 +1091,6 @@ static bool set_nr_if_polling(struct task_struct *p) > return true; > } > > -#else > -static inline bool set_nr_and_not_polling(struct rq *rq, int tif) > -{ > - struct task_struct *curr = rq->curr; > - struct thread_info *ti = task_thread_info(curr); > - int set = test_and_set_ti_thread_flag(ti, tif); > - > - if (trace_sched_set_need_resched_tp_enabled() && !set) > - trace_call__sched_set_need_resched_tp(curr, cpu_of(rq), tif); > - > - return true; > -} > - > -static inline bool set_nr_if_polling(struct task_struct *p) > -{ > - return false; > -} > -#endif > - > static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task) > { > struct wake_q_node *node = &task->wake_q; > @@ -1216,19 +1205,19 @@ static void __resched_curr(struct rq *rq, int tif) > > cpu = cpu_of(rq); > > - if (cpu == smp_processor_id()) { > - int set = test_and_set_ti_thread_flag(cti, tif); > - > - if (trace_sched_set_need_resched_tp_enabled() && !set) > - trace_call__sched_set_need_resched_tp(curr, cpu, tif); > - if (tif == TIF_NEED_RESCHED) > - set_preempt_need_resched(); > - return; > - } > - > if (set_nr_and_not_polling(rq, tif)) { > - if (tif == TIF_NEED_RESCHED) > - smp_send_reschedule(cpu); > + if (tif != TIF_NEED_RESCHED) > + return; > + /* > + * For local CPU, folding the NEED_RESCHED > + * into preempt_count() is sufficient. > + */ > + if (cpu == smp_processor_id()) { > + set_preempt_need_resched(); > + return; > + } > + /* Use an IPI for remote CPUs. */ > + smp_send_reschedule(cpu); > } else { > trace_sched_wake_idle_without_ipi(cpu); > } > --- > > set_nr_and_not_polling() will always return true for !POLLING which will > go down the above path in __resched_curr() that deals appropriately with > local CPU case. > > set_nr_if_polling() will always return false for !POLLING from the first > condition in the do-while loop. > > Thoughts? > > -- > Thanks and Regards, > Prateek > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched() 2026-06-29 4:11 ` K Prateek Nayak 2026-06-29 12:40 ` Gabriele Monaco @ 2026-06-30 7:58 ` Sechang Lim 1 sibling, 0 replies; 12+ messages in thread From: Sechang Lim @ 2026-06-30 7:58 UTC (permalink / raw) To: K Prateek Nayak Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Gabriele Monaco, linux-kernel On Mon, Jun 29, 2026 at 09:41:15AM +0530, K Prateek Nayak wrote: >Hello Sechang, > >On 6/27/2026 1:46 PM, Sechang Lim wrote: >> static inline void clear_tsk_need_resched(struct task_struct *tsk) >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index b8871449d3c6..b358fac315d0 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -1172,6 +1172,7 @@ static void __resched_curr(struct rq *rq, int tif) >> struct task_struct *curr = rq->curr; >> struct thread_info *cti = task_thread_info(curr); >> int cpu; >> + bool need_ipi; > >nit. This declaration can go before "int cpu;" to preserve the >reverse x-mas arrangement for the independent variables. > Will fix in v3. Thanks! Bests, Sechang ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-07-01 8:49 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-27 8:16 [PATCH v2] sched: set TIF_NEED_RESCHED before calling __trace_set_need_resched() Sechang Lim 2026-06-29 4:11 ` K Prateek Nayak 2026-06-29 12:40 ` Gabriele Monaco 2026-06-29 17:35 ` K Prateek Nayak 2026-06-30 8:58 ` Gabriele Monaco 2026-06-30 16:16 ` K Prateek Nayak 2026-06-30 20:34 ` Peter Zijlstra 2026-07-01 6:51 ` K Prateek Nayak 2026-07-01 6:54 ` Peter Zijlstra 2026-07-01 8:09 ` K Prateek Nayak 2026-07-01 8:49 ` Peter Zijlstra 2026-06-30 7:58 ` Sechang Lim
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox