* [PATCH v6 sched_ext/for-7.2 0/2] sched_ext: Follow-up fixes for exit_cpu accuracy
@ 2026-05-31 15:25 Cheng-Yang Chou
2026-05-31 15:25 ` [PATCH 1/2] sched_ext: Fix exit_cpu accuracy for lockup paths Cheng-Yang Chou
2026-05-31 15:25 ` [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs Cheng-Yang Chou
0 siblings, 2 replies; 18+ messages in thread
From: Cheng-Yang Chou @ 2026-05-31 15:25 UTC (permalink / raw)
To: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min,
Paul E . McKenney, rcu
Cc: Ching-Chun Huang, Chia-Ping Tsai, Cheng-Yang Chou
Follow-up of Tejun's suggestion [1], discussed in v2 and the subsequent
review thread [2].
Patch 1 fixes exit_cpu accuracy for the hard lockup and softlockup
paths. For the RCU stall path, -1 is used as a placeholder until
patch 2 threads the cpumask.
Patch 2 upgrades panic_on_rcu_stall() and scx_rcu_cpu_stall() to
accept a cpumask of stalled CPUs, stored in scx_sched and piped
directly to scx_dump_state().
Based on sched_ext/for-next (b565a73baec2).
Changes in v6:
- Drop stalled_mask parameter from synchronize_rcu_expedited_stall()
(Paul McKenney).
- Link to v5:
https://lore.kernel.org/r/20260521161636.1893894-1-yphbchou0911@gmail.com/
Changes in v5:
- Replace dynamic cpumask allocation in both RCU stall paths with static
rcu_stall_cpumask and rcu_exp_stall_cpumask (Paul McKenney).
- Link to v4:
https://lore.kernel.org/r/20260519171745.1551340-1-yphbchou0911@gmail.com/
Changes in v4:
- Store stall mask in scx_sched rather than scx_exit_info (Tejun Heo).
- Use __GFP_NOWARN for cpumask allocations in both RCU stall paths.
- Link to v3:
https://lore.kernel.org/r/20260518131311.1170786-1-yphbchou0911@gmail.com/
Changes in v3:
- Drop patch 1 of v2 ("Normalize exit dump header to 'on CPU N'"),
already applied to sched_ext/for-7.2.
- Replace single stalled_cpu int with const struct cpumask * in
panic_on_rcu_stall() and scx_rcu_cpu_stall() (Paul McKenney, Tejun Heo).
- Thread cpumask through the expedited stall path (Tejun Heo).
- Falls back to cpu_none_mask on allocation failure.
- Link to v2:
https://lore.kernel.org/r/20260504161543.674488-1-yphbchou0911@gmail.com/
Changes in v2:
- Use raw_smp_processor_id() in synchronize_rcu_expedited_wait() to
avoid CONFIG_DEBUG_PREEMPT splat.
- Link to v1:
https://lore.kernel.org/r/20260501131521.161852-1-yphbchou0911@gmail.com/
[1]: https://lore.kernel.org/r/e7cbfc99b52b4b7059267bb81498179f@kernel.org/
[2]: https://lore.kernel.org/r/af0Gn47y1Lj2BAqd@slm.duckdns.org/
Thanks,
Cheng-Yang
---
Cheng-Yang Chou (2):
sched_ext: Fix exit_cpu accuracy for lockup paths
sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled
CPUs
include/linux/sched/ext.h | 4 +-
kernel/rcu/tree.c | 3 ++
kernel/rcu/tree_exp.h | 5 ++-
kernel/rcu/tree_stall.h | 13 ++++--
kernel/sched/ext.c | 85 +++++++++++++++++++++++++++++++------
kernel/sched/ext_internal.h | 3 +-
6 files changed, 91 insertions(+), 22 deletions(-)
--
2.48.1
^ permalink raw reply [flat|nested] 18+ messages in thread* [PATCH 1/2] sched_ext: Fix exit_cpu accuracy for lockup paths 2026-05-31 15:25 [PATCH v6 sched_ext/for-7.2 0/2] sched_ext: Follow-up fixes for exit_cpu accuracy Cheng-Yang Chou @ 2026-05-31 15:25 ` Cheng-Yang Chou 2026-06-09 5:10 ` Andrea Righi 2026-05-31 15:25 ` [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs Cheng-Yang Chou 1 sibling, 1 reply; 18+ messages in thread From: Cheng-Yang Chou @ 2026-05-31 15:25 UTC (permalink / raw) To: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, Paul E . McKenney, rcu Cc: Ching-Chun Huang, Chia-Ping Tsai, Cheng-Yang Chou handle_lockup() uses raw_smp_processor_id() for exit_cpu, which is wrong for two paths: - scx_hardlockup_irq_workfn() has the hung CPU in a local variable but irq_work may run elsewhere. Pass the local cpu explicitly. - scx_rcu_cpu_stall() records the detector CPU rather than the stalled one. Pass -1 for now. The next patch fixes this properly. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> --- kernel/sched/ext.c | 12 +++++++----- kernel/sched/ext_internal.h | 2 -- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index ffad1a90196f..0c37b5fd58b0 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -5205,6 +5205,7 @@ bool scx_allow_ttwu_queue(const struct task_struct *p) /** * handle_lockup - sched_ext common lockup handler + * @exit_cpu: CPU to record in exit_info. Pass the stalled/hung CPU, not current. * @fmt: format string * * Called on system stall or lockup condition and initiates abort of sched_ext @@ -5214,7 +5215,7 @@ bool scx_allow_ttwu_queue(const struct task_struct *p) * resolve the lockup. %false if sched_ext is not enabled or abort was already * initiated by someone else. */ -static __printf(1, 2) bool handle_lockup(const char *fmt, ...) +static __printf(2, 3) bool handle_lockup(int exit_cpu, const char *fmt, ...) { struct scx_sched *sch; va_list args; @@ -5230,7 +5231,7 @@ static __printf(1, 2) bool handle_lockup(const char *fmt, ...) case SCX_ENABLING: case SCX_ENABLED: va_start(args, fmt); - ret = scx_verror(sch, fmt, args); + ret = scx_vexit(sch, SCX_EXIT_ERROR, 0, exit_cpu, fmt, args); va_end(args); return ret; default: @@ -5252,7 +5253,7 @@ static __printf(1, 2) bool handle_lockup(const char *fmt, ...) */ bool scx_rcu_cpu_stall(void) { - return handle_lockup("RCU CPU stall detected!"); + return handle_lockup(-1, "RCU CPU stall detected!"); } /** @@ -5267,7 +5268,8 @@ bool scx_rcu_cpu_stall(void) */ void scx_softlockup(u32 dur_s) { - if (!handle_lockup("soft lockup - CPU %d stuck for %us", smp_processor_id(), dur_s)) + if (!handle_lockup(smp_processor_id(), "soft lockup - CPU %d stuck for %us", + smp_processor_id(), dur_s)) return; printk_deferred(KERN_ERR "sched_ext: Soft lockup - CPU %d stuck for %us, disabling BPF scheduler\n", @@ -5286,7 +5288,7 @@ static void scx_hardlockup_irq_workfn(struct irq_work *work) { int cpu = atomic_xchg(&scx_hardlockup_cpu, -1); - if (cpu >= 0 && handle_lockup("hard lockup - CPU %d", cpu)) + if (cpu >= 0 && handle_lockup(cpu, "hard lockup - CPU %d", cpu)) printk_deferred(KERN_ERR "sched_ext: Hard lockup - CPU %d, disabling BPF scheduler\n", cpu); } diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h index 9bb65367f510..2f15a4d3c534 100644 --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -1509,8 +1509,6 @@ __printf(5, 6) bool __scx_exit(struct scx_sched *sch, enum scx_exit_kind kind, __scx_exit(sch, kind, exit_code, raw_smp_processor_id(), fmt, ##args) #define scx_error(sch, fmt, args...) \ scx_exit((sch), SCX_EXIT_ERROR, 0, fmt, ##args) -#define scx_verror(sch, fmt, args) \ - scx_vexit((sch), SCX_EXIT_ERROR, 0, raw_smp_processor_id(), fmt, args) /* * Return the rq currently locked from an scx callback, or NULL if no rq is -- 2.48.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix exit_cpu accuracy for lockup paths 2026-05-31 15:25 ` [PATCH 1/2] sched_ext: Fix exit_cpu accuracy for lockup paths Cheng-Yang Chou @ 2026-06-09 5:10 ` Andrea Righi 0 siblings, 0 replies; 18+ messages in thread From: Andrea Righi @ 2026-06-09 5:10 UTC (permalink / raw) To: Cheng-Yang Chou Cc: sched-ext, Tejun Heo, David Vernet, Changwoo Min, Paul E . McKenney, rcu, Ching-Chun Huang, Chia-Ping Tsai Hi Cheng-Yang, On Sun, May 31, 2026 at 11:25:26PM +0800, Cheng-Yang Chou wrote: > handle_lockup() uses raw_smp_processor_id() for exit_cpu, which is wrong > for two paths: > > - scx_hardlockup_irq_workfn() has the hung CPU in a local variable but > irq_work may run elsewhere. Pass the local cpu explicitly. > - scx_rcu_cpu_stall() records the detector CPU rather than the stalled > one. Pass -1 for now. The next patch fixes this properly. > > Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Small nit below, apart than that looks good me. Reviewed-by: Andrea Righi <arighi@nvidia.com> > --- > kernel/sched/ext.c | 12 +++++++----- > kernel/sched/ext_internal.h | 2 -- > 2 files changed, 7 insertions(+), 7 deletions(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index ffad1a90196f..0c37b5fd58b0 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -5205,6 +5205,7 @@ bool scx_allow_ttwu_queue(const struct task_struct *p) > > /** > * handle_lockup - sched_ext common lockup handler > + * @exit_cpu: CPU to record in exit_info. Pass the stalled/hung CPU, not current. > * @fmt: format string > * > * Called on system stall or lockup condition and initiates abort of sched_ext > @@ -5214,7 +5215,7 @@ bool scx_allow_ttwu_queue(const struct task_struct *p) > * resolve the lockup. %false if sched_ext is not enabled or abort was already > * initiated by someone else. > */ > -static __printf(1, 2) bool handle_lockup(const char *fmt, ...) > +static __printf(2, 3) bool handle_lockup(int exit_cpu, const char *fmt, ...) > { > struct scx_sched *sch; > va_list args; > @@ -5230,7 +5231,7 @@ static __printf(1, 2) bool handle_lockup(const char *fmt, ...) > case SCX_ENABLING: > case SCX_ENABLED: > va_start(args, fmt); > - ret = scx_verror(sch, fmt, args); > + ret = scx_vexit(sch, SCX_EXIT_ERROR, 0, exit_cpu, fmt, args); > va_end(args); > return ret; > default: > @@ -5252,7 +5253,7 @@ static __printf(1, 2) bool handle_lockup(const char *fmt, ...) > */ > bool scx_rcu_cpu_stall(void) > { > - return handle_lockup("RCU CPU stall detected!"); > + return handle_lockup(-1, "RCU CPU stall detected!"); > } > > /** > @@ -5267,7 +5268,8 @@ bool scx_rcu_cpu_stall(void) > */ > void scx_softlockup(u32 dur_s) > { > - if (!handle_lockup("soft lockup - CPU %d stuck for %us", smp_processor_id(), dur_s)) > + if (!handle_lockup(smp_processor_id(), "soft lockup - CPU %d stuck for %us", > + smp_processor_id(), dur_s)) nit: maybe we can use smp_processor_id() once here, like: int cpu = smp_processor_id(); if (!handle_lockup(cpu, ..., cpu, dur_s)) Thanks, -Andrea ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs 2026-05-31 15:25 [PATCH v6 sched_ext/for-7.2 0/2] sched_ext: Follow-up fixes for exit_cpu accuracy Cheng-Yang Chou 2026-05-31 15:25 ` [PATCH 1/2] sched_ext: Fix exit_cpu accuracy for lockup paths Cheng-Yang Chou @ 2026-05-31 15:25 ` Cheng-Yang Chou 2026-06-04 17:57 ` Paul E. McKenney 2026-06-09 8:06 ` Andrea Righi 1 sibling, 2 replies; 18+ messages in thread From: Cheng-Yang Chou @ 2026-05-31 15:25 UTC (permalink / raw) To: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, Paul E . McKenney, rcu Cc: Ching-Chun Huang, Chia-Ping Tsai, Cheng-Yang Chou scx_rcu_cpu_stall() previously recorded the detector CPU rather than the stalled one, and the expedited grace period path had no stalled CPU to report at all. Thread a cpumask through panic_on_rcu_stall() and scx_rcu_cpu_stall() to capture all stalled CPUs. Report cpumask_first() as exit_cpu and the full CPU list in the exit message. Task-only stalls yield exit_cpu = -1. Store the stall mask in scx_sched rather than scx_exit_info, keeping the BPF-visible struct unchanged. scx_dump_state() reads sch->stall_cpus directly and dumps all stalled CPUs first to avoid losing them to truncation. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> --- include/linux/sched/ext.h | 4 +- kernel/rcu/tree.c | 3 ++ kernel/rcu/tree_exp.h | 5 ++- kernel/rcu/tree_stall.h | 13 +++++-- kernel/sched/ext.c | 73 ++++++++++++++++++++++++++++++++----- kernel/sched/ext_internal.h | 1 + 6 files changed, 83 insertions(+), 16 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index 20b2343aa344..75cb8b119fb7 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -263,7 +263,7 @@ void sched_ext_dead(struct task_struct *p); void print_scx_info(const char *log_lvl, struct task_struct *p); void scx_softlockup(u32 dur_s); bool scx_hardlockup(int cpu); -bool scx_rcu_cpu_stall(void); +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask); #else /* !CONFIG_SCHED_CLASS_EXT */ @@ -271,7 +271,7 @@ static inline void sched_ext_dead(struct task_struct *p) {} static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {} static inline void scx_softlockup(u32 dur_s) {} static inline bool scx_hardlockup(int cpu) { return false; } -static inline bool scx_rcu_cpu_stall(void) { return false; } +static inline bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) { return false; } #endif /* CONFIG_SCHED_CLASS_EXT */ diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 55df6d37145e..03c9651be5c0 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -4871,6 +4871,9 @@ static void __init rcu_dump_rcu_node_tree(void) struct workqueue_struct *rcu_gp_wq; +static struct cpumask rcu_stall_cpumask; +static struct cpumask rcu_exp_stall_cpumask; + void __init rcu_init(void) { int cpu = smp_processor_id(); diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 82cada459e5d..46b6907f1b09 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -578,6 +578,7 @@ static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigne if (!(READ_ONCE(rnp->expmask) & mask)) continue; ndetected++; + cpumask_set_cpu(cpu, &rcu_exp_stall_cpumask); rdp = per_cpu_ptr(&rcu_data, cpu); pr_cont(" %d-%c%c%c%c", cpu, "O."[!!cpu_online(cpu)], @@ -665,6 +666,8 @@ static void synchronize_rcu_expedited_wait(void) if (rcu_stall_is_suppressed()) continue; + cpumask_clear(&rcu_exp_stall_cpumask); + nbcon_cpu_emergency_enter(); j = jiffies; @@ -675,7 +678,7 @@ static void synchronize_rcu_expedited_wait(void) nbcon_cpu_emergency_exit(); - panic_on_rcu_stall(); + panic_on_rcu_stall(&rcu_exp_stall_cpumask); } } diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index b67532cb8770..d0c4f193f17e 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -159,7 +159,7 @@ static int __init check_cpu_stall_init(void) early_initcall(check_cpu_stall_init); /* If so specified via sysctl, panic, yielding cleaner stall-warning output. */ -static void panic_on_rcu_stall(void) +static void panic_on_rcu_stall(const struct cpumask *stalled_mask) { static int cpu_stall; @@ -167,7 +167,7 @@ static void panic_on_rcu_stall(void) * Attempt to kick out the BPF scheduler if it's installed and defer * the panic to give the system a chance to recover. */ - if (scx_rcu_cpu_stall()) + if (scx_rcu_cpu_stall(stalled_mask)) return; if (++cpu_stall < sysctl_max_rcu_stall_to_panic) @@ -645,6 +645,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) if (rcu_stall_is_suppressed()) return; + cpumask_clear(&rcu_stall_cpumask); + nbcon_cpu_emergency_enter(); /* @@ -660,6 +662,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) for_each_leaf_node_possible_cpu(rnp, cpu) if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { print_cpu_stall_info(cpu); + cpumask_set_cpu(cpu, &rcu_stall_cpumask); ndetected++; } } @@ -701,7 +704,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) nbcon_cpu_emergency_exit(); - panic_on_rcu_stall(); + panic_on_rcu_stall(&rcu_stall_cpumask); rcu_force_quiescent_state(); /* Kick them all. */ } @@ -754,7 +757,9 @@ static void print_cpu_stall(unsigned long gp_seq, unsigned long gps) nbcon_cpu_emergency_exit(); - panic_on_rcu_stall(); + cpumask_clear(&rcu_stall_cpumask); + cpumask_set_cpu(smp_processor_id(), &rcu_stall_cpumask); + panic_on_rcu_stall(&rcu_stall_cpumask); /* * Attempt to revive the RCU machinery by forcing a context switch. diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 0c37b5fd58b0..28009d08762b 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -4966,6 +4966,8 @@ static const struct attribute_group scx_global_attr_group = { static void free_pnode(struct scx_sched_pnode *pnode); static void free_exit_info(struct scx_exit_info *ei); +static const char *scx_exit_reason(enum scx_exit_kind kind); +static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind); static s32 scx_set_cmask_scratch_alloc(struct scx_sched *sch) { @@ -5022,6 +5024,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work) timer_shutdown_sync(&sch->bypass_lb_timer); free_cpumask_var(sch->bypass_lb_donee_cpumask); free_cpumask_var(sch->bypass_lb_resched_cpumask); + free_cpumask_var(sch->stall_cpus); #ifdef CONFIG_EXT_SUB_SCHED kfree(sch->cgrp_path); @@ -5251,9 +5254,46 @@ static __printf(2, 3) bool handle_lockup(int exit_cpu, const char *fmt, ...) * resolve the reported RCU stall. %false if sched_ext is not enabled or someone * else already initiated abort. */ -bool scx_rcu_cpu_stall(void) +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) { - return handle_lockup(-1, "RCU CPU stall detected!"); + struct scx_sched *sch; + struct scx_exit_info *ei; + int exit_cpu; + + guard(rcu)(); + + sch = rcu_dereference(scx_root); + if (unlikely(!sch)) + return false; + + switch (scx_enable_state()) { + case SCX_ENABLING: + case SCX_ENABLED: + break; + default: + return false; + } + + exit_cpu = cpumask_empty(stalled_mask) ? -1 : (int)cpumask_first(stalled_mask); + ei = sch->exit_info; + + guard(preempt)(); + + if (!scx_claim_exit(sch, SCX_EXIT_ERROR)) + return false; + +#ifdef CONFIG_STACKTRACE + ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1); +#endif + scnprintf(ei->msg, SCX_EXIT_MSG_LEN, "RCU CPU stall on CPUs (%*pbl)", + cpumask_pr_args(stalled_mask)); + ei->kind = SCX_EXIT_ERROR; + ei->reason = scx_exit_reason(SCX_EXIT_ERROR); + ei->exit_cpu = exit_cpu; + cpumask_copy(sch->stall_cpus, stalled_mask); + + irq_work_queue(&sch->disable_irq_work); + return true; } /** @@ -6672,14 +6712,23 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, dump_line(&s, "----------"); /* - * Dump the exit CPU first so it isn't lost to dump truncation, then - * walk the rest in order, skipping the one already dumped. + * Dump stalled CPUs first so they aren't lost to dump truncation, then + * walk the rest in order. Fall back to exit_cpu if no stall mask set. */ - if (ei->exit_cpu >= 0) - scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); - for_each_possible_cpu(cpu) { - if (cpu != ei->exit_cpu) + if (!cpumask_empty(sch->stall_cpus)) { + for_each_cpu(cpu, sch->stall_cpus) scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); + for_each_possible_cpu(cpu) { + if (!cpumask_test_cpu(cpu, sch->stall_cpus)) + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); + } + } else { + if (ei->exit_cpu >= 0) + scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); + for_each_possible_cpu(cpu) { + if (cpu != ei->exit_cpu) + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); + } } dump_newline(&s); @@ -6916,6 +6965,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, ret = -ENOMEM; goto err_free_lb_cpumask; } + if (!zalloc_cpumask_var(&sch->stall_cpus, GFP_KERNEL)) { + ret = -ENOMEM; + goto err_free_lb_resched_cpumask; + } /* * Copy ops through the right union view. For cid-form the source is * struct sched_ext_ops_cid which lacks the trailing cpu_acquire/ @@ -6994,8 +7047,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, #ifdef CONFIG_EXT_SUB_SCHED err_free_lb_resched: RCU_INIT_POINTER(ops->priv, NULL); - free_cpumask_var(sch->bypass_lb_resched_cpumask); + free_cpumask_var(sch->stall_cpus); #endif +err_free_lb_resched_cpumask: + free_cpumask_var(sch->bypass_lb_resched_cpumask); err_free_lb_cpumask: free_cpumask_var(sch->bypass_lb_donee_cpumask); err_stop_helper: diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h index 2f15a4d3c534..f48dfda3facb 100644 --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -1199,6 +1199,7 @@ struct scx_sched { struct timer_list bypass_lb_timer; cpumask_var_t bypass_lb_donee_cpumask; cpumask_var_t bypass_lb_resched_cpumask; + cpumask_var_t stall_cpus; struct rcu_work rcu_work; /* all ancestors including self */ -- 2.48.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs 2026-05-31 15:25 ` [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs Cheng-Yang Chou @ 2026-06-04 17:57 ` Paul E. McKenney 2026-06-05 14:33 ` Cheng-Yang Chou 2026-06-09 8:06 ` Andrea Righi 1 sibling, 1 reply; 18+ messages in thread From: Paul E. McKenney @ 2026-06-04 17:57 UTC (permalink / raw) To: Cheng-Yang Chou Cc: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, rcu, Ching-Chun Huang, Chia-Ping Tsai On Sun, May 31, 2026 at 11:25:27PM +0800, Cheng-Yang Chou wrote: > scx_rcu_cpu_stall() previously recorded the detector CPU rather than the > stalled one, and the expedited grace period path had no stalled CPU to > report at all. > > Thread a cpumask through panic_on_rcu_stall() and scx_rcu_cpu_stall() > to capture all stalled CPUs. Report cpumask_first() as exit_cpu and the > full CPU list in the exit message. Task-only stalls yield exit_cpu = -1. > > Store the stall mask in scx_sched rather than scx_exit_info, keeping the > BPF-visible struct unchanged. scx_dump_state() reads sch->stall_cpus > directly and dumps all stalled CPUs first to avoid losing them to > truncation. > > Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> From an RCU perspective: Reviewed-by: Paul E. McKenney <paulmck@kernel.org> > --- > include/linux/sched/ext.h | 4 +- > kernel/rcu/tree.c | 3 ++ > kernel/rcu/tree_exp.h | 5 ++- > kernel/rcu/tree_stall.h | 13 +++++-- > kernel/sched/ext.c | 73 ++++++++++++++++++++++++++++++++----- > kernel/sched/ext_internal.h | 1 + > 6 files changed, 83 insertions(+), 16 deletions(-) > > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > index 20b2343aa344..75cb8b119fb7 100644 > --- a/include/linux/sched/ext.h > +++ b/include/linux/sched/ext.h > @@ -263,7 +263,7 @@ void sched_ext_dead(struct task_struct *p); > void print_scx_info(const char *log_lvl, struct task_struct *p); > void scx_softlockup(u32 dur_s); > bool scx_hardlockup(int cpu); > -bool scx_rcu_cpu_stall(void); > +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask); > > #else /* !CONFIG_SCHED_CLASS_EXT */ > > @@ -271,7 +271,7 @@ static inline void sched_ext_dead(struct task_struct *p) {} > static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {} > static inline void scx_softlockup(u32 dur_s) {} > static inline bool scx_hardlockup(int cpu) { return false; } > -static inline bool scx_rcu_cpu_stall(void) { return false; } > +static inline bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) { return false; } > > #endif /* CONFIG_SCHED_CLASS_EXT */ > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 55df6d37145e..03c9651be5c0 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -4871,6 +4871,9 @@ static void __init rcu_dump_rcu_node_tree(void) > > struct workqueue_struct *rcu_gp_wq; > > +static struct cpumask rcu_stall_cpumask; > +static struct cpumask rcu_exp_stall_cpumask; > + > void __init rcu_init(void) > { > int cpu = smp_processor_id(); > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > index 82cada459e5d..46b6907f1b09 100644 > --- a/kernel/rcu/tree_exp.h > +++ b/kernel/rcu/tree_exp.h > @@ -578,6 +578,7 @@ static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigne > if (!(READ_ONCE(rnp->expmask) & mask)) > continue; > ndetected++; > + cpumask_set_cpu(cpu, &rcu_exp_stall_cpumask); > rdp = per_cpu_ptr(&rcu_data, cpu); > pr_cont(" %d-%c%c%c%c", cpu, > "O."[!!cpu_online(cpu)], > @@ -665,6 +666,8 @@ static void synchronize_rcu_expedited_wait(void) > if (rcu_stall_is_suppressed()) > continue; > > + cpumask_clear(&rcu_exp_stall_cpumask); > + > nbcon_cpu_emergency_enter(); > > j = jiffies; > @@ -675,7 +678,7 @@ static void synchronize_rcu_expedited_wait(void) > > nbcon_cpu_emergency_exit(); > > - panic_on_rcu_stall(); > + panic_on_rcu_stall(&rcu_exp_stall_cpumask); > } > } > > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h > index b67532cb8770..d0c4f193f17e 100644 > --- a/kernel/rcu/tree_stall.h > +++ b/kernel/rcu/tree_stall.h > @@ -159,7 +159,7 @@ static int __init check_cpu_stall_init(void) > early_initcall(check_cpu_stall_init); > > /* If so specified via sysctl, panic, yielding cleaner stall-warning output. */ > -static void panic_on_rcu_stall(void) > +static void panic_on_rcu_stall(const struct cpumask *stalled_mask) > { > static int cpu_stall; > > @@ -167,7 +167,7 @@ static void panic_on_rcu_stall(void) > * Attempt to kick out the BPF scheduler if it's installed and defer > * the panic to give the system a chance to recover. > */ > - if (scx_rcu_cpu_stall()) > + if (scx_rcu_cpu_stall(stalled_mask)) > return; > > if (++cpu_stall < sysctl_max_rcu_stall_to_panic) > @@ -645,6 +645,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > if (rcu_stall_is_suppressed()) > return; > > + cpumask_clear(&rcu_stall_cpumask); > + > nbcon_cpu_emergency_enter(); > > /* > @@ -660,6 +662,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > for_each_leaf_node_possible_cpu(rnp, cpu) > if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { > print_cpu_stall_info(cpu); > + cpumask_set_cpu(cpu, &rcu_stall_cpumask); > ndetected++; > } > } > @@ -701,7 +704,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > > nbcon_cpu_emergency_exit(); > > - panic_on_rcu_stall(); > + panic_on_rcu_stall(&rcu_stall_cpumask); > > rcu_force_quiescent_state(); /* Kick them all. */ > } > @@ -754,7 +757,9 @@ static void print_cpu_stall(unsigned long gp_seq, unsigned long gps) > > nbcon_cpu_emergency_exit(); > > - panic_on_rcu_stall(); > + cpumask_clear(&rcu_stall_cpumask); > + cpumask_set_cpu(smp_processor_id(), &rcu_stall_cpumask); > + panic_on_rcu_stall(&rcu_stall_cpumask); > > /* > * Attempt to revive the RCU machinery by forcing a context switch. > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 0c37b5fd58b0..28009d08762b 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -4966,6 +4966,8 @@ static const struct attribute_group scx_global_attr_group = { > > static void free_pnode(struct scx_sched_pnode *pnode); > static void free_exit_info(struct scx_exit_info *ei); > +static const char *scx_exit_reason(enum scx_exit_kind kind); > +static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind); > > static s32 scx_set_cmask_scratch_alloc(struct scx_sched *sch) > { > @@ -5022,6 +5024,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work) > timer_shutdown_sync(&sch->bypass_lb_timer); > free_cpumask_var(sch->bypass_lb_donee_cpumask); > free_cpumask_var(sch->bypass_lb_resched_cpumask); > + free_cpumask_var(sch->stall_cpus); > > #ifdef CONFIG_EXT_SUB_SCHED > kfree(sch->cgrp_path); > @@ -5251,9 +5254,46 @@ static __printf(2, 3) bool handle_lockup(int exit_cpu, const char *fmt, ...) > * resolve the reported RCU stall. %false if sched_ext is not enabled or someone > * else already initiated abort. > */ > -bool scx_rcu_cpu_stall(void) > +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) > { > - return handle_lockup(-1, "RCU CPU stall detected!"); > + struct scx_sched *sch; > + struct scx_exit_info *ei; > + int exit_cpu; > + > + guard(rcu)(); > + > + sch = rcu_dereference(scx_root); > + if (unlikely(!sch)) > + return false; > + > + switch (scx_enable_state()) { > + case SCX_ENABLING: > + case SCX_ENABLED: > + break; > + default: > + return false; > + } > + > + exit_cpu = cpumask_empty(stalled_mask) ? -1 : (int)cpumask_first(stalled_mask); > + ei = sch->exit_info; > + > + guard(preempt)(); > + > + if (!scx_claim_exit(sch, SCX_EXIT_ERROR)) > + return false; > + > +#ifdef CONFIG_STACKTRACE > + ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1); > +#endif > + scnprintf(ei->msg, SCX_EXIT_MSG_LEN, "RCU CPU stall on CPUs (%*pbl)", > + cpumask_pr_args(stalled_mask)); > + ei->kind = SCX_EXIT_ERROR; > + ei->reason = scx_exit_reason(SCX_EXIT_ERROR); > + ei->exit_cpu = exit_cpu; > + cpumask_copy(sch->stall_cpus, stalled_mask); > + > + irq_work_queue(&sch->disable_irq_work); > + return true; > } > > /** > @@ -6672,14 +6712,23 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, > dump_line(&s, "----------"); > > /* > - * Dump the exit CPU first so it isn't lost to dump truncation, then > - * walk the rest in order, skipping the one already dumped. > + * Dump stalled CPUs first so they aren't lost to dump truncation, then > + * walk the rest in order. Fall back to exit_cpu if no stall mask set. > */ > - if (ei->exit_cpu >= 0) > - scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); > - for_each_possible_cpu(cpu) { > - if (cpu != ei->exit_cpu) > + if (!cpumask_empty(sch->stall_cpus)) { > + for_each_cpu(cpu, sch->stall_cpus) > scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > + for_each_possible_cpu(cpu) { > + if (!cpumask_test_cpu(cpu, sch->stall_cpus)) > + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > + } > + } else { > + if (ei->exit_cpu >= 0) > + scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); > + for_each_possible_cpu(cpu) { > + if (cpu != ei->exit_cpu) > + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > + } > } > > dump_newline(&s); > @@ -6916,6 +6965,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, > ret = -ENOMEM; > goto err_free_lb_cpumask; > } > + if (!zalloc_cpumask_var(&sch->stall_cpus, GFP_KERNEL)) { > + ret = -ENOMEM; > + goto err_free_lb_resched_cpumask; > + } > /* > * Copy ops through the right union view. For cid-form the source is > * struct sched_ext_ops_cid which lacks the trailing cpu_acquire/ > @@ -6994,8 +7047,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, > #ifdef CONFIG_EXT_SUB_SCHED > err_free_lb_resched: > RCU_INIT_POINTER(ops->priv, NULL); > - free_cpumask_var(sch->bypass_lb_resched_cpumask); > + free_cpumask_var(sch->stall_cpus); > #endif > +err_free_lb_resched_cpumask: > + free_cpumask_var(sch->bypass_lb_resched_cpumask); > err_free_lb_cpumask: > free_cpumask_var(sch->bypass_lb_donee_cpumask); > err_stop_helper: > diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h > index 2f15a4d3c534..f48dfda3facb 100644 > --- a/kernel/sched/ext_internal.h > +++ b/kernel/sched/ext_internal.h > @@ -1199,6 +1199,7 @@ struct scx_sched { > struct timer_list bypass_lb_timer; > cpumask_var_t bypass_lb_donee_cpumask; > cpumask_var_t bypass_lb_resched_cpumask; > + cpumask_var_t stall_cpus; > struct rcu_work rcu_work; > > /* all ancestors including self */ > -- > 2.48.1 > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs 2026-06-04 17:57 ` Paul E. McKenney @ 2026-06-05 14:33 ` Cheng-Yang Chou 0 siblings, 0 replies; 18+ messages in thread From: Cheng-Yang Chou @ 2026-06-05 14:33 UTC (permalink / raw) To: Paul E. McKenney Cc: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, rcu, Ching-Chun Huang, Chia-Ping Tsai, chengyang.chou Hi Paul, On Thu, Jun 04, 2026 at 10:57:38AM -0700, Paul E. McKenney wrote: > On Sun, May 31, 2026 at 11:25:27PM +0800, Cheng-Yang Chou wrote: > > scx_rcu_cpu_stall() previously recorded the detector CPU rather than the > > stalled one, and the expedited grace period path had no stalled CPU to > > report at all. > > > > Thread a cpumask through panic_on_rcu_stall() and scx_rcu_cpu_stall() > > to capture all stalled CPUs. Report cpumask_first() as exit_cpu and the > > full CPU list in the exit message. Task-only stalls yield exit_cpu = -1. > > > > Store the stall mask in scx_sched rather than scx_exit_info, keeping the > > BPF-visible struct unchanged. scx_dump_state() reads sch->stall_cpus > > directly and dumps all stalled CPUs first to avoid losing them to > > truncation. > > > > Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> > > From an RCU perspective: > > Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Thanks for your guidance and review! ^0^ -- Cheers, Cheng-Yang ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs 2026-05-31 15:25 ` [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs Cheng-Yang Chou 2026-06-04 17:57 ` Paul E. McKenney @ 2026-06-09 8:06 ` Andrea Righi 1 sibling, 0 replies; 18+ messages in thread From: Andrea Righi @ 2026-06-09 8:06 UTC (permalink / raw) To: Cheng-Yang Chou Cc: sched-ext, Tejun Heo, David Vernet, Changwoo Min, Paul E . McKenney, Joel Fernandes, rcu, Ching-Chun Huang, Chia-Ping Tsai Hi Cheng-Yang, On Sun, May 31, 2026 at 11:25:27PM +0800, Cheng-Yang Chou wrote: > scx_rcu_cpu_stall() previously recorded the detector CPU rather than the > stalled one, and the expedited grace period path had no stalled CPU to > report at all. > > Thread a cpumask through panic_on_rcu_stall() and scx_rcu_cpu_stall() > to capture all stalled CPUs. Report cpumask_first() as exit_cpu and the > full CPU list in the exit message. Task-only stalls yield exit_cpu = -1. > > Store the stall mask in scx_sched rather than scx_exit_info, keeping the > BPF-visible struct unchanged. scx_dump_state() reads sch->stall_cpus > directly and dumps all stalled CPUs first to avoid losing them to > truncation. > > Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Looks good from a sched_ext perspective. Reviewed-by: Andrea Righi <arighi@nvidia.com> Thanks, -Andrea > --- > include/linux/sched/ext.h | 4 +- > kernel/rcu/tree.c | 3 ++ > kernel/rcu/tree_exp.h | 5 ++- > kernel/rcu/tree_stall.h | 13 +++++-- > kernel/sched/ext.c | 73 ++++++++++++++++++++++++++++++++----- > kernel/sched/ext_internal.h | 1 + > 6 files changed, 83 insertions(+), 16 deletions(-) > > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > index 20b2343aa344..75cb8b119fb7 100644 > --- a/include/linux/sched/ext.h > +++ b/include/linux/sched/ext.h > @@ -263,7 +263,7 @@ void sched_ext_dead(struct task_struct *p); > void print_scx_info(const char *log_lvl, struct task_struct *p); > void scx_softlockup(u32 dur_s); > bool scx_hardlockup(int cpu); > -bool scx_rcu_cpu_stall(void); > +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask); > > #else /* !CONFIG_SCHED_CLASS_EXT */ > > @@ -271,7 +271,7 @@ static inline void sched_ext_dead(struct task_struct *p) {} > static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {} > static inline void scx_softlockup(u32 dur_s) {} > static inline bool scx_hardlockup(int cpu) { return false; } > -static inline bool scx_rcu_cpu_stall(void) { return false; } > +static inline bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) { return false; } > > #endif /* CONFIG_SCHED_CLASS_EXT */ > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 55df6d37145e..03c9651be5c0 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -4871,6 +4871,9 @@ static void __init rcu_dump_rcu_node_tree(void) > > struct workqueue_struct *rcu_gp_wq; > > +static struct cpumask rcu_stall_cpumask; > +static struct cpumask rcu_exp_stall_cpumask; > + > void __init rcu_init(void) > { > int cpu = smp_processor_id(); > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > index 82cada459e5d..46b6907f1b09 100644 > --- a/kernel/rcu/tree_exp.h > +++ b/kernel/rcu/tree_exp.h > @@ -578,6 +578,7 @@ static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigne > if (!(READ_ONCE(rnp->expmask) & mask)) > continue; > ndetected++; > + cpumask_set_cpu(cpu, &rcu_exp_stall_cpumask); > rdp = per_cpu_ptr(&rcu_data, cpu); > pr_cont(" %d-%c%c%c%c", cpu, > "O."[!!cpu_online(cpu)], > @@ -665,6 +666,8 @@ static void synchronize_rcu_expedited_wait(void) > if (rcu_stall_is_suppressed()) > continue; > > + cpumask_clear(&rcu_exp_stall_cpumask); > + > nbcon_cpu_emergency_enter(); > > j = jiffies; > @@ -675,7 +678,7 @@ static void synchronize_rcu_expedited_wait(void) > > nbcon_cpu_emergency_exit(); > > - panic_on_rcu_stall(); > + panic_on_rcu_stall(&rcu_exp_stall_cpumask); > } > } > > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h > index b67532cb8770..d0c4f193f17e 100644 > --- a/kernel/rcu/tree_stall.h > +++ b/kernel/rcu/tree_stall.h > @@ -159,7 +159,7 @@ static int __init check_cpu_stall_init(void) > early_initcall(check_cpu_stall_init); > > /* If so specified via sysctl, panic, yielding cleaner stall-warning output. */ > -static void panic_on_rcu_stall(void) > +static void panic_on_rcu_stall(const struct cpumask *stalled_mask) > { > static int cpu_stall; > > @@ -167,7 +167,7 @@ static void panic_on_rcu_stall(void) > * Attempt to kick out the BPF scheduler if it's installed and defer > * the panic to give the system a chance to recover. > */ > - if (scx_rcu_cpu_stall()) > + if (scx_rcu_cpu_stall(stalled_mask)) > return; > > if (++cpu_stall < sysctl_max_rcu_stall_to_panic) > @@ -645,6 +645,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > if (rcu_stall_is_suppressed()) > return; > > + cpumask_clear(&rcu_stall_cpumask); > + > nbcon_cpu_emergency_enter(); > > /* > @@ -660,6 +662,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > for_each_leaf_node_possible_cpu(rnp, cpu) > if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { > print_cpu_stall_info(cpu); > + cpumask_set_cpu(cpu, &rcu_stall_cpumask); > ndetected++; > } > } > @@ -701,7 +704,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > > nbcon_cpu_emergency_exit(); > > - panic_on_rcu_stall(); > + panic_on_rcu_stall(&rcu_stall_cpumask); > > rcu_force_quiescent_state(); /* Kick them all. */ > } > @@ -754,7 +757,9 @@ static void print_cpu_stall(unsigned long gp_seq, unsigned long gps) > > nbcon_cpu_emergency_exit(); > > - panic_on_rcu_stall(); > + cpumask_clear(&rcu_stall_cpumask); > + cpumask_set_cpu(smp_processor_id(), &rcu_stall_cpumask); > + panic_on_rcu_stall(&rcu_stall_cpumask); > > /* > * Attempt to revive the RCU machinery by forcing a context switch. > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 0c37b5fd58b0..28009d08762b 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -4966,6 +4966,8 @@ static const struct attribute_group scx_global_attr_group = { > > static void free_pnode(struct scx_sched_pnode *pnode); > static void free_exit_info(struct scx_exit_info *ei); > +static const char *scx_exit_reason(enum scx_exit_kind kind); > +static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind); > > static s32 scx_set_cmask_scratch_alloc(struct scx_sched *sch) > { > @@ -5022,6 +5024,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work) > timer_shutdown_sync(&sch->bypass_lb_timer); > free_cpumask_var(sch->bypass_lb_donee_cpumask); > free_cpumask_var(sch->bypass_lb_resched_cpumask); > + free_cpumask_var(sch->stall_cpus); > > #ifdef CONFIG_EXT_SUB_SCHED > kfree(sch->cgrp_path); > @@ -5251,9 +5254,46 @@ static __printf(2, 3) bool handle_lockup(int exit_cpu, const char *fmt, ...) > * resolve the reported RCU stall. %false if sched_ext is not enabled or someone > * else already initiated abort. > */ > -bool scx_rcu_cpu_stall(void) > +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) > { > - return handle_lockup(-1, "RCU CPU stall detected!"); > + struct scx_sched *sch; > + struct scx_exit_info *ei; > + int exit_cpu; > + > + guard(rcu)(); > + > + sch = rcu_dereference(scx_root); > + if (unlikely(!sch)) > + return false; > + > + switch (scx_enable_state()) { > + case SCX_ENABLING: > + case SCX_ENABLED: > + break; > + default: > + return false; > + } > + > + exit_cpu = cpumask_empty(stalled_mask) ? -1 : (int)cpumask_first(stalled_mask); > + ei = sch->exit_info; > + > + guard(preempt)(); > + > + if (!scx_claim_exit(sch, SCX_EXIT_ERROR)) > + return false; > + > +#ifdef CONFIG_STACKTRACE > + ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1); > +#endif > + scnprintf(ei->msg, SCX_EXIT_MSG_LEN, "RCU CPU stall on CPUs (%*pbl)", > + cpumask_pr_args(stalled_mask)); > + ei->kind = SCX_EXIT_ERROR; > + ei->reason = scx_exit_reason(SCX_EXIT_ERROR); > + ei->exit_cpu = exit_cpu; > + cpumask_copy(sch->stall_cpus, stalled_mask); > + > + irq_work_queue(&sch->disable_irq_work); > + return true; > } > > /** > @@ -6672,14 +6712,23 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, > dump_line(&s, "----------"); > > /* > - * Dump the exit CPU first so it isn't lost to dump truncation, then > - * walk the rest in order, skipping the one already dumped. > + * Dump stalled CPUs first so they aren't lost to dump truncation, then > + * walk the rest in order. Fall back to exit_cpu if no stall mask set. > */ > - if (ei->exit_cpu >= 0) > - scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); > - for_each_possible_cpu(cpu) { > - if (cpu != ei->exit_cpu) > + if (!cpumask_empty(sch->stall_cpus)) { > + for_each_cpu(cpu, sch->stall_cpus) > scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > + for_each_possible_cpu(cpu) { > + if (!cpumask_test_cpu(cpu, sch->stall_cpus)) > + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > + } > + } else { > + if (ei->exit_cpu >= 0) > + scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); > + for_each_possible_cpu(cpu) { > + if (cpu != ei->exit_cpu) > + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > + } > } > > dump_newline(&s); > @@ -6916,6 +6965,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, > ret = -ENOMEM; > goto err_free_lb_cpumask; > } > + if (!zalloc_cpumask_var(&sch->stall_cpus, GFP_KERNEL)) { > + ret = -ENOMEM; > + goto err_free_lb_resched_cpumask; > + } > /* > * Copy ops through the right union view. For cid-form the source is > * struct sched_ext_ops_cid which lacks the trailing cpu_acquire/ > @@ -6994,8 +7047,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, > #ifdef CONFIG_EXT_SUB_SCHED > err_free_lb_resched: > RCU_INIT_POINTER(ops->priv, NULL); > - free_cpumask_var(sch->bypass_lb_resched_cpumask); > + free_cpumask_var(sch->stall_cpus); > #endif > +err_free_lb_resched_cpumask: > + free_cpumask_var(sch->bypass_lb_resched_cpumask); > err_free_lb_cpumask: > free_cpumask_var(sch->bypass_lb_donee_cpumask); > err_stop_helper: > diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h > index 2f15a4d3c534..f48dfda3facb 100644 > --- a/kernel/sched/ext_internal.h > +++ b/kernel/sched/ext_internal.h > @@ -1199,6 +1199,7 @@ struct scx_sched { > struct timer_list bypass_lb_timer; > cpumask_var_t bypass_lb_donee_cpumask; > cpumask_var_t bypass_lb_resched_cpumask; > + cpumask_var_t stall_cpus; > struct rcu_work rcu_work; > > /* all ancestors including self */ > -- > 2.48.1 > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v5 sched_ext/for-7.2 0/2] sched_ext: Follow-up fixes for exit_cpu accuracy
@ 2026-05-21 16:16 Cheng-Yang Chou
2026-05-21 16:16 ` [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs Cheng-Yang Chou
0 siblings, 1 reply; 18+ messages in thread
From: Cheng-Yang Chou @ 2026-05-21 16:16 UTC (permalink / raw)
To: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min,
Paul E . McKenney, rcu
Cc: Ching-Chun Huang, Chia-Ping Tsai, Cheng-Yang Chou
Follow-up of Tejun's suggestion [1], discussed in v2 and the subsequent
review thread [2].
Patch 1 fixes exit_cpu accuracy for the hard lockup and softlockup
paths. For the RCU stall path, -1 is used as a placeholder until
patch 2 threads the cpumask.
Patch 2 upgrades panic_on_rcu_stall() and scx_rcu_cpu_stall() to
accept a cpumask of stalled CPUs, stored in scx_sched and piped
directly to scx_dump_state().
Based on sched_ext/for-next (515e3996a4c2).
Changes in v5:
- Replace dynamic cpumask allocation in both RCU stall paths with static
rcu_stall_cpumask and rcu_exp_stall_cpumask (Paul McKenney).
- Link to v4:
https://lore.kernel.org/r/20260519171745.1551340-1-yphbchou0911@gmail.com/
Changes in v4:
- Store stall mask in scx_sched rather than scx_exit_info (Tejun Heo).
- Use __GFP_NOWARN for cpumask allocations in both RCU stall paths.
- Link to v3:
https://lore.kernel.org/r/20260518131311.1170786-1-yphbchou0911@gmail.com/
Changes in v3:
- Drop patch 1 of v2 ("Normalize exit dump header to 'on CPU N'"),
already applied to sched_ext/for-7.2.
- Replace single stalled_cpu int with const struct cpumask * in
panic_on_rcu_stall() and scx_rcu_cpu_stall() (Paul McKenney, Tejun Heo).
- Thread cpumask through the expedited stall path (Tejun Heo).
- Falls back to cpu_none_mask on allocation failure.
- Link to v2:
https://lore.kernel.org/r/20260504161543.674488-1-yphbchou0911@gmail.com/
Changes in v2:
- Use raw_smp_processor_id() in synchronize_rcu_expedited_wait() to
avoid CONFIG_DEBUG_PREEMPT splat.
- Link to v1:
https://lore.kernel.org/r/20260501131521.161852-1-yphbchou0911@gmail.com/
[1]: https://lore.kernel.org/r/e7cbfc99b52b4b7059267bb81498179f@kernel.org/
[2]: https://lore.kernel.org/r/af0Gn47y1Lj2BAqd@slm.duckdns.org/
Thanks,
Cheng-Yang
---
Cheng-Yang Chou (2):
sched_ext: Fix exit_cpu accuracy for lockup paths
sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled
CPUs
include/linux/sched/ext.h | 4 +-
kernel/rcu/tree.c | 3 ++
kernel/rcu/tree_exp.h | 10 +++--
kernel/rcu/tree_stall.h | 13 ++++--
kernel/sched/ext.c | 83 +++++++++++++++++++++++++++++++------
kernel/sched/ext_internal.h | 3 +-
6 files changed, 92 insertions(+), 24 deletions(-)
--
2.48.1
^ permalink raw reply [flat|nested] 18+ messages in thread* [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs 2026-05-21 16:16 [PATCH v5 sched_ext/for-7.2 0/2] sched_ext: Follow-up fixes for exit_cpu accuracy Cheng-Yang Chou @ 2026-05-21 16:16 ` Cheng-Yang Chou 2026-05-21 17:05 ` sashiko-bot 2026-05-27 23:19 ` Paul E. McKenney 0 siblings, 2 replies; 18+ messages in thread From: Cheng-Yang Chou @ 2026-05-21 16:16 UTC (permalink / raw) To: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, Paul E . McKenney, rcu Cc: Ching-Chun Huang, Chia-Ping Tsai, Cheng-Yang Chou scx_rcu_cpu_stall() previously recorded the detector CPU rather than the stalled one, and the expedited grace period path had no stalled CPU to report at all. Thread a cpumask through panic_on_rcu_stall() and scx_rcu_cpu_stall() to capture all stalled CPUs. Report cpumask_first() as exit_cpu and the full CPU list in the exit message. Task-only stalls yield exit_cpu = -1. Store the stall mask in scx_sched rather than scx_exit_info, keeping the BPF-visible struct unchanged. scx_dump_state() reads sch->stall_cpus directly and dumps all stalled CPUs first to avoid losing them to truncation. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> --- include/linux/sched/ext.h | 4 +- kernel/rcu/tree.c | 3 ++ kernel/rcu/tree_exp.h | 10 +++-- kernel/rcu/tree_stall.h | 13 +++++-- kernel/sched/ext.c | 73 ++++++++++++++++++++++++++++++++----- kernel/sched/ext_internal.h | 1 + 6 files changed, 86 insertions(+), 18 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index 20b2343aa344..75cb8b119fb7 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -263,7 +263,7 @@ void sched_ext_dead(struct task_struct *p); void print_scx_info(const char *log_lvl, struct task_struct *p); void scx_softlockup(u32 dur_s); bool scx_hardlockup(int cpu); -bool scx_rcu_cpu_stall(void); +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask); #else /* !CONFIG_SCHED_CLASS_EXT */ @@ -271,7 +271,7 @@ static inline void sched_ext_dead(struct task_struct *p) {} static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {} static inline void scx_softlockup(u32 dur_s) {} static inline bool scx_hardlockup(int cpu) { return false; } -static inline bool scx_rcu_cpu_stall(void) { return false; } +static inline bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) { return false; } #endif /* CONFIG_SCHED_CLASS_EXT */ diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 55df6d37145e..03c9651be5c0 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -4871,6 +4871,9 @@ static void __init rcu_dump_rcu_node_tree(void) struct workqueue_struct *rcu_gp_wq; +static struct cpumask rcu_stall_cpumask; +static struct cpumask rcu_exp_stall_cpumask; + void __init rcu_init(void) { int cpu = smp_processor_id(); diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 82cada459e5d..0a9a509c8c91 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -555,7 +555,8 @@ static bool synchronize_rcu_expedited_wait_once(long tlimit) /* * Print out an expedited RCU CPU stall warning message. */ -static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigned long j) +static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigned long j, + struct cpumask *stalled_mask) { int cpu; unsigned long mask; @@ -578,6 +579,7 @@ static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigne if (!(READ_ONCE(rnp->expmask) & mask)) continue; ndetected++; + cpumask_set_cpu(cpu, stalled_mask); rdp = per_cpu_ptr(&rcu_data, cpu); pr_cont(" %d-%c%c%c%c", cpu, "O."[!!cpu_online(cpu)], @@ -665,17 +667,19 @@ static void synchronize_rcu_expedited_wait(void) if (rcu_stall_is_suppressed()) continue; + cpumask_clear(&rcu_exp_stall_cpumask); + nbcon_cpu_emergency_enter(); j = jiffies; rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_EXP, (void *)(j - jiffies_start)); trace_rcu_stall_warning(rcu_state.name, TPS("ExpeditedStall")); - synchronize_rcu_expedited_stall(jiffies_start, j); + synchronize_rcu_expedited_stall(jiffies_start, j, &rcu_exp_stall_cpumask); jiffies_stall = 3 * rcu_exp_jiffies_till_stall_check() + 3; nbcon_cpu_emergency_exit(); - panic_on_rcu_stall(); + panic_on_rcu_stall(&rcu_exp_stall_cpumask); } } diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index b67532cb8770..d0c4f193f17e 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -159,7 +159,7 @@ static int __init check_cpu_stall_init(void) early_initcall(check_cpu_stall_init); /* If so specified via sysctl, panic, yielding cleaner stall-warning output. */ -static void panic_on_rcu_stall(void) +static void panic_on_rcu_stall(const struct cpumask *stalled_mask) { static int cpu_stall; @@ -167,7 +167,7 @@ static void panic_on_rcu_stall(void) * Attempt to kick out the BPF scheduler if it's installed and defer * the panic to give the system a chance to recover. */ - if (scx_rcu_cpu_stall()) + if (scx_rcu_cpu_stall(stalled_mask)) return; if (++cpu_stall < sysctl_max_rcu_stall_to_panic) @@ -645,6 +645,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) if (rcu_stall_is_suppressed()) return; + cpumask_clear(&rcu_stall_cpumask); + nbcon_cpu_emergency_enter(); /* @@ -660,6 +662,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) for_each_leaf_node_possible_cpu(rnp, cpu) if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { print_cpu_stall_info(cpu); + cpumask_set_cpu(cpu, &rcu_stall_cpumask); ndetected++; } } @@ -701,7 +704,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) nbcon_cpu_emergency_exit(); - panic_on_rcu_stall(); + panic_on_rcu_stall(&rcu_stall_cpumask); rcu_force_quiescent_state(); /* Kick them all. */ } @@ -754,7 +757,9 @@ static void print_cpu_stall(unsigned long gp_seq, unsigned long gps) nbcon_cpu_emergency_exit(); - panic_on_rcu_stall(); + cpumask_clear(&rcu_stall_cpumask); + cpumask_set_cpu(smp_processor_id(), &rcu_stall_cpumask); + panic_on_rcu_stall(&rcu_stall_cpumask); /* * Attempt to revive the RCU machinery by forcing a context switch. diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 22b1356aa4eb..7bf3c22d573e 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -4948,6 +4948,8 @@ static const struct attribute_group scx_global_attr_group = { static void free_pnode(struct scx_sched_pnode *pnode); static void free_exit_info(struct scx_exit_info *ei); +static const char *scx_exit_reason(enum scx_exit_kind kind); +static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind); static void scx_sched_free_rcu_work(struct work_struct *work) { @@ -4962,6 +4964,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work) timer_shutdown_sync(&sch->bypass_lb_timer); free_cpumask_var(sch->bypass_lb_donee_cpumask); free_cpumask_var(sch->bypass_lb_resched_cpumask); + free_cpumask_var(sch->stall_cpus); #ifdef CONFIG_EXT_SUB_SCHED kfree(sch->cgrp_path); @@ -5187,9 +5190,46 @@ static __printf(2, 3) bool handle_lockup(int exit_cpu, const char *fmt, ...) * resolve the reported RCU stall. %false if sched_ext is not enabled or someone * else already initiated abort. */ -bool scx_rcu_cpu_stall(void) +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) { - return handle_lockup(-1, "RCU CPU stall detected!"); + struct scx_sched *sch; + struct scx_exit_info *ei; + int exit_cpu; + + guard(rcu)(); + + sch = rcu_dereference(scx_root); + if (unlikely(!sch)) + return false; + + switch (scx_enable_state()) { + case SCX_ENABLING: + case SCX_ENABLED: + break; + default: + return false; + } + + exit_cpu = cpumask_empty(stalled_mask) ? -1 : (int)cpumask_first(stalled_mask); + ei = sch->exit_info; + + guard(preempt)(); + + if (!scx_claim_exit(sch, SCX_EXIT_ERROR)) + return false; + +#ifdef CONFIG_STACKTRACE + ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1); +#endif + scnprintf(ei->msg, SCX_EXIT_MSG_LEN, "RCU CPU stall on CPUs (%*pbl)", + cpumask_pr_args(stalled_mask)); + ei->kind = SCX_EXIT_ERROR; + ei->reason = scx_exit_reason(SCX_EXIT_ERROR); + ei->exit_cpu = exit_cpu; + cpumask_copy(sch->stall_cpus, stalled_mask); + + irq_work_queue(&sch->disable_irq_work); + return true; } /** @@ -6608,14 +6648,23 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, dump_line(&s, "----------"); /* - * Dump the exit CPU first so it isn't lost to dump truncation, then - * walk the rest in order, skipping the one already dumped. + * Dump stalled CPUs first so they aren't lost to dump truncation, then + * walk the rest in order. Fall back to exit_cpu if no stall mask set. */ - if (ei->exit_cpu >= 0) - scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); - for_each_possible_cpu(cpu) { - if (cpu != ei->exit_cpu) + if (!cpumask_empty(sch->stall_cpus)) { + for_each_cpu(cpu, sch->stall_cpus) scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); + for_each_possible_cpu(cpu) { + if (!cpumask_test_cpu(cpu, sch->stall_cpus)) + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); + } + } else { + if (ei->exit_cpu >= 0) + scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); + for_each_possible_cpu(cpu) { + if (cpu != ei->exit_cpu) + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); + } } dump_newline(&s); @@ -6851,6 +6900,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, ret = -ENOMEM; goto err_free_lb_cpumask; } + if (!zalloc_cpumask_var(&sch->stall_cpus, GFP_KERNEL)) { + ret = -ENOMEM; + goto err_free_lb_resched_cpumask; + } /* * Copy ops through the right union view. For cid-form the source is * struct sched_ext_ops_cid which lacks the trailing cpu_acquire/ @@ -6920,8 +6973,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, #ifdef CONFIG_EXT_SUB_SCHED err_free_lb_resched: RCU_INIT_POINTER(ops->priv, NULL); - free_cpumask_var(sch->bypass_lb_resched_cpumask); + free_cpumask_var(sch->stall_cpus); #endif +err_free_lb_resched_cpumask: + free_cpumask_var(sch->bypass_lb_resched_cpumask); err_free_lb_cpumask: free_cpumask_var(sch->bypass_lb_donee_cpumask); err_stop_helper: diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h index 3ccfea06cba0..f51f03868ff7 100644 --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -1178,6 +1178,7 @@ struct scx_sched { struct timer_list bypass_lb_timer; cpumask_var_t bypass_lb_donee_cpumask; cpumask_var_t bypass_lb_resched_cpumask; + cpumask_var_t stall_cpus; struct rcu_work rcu_work; /* all ancestors including self */ -- 2.48.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs 2026-05-21 16:16 ` [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs Cheng-Yang Chou @ 2026-05-21 17:05 ` sashiko-bot 2026-05-27 23:19 ` Paul E. McKenney 1 sibling, 0 replies; 18+ messages in thread From: sashiko-bot @ 2026-05-21 17:05 UTC (permalink / raw) To: Cheng-Yang Chou; +Cc: sched-ext Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider: - [High] Concurrent execution of RCU stall detection introduces a data race on the global `rcu_stall_cpumask`, leading to a Time-Of-Check to Time-Of-Use (TOCTOU) bug that causes an out-of-bounds array access in `scx_dump_state()`. -- commit f19732cfebfb0ed5b2408491122054e9d0b8b9ff Author: Cheng-Yang Chou <yphbchou0911@gmail.com> sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs This commit threads a cpumask through the RCU stall warning paths to capture all stalled CPUs. It stores this mask in the sched_ext dump structures so that all stalled CPUs can be dumped first, avoiding truncation. > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -4871,6 +4871,9 @@ static void __init rcu_dump_rcu_node_tree(void) > > struct workqueue_struct *rcu_gp_wq; > > +static struct cpumask rcu_stall_cpumask; > +static struct cpumask rcu_exp_stall_cpumask; > + > void __init rcu_init(void) [ ... ] > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h > --- a/kernel/rcu/tree_stall.h > +++ b/kernel/rcu/tree_stall.h [ ... ] > @@ -645,6 +645,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > if (rcu_stall_is_suppressed()) > return; > > + cpumask_clear(&rcu_stall_cpumask); > + > nbcon_cpu_emergency_enter(); > > /* [ ... ] > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c [ ... ] > @@ -5187,9 +5190,46 @@ static __printf(2, 3) bool handle_lockup(int exit_cpu, const char *fmt, ...) > * resolve the reported RCU stall. %false if sched_ext is not enabled or someone > * else already initiated abort. > */ > -bool scx_rcu_cpu_stall(void) > +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) > { > - return handle_lockup(-1, "RCU CPU stall detected!"); > + struct scx_sched *sch; > + struct scx_exit_info *ei; > + int exit_cpu; > + > + guard(rcu)(); > + > + sch = rcu_dereference(scx_root); > + if (unlikely(!sch)) > + return false; > + > + switch (scx_enable_state()) { > + case SCX_ENABLING: > + case SCX_ENABLED: > + break; > + default: > + return false; > + } > + > + exit_cpu = cpumask_empty(stalled_mask) ? -1 : (int)cpumask_first(stalled_mask); [Severity: High] If multiple CPUs concurrently execute the RCU stall detection, can they race on the global rcu_stall_cpumask variable? If the mask is concurrently cleared between cpumask_empty() and cpumask_first(), couldn't cpumask_first() return nr_cpumask_bits? This would result in exit_cpu being set to nr_cpumask_bits (which is greater than or equal to nr_cpu_ids). > + ei = sch->exit_info; [ ... ] > @@ -6608,14 +6648,23 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, > dump_line(&s, "----------"); > > /* > - * Dump the exit CPU first so it isn't lost to dump truncation, then > - * walk the rest in order, skipping the one already dumped. > + * Dump stalled CPUs first so they aren't lost to dump truncation, then > + * walk the rest in order. Fall back to exit_cpu if no stall mask set. > */ > - if (ei->exit_cpu >= 0) > - scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); > - for_each_possible_cpu(cpu) { > - if (cpu != ei->exit_cpu) > + if (!cpumask_empty(sch->stall_cpus)) { > + for_each_cpu(cpu, sch->stall_cpus) > scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > + for_each_possible_cpu(cpu) { > + if (!cpumask_test_cpu(cpu, sch->stall_cpus)) > + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > + } > + } else { > + if (ei->exit_cpu >= 0) > + scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); [Severity: High] If ei->exit_cpu was set to nr_cpumask_bits from the race described above, will this cause an out-of-bounds array access? Since scx_dump_cpu() uses the cpu_rq(cpu) macro without bounds checking, does this risk an out-of-bounds read in the per-cpu runqueues array that could disrupt the system's crash dump process? > + for_each_possible_cpu(cpu) { > + if (cpu != ei->exit_cpu) > + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > + } > } > > dump_newline(&s); -- Sashiko AI review · https://sashiko.dev/#/patchset/20260521161636.1893894-1-yphbchou0911@gmail.com?part=2 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs 2026-05-21 16:16 ` [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs Cheng-Yang Chou 2026-05-21 17:05 ` sashiko-bot @ 2026-05-27 23:19 ` Paul E. McKenney 2026-05-29 15:51 ` Cheng-Yang Chou 1 sibling, 1 reply; 18+ messages in thread From: Paul E. McKenney @ 2026-05-27 23:19 UTC (permalink / raw) To: Cheng-Yang Chou Cc: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, rcu, Ching-Chun Huang, Chia-Ping Tsai On Fri, May 22, 2026 at 12:16:10AM +0800, Cheng-Yang Chou wrote: > scx_rcu_cpu_stall() previously recorded the detector CPU rather than the > stalled one, and the expedited grace period path had no stalled CPU to > report at all. > > Thread a cpumask through panic_on_rcu_stall() and scx_rcu_cpu_stall() > to capture all stalled CPUs. Report cpumask_first() as exit_cpu and the > full CPU list in the exit message. Task-only stalls yield exit_cpu = -1. > > Store the stall mask in scx_sched rather than scx_exit_info, keeping the > BPF-visible struct unchanged. scx_dump_state() reads sch->stall_cpus > directly and dumps all stalled CPUs first to avoid losing them to > truncation. > > Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Much better! One question below. Thanx, Paul > --- > include/linux/sched/ext.h | 4 +- > kernel/rcu/tree.c | 3 ++ > kernel/rcu/tree_exp.h | 10 +++-- > kernel/rcu/tree_stall.h | 13 +++++-- > kernel/sched/ext.c | 73 ++++++++++++++++++++++++++++++++----- > kernel/sched/ext_internal.h | 1 + > 6 files changed, 86 insertions(+), 18 deletions(-) > > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > index 20b2343aa344..75cb8b119fb7 100644 > --- a/include/linux/sched/ext.h > +++ b/include/linux/sched/ext.h > @@ -263,7 +263,7 @@ void sched_ext_dead(struct task_struct *p); > void print_scx_info(const char *log_lvl, struct task_struct *p); > void scx_softlockup(u32 dur_s); > bool scx_hardlockup(int cpu); > -bool scx_rcu_cpu_stall(void); > +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask); > > #else /* !CONFIG_SCHED_CLASS_EXT */ > > @@ -271,7 +271,7 @@ static inline void sched_ext_dead(struct task_struct *p) {} > static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {} > static inline void scx_softlockup(u32 dur_s) {} > static inline bool scx_hardlockup(int cpu) { return false; } > -static inline bool scx_rcu_cpu_stall(void) { return false; } > +static inline bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) { return false; } > > #endif /* CONFIG_SCHED_CLASS_EXT */ > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 55df6d37145e..03c9651be5c0 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -4871,6 +4871,9 @@ static void __init rcu_dump_rcu_node_tree(void) > > struct workqueue_struct *rcu_gp_wq; > > +static struct cpumask rcu_stall_cpumask; > +static struct cpumask rcu_exp_stall_cpumask; This works! ;-) > void __init rcu_init(void) > { > int cpu = smp_processor_id(); > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > index 82cada459e5d..0a9a509c8c91 100644 > --- a/kernel/rcu/tree_exp.h > +++ b/kernel/rcu/tree_exp.h > @@ -555,7 +555,8 @@ static bool synchronize_rcu_expedited_wait_once(long tlimit) > /* > * Print out an expedited RCU CPU stall warning message. > */ > -static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigned long j) > +static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigned long j, > + struct cpumask *stalled_mask) > { > int cpu; > unsigned long mask; > @@ -578,6 +579,7 @@ static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigne > if (!(READ_ONCE(rnp->expmask) & mask)) > continue; > ndetected++; > + cpumask_set_cpu(cpu, stalled_mask); > rdp = per_cpu_ptr(&rcu_data, cpu); > pr_cont(" %d-%c%c%c%c", cpu, > "O."[!!cpu_online(cpu)], > @@ -665,17 +667,19 @@ static void synchronize_rcu_expedited_wait(void) > if (rcu_stall_is_suppressed()) > continue; > > + cpumask_clear(&rcu_exp_stall_cpumask); > + > nbcon_cpu_emergency_enter(); > > j = jiffies; > rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_EXP, (void *)(j - jiffies_start)); > trace_rcu_stall_warning(rcu_state.name, TPS("ExpeditedStall")); > - synchronize_rcu_expedited_stall(jiffies_start, j); > + synchronize_rcu_expedited_stall(jiffies_start, j, &rcu_exp_stall_cpumask); Are we ever going to pass some other cpumask to this function? I would not expect that the rcu_stall_cpumask would ever be passed in here, for example. And the two definitions in tree.c are visible to the synchronize_rcu_expedited_stall() function. Or am I missing something here? > jiffies_stall = 3 * rcu_exp_jiffies_till_stall_check() + 3; > > nbcon_cpu_emergency_exit(); > > - panic_on_rcu_stall(); > + panic_on_rcu_stall(&rcu_exp_stall_cpumask); Now this one I get: This function is called from both normal and expedited RCU CPU stall code. > } > } > > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h > index b67532cb8770..d0c4f193f17e 100644 > --- a/kernel/rcu/tree_stall.h > +++ b/kernel/rcu/tree_stall.h > @@ -159,7 +159,7 @@ static int __init check_cpu_stall_init(void) > early_initcall(check_cpu_stall_init); > > /* If so specified via sysctl, panic, yielding cleaner stall-warning output. */ > -static void panic_on_rcu_stall(void) > +static void panic_on_rcu_stall(const struct cpumask *stalled_mask) > { > static int cpu_stall; > > @@ -167,7 +167,7 @@ static void panic_on_rcu_stall(void) > * Attempt to kick out the BPF scheduler if it's installed and defer > * the panic to give the system a chance to recover. > */ > - if (scx_rcu_cpu_stall()) > + if (scx_rcu_cpu_stall(stalled_mask)) > return; > > if (++cpu_stall < sysctl_max_rcu_stall_to_panic) > @@ -645,6 +645,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > if (rcu_stall_is_suppressed()) > return; > > + cpumask_clear(&rcu_stall_cpumask); > + > nbcon_cpu_emergency_enter(); > > /* > @@ -660,6 +662,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > for_each_leaf_node_possible_cpu(rnp, cpu) > if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { > print_cpu_stall_info(cpu); > + cpumask_set_cpu(cpu, &rcu_stall_cpumask); > ndetected++; > } > } > @@ -701,7 +704,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > > nbcon_cpu_emergency_exit(); > > - panic_on_rcu_stall(); > + panic_on_rcu_stall(&rcu_stall_cpumask); > > rcu_force_quiescent_state(); /* Kick them all. */ > } > @@ -754,7 +757,9 @@ static void print_cpu_stall(unsigned long gp_seq, unsigned long gps) > > nbcon_cpu_emergency_exit(); > > - panic_on_rcu_stall(); > + cpumask_clear(&rcu_stall_cpumask); > + cpumask_set_cpu(smp_processor_id(), &rcu_stall_cpumask); > + panic_on_rcu_stall(&rcu_stall_cpumask); > > /* > * Attempt to revive the RCU machinery by forcing a context switch. > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 22b1356aa4eb..7bf3c22d573e 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -4948,6 +4948,8 @@ static const struct attribute_group scx_global_attr_group = { > > static void free_pnode(struct scx_sched_pnode *pnode); > static void free_exit_info(struct scx_exit_info *ei); > +static const char *scx_exit_reason(enum scx_exit_kind kind); > +static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind); > > static void scx_sched_free_rcu_work(struct work_struct *work) > { > @@ -4962,6 +4964,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work) > timer_shutdown_sync(&sch->bypass_lb_timer); > free_cpumask_var(sch->bypass_lb_donee_cpumask); > free_cpumask_var(sch->bypass_lb_resched_cpumask); > + free_cpumask_var(sch->stall_cpus); > > #ifdef CONFIG_EXT_SUB_SCHED > kfree(sch->cgrp_path); > @@ -5187,9 +5190,46 @@ static __printf(2, 3) bool handle_lockup(int exit_cpu, const char *fmt, ...) > * resolve the reported RCU stall. %false if sched_ext is not enabled or someone > * else already initiated abort. > */ > -bool scx_rcu_cpu_stall(void) > +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) > { > - return handle_lockup(-1, "RCU CPU stall detected!"); > + struct scx_sched *sch; > + struct scx_exit_info *ei; > + int exit_cpu; > + > + guard(rcu)(); > + > + sch = rcu_dereference(scx_root); > + if (unlikely(!sch)) > + return false; > + > + switch (scx_enable_state()) { > + case SCX_ENABLING: > + case SCX_ENABLED: > + break; > + default: > + return false; > + } > + > + exit_cpu = cpumask_empty(stalled_mask) ? -1 : (int)cpumask_first(stalled_mask); > + ei = sch->exit_info; > + > + guard(preempt)(); > + > + if (!scx_claim_exit(sch, SCX_EXIT_ERROR)) > + return false; > + > +#ifdef CONFIG_STACKTRACE > + ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1); > +#endif > + scnprintf(ei->msg, SCX_EXIT_MSG_LEN, "RCU CPU stall on CPUs (%*pbl)", > + cpumask_pr_args(stalled_mask)); > + ei->kind = SCX_EXIT_ERROR; > + ei->reason = scx_exit_reason(SCX_EXIT_ERROR); > + ei->exit_cpu = exit_cpu; > + cpumask_copy(sch->stall_cpus, stalled_mask); > + > + irq_work_queue(&sch->disable_irq_work); > + return true; > } > > /** > @@ -6608,14 +6648,23 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, > dump_line(&s, "----------"); > > /* > - * Dump the exit CPU first so it isn't lost to dump truncation, then > - * walk the rest in order, skipping the one already dumped. > + * Dump stalled CPUs first so they aren't lost to dump truncation, then > + * walk the rest in order. Fall back to exit_cpu if no stall mask set. > */ > - if (ei->exit_cpu >= 0) > - scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); > - for_each_possible_cpu(cpu) { > - if (cpu != ei->exit_cpu) > + if (!cpumask_empty(sch->stall_cpus)) { > + for_each_cpu(cpu, sch->stall_cpus) > scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > + for_each_possible_cpu(cpu) { > + if (!cpumask_test_cpu(cpu, sch->stall_cpus)) > + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > + } > + } else { > + if (ei->exit_cpu >= 0) > + scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); > + for_each_possible_cpu(cpu) { > + if (cpu != ei->exit_cpu) > + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > + } > } > > dump_newline(&s); > @@ -6851,6 +6900,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, > ret = -ENOMEM; > goto err_free_lb_cpumask; > } > + if (!zalloc_cpumask_var(&sch->stall_cpus, GFP_KERNEL)) { > + ret = -ENOMEM; > + goto err_free_lb_resched_cpumask; > + } > /* > * Copy ops through the right union view. For cid-form the source is > * struct sched_ext_ops_cid which lacks the trailing cpu_acquire/ > @@ -6920,8 +6973,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, > #ifdef CONFIG_EXT_SUB_SCHED > err_free_lb_resched: > RCU_INIT_POINTER(ops->priv, NULL); > - free_cpumask_var(sch->bypass_lb_resched_cpumask); > + free_cpumask_var(sch->stall_cpus); > #endif > +err_free_lb_resched_cpumask: > + free_cpumask_var(sch->bypass_lb_resched_cpumask); > err_free_lb_cpumask: > free_cpumask_var(sch->bypass_lb_donee_cpumask); > err_stop_helper: > diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h > index 3ccfea06cba0..f51f03868ff7 100644 > --- a/kernel/sched/ext_internal.h > +++ b/kernel/sched/ext_internal.h > @@ -1178,6 +1178,7 @@ struct scx_sched { > struct timer_list bypass_lb_timer; > cpumask_var_t bypass_lb_donee_cpumask; > cpumask_var_t bypass_lb_resched_cpumask; > + cpumask_var_t stall_cpus; > struct rcu_work rcu_work; > > /* all ancestors including self */ > -- > 2.48.1 > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs 2026-05-27 23:19 ` Paul E. McKenney @ 2026-05-29 15:51 ` Cheng-Yang Chou 0 siblings, 0 replies; 18+ messages in thread From: Cheng-Yang Chou @ 2026-05-29 15:51 UTC (permalink / raw) To: Paul E. McKenney Cc: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, rcu, Ching-Chun Huang, Chia-Ping Tsai Hi Paul, On Wed, May 27, 2026 at 04:19:08PM -0700, Paul E. McKenney wrote: [...] > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > > index 82cada459e5d..0a9a509c8c91 100644 > > --- a/kernel/rcu/tree_exp.h > > +++ b/kernel/rcu/tree_exp.h > > @@ -555,7 +555,8 @@ static bool synchronize_rcu_expedited_wait_once(long tlimit) > > /* > > * Print out an expedited RCU CPU stall warning message. > > */ > > -static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigned long j) > > +static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigned long j, > > + struct cpumask *stalled_mask) > > { > > int cpu; > > unsigned long mask; > > @@ -578,6 +579,7 @@ static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigne > > if (!(READ_ONCE(rnp->expmask) & mask)) > > continue; > > ndetected++; > > + cpumask_set_cpu(cpu, stalled_mask); > > rdp = per_cpu_ptr(&rcu_data, cpu); > > pr_cont(" %d-%c%c%c%c", cpu, > > "O."[!!cpu_online(cpu)], > > @@ -665,17 +667,19 @@ static void synchronize_rcu_expedited_wait(void) > > if (rcu_stall_is_suppressed()) > > continue; > > > > + cpumask_clear(&rcu_exp_stall_cpumask); > > + > > nbcon_cpu_emergency_enter(); > > > > j = jiffies; > > rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_EXP, (void *)(j - jiffies_start)); > > trace_rcu_stall_warning(rcu_state.name, TPS("ExpeditedStall")); > > - synchronize_rcu_expedited_stall(jiffies_start, j); > > + synchronize_rcu_expedited_stall(jiffies_start, j, &rcu_exp_stall_cpumask); > > Are we ever going to pass some other cpumask to this function? > > I would not expect that the rcu_stall_cpumask would ever be passed in > here, for example. And the two definitions in tree.c are visible to > the synchronize_rcu_expedited_stall() function. Agree, drop in v6, thanks! > > Or am I missing something here? > > > jiffies_stall = 3 * rcu_exp_jiffies_till_stall_check() + 3; > > > > nbcon_cpu_emergency_exit(); > > > > - panic_on_rcu_stall(); > > + panic_on_rcu_stall(&rcu_exp_stall_cpumask); > > Now this one I get: This function is called from both normal and expedited > RCU CPU stall code. > > > } > > } > > > > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h > > index b67532cb8770..d0c4f193f17e 100644 > > --- a/kernel/rcu/tree_stall.h > > +++ b/kernel/rcu/tree_stall.h > > @@ -159,7 +159,7 @@ static int __init check_cpu_stall_init(void) > > early_initcall(check_cpu_stall_init); > > > > /* If so specified via sysctl, panic, yielding cleaner stall-warning output. */ > > -static void panic_on_rcu_stall(void) > > +static void panic_on_rcu_stall(const struct cpumask *stalled_mask) > > { > > static int cpu_stall; > > > > @@ -167,7 +167,7 @@ static void panic_on_rcu_stall(void) > > * Attempt to kick out the BPF scheduler if it's installed and defer > > * the panic to give the system a chance to recover. > > */ > > - if (scx_rcu_cpu_stall()) > > + if (scx_rcu_cpu_stall(stalled_mask)) > > return; > > > > if (++cpu_stall < sysctl_max_rcu_stall_to_panic) > > @@ -645,6 +645,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > > if (rcu_stall_is_suppressed()) > > return; > > > > + cpumask_clear(&rcu_stall_cpumask); > > + > > nbcon_cpu_emergency_enter(); > > > > /* > > @@ -660,6 +662,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > > for_each_leaf_node_possible_cpu(rnp, cpu) > > if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { > > print_cpu_stall_info(cpu); > > + cpumask_set_cpu(cpu, &rcu_stall_cpumask); > > ndetected++; > > } > > } > > @@ -701,7 +704,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > > > > nbcon_cpu_emergency_exit(); > > > > - panic_on_rcu_stall(); > > + panic_on_rcu_stall(&rcu_stall_cpumask); > > > > rcu_force_quiescent_state(); /* Kick them all. */ > > } > > @@ -754,7 +757,9 @@ static void print_cpu_stall(unsigned long gp_seq, unsigned long gps) > > > > nbcon_cpu_emergency_exit(); > > > > - panic_on_rcu_stall(); > > + cpumask_clear(&rcu_stall_cpumask); > > + cpumask_set_cpu(smp_processor_id(), &rcu_stall_cpumask); > > + panic_on_rcu_stall(&rcu_stall_cpumask); > > > > /* > > * Attempt to revive the RCU machinery by forcing a context switch. > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > > index 22b1356aa4eb..7bf3c22d573e 100644 > > --- a/kernel/sched/ext.c > > +++ b/kernel/sched/ext.c > > @@ -4948,6 +4948,8 @@ static const struct attribute_group scx_global_attr_group = { > > > > static void free_pnode(struct scx_sched_pnode *pnode); > > static void free_exit_info(struct scx_exit_info *ei); > > +static const char *scx_exit_reason(enum scx_exit_kind kind); > > +static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind); > > > > static void scx_sched_free_rcu_work(struct work_struct *work) > > { > > @@ -4962,6 +4964,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work) > > timer_shutdown_sync(&sch->bypass_lb_timer); > > free_cpumask_var(sch->bypass_lb_donee_cpumask); > > free_cpumask_var(sch->bypass_lb_resched_cpumask); > > + free_cpumask_var(sch->stall_cpus); > > > > #ifdef CONFIG_EXT_SUB_SCHED > > kfree(sch->cgrp_path); > > @@ -5187,9 +5190,46 @@ static __printf(2, 3) bool handle_lockup(int exit_cpu, const char *fmt, ...) > > * resolve the reported RCU stall. %false if sched_ext is not enabled or someone > > * else already initiated abort. > > */ > > -bool scx_rcu_cpu_stall(void) > > +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) > > { > > - return handle_lockup(-1, "RCU CPU stall detected!"); > > + struct scx_sched *sch; > > + struct scx_exit_info *ei; > > + int exit_cpu; > > + > > + guard(rcu)(); > > + > > + sch = rcu_dereference(scx_root); > > + if (unlikely(!sch)) > > + return false; > > + > > + switch (scx_enable_state()) { > > + case SCX_ENABLING: > > + case SCX_ENABLED: > > + break; > > + default: > > + return false; > > + } > > + > > + exit_cpu = cpumask_empty(stalled_mask) ? -1 : (int)cpumask_first(stalled_mask); > > + ei = sch->exit_info; > > + > > + guard(preempt)(); > > + > > + if (!scx_claim_exit(sch, SCX_EXIT_ERROR)) > > + return false; > > + > > +#ifdef CONFIG_STACKTRACE > > + ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1); > > +#endif > > + scnprintf(ei->msg, SCX_EXIT_MSG_LEN, "RCU CPU stall on CPUs (%*pbl)", > > + cpumask_pr_args(stalled_mask)); > > + ei->kind = SCX_EXIT_ERROR; > > + ei->reason = scx_exit_reason(SCX_EXIT_ERROR); > > + ei->exit_cpu = exit_cpu; > > + cpumask_copy(sch->stall_cpus, stalled_mask); > > + > > + irq_work_queue(&sch->disable_irq_work); > > + return true; > > } > > > > /** > > @@ -6608,14 +6648,23 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, > > dump_line(&s, "----------"); > > > > /* > > - * Dump the exit CPU first so it isn't lost to dump truncation, then > > - * walk the rest in order, skipping the one already dumped. > > + * Dump stalled CPUs first so they aren't lost to dump truncation, then > > + * walk the rest in order. Fall back to exit_cpu if no stall mask set. > > */ > > - if (ei->exit_cpu >= 0) > > - scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); > > - for_each_possible_cpu(cpu) { > > - if (cpu != ei->exit_cpu) > > + if (!cpumask_empty(sch->stall_cpus)) { > > + for_each_cpu(cpu, sch->stall_cpus) > > scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > > + for_each_possible_cpu(cpu) { > > + if (!cpumask_test_cpu(cpu, sch->stall_cpus)) > > + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > > + } > > + } else { > > + if (ei->exit_cpu >= 0) > > + scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); > > + for_each_possible_cpu(cpu) { > > + if (cpu != ei->exit_cpu) > > + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > > + } > > } > > > > dump_newline(&s); > > @@ -6851,6 +6900,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, > > ret = -ENOMEM; > > goto err_free_lb_cpumask; > > } > > + if (!zalloc_cpumask_var(&sch->stall_cpus, GFP_KERNEL)) { > > + ret = -ENOMEM; > > + goto err_free_lb_resched_cpumask; > > + } > > /* > > * Copy ops through the right union view. For cid-form the source is > > * struct sched_ext_ops_cid which lacks the trailing cpu_acquire/ > > @@ -6920,8 +6973,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, > > #ifdef CONFIG_EXT_SUB_SCHED > > err_free_lb_resched: > > RCU_INIT_POINTER(ops->priv, NULL); > > - free_cpumask_var(sch->bypass_lb_resched_cpumask); > > + free_cpumask_var(sch->stall_cpus); > > #endif > > +err_free_lb_resched_cpumask: > > + free_cpumask_var(sch->bypass_lb_resched_cpumask); > > err_free_lb_cpumask: > > free_cpumask_var(sch->bypass_lb_donee_cpumask); > > err_stop_helper: > > diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h > > index 3ccfea06cba0..f51f03868ff7 100644 > > --- a/kernel/sched/ext_internal.h > > +++ b/kernel/sched/ext_internal.h > > @@ -1178,6 +1178,7 @@ struct scx_sched { > > struct timer_list bypass_lb_timer; > > cpumask_var_t bypass_lb_donee_cpumask; > > cpumask_var_t bypass_lb_resched_cpumask; > > + cpumask_var_t stall_cpus; > > struct rcu_work rcu_work; > > > > /* all ancestors including self */ > > -- > > 2.48.1 > > -- Cheers, Cheng-Yang ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v4 sched_ext/for-7.2 0/2] sched_ext: Follow-up fixes for exit_cpu accuracy
@ 2026-05-19 17:17 Cheng-Yang Chou
2026-05-19 17:17 ` [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs Cheng-Yang Chou
0 siblings, 1 reply; 18+ messages in thread
From: Cheng-Yang Chou @ 2026-05-19 17:17 UTC (permalink / raw)
To: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min,
Paul E . McKenney, rcu
Cc: Ching-Chun Huang, Chia-Ping Tsai, Cheng-Yang Chou
Follow-up of Tejun's suggestion [1], discussed in v2 and the subsequent
review thread [2].
Patch 1 fixes exit_cpu accuracy for the hard lockup and softlockup
paths. For the RCU stall path, -1 is used as a placeholder until
patch 2 threads the cpumask.
Patch 2 upgrades panic_on_rcu_stall() and scx_rcu_cpu_stall() to
accept a cpumask of stalled CPUs, stored in scx_sched and piped
directly to scx_dump_state().
Based on sched_ext/for-next (d83f90561532).
Changes in v4:
- Store stall mask in scx_sched rather than scx_exit_info (Tejun Heo).
- Use __GFP_NOWARN for cpumask allocations in both RCU stall paths.
- Link to v3:
https://lore.kernel.org/r/20260518131311.1170786-1-yphbchou0911@gmail.com/
Changes in v3:
- Drop patch 1 of v2 ("Normalize exit dump header to 'on CPU N'"),
already applied to sched_ext/for-7.2.
- Replace single stalled_cpu int with const struct cpumask * in
panic_on_rcu_stall() and scx_rcu_cpu_stall() (Paul McKenney, Tejun Heo).
- Thread cpumask through the expedited stall path (Tejun Heo).
- Falls back to cpu_none_mask on allocation failure.
- Link to v2:
https://lore.kernel.org/r/20260504161543.674488-1-yphbchou0911@gmail.com/
Changes in v2:
- Use raw_smp_processor_id() in synchronize_rcu_expedited_wait() to
avoid CONFIG_DEBUG_PREEMPT splat.
- Link to v1:
https://lore.kernel.org/r/20260501131521.161852-1-yphbchou0911@gmail.com/
[1]: https://lore.kernel.org/r/e7cbfc99b52b4b7059267bb81498179f@kernel.org/
[2]: https://lore.kernel.org/r/af0Gn47y1Lj2BAqd@slm.duckdns.org/
Thanks,
Cheng-Yang
---
Cheng-Yang Chou (2):
sched_ext: Fix exit_cpu accuracy for lockup paths
sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled
CPUs
include/linux/sched/ext.h | 4 +-
kernel/rcu/tree_exp.h | 14 +++++--
kernel/rcu/tree_stall.h | 15 +++++--
kernel/sched/ext.c | 84 +++++++++++++++++++++++++++++++------
kernel/sched/ext_internal.h | 3 +-
5 files changed, 96 insertions(+), 24 deletions(-)
--
2.48.1
^ permalink raw reply [flat|nested] 18+ messages in thread* [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs 2026-05-19 17:17 [PATCH v4 sched_ext/for-7.2 0/2] sched_ext: Follow-up fixes for exit_cpu accuracy Cheng-Yang Chou @ 2026-05-19 17:17 ` Cheng-Yang Chou 2026-05-19 17:52 ` sashiko-bot 2026-05-19 23:48 ` Paul E. McKenney 0 siblings, 2 replies; 18+ messages in thread From: Cheng-Yang Chou @ 2026-05-19 17:17 UTC (permalink / raw) To: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, Paul E . McKenney, rcu Cc: Ching-Chun Huang, Chia-Ping Tsai, Cheng-Yang Chou scx_rcu_cpu_stall() previously recorded the detector CPU rather than the stalled one, and the expedited grace period path had no stalled CPU to report at all. Thread a cpumask through panic_on_rcu_stall() and scx_rcu_cpu_stall() to capture all stalled CPUs. Report cpumask_first() as exit_cpu and the full CPU list in the exit message. Task-only stalls yield exit_cpu = -1. Falls back to cpu_none_mask on allocation failure. Store the stall mask in scx_sched rather than scx_exit_info, keeping the BPF-visible struct unchanged. scx_dump_state() reads sch->stall_cpus directly and dumps all stalled CPUs first to avoid losing them to truncation. Use __GFP_NOWARN for the cpumask allocations in print_other_cpu_stall() and synchronize_rcu_expedited_wait(). Both allocate before calling nbcon_cpu_emergency_enter(). A warn_alloc() failure printout would use the normal printk path and risk deadlock if a stalled CPU holds a console lock. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> --- include/linux/sched/ext.h | 4 +- kernel/rcu/tree_exp.h | 14 +++++-- kernel/rcu/tree_stall.h | 15 ++++++-- kernel/sched/ext.c | 76 ++++++++++++++++++++++++++++++++----- kernel/sched/ext_internal.h | 1 + 5 files changed, 91 insertions(+), 19 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index 20b2343aa344..75cb8b119fb7 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -263,7 +263,7 @@ void sched_ext_dead(struct task_struct *p); void print_scx_info(const char *log_lvl, struct task_struct *p); void scx_softlockup(u32 dur_s); bool scx_hardlockup(int cpu); -bool scx_rcu_cpu_stall(void); +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask); #else /* !CONFIG_SCHED_CLASS_EXT */ @@ -271,7 +271,7 @@ static inline void sched_ext_dead(struct task_struct *p) {} static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {} static inline void scx_softlockup(u32 dur_s) {} static inline bool scx_hardlockup(int cpu) { return false; } -static inline bool scx_rcu_cpu_stall(void) { return false; } +static inline bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) { return false; } #endif /* CONFIG_SCHED_CLASS_EXT */ diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 82cada459e5d..b8f4ca984c1c 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -555,7 +555,8 @@ static bool synchronize_rcu_expedited_wait_once(long tlimit) /* * Print out an expedited RCU CPU stall warning message. */ -static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigned long j) +static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigned long j, + struct cpumask *stalled_mask) { int cpu; unsigned long mask; @@ -578,6 +579,8 @@ static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigne if (!(READ_ONCE(rnp->expmask) & mask)) continue; ndetected++; + if (stalled_mask) + cpumask_set_cpu(cpu, stalled_mask); rdp = per_cpu_ptr(&rcu_data, cpu); pr_cont(" %d-%c%c%c%c", cpu, "O."[!!cpu_online(cpu)], @@ -659,23 +662,28 @@ static void synchronize_rcu_expedited_wait(void) for (;;) { unsigned long j; + cpumask_var_t stalled_mask; + bool have_mask; if (synchronize_rcu_expedited_wait_once(jiffies_stall)) return; if (rcu_stall_is_suppressed()) continue; + have_mask = zalloc_cpumask_var(&stalled_mask, GFP_ATOMIC | __GFP_NOWARN); + nbcon_cpu_emergency_enter(); j = jiffies; rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_EXP, (void *)(j - jiffies_start)); trace_rcu_stall_warning(rcu_state.name, TPS("ExpeditedStall")); - synchronize_rcu_expedited_stall(jiffies_start, j); + synchronize_rcu_expedited_stall(jiffies_start, j, have_mask ? stalled_mask : NULL); jiffies_stall = 3 * rcu_exp_jiffies_till_stall_check() + 3; nbcon_cpu_emergency_exit(); - panic_on_rcu_stall(); + panic_on_rcu_stall(have_mask ? stalled_mask : cpu_none_mask); + free_cpumask_var(stalled_mask); } } diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index b67532cb8770..90d6eaf1b841 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -159,7 +159,7 @@ static int __init check_cpu_stall_init(void) early_initcall(check_cpu_stall_init); /* If so specified via sysctl, panic, yielding cleaner stall-warning output. */ -static void panic_on_rcu_stall(void) +static void panic_on_rcu_stall(const struct cpumask *stalled_mask) { static int cpu_stall; @@ -167,7 +167,7 @@ static void panic_on_rcu_stall(void) * Attempt to kick out the BPF scheduler if it's installed and defer * the panic to give the system a chance to recover. */ - if (scx_rcu_cpu_stall()) + if (scx_rcu_cpu_stall(stalled_mask)) return; if (++cpu_stall < sysctl_max_rcu_stall_to_panic) @@ -631,6 +631,8 @@ static void rcu_check_gp_kthread_expired_fqs_timer(void) static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) { int cpu; + cpumask_var_t stalled_mask; + bool have_mask; unsigned long flags; unsigned long gpa; unsigned long j; @@ -645,6 +647,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) if (rcu_stall_is_suppressed()) return; + have_mask = zalloc_cpumask_var(&stalled_mask, GFP_ATOMIC | __GFP_NOWARN); + nbcon_cpu_emergency_enter(); /* @@ -660,6 +664,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) for_each_leaf_node_possible_cpu(rnp, cpu) if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { print_cpu_stall_info(cpu); + if (have_mask) + cpumask_set_cpu(cpu, stalled_mask); ndetected++; } } @@ -701,7 +707,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) nbcon_cpu_emergency_exit(); - panic_on_rcu_stall(); + panic_on_rcu_stall(have_mask ? stalled_mask : cpu_none_mask); + free_cpumask_var(stalled_mask); rcu_force_quiescent_state(); /* Kick them all. */ } @@ -754,7 +761,7 @@ static void print_cpu_stall(unsigned long gp_seq, unsigned long gps) nbcon_cpu_emergency_exit(); - panic_on_rcu_stall(); + panic_on_rcu_stall(cpumask_of(smp_processor_id())); /* * Attempt to revive the RCU machinery by forcing a context switch. diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 22b1356aa4eb..28797467f82a 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -4948,6 +4948,8 @@ static const struct attribute_group scx_global_attr_group = { static void free_pnode(struct scx_sched_pnode *pnode); static void free_exit_info(struct scx_exit_info *ei); +static const char *scx_exit_reason(enum scx_exit_kind kind); +static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind); static void scx_sched_free_rcu_work(struct work_struct *work) { @@ -4962,6 +4964,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work) timer_shutdown_sync(&sch->bypass_lb_timer); free_cpumask_var(sch->bypass_lb_donee_cpumask); free_cpumask_var(sch->bypass_lb_resched_cpumask); + free_cpumask_var(sch->stall_cpus); #ifdef CONFIG_EXT_SUB_SCHED kfree(sch->cgrp_path); @@ -5187,9 +5190,46 @@ static __printf(2, 3) bool handle_lockup(int exit_cpu, const char *fmt, ...) * resolve the reported RCU stall. %false if sched_ext is not enabled or someone * else already initiated abort. */ -bool scx_rcu_cpu_stall(void) +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) { - return handle_lockup(-1, "RCU CPU stall detected!"); + struct scx_sched *sch; + struct scx_exit_info *ei; + int exit_cpu; + + guard(rcu)(); + + sch = rcu_dereference(scx_root); + if (unlikely(!sch)) + return false; + + switch (scx_enable_state()) { + case SCX_ENABLING: + case SCX_ENABLED: + break; + default: + return false; + } + + exit_cpu = cpumask_empty(stalled_mask) ? -1 : (int)cpumask_first(stalled_mask); + ei = sch->exit_info; + + guard(preempt)(); + + if (!scx_claim_exit(sch, SCX_EXIT_ERROR)) + return false; + +#ifdef CONFIG_STACKTRACE + ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1); +#endif + scnprintf(ei->msg, SCX_EXIT_MSG_LEN, "RCU CPU stall on CPUs (%*pbl)", + cpumask_pr_args(stalled_mask)); + ei->kind = SCX_EXIT_ERROR; + ei->reason = scx_exit_reason(SCX_EXIT_ERROR); + ei->exit_cpu = exit_cpu; + cpumask_copy(sch->stall_cpus, stalled_mask); + + irq_work_queue(&sch->disable_irq_work); + return true; } /** @@ -5204,7 +5244,8 @@ bool scx_rcu_cpu_stall(void) */ void scx_softlockup(u32 dur_s) { - if (!handle_lockup(smp_processor_id(), "soft lockup - CPU %d stuck for %us", + if (!handle_lockup(smp_processor_id(), + "soft lockup - CPU %d stuck for %us", smp_processor_id(), dur_s)) return; @@ -6608,14 +6649,23 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, dump_line(&s, "----------"); /* - * Dump the exit CPU first so it isn't lost to dump truncation, then - * walk the rest in order, skipping the one already dumped. + * Dump stalled CPUs first so they aren't lost to dump truncation, then + * walk the rest in order. Fall back to exit_cpu if no stall mask set. */ - if (ei->exit_cpu >= 0) - scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); - for_each_possible_cpu(cpu) { - if (cpu != ei->exit_cpu) + if (!cpumask_empty(sch->stall_cpus)) { + for_each_cpu(cpu, sch->stall_cpus) scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); + for_each_possible_cpu(cpu) { + if (!cpumask_test_cpu(cpu, sch->stall_cpus)) + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); + } + } else { + if (ei->exit_cpu >= 0) + scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); + for_each_possible_cpu(cpu) { + if (cpu != ei->exit_cpu) + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); + } } dump_newline(&s); @@ -6851,6 +6901,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, ret = -ENOMEM; goto err_free_lb_cpumask; } + if (!zalloc_cpumask_var(&sch->stall_cpus, GFP_KERNEL)) { + ret = -ENOMEM; + goto err_free_lb_resched_cpumask; + } /* * Copy ops through the right union view. For cid-form the source is * struct sched_ext_ops_cid which lacks the trailing cpu_acquire/ @@ -6920,8 +6974,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, #ifdef CONFIG_EXT_SUB_SCHED err_free_lb_resched: RCU_INIT_POINTER(ops->priv, NULL); - free_cpumask_var(sch->bypass_lb_resched_cpumask); + free_cpumask_var(sch->stall_cpus); #endif +err_free_lb_resched_cpumask: + free_cpumask_var(sch->bypass_lb_resched_cpumask); err_free_lb_cpumask: free_cpumask_var(sch->bypass_lb_donee_cpumask); err_stop_helper: diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h index 3ccfea06cba0..f51f03868ff7 100644 --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -1178,6 +1178,7 @@ struct scx_sched { struct timer_list bypass_lb_timer; cpumask_var_t bypass_lb_donee_cpumask; cpumask_var_t bypass_lb_resched_cpumask; + cpumask_var_t stall_cpus; struct rcu_work rcu_work; /* all ancestors including self */ -- 2.48.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs 2026-05-19 17:17 ` [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs Cheng-Yang Chou @ 2026-05-19 17:52 ` sashiko-bot 2026-05-19 18:22 ` Cheng-Yang Chou 2026-05-19 23:48 ` Paul E. McKenney 1 sibling, 1 reply; 18+ messages in thread From: sashiko-bot @ 2026-05-19 17:52 UTC (permalink / raw) To: Cheng-Yang Chou; +Cc: sched-ext Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider: - [High] Dynamic memory allocation (`GFP_ATOMIC`) in the RCU stall detector path can cause a hard deadlock, masking the system hang instead of reporting it. -- commit ffd18e78b03bfaba9705c61803e8bd0bb6325d38 Author: Cheng-Yang Chou <yphbchou0911@gmail.com> sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs This commit updates the RCU stall paths to capture all stalled CPUs in a cpumask rather than just the detecting CPU. The cpumask is passed through the stall handlers and dumped first during a sched_ext state dump to avoid losing the stalled CPUs to truncation. > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > index 82cada459e5d0..b8f4ca984c1c9 100644 > --- a/kernel/rcu/tree_exp.h > +++ b/kernel/rcu/tree_exp.h [ ... ] > @@ -659,23 +662,28 @@ static void synchronize_rcu_expedited_wait(void) > > for (;;) { > unsigned long j; > + cpumask_var_t stalled_mask; > + bool have_mask; > > if (synchronize_rcu_expedited_wait_once(jiffies_stall)) > return; > if (rcu_stall_is_suppressed()) > continue; > > + have_mask = zalloc_cpumask_var(&stalled_mask, GFP_ATOMIC | __GFP_NOWARN); > + [Severity: High] Could this allocation cause a hard deadlock during an RCU stall? If CONFIG_CPUMASK_OFFSTACK=y is set, zalloc_cpumask_var() performs a dynamic memory allocation via kmalloc(). If the stalled CPU is stuck holding a memory allocator lock (such as zone->lock or the SLUB list_lock), the detecting CPU will deadlock trying to acquire it. Additionally, GFP_ATOMIC includes __GFP_KSWAPD_RECLAIM, which can wake up kswapd and acquire scheduler locks (pi_lock, rq->lock). If the stalled CPU holds one of these scheduler locks, it could also deadlock here. This would completely prevent the RCU stall warning from being printed and mask the root cause of the hang. Would it be safer to use a statically preallocated cpumask for stall reporting? > nbcon_cpu_emergency_enter(); > > j = jiffies; [ ... ] > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h > index b67532cb87705..90d6eaf1b8415 100644 > --- a/kernel/rcu/tree_stall.h > +++ b/kernel/rcu/tree_stall.h [ ... ] > @@ -645,6 +647,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > if (rcu_stall_is_suppressed()) > return; > > + have_mask = zalloc_cpumask_var(&stalled_mask, GFP_ATOMIC | __GFP_NOWARN); > + [Severity: High] Can this dynamic allocation also trigger a hard deadlock? print_other_cpu_stall() executes in a critical diagnostic context with IRQs disabled (enforced by lockdep_assert_irqs_disabled()). Similar to the issue in synchronize_rcu_expedited_wait(), allocating memory here might deadlock if the stalled CPU holds a lock required by the memory allocator or scheduler. > nbcon_cpu_emergency_enter(); > > /* -- Sashiko AI review · https://sashiko.dev/#/patchset/20260519171745.1551340-1-yphbchou0911@gmail.com?part=2 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs 2026-05-19 17:52 ` sashiko-bot @ 2026-05-19 18:22 ` Cheng-Yang Chou 0 siblings, 0 replies; 18+ messages in thread From: Cheng-Yang Chou @ 2026-05-19 18:22 UTC (permalink / raw) To: sashiko-reviews; +Cc: sched-ext On Tue, May 19, 2026 at 05:52:29PM +0000, sashiko-bot@kernel.org wrote: > Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider: > - [High] Dynamic memory allocation (`GFP_ATOMIC`) in the RCU stall detector path can cause a hard deadlock, masking the system hang instead of reporting it. > -- > > commit ffd18e78b03bfaba9705c61803e8bd0bb6325d38 > Author: Cheng-Yang Chou <yphbchou0911@gmail.com> > > sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs > > This commit updates the RCU stall paths to capture all stalled CPUs in a > cpumask rather than just the detecting CPU. The cpumask is passed through > the stall handlers and dumped first during a sched_ext state dump to avoid > losing the stalled CPUs to truncation. > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > > index 82cada459e5d0..b8f4ca984c1c9 100644 > > --- a/kernel/rcu/tree_exp.h > > +++ b/kernel/rcu/tree_exp.h > > [ ... ] > > > @@ -659,23 +662,28 @@ static void synchronize_rcu_expedited_wait(void) > > > > for (;;) { > > unsigned long j; > > + cpumask_var_t stalled_mask; > > + bool have_mask; > > > > if (synchronize_rcu_expedited_wait_once(jiffies_stall)) > > return; > > if (rcu_stall_is_suppressed()) > > continue; > > > > + have_mask = zalloc_cpumask_var(&stalled_mask, GFP_ATOMIC | __GFP_NOWARN); > > + > > [Severity: High] > Could this allocation cause a hard deadlock during an RCU stall? > > If CONFIG_CPUMASK_OFFSTACK=y is set, zalloc_cpumask_var() performs a dynamic > memory allocation via kmalloc(). If the stalled CPU is stuck holding a memory > allocator lock (such as zone->lock or the SLUB list_lock), the detecting CPU > will deadlock trying to acquire it. > > Additionally, GFP_ATOMIC includes __GFP_KSWAPD_RECLAIM, which can wake up > kswapd and acquire scheduler locks (pi_lock, rq->lock). If the stalled > CPU holds one of these scheduler locks, it could also deadlock here. > > This would completely prevent the RCU stall warning from being printed > and mask the root cause of the hang. Would it be safer to use a > statically preallocated cpumask for stall reporting? I think the concern is valid, will fix this by pre-allocating static cpumask in rcu_init() and clearing at the start of each stall report. > > > nbcon_cpu_emergency_enter(); > > > > j = jiffies; > > [ ... ] > > > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h > > index b67532cb87705..90d6eaf1b8415 100644 > > --- a/kernel/rcu/tree_stall.h > > +++ b/kernel/rcu/tree_stall.h > > [ ... ] > > > @@ -645,6 +647,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > > if (rcu_stall_is_suppressed()) > > return; > > > > + have_mask = zalloc_cpumask_var(&stalled_mask, GFP_ATOMIC | __GFP_NOWARN); > > + > > [Severity: High] > Can this dynamic allocation also trigger a hard deadlock? > > print_other_cpu_stall() executes in a critical diagnostic context with > IRQs disabled (enforced by lockdep_assert_irqs_disabled()). Similar to > the issue in synchronize_rcu_expedited_wait(), allocating memory here > might deadlock if the stalled CPU holds a lock required by the memory > allocator or scheduler. > Ditto. > > nbcon_cpu_emergency_enter(); > > > > /* > > -- > Sashiko AI review · https://sashiko.dev/#/patchset/20260519171745.1551340-1-yphbchou0911@gmail.com?part=2 -- Cheers, Cheng-Yang ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs 2026-05-19 17:17 ` [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs Cheng-Yang Chou 2026-05-19 17:52 ` sashiko-bot @ 2026-05-19 23:48 ` Paul E. McKenney 2026-05-20 14:56 ` Cheng-Yang Chou 1 sibling, 1 reply; 18+ messages in thread From: Paul E. McKenney @ 2026-05-19 23:48 UTC (permalink / raw) To: Cheng-Yang Chou Cc: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, rcu, Ching-Chun Huang, Chia-Ping Tsai On Wed, May 20, 2026 at 01:17:31AM +0800, Cheng-Yang Chou wrote: > scx_rcu_cpu_stall() previously recorded the detector CPU rather than the > stalled one, and the expedited grace period path had no stalled CPU to > report at all. > > Thread a cpumask through panic_on_rcu_stall() and scx_rcu_cpu_stall() > to capture all stalled CPUs. Report cpumask_first() as exit_cpu and the > full CPU list in the exit message. Task-only stalls yield exit_cpu = -1. > Falls back to cpu_none_mask on allocation failure. > > Store the stall mask in scx_sched rather than scx_exit_info, keeping the > BPF-visible struct unchanged. scx_dump_state() reads sch->stall_cpus > directly and dumps all stalled CPUs first to avoid losing them to > truncation. > > Use __GFP_NOWARN for the cpumask allocations in print_other_cpu_stall() > and synchronize_rcu_expedited_wait(). Both allocate before calling > nbcon_cpu_emergency_enter(). A warn_alloc() failure printout would use > the normal printk path and risk deadlock if a stalled CPU holds a > console lock. > > Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> More comments below, mostly about care and feeding of cpumasks. > --- > include/linux/sched/ext.h | 4 +- > kernel/rcu/tree_exp.h | 14 +++++-- > kernel/rcu/tree_stall.h | 15 ++++++-- > kernel/sched/ext.c | 76 ++++++++++++++++++++++++++++++++----- > kernel/sched/ext_internal.h | 1 + > 5 files changed, 91 insertions(+), 19 deletions(-) > > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > index 20b2343aa344..75cb8b119fb7 100644 > --- a/include/linux/sched/ext.h > +++ b/include/linux/sched/ext.h > @@ -263,7 +263,7 @@ void sched_ext_dead(struct task_struct *p); > void print_scx_info(const char *log_lvl, struct task_struct *p); > void scx_softlockup(u32 dur_s); > bool scx_hardlockup(int cpu); > -bool scx_rcu_cpu_stall(void); > +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask); See below for cpumask_var_t discussion. > #else /* !CONFIG_SCHED_CLASS_EXT */ > > @@ -271,7 +271,7 @@ static inline void sched_ext_dead(struct task_struct *p) {} > static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {} > static inline void scx_softlockup(u32 dur_s) {} > static inline bool scx_hardlockup(int cpu) { return false; } > -static inline bool scx_rcu_cpu_stall(void) { return false; } > +static inline bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) { return false; } Ditto. > #endif /* CONFIG_SCHED_CLASS_EXT */ > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > index 82cada459e5d..b8f4ca984c1c 100644 > --- a/kernel/rcu/tree_exp.h > +++ b/kernel/rcu/tree_exp.h > @@ -555,7 +555,8 @@ static bool synchronize_rcu_expedited_wait_once(long tlimit) > /* > * Print out an expedited RCU CPU stall warning message. > */ > -static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigned long j) > +static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigned long j, > + struct cpumask *stalled_mask) See below for cpumask_var_t discussion. > { > int cpu; > unsigned long mask; > @@ -578,6 +579,8 @@ static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigne > if (!(READ_ONCE(rnp->expmask) & mask)) > continue; > ndetected++; > + if (stalled_mask) To his should instead be: if (cpumask_available(stalled_mask)) Otherwise, this breaks in kernels built with CONFIG_CPUMASK_OFFSTACK=n. > + cpumask_set_cpu(cpu, stalled_mask); > rdp = per_cpu_ptr(&rcu_data, cpu); > pr_cont(" %d-%c%c%c%c", cpu, > "O."[!!cpu_online(cpu)], > @@ -659,23 +662,28 @@ static void synchronize_rcu_expedited_wait(void) > > for (;;) { > unsigned long j; > + cpumask_var_t stalled_mask; > + bool have_mask; > > if (synchronize_rcu_expedited_wait_once(jiffies_stall)) > return; > if (rcu_stall_is_suppressed()) > continue; > > + have_mask = zalloc_cpumask_var(&stalled_mask, GFP_ATOMIC | __GFP_NOWARN); > + > nbcon_cpu_emergency_enter(); > > j = jiffies; > rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_EXP, (void *)(j - jiffies_start)); > trace_rcu_stall_warning(rcu_state.name, TPS("ExpeditedStall")); > - synchronize_rcu_expedited_stall(jiffies_start, j); > + synchronize_rcu_expedited_stall(jiffies_start, j, have_mask ? stalled_mask : NULL); An alternative to the ?: is to just pass the mask in and have synchronize_rcu_expedited_stall(), or whatever it calls, invoke cpumask_available() on it. > jiffies_stall = 3 * rcu_exp_jiffies_till_stall_check() + 3; > > nbcon_cpu_emergency_exit(); > > - panic_on_rcu_stall(); > + panic_on_rcu_stall(have_mask ? stalled_mask : cpu_none_mask); And the same here, but for panic_on_rcu_stall(). > + free_cpumask_var(stalled_mask); And free_cpumask_var() is happy to ignore a NULL pointer, so it is OK to invoke it on the pointer passed to a failing call to zalloc_cpumask_var(). > } > } > > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h > index b67532cb8770..90d6eaf1b841 100644 > --- a/kernel/rcu/tree_stall.h > +++ b/kernel/rcu/tree_stall.h > @@ -159,7 +159,7 @@ static int __init check_cpu_stall_init(void) > early_initcall(check_cpu_stall_init); > > /* If so specified via sysctl, panic, yielding cleaner stall-warning output. */ > -static void panic_on_rcu_stall(void) > +static void panic_on_rcu_stall(const struct cpumask *stalled_mask) I suggest "cpumask_var_t" instead of "struct cpumask *", because cpumask_var_t better handles on-stack vs. offstack cpumasks. > { > static int cpu_stall; > > @@ -167,7 +167,7 @@ static void panic_on_rcu_stall(void) > * Attempt to kick out the BPF scheduler if it's installed and defer > * the panic to give the system a chance to recover. > */ > - if (scx_rcu_cpu_stall()) > + if (scx_rcu_cpu_stall(stalled_mask)) > return; > > if (++cpu_stall < sysctl_max_rcu_stall_to_panic) > @@ -631,6 +631,8 @@ static void rcu_check_gp_kthread_expired_fqs_timer(void) > static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > { > int cpu; > + cpumask_var_t stalled_mask; > + bool have_mask; Use of cpumask_available() should allow you to drop have_mask. > unsigned long flags; > unsigned long gpa; > unsigned long j; > @@ -645,6 +647,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > if (rcu_stall_is_suppressed()) > return; > > + have_mask = zalloc_cpumask_var(&stalled_mask, GFP_ATOMIC | __GFP_NOWARN); > + > nbcon_cpu_emergency_enter(); > > /* > @@ -660,6 +664,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > for_each_leaf_node_possible_cpu(rnp, cpu) > if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { > print_cpu_stall_info(cpu); > + if (have_mask) > + cpumask_set_cpu(cpu, stalled_mask); > ndetected++; > } > } > @@ -701,7 +707,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > > nbcon_cpu_emergency_exit(); > > - panic_on_rcu_stall(); > + panic_on_rcu_stall(have_mask ? stalled_mask : cpu_none_mask); > + free_cpumask_var(stalled_mask); > > rcu_force_quiescent_state(); /* Kick them all. */ > } > @@ -754,7 +761,7 @@ static void print_cpu_stall(unsigned long gp_seq, unsigned long gps) > > nbcon_cpu_emergency_exit(); > > - panic_on_rcu_stall(); > + panic_on_rcu_stall(cpumask_of(smp_processor_id())); > > /* > * Attempt to revive the RCU machinery by forcing a context switch. > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 22b1356aa4eb..28797467f82a 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -4948,6 +4948,8 @@ static const struct attribute_group scx_global_attr_group = { > > static void free_pnode(struct scx_sched_pnode *pnode); > static void free_exit_info(struct scx_exit_info *ei); > +static const char *scx_exit_reason(enum scx_exit_kind kind); > +static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind); > > static void scx_sched_free_rcu_work(struct work_struct *work) > { > @@ -4962,6 +4964,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work) > timer_shutdown_sync(&sch->bypass_lb_timer); > free_cpumask_var(sch->bypass_lb_donee_cpumask); > free_cpumask_var(sch->bypass_lb_resched_cpumask); > + free_cpumask_var(sch->stall_cpus); > > #ifdef CONFIG_EXT_SUB_SCHED > kfree(sch->cgrp_path); > @@ -5187,9 +5190,46 @@ static __printf(2, 3) bool handle_lockup(int exit_cpu, const char *fmt, ...) > * resolve the reported RCU stall. %false if sched_ext is not enabled or someone > * else already initiated abort. > */ > -bool scx_rcu_cpu_stall(void) > +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) And cpumask_var_t here as well. > { > - return handle_lockup(-1, "RCU CPU stall detected!"); > + struct scx_sched *sch; > + struct scx_exit_info *ei; > + int exit_cpu; > + > + guard(rcu)(); > + > + sch = rcu_dereference(scx_root); > + if (unlikely(!sch)) > + return false; > + > + switch (scx_enable_state()) { > + case SCX_ENABLING: > + case SCX_ENABLED: > + break; > + default: > + return false; > + } > + > + exit_cpu = cpumask_empty(stalled_mask) ? -1 : (int)cpumask_first(stalled_mask); And here you need a cpumask_available(). Please note that cpumask_empty() does not play nicely with NULL pointers. So maybe something like this: exit_cpu = (!cpumask_available(stalled_mask) || cpumask_empty(stalled_mask)) ? -1 : (int)cpumask_first(stalled_mask); Or maybe the same thing, but with an "if" statement. > + ei = sch->exit_info; > + > + guard(preempt)(); > + > + if (!scx_claim_exit(sch, SCX_EXIT_ERROR)) > + return false; > + > +#ifdef CONFIG_STACKTRACE > + ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1); > +#endif > + scnprintf(ei->msg, SCX_EXIT_MSG_LEN, "RCU CPU stall on CPUs (%*pbl)", > + cpumask_pr_args(stalled_mask)); > + ei->kind = SCX_EXIT_ERROR; > + ei->reason = scx_exit_reason(SCX_EXIT_ERROR); > + ei->exit_cpu = exit_cpu; > + cpumask_copy(sch->stall_cpus, stalled_mask); > + > + irq_work_queue(&sch->disable_irq_work); > + return true; > } > > /** > @@ -5204,7 +5244,8 @@ bool scx_rcu_cpu_stall(void) > */ > void scx_softlockup(u32 dur_s) > { > - if (!handle_lockup(smp_processor_id(), "soft lockup - CPU %d stuck for %us", > + if (!handle_lockup(smp_processor_id(), > + "soft lockup - CPU %d stuck for %us", > smp_processor_id(), dur_s)) > return; > > @@ -6608,14 +6649,23 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, > dump_line(&s, "----------"); > > /* > - * Dump the exit CPU first so it isn't lost to dump truncation, then > - * walk the rest in order, skipping the one already dumped. > + * Dump stalled CPUs first so they aren't lost to dump truncation, then > + * walk the rest in order. Fall back to exit_cpu if no stall mask set. > */ > - if (ei->exit_cpu >= 0) > - scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); > - for_each_possible_cpu(cpu) { > - if (cpu != ei->exit_cpu) > + if (!cpumask_empty(sch->stall_cpus)) { > + for_each_cpu(cpu, sch->stall_cpus) > scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > + for_each_possible_cpu(cpu) { > + if (!cpumask_test_cpu(cpu, sch->stall_cpus)) > + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > + } > + } else { > + if (ei->exit_cpu >= 0) > + scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); > + for_each_possible_cpu(cpu) { > + if (cpu != ei->exit_cpu) > + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > + } > } > > dump_newline(&s); > @@ -6851,6 +6901,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, > ret = -ENOMEM; > goto err_free_lb_cpumask; > } > + if (!zalloc_cpumask_var(&sch->stall_cpus, GFP_KERNEL)) { > + ret = -ENOMEM; > + goto err_free_lb_resched_cpumask; This is sufficient to make sure that sch->stall_cpus is never used, correct? Otherwise, there needs to be a bunch of cpumask_available() checks guarding this uses of ->stall_cpus. Thanx, Paul > + } > /* > * Copy ops through the right union view. For cid-form the source is > * struct sched_ext_ops_cid which lacks the trailing cpu_acquire/ > @@ -6920,8 +6974,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, > #ifdef CONFIG_EXT_SUB_SCHED > err_free_lb_resched: > RCU_INIT_POINTER(ops->priv, NULL); > - free_cpumask_var(sch->bypass_lb_resched_cpumask); > + free_cpumask_var(sch->stall_cpus); > #endif > +err_free_lb_resched_cpumask: > + free_cpumask_var(sch->bypass_lb_resched_cpumask); > err_free_lb_cpumask: > free_cpumask_var(sch->bypass_lb_donee_cpumask); > err_stop_helper: > diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h > index 3ccfea06cba0..f51f03868ff7 100644 > --- a/kernel/sched/ext_internal.h > +++ b/kernel/sched/ext_internal.h > @@ -1178,6 +1178,7 @@ struct scx_sched { > struct timer_list bypass_lb_timer; > cpumask_var_t bypass_lb_donee_cpumask; > cpumask_var_t bypass_lb_resched_cpumask; > + cpumask_var_t stall_cpus; > struct rcu_work rcu_work; > > /* all ancestors including self */ > -- > 2.48.1 > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs 2026-05-19 23:48 ` Paul E. McKenney @ 2026-05-20 14:56 ` Cheng-Yang Chou 2026-05-20 16:35 ` Paul E. McKenney 0 siblings, 1 reply; 18+ messages in thread From: Cheng-Yang Chou @ 2026-05-20 14:56 UTC (permalink / raw) To: Paul E. McKenney Cc: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, rcu, Ching-Chun Huang, Chia-Ping Tsai Hi Paul, On Tue, May 19, 2026 at 04:48:01PM -0700, Paul E. McKenney wrote: > On Wed, May 20, 2026 at 01:17:31AM +0800, Cheng-Yang Chou wrote: > > scx_rcu_cpu_stall() previously recorded the detector CPU rather than the > > stalled one, and the expedited grace period path had no stalled CPU to > > report at all. > > > > Thread a cpumask through panic_on_rcu_stall() and scx_rcu_cpu_stall() > > to capture all stalled CPUs. Report cpumask_first() as exit_cpu and the > > full CPU list in the exit message. Task-only stalls yield exit_cpu = -1. > > Falls back to cpu_none_mask on allocation failure. > > > > Store the stall mask in scx_sched rather than scx_exit_info, keeping the > > BPF-visible struct unchanged. scx_dump_state() reads sch->stall_cpus > > directly and dumps all stalled CPUs first to avoid losing them to > > truncation. > > > > Use __GFP_NOWARN for the cpumask allocations in print_other_cpu_stall() > > and synchronize_rcu_expedited_wait(). Both allocate before calling > > nbcon_cpu_emergency_enter(). A warn_alloc() failure printout would use > > the normal printk path and risk deadlock if a stalled CPU holds a > > console lock. > > > > Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> > > More comments below, mostly about care and feeding of cpumasks. Thanks for the review, and only one question below. > > > --- > > include/linux/sched/ext.h | 4 +- > > kernel/rcu/tree_exp.h | 14 +++++-- > > kernel/rcu/tree_stall.h | 15 ++++++-- > > kernel/sched/ext.c | 76 ++++++++++++++++++++++++++++++++----- > > kernel/sched/ext_internal.h | 1 + > > 5 files changed, 91 insertions(+), 19 deletions(-) > > > > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > > index 20b2343aa344..75cb8b119fb7 100644 > > --- a/include/linux/sched/ext.h > > +++ b/include/linux/sched/ext.h > > @@ -263,7 +263,7 @@ void sched_ext_dead(struct task_struct *p); > > void print_scx_info(const char *log_lvl, struct task_struct *p); > > void scx_softlockup(u32 dur_s); > > bool scx_hardlockup(int cpu); > > -bool scx_rcu_cpu_stall(void); > > +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask); > > See below for cpumask_var_t discussion. > > > #else /* !CONFIG_SCHED_CLASS_EXT */ > > > > @@ -271,7 +271,7 @@ static inline void sched_ext_dead(struct task_struct *p) {} > > static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {} > > static inline void scx_softlockup(u32 dur_s) {} > > static inline bool scx_hardlockup(int cpu) { return false; } > > -static inline bool scx_rcu_cpu_stall(void) { return false; } > > +static inline bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) { return false; } > > Ditto. > > > #endif /* CONFIG_SCHED_CLASS_EXT */ > > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > > index 82cada459e5d..b8f4ca984c1c 100644 > > --- a/kernel/rcu/tree_exp.h > > +++ b/kernel/rcu/tree_exp.h > > @@ -555,7 +555,8 @@ static bool synchronize_rcu_expedited_wait_once(long tlimit) > > /* > > * Print out an expedited RCU CPU stall warning message. > > */ > > -static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigned long j) > > +static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigned long j, > > + struct cpumask *stalled_mask) > > See below for cpumask_var_t discussion. > > > { > > int cpu; > > unsigned long mask; > > @@ -578,6 +579,8 @@ static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigne > > if (!(READ_ONCE(rnp->expmask) & mask)) > > continue; > > ndetected++; > > + if (stalled_mask) > > To his should instead be: > > if (cpumask_available(stalled_mask)) > > Otherwise, this breaks in kernels built with CONFIG_CPUMASK_OFFSTACK=n. Ack. > > > + cpumask_set_cpu(cpu, stalled_mask); > > rdp = per_cpu_ptr(&rcu_data, cpu); > > pr_cont(" %d-%c%c%c%c", cpu, > > "O."[!!cpu_online(cpu)], > > @@ -659,23 +662,28 @@ static void synchronize_rcu_expedited_wait(void) > > > > for (;;) { > > unsigned long j; > > + cpumask_var_t stalled_mask; > > + bool have_mask; > > > > if (synchronize_rcu_expedited_wait_once(jiffies_stall)) > > return; > > if (rcu_stall_is_suppressed()) > > continue; > > > > + have_mask = zalloc_cpumask_var(&stalled_mask, GFP_ATOMIC | __GFP_NOWARN); > > + > > nbcon_cpu_emergency_enter(); > > > > j = jiffies; > > rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_EXP, (void *)(j - jiffies_start)); > > trace_rcu_stall_warning(rcu_state.name, TPS("ExpeditedStall")); > > - synchronize_rcu_expedited_stall(jiffies_start, j); > > + synchronize_rcu_expedited_stall(jiffies_start, j, have_mask ? stalled_mask : NULL); > > An alternative to the ?: is to just pass the mask in and have > synchronize_rcu_expedited_stall(), or whatever it calls, invoke > cpumask_available() on it. Ack. > > > jiffies_stall = 3 * rcu_exp_jiffies_till_stall_check() + 3; > > > > nbcon_cpu_emergency_exit(); > > > > - panic_on_rcu_stall(); > > + panic_on_rcu_stall(have_mask ? stalled_mask : cpu_none_mask); > > And the same here, but for panic_on_rcu_stall(). > Ack. > > + free_cpumask_var(stalled_mask); > > And free_cpumask_var() is happy to ignore a NULL pointer, so it is OK to > invoke it on the pointer passed to a failing call to zalloc_cpumask_var(). > One question here: Sashiko flagged that the GFP_ATOMIC allocation risks deadlock if the stalled CPU holds a scheduler or allocator lock. Rather than keeping dynamic allocation with cpumask_available() guards, I was planning to pre-allocate rcu_stall_cpumask and rcu_exp_stall_cpumask in rcu_init() with GFP_KERNEL and clear them at the start of each stall event. That would eliminate have_mask, cpumask_available(), and free_cpumask_var() from the stall paths entirely. Does that direction work for you, or would you prefer the dynamic allocation path with proper cpumask_available() guards? > > } > > } > > > > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h > > index b67532cb8770..90d6eaf1b841 100644 > > --- a/kernel/rcu/tree_stall.h > > +++ b/kernel/rcu/tree_stall.h > > @@ -159,7 +159,7 @@ static int __init check_cpu_stall_init(void) > > early_initcall(check_cpu_stall_init); > > > > /* If so specified via sysctl, panic, yielding cleaner stall-warning output. */ > > -static void panic_on_rcu_stall(void) > > +static void panic_on_rcu_stall(const struct cpumask *stalled_mask) > > I suggest "cpumask_var_t" instead of "struct cpumask *", because > cpumask_var_t better handles on-stack vs. offstack cpumasks. Ack, will update to cpumask_var_t in v5. > > > { > > static int cpu_stall; > > > > @@ -167,7 +167,7 @@ static void panic_on_rcu_stall(void) > > * Attempt to kick out the BPF scheduler if it's installed and defer > > * the panic to give the system a chance to recover. > > */ > > - if (scx_rcu_cpu_stall()) > > + if (scx_rcu_cpu_stall(stalled_mask)) > > return; > > > > if (++cpu_stall < sysctl_max_rcu_stall_to_panic) > > @@ -631,6 +631,8 @@ static void rcu_check_gp_kthread_expired_fqs_timer(void) > > static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > > { > > int cpu; > > + cpumask_var_t stalled_mask; > > + bool have_mask; > > Use of cpumask_available() should allow you to drop have_mask. > Ack. > > unsigned long flags; > > unsigned long gpa; > > unsigned long j; > > @@ -645,6 +647,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > > if (rcu_stall_is_suppressed()) > > return; > > > > + have_mask = zalloc_cpumask_var(&stalled_mask, GFP_ATOMIC | __GFP_NOWARN); > > + > > nbcon_cpu_emergency_enter(); > > > > /* > > @@ -660,6 +664,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > > for_each_leaf_node_possible_cpu(rnp, cpu) > > if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { > > print_cpu_stall_info(cpu); > > + if (have_mask) > > + cpumask_set_cpu(cpu, stalled_mask); > > ndetected++; > > } > > } > > @@ -701,7 +707,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > > > > nbcon_cpu_emergency_exit(); > > > > - panic_on_rcu_stall(); > > + panic_on_rcu_stall(have_mask ? stalled_mask : cpu_none_mask); > > + free_cpumask_var(stalled_mask); > > > > rcu_force_quiescent_state(); /* Kick them all. */ > > } > > @@ -754,7 +761,7 @@ static void print_cpu_stall(unsigned long gp_seq, unsigned long gps) > > > > nbcon_cpu_emergency_exit(); > > > > - panic_on_rcu_stall(); > > + panic_on_rcu_stall(cpumask_of(smp_processor_id())); > > > > /* > > * Attempt to revive the RCU machinery by forcing a context switch. > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > > index 22b1356aa4eb..28797467f82a 100644 > > --- a/kernel/sched/ext.c > > +++ b/kernel/sched/ext.c > > @@ -4948,6 +4948,8 @@ static const struct attribute_group scx_global_attr_group = { > > > > static void free_pnode(struct scx_sched_pnode *pnode); > > static void free_exit_info(struct scx_exit_info *ei); > > +static const char *scx_exit_reason(enum scx_exit_kind kind); > > +static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind); > > > > static void scx_sched_free_rcu_work(struct work_struct *work) > > { > > @@ -4962,6 +4964,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work) > > timer_shutdown_sync(&sch->bypass_lb_timer); > > free_cpumask_var(sch->bypass_lb_donee_cpumask); > > free_cpumask_var(sch->bypass_lb_resched_cpumask); > > + free_cpumask_var(sch->stall_cpus); > > > > #ifdef CONFIG_EXT_SUB_SCHED > > kfree(sch->cgrp_path); > > @@ -5187,9 +5190,46 @@ static __printf(2, 3) bool handle_lockup(int exit_cpu, const char *fmt, ...) > > * resolve the reported RCU stall. %false if sched_ext is not enabled or someone > > * else already initiated abort. > > */ > > -bool scx_rcu_cpu_stall(void) > > +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) > > And cpumask_var_t here as well. > i Ack, will update. > > { > > - return handle_lockup(-1, "RCU CPU stall detected!"); > > + struct scx_sched *sch; > > + struct scx_exit_info *ei; > > + int exit_cpu; > > + > > + guard(rcu)(); > > + > > + sch = rcu_dereference(scx_root); > > + if (unlikely(!sch)) > > + return false; > > + > > + switch (scx_enable_state()) { > > + case SCX_ENABLING: > > + case SCX_ENABLED: > > + break; > > + default: > > + return false; > > + } > > + > > + exit_cpu = cpumask_empty(stalled_mask) ? -1 : (int)cpumask_first(stalled_mask); > > And here you need a cpumask_available(). Please note that cpumask_empty() > does not play nicely with NULL pointers. > > So maybe something like this: > > exit_cpu = (!cpumask_available(stalled_mask) || cpumask_empty(stalled_mask)) ? -1 : (int)cpumask_first(stalled_mask); > > Or maybe the same thing, but with an "if" statement. > Ack, will fix with if-statement. If pre-allocation is used this becomes unnecessary, but I will handle it regardless. > > + ei = sch->exit_info; > > + > > + guard(preempt)(); > > + > > + if (!scx_claim_exit(sch, SCX_EXIT_ERROR)) > > + return false; > > + > > +#ifdef CONFIG_STACKTRACE > > + ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1); > > +#endif > > + scnprintf(ei->msg, SCX_EXIT_MSG_LEN, "RCU CPU stall on CPUs (%*pbl)", > > + cpumask_pr_args(stalled_mask)); > > + ei->kind = SCX_EXIT_ERROR; > > + ei->reason = scx_exit_reason(SCX_EXIT_ERROR); > > + ei->exit_cpu = exit_cpu; > > + cpumask_copy(sch->stall_cpus, stalled_mask); > > + > > + irq_work_queue(&sch->disable_irq_work); > > + return true; > > } > > > > /** > > @@ -5204,7 +5244,8 @@ bool scx_rcu_cpu_stall(void) > > */ > > void scx_softlockup(u32 dur_s) > > { > > - if (!handle_lockup(smp_processor_id(), "soft lockup - CPU %d stuck for %us", > > + if (!handle_lockup(smp_processor_id(), > > + "soft lockup - CPU %d stuck for %us", > > smp_processor_id(), dur_s)) > > return; > > > > @@ -6608,14 +6649,23 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, > > dump_line(&s, "----------"); > > > > /* > > - * Dump the exit CPU first so it isn't lost to dump truncation, then > > - * walk the rest in order, skipping the one already dumped. > > + * Dump stalled CPUs first so they aren't lost to dump truncation, then > > + * walk the rest in order. Fall back to exit_cpu if no stall mask set. > > */ > > - if (ei->exit_cpu >= 0) > > - scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); > > - for_each_possible_cpu(cpu) { > > - if (cpu != ei->exit_cpu) > > + if (!cpumask_empty(sch->stall_cpus)) { > > + for_each_cpu(cpu, sch->stall_cpus) > > scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > > + for_each_possible_cpu(cpu) { > > + if (!cpumask_test_cpu(cpu, sch->stall_cpus)) > > + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > > + } > > + } else { > > + if (ei->exit_cpu >= 0) > > + scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); > > + for_each_possible_cpu(cpu) { > > + if (cpu != ei->exit_cpu) > > + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > > + } > > } > > > > dump_newline(&s); > > @@ -6851,6 +6901,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, > > ret = -ENOMEM; > > goto err_free_lb_cpumask; > > } > > + if (!zalloc_cpumask_var(&sch->stall_cpus, GFP_KERNEL)) { > > + ret = -ENOMEM; > > + goto err_free_lb_resched_cpumask; > > This is sufficient to make sure that sch->stall_cpus is never used, > correct? > > Otherwise, there needs to be a bunch of cpumask_available() checks > guarding this uses of ->stall_cpus. Yes. Failure returns -ENOMEM before the scheduler is registered, so sch->stall_cpus is never accessible through scx_root. Thanks! > > Thanx, Paul > > > + } > > /* > > * Copy ops through the right union view. For cid-form the source is > > * struct sched_ext_ops_cid which lacks the trailing cpu_acquire/ > > @@ -6920,8 +6974,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, > > #ifdef CONFIG_EXT_SUB_SCHED > > err_free_lb_resched: > > RCU_INIT_POINTER(ops->priv, NULL); > > - free_cpumask_var(sch->bypass_lb_resched_cpumask); > > + free_cpumask_var(sch->stall_cpus); > > #endif > > +err_free_lb_resched_cpumask: > > + free_cpumask_var(sch->bypass_lb_resched_cpumask); > > err_free_lb_cpumask: > > free_cpumask_var(sch->bypass_lb_donee_cpumask); > > err_stop_helper: > > diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h > > index 3ccfea06cba0..f51f03868ff7 100644 > > --- a/kernel/sched/ext_internal.h > > +++ b/kernel/sched/ext_internal.h > > @@ -1178,6 +1178,7 @@ struct scx_sched { > > struct timer_list bypass_lb_timer; > > cpumask_var_t bypass_lb_donee_cpumask; > > cpumask_var_t bypass_lb_resched_cpumask; > > + cpumask_var_t stall_cpus; > > struct rcu_work rcu_work; > > > > /* all ancestors including self */ > > -- > > 2.48.1 > > -- Cheers, Cheng-Yang ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs 2026-05-20 14:56 ` Cheng-Yang Chou @ 2026-05-20 16:35 ` Paul E. McKenney 2026-05-21 7:00 ` Cheng-Yang Chou 0 siblings, 1 reply; 18+ messages in thread From: Paul E. McKenney @ 2026-05-20 16:35 UTC (permalink / raw) To: Cheng-Yang Chou Cc: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, rcu, Ching-Chun Huang, Chia-Ping Tsai On Wed, May 20, 2026 at 10:56:26PM +0800, Cheng-Yang Chou wrote: > Hi Paul, > > On Tue, May 19, 2026 at 04:48:01PM -0700, Paul E. McKenney wrote: > > On Wed, May 20, 2026 at 01:17:31AM +0800, Cheng-Yang Chou wrote: > > > scx_rcu_cpu_stall() previously recorded the detector CPU rather than the > > > stalled one, and the expedited grace period path had no stalled CPU to > > > report at all. > > > > > > Thread a cpumask through panic_on_rcu_stall() and scx_rcu_cpu_stall() > > > to capture all stalled CPUs. Report cpumask_first() as exit_cpu and the > > > full CPU list in the exit message. Task-only stalls yield exit_cpu = -1. > > > Falls back to cpu_none_mask on allocation failure. > > > > > > Store the stall mask in scx_sched rather than scx_exit_info, keeping the > > > BPF-visible struct unchanged. scx_dump_state() reads sch->stall_cpus > > > directly and dumps all stalled CPUs first to avoid losing them to > > > truncation. > > > > > > Use __GFP_NOWARN for the cpumask allocations in print_other_cpu_stall() > > > and synchronize_rcu_expedited_wait(). Both allocate before calling > > > nbcon_cpu_emergency_enter(). A warn_alloc() failure printout would use > > > the normal printk path and risk deadlock if a stalled CPU holds a > > > console lock. > > > > > > Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> > > > > More comments below, mostly about care and feeding of cpumasks. > > Thanks for the review, and only one question below. > > > > > > --- > > > include/linux/sched/ext.h | 4 +- > > > kernel/rcu/tree_exp.h | 14 +++++-- > > > kernel/rcu/tree_stall.h | 15 ++++++-- > > > kernel/sched/ext.c | 76 ++++++++++++++++++++++++++++++++----- > > > kernel/sched/ext_internal.h | 1 + > > > 5 files changed, 91 insertions(+), 19 deletions(-) > > > > > > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > > > index 20b2343aa344..75cb8b119fb7 100644 > > > --- a/include/linux/sched/ext.h > > > +++ b/include/linux/sched/ext.h > > > @@ -263,7 +263,7 @@ void sched_ext_dead(struct task_struct *p); > > > void print_scx_info(const char *log_lvl, struct task_struct *p); > > > void scx_softlockup(u32 dur_s); > > > bool scx_hardlockup(int cpu); > > > -bool scx_rcu_cpu_stall(void); > > > +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask); > > > > See below for cpumask_var_t discussion. > > > > > #else /* !CONFIG_SCHED_CLASS_EXT */ > > > > > > @@ -271,7 +271,7 @@ static inline void sched_ext_dead(struct task_struct *p) {} > > > static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {} > > > static inline void scx_softlockup(u32 dur_s) {} > > > static inline bool scx_hardlockup(int cpu) { return false; } > > > -static inline bool scx_rcu_cpu_stall(void) { return false; } > > > +static inline bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) { return false; } > > > > Ditto. > > > > > #endif /* CONFIG_SCHED_CLASS_EXT */ > > > > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > > > index 82cada459e5d..b8f4ca984c1c 100644 > > > --- a/kernel/rcu/tree_exp.h > > > +++ b/kernel/rcu/tree_exp.h > > > @@ -555,7 +555,8 @@ static bool synchronize_rcu_expedited_wait_once(long tlimit) > > > /* > > > * Print out an expedited RCU CPU stall warning message. > > > */ > > > -static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigned long j) > > > +static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigned long j, > > > + struct cpumask *stalled_mask) > > > > See below for cpumask_var_t discussion. > > > > > { > > > int cpu; > > > unsigned long mask; > > > @@ -578,6 +579,8 @@ static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigne > > > if (!(READ_ONCE(rnp->expmask) & mask)) > > > continue; > > > ndetected++; > > > + if (stalled_mask) > > > > To his should instead be: > > > > if (cpumask_available(stalled_mask)) > > > > Otherwise, this breaks in kernels built with CONFIG_CPUMASK_OFFSTACK=n. > > Ack. > > > > > > + cpumask_set_cpu(cpu, stalled_mask); > > > rdp = per_cpu_ptr(&rcu_data, cpu); > > > pr_cont(" %d-%c%c%c%c", cpu, > > > "O."[!!cpu_online(cpu)], > > > @@ -659,23 +662,28 @@ static void synchronize_rcu_expedited_wait(void) > > > > > > for (;;) { > > > unsigned long j; > > > + cpumask_var_t stalled_mask; > > > + bool have_mask; > > > > > > if (synchronize_rcu_expedited_wait_once(jiffies_stall)) > > > return; > > > if (rcu_stall_is_suppressed()) > > > continue; > > > > > > + have_mask = zalloc_cpumask_var(&stalled_mask, GFP_ATOMIC | __GFP_NOWARN); > > > + > > > nbcon_cpu_emergency_enter(); > > > > > > j = jiffies; > > > rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_EXP, (void *)(j - jiffies_start)); > > > trace_rcu_stall_warning(rcu_state.name, TPS("ExpeditedStall")); > > > - synchronize_rcu_expedited_stall(jiffies_start, j); > > > + synchronize_rcu_expedited_stall(jiffies_start, j, have_mask ? stalled_mask : NULL); > > > > An alternative to the ?: is to just pass the mask in and have > > synchronize_rcu_expedited_stall(), or whatever it calls, invoke > > cpumask_available() on it. > > Ack. > > > > > > jiffies_stall = 3 * rcu_exp_jiffies_till_stall_check() + 3; > > > > > > nbcon_cpu_emergency_exit(); > > > > > > - panic_on_rcu_stall(); > > > + panic_on_rcu_stall(have_mask ? stalled_mask : cpu_none_mask); > > > > And the same here, but for panic_on_rcu_stall(). > > > > Ack. > > > > + free_cpumask_var(stalled_mask); > > > > And free_cpumask_var() is happy to ignore a NULL pointer, so it is OK to > > invoke it on the pointer passed to a failing call to zalloc_cpumask_var(). > > One question here: Sashiko flagged that the GFP_ATOMIC allocation > risks deadlock if the stalled CPU holds a scheduler or allocator lock. That sounds like a bug in either the combination of GFP_ flags used or in the memory allocator itself. It should be possible to allocate so that you either get memory or get a NULL pointer, without risk of deadlock. What am I missing here? > Rather than keeping dynamic allocation with cpumask_available() guards, > I was planning to pre-allocate rcu_stall_cpumask and rcu_exp_stall_cpumask > in rcu_init() with GFP_KERNEL and clear them at the start of each stall > event. That would eliminate have_mask, cpumask_available(), and > free_cpumask_var() from the stall paths entirely. Please note that rcu_init() runs when there is only one CPU online that is running with interrupts disabled. So the fact that GFP_KERNEL works in that environment is because the memory allocator is interpreting it specially (see GFP_BOOT_MASK). Does static allocation work? > Does that direction work for you, or would you prefer the dynamic > allocation path with proper cpumask_available() guards? On the one hand, memory is getting expensive. On the other hand, cpumasks are not *that* big. I prefer dynamic allocation, but am not inalterably opposed to static (or, *almost* equivalently, boot-time) allocation. The "*almost*" refers to the fact that boot-time allocation can fail. Usually, the system fails to boot in that situation, but some sort of check would be good. Thanx, Paul > > > } > > > } > > > > > > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h > > > index b67532cb8770..90d6eaf1b841 100644 > > > --- a/kernel/rcu/tree_stall.h > > > +++ b/kernel/rcu/tree_stall.h > > > @@ -159,7 +159,7 @@ static int __init check_cpu_stall_init(void) > > > early_initcall(check_cpu_stall_init); > > > > > > /* If so specified via sysctl, panic, yielding cleaner stall-warning output. */ > > > -static void panic_on_rcu_stall(void) > > > +static void panic_on_rcu_stall(const struct cpumask *stalled_mask) > > > > I suggest "cpumask_var_t" instead of "struct cpumask *", because > > cpumask_var_t better handles on-stack vs. offstack cpumasks. > > Ack, will update to cpumask_var_t in v5. > > > > > > { > > > static int cpu_stall; > > > > > > @@ -167,7 +167,7 @@ static void panic_on_rcu_stall(void) > > > * Attempt to kick out the BPF scheduler if it's installed and defer > > > * the panic to give the system a chance to recover. > > > */ > > > - if (scx_rcu_cpu_stall()) > > > + if (scx_rcu_cpu_stall(stalled_mask)) > > > return; > > > > > > if (++cpu_stall < sysctl_max_rcu_stall_to_panic) > > > @@ -631,6 +631,8 @@ static void rcu_check_gp_kthread_expired_fqs_timer(void) > > > static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > > > { > > > int cpu; > > > + cpumask_var_t stalled_mask; > > > + bool have_mask; > > > > Use of cpumask_available() should allow you to drop have_mask. > > > > Ack. > > > > unsigned long flags; > > > unsigned long gpa; > > > unsigned long j; > > > @@ -645,6 +647,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > > > if (rcu_stall_is_suppressed()) > > > return; > > > > > > + have_mask = zalloc_cpumask_var(&stalled_mask, GFP_ATOMIC | __GFP_NOWARN); > > > + > > > nbcon_cpu_emergency_enter(); > > > > > > /* > > > @@ -660,6 +664,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > > > for_each_leaf_node_possible_cpu(rnp, cpu) > > > if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { > > > print_cpu_stall_info(cpu); > > > + if (have_mask) > > > + cpumask_set_cpu(cpu, stalled_mask); > > > ndetected++; > > > } > > > } > > > @@ -701,7 +707,8 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > > > > > > nbcon_cpu_emergency_exit(); > > > > > > - panic_on_rcu_stall(); > > > + panic_on_rcu_stall(have_mask ? stalled_mask : cpu_none_mask); > > > + free_cpumask_var(stalled_mask); > > > > > > rcu_force_quiescent_state(); /* Kick them all. */ > > > } > > > @@ -754,7 +761,7 @@ static void print_cpu_stall(unsigned long gp_seq, unsigned long gps) > > > > > > nbcon_cpu_emergency_exit(); > > > > > > - panic_on_rcu_stall(); > > > + panic_on_rcu_stall(cpumask_of(smp_processor_id())); > > > > > > /* > > > * Attempt to revive the RCU machinery by forcing a context switch. > > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > > > index 22b1356aa4eb..28797467f82a 100644 > > > --- a/kernel/sched/ext.c > > > +++ b/kernel/sched/ext.c > > > @@ -4948,6 +4948,8 @@ static const struct attribute_group scx_global_attr_group = { > > > > > > static void free_pnode(struct scx_sched_pnode *pnode); > > > static void free_exit_info(struct scx_exit_info *ei); > > > +static const char *scx_exit_reason(enum scx_exit_kind kind); > > > +static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind); > > > > > > static void scx_sched_free_rcu_work(struct work_struct *work) > > > { > > > @@ -4962,6 +4964,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work) > > > timer_shutdown_sync(&sch->bypass_lb_timer); > > > free_cpumask_var(sch->bypass_lb_donee_cpumask); > > > free_cpumask_var(sch->bypass_lb_resched_cpumask); > > > + free_cpumask_var(sch->stall_cpus); > > > > > > #ifdef CONFIG_EXT_SUB_SCHED > > > kfree(sch->cgrp_path); > > > @@ -5187,9 +5190,46 @@ static __printf(2, 3) bool handle_lockup(int exit_cpu, const char *fmt, ...) > > > * resolve the reported RCU stall. %false if sched_ext is not enabled or someone > > > * else already initiated abort. > > > */ > > > -bool scx_rcu_cpu_stall(void) > > > +bool scx_rcu_cpu_stall(const struct cpumask *stalled_mask) > > > > And cpumask_var_t here as well. > > i > > Ack, will update. > > > > { > > > - return handle_lockup(-1, "RCU CPU stall detected!"); > > > + struct scx_sched *sch; > > > + struct scx_exit_info *ei; > > > + int exit_cpu; > > > + > > > + guard(rcu)(); > > > + > > > + sch = rcu_dereference(scx_root); > > > + if (unlikely(!sch)) > > > + return false; > > > + > > > + switch (scx_enable_state()) { > > > + case SCX_ENABLING: > > > + case SCX_ENABLED: > > > + break; > > > + default: > > > + return false; > > > + } > > > + > > > + exit_cpu = cpumask_empty(stalled_mask) ? -1 : (int)cpumask_first(stalled_mask); > > > > And here you need a cpumask_available(). Please note that cpumask_empty() > > does not play nicely with NULL pointers. > > > > So maybe something like this: > > > > exit_cpu = (!cpumask_available(stalled_mask) || cpumask_empty(stalled_mask)) ? -1 : (int)cpumask_first(stalled_mask); > > > > Or maybe the same thing, but with an "if" statement. > > > > Ack, will fix with if-statement. If pre-allocation is used this > becomes unnecessary, but I will handle it regardless. > > > > + ei = sch->exit_info; > > > + > > > + guard(preempt)(); > > > + > > > + if (!scx_claim_exit(sch, SCX_EXIT_ERROR)) > > > + return false; > > > + > > > +#ifdef CONFIG_STACKTRACE > > > + ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1); > > > +#endif > > > + scnprintf(ei->msg, SCX_EXIT_MSG_LEN, "RCU CPU stall on CPUs (%*pbl)", > > > + cpumask_pr_args(stalled_mask)); > > > + ei->kind = SCX_EXIT_ERROR; > > > + ei->reason = scx_exit_reason(SCX_EXIT_ERROR); > > > + ei->exit_cpu = exit_cpu; > > > + cpumask_copy(sch->stall_cpus, stalled_mask); > > > + > > > + irq_work_queue(&sch->disable_irq_work); > > > + return true; > > > } > > > > > > /** > > > @@ -5204,7 +5244,8 @@ bool scx_rcu_cpu_stall(void) > > > */ > > > void scx_softlockup(u32 dur_s) > > > { > > > - if (!handle_lockup(smp_processor_id(), "soft lockup - CPU %d stuck for %us", > > > + if (!handle_lockup(smp_processor_id(), > > > + "soft lockup - CPU %d stuck for %us", > > > smp_processor_id(), dur_s)) > > > return; > > > > > > @@ -6608,14 +6649,23 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, > > > dump_line(&s, "----------"); > > > > > > /* > > > - * Dump the exit CPU first so it isn't lost to dump truncation, then > > > - * walk the rest in order, skipping the one already dumped. > > > + * Dump stalled CPUs first so they aren't lost to dump truncation, then > > > + * walk the rest in order. Fall back to exit_cpu if no stall mask set. > > > */ > > > - if (ei->exit_cpu >= 0) > > > - scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); > > > - for_each_possible_cpu(cpu) { > > > - if (cpu != ei->exit_cpu) > > > + if (!cpumask_empty(sch->stall_cpus)) { > > > + for_each_cpu(cpu, sch->stall_cpus) > > > scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > > > + for_each_possible_cpu(cpu) { > > > + if (!cpumask_test_cpu(cpu, sch->stall_cpus)) > > > + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > > > + } > > > + } else { > > > + if (ei->exit_cpu >= 0) > > > + scx_dump_cpu(sch, &s, &dctx, ei->exit_cpu, dump_all_tasks); > > > + for_each_possible_cpu(cpu) { > > > + if (cpu != ei->exit_cpu) > > > + scx_dump_cpu(sch, &s, &dctx, cpu, dump_all_tasks); > > > + } > > > } > > > > > > dump_newline(&s); > > > @@ -6851,6 +6901,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, > > > ret = -ENOMEM; > > > goto err_free_lb_cpumask; > > > } > > > + if (!zalloc_cpumask_var(&sch->stall_cpus, GFP_KERNEL)) { > > > + ret = -ENOMEM; > > > + goto err_free_lb_resched_cpumask; > > > > This is sufficient to make sure that sch->stall_cpus is never used, > > correct? > > > > Otherwise, there needs to be a bunch of cpumask_available() checks > > guarding this uses of ->stall_cpus. > > Yes. Failure returns -ENOMEM before the scheduler is registered, so > sch->stall_cpus is never accessible through scx_root. > > Thanks! > > > > > Thanx, Paul > > > > > + } > > > /* > > > * Copy ops through the right union view. For cid-form the source is > > > * struct sched_ext_ops_cid which lacks the trailing cpu_acquire/ > > > @@ -6920,8 +6974,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd, > > > #ifdef CONFIG_EXT_SUB_SCHED > > > err_free_lb_resched: > > > RCU_INIT_POINTER(ops->priv, NULL); > > > - free_cpumask_var(sch->bypass_lb_resched_cpumask); > > > + free_cpumask_var(sch->stall_cpus); > > > #endif > > > +err_free_lb_resched_cpumask: > > > + free_cpumask_var(sch->bypass_lb_resched_cpumask); > > > err_free_lb_cpumask: > > > free_cpumask_var(sch->bypass_lb_donee_cpumask); > > > err_stop_helper: > > > diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h > > > index 3ccfea06cba0..f51f03868ff7 100644 > > > --- a/kernel/sched/ext_internal.h > > > +++ b/kernel/sched/ext_internal.h > > > @@ -1178,6 +1178,7 @@ struct scx_sched { > > > struct timer_list bypass_lb_timer; > > > cpumask_var_t bypass_lb_donee_cpumask; > > > cpumask_var_t bypass_lb_resched_cpumask; > > > + cpumask_var_t stall_cpus; > > > struct rcu_work rcu_work; > > > > > > /* all ancestors including self */ > > > -- > > > 2.48.1 > > > > > -- > Cheers, > Cheng-Yang > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs 2026-05-20 16:35 ` Paul E. McKenney @ 2026-05-21 7:00 ` Cheng-Yang Chou 0 siblings, 0 replies; 18+ messages in thread From: Cheng-Yang Chou @ 2026-05-21 7:00 UTC (permalink / raw) To: Paul E. McKenney Cc: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, rcu, Ching-Chun Huang, Chia-Ping Tsai Hi Paul, On Wed, May 20, 2026 at 09:35:35AM -0700, Paul E. McKenney wrote: [...] > > > > + free_cpumask_var(stalled_mask); > > > > > > And free_cpumask_var() is happy to ignore a NULL pointer, so it is OK to > > > invoke it on the pointer passed to a failing call to zalloc_cpumask_var(). > > > > One question here: Sashiko flagged that the GFP_ATOMIC allocation > > risks deadlock if the stalled CPU holds a scheduler or allocator lock. > > That sounds like a bug in either the combination of GFP_ flags used or in > the memory allocator itself. It should be possible to allocate so that > you either get memory or get a NULL pointer, without risk of deadlock. > > What am I missing here? > Ah, you are right here, thanks for the correction. > > Rather than keeping dynamic allocation with cpumask_available() guards, > > I was planning to pre-allocate rcu_stall_cpumask and rcu_exp_stall_cpumask > > in rcu_init() with GFP_KERNEL and clear them at the start of each stall > > event. That would eliminate have_mask, cpumask_available(), and > > free_cpumask_var() from the stall paths entirely. > > Please note that rcu_init() runs when there is only one CPU online that > is running with interrupts disabled. So the fact that GFP_KERNEL works > in that environment is because the memory allocator is interpreting it > specially (see GFP_BOOT_MASK). > > Does static allocation work? > Yes, static allocation works. Declaring: static struct cpumask rcu_stall_cpumask; static struct cpumask rcu_exp_stall_cpumask; and calling cpumask_clear() at the start of each stall event removes all allocation, NULL checks, cpumask_available() guards, and free calls from the stall paths. No boot-time failure path to worry about either. > > Does that direction work for you, or would you prefer the dynamic > > allocation path with proper cpumask_available() guards? > > On the one hand, memory is getting expensive. On the other hand, cpumasks > are not *that* big. I prefer dynamic allocation, but am not inalterably > opposed to static (or, *almost* equivalently, boot-time) allocation. > > The "*almost*" refers to the fact that boot-time allocation can fail. > Usually, the system fails to boot in that situation, but some sort of > check would be good. Given that static allocation avoids the failure check concern and simplifies the stall paths, I will use static allocation for v5. Thanks! -- Cheers, Cheng-Yang ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2026-06-09 8:07 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-31 15:25 [PATCH v6 sched_ext/for-7.2 0/2] sched_ext: Follow-up fixes for exit_cpu accuracy Cheng-Yang Chou 2026-05-31 15:25 ` [PATCH 1/2] sched_ext: Fix exit_cpu accuracy for lockup paths Cheng-Yang Chou 2026-06-09 5:10 ` Andrea Righi 2026-05-31 15:25 ` [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs Cheng-Yang Chou 2026-06-04 17:57 ` Paul E. McKenney 2026-06-05 14:33 ` Cheng-Yang Chou 2026-06-09 8:06 ` Andrea Righi -- strict thread matches above, loose matches on Subject: below -- 2026-05-21 16:16 [PATCH v5 sched_ext/for-7.2 0/2] sched_ext: Follow-up fixes for exit_cpu accuracy Cheng-Yang Chou 2026-05-21 16:16 ` [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs Cheng-Yang Chou 2026-05-21 17:05 ` sashiko-bot 2026-05-27 23:19 ` Paul E. McKenney 2026-05-29 15:51 ` Cheng-Yang Chou 2026-05-19 17:17 [PATCH v4 sched_ext/for-7.2 0/2] sched_ext: Follow-up fixes for exit_cpu accuracy Cheng-Yang Chou 2026-05-19 17:17 ` [PATCH 2/2] sched_ext, rcu: Upgrade RCU stall paths to report cpumask of stalled CPUs Cheng-Yang Chou 2026-05-19 17:52 ` sashiko-bot 2026-05-19 18:22 ` Cheng-Yang Chou 2026-05-19 23:48 ` Paul E. McKenney 2026-05-20 14:56 ` Cheng-Yang Chou 2026-05-20 16:35 ` Paul E. McKenney 2026-05-21 7:00 ` Cheng-Yang Chou
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.