[RFT] sched_ext: Skip stack trace capture in NMI context

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFT] sched_ext: Skip stack trace capture in NMI context
@ 2025-12-23  0:50 Joel Fernandes
  2025-12-23  2:44 ` Tejun Heo
  2025-12-23  6:39 ` Andrea Righi
  0 siblings, 2 replies; 8+ messages in thread
From: Joel Fernandes @ 2025-12-23  0:50 UTC (permalink / raw)
  To: linux-kernel, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: Joel Fernandes, sched-ext

stack_trace_save() is not guaranteed to be NMI-safe on all
architectures.

The hardlockup detector calls into sched_ext via the following call
chain when an NMI occurs:

  watchdog_overflow_callback()
    watchdog_hardlockup_check()
      scx_hardlockup()
        stack_trace_save()

Skip stack trace capture when in_nmi() returns true to prevent
potential deadlocks.

Fixes: 582f700e1bdc ("sched_ext: Hook up hardlockup detector")
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/ext.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 05f5a49e9649..a96255ca3a08 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4678,7 +4678,8 @@ static bool scx_vexit(struct scx_sched *sch,
 
 	ei->exit_code = exit_code;
 #ifdef CONFIG_STACKTRACE
-	if (kind >= SCX_EXIT_ERROR)
+	/* Skip stack trace capture in NMI context as its unsafe. */
+	if (kind >= SCX_EXIT_ERROR && !in_nmi())
 		ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1);
 #endif
 	vscnprintf(ei->msg, SCX_EXIT_MSG_LEN, fmt, args);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFT] sched_ext: Skip stack trace capture in NMI context
  2025-12-23  0:50 [RFT] sched_ext: Skip stack trace capture in NMI context Joel Fernandes
@ 2025-12-23  2:44 ` Tejun Heo
  2025-12-23  4:34   ` Joel Fernandes
  2025-12-23  6:39 ` Andrea Righi
  1 sibling, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2025-12-23  2:44 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, David Vernet, Andrea Righi, Changwoo Min,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, sched-ext

Hello,

On Mon, Dec 22, 2025 at 07:50:37PM -0500, Joel Fernandes wrote:
> stack_trace_save() is not guaranteed to be NMI-safe on all
> architectures.
> 
> The hardlockup detector calls into sched_ext via the following call
> chain when an NMI occurs:
> 
>   watchdog_overflow_callback()
>     watchdog_hardlockup_check()
>       scx_hardlockup()
>         stack_trace_save()
> 
> Skip stack trace capture when in_nmi() returns true to prevent
> potential deadlocks.
> 
> Fixes: 582f700e1bdc ("sched_ext: Hook up hardlockup detector")
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>

This does work on x86 (right?) and is useful in understanding what the
underlying problem is. It'd be great if there's a config flag we can test
but if not can we specifically exclude archs which are known to not work?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFT] sched_ext: Skip stack trace capture in NMI context
  2025-12-23  2:44 ` Tejun Heo
@ 2025-12-23  4:34   ` Joel Fernandes
  2025-12-23 20:31     ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Joel Fernandes @ 2025-12-23  4:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel@vger.kernel.org, David Vernet, Andrea Righi,
	Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, sched-ext@lists.linux.dev

> On Dec 22, 2025, at 9:44 PM, Tejun Heo <tj@kernel.org> wrote:
> 
> Hello,
> 
>> On Mon, Dec 22, 2025 at 07:50:37PM -0500, Joel Fernandes wrote:
>> stack_trace_save() is not guaranteed to be NMI-safe on all
>> architectures.
>> 
>> The hardlockup detector calls into sched_ext via the following call
>> chain when an NMI occurs:
>> 
>>  watchdog_overflow_callback()
>>    watchdog_hardlockup_check()
>>      scx_hardlockup()
>>        stack_trace_save()
>> 
>> Skip stack trace capture when in_nmi() returns true to prevent
>> potential deadlocks.
>> 
>> Fixes: 582f700e1bdc ("sched_ext: Hook up hardlockup detector")
>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> 
> This does work on x86 (right?) and is useful in understanding what the
> underlying problem is. It'd be great if there's a config flag we can test
> but if not can we specifically exclude archs which are known to not work?

You are right that we will miss out on architectures where this is safe. We should make it more specific. I am wondering if Steven Rostedt has any thoughts here since he is actively working on stack tracing/unwinding and has made similar commits in the past where he restricted stack tracing in an NMI context.

Per my understanding, stack trace unwinding is not safe/valid to do on architectures where the NMI context does not have its own stack. But I could stand corrected, hence I marked this as an RFT.  It is safe to do on 64-bit x86, but not on 32-bit x86 and other same-stack architectures.

If we feel that this is not an issue, then that is fine with me (and sorry for the noise), but I just wanted to raise it anyway just in case. Sooner or later someone running scx on an odd architecture might complaint.

Thanks!

 - Joel

> 
> Thanks.
> 
> --
> tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFT] sched_ext: Skip stack trace capture in NMI context
  2025-12-23  0:50 [RFT] sched_ext: Skip stack trace capture in NMI context Joel Fernandes
  2025-12-23  2:44 ` Tejun Heo
@ 2025-12-23  6:39 ` Andrea Righi
  1 sibling, 0 replies; 8+ messages in thread
From: Andrea Righi @ 2025-12-23  6:39 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	sched-ext

On Mon, Dec 22, 2025 at 07:50:37PM -0500, Joel Fernandes wrote:
> stack_trace_save() is not guaranteed to be NMI-safe on all
> architectures.
> 
> The hardlockup detector calls into sched_ext via the following call
> chain when an NMI occurs:
> 
>   watchdog_overflow_callback()
>     watchdog_hardlockup_check()
>       scx_hardlockup()
>         stack_trace_save()
> 
> Skip stack trace capture when in_nmi() returns true to prevent
> potential deadlocks.
> 
> Fixes: 582f700e1bdc ("sched_ext: Hook up hardlockup detector")
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
>  kernel/sched/ext.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 05f5a49e9649..a96255ca3a08 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -4678,7 +4678,8 @@ static bool scx_vexit(struct scx_sched *sch,
>  
>  	ei->exit_code = exit_code;
>  #ifdef CONFIG_STACKTRACE
> -	if (kind >= SCX_EXIT_ERROR)
> +	/* Skip stack trace capture in NMI context as its unsafe. */

nit: s/its/it's/

> +	if (kind >= SCX_EXIT_ERROR && !in_nmi())
>  		ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1);

If stack_trace_save() isn't NMI-safe on certain architectures, shouldn't we
fix this inside stack_trace_save()?

There are probably other places where we call stack_trace_save() without
checking in_nmi(). Making stack_trace_save() handle the NMI case would
solve all of them.

>  #endif
>  	vscnprintf(ei->msg, SCX_EXIT_MSG_LEN, fmt, args);
> -- 
> 2.34.1
> 

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFT] sched_ext: Skip stack trace capture in NMI context
  2025-12-23  4:34   ` Joel Fernandes
@ 2025-12-23 20:31     ` Steven Rostedt
  2025-12-23 23:58       ` Joel Fernandes
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2025-12-23 20:31 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Tejun Heo, linux-kernel@vger.kernel.org, David Vernet,
	Andrea Righi, Changwoo Min, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, sched-ext@lists.linux.dev

On Tue, 23 Dec 2025 04:34:00 +0000
Joel Fernandes <joelagnelf@nvidia.com> wrote:

> > This does work on x86 (right?) and is useful in understanding what the
> > underlying problem is. It'd be great if there's a config flag we can test
> > but if not can we specifically exclude archs which are known to not work?  
> 
> You are right that we will miss out on architectures where this is safe.
> We should make it more specific. I am wondering if Steven Rostedt has any
> thoughts here since he is actively working on stack tracing/unwinding and
> has made similar commits in the past where he restricted stack tracing in
> an NMI context.

[ Fixes line wrap, ug it's hard to read emails that go across 300 characters! ]

Well, we do kernel stack tracing in NMI context all the time with no issue
(but I mostly work on x86).

> 
> Per my understanding, stack trace unwinding is not safe/valid to do on
> architectures where the NMI context does not have its own stack. But I

Hmm, no, I think it's fine to do it on archs where NMI doesn't have its own
stack. It works on 32bit x86, where the NMI shares the kernel stack.

Which architecture had an issue with a stack trace?

-- Steve


> could stand corrected, hence I marked this as an RFT.  It is safe to do
> on 64-bit x86, but not on 32-bit x86 and other same-stack architectures.
> 
> If we feel that this is not an issue, then that is fine with me (and
> sorry for the noise), but I just wanted to raise it anyway just in case.
> Sooner or later someone running scx on an odd architecture might
> complaint.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFT] sched_ext: Skip stack trace capture in NMI context
  2025-12-23 20:31     ` Steven Rostedt
@ 2025-12-23 23:58       ` Joel Fernandes
  2025-12-24 14:18         ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Joel Fernandes @ 2025-12-23 23:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tejun Heo, linux-kernel@vger.kernel.org, David Vernet,
	Andrea Righi, Changwoo Min, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, sched-ext@lists.linux.dev

On Tue, Dec 23, 2025 at 03:31:36PM -0500, Steven Rostedt wrote:
> On Tue, 23 Dec 2025 04:34:00 +0000
> Joel Fernandes <joelagnelf@nvidia.com> wrote:
> 
> > > This does work on x86 (right?) and is useful in understanding what the
> > > underlying problem is. It'd be great if there's a config flag we can test
> > > but if not can we specifically exclude archs which are known to not work?  
> > 
> > You are right that we will miss out on architectures where this is safe.
> > We should make it more specific. I am wondering if Steven Rostedt has any
> > thoughts here since he is actively working on stack tracing/unwinding and
> > has made similar commits in the past where he restricted stack tracing in
> > an NMI context.
> 
> [ Fixes line wrap, ug it's hard to read emails that go across 300 characters! ]

Sorry about that. Thank you.

> Well, we do kernel stack tracing in NMI context all the time with no issue
> (but I mostly work on x86).
> 
> > 
> > Per my understanding, stack trace unwinding is not safe/valid to do on
> > architectures where the NMI context does not have its own stack. But I
> 
> Hmm, no, I think it's fine to do it on archs where NMI doesn't have its own
> stack. It works on 32bit x86, where the NMI shares the kernel stack.
> 
> Which architecture had an issue with a stack trace?

On 32 bit what happens if NMI hits during stack frame setup? Can the unwinder
misbehave if base pointer has not yet been setup and NMI starts using same
stack?

Not sure.

Some documentation suggests IST is required for reliable NMI stack tracing
[1] [2] which 32-bit does not have.
”If an interrupt or other exception is taken while the stack or other unwind
state is in an inconsistent state, it may not be possible to reliably unwind,
and it may not be possible to identify whether such unwinding will be
reliable. See below for examples.“

Probably the issue happens to be more of printing garbage than crashing the
kernel, but I am not convinced it is stable. Hmm.

[1] https://www.kernel.org/doc/html/v6.16/arch/x86/kernel-stacks.html
[2] https://docs.kernel.org/livepatch/reliable-stacktrace.html

thanks,

 - Joel


> 
> -- Steve
> 
> 
> > could stand corrected, hence I marked this as an RFT.  It is safe to do
> > on 64-bit x86, but not on 32-bit x86 and other same-stack architectures.
> > 
> > If we feel that this is not an issue, then that is fine with me (and
> > sorry for the noise), but I just wanted to raise it anyway just in case.
> > Sooner or later someone running scx on an odd architecture might
> > complaint.
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFT] sched_ext: Skip stack trace capture in NMI context
  2025-12-23 23:58       ` Joel Fernandes
@ 2025-12-24 14:18         ` Steven Rostedt
  2025-12-24 17:42           ` Joel Fernandes
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2025-12-24 14:18 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Tejun Heo, linux-kernel@vger.kernel.org, David Vernet,
	Andrea Righi, Changwoo Min, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, sched-ext@lists.linux.dev

On Tue, 23 Dec 2025 18:58:33 -0500
Joel Fernandes <joelagnelf@nvidia.com> wrote:

> Some documentation suggests IST is required for reliable NMI stack tracing
> [1] [2] which 32-bit does not have.
> ”If an interrupt or other exception is taken while the stack or other unwind
> state is in an inconsistent state, it may not be possible to reliably unwind,
> and it may not be possible to identify whether such unwinding will be
> reliable. See below for examples.“
> 
> Probably the issue happens to be more of printing garbage than crashing the
> kernel, but I am not convinced it is stable. Hmm.

Correct. It's about reliable stack traces, as live kernel patching requires
that the stack it looks at is reliable before it can modify the code. What
happens if it's not reliable, means it will just stop at the interrupt
handler and you don't get to see the rest (or you'll see a bunch of
functions with "?" in front of them).

-- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFT] sched_ext: Skip stack trace capture in NMI context
  2025-12-24 14:18         ` Steven Rostedt
@ 2025-12-24 17:42           ` Joel Fernandes
  0 siblings, 0 replies; 8+ messages in thread
From: Joel Fernandes @ 2025-12-24 17:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tejun Heo, linux-kernel@vger.kernel.org, David Vernet,
	Andrea Righi, Changwoo Min, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, sched-ext@lists.linux.dev



> On Dec 24, 2025, at 9:17 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Tue, 23 Dec 2025 18:58:33 -0500
> Joel Fernandes <joelagnelf@nvidia.com> wrote:
> 
>> Some documentation suggests IST is required for reliable NMI stack tracing
>> [1] [2] which 32-bit does not have.
>> ”If an interrupt or other exception is taken while the stack or other unwind
>> state is in an inconsistent state, it may not be possible to reliably unwind,
>> and it may not be possible to identify whether such unwinding will be
>> reliable. See below for examples.“
>> 
>> Probably the issue happens to be more of printing garbage than crashing the
>> kernel, but I am not convinced it is stable. Hmm.
> 
> Correct. It's about reliable stack traces, as live kernel patching requires
> that the stack it looks at is reliable before it can modify the code. What
> happens if it's not reliable, means it will just stop at the interrupt
> handler and you don't get to see the rest (or you'll see a bunch of
> functions with "?" in front of them).

Ah, thanks Steve for clarifying!

 - Joel


> 
> -- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-12-24 17:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-23  0:50 [RFT] sched_ext: Skip stack trace capture in NMI context Joel Fernandes
2025-12-23  2:44 ` Tejun Heo
2025-12-23  4:34   ` Joel Fernandes
2025-12-23 20:31     ` Steven Rostedt
2025-12-23 23:58       ` Joel Fernandes
2025-12-24 14:18         ` Steven Rostedt
2025-12-24 17:42           ` Joel Fernandes
2025-12-23  6:39 ` Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox