* [PATCH] lockdep: Fix inconsistency in irq tracking on NMIs
@ 2025-06-20 12:51 Gabriele Monaco
2025-06-20 21:07 ` Thomas Gleixner
2025-06-21 8:57 ` Peter Zijlstra
0 siblings, 2 replies; 3+ messages in thread
From: Gabriele Monaco @ 2025-06-20 12:51 UTC (permalink / raw)
To: linux-kernel, Thomas Gleixner, Peter Zijlstra, Andy Lutomirski,
Ingo Molnar
Cc: Gabriele Monaco, Steven Rostedt, Masami Hiramatsu,
linux-trace-kernel
The irq_enable/irq_disable tracepoints fire only when there's an actual
transition (enabled->disabled and vice versa), this needs special care
in NMIs, as they could potentially start with IRQs already disabled.
The current implementation takes care of this by tracking the lockdep
state before the NMI (on nmi_entry) and not tracing on nmi_exit in case
IRQs were already disabled, we don't trace on nmi_entry following the
tracing_irq_cpu variable, which can be racy:
local_irq_enable()
void trace_hardirqs_on(void)
{
if (tracing_irq_cpu) {
trace(irq_enable);
tracing_irq_cpu = 0;
}
/*
* NMI here
* tracing_irq_cpu == 0 (done tracing)
* lockdep_hardirqs_enabled == 0 (IRQs still disabled)
*/
irqentry_nmi_enter()
irq_state.lockdep = 0
trace(irq_disable);
irqentry_nmi_exit()
// irq_state.lockdep == 0
// do not trace(irq_enable)
lockdep_hardirqs_on();
}
The error is visible with the sncid RV monitor and particularly likely
on machines with the following setup:
- x86 bare-metal with 40+ CPUs
- tuned throughput-performance (activating regular perf NMIs)
- workload: stress-ng --cpu-sched 21 --timer 11 --signal 11
The presence of the RV monitor is useful to see the error but it is not
necessary to trigger it.
Prevent this scenario by checking lockdep_hardirqs_enabled to trace also
on nmi_entry.
Fixes: ba1f2b2eaa2a ("x86/entry: Fix NMI vs IRQ state tracking")
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: linux-trace-kernel@vger.kernel.org
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
kernel/entry/common.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index a8dd1f27417cf..7369132c00ba4 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -326,13 +326,15 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
irq_state.lockdep = lockdep_hardirqs_enabled();
__nmi_enter();
- lockdep_hardirqs_off(CALLER_ADDR0);
+ if (irq_state.lockdep)
+ lockdep_hardirqs_off(CALLER_ADDR0);
lockdep_hardirq_enter();
ct_nmi_enter();
instrumentation_begin();
kmsan_unpoison_entry_regs(regs);
- trace_hardirqs_off_finish();
+ if (irq_state.lockdep)
+ trace_hardirqs_off_finish();
ftrace_nmi_enter();
instrumentation_end();
base-commit: 75f5f23f8787c5e184fcb2fbcd02d8e9317dc5e7
--
2.49.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] lockdep: Fix inconsistency in irq tracking on NMIs
2025-06-20 12:51 [PATCH] lockdep: Fix inconsistency in irq tracking on NMIs Gabriele Monaco
@ 2025-06-20 21:07 ` Thomas Gleixner
2025-06-21 8:57 ` Peter Zijlstra
1 sibling, 0 replies; 3+ messages in thread
From: Thomas Gleixner @ 2025-06-20 21:07 UTC (permalink / raw)
To: Gabriele Monaco, linux-kernel, Peter Zijlstra, Andy Lutomirski,
Ingo Molnar
Cc: Gabriele Monaco, Steven Rostedt, Masami Hiramatsu,
linux-trace-kernel
On Fri, Jun 20 2025 at 14:51, Gabriele Monaco wrote:
The subject prefix is misleading. The problem is not in lockdep, the
problem is in the generic entry NMI code, no?
> The irq_enable/irq_disable tracepoints fire only when there's an actual
Now it gets truly confusing. Above you say lockdep, now it's trace
points...
> transition (enabled->disabled and vice versa), this needs special care
> in NMIs, as they could potentially start with IRQs already disabled.
s/IRQs/interrupts/ please. This is not twatter.
> The current implementation takes care of this by tracking the lockdep
> state before the NMI (on nmi_entry) and not tracing on nmi_exit in case
> IRQs were already disabled, we don't trace on nmi_entry following the
> tracing_irq_cpu variable, which can be racy:
This sentence does not parse, especially the subordinate clause starting
with 'we' does not make sense.
> The error is visible with the sncid RV monitor and particularly likely
> on machines with the following setup:
> - x86 bare-metal with 40+ CPUs
> - tuned throughput-performance (activating regular perf NMIs)
> - workload: stress-ng --cpu-sched 21 --timer 11 --signal 11
>
> The presence of the RV monitor is useful to see the error but it is not
> necessary to trigger it.
Not sure whether this information is useful in the change log itself. It
can go into the notes section after the '---' seperator.
> @@ -326,13 +326,15 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
> irq_state.lockdep = lockdep_hardirqs_enabled();
>
> __nmi_enter();
> - lockdep_hardirqs_off(CALLER_ADDR0);
> + if (irq_state.lockdep)
> + lockdep_hardirqs_off(CALLER_ADDR0);
This avoids the lockdep call, which has nothing to do with your tracing
problem. lockdep already handles that case. So making this conditional
is a cosmetic noop, nothing else. It's actually questionable whether
this conditional makes sense performance wise. It only makes sense when
the amount of lockdep state == disabled instances is significant.
Otherwise it introduces an extra conditional into the hot path.
> lockdep_hardirq_enter();
> ct_nmi_enter();
>
> instrumentation_begin();
> kmsan_unpoison_entry_regs(regs);
> - trace_hardirqs_off_finish();
> + if (irq_state.lockdep)
> + trace_hardirqs_off_finish();
Now this is the real thing you are interested in, because otherwise you
end up with a trace_irqsoff() event without the corresponding
trace_irqson() event, right?
So in short what you are trying to explain in the change log is:
irqentry_nmi_enter() tracks the lockdep interrupt state on entry to
prevent a lockdep_hardirqs_on() invocation in the NMI exit path, when
lockdep was interrupted by the NMI before it could mark interrupts
enabled in the lockdep tracking. This works correctly, but a similar
problem exists for the local_irq_* tracepoints.
local_irq_enable() invokes trace_hardirqs_on(), which invokes
lockdep_hardirqs_on() after establishing the tracer state.
If the NMI hits before lockdep_hardirqs_on() has established the lockdep
enabled state, then unconditional trace_hardirqs_off_finish() invocation
in irqentry_nmi_enter() flips the tracer state to disabled.
In irqentry_nmi_exit() the counterpart trace_hardirqs_on_prepare() is
only invoked when the lockdep interrupt state on entry was 'enabled',
which means the trace lacks the corresponding interrupt enable entry.
Fix this by making the invocation of trace_hardirqs_off_finish() in
irqentry_nmi_enter() conditional on the lockdep state.
Right?
But you missed something completely here. This problem exists also when
lockdep is disabled and then it is way simpler to trigger. Why?
If lockdep is off, then lockdep_hardirqs_enabled() always returns
false. So any entry where the tracer state is 'enabled' will record a
off, but the exit will not record the on.
As you correctly described, the two states are asynchronous in the
context of NMIs, so the obvious thing is to treat them as seperate
entities so that the trace really contains the off/on events independent
of lockdep enablement and lockdep's internal state. No?
The only other sensible option is to remove the trace points from the
NMI code all together because with the lockdep conditional they are
inconsistent no matter what.
Aside of that you missed to fix the corresponding problem in the arm64
entry code, which still has it's own version of the generic entry code.
Thanks,
tglx
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] lockdep: Fix inconsistency in irq tracking on NMIs
2025-06-20 12:51 [PATCH] lockdep: Fix inconsistency in irq tracking on NMIs Gabriele Monaco
2025-06-20 21:07 ` Thomas Gleixner
@ 2025-06-21 8:57 ` Peter Zijlstra
1 sibling, 0 replies; 3+ messages in thread
From: Peter Zijlstra @ 2025-06-21 8:57 UTC (permalink / raw)
To: Gabriele Monaco
Cc: linux-kernel, Thomas Gleixner, Andy Lutomirski, Ingo Molnar,
Steven Rostedt, Masami Hiramatsu, linux-trace-kernel
On Fri, Jun 20, 2025 at 02:51:13PM +0200, Gabriele Monaco wrote:
> local_irq_enable()
> void trace_hardirqs_on(void)
> {
> if (tracing_irq_cpu) {
> trace(irq_enable);
> tracing_irq_cpu = 0;
> }
>
> /*
> * NMI here
> * tracing_irq_cpu == 0 (done tracing)
> * lockdep_hardirqs_enabled == 0 (IRQs still disabled)
> */
>
> irqentry_nmi_enter()
> irq_state.lockdep = 0
> trace(irq_disable);
So you're saying this ^^^^^ is the
actual problem?
>
> irqentry_nmi_exit()
> // irq_state.lockdep == 0
> // do not trace(irq_enable)
Because this ^^^^ might lead one to
believe the lack of trace(irq_enable)
is the problem.
> lockdep_hardirqs_on();
> }
Because I'm thinking the trace(irq_disable) is actually correct. We are
entering an NMI handler, and that very much has IRQs disabled.
> Prevent this scenario by checking lockdep_hardirqs_enabled to trace also
> on nmi_entry.
>
> Fixes: ba1f2b2eaa2a ("x86/entry: Fix NMI vs IRQ state tracking")
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: linux-trace-kernel@vger.kernel.org
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
> kernel/entry/common.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index a8dd1f27417cf..7369132c00ba4 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -326,13 +326,15 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
> irq_state.lockdep = lockdep_hardirqs_enabled();
>
> __nmi_enter();
> - lockdep_hardirqs_off(CALLER_ADDR0);
> + if (irq_state.lockdep)
> + lockdep_hardirqs_off(CALLER_ADDR0);
This isn't needed... it is perfectly fine calling lockdep_hardirq_off()
again here. You'll hit the redundant_hardirqs_off counter.
> lockdep_hardirq_enter();
> ct_nmi_enter();
>
> instrumentation_begin();
> kmsan_unpoison_entry_regs(regs);
> - trace_hardirqs_off_finish();
> + if (irq_state.lockdep)
> + trace_hardirqs_off_finish();
So I really think you're doing the wrong thing here. We traced IRQs are
enabled, but then take an NMI, meaning IRQs are very much disabled. So
we want this irqs_off to fire.
The much more fun case is:
if (tracing_irq_cpu) {
trace(irq_enable);
<NMI>
Because then it will see tracing_irq_cpu set, but also have issued
irq_enable and not issue irq_disable, and then things are really messed
up.
So yes, you found a fun case, but your solution seemed aimed at pleasing
the model, rather than reality.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-06-21 8:57 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-20 12:51 [PATCH] lockdep: Fix inconsistency in irq tracking on NMIs Gabriele Monaco
2025-06-20 21:07 ` Thomas Gleixner
2025-06-21 8:57 ` Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).