[PATCH 0/2] arm64/entry: Fix involuntary preemption exception masking

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] arm64/entry: Fix involuntary preemption exception masking
@ 2026-03-20 11:30 Mark Rutland
  2026-03-20 11:30 ` [PATCH 1/2] " Mark Rutland
  2026-03-20 11:30 ` [PATCH 2/2] arm64/entry: Remove arch_irqentry_exit_need_resched() Mark Rutland
  0 siblings, 2 replies; 25+ messages in thread
From: Mark Rutland @ 2026-03-20 11:30 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, peterz, tglx, vladimir.murzin,
	will
  Cc: ada.coupriediaz, linux-kernel, luto, mark.rutland, ruanjinjie

Hi all,

Since the move to generic IRQ entry, arm64's involuntary kernel
preemption logic has been subtly broken, and preemption can lead to
tasks running with some exceptions masked unexpectedly.

Patch 1 describes the gory details, but the gist is that due to the way
exceptions work on arm64 architecturally, and due to the number of
independently maskable exceptions, some aspects of the generic irq entry
code aren't a great fit, and arm64 needs to manage more of the
sequencing and state management itself. Doing so will also make it
possible to implement new stuff in the near future (e.g. architectural
NMI).

Patch 2 is purely a cleanup atop patch 1.

Thomas, Peter, I have some questions at the end of the patch 1 commit
message, but otherwise I hope this looks good to you. I'm assuming that
this should go via the arm64 tree, so I'm looking for your acks on both
patches.

Mark.

Mark Rutland (2):
  arm64/entry: Fix involuntary preemption exception masking
  arm64/entry: Remove arch_irqentry_exit_need_resched()

 arch/Kconfig                          |  3 +++
 arch/arm64/Kconfig                    |  1 +
 arch/arm64/include/asm/entry-common.h | 27 ---------------------------
 arch/arm64/kernel/entry-common.c      | 27 +++++++++++++++++++++++++++
 kernel/entry/common.c                 | 20 ++++----------------
 5 files changed, 35 insertions(+), 43 deletions(-)

-- 
2.30.2

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-20 11:30 [PATCH 0/2] arm64/entry: Fix involuntary preemption exception masking Mark Rutland
@ 2026-03-20 11:30 ` Mark Rutland
  2026-03-20 13:04   ` Peter Zijlstra
                     ` (2 more replies)
  2026-03-20 11:30 ` [PATCH 2/2] arm64/entry: Remove arch_irqentry_exit_need_resched() Mark Rutland
  1 sibling, 3 replies; 25+ messages in thread
From: Mark Rutland @ 2026-03-20 11:30 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: ada.coupriediaz, catalin.marinas, linux-kernel, luto,
	mark.rutland, peterz, ruanjinjie, tglx, vladimir.murzin, will

On arm64, involuntary kernel preemption has been subtly broken since the
move to the generic irq entry code. When preemption occurs, the new task
may run with SError and Debug exceptions masked unexpectedly, leading to
a loss of RAS events, breakpoints, watchpoints, and single-step
exceptions.

We can fix this relatively simply by moving the preemption logic out of
irqentry_exit(), which is desirable for a number of other reasons on
arm64. Context and rationale below:

1) Architecturally, several groups of exceptions can be masked
   independently, including 'Debug', 'SError', 'IRQ', and 'FIQ', whose
   mask bits can be read/written via the 'DAIF' register.

   Other mask bits exist, including 'PM' and 'AllInt', which we will
   need to use in future (e.g. for architectural NMI support).

   The entry code needs to manipulate all of these, but the generic
   entry code only knows about interrupts (which means both IRQ and FIQ
   on arm64), and the other exception masks aren't generic.

2) Architecturally, all maskable exceptions MUST be masked during
   exception entry and exception return.

   Upon exception entry, hardware places exception context into
   exception registers (e.g. the PC is saved into ELR_ELx). Upon
   exception return, hardware restores exception context from those
   exception registers (e.g. the PC is restored from ELR_ELx).

   To ensure the exception registers aren't clobbered by recursive
   exceptions, all maskable exceptions must be masked early during entry
   and late during exit. Hardware masks all maskable exceptions
   automatically at exception entry. Software must unmask these as
   required, and must mask them prior to exception return.

3) Architecturally, hardware masks all maskable exceptions upon any
   exception entry. A synchronous exception (e.g. a fault on a memory
   access) can be taken from any context (e.g. where IRQ+FIQ might be
   masked), and the entry code must explicitly 'inherit' the unmasking
   from the original context by reading the exception registers (e.g.
   SPSR_ELx) and writing to DAIF, etc.

4) When 'pseudo-NMI' is used, Linux masks interrupts via a combination
   of DAIF and the 'PMR' priority mask register. At entry and exit,
   interrupts must be masked via DAIF, but most kernel code will
   mask/unmask regular interrupts using PMR (e.g. in local_irq_save()
   and local_irq_restore()).

   This requires more complicated transitions at entry and exit. Early
   during entry or late during return, interrupts are masked via DAIF,
   and kernel code which manipulates PMR to mask/unmask interrupts will
   not function correctly in this state.

   This also requires fairly complicated management of DAIF and PMR when
   handling interrupts, and arm64 has special logic to avoid preempting
   from pseudo-NMIs which currently lives in
   arch_irqentry_exit_need_resched().

5) Most kernel code runs with all exceptions unmasked. When scheduling,
   only interrupts should be masked (by PMR pseudo-NMI is used, and by
   DAIF otherwise).

For most exceptions, arm64's entry code has a sequence similar to that
of el1_abort(), which is used for faults:

| static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
| {
|         unsigned long far = read_sysreg(far_el1);
|         irqentry_state_t state;
|
|         state = enter_from_kernel_mode(regs);
|         local_daif_inherit(regs);
|         do_mem_abort(far, esr, regs);
|         local_daif_mask();
|         exit_to_kernel_mode(regs, state);
| }

... where enter_from_kernel_mode() and exit_to_kernel_mode() are
wrappers around irqentry_enter() and irqentry_exit() which perform
additional arm64-specific entry/exit logic.

Currently, the generic irq entry code will attempt to preempt from any
exception under irqentry_exit() where interrupts were unmasked in the
original context. As arm64's entry code will have already masked
exceptions via DAIF, this results in the problems described above.

Fix this by opting out of preemption in irqentry_exit(), and restoring
arm64's old behaivour of explicitly preempting when returning from IRQ
or FIQ, before calling exit_to_kernel_mode() / irqentry_exit(). This
ensures that preemption occurs when only interrupts are masked, and
where that masking is compatible with most kernel code (e.g. using PMR
when pseudo-NMI is in use).

Fixes: 99eb057ccd67 ("arm64: entry: Move arm64_preempt_schedule_irq() into __exit_to_kernel_mode()")
Reported-by: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Reported-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jinjie Ruan <ruanjinjie@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Will Deacon <will@kernel.org>
---
 arch/Kconfig                     | 3 +++
 arch/arm64/Kconfig               | 1 +
 arch/arm64/kernel/entry-common.c | 2 ++
 kernel/entry/common.c            | 4 +++-
 4 files changed, 9 insertions(+), 1 deletion(-)

Thomas, Peter, I have a couple of things I'd like to check:

(1) The generic irq entry code will preempt from any exception (e.g. a
    synchronous fault) where interrupts were unmasked in the original
    context. Is that intentional/necessary, or was that just the way the
    x86 code happened to be implemented?

    I assume that it'd be fine if arm64 only preempted from true
    interrupts, but if that was intentional/necessary I can go rework
    this.

(2) The generic irq entry code only preempts when RCU was watching in
    the original context. IIUC that's just to avoid preempting from the
    idle thread. Is it functionally necessary to avoid that, or is that
    just an optimization?

    I'm asking because historically arm64 didn't check that, and I
    haven't bothered checking here. I don't know whether we have a
    latent functional bug.

Mark.

diff --git a/arch/Kconfig b/arch/Kconfig
index 102ddbd4298ef..c8c99cd955281 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -102,6 +102,9 @@ config HOTPLUG_PARALLEL
 	bool
 	select HOTPLUG_SPLIT_STARTUP

+config ARCH_HAS_OWN_IRQ_PREEMPTION
+	bool
+
 config GENERIC_IRQ_ENTRY
 	bool

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 38dba5f7e4d2d..bf0ec8237de45 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -42,6 +42,7 @@ config ARM64
 	select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_NONLEAF_PMD_YOUNG if ARM64_HAFT
+	select ARCH_HAS_OWN_IRQ_PREEMPTION
 	select ARCH_HAS_PREEMPT_LAZY
 	select ARCH_HAS_PTDUMP
 	select ARCH_HAS_PTE_SPECIAL
diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index 3625797e9ee8f..1aedadf09eb4d 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -497,6 +497,8 @@ static __always_inline void __el1_irq(struct pt_regs *regs,
 	do_interrupt_handler(regs, handler);
 	irq_exit_rcu();

+	irqentry_exit_cond_resched();
+
 	exit_to_kernel_mode(regs, state);
 }
 static void noinstr el1_interrupt(struct pt_regs *regs,
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 9ef63e4147913..af9cae1f225e3 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -235,8 +235,10 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		}

 		instrumentation_begin();
-		if (IS_ENABLED(CONFIG_PREEMPTION))
+		if (IS_ENABLED(CONFIG_PREEMPTION) &&
+		    !IS_ENABLED(CONFIG_ARCH_HAS_OWN_IRQ_PREEMPTION)) {
 			irqentry_exit_cond_resched();
+		}

 		/* Covers both tracing and lockdep */
 		trace_hardirqs_on();
-- 
2.30.2

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-20 11:30 ` [PATCH 1/2] " Mark Rutland
@ 2026-03-20 13:04   ` Peter Zijlstra
  2026-03-20 14:11     ` Thomas Gleixner
  2026-03-20 14:59   ` Thomas Gleixner
  2026-03-24  3:14   ` Jinjie Ruan
  2 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2026-03-20 13:04 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, ada.coupriediaz, catalin.marinas, linux-kernel,
	luto, ruanjinjie, tglx, vladimir.murzin, will

On Fri, Mar 20, 2026 at 11:30:25AM +0000, Mark Rutland wrote:

> Thomas, Peter, I have a couple of things I'd like to check:
> 
> (1) The generic irq entry code will preempt from any exception (e.g. a
>     synchronous fault) where interrupts were unmasked in the original
>     context. Is that intentional/necessary, or was that just the way the
>     x86 code happened to be implemented?
> 
>     I assume that it'd be fine if arm64 only preempted from true
>     interrupts, but if that was intentional/necessary I can go rework
>     this.

So NMI-from-kernel must not trigger resched IIRC. There is some code
that relies on this somewhere. And on x86 many of those synchronous
exceptions are marked as NMI, since they can happen with IRQs disabled
inside locks etc.

But for the rest I don't think we care particularly. Notably page-fault
will already schedule itself when possible (faults leading to IO and
blocking).

> (2) The generic irq entry code only preempts when RCU was watching in
>     the original context. IIUC that's just to avoid preempting from the
>     idle thread. Is it functionally necessary to avoid that, or is that
>     just an optimization?
> 
>     I'm asking because historically arm64 didn't check that, and I
>     haven't bothered checking here. I don't know whether we have a
>     latent functional bug.

Like I told you on IRC, I *think* this is just an optimization, since if
you hit idle, the idle loop will take care of scheduling. But I can't
quite remember the details here, and wish we'd have written a sensible
comment at that spot.

Other places where RCU isn't watching are userspace and KVM. The first
isn't relevant because this is return-to-kernel, and the second I'm not
sure about.

Thomas, can you remember?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-20 13:04   ` Peter Zijlstra
@ 2026-03-20 14:11     ` Thomas Gleixner
  2026-03-20 14:57       ` Mark Rutland
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2026-03-20 14:11 UTC (permalink / raw)
  To: Peter Zijlstra, Mark Rutland
  Cc: linux-arm-kernel, ada.coupriediaz, catalin.marinas, linux-kernel,
	luto, ruanjinjie, vladimir.murzin, will

On Fri, Mar 20 2026 at 14:04, Peter Zijlstra wrote:
> On Fri, Mar 20, 2026 at 11:30:25AM +0000, Mark Rutland wrote:
>> Thomas, Peter, I have a couple of things I'd like to check:
>> 
>> (1) The generic irq entry code will preempt from any exception (e.g. a
>>     synchronous fault) where interrupts were unmasked in the original
>>     context. Is that intentional/necessary, or was that just the way the
>>     x86 code happened to be implemented?
>> 
>>     I assume that it'd be fine if arm64 only preempted from true
>>     interrupts, but if that was intentional/necessary I can go rework
>>     this.
>
> So NMI-from-kernel must not trigger resched IIRC. There is some code
> that relies on this somewhere. And on x86 many of those synchronous
> exceptions are marked as NMI, since they can happen with IRQs disabled
> inside locks etc.
>
> But for the rest I don't think we care particularly. Notably page-fault
> will already schedule itself when possible (faults leading to IO and
> blocking).

Right. In general we allow preemption on any interrupt, trap and exception
when:

  1) the interrupted context had interrupts enabled

  2) RCU was watching in the original context

This _is_ intentional as there is no reason to defer preemption in such
a case. The RT people might get upset if you do so.

NMI like exceptions, which are not allowed to schedule, should therefore
never go through irqentry_irq_entry() and irqentry_irq_exit().

irqentry_nmi_enter() and irqentry_nmi_exit() exist for a technical
reason and are not just of decorative nature. :)

>> (2) The generic irq entry code only preempts when RCU was watching in
>>     the original context. IIUC that's just to avoid preempting from the
>>     idle thread. Is it functionally necessary to avoid that, or is that
>>     just an optimization?
>> 
>>     I'm asking because historically arm64 didn't check that, and I
>>     haven't bothered checking here. I don't know whether we have a
>>     latent functional bug.
>
> Like I told you on IRC, I *think* this is just an optimization, since if
> you hit idle, the idle loop will take care of scheduling. But I can't
> quite remember the details here, and wish we'd have written a sensible
> comment at that spot.

There is one, but it's obviously not detailed enough.

> Other places where RCU isn't watching are userspace and KVM. The first
> isn't relevant because this is return-to-kernel, and the second I'm not
> sure about.
>
> Thomas, can you remember?

Yes. It's not an optimization. It's a correctness issue.

If the interrupted context is RCU idle then you have to carefully go
back to that context. So that the context can tell RCU it is done with
the idle state and RCU has to pay attention again. Otherwise all of this
becomes imbalanced.

This is about context-level nesting:

        ...
L1.A    ct_cpuidle_enter();

                        -> interrupt
 L2.A                           ct_irq_enter();
                                ...             // Set NEED_RESCHED
 L2.B                           ct_irq_exit();
                               
        ...
L1.B    ct_cpuidle_exit();

Scheduling between #L2.B and #L1.B makes RCU rightfully upset. Think
about it this way:

L1.A    preempt_disable();
L2.A    local_bh_disable();
        ..
L2.B    local_bh_enable();
        if (need_resched())
           schedule();
L1.B    preempt_enable();

RCU is not any different. For context-level nesting of any kind the only
valid order is:

   L1.A -> L2.A -> L2.B -> L1.B

Pretty obvious if you actually think about it, no?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-20 14:11     ` Thomas Gleixner
@ 2026-03-20 14:57       ` Mark Rutland
  2026-03-20 15:34         ` Peter Zijlstra
  2026-03-20 15:50         ` Thomas Gleixner
  0 siblings, 2 replies; 25+ messages in thread
From: Mark Rutland @ 2026-03-20 14:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, linux-arm-kernel, ada.coupriediaz,
	catalin.marinas, linux-kernel, luto, ruanjinjie, vladimir.murzin,
	will

On Fri, Mar 20, 2026 at 03:11:20PM +0100, Thomas Gleixner wrote:
> On Fri, Mar 20 2026 at 14:04, Peter Zijlstra wrote:
> > On Fri, Mar 20, 2026 at 11:30:25AM +0000, Mark Rutland wrote:
> >> Thomas, Peter, I have a couple of things I'd like to check:
> >> 
> >> (1) The generic irq entry code will preempt from any exception (e.g. a
> >>     synchronous fault) where interrupts were unmasked in the original
> >>     context. Is that intentional/necessary, or was that just the way the
> >>     x86 code happened to be implemented?
> >> 
> >>     I assume that it'd be fine if arm64 only preempted from true
> >>     interrupts, but if that was intentional/necessary I can go rework
> >>     this.
> >
> > So NMI-from-kernel must not trigger resched IIRC. There is some code
> > that relies on this somewhere. And on x86 many of those synchronous
> > exceptions are marked as NMI, since they can happen with IRQs disabled
> > inside locks etc.
> >
> > But for the rest I don't think we care particularly. Notably page-fault
> > will already schedule itself when possible (faults leading to IO and
> > blocking).
> 
> Right. In general we allow preemption on any interrupt, trap and exception
> when:
> 
>   1) the interrupted context had interrupts enabled
> 
>   2) RCU was watching in the original context
> 
> This _is_ intentional as there is no reason to defer preemption in such
> a case. The RT people might get upset if you do so.

Ok. Thanks for confirming!

As above, I'll go see what I can do to address that. I suspect I'll need
something like irqentry_exit_to_kernel_mode_prepare(), analogous to
irqentry_exit_to_user_mode_prepare(), so that the preemption can happen
before the exception masking, but the rest of the exit logic can happen
afterwards.

I know that arm64 currently uses exit_to_user_mode_prepare_legacy(), and
I want to go clean that up too. :)

> NMI like exceptions, which are not allowed to schedule, should therefore
> never go through irqentry_irq_entry() and irqentry_irq_exit().
> 
> irqentry_nmi_enter() and irqentry_nmi_exit() exist for a technical
> reason and are not just of decorative nature. :)

Sorry, I should have been clearer that I was only trying to ask about
cases where irqentry_exit() would preempt. I understand
irqentry_nmi_exit() won't preempt.

Understood and agreed for NMI!

> >> (2) The generic irq entry code only preempts when RCU was watching in
> >>     the original context. IIUC that's just to avoid preempting from the
> >>     idle thread. Is it functionally necessary to avoid that, or is that
> >>     just an optimization?
> >> 
> >>     I'm asking because historically arm64 didn't check that, and I
> >>     haven't bothered checking here. I don't know whether we have a
> >>     latent functional bug.
> >
> > Like I told you on IRC, I *think* this is just an optimization, since if
> > you hit idle, the idle loop will take care of scheduling. But I can't
> > quite remember the details here, and wish we'd have written a sensible
> > comment at that spot.
> 
> There is one, but it's obviously not detailed enough.
> 
> > Other places where RCU isn't watching are userspace and KVM. The first
> > isn't relevant because this is return-to-kernel, and the second I'm not
> > sure about.
> >
> > Thomas, can you remember?
> 
> Yes. It's not an optimization. It's a correctness issue.
> 
> If the interrupted context is RCU idle then you have to carefully go
> back to that context. So that the context can tell RCU it is done with
> the idle state and RCU has to pay attention again. Otherwise all of this
> becomes imbalanced.
> 
> This is about context-level nesting:
> 
>         ...
> L1.A    ct_cpuidle_enter();
> 
>                         -> interrupt
>  L2.A                           ct_irq_enter();
>                                 ...             // Set NEED_RESCHED
>  L2.B                           ct_irq_exit();
>                                
>         ...
> L1.B    ct_cpuidle_exit();
> 
> Scheduling between #L2.B and #L1.B makes RCU rightfully upset. 

I suspect I'm missing something obvious here:

* Regardless of nesting, I see that scheduling between L2.B and L1.B is
  broken because RCU isn't watching.

* I'm not sure whether there's a problem with scheduling between L2.A
  and L2.B, which is what arm64 used to do, and what arm64 would do
  after this patch.

I *think* I just don't understand how context tracking actually works,
so I'll go dig into that and go learn how the struct context_tracking
fields are manipulated by ct_cpuidle_{enter,exit}() and
ct_irq_{enter,exit}().

If there's something else I should go look at, please let me know!

> Think about it this way:
> 
> L1.A    preempt_disable();
> L2.A    local_bh_disable();
>         ..
> L2.B    local_bh_enable();
>         if (need_resched())
>            schedule();
> L1.B    preempt_enable();
> 
> RCU is not any different. For context-level nesting of any kind the only
> valid order is:
> 
>    L1.A -> L2.A -> L2.B -> L1.B
> 
> Pretty obvious if you actually think about it, no?

I guess I'll need to think a bit harder ;)

Thanks for all of this. Even if I'm confused right now, it's very
helpful!

Mark.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-20 14:57       ` Mark Rutland
@ 2026-03-20 15:34         ` Peter Zijlstra
  2026-03-20 16:16           ` Mark Rutland
  2026-03-20 15:50         ` Thomas Gleixner
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2026-03-20 15:34 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Thomas Gleixner, linux-arm-kernel, ada.coupriediaz,
	catalin.marinas, linux-kernel, luto, ruanjinjie, vladimir.murzin,
	will

On Fri, Mar 20, 2026 at 02:57:37PM +0000, Mark Rutland wrote:

> I know that arm64 currently uses exit_to_user_mode_prepare_legacy(), and
> I want to go clean that up too. :)

This series; and this patch in particular:

https://lkml.kernel.org/r/20260320102620.1336796-10-ruanjinjie@huawei.com

seem to already take care of that.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-20 15:34         ` Peter Zijlstra
@ 2026-03-20 16:16           ` Mark Rutland
  0 siblings, 0 replies; 25+ messages in thread
From: Mark Rutland @ 2026-03-20 16:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, linux-arm-kernel, ada.coupriediaz,
	catalin.marinas, linux-kernel, luto, ruanjinjie, vladimir.murzin,
	will

On Fri, Mar 20, 2026 at 04:34:21PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 20, 2026 at 02:57:37PM +0000, Mark Rutland wrote:
> 
> > I know that arm64 currently uses exit_to_user_mode_prepare_legacy(), and
> > I want to go clean that up too. :)
> 
> This series; and this patch in particular:
> 
> https://lkml.kernel.org/r/20260320102620.1336796-10-ruanjinjie@huawei.com
> 
> seem to already take care of that.

Sure, and Jinjie's patch might be the right option.

I'd like to fix the irqentry stuff as a whle *before* we convert the
syscall stuff, so that we're not just creating more work for ourselves.

Mark.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-20 14:57       ` Mark Rutland
  2026-03-20 15:34         ` Peter Zijlstra
@ 2026-03-20 15:50         ` Thomas Gleixner
  2026-03-23 17:21           ` Mark Rutland
  1 sibling, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2026-03-20 15:50 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Peter Zijlstra, linux-arm-kernel, ada.coupriediaz,
	catalin.marinas, linux-kernel, luto, ruanjinjie, vladimir.murzin,
	will

On Fri, Mar 20 2026 at 14:57, Mark Rutland wrote:
> On Fri, Mar 20, 2026 at 03:11:20PM +0100, Thomas Gleixner wrote:
>> Yes. It's not an optimization. It's a correctness issue.
>> 
>> If the interrupted context is RCU idle then you have to carefully go
>> back to that context. So that the context can tell RCU it is done with
>> the idle state and RCU has to pay attention again. Otherwise all of this
>> becomes imbalanced.
>> 
>> This is about context-level nesting:
>> 
>>         ...
>> L1.A    ct_cpuidle_enter();
>> 
>>                         -> interrupt
>>  L2.A                           ct_irq_enter();
>>                                 ...             // Set NEED_RESCHED
>>  L2.B                           ct_irq_exit();
>>                                
>>         ...
>> L1.B    ct_cpuidle_exit();
>> 
>> Scheduling between #L2.B and #L1.B makes RCU rightfully upset. 
>
> I suspect I'm missing something obvious here:
>
> * Regardless of nesting, I see that scheduling between L2.B and L1.B is
>   broken because RCU isn't watching.
>
> * I'm not sure whether there's a problem with scheduling between L2.A
>   and L2.B, which is what arm64 used to do, and what arm64 would do
>   after this patch.

The only reason why it "works" is that the idle task has preemption
permanently disabled, so it won't really schedule even if need_resched()
is set. So it "works" by chance and not by design.

Apply the patch below and watch the show.

> Thanks for all of this. Even if I'm confused right now, it's very
> helpful!

RCU induced confusion is perfectly normal. Everyone suffers from that at
some point. Welcome to the club.

Thanks,

        tglx
---
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -187,9 +187,10 @@ static inline bool arch_irqentry_exit_ne
 
 void raw_irqentry_exit_cond_resched(void)
 {
+	rcu_irq_exit_check_preempt();
+
 	if (!preempt_count()) {
 		/* Sanity check RCU and thread stack */
-		rcu_irq_exit_check_preempt();
 		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
 			WARN_ON_ONCE(!on_thread_stack());
 		if (need_resched() && arch_irqentry_exit_need_resched())



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-20 15:50         ` Thomas Gleixner
@ 2026-03-23 17:21           ` Mark Rutland
  0 siblings, 0 replies; 25+ messages in thread
From: Mark Rutland @ 2026-03-23 17:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, linux-arm-kernel, ada.coupriediaz,
	catalin.marinas, linux-kernel, luto, ruanjinjie, vladimir.murzin,
	will

On Fri, Mar 20, 2026 at 04:50:03PM +0100, Thomas Gleixner wrote:
> On Fri, Mar 20 2026 at 14:57, Mark Rutland wrote:
> > On Fri, Mar 20, 2026 at 03:11:20PM +0100, Thomas Gleixner wrote:
> >> Yes. It's not an optimization. It's a correctness issue.
> >> 
> >> If the interrupted context is RCU idle then you have to carefully go
> >> back to that context. So that the context can tell RCU it is done with
> >> the idle state and RCU has to pay attention again. Otherwise all of this
> >> becomes imbalanced.
> >> 
> >> This is about context-level nesting:
> >> 
> >>         ...
> >> L1.A    ct_cpuidle_enter();
> >> 
> >>                         -> interrupt
> >>  L2.A                           ct_irq_enter();
> >>                                 ...             // Set NEED_RESCHED
> >>  L2.B                           ct_irq_exit();
> >>                                
> >>         ...
> >> L1.B    ct_cpuidle_exit();
> >> 
> >> Scheduling between #L2.B and #L1.B makes RCU rightfully upset. 
> >
> > I suspect I'm missing something obvious here:
> >
> > * Regardless of nesting, I see that scheduling between L2.B and L1.B is
> >   broken because RCU isn't watching.
> >
> > * I'm not sure whether there's a problem with scheduling between L2.A
> >   and L2.B, which is what arm64 used to do, and what arm64 would do
> >   after this patch.
> 
> The only reason why it "works" is that the idle task has preemption
> permanently disabled, so it won't really schedule even if need_resched()
> is set. So it "works" by chance and not by design.

Ah, I see.

Thanks -- that relieves my fear that we'd have to backport a fix to
stable kernels. Since that's safe by accident, I think we can leave
stable kernels as-is.

> Apply the patch below and watch the show.

Thanks for this too; I hadn't spotted rcu_irq_exit_check_preempt().

Info dump below, but this is just agreeing with what you said above. :)

Since rcu_irq_exit_check_preempt() doesn't dump the actual values, I
hacked up something similar and tested arm64's old logic (from v6.17).
CT_NESTING_IRQ_NONIDLE would be 0x4000000000000001, so that would
be off-by-one if we were to preempt. However, as you say, preemption is
disabled, and that happens to save us.

Thanks again!

Mark.

| ------------[ cut here ]------------
| HARK: arm64_preempt_schedule_irq() called with:
|       CT nesting: 0x0000000000000001
|   CT NMI nesting: 0x4000000000000002
|     RCU watching: yes
|    preempt_count: 0x00000001
| WARNING: CPU: 0 PID: 0 at arch/arm64/kernel/entry-common.c:286 el1_interrupt+0xf8/0x100
| Modules linked in:
| CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.17.0-00001-gc02e86492f52-dirty #8 PREEMPT 
| Hardware name: linux,dummy-virt (DT)
| pstate: 600000c9 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : el1_interrupt+0xf8/0x100
| lr : el1_interrupt+0xf8/0x100
| sp : ffffa1efd4333be0
| x29: ffffa1efd4333be0 x28: ffffa1efd434d280 x27: ffffa1efd4342360
| x26: ffffa1efd4345000 x25: 0000000000000000 x24: ffffa1efd434d280
| x23: 0000000060000009 x22: ffffa1efd31f0154 x21: ffffa1efd4333d70
| x20: 0000000000000000 x19: ffffa1efd4333c20 x18: 000000000000000a
| x17: 72702020200a7365 x16: 79203a676e696863 x15: 7461772055435220
| x14: 2020200a32303030 x13: 3130303030303030 x12: 7830203a746e756f
| x11: 0000000000000058 x10: 0000000000000018 x9 : fff000003c7e5000
| x8 : 00000000000affa8 x7 : 0000000000000084 x6 : fff000003fc7b6c0
| x5 : fff000003fc7b6c0 x4 : 0000000000000000 x3 : 0000000000000000
| x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffa1efd434d280
| Call trace:
|  el1_interrupt+0xf8/0x100 (P)
|  el1h_64_irq_handler+0x18/0x24
|  el1h_64_irq+0x6c/0x70
|  default_idle_call+0xb4/0x2a0 (P)
|  do_idle+0x210/0x270
|  cpu_startup_entry+0x34/0x40
|  rest_init+0x174/0x180
|  console_on_rootfs+0x0/0x6c
|  __primary_switched+0x88/0x90
| irq event stamp: 848
| hardirqs last  enabled at (846): [<ffffa1efd1fb0da8>] rcu_core+0xc88/0x1048
| hardirqs last disabled at (847): [<ffffa1efd1ee2444>] handle_softirqs+0x434/0x4a0
| softirqs last  enabled at (848): [<ffffa1efd1ee245c>] handle_softirqs+0x44c/0x4a0
| softirqs last disabled at (841): [<ffffa1efd1e10794>] __do_softirq+0x14/0x20
| ---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-20 11:30 ` [PATCH 1/2] " Mark Rutland
  2026-03-20 13:04   ` Peter Zijlstra
@ 2026-03-20 14:59   ` Thomas Gleixner
  2026-03-20 15:37     ` Mark Rutland
  2026-03-24  3:14   ` Jinjie Ruan
  2 siblings, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2026-03-20 14:59 UTC (permalink / raw)
  To: Mark Rutland, linux-arm-kernel
  Cc: ada.coupriediaz, catalin.marinas, linux-kernel, luto,
	mark.rutland, peterz, ruanjinjie, vladimir.murzin, will

On Fri, Mar 20 2026 at 11:30, Mark Rutland wrote:
> We can fix this relatively simply by moving the preemption logic out of
> irqentry_exit(), which is desirable for a number of other reasons on
> arm64. Context and rationale below:
>
> 1) Architecturally, several groups of exceptions can be masked
>    independently, including 'Debug', 'SError', 'IRQ', and 'FIQ', whose
>    mask bits can be read/written via the 'DAIF' register.
>
>    Other mask bits exist, including 'PM' and 'AllInt', which we will
>    need to use in future (e.g. for architectural NMI support).
>
>    The entry code needs to manipulate all of these, but the generic
>    entry code only knows about interrupts (which means both IRQ and FIQ
>    on arm64), and the other exception masks aren't generic.

Right, but that's what the architecture specific parts are for.

> 2) Architecturally, all maskable exceptions MUST be masked during
>    exception entry and exception return.
>
>    Upon exception entry, hardware places exception context into
>    exception registers (e.g. the PC is saved into ELR_ELx). Upon
>    exception return, hardware restores exception context from those
>    exception registers (e.g. the PC is restored from ELR_ELx).
>
>    To ensure the exception registers aren't clobbered by recursive
>    exceptions, all maskable exceptions must be masked early during entry
>    and late during exit. Hardware masks all maskable exceptions
>    automatically at exception entry. Software must unmask these as
>    required, and must mask them prior to exception return.

That's not much different from any other architecture.

> 3) Architecturally, hardware masks all maskable exceptions upon any
>    exception entry. A synchronous exception (e.g. a fault on a memory
>    access) can be taken from any context (e.g. where IRQ+FIQ might be
>    masked), and the entry code must explicitly 'inherit' the unmasking
>    from the original context by reading the exception registers (e.g.
>    SPSR_ELx) and writing to DAIF, etc.

The amount of mask bits/registers is obviously architecture specific,
but conceptually it's the same everywhere.

> 4) When 'pseudo-NMI' is used, Linux masks interrupts via a combination
>    of DAIF and the 'PMR' priority mask register. At entry and exit,
>    interrupts must be masked via DAIF, but most kernel code will
>    mask/unmask regular interrupts using PMR (e.g. in local_irq_save()
>    and local_irq_restore()).
>
>    This requires more complicated transitions at entry and exit. Early
>    during entry or late during return, interrupts are masked via DAIF,
>    and kernel code which manipulates PMR to mask/unmask interrupts will
>    not function correctly in this state.
>
>    This also requires fairly complicated management of DAIF and PMR when
>    handling interrupts, and arm64 has special logic to avoid preempting
>    from pseudo-NMIs which currently lives in
>    arch_irqentry_exit_need_resched().

Why are you routing NMI like exceptions through irqentry_enter() and
irqentry_exit() in the first place? That's just wrong.

> 5) Most kernel code runs with all exceptions unmasked. When scheduling,
>    only interrupts should be masked (by PMR pseudo-NMI is used, and by
>    DAIF otherwise).
>
> For most exceptions, arm64's entry code has a sequence similar to that
> of el1_abort(), which is used for faults:
>
> | static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
> | {
> |         unsigned long far = read_sysreg(far_el1);
> |         irqentry_state_t state;
> |
> |         state = enter_from_kernel_mode(regs);
> |         local_daif_inherit(regs);
> |         do_mem_abort(far, esr, regs);
> |         local_daif_mask();
> |         exit_to_kernel_mode(regs, state);
> | }
>
> ... where enter_from_kernel_mode() and exit_to_kernel_mode() are
> wrappers around irqentry_enter() and irqentry_exit() which perform
> additional arm64-specific entry/exit logic.
>
> Currently, the generic irq entry code will attempt to preempt from any
> exception under irqentry_exit() where interrupts were unmasked in the
> original context. As arm64's entry code will have already masked
> exceptions via DAIF, this results in the problems described above.

See below.

> Fix this by opting out of preemption in irqentry_exit(), and restoring
> arm64's old behaivour of explicitly preempting when returning from IRQ
> or FIQ, before calling exit_to_kernel_mode() / irqentry_exit(). This
> ensures that preemption occurs when only interrupts are masked, and
> where that masking is compatible with most kernel code (e.g. using PMR
> when pseudo-NMI is in use).

My gut feeling tells me that there is a fundamental design flaw
somewhere and the below is papering over it.

> @@ -497,6 +497,8 @@ static __always_inline void __el1_irq(struct pt_regs *regs,
>  	do_interrupt_handler(regs, handler);
>  	irq_exit_rcu();
>  
> +	irqentry_exit_cond_resched();
> +
>  	exit_to_kernel_mode(regs, state);
>  }
>  static void noinstr el1_interrupt(struct pt_regs *regs,
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 9ef63e4147913..af9cae1f225e3 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -235,8 +235,10 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  		}
>  
>  		instrumentation_begin();
> -		if (IS_ENABLED(CONFIG_PREEMPTION))
> +		if (IS_ENABLED(CONFIG_PREEMPTION) &&
> +		    !IS_ENABLED(CONFIG_ARCH_HAS_OWN_IRQ_PREEMPTION)) {

These 'But my architecture is sooo special' switches cause immediate review
nausea and just confirm that there is a fundamental flaw somewhere else.

>  			irqentry_exit_cond_resched();

Let's look at how this is supposed to work. I'm just looking at
irqentry_enter()/exit() and not the NMI variant.

Interrupt/exception is raised

  1) low level architecture specific entry code does all the magic state
     saving, setup etc.

  2) irqentry_enter() is invoked

      - checks for user mode or kernel mode entry

      - handles RCU on enter from user and if kernel entry hits the idle
        task

      - Sets up lockdep, tracing, kminsanity

  3) the interrupt/exception handler is invoked

  4) irqentry_exit() is invoked

      - handles exit to user and exit to kernel

      - exit to user handles the TIF and other pending work, which can
        schedule and then prepares for return

      - exit to kernel

        When interrupt were disabled on entry, it just handles RCU and
        returns.

        When enabled on entry, it checks whether RCU was watching on
        entry or not. If not it tells RCU that the interrupt nesting is
        done and returns. When RCU was watching it can schedule
 
  5) Undoes #1 so that it can return to the originally interrupted
     context.

That means at the point where irqentry_entry() is invoked, the
architecture side should have made sure that everything is set up for
the kernel to operate until irqentry_exit() returns.

Looking at your example:

> | static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
> | {
> |         unsigned long far = read_sysreg(far_el1);
> |         irqentry_state_t state;
> |
> |         state = enter_from_kernel_mode(regs);
> |         local_daif_inherit(regs);
> |         do_mem_abort(far, esr, regs);
> |         local_daif_mask();
> |         exit_to_kernel_mode(regs, state);

and the paragraph right below that:

> Currently, the generic irq entry code will attempt to preempt from any
> exception under irqentry_exit() where interrupts were unmasked in the
> original context. As arm64's entry code will have already masked
> exceptions via DAIF, this results in the problems described above.

To me this looks like your ordering is wrong. Why are you doing the DAIF
inherit _after_ irqentry_enter() and the mask _before_ irqentry_exit()?

I might be missing something, but this smells more than fishy.

As no other architecture has that problem I'm pretty sure that the
problem is not in the way how the generic code was designed. Why?

Because your architecture is _not_ sooo special! :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-20 14:59   ` Thomas Gleixner
@ 2026-03-20 15:37     ` Mark Rutland
  2026-03-20 16:26       ` Thomas Gleixner
  0 siblings, 1 reply; 25+ messages in thread
From: Mark Rutland @ 2026-03-20 15:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-arm-kernel, ada.coupriediaz, catalin.marinas, linux-kernel,
	luto, peterz, ruanjinjie, vladimir.murzin, will

On Fri, Mar 20, 2026 at 03:59:40PM +0100, Thomas Gleixner wrote:
> On Fri, Mar 20 2026 at 11:30, Mark Rutland wrote:
> > 4) When 'pseudo-NMI' is used, Linux masks interrupts via a combination
> >    of DAIF and the 'PMR' priority mask register. At entry and exit,
> >    interrupts must be masked via DAIF, but most kernel code will
> >    mask/unmask regular interrupts using PMR (e.g. in local_irq_save()
> >    and local_irq_restore()).
> >
> >    This requires more complicated transitions at entry and exit. Early
> >    during entry or late during return, interrupts are masked via DAIF,
> >    and kernel code which manipulates PMR to mask/unmask interrupts will
> >    not function correctly in this state.
> >
> >    This also requires fairly complicated management of DAIF and PMR when
> >    handling interrupts, and arm64 has special logic to avoid preempting
> >    from pseudo-NMIs which currently lives in
> >    arch_irqentry_exit_need_resched().
> 
> Why are you routing NMI like exceptions through irqentry_enter() and
> irqentry_exit() in the first place? That's just wrong.

Sorry, the above was not clear, and some of this logic is gunk that has
been carried over unnnecessarily from our old exception handling flow.

The issue with pseudo-NMI is that it uses the same exception as regular
interrupts, but we don't know whether we have a pseudo-NMI until we
acknowledge the event at the irqchip level. When a pseudo-NMI is taken,
there are two possibilities:

(1) The pseudo-NMI is taken from a context where interrupts were
    *disabled*. The entry code immediately knows it must be a
    pseudo-NMI, and we call irqentry_nmi_{enter,exit}(), NOT
    irqentry_{enter,exit}(), treating it as an NMI.


(2) The pseudo-NMI was taken from a context where interrupts were
    *enabled*. The entry code doesn't know whether it's a pseudo-NMI or
    a regular interrupt, so it calls irqentry_{enter,exit}(), and then
    within that we'll call nmi_{enter,exit}() to transiently enter NMI
    context.

I realise this is crazy. I would love to delete pseudo-NMI.
Unfortunately people are using it.

Putting aside the nesting here, I think it's fine to preempt upon return
from case (2), and we can delete the logic to avoid preempting.

> > 5) Most kernel code runs with all exceptions unmasked. When scheduling,
> >    only interrupts should be masked (by PMR pseudo-NMI is used, and by
> >    DAIF otherwise).
> >
> > For most exceptions, arm64's entry code has a sequence similar to that
> > of el1_abort(), which is used for faults:
> >
> > | static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
> > | {
> > |         unsigned long far = read_sysreg(far_el1);
> > |         irqentry_state_t state;
> > |
> > |         state = enter_from_kernel_mode(regs);
> > |         local_daif_inherit(regs);
> > |         do_mem_abort(far, esr, regs);
> > |         local_daif_mask();
> > |         exit_to_kernel_mode(regs, state);
> > | }
> >
> > ... where enter_from_kernel_mode() and exit_to_kernel_mode() are
> > wrappers around irqentry_enter() and irqentry_exit() which perform
> > additional arm64-specific entry/exit logic.
> >
> > Currently, the generic irq entry code will attempt to preempt from any
> > exception under irqentry_exit() where interrupts were unmasked in the
> > original context. As arm64's entry code will have already masked
> > exceptions via DAIF, this results in the problems described above.
> 
> See below.
> 
> > Fix this by opting out of preemption in irqentry_exit(), and restoring
> > arm64's old behaivour of explicitly preempting when returning from IRQ
> > or FIQ, before calling exit_to_kernel_mode() / irqentry_exit(). This
> > ensures that preemption occurs when only interrupts are masked, and
> > where that masking is compatible with most kernel code (e.g. using PMR
> > when pseudo-NMI is in use).
> 
> My gut feeling tells me that there is a fundamental design flaw
> somewhere and the below is papering over it.
> 
> > @@ -497,6 +497,8 @@ static __always_inline void __el1_irq(struct pt_regs *regs,
> >  	do_interrupt_handler(regs, handler);
> >  	irq_exit_rcu();
> >  
> > +	irqentry_exit_cond_resched();
> > +
> >  	exit_to_kernel_mode(regs, state);
> >  }
> >  static void noinstr el1_interrupt(struct pt_regs *regs,
> > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > index 9ef63e4147913..af9cae1f225e3 100644
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -235,8 +235,10 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
> >  		}
> >  
> >  		instrumentation_begin();
> > -		if (IS_ENABLED(CONFIG_PREEMPTION))
> > +		if (IS_ENABLED(CONFIG_PREEMPTION) &&
> > +		    !IS_ENABLED(CONFIG_ARCH_HAS_OWN_IRQ_PREEMPTION)) {
> 
> These 'But my architecture is sooo special' switches cause immediate review
> nausea and just confirm that there is a fundamental flaw somewhere else.
> 
> >  			irqentry_exit_cond_resched();
> 
> Let's look at how this is supposed to work. I'm just looking at
> irqentry_enter()/exit() and not the NMI variant.
> 
> Interrupt/exception is raised
> 
>   1) low level architecture specific entry code does all the magic state
>      saving, setup etc.
> 
>   2) irqentry_enter() is invoked
> 
>       - checks for user mode or kernel mode entry
> 
>       - handles RCU on enter from user and if kernel entry hits the idle
>         task
> 
>       - Sets up lockdep, tracing, kminsanity
> 
>   3) the interrupt/exception handler is invoked
> 
>   4) irqentry_exit() is invoked
> 
>       - handles exit to user and exit to kernel
> 
>       - exit to user handles the TIF and other pending work, which can
>         schedule and then prepares for return
> 
>       - exit to kernel
> 
>         When interrupt were disabled on entry, it just handles RCU and
>         returns.
> 
>         When enabled on entry, it checks whether RCU was watching on
>         entry or not. If not it tells RCU that the interrupt nesting is
>         done and returns. When RCU was watching it can schedule
>  
>   5) Undoes #1 so that it can return to the originally interrupted
>      context.
> 
> That means at the point where irqentry_entry() is invoked, the
> architecture side should have made sure that everything is set up for
> the kernel to operate until irqentry_exit() returns.

Ok. I think you're saying I should try:

* At entry, *before* irqentry_enter():
  - unmask everything EXCEPT regular interrupts.
  - fix up all the necessary state.

* At exception exit, *after* irqentry_exit():
  - mask everything.
  - fix up all the necessary state.

... right?

> Looking at your example:
> 
> > | static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
> > | {
> > |         unsigned long far = read_sysreg(far_el1);
> > |         irqentry_state_t state;
> > |
> > |         state = enter_from_kernel_mode(regs);
> > |         local_daif_inherit(regs);
> > |         do_mem_abort(far, esr, regs);
> > |         local_daif_mask();
> > |         exit_to_kernel_mode(regs, state);
> 
> and the paragraph right below that:
> 
> > Currently, the generic irq entry code will attempt to preempt from any
> > exception under irqentry_exit() where interrupts were unmasked in the
> > original context. As arm64's entry code will have already masked
> > exceptions via DAIF, this results in the problems described above.
> 
> To me this looks like your ordering is wrong. Why are you doing the DAIF
> inherit _after_ irqentry_enter() and the mask _before_ irqentry_exit()?

As above, I can go look at reworking this. 

For context, we do it this way today for several reasons, including:

(1) Because some of the arch-specific bits (such as checking the TFSR
    for MTE) in enter_from_kernel_mode() and exit_to_kernel_mode() need
    to be done while RCU is watching, etc, but needs other exceptions
    masked. I can look at reworking that.

(2) To minimize the number of times we have to write to things like
    DAIF, as that can be expensive.

(3) To simplify the management of things like DAIF, so that we don't
    have several points in time at which we need to inherit different
    pieces of state.

(4) Historical, as that's the flow we had in assembly, and prior to the
    move to generic irq entry.

> I might be missing something, but this smells more than fishy.
> 
> As no other architecture has that problem I'm pretty sure that the
> problem is not in the way how the generic code was designed. Why?

Hey, I'm not saying the generic entry code is wrong, just that there's a
mismatch between it and what would be optimal for arm64.

> Because your architecture is _not_ sooo special! :)

I think it's pretty special, but not necessarily in the same sense. ;)

Mark.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-20 15:37     ` Mark Rutland
@ 2026-03-20 16:26       ` Thomas Gleixner
  2026-03-20 17:31         ` Mark Rutland
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2026-03-20 16:26 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, ada.coupriediaz, catalin.marinas, linux-kernel,
	luto, peterz, ruanjinjie, vladimir.murzin, will

On Fri, Mar 20 2026 at 15:37, Mark Rutland wrote:
> On Fri, Mar 20, 2026 at 03:59:40PM +0100, Thomas Gleixner wrote:
>> Why are you routing NMI like exceptions through irqentry_enter() and
>> irqentry_exit() in the first place? That's just wrong.
>
> Sorry, the above was not clear, and some of this logic is gunk that has
> been carried over unnnecessarily from our old exception handling flow.
>
> The issue with pseudo-NMI is that it uses the same exception as regular
> interrupts, but we don't know whether we have a pseudo-NMI until we
> acknowledge the event at the irqchip level. When a pseudo-NMI is taken,
> there are two possibilities:
>
> (1) The pseudo-NMI is taken from a context where interrupts were
>     *disabled*. The entry code immediately knows it must be a
>     pseudo-NMI, and we call irqentry_nmi_{enter,exit}(), NOT
>     irqentry_{enter,exit}(), treating it as an NMI.
>
>
> (2) The pseudo-NMI was taken from a context where interrupts were
>     *enabled*. The entry code doesn't know whether it's a pseudo-NMI or
>     a regular interrupt, so it calls irqentry_{enter,exit}(), and then
>     within that we'll call nmi_{enter,exit}() to transiently enter NMI
>     context.
>
> I realise this is crazy. I would love to delete pseudo-NMI.
> Unfortunately people are using it.

What is it used for?

> Putting aside the nesting here, I think it's fine to preempt upon return
> from case (2), and we can delete the logic to avoid preempting.

Correct.

>> 
>> That means at the point where irqentry_entry() is invoked, the
>> architecture side should have made sure that everything is set up for
>> the kernel to operate until irqentry_exit() returns.
>
> Ok. I think you're saying I should try:
>
> * At entry, *before* irqentry_enter():
>   - unmask everything EXCEPT regular interrupts.
>   - fix up all the necessary state.
>
> * At exception exit, *after* irqentry_exit():
>   - mask everything.
>   - fix up all the necessary state.
>
> ... right?

Yes.

>> Looking at your example:
>> 
>> > | static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
>> > | {
>> > |         unsigned long far = read_sysreg(far_el1);
>> > |         irqentry_state_t state;
>> > |
>> > |         state = enter_from_kernel_mode(regs);
>> > |         local_daif_inherit(regs);
>> > |         do_mem_abort(far, esr, regs);
>> > |         local_daif_mask();
>> > |         exit_to_kernel_mode(regs, state);
>> 
>> and the paragraph right below that:
>> 
>> > Currently, the generic irq entry code will attempt to preempt from any
>> > exception under irqentry_exit() where interrupts were unmasked in the
>> > original context. As arm64's entry code will have already masked
>> > exceptions via DAIF, this results in the problems described above.
>> 
>> To me this looks like your ordering is wrong. Why are you doing the DAIF
>> inherit _after_ irqentry_enter() and the mask _before_ irqentry_exit()?
>
> As above, I can go look at reworking this. 
>
> For context, we do it this way today for several reasons, including:
>
> (1) Because some of the arch-specific bits (such as checking the TFSR
>     for MTE) in enter_from_kernel_mode() and exit_to_kernel_mode() need
>     to be done while RCU is watching, etc, but needs other exceptions
>     masked. I can look at reworking that.

Something like the below should do that for you. If you need more than
regs, then you can either stick something on your stack frame or we go
and extend irqentry_enter()/exit() with an additional argument which
allows you to hand some exception/interrupt specific cookie in. The core
code would just hand it through to arch_irqenter_enter/exit_rcu() along
with @regs. That cookie might be data or even a function pointer. The
core does not have to know about it.

> (2) To minimize the number of times we have to write to things like
>     DAIF, as that can be expensive.
>
> (3) To simplify the management of things like DAIF, so that we don't
>     have several points in time at which we need to inherit different
>     pieces of state.

The above should cover your #2/3 too, no?

> (4) Historical, as that's the flow we had in assembly, and prior to the
>     move to generic irq entry.

No comment :)

>> I might be missing something, but this smells more than fishy.
>> 
>> As no other architecture has that problem I'm pretty sure that the
>> problem is not in the way how the generic code was designed. Why?
>
> Hey, I'm not saying the generic entry code is wrong, just that there's a
> mismatch between it and what would be optimal for arm64.
>
>> Because your architecture is _not_ sooo special! :)
>
> I think it's pretty special, but not necessarily in the same sense. ;)

Hehehe.

Thanks,

        tglx
---        
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -149,6 +149,7 @@ noinstr irqentry_state_t irqentry_enter(
 		instrumentation_begin();
 		kmsan_unpoison_entry_regs(regs);
 		trace_hardirqs_off_finish();
+		arch_irqentry_enter_rcu(regs);
 		instrumentation_end();
 
 		ret.exit_rcu = true;
@@ -166,6 +167,7 @@ noinstr irqentry_state_t irqentry_enter(
 	kmsan_unpoison_entry_regs(regs);
 	rcu_irq_enter_check_tick();
 	trace_hardirqs_off_finish();
+	arch_irqentry_enter_rcu(regs);
 	instrumentation_end();
 
 	return ret;
@@ -225,6 +227,7 @@ noinstr void irqentry_exit(struct pt_reg
 		 */
 		if (state.exit_rcu) {
 			instrumentation_begin();
+			arch_irqentry_exit_rcu(regs);
 			hrtimer_rearm_deferred();
 			/* Tell the tracer that IRET will enable interrupts */
 			trace_hardirqs_on_prepare();
@@ -239,11 +242,13 @@ noinstr void irqentry_exit(struct pt_reg
 		if (IS_ENABLED(CONFIG_PREEMPTION))
 			irqentry_exit_cond_resched();
 
+		arch_irqentry_exit_rcu(regs);
 		hrtimer_rearm_deferred();
 		/* Covers both tracing and lockdep */
 		trace_hardirqs_on();
 		instrumentation_end();
 	} else {
+		arch_irqentry_exit_rcu(regs);
 		/*
 		 * IRQ flags state is correct already. Just tell RCU if it
 		 * was not watching on entry.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-20 16:26       ` Thomas Gleixner
@ 2026-03-20 17:31         ` Mark Rutland
  2026-03-21 23:25           ` Thomas Gleixner
  0 siblings, 1 reply; 25+ messages in thread
From: Mark Rutland @ 2026-03-20 17:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-arm-kernel, ada.coupriediaz, catalin.marinas, linux-kernel,
	luto, peterz, ruanjinjie, vladimir.murzin, will

On Fri, Mar 20, 2026 at 05:26:47PM +0100, Thomas Gleixner wrote:
> On Fri, Mar 20 2026 at 15:37, Mark Rutland wrote:
> > On Fri, Mar 20, 2026 at 03:59:40PM +0100, Thomas Gleixner wrote:
> >> Why are you routing NMI like exceptions through irqentry_enter() and
> >> irqentry_exit() in the first place? That's just wrong.
> >
> > Sorry, the above was not clear, and some of this logic is gunk that has
> > been carried over unnnecessarily from our old exception handling flow.
> >
> > The issue with pseudo-NMI is that it uses the same exception as regular
> > interrupts, but we don't know whether we have a pseudo-NMI until we
> > acknowledge the event at the irqchip level. When a pseudo-NMI is taken,
> > there are two possibilities:
> >
> > (1) The pseudo-NMI is taken from a context where interrupts were
> >     *disabled*. The entry code immediately knows it must be a
> >     pseudo-NMI, and we call irqentry_nmi_{enter,exit}(), NOT
> >     irqentry_{enter,exit}(), treating it as an NMI.
> >
> >
> > (2) The pseudo-NMI was taken from a context where interrupts were
> >     *enabled*. The entry code doesn't know whether it's a pseudo-NMI or
> >     a regular interrupt, so it calls irqentry_{enter,exit}(), and then
> >     within that we'll call nmi_{enter,exit}() to transiently enter NMI
> >     context.
> >
> > I realise this is crazy. I would love to delete pseudo-NMI.
> > Unfortunately people are using it.
> 
> What is it used for?

It's used where people would want an NMI; specifically today that's:

* The PMU interrupt (so people can profile code that has IRQs off).
* IPI_CPU_BACKTRACE (so we can get a backtrace when code has IRQs off).
* IPI_CPU_STOP_NMI (so panic can stop secondaries more reliably).
* IPI_KGDB_ROUNDUP (so KGDB can stop secondaries more reliably).

I mostly care about the first two. People *really* want the PMU interrupt as an
NMI.

> > Putting aside the nesting here, I think it's fine to preempt upon return
> > from case (2), and we can delete the logic to avoid preempting.
> 
> Correct.

Cool; thanks for confirming!

> >> That means at the point where irqentry_entry() is invoked, the
> >> architecture side should have made sure that everything is set up for
> >> the kernel to operate until irqentry_exit() returns.
> >
> > Ok. I think you're saying I should try:
> >
> > * At entry, *before* irqentry_enter():
> >   - unmask everything EXCEPT regular interrupts.
> >   - fix up all the necessary state.
> >
> > * At exception exit, *after* irqentry_exit():
> >   - mask everything.
> >   - fix up all the necessary state.
> >
> > ... right?
> 
> Yes.

Ok; I'll go experiment...

My major concern is that this is liable to make the arm64 entry
sequences substantially more expensive and complicated (see notes
below), but I should go see how bad that is in practice.

My other concern is that I'd like to backport a fix for the issue I
mentioned in the commit message, and I'd like to have something that's
simpler than "rewrite the entire entry code" for that -- for backporting
it'd be easier to revert the move to generic irq entry.

> >> Looking at your example:
> >> 
> >> > | static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
> >> > | {
> >> > |         unsigned long far = read_sysreg(far_el1);
> >> > |         irqentry_state_t state;
> >> > |
> >> > |         state = enter_from_kernel_mode(regs);
> >> > |         local_daif_inherit(regs);
> >> > |         do_mem_abort(far, esr, regs);
> >> > |         local_daif_mask();
> >> > |         exit_to_kernel_mode(regs, state);
> >> 
> >> and the paragraph right below that:
> >> 
> >> > Currently, the generic irq entry code will attempt to preempt from any
> >> > exception under irqentry_exit() where interrupts were unmasked in the
> >> > original context. As arm64's entry code will have already masked
> >> > exceptions via DAIF, this results in the problems described above.
> >> 
> >> To me this looks like your ordering is wrong. Why are you doing the DAIF
> >> inherit _after_ irqentry_enter() and the mask _before_ irqentry_exit()?
> >
> > As above, I can go look at reworking this. 
> >
> > For context, we do it this way today for several reasons, including:
> >
> > (1) Because some of the arch-specific bits (such as checking the TFSR
> >     for MTE) in enter_from_kernel_mode() and exit_to_kernel_mode() need
> >     to be done while RCU is watching, etc, but needs other exceptions
> >     masked. I can look at reworking that.
> 
> Something like the below should do that for you. If you need more than
> regs, then you can either stick something on your stack frame or we go
> and extend irqentry_enter()/exit() with an additional argument which
> allows you to hand some exception/interrupt specific cookie in. The core
> code would just hand it through to arch_irqenter_enter/exit_rcu() along
> with @regs. That cookie might be data or even a function pointer. The
> core does not have to know about it.

I don't think that helps for exit, because the contradiction is "while
RCU is watching, etc, but needs other exceptions masked", and as above,
we can't have that, because (with the flow you've suggested)  exceptions
aren't masked until after irqentry_exit().

Let me go think a bit harder about that. The exit path for TFSR is
already somewhat best-effort. Maybe the right thing to do is push that
entirely out of the way and re-enter when it indicates a problem.

> > (2) To minimize the number of times we have to write to things like
> >     DAIF, as that can be expensive.
> >
> > (3) To simplify the management of things like DAIF, so that we don't
> >     have several points in time at which we need to inherit different
> >     pieces of state.
> 
> The above should cover your #2/3 too, no?

Not really, but we might be talking past one another.

I *think* you're saying that because the arch code would manage DAIF
early during entry and late during exit, that would all be in one place.

However, that doubles the number of times we have to write to DAIF: at
entry we'd have to poke it once to unmask everything except IRQs, then
again to unmask IRQs, and exit would need the inverse. We'd also have to
split the inheritance logic into inherit-everything-but-interrupt and
inherit-only-interrupt, which is doable but not ideal. With pseudo-NMI
it's even worse, but that's largely because pseudo-NMI is
over-complicated today.

As above, I'll go experiment and see how bad this is in practice.

> > (4) Historical, as that's the flow we had in assembly, and prior to the
> >     move to generic irq entry.
> 
> No comment :)

:)

Mark.

> >> I might be missing something, but this smells more than fishy.
> >> 
> >> As no other architecture has that problem I'm pretty sure that the
> >> problem is not in the way how the generic code was designed. Why?
> >
> > Hey, I'm not saying the generic entry code is wrong, just that there's a
> > mismatch between it and what would be optimal for arm64.
> >
> >> Because your architecture is _not_ sooo special! :)
> >
> > I think it's pretty special, but not necessarily in the same sense. ;)
> 
> Hehehe.
> 
> Thanks,
> 
>         tglx
> ---        
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -149,6 +149,7 @@ noinstr irqentry_state_t irqentry_enter(
>  		instrumentation_begin();
>  		kmsan_unpoison_entry_regs(regs);
>  		trace_hardirqs_off_finish();
> +		arch_irqentry_enter_rcu(regs);
>  		instrumentation_end();
>  
>  		ret.exit_rcu = true;
> @@ -166,6 +167,7 @@ noinstr irqentry_state_t irqentry_enter(
>  	kmsan_unpoison_entry_regs(regs);
>  	rcu_irq_enter_check_tick();
>  	trace_hardirqs_off_finish();
> +	arch_irqentry_enter_rcu(regs);
>  	instrumentation_end();
>  
>  	return ret;
> @@ -225,6 +227,7 @@ noinstr void irqentry_exit(struct pt_reg
>  		 */
>  		if (state.exit_rcu) {
>  			instrumentation_begin();
> +			arch_irqentry_exit_rcu(regs);
>  			hrtimer_rearm_deferred();
>  			/* Tell the tracer that IRET will enable interrupts */
>  			trace_hardirqs_on_prepare();
> @@ -239,11 +242,13 @@ noinstr void irqentry_exit(struct pt_reg
>  		if (IS_ENABLED(CONFIG_PREEMPTION))
>  			irqentry_exit_cond_resched();
>  
> +		arch_irqentry_exit_rcu(regs);
>  		hrtimer_rearm_deferred();
>  		/* Covers both tracing and lockdep */
>  		trace_hardirqs_on();
>  		instrumentation_end();
>  	} else {
> +		arch_irqentry_exit_rcu(regs);
>  		/*
>  		 * IRQ flags state is correct already. Just tell RCU if it
>  		 * was not watching on entry.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-20 17:31         ` Mark Rutland
@ 2026-03-21 23:25           ` Thomas Gleixner
  2026-03-24 12:19             ` Thomas Gleixner
  2026-03-25 11:03             ` Mark Rutland
  0 siblings, 2 replies; 25+ messages in thread
From: Thomas Gleixner @ 2026-03-21 23:25 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, ada.coupriediaz, catalin.marinas, linux-kernel,
	luto, peterz, ruanjinjie, vladimir.murzin, will

On Fri, Mar 20 2026 at 17:31, Mark Rutland wrote:
> On Fri, Mar 20, 2026 at 05:26:47PM +0100, Thomas Gleixner wrote:
>> > I realise this is crazy. I would love to delete pseudo-NMI.
>> > Unfortunately people are using it.
>> 
>> What is it used for?
>
> It's used where people would want an NMI; specifically today that's:
>
> * The PMU interrupt (so people can profile code that has IRQs off).
> * IPI_CPU_BACKTRACE (so we can get a backtrace when code has IRQs off).
> * IPI_CPU_STOP_NMI (so panic can stop secondaries more reliably).
> * IPI_KGDB_ROUNDUP (so KGDB can stop secondaries more reliably).
>
> I mostly care about the first two. People *really* want the PMU interrupt as an
> NMI.

Which makes actually sense.

> My other concern is that I'd like to backport a fix for the issue I
> mentioned in the commit message, and I'd like to have something that's
> simpler than "rewrite the entire entry code" for that -- for backporting
> it'd be easier to revert the move to generic irq entry.

I understand.

> Not really, but we might be talking past one another.
>
> I *think* you're saying that because the arch code would manage DAIF
> early during entry and late during exit, that would all be in one place.

That was my thought, see below.

> However, that doubles the number of times we have to write to DAIF: at
> entry we'd have to poke it once to unmask everything except IRQs, then
> again to unmask IRQs, and exit would need the inverse. We'd also have to
> split the inheritance logic into inherit-everything-but-interrupt and
> inherit-only-interrupt, which is doable but not ideal. With pseudo-NMI
> it's even worse, but that's largely because pseudo-NMI is
> over-complicated today.

Interrupts are not unmasked on interrupt/exception entry ever and I
don't understand your concerns at all, but as always I might be missing
something.

The current sequence on entry is:

  // interrupts are disabled by interrupt/exception entry
  enter_from_kernel_mode()
     irqentry_enter(regs);
     mte_check_tfsr_entry();
     mte_disable_tco_entry();
     daif_inherit(regs);
     // interrupts are still disabled

which then becomes:

  // interrupts are disabled by interrupt/exception entry
  irqentry_enter(regs)
     establish_state();
     // RCU is watching
     arch_irqentry_enter_rcu()
        mte_check_tfsr_entry();
        mte_disable_tco_entry();
        daif_inherit(regs);
     // interrupts are still disabled
          
Which is equivalent versus the MTE/DAIF requirements, no?

The current broken sequence vs. preemption on exit is:

  // interrupts are disabled
  exit_to_kernel_mode
     daif_mask();
     mte_check_tfsr_exit();
     irqentry_exit(regs, state);

which then becomes:

  // interrupts are disabled
  irqentry_exit(regs, state)
     // includes preemption
     prepare_for_exit();

     // RCU is still watching
     arch_irqentry_ecit_rcu()
        daif_mask();
        mte_check_tfsr_exit();

     if (state.exit_rcu)
        ct_irq_exit();

Which is equivalent versus the MTE/DAIF requirements and fixes the
preempt on exit issue too, no?

That change would be trivial enough for backporting, right?

It also prevents you from staring at the bug reports which are going to
end up in your mailbox after I merged the patch which moves the
misplaced rcu_irq_exit_check_preempt() check _before_ the
preempt_count() check where it belongs.

I fully agree that ARM64 is special vs. CPU state handling, but it's not
special enough to justify it's own semantically broken preemption logic.

Looking at those details made me also look at this magic
arch_irqentry_exit_need_resched() inline function. 

	/*
	 * DAIF.DA are cleared at the start of IRQ/FIQ handling, and when GIC
	 * priority masking is used the GIC irqchip driver will clear DAIF.IF
	 * using gic_arch_enable_irqs() for normal IRQs. If anything is set in
	 * DAIF we must have handled an NMI, so skip preemption.
	 */
	if (system_uses_irq_prio_masking() && read_sysreg(daif))
		return false;

Why is this using irqentry_enter/exit() in the first place? 

NMI delivery has to go through irqentry_nmi_enter/exit() as I explained
to you before. This hack is fundamentally wrong:

 1) It fails to indicate NMI state in preempt_count and other facilities

    There is code which cares about this state. It's truly amazing that
    it did not explode in your face yet.

 2) It uses a code path which is not NMI safe by definition

    I did not go through all the gory details there, but the lack of
    explosions might be sheer luck because most of the related code is
    written in a way that it can be used in both context.

    But the pending for v7.1 hrtimer changes are going to actually
    expose this ARM64 NMI bogosity. Assume the following scenario:

    timer interrupt
       hrtimer_interrupt()
          raw_spin_lock(&cpu_base->lock);
          ...
          cpu_base->deferred_rearm = true;
    ---> NMI
      irqentry_enter();
      handle();
      irqentry_exit()
        ...
        hrtimer_rearm_deferred()
          if (!cpu_base->deferred_rearm)
             return;
          raw_spin_lock(&cpu_base->lock)
     --->    LIVELOCK

    That's code which is not upstream yet, but it was written under the
    perfectly valid assumption that architectures actually adhere to the
    mandatory interrupt/NMI destinction, which ARM64 clearly does not.

You might argue that this assumption is wrong vs. other irqentry_enter()
bound exceptions which can be raised and handled even if interrupts are
disabled, i.e. page faults, divide by zero etc.

It's not wrong at all because any of these exceptions in the context of
a interrupt handler will cause a kernel panic.

Actually thinking about that the rearm code needs to be hardened against
this by having:

         if (WARN_ON_ONCE(in_hardirq())
         	return;
Thanks,

        tglx







^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-21 23:25           ` Thomas Gleixner
@ 2026-03-24 12:19             ` Thomas Gleixner
  2026-03-25 11:03             ` Mark Rutland
  1 sibling, 0 replies; 25+ messages in thread
From: Thomas Gleixner @ 2026-03-24 12:19 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, ada.coupriediaz, catalin.marinas, linux-kernel,
	luto, peterz, ruanjinjie, vladimir.murzin, will

On Sun, Mar 22 2026 at 00:25, Thomas Gleixner wrote:
> On Fri, Mar 20 2026 at 17:31, Mark Rutland wrote:
> Looking at those details made me also look at this magic
> arch_irqentry_exit_need_resched() inline function. 
>
> 	/*
> 	 * DAIF.DA are cleared at the start of IRQ/FIQ handling, and when GIC
> 	 * priority masking is used the GIC irqchip driver will clear DAIF.IF
> 	 * using gic_arch_enable_irqs() for normal IRQs. If anything is set in
> 	 * DAIF we must have handled an NMI, so skip preemption.
> 	 */
> 	if (system_uses_irq_prio_masking() && read_sysreg(daif))
> 		return false;
>
> Why is this using irqentry_enter/exit() in the first place? 

Ah. The entry point does

          if (regs_irqs_disabled(regs))
          	do_nmi();
          else
                do_irq();

So you end up in do_irq() and eventually in the preemption path and need
that check to prevent scheduling. So that should be fine and obviously
won't hit the code path I outlined.

Thanks,

        tglx





^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-21 23:25           ` Thomas Gleixner
  2026-03-24 12:19             ` Thomas Gleixner
@ 2026-03-25 11:03             ` Mark Rutland
  2026-03-25 15:46               ` Thomas Gleixner
  2026-03-26  8:52               ` Jinjie Ruan
  1 sibling, 2 replies; 25+ messages in thread
From: Mark Rutland @ 2026-03-25 11:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-arm-kernel, ada.coupriediaz, catalin.marinas, linux-kernel,
	luto, peterz, ruanjinjie, vladimir.murzin, will

On Sun, Mar 22, 2026 at 12:25:06AM +0100, Thomas Gleixner wrote:
> On Fri, Mar 20 2026 at 17:31, Mark Rutland wrote:
> > On Fri, Mar 20, 2026 at 05:26:47PM +0100, Thomas Gleixner wrote:
> > I *think* you're saying that because the arch code would manage DAIF
> > early during entry and late during exit, that would all be in one place.
> 
> That was my thought, see below.
> 
> > However, that doubles the number of times we have to write to DAIF: at
> > entry we'd have to poke it once to unmask everything except IRQs, then
> > again to unmask IRQs, and exit would need the inverse. We'd also have to
> > split the inheritance logic into inherit-everything-but-interrupt and
> > inherit-only-interrupt, which is doable but not ideal. With pseudo-NMI
> > it's even worse, but that's largely because pseudo-NMI is
> > over-complicated today.
> 
> Interrupts are not unmasked on interrupt/exception entry ever and I
> don't understand your concerns at all, but as always I might be missing
> something.

I think one problem is that we're using the same words to describe
distinct things, because the terminology is overloaded; I've tried to
clarify some of that below.

I think another is that there are a large number of interacting
constraints, and it's easily possible to find something that for most
but not all of those. I *think* there's an approach that satisfies all
of our requirements; see below where I say "I *think* what would work
for us ...".

For context, when I said "at entry" and "at exit" I was including
*everything* we have to do before/after the "real" logic to handle an
exception, including parts that surround the generic entry code. I'm
happy to use a different term for those periods, but I can't immediately
think of something better.

For example, for faults handled by arm64's el1_abort(), I was
characterising this as:

    /* -------- "entry" begins here -------- */

    [[ entry asm ]]
    [[ early triage, branch to el1_abort() ]]

    static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr) 
    {
            unsigned long far = read_sysreg(far_el1);
            irqentry_state_t state;

            state = enter_from_kernel_mode(regs) {
                    ...
		    irqentry_enter(regs);
                    ...
	    }
            local_daif_inherit(regs); // <----------- might unmask interrupts
    /* -------- "entry" ends here ---------- */

    /* -------- "real logic" begins here --- */
            do_mem_abort(far, esr, regs);
    /* -------- "real logic" ends here ----- */

    /* -------- "exit" begins here --------- */
            local_daif_mask(); // <------------------------- masks interrupts
            exit_to_kernel_mode(regs, state) {
	            ...
		    irqentry_exit(regs);
	            ...
	    }
    }

    [[ return from el1_abort() ]]
    [[ exit asm ]]

    /* -------- "exit" ends here ----------- */

Please note, el1_abort() is indicative of what arm64 does for
(most) synchronous exceptions, but there are slight differences for
other exceptions. For anything maskable (including interupts) we DO NOT
use local_daif_inherit() and instead unmask specific higher-priority
maskable exceptions via other functions that write to DAIF, etc.

Interrupts are an odd middle ground where we use irqentry_{enter,exit}()
but do not use local_daif_inherit().

> The current sequence on entry is:
> 
>   // interrupts are disabled by interrupt/exception entry
>   enter_from_kernel_mode()
>      irqentry_enter(regs);
>      mte_check_tfsr_entry();
>      mte_disable_tco_entry();
>      daif_inherit(regs);
>      // interrupts are still disabled

That last comment isn't quite right: we CAN and WILL enable interrupts
in local_daif_inherit(), if and only if they were enabled in the context
the exception was taken from.

As mentioned above, when handling an interrupt (rather than a
synchronous exception), we don't use local_daif_inherit(), and instead
use a different DAIF function to unmask everything except interrupts.

> which then becomes:
> 
>   // interrupts are disabled by interrupt/exception entry
>   irqentry_enter(regs)
>      establish_state();
>      // RCU is watching
>      arch_irqentry_enter_rcu()
>         mte_check_tfsr_entry();
>         mte_disable_tco_entry();
>         daif_inherit(regs);
>      // interrupts are still disabled
>           
> Which is equivalent versus the MTE/DAIF requirements, no?

As above, we can't use local_daif_inherit() here because we want
different DAIF masking behavior for entry to interrupts and entry to
synchronous exceptions. While we could pass some token around to
determine the behaviour dynamically, that's less clear, more
complicated, and results in worse code being generated for something we
know at compile time.

If we can leave DAIF masked early on during irqentry_enter(), I don't
see why we can't leave all DAIF exceptions masked until the end of
irqentry_enter().

I *think* what would work for us is we could split some of the exit
handling (including involuntary preemption) into a "prepare" step, as we
have for return to userspace. That way, arm64 could handle exiting
something like:

	local_irq_disable();
	irqentry_exit_prepare(); // new, all generic logic
	local_daif_mask();
	arm64_exit_to_kernel_mode() {
		...
		irqentry_exit(); // ideally irqentry_exit_to_kernel_mode().
		...
	}

... and other architectures can use a combined exit_to_kernel_mode() (or
whatever we call that), which does both, e.g.

	// either noinstr, __always_inline, or a macro
	void irqentry_prepare_and_exit(void)
	{
		irqentry_exit_prepare();
		irqentry_exit();
	}

That way:

* There's a clear separation between the "prepare" and subsequent exit
  steps, which minimizes the risk that a change subtly breaks arm64's
  exception masking.

* This would mirror the userspace return path, and so would be more
  consistent over all.

* All of arm64's arch-specific exception masking can live in
  arch/arm64/kernel/entry-common.c, explicit in the straight line code
  rather than partially hidden behind arch_*() callbacks.

* There's no unnecessary cost to other architectures.

* There's no/minimal maintenance cost for the generic code. There are no
  arch_*() callbacks, and we'd have to enforce ordering between the
  prepare/exit steps anyhow...

If you don't see an obvious problem with that, I will go and try that
now.

> The current broken sequence vs. preemption on exit is:
> 
>   // interrupts are disabled
>   exit_to_kernel_mode
>      daif_mask();
>      mte_check_tfsr_exit();
>      irqentry_exit(regs, state);
> 
> which then becomes:
> 
>   // interrupts are disabled
>   irqentry_exit(regs, state)
>      // includes preemption
>      prepare_for_exit();
> 
>      // RCU is still watching
>      arch_irqentry_ecit_rcu()
>         daif_mask();
>         mte_check_tfsr_exit();
> 
>      if (state.exit_rcu)
>         ct_irq_exit();

As above, I'd strongly prefer if we could pull the "prepare" step out of
irqentry_exit(). Especially since for the entry path we can't push the
DAIF masking into irqentry_enter(), and I'd very strongly prefer that
the masking and unmasking occur in the same logical place, rather than
having one of those hidden behind an arch_() callback.

> Which is equivalent versus the MTE/DAIF requirements and fixes the
> preempt on exit issue too, no?
> 
> That change would be trivial enough for backporting, right?
> 
> It also prevents you from staring at the bug reports which are going to
> end up in your mailbox after I merged the patch which moves the
> misplaced rcu_irq_exit_check_preempt() check _before_ the
> preempt_count() check where it belongs.

I intend to fix that issue, so hopefully I'm not staring at those for
long.

Just to check, do you mean that you've already queued that (I didn't
spot it in tip), or that you intend to? I'll happily test/review/ack a
patch adding that, but hopefully we can fix arm64 first.

> I fully agree that ARM64 is special vs. CPU state handling, but it's not
> special enough to justify it's own semantically broken preemption logic.

Sure. To be clear, I'm not arguing for broken preemption logic. I'd
asked those initial two questions because I suspected this approach
wasn't quite right.

As above, I think we can solve this in an actually generic way by
splitting out a "prepare to exit" step, and still keep the bulk of the
logic generic.

> Looking at those details made me also look at this magic
> arch_irqentry_exit_need_resched() inline function. 

I see per your other reply that you figured out this part was ok:

  https://lore.kernel.org/linux-arm-kernel/87se9ph129.ffs@tglx/

... though I agree we can clean that up further.

Mark.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-25 11:03             ` Mark Rutland
@ 2026-03-25 15:46               ` Thomas Gleixner
  2026-03-26  8:56                 ` Jinjie Ruan
  2026-03-26 18:11                 ` Mark Rutland
  2026-03-26  8:52               ` Jinjie Ruan
  1 sibling, 2 replies; 25+ messages in thread
From: Thomas Gleixner @ 2026-03-25 15:46 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, ada.coupriediaz, catalin.marinas, linux-kernel,
	luto, peterz, ruanjinjie, vladimir.murzin, will

On Wed, Mar 25 2026 at 11:03, Mark Rutland wrote:
> On Sun, Mar 22, 2026 at 12:25:06AM +0100, Thomas Gleixner wrote:
>> The current sequence on entry is:
>> 
>>   // interrupts are disabled by interrupt/exception entry
>>   enter_from_kernel_mode()
>>      irqentry_enter(regs);
>>      mte_check_tfsr_entry();
>>      mte_disable_tco_entry();
>>      daif_inherit(regs);
>>      // interrupts are still disabled
>
> That last comment isn't quite right: we CAN and WILL enable interrupts
> in local_daif_inherit(), if and only if they were enabled in the context
> the exception was taken from.

Ok.

> As mentioned above, when handling an interrupt (rather than a
> synchronous exception), we don't use local_daif_inherit(), and instead
> use a different DAIF function to unmask everything except interrupts.
>
>> which then becomes:
>> 
>>   // interrupts are disabled by interrupt/exception entry
>>   irqentry_enter(regs)
>>      establish_state();
>>      // RCU is watching
>>      arch_irqentry_enter_rcu()
>>         mte_check_tfsr_entry();
>>         mte_disable_tco_entry();
>>         daif_inherit(regs);
>>      // interrupts are still disabled
>>           
>> Which is equivalent versus the MTE/DAIF requirements, no?
>
> As above, we can't use local_daif_inherit() here because we want
> different DAIF masking behavior for entry to interrupts and entry to
> synchronous exceptions. While we could pass some token around to
> determine the behaviour dynamically, that's less clear, more
> complicated, and results in worse code being generated for something we
> know at compile time.

I get it. Duh what a maze.

> If we can leave DAIF masked early on during irqentry_enter(), I don't
> see why we can't leave all DAIF exceptions masked until the end of
> irqentry_enter().

Yes. Entry is not an issue.

> I *think* what would work for us is we could split some of the exit
> handling (including involuntary preemption) into a "prepare" step, as we
> have for return to userspace. That way, arm64 could handle exiting
> something like:
>
> 	local_irq_disable();
> 	irqentry_exit_prepare(); // new, all generic logic
> 	local_daif_mask();
> 	arm64_exit_to_kernel_mode() {
> 		...
> 		irqentry_exit(); // ideally irqentry_exit_to_kernel_mode().
> 		...
> 	}
>
> ... and other architectures can use a combined exit_to_kernel_mode() (or
> whatever we call that), which does both, e.g.
>
> 	// either noinstr, __always_inline, or a macro
> 	void irqentry_prepare_and_exit(void)

That's a bad idea as that would require to do a full kernel rename of
all existing irqentry_exit() users.

> 	{
> 		irqentry_exit_prepare();
> 		irqentry_exit();
> 	}

Aside of the naming that should work.

Thanks,

        tglx





^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-25 15:46               ` Thomas Gleixner
@ 2026-03-26  8:56                 ` Jinjie Ruan
  2026-03-26 18:11                 ` Mark Rutland
  1 sibling, 0 replies; 25+ messages in thread
From: Jinjie Ruan @ 2026-03-26  8:56 UTC (permalink / raw)
  To: Thomas Gleixner, Mark Rutland
  Cc: linux-arm-kernel, ada.coupriediaz, catalin.marinas, linux-kernel,
	luto, peterz, vladimir.murzin, will



On 2026/3/25 23:46, Thomas Gleixner wrote:
> On Wed, Mar 25 2026 at 11:03, Mark Rutland wrote:
>> On Sun, Mar 22, 2026 at 12:25:06AM +0100, Thomas Gleixner wrote:
>>> The current sequence on entry is:
>>>
>>>   // interrupts are disabled by interrupt/exception entry
>>>   enter_from_kernel_mode()
>>>      irqentry_enter(regs);
>>>      mte_check_tfsr_entry();
>>>      mte_disable_tco_entry();
>>>      daif_inherit(regs);
>>>      // interrupts are still disabled
>>
>> That last comment isn't quite right: we CAN and WILL enable interrupts
>> in local_daif_inherit(), if and only if they were enabled in the context
>> the exception was taken from.
> 
> Ok.
> 
>> As mentioned above, when handling an interrupt (rather than a
>> synchronous exception), we don't use local_daif_inherit(), and instead
>> use a different DAIF function to unmask everything except interrupts.
>>
>>> which then becomes:
>>>
>>>   // interrupts are disabled by interrupt/exception entry
>>>   irqentry_enter(regs)
>>>      establish_state();
>>>      // RCU is watching
>>>      arch_irqentry_enter_rcu()
>>>         mte_check_tfsr_entry();
>>>         mte_disable_tco_entry();
>>>         daif_inherit(regs);
>>>      // interrupts are still disabled
>>>           
>>> Which is equivalent versus the MTE/DAIF requirements, no?
>>
>> As above, we can't use local_daif_inherit() here because we want
>> different DAIF masking behavior for entry to interrupts and entry to
>> synchronous exceptions. While we could pass some token around to
>> determine the behaviour dynamically, that's less clear, more
>> complicated, and results in worse code being generated for something we
>> know at compile time.
> 
> I get it. Duh what a maze.
> 
>> If we can leave DAIF masked early on during irqentry_enter(), I don't
>> see why we can't leave all DAIF exceptions masked until the end of
>> irqentry_enter().
> 
> Yes. Entry is not an issue.
> 
>> I *think* what would work for us is we could split some of the exit
>> handling (including involuntary preemption) into a "prepare" step, as we
>> have for return to userspace. That way, arm64 could handle exiting
>> something like:
>>
>> 	local_irq_disable();
>> 	irqentry_exit_prepare(); // new, all generic logic
>> 	local_daif_mask();
>> 	arm64_exit_to_kernel_mode() {
>> 		...
>> 		irqentry_exit(); // ideally irqentry_exit_to_kernel_mode().
>> 		...
>> 	}
>>
>> ... and other architectures can use a combined exit_to_kernel_mode() (or
>> whatever we call that), which does both, e.g.
>>
>> 	// either noinstr, __always_inline, or a macro
>> 	void irqentry_prepare_and_exit(void)
> 
> That's a bad idea as that would require to do a full kernel rename of
> all existing irqentry_exit() users.

I see your point about the rename. However, we can avoid a tree-wide
rename by keeping the irqentry_exit() name and interface exactly as it is.
The idea is to perform an internal refactoring: split the existing logic
into two helpers (e.g., irqentry_exit_prepare() and a core helper), and
then have the original irqentry_exit() call both of them. This way,
existing users like RISC-V remain untouched, while arm64 can choose to
call the two sub-functions individually to insert the DAIF masking in
between."

> 
>> 	{
>> 		irqentry_exit_prepare();
>> 		irqentry_exit();
>> 	}
> 
> Aside of the naming that should work.
> 
> Thanks,
> 
>         tglx
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-25 15:46               ` Thomas Gleixner
  2026-03-26  8:56                 ` Jinjie Ruan
@ 2026-03-26 18:11                 ` Mark Rutland
  2026-03-26 18:32                   ` Thomas Gleixner
  2026-03-27  1:27                   ` Jinjie Ruan
  1 sibling, 2 replies; 25+ messages in thread
From: Mark Rutland @ 2026-03-26 18:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-arm-kernel, ada.coupriediaz, catalin.marinas, linux-kernel,
	luto, peterz, ruanjinjie, vladimir.murzin, will

On Wed, Mar 25, 2026 at 04:46:01PM +0100, Thomas Gleixner wrote:
> On Wed, Mar 25 2026 at 11:03, Mark Rutland wrote:
> > On Sun, Mar 22, 2026 at 12:25:06AM +0100, Thomas Gleixner wrote:
> > I *think* what would work for us is we could split some of the exit
> > handling (including involuntary preemption) into a "prepare" step, as we
> > have for return to userspace. That way, arm64 could handle exiting
> > something like:
> >
> > 	local_irq_disable();
> > 	irqentry_exit_prepare(); // new, all generic logic
> > 	local_daif_mask();
> > 	arm64_exit_to_kernel_mode() {
> > 		...
> > 		irqentry_exit(); // ideally irqentry_exit_to_kernel_mode().
> > 		...
> > 	}
> >
> > ... and other architectures can use a combined exit_to_kernel_mode() (or
> > whatever we call that), which does both, e.g.
> >
> > 	// either noinstr, __always_inline, or a macro
> > 	void irqentry_prepare_and_exit(void)
> 
> That's a bad idea as that would require to do a full kernel rename of
> all existing irqentry_exit() users.
> 
> > 	{
> > 		irqentry_exit_prepare();
> > 		irqentry_exit();
> > 	}
> 
> Aside of the naming that should work.

Thanks for confirming!

I've pushed a (very early, WIP) draft to

  https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/entry/rework

... which is missing commit messages, comments, etc, but seems to work.

I'll see about getting that tested, cleaned up, and on-list.

Mark.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-26 18:11                 ` Mark Rutland
@ 2026-03-26 18:32                   ` Thomas Gleixner
  2026-03-27  1:27                   ` Jinjie Ruan
  1 sibling, 0 replies; 25+ messages in thread
From: Thomas Gleixner @ 2026-03-26 18:32 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, ada.coupriediaz, catalin.marinas, linux-kernel,
	luto, peterz, ruanjinjie, vladimir.murzin, will

Mark!

On Thu, Mar 26 2026 at 18:11, Mark Rutland wrote:
>
> I've pushed a (very early, WIP) draft to
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/entry/rework
>
> ... which is missing commit messages, comments, etc, but seems to work.
>
> I'll see about getting that tested, cleaned up, and on-list.

I've pulled it and looked at the v7.0-rc3.. diff.

That looks really good! Thanks for taking care of that.

As a related side note. Seeing that even more of the generic entry code
is moving to include/linux/, I'm pondering to move the content back into
kernel/entry and include those "local" headers from the
include/linux/... so that everything is to find in one place, but that's
a purely cosmetic problem :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-26 18:11                 ` Mark Rutland
  2026-03-26 18:32                   ` Thomas Gleixner
@ 2026-03-27  1:27                   ` Jinjie Ruan
  1 sibling, 0 replies; 25+ messages in thread
From: Jinjie Ruan @ 2026-03-27  1:27 UTC (permalink / raw)
  To: Mark Rutland, Thomas Gleixner
  Cc: linux-arm-kernel, ada.coupriediaz, catalin.marinas, linux-kernel,
	luto, peterz, vladimir.murzin, will



On 2026/3/27 2:11, Mark Rutland wrote:
> On Wed, Mar 25, 2026 at 04:46:01PM +0100, Thomas Gleixner wrote:
>> On Wed, Mar 25 2026 at 11:03, Mark Rutland wrote:
>>> On Sun, Mar 22, 2026 at 12:25:06AM +0100, Thomas Gleixner wrote:
>>> I *think* what would work for us is we could split some of the exit
>>> handling (including involuntary preemption) into a "prepare" step, as we
>>> have for return to userspace. That way, arm64 could handle exiting
>>> something like:
>>>
>>> 	local_irq_disable();
>>> 	irqentry_exit_prepare(); // new, all generic logic
>>> 	local_daif_mask();
>>> 	arm64_exit_to_kernel_mode() {
>>> 		...
>>> 		irqentry_exit(); // ideally irqentry_exit_to_kernel_mode().
>>> 		...
>>> 	}
>>>
>>> ... and other architectures can use a combined exit_to_kernel_mode() (or
>>> whatever we call that), which does both, e.g.
>>>
>>> 	// either noinstr, __always_inline, or a macro
>>> 	void irqentry_prepare_and_exit(void)
>>
>> That's a bad idea as that would require to do a full kernel rename of
>> all existing irqentry_exit() users.
>>
>>> 	{
>>> 		irqentry_exit_prepare();
>>> 		irqentry_exit();
>>> 	}
>>
>> Aside of the naming that should work.
> 
> Thanks for confirming!
> 
> I've pushed a (very early, WIP) draft to
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/entry/rework

The patch also looks good to me. Looking forward to seeing this move
forward.

> 
> ... which is missing commit messages, comments, etc, but seems to work.
> 
> I'll see about getting that tested, cleaned up, and on-list.
> 
> Mark.
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-25 11:03             ` Mark Rutland
  2026-03-25 15:46               ` Thomas Gleixner
@ 2026-03-26  8:52               ` Jinjie Ruan
  1 sibling, 0 replies; 25+ messages in thread
From: Jinjie Ruan @ 2026-03-26  8:52 UTC (permalink / raw)
  To: Mark Rutland, Thomas Gleixner
  Cc: linux-arm-kernel, ada.coupriediaz, catalin.marinas, linux-kernel,
	luto, peterz, vladimir.murzin, will



On 2026/3/25 19:03, Mark Rutland wrote:
> On Sun, Mar 22, 2026 at 12:25:06AM +0100, Thomas Gleixner wrote:
>> On Fri, Mar 20 2026 at 17:31, Mark Rutland wrote:
>>> On Fri, Mar 20, 2026 at 05:26:47PM +0100, Thomas Gleixner wrote:
>>> I *think* you're saying that because the arch code would manage DAIF
>>> early during entry and late during exit, that would all be in one place.
>>
>> That was my thought, see below.
>>
>>> However, that doubles the number of times we have to write to DAIF: at
>>> entry we'd have to poke it once to unmask everything except IRQs, then
>>> again to unmask IRQs, and exit would need the inverse. We'd also have to
>>> split the inheritance logic into inherit-everything-but-interrupt and
>>> inherit-only-interrupt, which is doable but not ideal. With pseudo-NMI
>>> it's even worse, but that's largely because pseudo-NMI is
>>> over-complicated today.
>>
>> Interrupts are not unmasked on interrupt/exception entry ever and I
>> don't understand your concerns at all, but as always I might be missing
>> something.
> 
> I think one problem is that we're using the same words to describe
> distinct things, because the terminology is overloaded; I've tried to
> clarify some of that below.
> 
> I think another is that there are a large number of interacting
> constraints, and it's easily possible to find something that for most
> but not all of those. I *think* there's an approach that satisfies all
> of our requirements; see below where I say "I *think* what would work
> for us ...".
> 
> For context, when I said "at entry" and "at exit" I was including
> *everything* we have to do before/after the "real" logic to handle an
> exception, including parts that surround the generic entry code. I'm
> happy to use a different term for those periods, but I can't immediately
> think of something better.
> 
> For example, for faults handled by arm64's el1_abort(), I was
> characterising this as:
> 
>     /* -------- "entry" begins here -------- */
> 
>     [[ entry asm ]]
>     [[ early triage, branch to el1_abort() ]]
> 
>     static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr) 
>     {
>             unsigned long far = read_sysreg(far_el1);
>             irqentry_state_t state;
> 
>             state = enter_from_kernel_mode(regs) {
>                     ...
> 		    irqentry_enter(regs);
>                     ...
> 	    }
>             local_daif_inherit(regs); // <----------- might unmask interrupts
>     /* -------- "entry" ends here ---------- */
> 
> 
>     /* -------- "real logic" begins here --- */
>             do_mem_abort(far, esr, regs);
>     /* -------- "real logic" ends here ----- */
> 
> 
>     /* -------- "exit" begins here --------- */
>             local_daif_mask(); // <------------------------- masks interrupts
>             exit_to_kernel_mode(regs, state) {
> 	            ...
> 		    irqentry_exit(regs);
> 	            ...
> 	    }
>     }
> 
>     [[ return from el1_abort() ]]
>     [[ exit asm ]]
> 
>     /* -------- "exit" ends here ----------- */
> 
> Please note, el1_abort() is indicative of what arm64 does for
> (most) synchronous exceptions, but there are slight differences for
> other exceptions. For anything maskable (including interupts) we DO NOT
> use local_daif_inherit() and instead unmask specific higher-priority
> maskable exceptions via other functions that write to DAIF, etc.
> 
> Interrupts are an odd middle ground where we use irqentry_{enter,exit}()
> but do not use local_daif_inherit().
> 
>> The current sequence on entry is:
>>
>>   // interrupts are disabled by interrupt/exception entry
>>   enter_from_kernel_mode()
>>      irqentry_enter(regs);
>>      mte_check_tfsr_entry();
>>      mte_disable_tco_entry();
>>      daif_inherit(regs);
>>      // interrupts are still disabled
> 
> That last comment isn't quite right: we CAN and WILL enable interrupts
> in local_daif_inherit(), if and only if they were enabled in the context
> the exception was taken from.
> 
> As mentioned above, when handling an interrupt (rather than a
> synchronous exception), we don't use local_daif_inherit(), and instead
> use a different DAIF function to unmask everything except interrupts.
> 
>> which then becomes:
>>
>>   // interrupts are disabled by interrupt/exception entry
>>   irqentry_enter(regs)
>>      establish_state();
>>      // RCU is watching
>>      arch_irqentry_enter_rcu()
>>         mte_check_tfsr_entry();
>>         mte_disable_tco_entry();
>>         daif_inherit(regs);
>>      // interrupts are still disabled
>>           
>> Which is equivalent versus the MTE/DAIF requirements, no?
> 
> As above, we can't use local_daif_inherit() here because we want
> different DAIF masking behavior for entry to interrupts and entry to
> synchronous exceptions. While we could pass some token around to
> determine the behaviour dynamically, that's less clear, more
> complicated, and results in worse code being generated for something we
> know at compile time.
> 
> If we can leave DAIF masked early on during irqentry_enter(), I don't
> see why we can't leave all DAIF exceptions masked until the end of
> irqentry_enter().
> 
> I *think* what would work for us is we could split some of the exit
> handling (including involuntary preemption) into a "prepare" step, as we
> have for return to userspace. That way, arm64 could handle exiting
> something like:
> 
> 	local_irq_disable();
> 	irqentry_exit_prepare(); // new, all generic logic
> 	local_daif_mask();
> 	arm64_exit_to_kernel_mode() {
> 		...
> 		irqentry_exit(); // ideally irqentry_exit_to_kernel_mode().
> 		...
> 	}

I agree. Having a symmetric 'prepare' step for kernel-mode exit as we do
for userspace would be much cleaner. It effectively addresses the DAIF
masking constraints on arm64.

  arm64_exit_to_user_mode(struct pt_regs *regs)
        -> local_irq_disable();       // only mask irqs
         ^^^^^^^^^^^^^^^^^^^^^^
        -> exit_to_user_mode_prepare_legacy(regs);
           -> schedule()             // schedule if need resched

        -> local_daif_mask();        // set daif to mask all exceptions
        ^^^^^^^^^^^^^^^^^^^^^
        -> mte_check_tfsr_exit();
        -> exit_to_user_mode();

This approach can also align with generic implementations like RISC-V.
We can split irqentry_exit() into two sub-functions:
irqentry_exit_prepare() to handle scheduling-related tasks, and the
remaining logic in a simplified irqentry_exit() (or a specific helper).
This way, arm64 can call these two helpers with the DAIF masking
operation inserted in between, while architectures like RISC-V can
continue to use the full irqentry_exit() functionality as they do now.

      void do_page_fault(struct pt_regs *regs)
       -> irqentry_state_t state = irqentry_enter(regs);
       -> handle_page_fault(regs);
       -> local_irq_disable();
       -> irqentry_exit(regs, state);



> 
> ... and other architectures can use a combined exit_to_kernel_mode() (or
> whatever we call that), which does both, e.g.
> 
> 	// either noinstr, __always_inline, or a macro
> 	void irqentry_prepare_and_exit(void)
> 	{
> 		irqentry_exit_prepare();
> 		irqentry_exit();
> 	}
> 
> That way:
> 
> * There's a clear separation between the "prepare" and subsequent exit
>   steps, which minimizes the risk that a change subtly breaks arm64's
>   exception masking.
> 
> * This would mirror the userspace return path, and so would be more
>   consistent over all.
> 
> * All of arm64's arch-specific exception masking can live in
>   arch/arm64/kernel/entry-common.c, explicit in the straight line code
>   rather than partially hidden behind arch_*() callbacks.
> 
> * There's no unnecessary cost to other architectures.
> 
> * There's no/minimal maintenance cost for the generic code. There are no
>   arch_*() callbacks, and we'd have to enforce ordering between the
>   prepare/exit steps anyhow...
> 
> If you don't see an obvious problem with that, I will go and try that
> now.
> 
>> The current broken sequence vs. preemption on exit is:
>>
>>   // interrupts are disabled
>>   exit_to_kernel_mode
>>      daif_mask();
>>      mte_check_tfsr_exit();
>>      irqentry_exit(regs, state);
>>
>> which then becomes:
>>
>>   // interrupts are disabled
>>   irqentry_exit(regs, state)
>>      // includes preemption
>>      prepare_for_exit();
>>
>>      // RCU is still watching
>>      arch_irqentry_ecit_rcu()
>>         daif_mask();
>>         mte_check_tfsr_exit();
>>
>>      if (state.exit_rcu)
>>         ct_irq_exit();
> 
> As above, I'd strongly prefer if we could pull the "prepare" step out of
> irqentry_exit(). Especially since for the entry path we can't push the
> DAIF masking into irqentry_enter(), and I'd very strongly prefer that
> the masking and unmasking occur in the same logical place, rather than
> having one of those hidden behind an arch_() callback.
> 
>> Which is equivalent versus the MTE/DAIF requirements and fixes the
>> preempt on exit issue too, no?
>>
>> That change would be trivial enough for backporting, right?
>>
>> It also prevents you from staring at the bug reports which are going to
>> end up in your mailbox after I merged the patch which moves the
>> misplaced rcu_irq_exit_check_preempt() check _before_ the
>> preempt_count() check where it belongs.
> 
> I intend to fix that issue, so hopefully I'm not staring at those for
> long.
> 
> Just to check, do you mean that you've already queued that (I didn't
> spot it in tip), or that you intend to? I'll happily test/review/ack a
> patch adding that, but hopefully we can fix arm64 first.
> 
>> I fully agree that ARM64 is special vs. CPU state handling, but it's not
>> special enough to justify it's own semantically broken preemption logic.
> 
> Sure. To be clear, I'm not arguing for broken preemption logic. I'd
> asked those initial two questions because I suspected this approach
> wasn't quite right.
> 
> As above, I think we can solve this in an actually generic way by
> splitting out a "prepare to exit" step, and still keep the bulk of the
> logic generic.
> 
>> Looking at those details made me also look at this magic
>> arch_irqentry_exit_need_resched() inline function. 
> 
> I see per your other reply that you figured out this part was ok:
> 
>   https://lore.kernel.org/linux-arm-kernel/87se9ph129.ffs@tglx/
> 
> ... though I agree we can clean that up further.
> 
> Mark.
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-20 11:30 ` [PATCH 1/2] " Mark Rutland
  2026-03-20 13:04   ` Peter Zijlstra
  2026-03-20 14:59   ` Thomas Gleixner
@ 2026-03-24  3:14   ` Jinjie Ruan
  2026-03-24 10:51     ` Mark Rutland
  2 siblings, 1 reply; 25+ messages in thread
From: Jinjie Ruan @ 2026-03-24  3:14 UTC (permalink / raw)
  To: Mark Rutland, linux-arm-kernel
  Cc: ada.coupriediaz, catalin.marinas, linux-kernel, luto, peterz,
	tglx, vladimir.murzin, will



On 2026/3/20 19:30, Mark Rutland wrote:
> On arm64, involuntary kernel preemption has been subtly broken since the
> move to the generic irq entry code. When preemption occurs, the new task
> may run with SError and Debug exceptions masked unexpectedly, leading to
> a loss of RAS events, breakpoints, watchpoints, and single-step
> exceptions.

We can also add a check in arch_irqentry_exit_need_resched to prevent
schedule-out when the DA bit is set.

> 
> We can fix this relatively simply by moving the preemption logic out of
> irqentry_exit(), which is desirable for a number of other reasons on
> arm64. Context and rationale below:
> 
> 1) Architecturally, several groups of exceptions can be masked
>    independently, including 'Debug', 'SError', 'IRQ', and 'FIQ', whose
>    mask bits can be read/written via the 'DAIF' register.
> 
>    Other mask bits exist, including 'PM' and 'AllInt', which we will
>    need to use in future (e.g. for architectural NMI support).
> 
>    The entry code needs to manipulate all of these, but the generic
>    entry code only knows about interrupts (which means both IRQ and FIQ
>    on arm64), and the other exception masks aren't generic.
> 
> 2) Architecturally, all maskable exceptions MUST be masked during
>    exception entry and exception return.
> 
>    Upon exception entry, hardware places exception context into
>    exception registers (e.g. the PC is saved into ELR_ELx). Upon
>    exception return, hardware restores exception context from those
>    exception registers (e.g. the PC is restored from ELR_ELx).
> 
>    To ensure the exception registers aren't clobbered by recursive
>    exceptions, all maskable exceptions must be masked early during entry
>    and late during exit. Hardware masks all maskable exceptions
>    automatically at exception entry. Software must unmask these as
>    required, and must mask them prior to exception return.
> 
> 3) Architecturally, hardware masks all maskable exceptions upon any
>    exception entry. A synchronous exception (e.g. a fault on a memory
>    access) can be taken from any context (e.g. where IRQ+FIQ might be
>    masked), and the entry code must explicitly 'inherit' the unmasking
>    from the original context by reading the exception registers (e.g.
>    SPSR_ELx) and writing to DAIF, etc.
> 
> 4) When 'pseudo-NMI' is used, Linux masks interrupts via a combination
>    of DAIF and the 'PMR' priority mask register. At entry and exit,
>    interrupts must be masked via DAIF, but most kernel code will
>    mask/unmask regular interrupts using PMR (e.g. in local_irq_save()
>    and local_irq_restore()).
> 
>    This requires more complicated transitions at entry and exit. Early
>    during entry or late during return, interrupts are masked via DAIF,
>    and kernel code which manipulates PMR to mask/unmask interrupts will
>    not function correctly in this state.
> 
>    This also requires fairly complicated management of DAIF and PMR when
>    handling interrupts, and arm64 has special logic to avoid preempting
>    from pseudo-NMIs which currently lives in
>    arch_irqentry_exit_need_resched().
> 
> 5) Most kernel code runs with all exceptions unmasked. When scheduling,
>    only interrupts should be masked (by PMR pseudo-NMI is used, and by
>    DAIF otherwise).
> 
> For most exceptions, arm64's entry code has a sequence similar to that
> of el1_abort(), which is used for faults:
> 
> | static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
> | {
> |         unsigned long far = read_sysreg(far_el1);
> |         irqentry_state_t state;
> |
> |         state = enter_from_kernel_mode(regs);
> |         local_daif_inherit(regs);
> |         do_mem_abort(far, esr, regs);
> |         local_daif_mask();
> |         exit_to_kernel_mode(regs, state);
> | }
> 
> ... where enter_from_kernel_mode() and exit_to_kernel_mode() are
> wrappers around irqentry_enter() and irqentry_exit() which perform
> additional arm64-specific entry/exit logic.
> 
> Currently, the generic irq entry code will attempt to preempt from any
> exception under irqentry_exit() where interrupts were unmasked in the
> original context. As arm64's entry code will have already masked
> exceptions via DAIF, this results in the problems described above.
> 
> Fix this by opting out of preemption in irqentry_exit(), and restoring
> arm64's old behaivour of explicitly preempting when returning from IRQ
> or FIQ, before calling exit_to_kernel_mode() / irqentry_exit(). This
> ensures that preemption occurs when only interrupts are masked, and
> where that masking is compatible with most kernel code (e.g. using PMR
> when pseudo-NMI is in use).
> 
> Fixes: 99eb057ccd67 ("arm64: entry: Move arm64_preempt_schedule_irq() into __exit_to_kernel_mode()")
> Reported-by: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Reported-by: Vladimir Murzin <vladimir.murzin@arm.com>
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Jinjie Ruan <ruanjinjie@huawei.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@kernel.org>
> Cc: Will Deacon <will@kernel.org>
> ---
>  arch/Kconfig                     | 3 +++
>  arch/arm64/Kconfig               | 1 +
>  arch/arm64/kernel/entry-common.c | 2 ++
>  kernel/entry/common.c            | 4 +++-
>  4 files changed, 9 insertions(+), 1 deletion(-)
> 
> Thomas, Peter, I have a couple of things I'd like to check:
> 
> (1) The generic irq entry code will preempt from any exception (e.g. a
>     synchronous fault) where interrupts were unmasked in the original
>     context. Is that intentional/necessary, or was that just the way the
>     x86 code happened to be implemented?
> 
>     I assume that it'd be fine if arm64 only preempted from true
>     interrupts, but if that was intentional/necessary I can go rework
>     this.
> 
> (2) The generic irq entry code only preempts when RCU was watching in
>     the original context. IIUC that's just to avoid preempting from the
>     idle thread. Is it functionally necessary to avoid that, or is that
>     just an optimization?
> 
>     I'm asking because historically arm64 didn't check that, and I
>     haven't bothered checking here. I don't know whether we have a
>     latent functional bug.
> 
> Mark.
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 102ddbd4298ef..c8c99cd955281 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -102,6 +102,9 @@ config HOTPLUG_PARALLEL
>  	bool
>  	select HOTPLUG_SPLIT_STARTUP
>  
> +config ARCH_HAS_OWN_IRQ_PREEMPTION
> +	bool
> +
>  config GENERIC_IRQ_ENTRY
>  	bool
>  
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 38dba5f7e4d2d..bf0ec8237de45 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -42,6 +42,7 @@ config ARM64
>  	select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
>  	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
>  	select ARCH_HAS_NONLEAF_PMD_YOUNG if ARM64_HAFT
> +	select ARCH_HAS_OWN_IRQ_PREEMPTION
>  	select ARCH_HAS_PREEMPT_LAZY
>  	select ARCH_HAS_PTDUMP
>  	select ARCH_HAS_PTE_SPECIAL
> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
> index 3625797e9ee8f..1aedadf09eb4d 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -497,6 +497,8 @@ static __always_inline void __el1_irq(struct pt_regs *regs,
>  	do_interrupt_handler(regs, handler);
>  	irq_exit_rcu();
>  
> +	irqentry_exit_cond_resched();
> +
>  	exit_to_kernel_mode(regs, state);
>  }
>  static void noinstr el1_interrupt(struct pt_regs *regs,
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 9ef63e4147913..af9cae1f225e3 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -235,8 +235,10 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  		}
>  
>  		instrumentation_begin();
> -		if (IS_ENABLED(CONFIG_PREEMPTION))
> +		if (IS_ENABLED(CONFIG_PREEMPTION) &&
> +		    !IS_ENABLED(CONFIG_ARCH_HAS_OWN_IRQ_PREEMPTION)) {
>  			irqentry_exit_cond_resched();
> +		}
>  
>  		/* Covers both tracing and lockdep */
>  		trace_hardirqs_on();

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
  2026-03-24  3:14   ` Jinjie Ruan
@ 2026-03-24 10:51     ` Mark Rutland
  0 siblings, 0 replies; 25+ messages in thread
From: Mark Rutland @ 2026-03-24 10:51 UTC (permalink / raw)
  To: Jinjie Ruan
  Cc: linux-arm-kernel, ada.coupriediaz, catalin.marinas, linux-kernel,
	luto, peterz, tglx, vladimir.murzin, will

On Tue, Mar 24, 2026 at 11:14:28AM +0800, Jinjie Ruan wrote:
> On 2026/3/20 19:30, Mark Rutland wrote:
> > On arm64, involuntary kernel preemption has been subtly broken since the
> > move to the generic irq entry code. When preemption occurs, the new task
> > may run with SError and Debug exceptions masked unexpectedly, leading to
> > a loss of RAS events, breakpoints, watchpoints, and single-step
> > exceptions.
> 
> We can also add a check in arch_irqentry_exit_need_resched to prevent
> schedule-out when the DA bit is set.

That *might* be good enough for a backport to stable, but that's not the
right fix going forwards.

Checking the DAIF.DA bits will prevent preemption from anything other
than real interrupts, and Thomas said we should preempt wherever
possible:

  https://lore.kernel.org/linux-arm-kernel/87h5qak2uv.ffs@tglx/

The interrupt handling path is also inconsistent w.r.t. masking DAIF
upon return, and with that cleaned up it'll look the same as other
exceptions, and we can't rely on DAIF.DA.

I think there are alternative options here; I have a half-written reply
to Thomas's other message:

  https://lore.kernel.org/linux-arm-kernel/87fr5six4d.ffs@tglx/

... and I'll try to finish that up and get it out shortly.

Mark.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/2] arm64/entry: Remove arch_irqentry_exit_need_resched()
  2026-03-20 11:30 [PATCH 0/2] arm64/entry: Fix involuntary preemption exception masking Mark Rutland
  2026-03-20 11:30 ` [PATCH 1/2] " Mark Rutland
@ 2026-03-20 11:30 ` Mark Rutland
  1 sibling, 0 replies; 25+ messages in thread
From: Mark Rutland @ 2026-03-20 11:30 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: ada.coupriediaz, catalin.marinas, linux-kernel, luto,
	mark.rutland, peterz, ruanjinjie, tglx, vladimir.murzin, will

The only user of arch_irqentry_exit_need_resched() is arm64. As arm64
provides its own preemption logic, there's no need to indirect some of
this via the generic irq entry code.

Remove arch_irqentry_exit_need_resched(), and fold its logic directly
into arm64's entry code.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jinjie Ruan <ruanjinjie@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Vladimir Murzin <vladimir.murzin@arm.com>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/entry-common.h | 27 ---------------------------
 arch/arm64/kernel/entry-common.c      | 27 ++++++++++++++++++++++++++-
 kernel/entry/common.c                 | 16 +---------------
 3 files changed, 27 insertions(+), 43 deletions(-)

diff --git a/arch/arm64/include/asm/entry-common.h b/arch/arm64/include/asm/entry-common.h
index cab8cd78f6938..2b8335ea2a390 100644
--- a/arch/arm64/include/asm/entry-common.h
+++ b/arch/arm64/include/asm/entry-common.h
@@ -27,31 +27,4 @@ static __always_inline void arch_exit_to_user_mode_work(struct pt_regs *regs,
 
 #define arch_exit_to_user_mode_work arch_exit_to_user_mode_work
 
-static inline bool arch_irqentry_exit_need_resched(void)
-{
-	/*
-	 * DAIF.DA are cleared at the start of IRQ/FIQ handling, and when GIC
-	 * priority masking is used the GIC irqchip driver will clear DAIF.IF
-	 * using gic_arch_enable_irqs() for normal IRQs. If anything is set in
-	 * DAIF we must have handled an NMI, so skip preemption.
-	 */
-	if (system_uses_irq_prio_masking() && read_sysreg(daif))
-		return false;
-
-	/*
-	 * Preempting a task from an IRQ means we leave copies of PSTATE
-	 * on the stack. cpufeature's enable calls may modify PSTATE, but
-	 * resuming one of these preempted tasks would undo those changes.
-	 *
-	 * Only allow a task to be preempted once cpufeatures have been
-	 * enabled.
-	 */
-	if (!system_capabilities_finalized())
-		return false;
-
-	return true;
-}
-
-#define arch_irqentry_exit_need_resched arch_irqentry_exit_need_resched
-
 #endif /* _ASM_ARM64_ENTRY_COMMON_H */
diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index 1aedadf09eb4d..c4481e0e326a7 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -486,6 +486,31 @@ static __always_inline void __el1_pnmi(struct pt_regs *regs,
 	irqentry_nmi_exit(regs, state);
 }
 
+static void arm64_irqentry_exit_cond_resched(void)
+{
+	/*
+	 * DAIF.DA are cleared at the start of IRQ/FIQ handling, and when GIC
+	 * priority masking is used the GIC irqchip driver will clear DAIF.IF
+	 * using gic_arch_enable_irqs() for normal IRQs. If anything is set in
+	 * DAIF we must have handled an NMI, so skip preemption.
+	 */
+	if (system_uses_irq_prio_masking() && read_sysreg(daif))
+		return;
+
+	/*
+	 * Preempting a task from an IRQ means we leave copies of PSTATE
+	 * on the stack. cpufeature's enable calls may modify PSTATE, but
+	 * resuming one of these preempted tasks would undo those changes.
+	 *
+	 * Only allow a task to be preempted once cpufeatures have been
+	 * enabled.
+	 */
+	if (!system_capabilities_finalized())
+		return;
+
+	irqentry_exit_cond_resched();
+}
+
 static __always_inline void __el1_irq(struct pt_regs *regs,
 				      void (*handler)(struct pt_regs *))
 {
@@ -497,7 +522,7 @@ static __always_inline void __el1_irq(struct pt_regs *regs,
 	do_interrupt_handler(regs, handler);
 	irq_exit_rcu();
 
-	irqentry_exit_cond_resched();
+	arm64_irqentry_exit_cond_resched();
 
 	exit_to_kernel_mode(regs, state);
 }
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index af9cae1f225e3..28351d76cfeb3 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -171,20 +171,6 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 	return ret;
 }
 
-/**
- * arch_irqentry_exit_need_resched - Architecture specific need resched function
- *
- * Invoked from raw_irqentry_exit_cond_resched() to check if resched is needed.
- * Defaults return true.
- *
- * The main purpose is to permit arch to avoid preemption of a task from an IRQ.
- */
-static inline bool arch_irqentry_exit_need_resched(void);
-
-#ifndef arch_irqentry_exit_need_resched
-static inline bool arch_irqentry_exit_need_resched(void) { return true; }
-#endif
-
 void raw_irqentry_exit_cond_resched(void)
 {
 	if (!preempt_count()) {
@@ -192,7 +178,7 @@ void raw_irqentry_exit_cond_resched(void)
 		rcu_irq_exit_check_preempt();
 		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
 			WARN_ON_ONCE(!on_thread_stack());
-		if (need_resched() && arch_irqentry_exit_need_resched())
+		if (need_resched())
 			preempt_schedule_irq();
 	}
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-03-27  1:27 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-20 11:30 [PATCH 0/2] arm64/entry: Fix involuntary preemption exception masking Mark Rutland
2026-03-20 11:30 ` [PATCH 1/2] " Mark Rutland
2026-03-20 13:04   ` Peter Zijlstra
2026-03-20 14:11     ` Thomas Gleixner
2026-03-20 14:57       ` Mark Rutland
2026-03-20 15:34         ` Peter Zijlstra
2026-03-20 16:16           ` Mark Rutland
2026-03-20 15:50         ` Thomas Gleixner
2026-03-23 17:21           ` Mark Rutland
2026-03-20 14:59   ` Thomas Gleixner
2026-03-20 15:37     ` Mark Rutland
2026-03-20 16:26       ` Thomas Gleixner
2026-03-20 17:31         ` Mark Rutland
2026-03-21 23:25           ` Thomas Gleixner
2026-03-24 12:19             ` Thomas Gleixner
2026-03-25 11:03             ` Mark Rutland
2026-03-25 15:46               ` Thomas Gleixner
2026-03-26  8:56                 ` Jinjie Ruan
2026-03-26 18:11                 ` Mark Rutland
2026-03-26 18:32                   ` Thomas Gleixner
2026-03-27  1:27                   ` Jinjie Ruan
2026-03-26  8:52               ` Jinjie Ruan
2026-03-24  3:14   ` Jinjie Ruan
2026-03-24 10:51     ` Mark Rutland
2026-03-20 11:30 ` [PATCH 2/2] arm64/entry: Remove arch_irqentry_exit_need_resched() Mark Rutland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox