From: Thomas Gleixner <tglx@kernel.org>
To: Mark Rutland <mark.rutland@arm.com>,
linux-arm-kernel@lists.infradead.org
Cc: ada.coupriediaz@arm.com, catalin.marinas@arm.com,
linux-kernel@vger.kernel.org, luto@kernel.org,
mark.rutland@arm.com, peterz@infradead.org,
ruanjinjie@huawei.com, vladimir.murzin@arm.com, will@kernel.org
Subject: Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking
Date: Fri, 20 Mar 2026 15:59:40 +0100 [thread overview]
Message-ID: <87eclek0mb.ffs@tglx> (raw)
In-Reply-To: <20260320113026.3219620-2-mark.rutland@arm.com>
On Fri, Mar 20 2026 at 11:30, Mark Rutland wrote:
> We can fix this relatively simply by moving the preemption logic out of
> irqentry_exit(), which is desirable for a number of other reasons on
> arm64. Context and rationale below:
>
> 1) Architecturally, several groups of exceptions can be masked
> independently, including 'Debug', 'SError', 'IRQ', and 'FIQ', whose
> mask bits can be read/written via the 'DAIF' register.
>
> Other mask bits exist, including 'PM' and 'AllInt', which we will
> need to use in future (e.g. for architectural NMI support).
>
> The entry code needs to manipulate all of these, but the generic
> entry code only knows about interrupts (which means both IRQ and FIQ
> on arm64), and the other exception masks aren't generic.
Right, but that's what the architecture specific parts are for.
> 2) Architecturally, all maskable exceptions MUST be masked during
> exception entry and exception return.
>
> Upon exception entry, hardware places exception context into
> exception registers (e.g. the PC is saved into ELR_ELx). Upon
> exception return, hardware restores exception context from those
> exception registers (e.g. the PC is restored from ELR_ELx).
>
> To ensure the exception registers aren't clobbered by recursive
> exceptions, all maskable exceptions must be masked early during entry
> and late during exit. Hardware masks all maskable exceptions
> automatically at exception entry. Software must unmask these as
> required, and must mask them prior to exception return.
That's not much different from any other architecture.
> 3) Architecturally, hardware masks all maskable exceptions upon any
> exception entry. A synchronous exception (e.g. a fault on a memory
> access) can be taken from any context (e.g. where IRQ+FIQ might be
> masked), and the entry code must explicitly 'inherit' the unmasking
> from the original context by reading the exception registers (e.g.
> SPSR_ELx) and writing to DAIF, etc.
The amount of mask bits/registers is obviously architecture specific,
but conceptually it's the same everywhere.
> 4) When 'pseudo-NMI' is used, Linux masks interrupts via a combination
> of DAIF and the 'PMR' priority mask register. At entry and exit,
> interrupts must be masked via DAIF, but most kernel code will
> mask/unmask regular interrupts using PMR (e.g. in local_irq_save()
> and local_irq_restore()).
>
> This requires more complicated transitions at entry and exit. Early
> during entry or late during return, interrupts are masked via DAIF,
> and kernel code which manipulates PMR to mask/unmask interrupts will
> not function correctly in this state.
>
> This also requires fairly complicated management of DAIF and PMR when
> handling interrupts, and arm64 has special logic to avoid preempting
> from pseudo-NMIs which currently lives in
> arch_irqentry_exit_need_resched().
Why are you routing NMI like exceptions through irqentry_enter() and
irqentry_exit() in the first place? That's just wrong.
> 5) Most kernel code runs with all exceptions unmasked. When scheduling,
> only interrupts should be masked (by PMR pseudo-NMI is used, and by
> DAIF otherwise).
>
> For most exceptions, arm64's entry code has a sequence similar to that
> of el1_abort(), which is used for faults:
>
> | static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
> | {
> | unsigned long far = read_sysreg(far_el1);
> | irqentry_state_t state;
> |
> | state = enter_from_kernel_mode(regs);
> | local_daif_inherit(regs);
> | do_mem_abort(far, esr, regs);
> | local_daif_mask();
> | exit_to_kernel_mode(regs, state);
> | }
>
> ... where enter_from_kernel_mode() and exit_to_kernel_mode() are
> wrappers around irqentry_enter() and irqentry_exit() which perform
> additional arm64-specific entry/exit logic.
>
> Currently, the generic irq entry code will attempt to preempt from any
> exception under irqentry_exit() where interrupts were unmasked in the
> original context. As arm64's entry code will have already masked
> exceptions via DAIF, this results in the problems described above.
See below.
> Fix this by opting out of preemption in irqentry_exit(), and restoring
> arm64's old behaivour of explicitly preempting when returning from IRQ
> or FIQ, before calling exit_to_kernel_mode() / irqentry_exit(). This
> ensures that preemption occurs when only interrupts are masked, and
> where that masking is compatible with most kernel code (e.g. using PMR
> when pseudo-NMI is in use).
My gut feeling tells me that there is a fundamental design flaw
somewhere and the below is papering over it.
> @@ -497,6 +497,8 @@ static __always_inline void __el1_irq(struct pt_regs *regs,
> do_interrupt_handler(regs, handler);
> irq_exit_rcu();
>
> + irqentry_exit_cond_resched();
> +
> exit_to_kernel_mode(regs, state);
> }
> static void noinstr el1_interrupt(struct pt_regs *regs,
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 9ef63e4147913..af9cae1f225e3 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -235,8 +235,10 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
> }
>
> instrumentation_begin();
> - if (IS_ENABLED(CONFIG_PREEMPTION))
> + if (IS_ENABLED(CONFIG_PREEMPTION) &&
> + !IS_ENABLED(CONFIG_ARCH_HAS_OWN_IRQ_PREEMPTION)) {
These 'But my architecture is sooo special' switches cause immediate review
nausea and just confirm that there is a fundamental flaw somewhere else.
> irqentry_exit_cond_resched();
Let's look at how this is supposed to work. I'm just looking at
irqentry_enter()/exit() and not the NMI variant.
Interrupt/exception is raised
1) low level architecture specific entry code does all the magic state
saving, setup etc.
2) irqentry_enter() is invoked
- checks for user mode or kernel mode entry
- handles RCU on enter from user and if kernel entry hits the idle
task
- Sets up lockdep, tracing, kminsanity
3) the interrupt/exception handler is invoked
4) irqentry_exit() is invoked
- handles exit to user and exit to kernel
- exit to user handles the TIF and other pending work, which can
schedule and then prepares for return
- exit to kernel
When interrupt were disabled on entry, it just handles RCU and
returns.
When enabled on entry, it checks whether RCU was watching on
entry or not. If not it tells RCU that the interrupt nesting is
done and returns. When RCU was watching it can schedule
5) Undoes #1 so that it can return to the originally interrupted
context.
That means at the point where irqentry_entry() is invoked, the
architecture side should have made sure that everything is set up for
the kernel to operate until irqentry_exit() returns.
Looking at your example:
> | static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
> | {
> | unsigned long far = read_sysreg(far_el1);
> | irqentry_state_t state;
> |
> | state = enter_from_kernel_mode(regs);
> | local_daif_inherit(regs);
> | do_mem_abort(far, esr, regs);
> | local_daif_mask();
> | exit_to_kernel_mode(regs, state);
and the paragraph right below that:
> Currently, the generic irq entry code will attempt to preempt from any
> exception under irqentry_exit() where interrupts were unmasked in the
> original context. As arm64's entry code will have already masked
> exceptions via DAIF, this results in the problems described above.
To me this looks like your ordering is wrong. Why are you doing the DAIF
inherit _after_ irqentry_enter() and the mask _before_ irqentry_exit()?
I might be missing something, but this smells more than fishy.
As no other architecture has that problem I'm pretty sure that the
problem is not in the way how the generic code was designed. Why?
Because your architecture is _not_ sooo special! :)
Thanks,
tglx
next prev parent reply other threads:[~2026-03-20 14:59 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-20 11:30 [PATCH 0/2] arm64/entry: Fix involuntary preemption exception masking Mark Rutland
2026-03-20 11:30 ` [PATCH 1/2] " Mark Rutland
2026-03-20 13:04 ` Peter Zijlstra
2026-03-20 14:11 ` Thomas Gleixner
2026-03-20 14:57 ` Mark Rutland
2026-03-20 15:34 ` Peter Zijlstra
2026-03-20 16:16 ` Mark Rutland
2026-03-20 15:50 ` Thomas Gleixner
2026-03-23 17:21 ` Mark Rutland
2026-03-20 14:59 ` Thomas Gleixner [this message]
2026-03-20 15:37 ` Mark Rutland
2026-03-20 16:26 ` Thomas Gleixner
2026-03-20 17:31 ` Mark Rutland
2026-03-21 23:25 ` Thomas Gleixner
2026-03-24 12:19 ` Thomas Gleixner
2026-03-25 11:03 ` Mark Rutland
2026-03-25 15:46 ` Thomas Gleixner
2026-03-26 8:56 ` Jinjie Ruan
2026-03-26 18:11 ` Mark Rutland
2026-03-26 18:32 ` Thomas Gleixner
2026-03-27 1:27 ` Jinjie Ruan
2026-03-26 8:52 ` Jinjie Ruan
2026-03-24 3:14 ` Jinjie Ruan
2026-03-24 10:51 ` Mark Rutland
2026-03-20 11:30 ` [PATCH 2/2] arm64/entry: Remove arch_irqentry_exit_need_resched() Mark Rutland
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87eclek0mb.ffs@tglx \
--to=tglx@kernel.org \
--cc=ada.coupriediaz@arm.com \
--cc=catalin.marinas@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@kernel.org \
--cc=mark.rutland@arm.com \
--cc=peterz@infradead.org \
--cc=ruanjinjie@huawei.com \
--cc=vladimir.murzin@arm.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox