From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6B5342D2385
	for <linux-kernel@vger.kernel.org>; Fri, 20 Mar 2026 14:59:44 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774018784; cv=none; b=cy0fEj/tZKjaI3VMhbuzE0RYPEvx3GjeH4yUgcyD6r5gHLuZRfU59sKSZD+6H742md6WZDzP3oxsJJUEh0aV1IYKl5t2/01/ML+KcBwranBlahnlu5FPMWWSAjsMyYffn4qkQirblTdRKnT27ELUor91gLSHAlDzekAtp7VKR/U=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774018784; c=relaxed/simple;
	bh=IVpPNHfkLUbMPsPANZapGtUdJtqu1A8En+bmgRYI8EI=;
	h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID:
	 MIME-Version:Content-Type; b=EQYqKXIV0Mdz7zC21ADogIrIorYW/hEhSr6MM21ygC7onwGDKYawJjLXUwGwPoD3X0FHaMZS66ezL/65314EZsKhP3h3+X0RNLCBvuLcNbPrln8U8woSK2d4VzmRXb1wo2jJgUzwB9oOCKtxphi39E/VaHbEnEXn61dMTUQ1fGo=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=JD74K33B; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="JD74K33B"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2A1ACC4CEF7;
	Fri, 20 Mar 2026 14:59:42 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1774018783;
	bh=IVpPNHfkLUbMPsPANZapGtUdJtqu1A8En+bmgRYI8EI=;
	h=From:To:Cc:Subject:In-Reply-To:References:Date:From;
	b=JD74K33Be/loKb4BxBCWk+8FFV/CgPpMoGi/jRSoSW2G0urmWnLOexw3e9Chuhu9B
	 yzPbO1NNkmEkPZEHqmxG2JB6DxZ33xfJrLdbj5/dEgUCB1Y+j4cm5KMKOFFensvNK7
	 Y/1mWK53pHXUxRg8NyufxRHYAMrp597lEvWWY2KKRQP2O0A8TgAT4/tkCSPs0a3m3/
	 G2EtymEQh+eUQ5LmsKviKSZOgWf3X/ZLXqs1o7DyBClsnPDH4jTFjdlHXDhf7S9Sd0
	 1YAQzALBNRymNnKGJXnXigb3lPXcwpdHQ0KPUCREOuBO2mlP0FFCVDUWW2/QYgn9ct
	 tChTMF5opRxVQ==
From: Thomas Gleixner <tglx@kernel.org>
To: Mark Rutland <mark.rutland@arm.com>, linux-arm-kernel@lists.infradead.org
Cc: ada.coupriediaz@arm.com, catalin.marinas@arm.com,
 linux-kernel@vger.kernel.org, luto@kernel.org, mark.rutland@arm.com,
 peterz@infradead.org, ruanjinjie@huawei.com, vladimir.murzin@arm.com,
 will@kernel.org
Subject: Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception
 masking
In-Reply-To: <20260320113026.3219620-2-mark.rutland@arm.com>
References: <20260320113026.3219620-1-mark.rutland@arm.com>
 <20260320113026.3219620-2-mark.rutland@arm.com>
Date: Fri, 20 Mar 2026 15:59:40 +0100
Message-ID: <87eclek0mb.ffs@tglx>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain

On Fri, Mar 20 2026 at 11:30, Mark Rutland wrote:
> We can fix this relatively simply by moving the preemption logic out of
> irqentry_exit(), which is desirable for a number of other reasons on
> arm64. Context and rationale below:
>
> 1) Architecturally, several groups of exceptions can be masked
>    independently, including 'Debug', 'SError', 'IRQ', and 'FIQ', whose
>    mask bits can be read/written via the 'DAIF' register.
>
>    Other mask bits exist, including 'PM' and 'AllInt', which we will
>    need to use in future (e.g. for architectural NMI support).
>
>    The entry code needs to manipulate all of these, but the generic
>    entry code only knows about interrupts (which means both IRQ and FIQ
>    on arm64), and the other exception masks aren't generic.

Right, but that's what the architecture specific parts are for.

> 2) Architecturally, all maskable exceptions MUST be masked during
>    exception entry and exception return.
>
>    Upon exception entry, hardware places exception context into
>    exception registers (e.g. the PC is saved into ELR_ELx). Upon
>    exception return, hardware restores exception context from those
>    exception registers (e.g. the PC is restored from ELR_ELx).
>
>    To ensure the exception registers aren't clobbered by recursive
>    exceptions, all maskable exceptions must be masked early during entry
>    and late during exit. Hardware masks all maskable exceptions
>    automatically at exception entry. Software must unmask these as
>    required, and must mask them prior to exception return.

That's not much different from any other architecture.

> 3) Architecturally, hardware masks all maskable exceptions upon any
>    exception entry. A synchronous exception (e.g. a fault on a memory
>    access) can be taken from any context (e.g. where IRQ+FIQ might be
>    masked), and the entry code must explicitly 'inherit' the unmasking
>    from the original context by reading the exception registers (e.g.
>    SPSR_ELx) and writing to DAIF, etc.

The amount of mask bits/registers is obviously architecture specific,
but conceptually it's the same everywhere.

> 4) When 'pseudo-NMI' is used, Linux masks interrupts via a combination
>    of DAIF and the 'PMR' priority mask register. At entry and exit,
>    interrupts must be masked via DAIF, but most kernel code will
>    mask/unmask regular interrupts using PMR (e.g. in local_irq_save()
>    and local_irq_restore()).
>
>    This requires more complicated transitions at entry and exit. Early
>    during entry or late during return, interrupts are masked via DAIF,
>    and kernel code which manipulates PMR to mask/unmask interrupts will
>    not function correctly in this state.
>
>    This also requires fairly complicated management of DAIF and PMR when
>    handling interrupts, and arm64 has special logic to avoid preempting
>    from pseudo-NMIs which currently lives in
>    arch_irqentry_exit_need_resched().

Why are you routing NMI like exceptions through irqentry_enter() and
irqentry_exit() in the first place? That's just wrong.

> 5) Most kernel code runs with all exceptions unmasked. When scheduling,
>    only interrupts should be masked (by PMR pseudo-NMI is used, and by
>    DAIF otherwise).
>
> For most exceptions, arm64's entry code has a sequence similar to that
> of el1_abort(), which is used for faults:
>
> | static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
> | {
> |         unsigned long far = read_sysreg(far_el1);
> |         irqentry_state_t state;
> |
> |         state = enter_from_kernel_mode(regs);
> |         local_daif_inherit(regs);
> |         do_mem_abort(far, esr, regs);
> |         local_daif_mask();
> |         exit_to_kernel_mode(regs, state);
> | }
>
> ... where enter_from_kernel_mode() and exit_to_kernel_mode() are
> wrappers around irqentry_enter() and irqentry_exit() which perform
> additional arm64-specific entry/exit logic.
>
> Currently, the generic irq entry code will attempt to preempt from any
> exception under irqentry_exit() where interrupts were unmasked in the
> original context. As arm64's entry code will have already masked
> exceptions via DAIF, this results in the problems described above.

See below.

> Fix this by opting out of preemption in irqentry_exit(), and restoring
> arm64's old behaivour of explicitly preempting when returning from IRQ
> or FIQ, before calling exit_to_kernel_mode() / irqentry_exit(). This
> ensures that preemption occurs when only interrupts are masked, and
> where that masking is compatible with most kernel code (e.g. using PMR
> when pseudo-NMI is in use).

My gut feeling tells me that there is a fundamental design flaw
somewhere and the below is papering over it.

> @@ -497,6 +497,8 @@ static __always_inline void __el1_irq(struct pt_regs *regs,
>  	do_interrupt_handler(regs, handler);
>  	irq_exit_rcu();
>  
> +	irqentry_exit_cond_resched();
> +
>  	exit_to_kernel_mode(regs, state);
>  }
>  static void noinstr el1_interrupt(struct pt_regs *regs,
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 9ef63e4147913..af9cae1f225e3 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -235,8 +235,10 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  		}
>  
>  		instrumentation_begin();
> -		if (IS_ENABLED(CONFIG_PREEMPTION))
> +		if (IS_ENABLED(CONFIG_PREEMPTION) &&
> +		    !IS_ENABLED(CONFIG_ARCH_HAS_OWN_IRQ_PREEMPTION)) {

These 'But my architecture is sooo special' switches cause immediate review
nausea and just confirm that there is a fundamental flaw somewhere else.

>  			irqentry_exit_cond_resched();

Let's look at how this is supposed to work. I'm just looking at
irqentry_enter()/exit() and not the NMI variant.

Interrupt/exception is raised

  1) low level architecture specific entry code does all the magic state
     saving, setup etc.

  2) irqentry_enter() is invoked

      - checks for user mode or kernel mode entry

      - handles RCU on enter from user and if kernel entry hits the idle
        task

      - Sets up lockdep, tracing, kminsanity

  3) the interrupt/exception handler is invoked

  4) irqentry_exit() is invoked

      - handles exit to user and exit to kernel

      - exit to user handles the TIF and other pending work, which can
        schedule and then prepares for return

      - exit to kernel

        When interrupt were disabled on entry, it just handles RCU and
        returns.

        When enabled on entry, it checks whether RCU was watching on
        entry or not. If not it tells RCU that the interrupt nesting is
        done and returns. When RCU was watching it can schedule
 
  5) Undoes #1 so that it can return to the originally interrupted
     context.

That means at the point where irqentry_entry() is invoked, the
architecture side should have made sure that everything is set up for
the kernel to operate until irqentry_exit() returns.

Looking at your example:

> | static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
> | {
> |         unsigned long far = read_sysreg(far_el1);
> |         irqentry_state_t state;
> |
> |         state = enter_from_kernel_mode(regs);
> |         local_daif_inherit(regs);
> |         do_mem_abort(far, esr, regs);
> |         local_daif_mask();
> |         exit_to_kernel_mode(regs, state);

and the paragraph right below that:

> Currently, the generic irq entry code will attempt to preempt from any
> exception under irqentry_exit() where interrupts were unmasked in the
> original context. As arm64's entry code will have already masked
> exceptions via DAIF, this results in the problems described above.

To me this looks like your ordering is wrong. Why are you doing the DAIF
inherit _after_ irqentry_enter() and the mask _before_ irqentry_exit()?

I might be missing something, but this smells more than fishy.

As no other architecture has that problem I'm pretty sure that the
problem is not in the way how the generic code was designed. Why?

Because your architecture is _not_ sooo special! :)

Thanks,

        tglx