[PATCH v2 0/2] KVM: arm64: nv: Fix permission checks for S1PTW faults

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] KVM: arm64: nv: Fix permission checks for S1PTW faults
@ 2026-06-24 20:24 Oliver Upton
  2026-06-24 20:24 ` [PATCH v2 1/2] KVM: arm64: Only consider S1PTW a write fault if HA is set Oliver Upton
  2026-06-24 20:24 ` [PATCH v2 2/2] KVM: arm64: nv: Treat S1PTW permission faults specially Oliver Upton
  0 siblings, 2 replies; 11+ messages in thread
From: Oliver Upton @ 2026-06-24 20:24 UTC (permalink / raw)
  To: kvmarm
  Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Wei-Lin Chang, Steffen Eiden, Oliver Upton

Small series that papers over arch ambiguity around S1PTW permission
faults.

As kvm_s2_handle_perm_fault() wasn't checking for S1PTW instruction
aborts, it was incorrectly evaluating the execute permissions to decide
where to send the fault.

Fixing that uncovers another issue in that kvm_is_write_fault() assumes
any S1PTW permission fault was due to write. Nested screws this up since
an L1 hypervisor could use write-only permissions at stage-2.

We end up papering over architecture ambiguity by potentially evaluating
*both* read and write permissions for S1PTW, assuming any fault with HA
set to require write permission (in addition to read).

Applies to kvmarm/fixes.

v1: https://lore.kernel.org/kvmarm/20260623211310.1529760-1-oupton@kernel.org/

Oliver Upton (2):
  KVM: arm64: Only consider S1PTW a write fault if HA is set
  KVM: arm64: nv: Treat S1PTW permission faults specially

 arch/arm64/include/asm/kvm_emulate.h | 22 +++++----------
 arch/arm64/include/asm/kvm_nested.h  |  2 ++
 arch/arm64/kvm/at.c                  | 42 +++++++++++++++++++++-------
 arch/arm64/kvm/nested.c              | 20 +++++++++++--
 4 files changed, 58 insertions(+), 28 deletions(-)

base-commit: d098bb75d14fde2f12155f1a95ec0168160867ce
-- 
2.47.3

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] KVM: arm64: Only consider S1PTW a write fault if HA is set
  2026-06-24 20:24 [PATCH v2 0/2] KVM: arm64: nv: Fix permission checks for S1PTW faults Oliver Upton
@ 2026-06-24 20:24 ` Oliver Upton
  2026-06-24 20:40   ` sashiko-bot
  2026-06-24 20:24 ` [PATCH v2 2/2] KVM: arm64: nv: Treat S1PTW permission faults specially Oliver Upton
  1 sibling, 1 reply; 11+ messages in thread
From: Oliver Upton @ 2026-06-24 20:24 UTC (permalink / raw)
  To: kvmarm
  Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Wei-Lin Chang, Steffen Eiden, Oliver Upton, Sashiko

In yet another example where the architecture is awesome, S1PTW faults
may not have a valid ESR_ELx.WnR. kvm_is_write_fault() worked around
this by relying on a KVM implementation detail that canonical stage-2
translations must have at least read-only permissions.

That assumption no longer holds for nested virt where an L1 hypervisor
could construct write-only mappings that propagate to the shadow
stage-2.

Since there's no exact science to this, assume that the S1PTW fault was
for write if HA is enabled at stage-1.

Fixes: fd276e71d1e7 ("KVM: arm64: nv: Handle shadow stage 2 page faults")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/kvmarm/20260623213225.A89CF1F000E9@smtp.kernel.org/
Signed-off-by: Oliver Upton <oupton@kernel.org>
---
 arch/arm64/include/asm/kvm_emulate.h | 22 +++++----------
 arch/arm64/include/asm/kvm_nested.h  |  2 ++
 arch/arm64/kvm/at.c                  | 42 +++++++++++++++++++++-------
 3 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index 5bf3d7e1d92c..8e208ce2597e 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -479,21 +479,13 @@ static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu)
 
 static inline bool kvm_is_write_fault(struct kvm_vcpu *vcpu)
 {
-	if (kvm_vcpu_abt_iss1tw(vcpu)) {
-		/*
-		 * Only a permission fault on a S1PTW should be
-		 * considered as a write. Otherwise, page tables baked
-		 * in a read-only memslot will result in an exception
-		 * being delivered in the guest.
-		 *
-		 * The drawback is that we end-up faulting twice if the
-		 * guest is using any of HW AF/DB: a translation fault
-		 * to map the page containing the PT (read only at
-		 * first), then a permission fault to allow the flags
-		 * to be set.
-		 */
-		return kvm_vcpu_trap_is_permission_fault(vcpu);
-	}
+	/*
+	 * The architecture sucks; assume that the S1PTW fetched for write if
+	 * HA is enabled at stage-1. Note that hardware updates to dirty state
+	 * and table AF are predicated on HA=1 (DDI0487 M.a D24.2.194; R_SNVTX).
+	 */
+	if (kvm_vcpu_abt_iss1tw(vcpu))
+		return effective_tcr_ha(vcpu);
 
 	if (kvm_vcpu_trap_is_iabt(vcpu))
 		return false;
diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
index cbdaaa2a2903..e9f48f94a77f 100644
--- a/arch/arm64/include/asm/kvm_nested.h
+++ b/arch/arm64/include/asm/kvm_nested.h
@@ -417,4 +417,6 @@ u16 get_asid_by_regime(struct kvm_vcpu *vcpu, enum trans_regime regime);
 
 int __kvm_at_swap_desc(struct kvm *kvm, gpa_t ipa, u64 old, u64 new);
 
+bool effective_tcr_ha(struct kvm_vcpu *vcpu);
+
 #endif /* __ARM64_KVM_NESTED_H */
diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
index 8263c648207b..91154654210e 100644
--- a/arch/arm64/kvm/at.c
+++ b/arch/arm64/kvm/at.c
@@ -219,6 +219,36 @@ static unsigned int tcr_tg_pgshift(struct kvm *kvm, u64 tcr, bool upper_range)
 	return shift;
 }
 
+static bool __effective_tcr_ha(struct kvm_vcpu *vcpu, enum trans_regime regime)
+{
+	if (!kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, HAFDBS, AF))
+		return false;
+
+	switch (regime) {
+	case TR_EL10:
+		return vcpu_read_sys_reg(vcpu, TCR_EL1) & TCR_HA;
+	case TR_EL20:
+		return vcpu_read_sys_reg(vcpu, TCR_EL2) & TCR_HA;
+	case TR_EL2:
+		return vcpu_read_sys_reg(vcpu, TCR_EL2) & TCR_EL2_HA;
+	default:
+		BUG();
+	}
+}
+
+static enum trans_regime vcpu_trans_regime(struct kvm_vcpu *vcpu)
+{
+	if (!is_hyp_ctxt(vcpu))
+		return TR_EL10;
+
+	return vcpu_el2_e2h_is_set(vcpu) ? TR_EL20 : TR_EL2;
+}
+
+bool effective_tcr_ha(struct kvm_vcpu *vcpu)
+{
+	return __effective_tcr_ha(vcpu, vcpu_trans_regime(vcpu));
+}
+
 static int setup_s1_walk(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
 			 struct s1_walk_result *wr, u64 va)
 {
@@ -407,12 +437,7 @@ static int setup_s1_walk(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
 		goto addrsz;
 
 	wi->baddr &= GENMASK_ULL(wi->max_oa_bits - 1, x);
-
-	wi->ha  = kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, HAFDBS, AF);
-	wi->ha &= (wi->regime == TR_EL2 ?
-		  FIELD_GET(TCR_EL2_HA, tcr) :
-		  FIELD_GET(TCR_HA, tcr));
-
+	wi->ha = __effective_tcr_ha(vcpu, wi->regime);
 	return 0;
 
 addrsz:
@@ -1723,10 +1748,7 @@ int __kvm_find_s1_desc_level(struct kvm_vcpu *vcpu, u64 va, u64 ipa, int *level)
 	struct s1_walk_result wr = {};
 	int ret;
 
-	if (is_hyp_ctxt(vcpu))
-		wi.regime = vcpu_el2_e2h_is_set(vcpu) ? TR_EL20 : TR_EL2;
-	else
-		wi.regime = TR_EL10;
+	wi.regime = vcpu_trans_regime(vcpu);
 
 	ret = setup_s1_walk(vcpu, &wi, &wr, va);
 	if (ret)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] KVM: arm64: nv: Treat S1PTW permission faults specially
  2026-06-24 20:24 [PATCH v2 0/2] KVM: arm64: nv: Fix permission checks for S1PTW faults Oliver Upton
  2026-06-24 20:24 ` [PATCH v2 1/2] KVM: arm64: Only consider S1PTW a write fault if HA is set Oliver Upton
@ 2026-06-24 20:24 ` Oliver Upton
  2026-06-24 20:35   ` sashiko-bot
  1 sibling, 1 reply; 11+ messages in thread
From: Oliver Upton @ 2026-06-24 20:24 UTC (permalink / raw)
  To: kvmarm
  Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Wei-Lin Chang, Steffen Eiden, Oliver Upton, Sashiko

Permission faults reported as instruction aborts with S1PTW set are due
to missing read/write permissions for the table walk, not execute
permissions on the output of translation.

As there's no way to directly tell which of the two permissions failed,
evaluate both and forward the fault to the L1 hypervisor if either
fails.

Fixes: fd276e71d1e7 ("KVM: arm64: nv: Handle shadow stage 2 page faults")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/kvmarm/20260623190607.7106B1F000E9@smtp.kernel.org/
Signed-off-by: Oliver Upton <oupton@kernel.org>
---
 arch/arm64/kvm/nested.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 94df26de6990..4c9123cb2e1c 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -911,6 +911,7 @@ void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu)
  */
 int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu, struct kvm_s2_trans *trans)
 {
+	bool write_fault = kvm_is_write_fault(vcpu);
 	bool forward_fault = false;
 
 	trans->esr = 0;
@@ -918,14 +919,27 @@ int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu, struct kvm_s2_trans *trans)
 	if (!kvm_vcpu_trap_is_permission_fault(vcpu))
 		return 0;
 
-	if (kvm_vcpu_trap_is_iabt(vcpu)) {
+	/*
+	 * S1PTW permission faults do not provide sufficient syndrome information
+	 * to determine if the fault was for read or write permissions. Perform a
+	 * read permission check and an optional write permission check, relying
+	 * on the fact that:
+	 *
+	 *  - The table walker at minimum requires read permission
+	 *
+	 *  - The L1 hypervisor also needs to deal with the architecture and
+	 *    cannot directly infer the failing permission from the fault context
+	 */
+	if (kvm_vcpu_abt_iss1tw(vcpu)) {
+		forward_fault = !trans->readable;
+		if (write_fault)
+			forward_fault |= !trans->writable;
+	} else if (kvm_vcpu_trap_is_iabt(vcpu)) {
 		if (vcpu_mode_priv(vcpu))
 			forward_fault = !kvm_s2_trans_exec_el1(vcpu->kvm, trans);
 		else
 			forward_fault = !kvm_s2_trans_exec_el0(vcpu->kvm, trans);
 	} else {
-		bool write_fault = kvm_is_write_fault(vcpu);
-
 		forward_fault = ((write_fault && !trans->writable) ||
 				 (!write_fault && !trans->readable));
 	}
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] KVM: arm64: nv: Treat S1PTW permission faults specially
  2026-06-24 20:24 ` [PATCH v2 2/2] KVM: arm64: nv: Treat S1PTW permission faults specially Oliver Upton
@ 2026-06-24 20:35   ` sashiko-bot
  2026-06-24 21:22     ` Oliver Upton
  0 siblings, 1 reply; 11+ messages in thread
From: sashiko-bot @ 2026-06-24 20:35 UTC (permalink / raw)
  To: Oliver Upton; +Cc: kvmarm, Marc Zyngier

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] An L1 hypervisor can cause an infinite page fault loop in the L0 hypervisor by creating an L1 S2 PTE with XN=0 (execution allowed) but AP=00 (read/write denied). L0 fails to forward the resulting instruction abort to L1, repeatedly mapping the shadow PTE without read permissions, which the hardware continually rejects as non-executable.
--

commit f64f1ef4967d6c0954790535ee60d6e909805163
Author: Oliver Upton <oupton@kernel.org>

KVM: arm64: nv: Treat S1PTW permission faults specially

Permission faults reported as instruction aborts with S1PTW set are due
to missing read/write permissions for the table walk, not execute
permissions on the output of translation. This patch adds logic to evaluate
both and forward the fault to the L1 hypervisor if either fails.

Closes: https://lore.kernel.org/kvmarm/20260623190607.7106B1F000E9@smtp.kernel.org/

> diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
> index 94df26de6990c..4c9123cb2e1c5 100644
> --- a/arch/arm64/kvm/nested.c
> +++ b/arch/arm64/kvm/nested.c
[ ... ]
> @@ -918,14 +919,27 @@ int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu, struct kvm_s2_trans *trans)
>  	if (!kvm_vcpu_trap_is_permission_fault(vcpu))
>  		return 0;
>  
> -	if (kvm_vcpu_trap_is_iabt(vcpu)) {
> +	/*
> +	 * S1PTW permission faults do not provide sufficient syndrome information
> +	 * to determine if the fault was for read or write permissions. Perform a
> +	 * read permission check and an optional write permission check, relying
> +	 * on the fact that:
> +	 *
> +	 *  - The table walker at minimum requires read permission
> +	 *
> +	 *  - The L1 hypervisor also needs to deal with the architecture and
> +	 *    cannot directly infer the failing permission from the fault context
> +	 */
> +	if (kvm_vcpu_abt_iss1tw(vcpu)) {
> +		forward_fault = !trans->readable;
> +		if (write_fault)
> +			forward_fault |= !trans->writable;
> +	} else if (kvm_vcpu_trap_is_iabt(vcpu)) {
>  		if (vcpu_mode_priv(vcpu))
>  			forward_fault = !kvm_s2_trans_exec_el1(vcpu->kvm, trans);
>  		else
>  			forward_fault = !kvm_s2_trans_exec_el0(vcpu->kvm, trans);

[Severity: High]
This is a pre-existing issue, but does this logic miss a read-permission
dependency check for normal instruction aborts?

If an L1 hypervisor maps a page for L2 with Stage-2 PTE AP=00 (no read/write)
and XN=0 (executable), and L2 attempts to fetch an instruction, the hardware
generates a Stage-2 Instruction Abort (Permission Fault) because execution
requires read permission.

Here, kvm_s2_handle_perm_fault() intercepts the IABT and checks only the XN
bits via kvm_s2_trans_exec_el1() and kvm_s2_trans_exec_el0(). Since they
only check the XN bits and ignore trans->readable, forward_fault is set to
false.

L0 then handles the fault by mapping the shadow S2 PTE. The shadow fault
handler will clear the read protection bit, leaving the execute protection
bit intact. The hardware will reject this shadow mapping as non-executable
due to missing read permissions, triggering another IABT upon guest
resumption and causing an infinite page fault loop in the L0 hypervisor.

>  	} else {
> -		bool write_fault = kvm_is_write_fault(vcpu);
> -
>  		forward_fault = ((write_fault && !trans->writable) ||
>  				 (!write_fault && !trans->readable));
>  	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260624202446.1698535-1-oupton@kernel.org?part=2

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] KVM: arm64: Only consider S1PTW a write fault if HA is set
  2026-06-24 20:24 ` [PATCH v2 1/2] KVM: arm64: Only consider S1PTW a write fault if HA is set Oliver Upton
@ 2026-06-24 20:40   ` sashiko-bot
  2026-06-24 21:00     ` Oliver Upton
  0 siblings, 1 reply; 11+ messages in thread
From: sashiko-bot @ 2026-06-24 20:40 UTC (permalink / raw)
  To: Oliver Upton; +Cc: Marc Zyngier, kvmarm

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Unconditionally treating all S1PTW faults as write faults when HA=1 breaks pre-populated guest page tables in read-only memslots.
- [High] `effective_tcr_ha()` incorrectly evaluates the Hardware Access (HA) bit for unprivileged memory accesses (e.g., `LDTR`) at EL2, leading to potential infinite fault loops.
--

commit 05ab0c07bd3e6099e1182b29afb0219565246eb7
Author: Oliver Upton <oupton@kernel.org>

KVM: arm64: Only consider S1PTW a write fault if HA is set

This commit works around the hardware not always setting the WnR bit for 
stage-1 page table walk (S1PTW) faults. It introduces effective_tcr_ha() 
to assume an S1PTW fault is a write if the Hardware Access (HA) flag is 
enabled at stage-1.

Link: https://lore.kernel.org/kvmarm/20260623213225.A89CF1F000E9@smtp.kernel.org/

> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> index 5bf3d7e1d92c7..8e208ce2597e5 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h

[ ... ]

> @@ -479,21 +479,13 @@ static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu)
>  
>  static inline bool kvm_is_write_fault(struct kvm_vcpu *vcpu)
>  {
> -	if (kvm_vcpu_abt_iss1tw(vcpu)) {
> -		/*
> -		 * Only a permission fault on a S1PTW should be
> -		 * considered as a write. Otherwise, page tables baked
> -		 * in a read-only memslot will result in an exception
> -		 * being delivered in the guest.
> -		 *
> -		 * The drawback is that we end-up faulting twice if the
> -		 * guest is using any of HW AF/DB: a translation fault
> -		 * to map the page containing the PT (read only at
> -		 * first), then a permission fault to allow the flags
> -		 * to be set.
> -		 */
> -		return kvm_vcpu_trap_is_permission_fault(vcpu);
> -	}
> +	/*
> +	 * The architecture sucks; assume that the S1PTW fetched for write if
> +	 * HA is enabled at stage-1. Note that hardware updates to dirty state
> +	 * and table AF are predicated on HA=1 (DDI0487 M.a D24.2.194; R_SNVTX).
> +	 */
> +	if (kvm_vcpu_abt_iss1tw(vcpu))
> +		return effective_tcr_ha(vcpu);

[Severity: High]
Does unconditionally treating S1PTW faults as writes when HA is enabled break
guests that use pre-populated page tables in read-only memslots?

If a guest populates its page tables with the Access Flag already set, places
them in an RO memslot, and enables HA in its TCR, the hardware only needs read
access during a walk. However, if a translation fault occurs,
kvm_is_write_fault() will unconditionally evaluate to true because of
effective_tcr_ha().

This will cause kvm_handle_guest_abort() to see a write fault on an RO memslot
and inject a spurious Synchronous External Abort (SEA) into the guest, even
though the hardware only needed to read the page table. Could we retain the
conditional check for permission faults to avoid this?

> diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
> index 8263c648207b0..91154654210ec 100644
> --- a/arch/arm64/kvm/at.c
> +++ b/arch/arm64/kvm/at.c

[ ... ]

> +static bool __effective_tcr_ha(struct kvm_vcpu *vcpu, enum trans_regime regime)
> +{
> +	if (!kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, HAFDBS, AF))
> +		return false;
> +
> +	switch (regime) {
> +	case TR_EL10:
> +		return vcpu_read_sys_reg(vcpu, TCR_EL1) & TCR_HA;
> +	case TR_EL20:
> +		return vcpu_read_sys_reg(vcpu, TCR_EL2) & TCR_HA;
> +	case TR_EL2:
> +		return vcpu_read_sys_reg(vcpu, TCR_EL2) & TCR_EL2_HA;
> +	default:
> +		BUG();
> +	}
> +}
> +
> +static enum trans_regime vcpu_trans_regime(struct kvm_vcpu *vcpu)
> +{
> +	if (!is_hyp_ctxt(vcpu))
> +		return TR_EL10;
> +
> +	return vcpu_el2_e2h_is_set(vcpu) ? TR_EL20 : TR_EL2;
> +}

[Severity: High]
Does vcpu_trans_regime() correctly handle unprivileged memory accesses 
like LDTR/STTR executed from EL2?

By returning TR_EL2 based strictly on the vCPU's execution context, KVM
disregards that unprivileged instructions at EL2 use the EL1 and 0 translation
regime (TCR_EL1) when HCR_EL2.E2H is 0 or TGE is 0.

Since KVM cannot distinguish an LDTR fault from an LDR fault via the ESR
syndrome, it will read TCR_EL2.HA in __effective_tcr_ha() instead of
TCR_EL1.HA. If TCR_EL1.HA is 1 but TCR_EL2.HA is 0, KVM incorrectly evaluates
HA as disabled.

This causes KVM to treat a legitimate S1PTW permission fault (generated to
update the Access Flag) as a read fault, refusing to grant write permissions.
The hardware will then fault again on retry, creating an infinite stage-2 fault
loop that hangs the guest vCPU.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260624202446.1698535-1-oupton@kernel.org?part=1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] KVM: arm64: Only consider S1PTW a write fault if HA is set
  2026-06-24 20:40   ` sashiko-bot
@ 2026-06-24 21:00     ` Oliver Upton
  2026-06-25 15:43       ` Marc Zyngier
  0 siblings, 1 reply; 11+ messages in thread
From: Oliver Upton @ 2026-06-24 21:00 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: Marc Zyngier, kvmarm

On Wed, Jun 24, 2026 at 08:40:24PM +0000, sashiko-bot@kernel.org wrote:
> > +	/*
> > +	 * The architecture sucks; assume that the S1PTW fetched for write if
> > +	 * HA is enabled at stage-1. Note that hardware updates to dirty state
> > +	 * and table AF are predicated on HA=1 (DDI0487 M.a D24.2.194; R_SNVTX).
> > +	 */
> > +	if (kvm_vcpu_abt_iss1tw(vcpu))
> > +		return effective_tcr_ha(vcpu);
> 
> [Severity: High]
> Does unconditionally treating S1PTW faults as writes when HA is enabled break
> guests that use pre-populated page tables in read-only memslots?
> 
> If a guest populates its page tables with the Access Flag already set, places
> them in an RO memslot, and enables HA in its TCR, the hardware only needs read
> access during a walk. However, if a translation fault occurs,
> kvm_is_write_fault() will unconditionally evaluate to true because of
> effective_tcr_ha().
> 
> This will cause kvm_handle_guest_abort() to see a write fault on an RO memslot
> and inject a spurious Synchronous External Abort (SEA) into the guest, even
> though the hardware only needed to read the page table. Could we retain the
> conditional check for permission faults to avoid this?

Gah, I tested with QEMU+EDK2 for this exact purpose but did so on an
implementation without HAFDBS.

> > +static enum trans_regime vcpu_trans_regime(struct kvm_vcpu *vcpu)
> > +{
> > +	if (!is_hyp_ctxt(vcpu))
> > +		return TR_EL10;
> > +
> > +	return vcpu_el2_e2h_is_set(vcpu) ? TR_EL20 : TR_EL2;
> > +}
> 
> [Severity: High]
> Does vcpu_trans_regime() correctly handle unprivileged memory accesses 
> like LDTR/STTR executed from EL2?
> 
> By returning TR_EL2 based strictly on the vCPU's execution context, KVM
> disregards that unprivileged instructions at EL2 use the EL1 and 0 translation
> regime (TCR_EL1) when HCR_EL2.E2H is 0 or TGE is 0.
> 
> Since KVM cannot distinguish an LDTR fault from an LDR fault via the ESR
> syndrome, it will read TCR_EL2.HA in __effective_tcr_ha() instead of
> TCR_EL1.HA. If TCR_EL1.HA is 1 but TCR_EL2.HA is 0, KVM incorrectly evaluates
> HA as disabled.
> 
> This causes KVM to treat a legitimate S1PTW permission fault (generated to
> update the Access Flag) as a read fault, refusing to grant write permissions.
> The hardware will then fault again on retry, creating an infinite stage-2 fault
> loop that hangs the guest vCPU.

This is wrong, the LDTR instruction does not use an out-of-context
translation regime.

If HCR_EL2.{E2H,TGE} != {1,1}, LDTR silently upgrades to a privileged
access.

"""
func AArch64_IsUnprivAccessPriv() => boolean
begin
    var privileged : boolean;

    case PSTATE.EL of
        when EL0 =>
            privileged = FALSE;
        when EL1 => privileged = EffectiveHCR_EL2_NVx()[1:0] == '11';
        when EL2 => privileged = !ELIsInHost(EL0);
        when EL3 =>
            privileged = TRUE;
    end;

    if IsFeatureImplemented(FEAT_UAO) && PSTATE.UAO == '1' then
        privileged = PSTATE.EL != EL0;
    end;

    return privileged;
end;
"""

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] KVM: arm64: nv: Treat S1PTW permission faults specially
  2026-06-24 20:35   ` sashiko-bot
@ 2026-06-24 21:22     ` Oliver Upton
  0 siblings, 0 replies; 11+ messages in thread
From: Oliver Upton @ 2026-06-24 21:22 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: kvmarm, Marc Zyngier

On Wed, Jun 24, 2026 at 08:35:16PM +0000, sashiko-bot@kernel.org wrote:
> > +	/*
> > +	 * S1PTW permission faults do not provide sufficient syndrome information
> > +	 * to determine if the fault was for read or write permissions. Perform a
> > +	 * read permission check and an optional write permission check, relying
> > +	 * on the fact that:
> > +	 *
> > +	 *  - The table walker at minimum requires read permission
> > +	 *
> > +	 *  - The L1 hypervisor also needs to deal with the architecture and
> > +	 *    cannot directly infer the failing permission from the fault context
> > +	 */
> > +	if (kvm_vcpu_abt_iss1tw(vcpu)) {
> > +		forward_fault = !trans->readable;
> > +		if (write_fault)
> > +			forward_fault |= !trans->writable;
> > +	} else if (kvm_vcpu_trap_is_iabt(vcpu)) {
> >  		if (vcpu_mode_priv(vcpu))
> >  			forward_fault = !kvm_s2_trans_exec_el1(vcpu->kvm, trans);
> >  		else
> >  			forward_fault = !kvm_s2_trans_exec_el0(vcpu->kvm, trans);
> 
> [Severity: High]
> This is a pre-existing issue, but does this logic miss a read-permission
> dependency check for normal instruction aborts?
> 
> If an L1 hypervisor maps a page for L2 with Stage-2 PTE AP=00 (no read/write)
> and XN=0 (executable), and L2 attempts to fetch an instruction, the hardware
> generates a Stage-2 Instruction Abort (Permission Fault) because execution
> requires read permission.

This does not match my understanding of stage-2 permissions. S2AP
controls data permissions, whereas the XN field controls instruction
fetches. The pseudocode also supports this view, where execute
permissions are computed separately from S2AP:

AArch64.S2DirectBasePermissions()
"""
    S2AccessControls s2perms;
    bit w;
    constant bit r = permissions.s2ap<0>;
    if permissions.s2ap<1> == '1' then
        w = '1';
    // Descriptors marked with DBM set have the effective value of S2AP[1] set.
    // This implies no Permission faults caused by lack of write permissions are
    // reported, and the Dirty bit can be set.
    elsif permissions.dbm == '1' && walkparams.hd == '1' then
        // An update occurs here, conditional to being able to append to HDBSS
        if walkparams.hdbss == '1' then
            w = if CanAppendToHDBSS() then '1' else '0';
        else
            w = '1';
    else
        w = '0';

    bit px, ux;
    case (permissions.s2xn:permissions.s2xnx) of
        when '00' (px,ux) = ('1','1');
        when '01' (px,ux) = ('0','1');
        when '10' (px,ux) = ('0','0');
        when '11' (px,ux) = ('1','0');

    x = if accdesc.el == EL0 then ux else px;
    s2perms.r = r;
    s2perms.w = w;
    s2perms.x = x;
    s2perms.r_rcw = r;
    s2perms.w_rcw = w;
    s2perms.r_mmu = r;
    s2perms.w_mmu = w;
    s2perms.toplevel0 = '0';
    s2perms.toplevel1 = '0';
    s2perms.overlay = FALSE;

    return s2perms;
"""

AArch64.S2CheckPermissions()
"""
    elsif accdesc.acctype == AccessType_IFETCH then
        if s2perms.overlay && s2perms.ox == '0' then
            fault.statuscode = Fault_Permission;
            fault.overlay    = TRUE;
        elsif (memtype == MemType_Device &&
                ConstrainUnpredictable(Unpredictable_INSTRDEVICE) == Constraint_FAULT) then
            fault.statuscode = Fault_Permission;

        // Prevent execution from Non-secure space by Realm state
        elsif accdesc.ss == SS_Realm && walkstate.baseaddress.paspace != PAS_Realm then
            fault.statuscode = Fault_Permission;
        elsif s2perms.x == '0' then
            fault.statuscode = Fault_Permission;
"""

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] KVM: arm64: Only consider S1PTW a write fault if HA is set
  2026-06-24 21:00     ` Oliver Upton
@ 2026-06-25 15:43       ` Marc Zyngier
  2026-06-25 19:34         ` Oliver Upton
  0 siblings, 1 reply; 11+ messages in thread
From: Marc Zyngier @ 2026-06-25 15:43 UTC (permalink / raw)
  To: Oliver Upton; +Cc: sashiko-reviews, kvmarm

On Wed, 24 Jun 2026 22:00:43 +0100,
Oliver Upton <oupton@kernel.org> wrote:
> 
> On Wed, Jun 24, 2026 at 08:40:24PM +0000, sashiko-bot@kernel.org wrote:
> > > +	/*
> > > +	 * The architecture sucks; assume that the S1PTW fetched for write if
> > > +	 * HA is enabled at stage-1. Note that hardware updates to dirty state
> > > +	 * and table AF are predicated on HA=1 (DDI0487 M.a D24.2.194; R_SNVTX).
> > > +	 */
> > > +	if (kvm_vcpu_abt_iss1tw(vcpu))
> > > +		return effective_tcr_ha(vcpu);
> > 
> > [Severity: High]
> > Does unconditionally treating S1PTW faults as writes when HA is enabled break
> > guests that use pre-populated page tables in read-only memslots?
> > 
> > If a guest populates its page tables with the Access Flag already set, places
> > them in an RO memslot, and enables HA in its TCR, the hardware only needs read
> > access during a walk.

Correct, this is a consequence of R_HDTRB.

> > However, if a translation fault occurs,
> > kvm_is_write_fault() will unconditionally evaluate to true because of
> > effective_tcr_ha().
> > 
> > This will cause kvm_handle_guest_abort() to see a write fault on an RO memslot
> > and inject a spurious Synchronous External Abort (SEA) into the guest, even
> > though the hardware only needed to read the page table. Could we retain the
> > conditional check for permission faults to avoid this?
> 
> Gah, I tested with QEMU+EDK2 for this exact purpose but did so on an
> implementation without HAFDBS.

I've been chewing on that one for a bit, and came up with the
following argument:

- We're missing one of Read or Write because of L1's doing, and L1
  needs to do *something* about it. We don't need to find out about HA
  in the guest, we just need to forward the fault (and it is probably
  enough to check that we're in a nested context).

- We have checked that HA==1, derived from that that we need R+W, and
  we're missing the Write permission because the page is marked RO
  (assuming that KVM still maps with Read permission by default):

  - either it is "opportunistically" RO (dirty tracking, page never
    written to), and we flip it to RW, rince, repeat.

  - or it is hard-wired to RO at the memslot level, and the only
    possibility is that the PTW is trying to perform an atomic access
    (as per R_HDTRB), which we should be able to reject with an
    "Unsupported atomic hardware update fault".

Thoughts?

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] KVM: arm64: Only consider S1PTW a write fault if HA is set
  2026-06-25 15:43       ` Marc Zyngier
@ 2026-06-25 19:34         ` Oliver Upton
  2026-06-25 20:43           ` Marc Zyngier
  0 siblings, 1 reply; 11+ messages in thread
From: Oliver Upton @ 2026-06-25 19:34 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: sashiko-reviews, kvmarm

On Thu, Jun 25, 2026 at 04:43:34PM +0100, Marc Zyngier wrote:
> On Wed, 24 Jun 2026 22:00:43 +0100,
> Oliver Upton <oupton@kernel.org> wrote:
> > 
> > On Wed, Jun 24, 2026 at 08:40:24PM +0000, sashiko-bot@kernel.org wrote:
> > > > +	/*
> > > > +	 * The architecture sucks; assume that the S1PTW fetched for write if
> > > > +	 * HA is enabled at stage-1. Note that hardware updates to dirty state
> > > > +	 * and table AF are predicated on HA=1 (DDI0487 M.a D24.2.194; R_SNVTX).
> > > > +	 */
> > > > +	if (kvm_vcpu_abt_iss1tw(vcpu))
> > > > +		return effective_tcr_ha(vcpu);
> > > 
> > > [Severity: High]
> > > Does unconditionally treating S1PTW faults as writes when HA is enabled break
> > > guests that use pre-populated page tables in read-only memslots?
> > > 
> > > If a guest populates its page tables with the Access Flag already set, places
> > > them in an RO memslot, and enables HA in its TCR, the hardware only needs read
> > > access during a walk.
> 
> Correct, this is a consequence of R_HDTRB.

While not _directly_ related, this does seem at odds with the
implications of R_JCXVS.

I.e. when HD is set, the PTW can speculatively fetch the S2 translation for
write before knowing if the S1 descriptor is actually subject to an
update per R_HDTRB.

It obviously all still fits together (permission checks are later down
the line), just weird is all. Anyway...

> I've been chewing on that one for a bit, and came up with the
> following argument:
> 
> - We're missing one of Read or Write because of L1's doing, and L1
>   needs to do *something* about it. We don't need to find out about HA
>   in the guest, we just need to forward the fault (and it is probably
>   enough to check that we're in a nested context).

We still need to account for host-induced permission faults, e.g. dirty
tracking or an RO memslot getting mapped into the L2. So I think we
still need to evaluate R+W before forwarding to the L1.

Looking ahead to HAFDBS, for this to work we will need to use a liberal
interpretation of R_JCXVS at the time of the initial translation fault
and always walk with intent for write.

Basically, there seems to be a subtle difference arising between writes
as observed from the nested MMU and writes as observed at the virtual
endpoint (memslot). Funny how something as straightforward as the access
flag can be so headache inducing :)

> - We have checked that HA==1, derived from that that we need R+W, and
>   we're missing the Write permission because the page is marked RO
>   (assuming that KVM still maps with Read permission by default):
> 
>   - either it is "opportunistically" RO (dirty tracking, page never
>     written to), and we flip it to RW, rince, repeat.
> 
>   - or it is hard-wired to RO at the memslot level, and the only
>     possibility is that the PTW is trying to perform an atomic access
>     (as per R_HDTRB), which we should be able to reject with an
>     "Unsupported atomic hardware update fault".
> 
> Thoughts?

All aboard :) I've just been staring at dirty state crap for too long. I
really like the unsupported atomic fault over our current behavior of
injecting an SEA.

I'll fold all of this into v3.

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] KVM: arm64: Only consider S1PTW a write fault if HA is set
  2026-06-25 19:34         ` Oliver Upton
@ 2026-06-25 20:43           ` Marc Zyngier
  2026-06-25 22:18             ` Oliver Upton
  0 siblings, 1 reply; 11+ messages in thread
From: Marc Zyngier @ 2026-06-25 20:43 UTC (permalink / raw)
  To: Oliver Upton; +Cc: sashiko-reviews, kvmarm

On Thu, 25 Jun 2026 20:34:46 +0100,
Oliver Upton <oupton@kernel.org> wrote:
> 
> On Thu, Jun 25, 2026 at 04:43:34PM +0100, Marc Zyngier wrote:
> > On Wed, 24 Jun 2026 22:00:43 +0100,
> > Oliver Upton <oupton@kernel.org> wrote:
> > > 
> > > On Wed, Jun 24, 2026 at 08:40:24PM +0000, sashiko-bot@kernel.org wrote:
> > > > > +	/*
> > > > > +	 * The architecture sucks; assume that the S1PTW fetched for write if
> > > > > +	 * HA is enabled at stage-1. Note that hardware updates to dirty state
> > > > > +	 * and table AF are predicated on HA=1 (DDI0487 M.a D24.2.194; R_SNVTX).
> > > > > +	 */
> > > > > +	if (kvm_vcpu_abt_iss1tw(vcpu))
> > > > > +		return effective_tcr_ha(vcpu);
> > > > 
> > > > [Severity: High]
> > > > Does unconditionally treating S1PTW faults as writes when HA is enabled break
> > > > guests that use pre-populated page tables in read-only memslots?
> > > > 
> > > > If a guest populates its page tables with the Access Flag already set, places
> > > > them in an RO memslot, and enables HA in its TCR, the hardware only needs read
> > > > access during a walk.
> > 
> > Correct, this is a consequence of R_HDTRB.
> 
> While not _directly_ related, this does seem at odds with the
> implications of R_JCXVS.
> 
> I.e. when HD is set, the PTW can speculatively fetch the S2 translation for
> write before knowing if the S1 descriptor is actually subject to an
> update per R_HDTRB.

I agree that this is a bit odd, and has a taste of speculative writes
(always a good idea... not!).

> 
> It obviously all still fits together (permission checks are later down
> the line), just weird is all. Anyway...
> 
> > I've been chewing on that one for a bit, and came up with the
> > following argument:
> > 
> > - We're missing one of Read or Write because of L1's doing, and L1
> >   needs to do *something* about it. We don't need to find out about HA
> >   in the guest, we just need to forward the fault (and it is probably
> >   enough to check that we're in a nested context).
> 
> We still need to account for host-induced permission faults, e.g. dirty
> tracking or an RO memslot getting mapped into the L2. So I think we
> still need to evaluate R+W before forwarding to the L1.

Hmm. I had forgotten about this indeed. Ultimately, we need to keep
track of why a S2 entry is RO in the L1 IPA space. We can either use
more SW bits in the PTE (not that many left), or wait until Wei-Lin is
done with his reverse + direct map tracking structure.

> Looking ahead to HAFDBS, for this to work we will need to use a liberal
> interpretation of R_JCXVS at the time of the initial translation fault
> and always walk with intent for write.

Yeah, I'm not too precious about that, and we might as well take
advantage of the architecture.

> 
> Basically, there seems to be a subtle difference arising between writes
> as observed from the nested MMU and writes as observed at the virtual
> endpoint (memslot). Funny how something as straightforward as the access
> flag can be so headache inducing :)

Well, you knew NV was just a sorry hack, didn't you? ;-) It's just
another case of "SW will sort it out eventually...".

> 
> > - We have checked that HA==1, derived from that that we need R+W, and
> >   we're missing the Write permission because the page is marked RO
> >   (assuming that KVM still maps with Read permission by default):
> > 
> >   - either it is "opportunistically" RO (dirty tracking, page never
> >     written to), and we flip it to RW, rince, repeat.
> > 
> >   - or it is hard-wired to RO at the memslot level, and the only
> >     possibility is that the PTW is trying to perform an atomic access
> >     (as per R_HDTRB), which we should be able to reject with an
> >     "Unsupported atomic hardware update fault".
> > 
> > Thoughts?
> 
> All aboard :) I've just been staring at dirty state crap for too long. I
> really like the unsupported atomic fault over our current behavior of
> injecting an SEA.
> 
> I'll fold all of this into v3.

Sounds good. Thank you!

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] KVM: arm64: Only consider S1PTW a write fault if HA is set
  2026-06-25 20:43           ` Marc Zyngier
@ 2026-06-25 22:18             ` Oliver Upton
  0 siblings, 0 replies; 11+ messages in thread
From: Oliver Upton @ 2026-06-25 22:18 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: sashiko-reviews, kvmarm

On Thu, Jun 25, 2026 at 09:43:54PM +0100, Marc Zyngier wrote:
> On Thu, 25 Jun 2026 20:34:46 +0100, Oliver Upton <oupton@kernel.org> wrote:
> > We still need to account for host-induced permission faults, e.g. dirty
> > tracking or an RO memslot getting mapped into the L2. So I think we
> > still need to evaluate R+W before forwarding to the L1.
> 
> Hmm. I had forgotten about this indeed. Ultimately, we need to keep
> track of why a S2 entry is RO in the L1 IPA space. We can either use
> more SW bits in the PTE (not that many left), or wait until Wei-Lin is
> done with his reverse + direct map tracking structure.

Pretty sure we can avoid the additional state tracking so long as we
ensure the nested S2 permission checks are ordered before anything
happening 'downstream' of the MMU.

Pairing an R+W check in kvm_s2_handle_perm_fault() along with a change
to kvm_is_write_fault() to only consider HA=1 permission faults to be
writes should do the trick. I've got the following blurb locally:

@@ -918,14 +919,39 @@ int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu, struct kvm_s2_trans *trans)
 	if (!kvm_vcpu_trap_is_permission_fault(vcpu))
 		return 0;
 
-	if (kvm_vcpu_trap_is_iabt(vcpu)) {
+	/*
+	 * S1PTW permission faults are a pain to deal with, owing to the fact that
+	 * the architecture sucks and there's insufficient syndrome information to
+	 * determine if the access failed for read or write permissions. We can still
+	 * infer it based on the behavior of our pseudo-TLB (i.e. the KVM MMU):
+	 *
+	 *  - S1PTW translation faults are treated as read accesses, meaning that
+	 *    the most relaxed resulting translation is read-only. The L1 hypervisor
+	 *    could prevent read accesses in the nested stage-2. Since all TTW accesses
+	 *    require at least read permission, we can detect this by unconditionally
+	 *    checking read permission in the nested stage-2.
+	 *
+	 *  - After establishing that this vCPU observed a read-only translation, we
+	 *    can infer that the access failed for lacking write permission due to
+	 *    either the nested stage-2 or KVM. Evaluate the write permission of the
+	 *    nested stage-2.
+	 *
+	 *  - Once the nested stage-2 permission checks have passed the permission
+	 *    fault must've been due to something downstream; The rest of KVM's
+	 *    fault handling can safely short-circuit to a write access at this point
+	 *    and potentially treat the access as unsupported at the virtual endpoint
+	 *    (e.g. unsupported atomic access to RO memslot).
+	 */
+	if (kvm_vcpu_abt_iss1tw(vcpu)) {
+		forward_fault = !trans->readable;
+		if (write_fault)
+			forward_fault |= !trans->writable;
+	} else if (kvm_vcpu_trap_is_iabt(vcpu)) {
 		if (vcpu_mode_priv(vcpu))
 			forward_fault = !kvm_s2_trans_exec_el1(vcpu->kvm, trans);
 		else
 			forward_fault = !kvm_s2_trans_exec_el0(vcpu->kvm, trans);
 	} else {
-		bool write_fault = kvm_is_write_fault(vcpu);
-
 		forward_fault = ((write_fault && !trans->writable) ||
 				 (!write_fault && !trans->readable));
 	}

I will admit, I have an extreme distaste for how subtle this is + the
reliance on implementation detail.

> > Looking ahead to HAFDBS, for this to work we will need to use a liberal
> > interpretation of R_JCXVS at the time of the initial translation fault
> > and always walk with intent for write.
> 
> Yeah, I'm not too precious about that, and we might as well take
> advantage of the architecture.
> 
> > 
> > Basically, there seems to be a subtle difference arising between writes
> > as observed from the nested MMU and writes as observed at the virtual
> > endpoint (memslot). Funny how something as straightforward as the access
> > flag can be so headache inducing :)
> 
> Well, you knew NV was just a sorry hack, didn't you? ;-) It's just
> another case of "SW will sort it out eventually...".

Given the present company on NV, I'm not entirely sure SW will actually sort
itself out :)

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-06-25 22:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 20:24 [PATCH v2 0/2] KVM: arm64: nv: Fix permission checks for S1PTW faults Oliver Upton
2026-06-24 20:24 ` [PATCH v2 1/2] KVM: arm64: Only consider S1PTW a write fault if HA is set Oliver Upton
2026-06-24 20:40   ` sashiko-bot
2026-06-24 21:00     ` Oliver Upton
2026-06-25 15:43       ` Marc Zyngier
2026-06-25 19:34         ` Oliver Upton
2026-06-25 20:43           ` Marc Zyngier
2026-06-25 22:18             ` Oliver Upton
2026-06-24 20:24 ` [PATCH v2 2/2] KVM: arm64: nv: Treat S1PTW permission faults specially Oliver Upton
2026-06-24 20:35   ` sashiko-bot
2026-06-24 21:22     ` Oliver Upton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.