[PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model
@ 2026-03-06 14:02 Fuad Tabba
  2026-03-06 14:02 ` [PATCH v1 01/13] KVM: arm64: Extract VMA size resolution in user_mem_abort() Fuad Tabba
                   ` (13 more replies)
  0 siblings, 14 replies; 20+ messages in thread
From: Fuad Tabba @ 2026-03-06 14:02 UTC (permalink / raw)
  To: kvm, kvmarm, linux-arm-kernel
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, qperret, vdonnefort, tabba

As promised in my recent patch series fixing a couple of urgent bugs in
user_mem_abort() [1], here is the actual refactoring to finally clean up this
monolith.

If you look through the Fixes: history of user_mem_abort(), you will start to
see a very clear pattern of whack-a-mole caused by the sheer size and
complexity of the function. For example:
- We keep leaking struct page references on early error returns because the
  cleanup logic is hard to track (e.g., 5f9466b50c1b and the atomic fault leak
  I just fixed in the previous series).
- We have had uninitialized memcache pointers (157dbc4a321f) because the
  initialization flow jumps around unpredictably.
- We have had subtle TOCTOU and locking boundary bugs (like 13ec9308a857 and
  f587661f21eb) because we drop the mmap_read_lock midway through the function
  but leave the vma pointer and mmu_seq floating around in the same lexical
  scope, tempting people to use them.

The bulk of the work is in the first 6 patches, which perform a strict,
no-logic-change structural refactoring of user_mem_abort() into a clean,
sequential dispatcher.

We introduce a state object, struct kvm_s2_fault, which encapsulates
both the input parameters and the intermediate state. Then,
user_mem_abort() is broken down into focused, standalone helpers:
- kvm_s2_resolve_vma_size(): Determines the VMA shift and page size.
- kvm_s2_fault_pin_pfn(): Handles faulting in the physical page.
  - kvm_s2_fault_get_vma_info(): A tightly-scoped sub-helper that isolates the
    mmap_read_lock, VMA lookup, and metadata snapshotting.
- kvm_s2_fault_compute_prot(): Computes stage-2 protections and evaluates
  permission/execution constraints.
- kvm_s2_fault_map(): Manages the KVM MMU lock, mmu_seq retry loops, MTE, and
  the final stage-2 mapping.

This structural change makes the "danger zone" foolproof. By isolating
the mmap_read_lock region inside a tightly-scoped sub-helper
(kvm_s2_fault_get_vma_info), the vma pointer is confined. It snapshots
the required metadata into the kvm_s2_fault structure before dropping
the lock. Because the pointers scope ends when the sub-helper returns,
accessing a stale VMA in the mapping phase is not possible by design.

The remaining patches in are localized cleanup patches. With the logic
finally extracted into digestible helpers, these patches take the
opportunity to streamline struct initialization, drop redundant struct
variables, simplify nested math, and hoist validation checks (like MTE)
out of the lock-heavy mapping phase.

I think that there are still more opportunities to tidy things up some
more, but I'll stop here to see what you think.

Based on Linux 7.0-rc2 and my previous fixes series [1].

[1] https://lore.kernel.org/all/20260304162222.836152-1-tabba@google.com/

Cheers,
/fuad

Fuad Tabba (13):
  KVM: arm64: Extract VMA size resolution in user_mem_abort()
  KVM: arm64: Introduce struct kvm_s2_fault to user_mem_abort()
  KVM: arm64: Extract PFN resolution in user_mem_abort()
  KVM: arm64: Isolate mmap_read_lock inside new
    kvm_s2_fault_get_vma_info() helper
  KVM: arm64: Extract stage-2 permission logic in user_mem_abort()
  KVM: arm64: Extract page table mapping in user_mem_abort()
  KVM: arm64: Simplify nested VMA shift calculation
  KVM: arm64: Remove redundant state variables from struct kvm_s2_fault
  KVM: arm64: Simplify return logic in user_mem_abort()
  KVM: arm64: Initialize struct kvm_s2_fault completely at declaration
  KVM: arm64: Optimize early exit checks in kvm_s2_fault_pin_pfn()
  KVM: arm64: Hoist MTE validation check out of MMU lock path
  KVM: arm64: Clean up control flow in kvm_s2_fault_map()

 arch/arm64/kvm/mmu.c | 379 +++++++++++++++++++++++++------------------
 1 file changed, 224 insertions(+), 155 deletions(-)


base-commit: f9985be5e1985930c2d2cf2752e36bb145b3ff7c
-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v1 01/13] KVM: arm64: Extract VMA size resolution in user_mem_abort()
  2026-03-06 14:02 [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Fuad Tabba
@ 2026-03-06 14:02 ` Fuad Tabba
  2026-03-17 15:07   ` Joey Gouly
  2026-03-06 14:02 ` [PATCH v1 02/13] KVM: arm64: Introduce struct kvm_s2_fault to user_mem_abort() Fuad Tabba
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 20+ messages in thread
From: Fuad Tabba @ 2026-03-06 14:02 UTC (permalink / raw)
  To: kvm, kvmarm, linux-arm-kernel
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, qperret, vdonnefort, tabba

As part of an effort to refactor user_mem_abort() into smaller, more
focused helper functions, extract the logic responsible for determining
the VMA shift and page size into a new static helper,
kvm_s2_resolve_vma_size().

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 130 ++++++++++++++++++++++++-------------------
 1 file changed, 73 insertions(+), 57 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 17d64a1e11e5..f8064b2d3204 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1639,6 +1639,77 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	return ret != -EAGAIN ? ret : 0;
 }
 
+static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma,
+				     unsigned long hva,
+				     struct kvm_memory_slot *memslot,
+				     struct kvm_s2_trans *nested,
+				     bool *force_pte, phys_addr_t *ipa)
+{
+	short vma_shift;
+	long vma_pagesize;
+
+	if (*force_pte)
+		vma_shift = PAGE_SHIFT;
+	else
+		vma_shift = get_vma_page_shift(vma, hva);
+
+	switch (vma_shift) {
+#ifndef __PAGETABLE_PMD_FOLDED
+	case PUD_SHIFT:
+		if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
+			break;
+		fallthrough;
+#endif
+	case CONT_PMD_SHIFT:
+		vma_shift = PMD_SHIFT;
+		fallthrough;
+	case PMD_SHIFT:
+		if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE))
+			break;
+		fallthrough;
+	case CONT_PTE_SHIFT:
+		vma_shift = PAGE_SHIFT;
+		*force_pte = true;
+		fallthrough;
+	case PAGE_SHIFT:
+		break;
+	default:
+		WARN_ONCE(1, "Unknown vma_shift %d", vma_shift);
+	}
+
+	vma_pagesize = 1UL << vma_shift;
+
+	if (nested) {
+		unsigned long max_map_size;
+
+		max_map_size = *force_pte ? PAGE_SIZE : PUD_SIZE;
+
+		*ipa = kvm_s2_trans_output(nested);
+
+		/*
+		 * If we're about to create a shadow stage 2 entry, then we
+		 * can only create a block mapping if the guest stage 2 page
+		 * table uses at least as big a mapping.
+		 */
+		max_map_size = min(kvm_s2_trans_size(nested), max_map_size);
+
+		/*
+		 * Be careful that if the mapping size falls between
+		 * two host sizes, take the smallest of the two.
+		 */
+		if (max_map_size >= PMD_SIZE && max_map_size < PUD_SIZE)
+			max_map_size = PMD_SIZE;
+		else if (max_map_size >= PAGE_SIZE && max_map_size < PMD_SIZE)
+			max_map_size = PAGE_SIZE;
+
+		*force_pte = (max_map_size == PAGE_SIZE);
+		vma_pagesize = min_t(long, vma_pagesize, max_map_size);
+		vma_shift = __ffs(vma_pagesize);
+	}
+
+	return vma_shift;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_s2_trans *nested,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
@@ -1695,65 +1766,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		return -EFAULT;
 	}
 
-	if (force_pte)
-		vma_shift = PAGE_SHIFT;
-	else
-		vma_shift = get_vma_page_shift(vma, hva);
-
-	switch (vma_shift) {
-#ifndef __PAGETABLE_PMD_FOLDED
-	case PUD_SHIFT:
-		if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
-			break;
-		fallthrough;
-#endif
-	case CONT_PMD_SHIFT:
-		vma_shift = PMD_SHIFT;
-		fallthrough;
-	case PMD_SHIFT:
-		if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE))
-			break;
-		fallthrough;
-	case CONT_PTE_SHIFT:
-		vma_shift = PAGE_SHIFT;
-		force_pte = true;
-		fallthrough;
-	case PAGE_SHIFT:
-		break;
-	default:
-		WARN_ONCE(1, "Unknown vma_shift %d", vma_shift);
-	}
-
+	vma_shift = kvm_s2_resolve_vma_size(vma, hva, memslot, nested,
+					    &force_pte, &ipa);
 	vma_pagesize = 1UL << vma_shift;
 
-	if (nested) {
-		unsigned long max_map_size;
-
-		max_map_size = force_pte ? PAGE_SIZE : PUD_SIZE;
-
-		ipa = kvm_s2_trans_output(nested);
-
-		/*
-		 * If we're about to create a shadow stage 2 entry, then we
-		 * can only create a block mapping if the guest stage 2 page
-		 * table uses at least as big a mapping.
-		 */
-		max_map_size = min(kvm_s2_trans_size(nested), max_map_size);
-
-		/*
-		 * Be careful that if the mapping size falls between
-		 * two host sizes, take the smallest of the two.
-		 */
-		if (max_map_size >= PMD_SIZE && max_map_size < PUD_SIZE)
-			max_map_size = PMD_SIZE;
-		else if (max_map_size >= PAGE_SIZE && max_map_size < PMD_SIZE)
-			max_map_size = PAGE_SIZE;
-
-		force_pte = (max_map_size == PAGE_SIZE);
-		vma_pagesize = min_t(long, vma_pagesize, max_map_size);
-		vma_shift = __ffs(vma_pagesize);
-	}
-
 	/*
 	 * Both the canonical IPA and fault IPA must be aligned to the
 	 * mapping size to ensure we find the right PFN and lay down the
-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 01/13] KVM: arm64: Extract VMA size resolution in user_mem_abort()
  2026-03-06 14:02 ` [PATCH v1 01/13] KVM: arm64: Extract VMA size resolution in user_mem_abort() Fuad Tabba
@ 2026-03-17 15:07   ` Joey Gouly
  0 siblings, 0 replies; 20+ messages in thread
From: Joey Gouly @ 2026-03-17 15:07 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, kvmarm, linux-arm-kernel, maz, oliver.upton, suzuki.poulose,
	yuzenghui, catalin.marinas, will, qperret, vdonnefort

On Fri, Mar 06, 2026 at 02:02:20PM +0000, Fuad Tabba wrote:
> As part of an effort to refactor user_mem_abort() into smaller, more
> focused helper functions, extract the logic responsible for determining
> the VMA shift and page size into a new static helper,
> kvm_s2_resolve_vma_size().
> 
> Signed-off-by: Fuad Tabba <tabba@google.com>

Reviewed-by: Joey Gouly <joey.gouly@arm.com>

> ---
>  arch/arm64/kvm/mmu.c | 130 ++++++++++++++++++++++++-------------------
>  1 file changed, 73 insertions(+), 57 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 17d64a1e11e5..f8064b2d3204 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1639,6 +1639,77 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	return ret != -EAGAIN ? ret : 0;
>  }
>  
> +static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma,
> +				     unsigned long hva,
> +				     struct kvm_memory_slot *memslot,
> +				     struct kvm_s2_trans *nested,
> +				     bool *force_pte, phys_addr_t *ipa)
> +{
> +	short vma_shift;
> +	long vma_pagesize;
> +
> +	if (*force_pte)
> +		vma_shift = PAGE_SHIFT;
> +	else
> +		vma_shift = get_vma_page_shift(vma, hva);
> +
> +	switch (vma_shift) {
> +#ifndef __PAGETABLE_PMD_FOLDED
> +	case PUD_SHIFT:
> +		if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
> +			break;
> +		fallthrough;
> +#endif
> +	case CONT_PMD_SHIFT:
> +		vma_shift = PMD_SHIFT;
> +		fallthrough;
> +	case PMD_SHIFT:
> +		if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE))
> +			break;
> +		fallthrough;
> +	case CONT_PTE_SHIFT:
> +		vma_shift = PAGE_SHIFT;
> +		*force_pte = true;
> +		fallthrough;
> +	case PAGE_SHIFT:
> +		break;
> +	default:
> +		WARN_ONCE(1, "Unknown vma_shift %d", vma_shift);
> +	}
> +
> +	vma_pagesize = 1UL << vma_shift;
> +
> +	if (nested) {
> +		unsigned long max_map_size;
> +
> +		max_map_size = *force_pte ? PAGE_SIZE : PUD_SIZE;
> +
> +		*ipa = kvm_s2_trans_output(nested);
> +
> +		/*
> +		 * If we're about to create a shadow stage 2 entry, then we
> +		 * can only create a block mapping if the guest stage 2 page
> +		 * table uses at least as big a mapping.
> +		 */
> +		max_map_size = min(kvm_s2_trans_size(nested), max_map_size);
> +
> +		/*
> +		 * Be careful that if the mapping size falls between
> +		 * two host sizes, take the smallest of the two.
> +		 */
> +		if (max_map_size >= PMD_SIZE && max_map_size < PUD_SIZE)
> +			max_map_size = PMD_SIZE;
> +		else if (max_map_size >= PAGE_SIZE && max_map_size < PMD_SIZE)
> +			max_map_size = PAGE_SIZE;
> +
> +		*force_pte = (max_map_size == PAGE_SIZE);
> +		vma_pagesize = min_t(long, vma_pagesize, max_map_size);
> +		vma_shift = __ffs(vma_pagesize);
> +	}
> +
> +	return vma_shift;
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_s2_trans *nested,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
> @@ -1695,65 +1766,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		return -EFAULT;
>  	}
>  
> -	if (force_pte)
> -		vma_shift = PAGE_SHIFT;
> -	else
> -		vma_shift = get_vma_page_shift(vma, hva);
> -
> -	switch (vma_shift) {
> -#ifndef __PAGETABLE_PMD_FOLDED
> -	case PUD_SHIFT:
> -		if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
> -			break;
> -		fallthrough;
> -#endif
> -	case CONT_PMD_SHIFT:
> -		vma_shift = PMD_SHIFT;
> -		fallthrough;
> -	case PMD_SHIFT:
> -		if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE))
> -			break;
> -		fallthrough;
> -	case CONT_PTE_SHIFT:
> -		vma_shift = PAGE_SHIFT;
> -		force_pte = true;
> -		fallthrough;
> -	case PAGE_SHIFT:
> -		break;
> -	default:
> -		WARN_ONCE(1, "Unknown vma_shift %d", vma_shift);
> -	}
> -
> +	vma_shift = kvm_s2_resolve_vma_size(vma, hva, memslot, nested,
> +					    &force_pte, &ipa);
>  	vma_pagesize = 1UL << vma_shift;
>  
> -	if (nested) {
> -		unsigned long max_map_size;
> -
> -		max_map_size = force_pte ? PAGE_SIZE : PUD_SIZE;
> -
> -		ipa = kvm_s2_trans_output(nested);
> -
> -		/*
> -		 * If we're about to create a shadow stage 2 entry, then we
> -		 * can only create a block mapping if the guest stage 2 page
> -		 * table uses at least as big a mapping.
> -		 */
> -		max_map_size = min(kvm_s2_trans_size(nested), max_map_size);
> -
> -		/*
> -		 * Be careful that if the mapping size falls between
> -		 * two host sizes, take the smallest of the two.
> -		 */
> -		if (max_map_size >= PMD_SIZE && max_map_size < PUD_SIZE)
> -			max_map_size = PMD_SIZE;
> -		else if (max_map_size >= PAGE_SIZE && max_map_size < PMD_SIZE)
> -			max_map_size = PAGE_SIZE;
> -
> -		force_pte = (max_map_size == PAGE_SIZE);
> -		vma_pagesize = min_t(long, vma_pagesize, max_map_size);
> -		vma_shift = __ffs(vma_pagesize);
> -	}
> -
>  	/*
>  	 * Both the canonical IPA and fault IPA must be aligned to the
>  	 * mapping size to ensure we find the right PFN and lay down the
> -- 
> 2.53.0.473.g4a7958ca14-goog
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v1 02/13] KVM: arm64: Introduce struct kvm_s2_fault to user_mem_abort()
  2026-03-06 14:02 [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Fuad Tabba
  2026-03-06 14:02 ` [PATCH v1 01/13] KVM: arm64: Extract VMA size resolution in user_mem_abort() Fuad Tabba
@ 2026-03-06 14:02 ` Fuad Tabba
  2026-03-17 16:00   ` Joey Gouly
  2026-03-06 14:02 ` [PATCH v1 03/13] KVM: arm64: Extract PFN resolution in user_mem_abort() Fuad Tabba
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 20+ messages in thread
From: Fuad Tabba @ 2026-03-06 14:02 UTC (permalink / raw)
  To: kvm, kvmarm, linux-arm-kernel
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, qperret, vdonnefort, tabba

The user_mem_abort() function takes many arguments and defines a large
number of local variables. Passing all these variables around to helper
functions would result in functions with too many arguments.

Introduce struct kvm_s2_fault to encapsulate the stage-2 fault state.
This structure holds both the input parameters and the intermediate
state required during the fault handling process.

Update user_mem_abort() to initialize this structure and replace the
usage of local variables with fields from the new structure.

This prepares the ground for further extracting parts of
user_mem_abort() into smaller helper functions that can simply take a
pointer to the fault state structure.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 222 +++++++++++++++++++++++++------------------
 1 file changed, 128 insertions(+), 94 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index f8064b2d3204..419b793c5891 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1710,38 +1710,68 @@ static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma,
 	return vma_shift;
 }
 
+struct kvm_s2_fault {
+	struct kvm_vcpu *vcpu;
+	phys_addr_t fault_ipa;
+	struct kvm_s2_trans *nested;
+	struct kvm_memory_slot *memslot;
+	unsigned long hva;
+	bool fault_is_perm;
+
+	bool write_fault;
+	bool exec_fault;
+	bool writable;
+	bool topup_memcache;
+	bool mte_allowed;
+	bool is_vma_cacheable;
+	bool s2_force_noncacheable;
+	bool vfio_allow_any_uc;
+	unsigned long mmu_seq;
+	phys_addr_t ipa;
+	short vma_shift;
+	gfn_t gfn;
+	kvm_pfn_t pfn;
+	bool logging_active;
+	bool force_pte;
+	long vma_pagesize;
+	long fault_granule;
+	enum kvm_pgtable_prot prot;
+	struct page *page;
+	vm_flags_t vm_flags;
+};
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_s2_trans *nested,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  bool fault_is_perm)
 {
 	int ret = 0;
-	bool topup_memcache;
-	bool write_fault, writable;
-	bool exec_fault, mte_allowed, is_vma_cacheable;
-	bool s2_force_noncacheable = false, vfio_allow_any_uc = false;
-	unsigned long mmu_seq;
-	phys_addr_t ipa = fault_ipa;
+	struct kvm_s2_fault fault_data = {
+		.vcpu = vcpu,
+		.fault_ipa = fault_ipa,
+		.nested = nested,
+		.memslot = memslot,
+		.hva = hva,
+		.fault_is_perm = fault_is_perm,
+		.ipa = fault_ipa,
+		.logging_active = memslot_is_logging(memslot),
+		.force_pte = memslot_is_logging(memslot),
+		.s2_force_noncacheable = false,
+		.vfio_allow_any_uc = false,
+		.prot = KVM_PGTABLE_PROT_R,
+	};
+	struct kvm_s2_fault *fault = &fault_data;
 	struct kvm *kvm = vcpu->kvm;
 	struct vm_area_struct *vma;
-	short vma_shift;
 	void *memcache;
-	gfn_t gfn;
-	kvm_pfn_t pfn;
-	bool logging_active = memslot_is_logging(memslot);
-	bool force_pte = logging_active;
-	long vma_pagesize, fault_granule;
-	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
-	struct page *page;
-	vm_flags_t vm_flags;
 	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
 
-	if (fault_is_perm)
-		fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu);
-	write_fault = kvm_is_write_fault(vcpu);
-	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
-	VM_WARN_ON_ONCE(write_fault && exec_fault);
+	if (fault->fault_is_perm)
+		fault->fault_granule = kvm_vcpu_trap_get_perm_fault_granule(fault->vcpu);
+	fault->write_fault = kvm_is_write_fault(fault->vcpu);
+	fault->exec_fault = kvm_vcpu_trap_is_exec_fault(fault->vcpu);
+	VM_WARN_ON_ONCE(fault->write_fault && fault->exec_fault);
 
 	/*
 	 * Permission faults just need to update the existing leaf entry,
@@ -1749,43 +1779,44 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * only exception to this is when dirty logging is enabled at runtime
 	 * and a write fault needs to collapse a block entry into a table.
 	 */
-	topup_memcache = !fault_is_perm || (logging_active && write_fault);
-	ret = prepare_mmu_memcache(vcpu, topup_memcache, &memcache);
+	fault->topup_memcache = !fault->fault_is_perm ||
+				(fault->logging_active && fault->write_fault);
+	ret = prepare_mmu_memcache(fault->vcpu, fault->topup_memcache, &memcache);
 	if (ret)
 		return ret;
 
 	/*
-	 * Let's check if we will get back a huge page backed by hugetlbfs, or
+	 * Let's check if we will get back a huge fault->page backed by hugetlbfs, or
 	 * get block mapping for device MMIO region.
 	 */
 	mmap_read_lock(current->mm);
-	vma = vma_lookup(current->mm, hva);
+	vma = vma_lookup(current->mm, fault->hva);
 	if (unlikely(!vma)) {
-		kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
+		kvm_err("Failed to find VMA for fault->hva 0x%lx\n", fault->hva);
 		mmap_read_unlock(current->mm);
 		return -EFAULT;
 	}
 
-	vma_shift = kvm_s2_resolve_vma_size(vma, hva, memslot, nested,
-					    &force_pte, &ipa);
-	vma_pagesize = 1UL << vma_shift;
+	fault->vma_shift = kvm_s2_resolve_vma_size(vma, fault->hva, fault->memslot, fault->nested,
+						   &fault->force_pte, &fault->ipa);
+	fault->vma_pagesize = 1UL << fault->vma_shift;
 
 	/*
 	 * Both the canonical IPA and fault IPA must be aligned to the
 	 * mapping size to ensure we find the right PFN and lay down the
 	 * mapping in the right place.
 	 */
-	fault_ipa = ALIGN_DOWN(fault_ipa, vma_pagesize);
-	ipa = ALIGN_DOWN(ipa, vma_pagesize);
+	fault->fault_ipa = ALIGN_DOWN(fault->fault_ipa, fault->vma_pagesize);
+	fault->ipa = ALIGN_DOWN(fault->ipa, fault->vma_pagesize);
 
-	gfn = ipa >> PAGE_SHIFT;
-	mte_allowed = kvm_vma_mte_allowed(vma);
+	fault->gfn = fault->ipa >> PAGE_SHIFT;
+	fault->mte_allowed = kvm_vma_mte_allowed(vma);
 
-	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
+	fault->vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
 
-	vm_flags = vma->vm_flags;
+	fault->vm_flags = vma->vm_flags;
 
-	is_vma_cacheable = kvm_vma_is_cacheable(vma);
+	fault->is_vma_cacheable = kvm_vma_is_cacheable(vma);
 
 	/* Don't use the VMA after the unlock -- it may have vanished */
 	vma = NULL;
@@ -1798,24 +1829,25 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs
 	 * with the smp_wmb() in kvm_mmu_invalidate_end().
 	 */
-	mmu_seq = kvm->mmu_invalidate_seq;
+	fault->mmu_seq = kvm->mmu_invalidate_seq;
 	mmap_read_unlock(current->mm);
 
-	pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0,
-				&writable, &page);
-	if (pfn == KVM_PFN_ERR_HWPOISON) {
-		kvm_send_hwpoison_signal(hva, vma_shift);
+	fault->pfn = __kvm_faultin_pfn(fault->memslot, fault->gfn,
+				       fault->write_fault ? FOLL_WRITE : 0,
+				       &fault->writable, &fault->page);
+	if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
+		kvm_send_hwpoison_signal(fault->hva, fault->vma_shift);
 		return 0;
 	}
-	if (is_error_noslot_pfn(pfn))
+	if (is_error_noslot_pfn(fault->pfn))
 		return -EFAULT;
 
 	/*
-	 * Check if this is non-struct page memory PFN, and cannot support
+	 * Check if this is non-struct fault->page memory PFN, and cannot support
 	 * CMOs. It could potentially be unsafe to access as cacheable.
 	 */
-	if (vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(pfn)) {
-		if (is_vma_cacheable) {
+	if (fault->vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(fault->pfn)) {
+		if (fault->is_vma_cacheable) {
 			/*
 			 * Whilst the VMA owner expects cacheable mapping to this
 			 * PFN, hardware also has to support the FWB and CACHE DIC
@@ -1832,26 +1864,26 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 				ret = -EFAULT;
 		} else {
 			/*
-			 * If the page was identified as device early by looking at
-			 * the VMA flags, vma_pagesize is already representing the
+			 * If the fault->page was identified as device early by looking at
+			 * the VMA flags, fault->vma_pagesize is already representing the
 			 * largest quantity we can map.  If instead it was mapped
-			 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
+			 * via __kvm_faultin_pfn(), fault->vma_pagesize is set to PAGE_SIZE
 			 * and must not be upgraded.
 			 *
 			 * In both cases, we don't let transparent_hugepage_adjust()
 			 * change things at the last minute.
 			 */
-			s2_force_noncacheable = true;
+			fault->s2_force_noncacheable = true;
 		}
-	} else if (logging_active && !write_fault) {
+	} else if (fault->logging_active && !fault->write_fault) {
 		/*
-		 * Only actually map the page as writable if this was a write
+		 * Only actually map the fault->page as fault->writable if this was a write
 		 * fault.
 		 */
-		writable = false;
+		fault->writable = false;
 	}
 
-	if (exec_fault && s2_force_noncacheable)
+	if (fault->exec_fault && fault->s2_force_noncacheable)
 		ret = -ENOEXEC;
 
 	if (ret)
@@ -1860,101 +1892,103 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	/*
 	 * Guest performs atomic/exclusive operations on memory with unsupported
 	 * attributes (e.g. ld64b/st64b on normal memory when no FEAT_LS64WB)
-	 * and trigger the exception here. Since the memslot is valid, inject
+	 * and trigger the exception here. Since the fault->memslot is valid, inject
 	 * the fault back to the guest.
 	 */
-	if (esr_fsc_is_excl_atomic_fault(kvm_vcpu_get_esr(vcpu))) {
-		kvm_inject_dabt_excl_atomic(vcpu, kvm_vcpu_get_hfar(vcpu));
+	if (esr_fsc_is_excl_atomic_fault(kvm_vcpu_get_esr(fault->vcpu))) {
+		kvm_inject_dabt_excl_atomic(fault->vcpu, kvm_vcpu_get_hfar(fault->vcpu));
 		ret = 1;
 		goto out_put_page;
 	}
 
-	if (nested)
-		adjust_nested_fault_perms(nested, &prot, &writable);
+	if (fault->nested)
+		adjust_nested_fault_perms(fault->nested, &fault->prot, &fault->writable);
 
 	kvm_fault_lock(kvm);
-	pgt = vcpu->arch.hw_mmu->pgt;
-	if (mmu_invalidate_retry(kvm, mmu_seq)) {
+	pgt = fault->vcpu->arch.hw_mmu->pgt;
+	if (mmu_invalidate_retry(kvm, fault->mmu_seq)) {
 		ret = -EAGAIN;
 		goto out_unlock;
 	}
 
 	/*
-	 * If we are not forced to use page mapping, check if we are
+	 * If we are not forced to use fault->page mapping, check if we are
 	 * backed by a THP and thus use block mapping if possible.
 	 */
-	if (vma_pagesize == PAGE_SIZE && !(force_pte || s2_force_noncacheable)) {
-		if (fault_is_perm && fault_granule > PAGE_SIZE)
-			vma_pagesize = fault_granule;
+	if (fault->vma_pagesize == PAGE_SIZE &&
+	    !(fault->force_pte || fault->s2_force_noncacheable)) {
+		if (fault->fault_is_perm && fault->fault_granule > PAGE_SIZE)
+			fault->vma_pagesize = fault->fault_granule;
 		else
-			vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
-								   hva, &pfn,
-								   &fault_ipa);
+			fault->vma_pagesize = transparent_hugepage_adjust(kvm, fault->memslot,
+									  fault->hva, &fault->pfn,
+									  &fault->fault_ipa);
 
-		if (vma_pagesize < 0) {
-			ret = vma_pagesize;
+		if (fault->vma_pagesize < 0) {
+			ret = fault->vma_pagesize;
 			goto out_unlock;
 		}
 	}
 
-	if (!fault_is_perm && !s2_force_noncacheable && kvm_has_mte(kvm)) {
+	if (!fault->fault_is_perm && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) {
 		/* Check the VMM hasn't introduced a new disallowed VMA */
-		if (mte_allowed) {
-			sanitise_mte_tags(kvm, pfn, vma_pagesize);
+		if (fault->mte_allowed) {
+			sanitise_mte_tags(kvm, fault->pfn, fault->vma_pagesize);
 		} else {
 			ret = -EFAULT;
 			goto out_unlock;
 		}
 	}
 
-	if (writable)
-		prot |= KVM_PGTABLE_PROT_W;
+	if (fault->writable)
+		fault->prot |= KVM_PGTABLE_PROT_W;
 
-	if (exec_fault)
-		prot |= KVM_PGTABLE_PROT_X;
+	if (fault->exec_fault)
+		fault->prot |= KVM_PGTABLE_PROT_X;
 
-	if (s2_force_noncacheable) {
-		if (vfio_allow_any_uc)
-			prot |= KVM_PGTABLE_PROT_NORMAL_NC;
+	if (fault->s2_force_noncacheable) {
+		if (fault->vfio_allow_any_uc)
+			fault->prot |= KVM_PGTABLE_PROT_NORMAL_NC;
 		else
-			prot |= KVM_PGTABLE_PROT_DEVICE;
+			fault->prot |= KVM_PGTABLE_PROT_DEVICE;
 	} else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) {
-		prot |= KVM_PGTABLE_PROT_X;
+		fault->prot |= KVM_PGTABLE_PROT_X;
 	}
 
-	if (nested)
-		adjust_nested_exec_perms(kvm, nested, &prot);
+	if (fault->nested)
+		adjust_nested_exec_perms(kvm, fault->nested, &fault->prot);
 
 	/*
 	 * Under the premise of getting a FSC_PERM fault, we just need to relax
-	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
+	 * permissions only if fault->vma_pagesize equals fault->fault_granule. Otherwise,
 	 * kvm_pgtable_stage2_map() should be called to change block size.
 	 */
-	if (fault_is_perm && vma_pagesize == fault_granule) {
+	if (fault->fault_is_perm && fault->vma_pagesize == fault->fault_granule) {
 		/*
 		 * Drop the SW bits in favour of those stored in the
 		 * PTE, which will be preserved.
 		 */
-		prot &= ~KVM_NV_GUEST_MAP_SZ;
-		ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault_ipa, prot, flags);
+		fault->prot &= ~KVM_NV_GUEST_MAP_SZ;
+		ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault->fault_ipa, fault->prot,
+								 flags);
 	} else {
-		ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, vma_pagesize,
-					     __pfn_to_phys(pfn), prot,
-					     memcache, flags);
+		ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault->fault_ipa, fault->vma_pagesize,
+							 __pfn_to_phys(fault->pfn), fault->prot,
+							 memcache, flags);
 	}
 
 out_unlock:
-	kvm_release_faultin_page(kvm, page, !!ret, writable);
+	kvm_release_faultin_page(kvm, fault->page, !!ret, fault->writable);
 	kvm_fault_unlock(kvm);
 
-	/* Mark the page dirty only if the fault is handled successfully */
-	if (writable && !ret)
-		mark_page_dirty_in_slot(kvm, memslot, gfn);
+	/* Mark the fault->page dirty only if the fault is handled successfully */
+	if (fault->writable && !ret)
+		mark_page_dirty_in_slot(kvm, fault->memslot, fault->gfn);
 
 	return ret != -EAGAIN ? ret : 0;
 
 out_put_page:
-	kvm_release_page_unused(page);
+	kvm_release_page_unused(fault->page);
 	return ret;
 }
 
-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 02/13] KVM: arm64: Introduce struct kvm_s2_fault to user_mem_abort()
  2026-03-06 14:02 ` [PATCH v1 02/13] KVM: arm64: Introduce struct kvm_s2_fault to user_mem_abort() Fuad Tabba
@ 2026-03-17 16:00   ` Joey Gouly
  0 siblings, 0 replies; 20+ messages in thread
From: Joey Gouly @ 2026-03-17 16:00 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, kvmarm, linux-arm-kernel, maz, oliver.upton, suzuki.poulose,
	yuzenghui, catalin.marinas, will, qperret, vdonnefort

On Fri, Mar 06, 2026 at 02:02:21PM +0000, Fuad Tabba wrote:
> The user_mem_abort() function takes many arguments and defines a large
> number of local variables. Passing all these variables around to helper
> functions would result in functions with too many arguments.
> 
> Introduce struct kvm_s2_fault to encapsulate the stage-2 fault state.
> This structure holds both the input parameters and the intermediate
> state required during the fault handling process.
> 
> Update user_mem_abort() to initialize this structure and replace the
> usage of local variables with fields from the new structure.
> 
> This prepares the ground for further extracting parts of
> user_mem_abort() into smaller helper functions that can simply take a
> pointer to the fault state structure.
> 
> Signed-off-by: Fuad Tabba <tabba@google.com>

A few nits later on, that can be changed or ignored!

maz's branch changes a lot of this again, but still:

Reviewed-by: Joey Gouly <joey.gouly@arm.com>

> ---
>  arch/arm64/kvm/mmu.c | 222 +++++++++++++++++++++++++------------------
>  1 file changed, 128 insertions(+), 94 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index f8064b2d3204..419b793c5891 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1710,38 +1710,68 @@ static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma,
>  	return vma_shift;
>  }
>  
> +struct kvm_s2_fault {
> +	struct kvm_vcpu *vcpu;
> +	phys_addr_t fault_ipa;
> +	struct kvm_s2_trans *nested;
> +	struct kvm_memory_slot *memslot;
> +	unsigned long hva;
> +	bool fault_is_perm;
> +
> +	bool write_fault;
> +	bool exec_fault;
> +	bool writable;
> +	bool topup_memcache;
> +	bool mte_allowed;
> +	bool is_vma_cacheable;
> +	bool s2_force_noncacheable;
> +	bool vfio_allow_any_uc;
> +	unsigned long mmu_seq;
> +	phys_addr_t ipa;
> +	short vma_shift;
> +	gfn_t gfn;
> +	kvm_pfn_t pfn;
> +	bool logging_active;
> +	bool force_pte;
> +	long vma_pagesize;
> +	long fault_granule;
> +	enum kvm_pgtable_prot prot;
> +	struct page *page;
> +	vm_flags_t vm_flags;
> +};
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_s2_trans *nested,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  bool fault_is_perm)
>  {
>  	int ret = 0;
> -	bool topup_memcache;
> -	bool write_fault, writable;
> -	bool exec_fault, mte_allowed, is_vma_cacheable;
> -	bool s2_force_noncacheable = false, vfio_allow_any_uc = false;
> -	unsigned long mmu_seq;
> -	phys_addr_t ipa = fault_ipa;
> +	struct kvm_s2_fault fault_data = {
> +		.vcpu = vcpu,
> +		.fault_ipa = fault_ipa,
> +		.nested = nested,
> +		.memslot = memslot,
> +		.hva = hva,
> +		.fault_is_perm = fault_is_perm,
> +		.ipa = fault_ipa,
> +		.logging_active = memslot_is_logging(memslot),
> +		.force_pte = memslot_is_logging(memslot),
> +		.s2_force_noncacheable = false,
> +		.vfio_allow_any_uc = false,
> +		.prot = KVM_PGTABLE_PROT_R,
> +	};
> +	struct kvm_s2_fault *fault = &fault_data;
>  	struct kvm *kvm = vcpu->kvm;
>  	struct vm_area_struct *vma;
> -	short vma_shift;
>  	void *memcache;
> -	gfn_t gfn;
> -	kvm_pfn_t pfn;
> -	bool logging_active = memslot_is_logging(memslot);
> -	bool force_pte = logging_active;
> -	long vma_pagesize, fault_granule;
> -	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>  	struct kvm_pgtable *pgt;
> -	struct page *page;
> -	vm_flags_t vm_flags;
>  	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
>  
> -	if (fault_is_perm)
> -		fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu);
> -	write_fault = kvm_is_write_fault(vcpu);
> -	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> -	VM_WARN_ON_ONCE(write_fault && exec_fault);
> +	if (fault->fault_is_perm)
> +		fault->fault_granule = kvm_vcpu_trap_get_perm_fault_granule(fault->vcpu);
> +	fault->write_fault = kvm_is_write_fault(fault->vcpu);
> +	fault->exec_fault = kvm_vcpu_trap_is_exec_fault(fault->vcpu);
> +	VM_WARN_ON_ONCE(fault->write_fault && fault->exec_fault);
>  
>  	/*
>  	 * Permission faults just need to update the existing leaf entry,
> @@ -1749,43 +1779,44 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	 * only exception to this is when dirty logging is enabled at runtime
>  	 * and a write fault needs to collapse a block entry into a table.
>  	 */
> -	topup_memcache = !fault_is_perm || (logging_active && write_fault);
> -	ret = prepare_mmu_memcache(vcpu, topup_memcache, &memcache);
> +	fault->topup_memcache = !fault->fault_is_perm ||
> +				(fault->logging_active && fault->write_fault);
> +	ret = prepare_mmu_memcache(fault->vcpu, fault->topup_memcache, &memcache);
>  	if (ret)
>  		return ret;
>  
>  	/*
> -	 * Let's check if we will get back a huge page backed by hugetlbfs, or
> +	 * Let's check if we will get back a huge fault->page backed by hugetlbfs, or

Seems like a find/replace issue. This one (and the others below) seem to be
talking more generically about 'page'.

>  	 * get block mapping for device MMIO region.
>  	 */
>  	mmap_read_lock(current->mm);
> -	vma = vma_lookup(current->mm, hva);
> +	vma = vma_lookup(current->mm, fault->hva);
>  	if (unlikely(!vma)) {
> -		kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
> +		kvm_err("Failed to find VMA for fault->hva 0x%lx\n", fault->hva);
>  		mmap_read_unlock(current->mm);
>  		return -EFAULT;
>  	}
>  
> -	vma_shift = kvm_s2_resolve_vma_size(vma, hva, memslot, nested,
> -					    &force_pte, &ipa);
> -	vma_pagesize = 1UL << vma_shift;
> +	fault->vma_shift = kvm_s2_resolve_vma_size(vma, fault->hva, fault->memslot, fault->nested,
> +						   &fault->force_pte, &fault->ipa);
> +	fault->vma_pagesize = 1UL << fault->vma_shift;
>  
>  	/*
>  	 * Both the canonical IPA and fault IPA must be aligned to the
>  	 * mapping size to ensure we find the right PFN and lay down the
>  	 * mapping in the right place.
>  	 */
> -	fault_ipa = ALIGN_DOWN(fault_ipa, vma_pagesize);
> -	ipa = ALIGN_DOWN(ipa, vma_pagesize);
> +	fault->fault_ipa = ALIGN_DOWN(fault->fault_ipa, fault->vma_pagesize);
> +	fault->ipa = ALIGN_DOWN(fault->ipa, fault->vma_pagesize);
>  
> -	gfn = ipa >> PAGE_SHIFT;
> -	mte_allowed = kvm_vma_mte_allowed(vma);
> +	fault->gfn = fault->ipa >> PAGE_SHIFT;
> +	fault->mte_allowed = kvm_vma_mte_allowed(vma);
>  
> -	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
> +	fault->vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
>  
> -	vm_flags = vma->vm_flags;
> +	fault->vm_flags = vma->vm_flags;
>  
> -	is_vma_cacheable = kvm_vma_is_cacheable(vma);
> +	fault->is_vma_cacheable = kvm_vma_is_cacheable(vma);
>  
>  	/* Don't use the VMA after the unlock -- it may have vanished */
>  	vma = NULL;
> @@ -1798,24 +1829,25 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	 * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs
>  	 * with the smp_wmb() in kvm_mmu_invalidate_end().
>  	 */
> -	mmu_seq = kvm->mmu_invalidate_seq;
> +	fault->mmu_seq = kvm->mmu_invalidate_seq;
>  	mmap_read_unlock(current->mm);
>  
> -	pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0,
> -				&writable, &page);
> -	if (pfn == KVM_PFN_ERR_HWPOISON) {
> -		kvm_send_hwpoison_signal(hva, vma_shift);
> +	fault->pfn = __kvm_faultin_pfn(fault->memslot, fault->gfn,
> +				       fault->write_fault ? FOLL_WRITE : 0,
> +				       &fault->writable, &fault->page);
> +	if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
> +		kvm_send_hwpoison_signal(fault->hva, fault->vma_shift);
>  		return 0;
>  	}
> -	if (is_error_noslot_pfn(pfn))
> +	if (is_error_noslot_pfn(fault->pfn))
>  		return -EFAULT;
>  
>  	/*
> -	 * Check if this is non-struct page memory PFN, and cannot support
> +	 * Check if this is non-struct fault->page memory PFN, and cannot support

Here too.

>  	 * CMOs. It could potentially be unsafe to access as cacheable.
>  	 */
> -	if (vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(pfn)) {
> -		if (is_vma_cacheable) {
> +	if (fault->vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(fault->pfn)) {
> +		if (fault->is_vma_cacheable) {
>  			/*
>  			 * Whilst the VMA owner expects cacheable mapping to this
>  			 * PFN, hardware also has to support the FWB and CACHE DIC
> @@ -1832,26 +1864,26 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  				ret = -EFAULT;
>  		} else {
>  			/*
> -			 * If the page was identified as device early by looking at
> -			 * the VMA flags, vma_pagesize is already representing the
> +			 * If the fault->page was identified as device early by looking at

Here too.

> +			 * the VMA flags, fault->vma_pagesize is already representing the
>  			 * largest quantity we can map.  If instead it was mapped
> -			 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
> +			 * via __kvm_faultin_pfn(), fault->vma_pagesize is set to PAGE_SIZE
>  			 * and must not be upgraded.
>  			 *
>  			 * In both cases, we don't let transparent_hugepage_adjust()
>  			 * change things at the last minute.
>  			 */
> -			s2_force_noncacheable = true;
> +			fault->s2_force_noncacheable = true;
>  		}
> -	} else if (logging_active && !write_fault) {
> +	} else if (fault->logging_active && !fault->write_fault) {
>  		/*
> -		 * Only actually map the page as writable if this was a write
> +		 * Only actually map the fault->page as fault->writable if this was a write
>  		 * fault.
>  		 */
> -		writable = false;
> +		fault->writable = false;
>  	}
>  
> -	if (exec_fault && s2_force_noncacheable)
> +	if (fault->exec_fault && fault->s2_force_noncacheable)
>  		ret = -ENOEXEC;
>  
>  	if (ret)
> @@ -1860,101 +1892,103 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	/*
>  	 * Guest performs atomic/exclusive operations on memory with unsupported
>  	 * attributes (e.g. ld64b/st64b on normal memory when no FEAT_LS64WB)
> -	 * and trigger the exception here. Since the memslot is valid, inject
> +	 * and trigger the exception here. Since the fault->memslot is valid, inject

Here too.

>  	 * the fault back to the guest.
>  	 */
> -	if (esr_fsc_is_excl_atomic_fault(kvm_vcpu_get_esr(vcpu))) {
> -		kvm_inject_dabt_excl_atomic(vcpu, kvm_vcpu_get_hfar(vcpu));
> +	if (esr_fsc_is_excl_atomic_fault(kvm_vcpu_get_esr(fault->vcpu))) {
> +		kvm_inject_dabt_excl_atomic(fault->vcpu, kvm_vcpu_get_hfar(fault->vcpu));
>  		ret = 1;
>  		goto out_put_page;
>  	}
>  
> -	if (nested)
> -		adjust_nested_fault_perms(nested, &prot, &writable);
> +	if (fault->nested)
> +		adjust_nested_fault_perms(fault->nested, &fault->prot, &fault->writable);
>  
>  	kvm_fault_lock(kvm);
> -	pgt = vcpu->arch.hw_mmu->pgt;
> -	if (mmu_invalidate_retry(kvm, mmu_seq)) {
> +	pgt = fault->vcpu->arch.hw_mmu->pgt;
> +	if (mmu_invalidate_retry(kvm, fault->mmu_seq)) {
>  		ret = -EAGAIN;
>  		goto out_unlock;
>  	}
>  
>  	/*
> -	 * If we are not forced to use page mapping, check if we are
> +	 * If we are not forced to use fault->page mapping, check if we are

Here too.

>  	 * backed by a THP and thus use block mapping if possible.
>  	 */
> -	if (vma_pagesize == PAGE_SIZE && !(force_pte || s2_force_noncacheable)) {
> -		if (fault_is_perm && fault_granule > PAGE_SIZE)
> -			vma_pagesize = fault_granule;
> +	if (fault->vma_pagesize == PAGE_SIZE &&
> +	    !(fault->force_pte || fault->s2_force_noncacheable)) {
> +		if (fault->fault_is_perm && fault->fault_granule > PAGE_SIZE)
> +			fault->vma_pagesize = fault->fault_granule;
>  		else
> -			vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
> -								   hva, &pfn,
> -								   &fault_ipa);
> +			fault->vma_pagesize = transparent_hugepage_adjust(kvm, fault->memslot,
> +									  fault->hva, &fault->pfn,
> +									  &fault->fault_ipa);
>  
> -		if (vma_pagesize < 0) {
> -			ret = vma_pagesize;
> +		if (fault->vma_pagesize < 0) {
> +			ret = fault->vma_pagesize;
>  			goto out_unlock;
>  		}
>  	}
>  
> -	if (!fault_is_perm && !s2_force_noncacheable && kvm_has_mte(kvm)) {
> +	if (!fault->fault_is_perm && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) {
>  		/* Check the VMM hasn't introduced a new disallowed VMA */
> -		if (mte_allowed) {
> -			sanitise_mte_tags(kvm, pfn, vma_pagesize);
> +		if (fault->mte_allowed) {
> +			sanitise_mte_tags(kvm, fault->pfn, fault->vma_pagesize);
>  		} else {
>  			ret = -EFAULT;
>  			goto out_unlock;
>  		}
>  	}
>  
> -	if (writable)
> -		prot |= KVM_PGTABLE_PROT_W;
> +	if (fault->writable)
> +		fault->prot |= KVM_PGTABLE_PROT_W;
>  
> -	if (exec_fault)
> -		prot |= KVM_PGTABLE_PROT_X;
> +	if (fault->exec_fault)
> +		fault->prot |= KVM_PGTABLE_PROT_X;
>  
> -	if (s2_force_noncacheable) {
> -		if (vfio_allow_any_uc)
> -			prot |= KVM_PGTABLE_PROT_NORMAL_NC;
> +	if (fault->s2_force_noncacheable) {
> +		if (fault->vfio_allow_any_uc)
> +			fault->prot |= KVM_PGTABLE_PROT_NORMAL_NC;
>  		else
> -			prot |= KVM_PGTABLE_PROT_DEVICE;
> +			fault->prot |= KVM_PGTABLE_PROT_DEVICE;
>  	} else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) {
> -		prot |= KVM_PGTABLE_PROT_X;
> +		fault->prot |= KVM_PGTABLE_PROT_X;
>  	}
>  
> -	if (nested)
> -		adjust_nested_exec_perms(kvm, nested, &prot);
> +	if (fault->nested)
> +		adjust_nested_exec_perms(kvm, fault->nested, &fault->prot);
>  
>  	/*
>  	 * Under the premise of getting a FSC_PERM fault, we just need to relax
> -	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
> +	 * permissions only if fault->vma_pagesize equals fault->fault_granule. Otherwise,
>  	 * kvm_pgtable_stage2_map() should be called to change block size.
>  	 */
> -	if (fault_is_perm && vma_pagesize == fault_granule) {
> +	if (fault->fault_is_perm && fault->vma_pagesize == fault->fault_granule) {
>  		/*
>  		 * Drop the SW bits in favour of those stored in the
>  		 * PTE, which will be preserved.
>  		 */
> -		prot &= ~KVM_NV_GUEST_MAP_SZ;
> -		ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault_ipa, prot, flags);
> +		fault->prot &= ~KVM_NV_GUEST_MAP_SZ;
> +		ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault->fault_ipa, fault->prot,
> +								 flags);
>  	} else {
> -		ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, vma_pagesize,
> -					     __pfn_to_phys(pfn), prot,
> -					     memcache, flags);
> +		ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault->fault_ipa, fault->vma_pagesize,
> +							 __pfn_to_phys(fault->pfn), fault->prot,
> +							 memcache, flags);
>  	}
>  
>  out_unlock:
> -	kvm_release_faultin_page(kvm, page, !!ret, writable);
> +	kvm_release_faultin_page(kvm, fault->page, !!ret, fault->writable);
>  	kvm_fault_unlock(kvm);
>  
> -	/* Mark the page dirty only if the fault is handled successfully */
> -	if (writable && !ret)
> -		mark_page_dirty_in_slot(kvm, memslot, gfn);
> +	/* Mark the fault->page dirty only if the fault is handled successfully */
> +	if (fault->writable && !ret)
> +		mark_page_dirty_in_slot(kvm, fault->memslot, fault->gfn);
>  
>  	return ret != -EAGAIN ? ret : 0;
>  
>  out_put_page:
> -	kvm_release_page_unused(page);
> +	kvm_release_page_unused(fault->page);
>  	return ret;
>  }
>  
> -- 
> 2.53.0.473.g4a7958ca14-goog
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v1 03/13] KVM: arm64: Extract PFN resolution in user_mem_abort()
  2026-03-06 14:02 [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Fuad Tabba
  2026-03-06 14:02 ` [PATCH v1 01/13] KVM: arm64: Extract VMA size resolution in user_mem_abort() Fuad Tabba
  2026-03-06 14:02 ` [PATCH v1 02/13] KVM: arm64: Introduce struct kvm_s2_fault to user_mem_abort() Fuad Tabba
@ 2026-03-06 14:02 ` Fuad Tabba
  2026-03-06 14:02 ` [PATCH v1 04/13] KVM: arm64: Isolate mmap_read_lock inside new kvm_s2_fault_get_vma_info() helper Fuad Tabba
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: Fuad Tabba @ 2026-03-06 14:02 UTC (permalink / raw)
  To: kvm, kvmarm, linux-arm-kernel
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, qperret, vdonnefort, tabba

Extract the section of code responsible for pinning the physical page
frame number (PFN) backing the faulting IPA into a new helper,
kvm_s2_fault_pin_pfn().

This helper encapsulates the critical section where the mmap_read_lock
is held, the VMA is looked up, the mmu invalidate sequence is sampled,
and the PFN is ultimately resolved via __kvm_faultin_pfn(). It also
handles the early exits for hardware poisoned pages and noslot PFNs.

By isolating this region, we can begin to organize the state variables
required for PFN resolution into the kvm_s2_fault struct, clearing out
a significant amount of local variable clutter from user_mem_abort().

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 105 ++++++++++++++++++++++++-------------------
 1 file changed, 59 insertions(+), 46 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 419b793c5891..d56c6422ca5f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1740,55 +1740,11 @@ struct kvm_s2_fault {
 	vm_flags_t vm_flags;
 };
 
-static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
-			  struct kvm_s2_trans *nested,
-			  struct kvm_memory_slot *memslot, unsigned long hva,
-			  bool fault_is_perm)
+static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault)
 {
-	int ret = 0;
-	struct kvm_s2_fault fault_data = {
-		.vcpu = vcpu,
-		.fault_ipa = fault_ipa,
-		.nested = nested,
-		.memslot = memslot,
-		.hva = hva,
-		.fault_is_perm = fault_is_perm,
-		.ipa = fault_ipa,
-		.logging_active = memslot_is_logging(memslot),
-		.force_pte = memslot_is_logging(memslot),
-		.s2_force_noncacheable = false,
-		.vfio_allow_any_uc = false,
-		.prot = KVM_PGTABLE_PROT_R,
-	};
-	struct kvm_s2_fault *fault = &fault_data;
-	struct kvm *kvm = vcpu->kvm;
 	struct vm_area_struct *vma;
-	void *memcache;
-	struct kvm_pgtable *pgt;
-	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
+	struct kvm *kvm = fault->vcpu->kvm;
 
-	if (fault->fault_is_perm)
-		fault->fault_granule = kvm_vcpu_trap_get_perm_fault_granule(fault->vcpu);
-	fault->write_fault = kvm_is_write_fault(fault->vcpu);
-	fault->exec_fault = kvm_vcpu_trap_is_exec_fault(fault->vcpu);
-	VM_WARN_ON_ONCE(fault->write_fault && fault->exec_fault);
-
-	/*
-	 * Permission faults just need to update the existing leaf entry,
-	 * and so normally don't require allocations from the memcache. The
-	 * only exception to this is when dirty logging is enabled at runtime
-	 * and a write fault needs to collapse a block entry into a table.
-	 */
-	fault->topup_memcache = !fault->fault_is_perm ||
-				(fault->logging_active && fault->write_fault);
-	ret = prepare_mmu_memcache(fault->vcpu, fault->topup_memcache, &memcache);
-	if (ret)
-		return ret;
-
-	/*
-	 * Let's check if we will get back a huge fault->page backed by hugetlbfs, or
-	 * get block mapping for device MMIO region.
-	 */
 	mmap_read_lock(current->mm);
 	vma = vma_lookup(current->mm, fault->hva);
 	if (unlikely(!vma)) {
@@ -1842,6 +1798,63 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (is_error_noslot_pfn(fault->pfn))
 		return -EFAULT;
 
+	return 1;
+}
+
+static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
+			  struct kvm_s2_trans *nested,
+			  struct kvm_memory_slot *memslot, unsigned long hva,
+			  bool fault_is_perm)
+{
+	int ret = 0;
+	struct kvm_s2_fault fault_data = {
+		.vcpu = vcpu,
+		.fault_ipa = fault_ipa,
+		.nested = nested,
+		.memslot = memslot,
+		.hva = hva,
+		.fault_is_perm = fault_is_perm,
+		.ipa = fault_ipa,
+		.logging_active = memslot_is_logging(memslot),
+		.force_pte = memslot_is_logging(memslot),
+		.s2_force_noncacheable = false,
+		.vfio_allow_any_uc = false,
+		.prot = KVM_PGTABLE_PROT_R,
+	};
+	struct kvm_s2_fault *fault = &fault_data;
+	struct kvm *kvm = vcpu->kvm;
+	void *memcache;
+	struct kvm_pgtable *pgt;
+	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
+
+	if (fault->fault_is_perm)
+		fault->fault_granule = kvm_vcpu_trap_get_perm_fault_granule(fault->vcpu);
+	fault->write_fault = kvm_is_write_fault(fault->vcpu);
+	fault->exec_fault = kvm_vcpu_trap_is_exec_fault(fault->vcpu);
+	VM_WARN_ON_ONCE(fault->write_fault && fault->exec_fault);
+
+	/*
+	 * Permission faults just need to update the existing leaf entry,
+	 * and so normally don't require allocations from the memcache. The
+	 * only exception to this is when dirty logging is enabled at runtime
+	 * and a write fault needs to collapse a block entry into a table.
+	 */
+	fault->topup_memcache = !fault->fault_is_perm ||
+				(fault->logging_active && fault->write_fault);
+	ret = prepare_mmu_memcache(fault->vcpu, fault->topup_memcache, &memcache);
+	if (ret)
+		return ret;
+
+	/*
+	 * Let's check if we will get back a huge fault->page backed by hugetlbfs, or
+	 * get block mapping for device MMIO region.
+	 */
+	ret = kvm_s2_fault_pin_pfn(fault);
+	if (ret != 1)
+		return ret;
+
+	ret = 0;
+
 	/*
 	 * Check if this is non-struct fault->page memory PFN, and cannot support
 	 * CMOs. It could potentially be unsafe to access as cacheable.
-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 04/13] KVM: arm64: Isolate mmap_read_lock inside new kvm_s2_fault_get_vma_info() helper
  2026-03-06 14:02 [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Fuad Tabba
                   ` (2 preceding siblings ...)
  2026-03-06 14:02 ` [PATCH v1 03/13] KVM: arm64: Extract PFN resolution in user_mem_abort() Fuad Tabba
@ 2026-03-06 14:02 ` Fuad Tabba
  2026-03-06 14:02 ` [PATCH v1 05/13] KVM: arm64: Extract stage-2 permission logic in user_mem_abort() Fuad Tabba
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: Fuad Tabba @ 2026-03-06 14:02 UTC (permalink / raw)
  To: kvm, kvmarm, linux-arm-kernel
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, qperret, vdonnefort, tabba

Extract the VMA lookup and metadata snapshotting logic from
kvm_s2_fault_pin_pfn() into a tightly-scoped sub-helper.

This refactoring structurally fixes a TOCTOU (Time-Of-Check to
Time-Of-Use) vulnerability and Use-After-Free risk involving the vma
pointer. In the previous layout, the mmap_read_lock is taken, the vma is
looked up, and then the lock is dropped before the function continues to
map the PFN. While an explicit vma = NULL safeguard was present, the vma
variable was still lexically in scope for the remainder of the function.

By isolating the locked region into kvm_s2_fault_get_vma_info(), the vma
pointer becomes a local variable strictly confined to that sub-helper.
Because the pointer's scope literally ends when the sub-helper returns,
it is not possible for the subsequent page fault logic in
kvm_s2_fault_pin_pfn() to accidentally access the vanished VMA,
eliminating this bug class by design.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d56c6422ca5f..344a477e1bff 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1740,7 +1740,7 @@ struct kvm_s2_fault {
 	vm_flags_t vm_flags;
 };
 
-static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault)
+static int kvm_s2_fault_get_vma_info(struct kvm_s2_fault *fault)
 {
 	struct vm_area_struct *vma;
 	struct kvm *kvm = fault->vcpu->kvm;
@@ -1774,9 +1774,6 @@ static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault)
 
 	fault->is_vma_cacheable = kvm_vma_is_cacheable(vma);
 
-	/* Don't use the VMA after the unlock -- it may have vanished */
-	vma = NULL;
-
 	/*
 	 * Read mmu_invalidate_seq so that KVM can detect if the results of
 	 * vma_lookup() or __kvm_faultin_pfn() become stale prior to
@@ -1788,6 +1785,17 @@ static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault)
 	fault->mmu_seq = kvm->mmu_invalidate_seq;
 	mmap_read_unlock(current->mm);
 
+	return 0;
+}
+
+static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault)
+{
+	int ret;
+
+	ret = kvm_s2_fault_get_vma_info(fault);
+	if (ret)
+		return ret;
+
 	fault->pfn = __kvm_faultin_pfn(fault->memslot, fault->gfn,
 				       fault->write_fault ? FOLL_WRITE : 0,
 				       &fault->writable, &fault->page);
-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 05/13] KVM: arm64: Extract stage-2 permission logic in user_mem_abort()
  2026-03-06 14:02 [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Fuad Tabba
                   ` (3 preceding siblings ...)
  2026-03-06 14:02 ` [PATCH v1 04/13] KVM: arm64: Isolate mmap_read_lock inside new kvm_s2_fault_get_vma_info() helper Fuad Tabba
@ 2026-03-06 14:02 ` Fuad Tabba
  2026-03-06 14:02 ` [PATCH v1 06/13] KVM: arm64: Extract page table mapping " Fuad Tabba
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: Fuad Tabba @ 2026-03-06 14:02 UTC (permalink / raw)
  To: kvm, kvmarm, linux-arm-kernel
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, qperret, vdonnefort, tabba

Extract the logic that computes the stage-2 protections and checks for
various permission faults (e.g., execution faults on non-cacheable
memory) into a new helper function, kvm_s2_fault_compute_prot(). This
helper also handles injecting atomic/exclusive faults back into the
guest when necessary.

This refactoring step separates the permission computation from the
mapping logic, making the main fault handler flow clearer.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 163 +++++++++++++++++++++++--------------------
 1 file changed, 87 insertions(+), 76 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 344a477e1bff..b328299cc0f5 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1809,6 +1809,89 @@ static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault)
 	return 1;
 }
 
+static int kvm_s2_fault_compute_prot(struct kvm_s2_fault *fault)
+{
+	struct kvm *kvm = fault->vcpu->kvm;
+
+	/*
+	 * Check if this is non-struct page memory PFN, and cannot support
+	 * CMOs. It could potentially be unsafe to access as cacheable.
+	 */
+	if (fault->vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(fault->pfn)) {
+		if (fault->is_vma_cacheable) {
+			/*
+			 * Whilst the VMA owner expects cacheable mapping to this
+			 * PFN, hardware also has to support the FWB and CACHE DIC
+			 * features.
+			 *
+			 * ARM64 KVM relies on kernel VA mapping to the PFN to
+			 * perform cache maintenance as the CMO instructions work on
+			 * virtual addresses. VM_PFNMAP region are not necessarily
+			 * mapped to a KVA and hence the presence of hardware features
+			 * S2FWB and CACHE DIC are mandatory to avoid the need for
+			 * cache maintenance.
+			 */
+			if (!kvm_supports_cacheable_pfnmap())
+				return -EFAULT;
+		} else {
+			/*
+			 * If the page was identified as device early by looking at
+			 * the VMA flags, vma_pagesize is already representing the
+			 * largest quantity we can map.  If instead it was mapped
+			 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
+			 * and must not be upgraded.
+			 *
+			 * In both cases, we don't let transparent_hugepage_adjust()
+			 * change things at the last minute.
+			 */
+			fault->s2_force_noncacheable = true;
+		}
+	} else if (fault->logging_active && !fault->write_fault) {
+		/*
+		 * Only actually map the page as writable if this was a write
+		 * fault.
+		 */
+		fault->writable = false;
+	}
+
+	if (fault->exec_fault && fault->s2_force_noncacheable)
+		return -ENOEXEC;
+
+	/*
+	 * Guest performs atomic/exclusive operations on memory with unsupported
+	 * attributes (e.g. ld64b/st64b on normal memory when no FEAT_LS64WB)
+	 * and trigger the exception here. Since the memslot is valid, inject
+	 * the fault back to the guest.
+	 */
+	if (esr_fsc_is_excl_atomic_fault(kvm_vcpu_get_esr(fault->vcpu))) {
+		kvm_inject_dabt_excl_atomic(fault->vcpu, kvm_vcpu_get_hfar(fault->vcpu));
+		return 1;
+	}
+
+	if (fault->nested)
+		adjust_nested_fault_perms(fault->nested, &fault->prot, &fault->writable);
+
+	if (fault->writable)
+		fault->prot |= KVM_PGTABLE_PROT_W;
+
+	if (fault->exec_fault)
+		fault->prot |= KVM_PGTABLE_PROT_X;
+
+	if (fault->s2_force_noncacheable) {
+		if (fault->vfio_allow_any_uc)
+			fault->prot |= KVM_PGTABLE_PROT_NORMAL_NC;
+		else
+			fault->prot |= KVM_PGTABLE_PROT_DEVICE;
+	} else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) {
+		fault->prot |= KVM_PGTABLE_PROT_X;
+	}
+
+	if (fault->nested)
+		adjust_nested_exec_perms(kvm, fault->nested, &fault->prot);
+
+	return 0;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_s2_trans *nested,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
@@ -1863,68 +1946,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 	ret = 0;
 
-	/*
-	 * Check if this is non-struct fault->page memory PFN, and cannot support
-	 * CMOs. It could potentially be unsafe to access as cacheable.
-	 */
-	if (fault->vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(fault->pfn)) {
-		if (fault->is_vma_cacheable) {
-			/*
-			 * Whilst the VMA owner expects cacheable mapping to this
-			 * PFN, hardware also has to support the FWB and CACHE DIC
-			 * features.
-			 *
-			 * ARM64 KVM relies on kernel VA mapping to the PFN to
-			 * perform cache maintenance as the CMO instructions work on
-			 * virtual addresses. VM_PFNMAP region are not necessarily
-			 * mapped to a KVA and hence the presence of hardware features
-			 * S2FWB and CACHE DIC are mandatory to avoid the need for
-			 * cache maintenance.
-			 */
-			if (!kvm_supports_cacheable_pfnmap())
-				ret = -EFAULT;
-		} else {
-			/*
-			 * If the fault->page was identified as device early by looking at
-			 * the VMA flags, fault->vma_pagesize is already representing the
-			 * largest quantity we can map.  If instead it was mapped
-			 * via __kvm_faultin_pfn(), fault->vma_pagesize is set to PAGE_SIZE
-			 * and must not be upgraded.
-			 *
-			 * In both cases, we don't let transparent_hugepage_adjust()
-			 * change things at the last minute.
-			 */
-			fault->s2_force_noncacheable = true;
-		}
-	} else if (fault->logging_active && !fault->write_fault) {
-		/*
-		 * Only actually map the fault->page as fault->writable if this was a write
-		 * fault.
-		 */
-		fault->writable = false;
+	ret = kvm_s2_fault_compute_prot(fault);
+	if (ret == 1) {
+		ret = 1; /* fault injected */
+		goto out_put_page;
 	}
-
-	if (fault->exec_fault && fault->s2_force_noncacheable)
-		ret = -ENOEXEC;
-
 	if (ret)
 		goto out_put_page;
 
-	/*
-	 * Guest performs atomic/exclusive operations on memory with unsupported
-	 * attributes (e.g. ld64b/st64b on normal memory when no FEAT_LS64WB)
-	 * and trigger the exception here. Since the fault->memslot is valid, inject
-	 * the fault back to the guest.
-	 */
-	if (esr_fsc_is_excl_atomic_fault(kvm_vcpu_get_esr(fault->vcpu))) {
-		kvm_inject_dabt_excl_atomic(fault->vcpu, kvm_vcpu_get_hfar(fault->vcpu));
-		ret = 1;
-		goto out_put_page;
-	}
-
-	if (fault->nested)
-		adjust_nested_fault_perms(fault->nested, &fault->prot, &fault->writable);
-
 	kvm_fault_lock(kvm);
 	pgt = fault->vcpu->arch.hw_mmu->pgt;
 	if (mmu_invalidate_retry(kvm, fault->mmu_seq)) {
@@ -1961,24 +1990,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		}
 	}
 
-	if (fault->writable)
-		fault->prot |= KVM_PGTABLE_PROT_W;
-
-	if (fault->exec_fault)
-		fault->prot |= KVM_PGTABLE_PROT_X;
-
-	if (fault->s2_force_noncacheable) {
-		if (fault->vfio_allow_any_uc)
-			fault->prot |= KVM_PGTABLE_PROT_NORMAL_NC;
-		else
-			fault->prot |= KVM_PGTABLE_PROT_DEVICE;
-	} else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) {
-		fault->prot |= KVM_PGTABLE_PROT_X;
-	}
-
-	if (fault->nested)
-		adjust_nested_exec_perms(kvm, fault->nested, &fault->prot);
-
 	/*
 	 * Under the premise of getting a FSC_PERM fault, we just need to relax
 	 * permissions only if fault->vma_pagesize equals fault->fault_granule. Otherwise,
-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 06/13] KVM: arm64: Extract page table mapping in user_mem_abort()
  2026-03-06 14:02 [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Fuad Tabba
                   ` (4 preceding siblings ...)
  2026-03-06 14:02 ` [PATCH v1 05/13] KVM: arm64: Extract stage-2 permission logic in user_mem_abort() Fuad Tabba
@ 2026-03-06 14:02 ` Fuad Tabba
  2026-03-06 14:02 ` [PATCH v1 07/13] KVM: arm64: Simplify nested VMA shift calculation Fuad Tabba
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: Fuad Tabba @ 2026-03-06 14:02 UTC (permalink / raw)
  To: kvm, kvmarm, linux-arm-kernel
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, qperret, vdonnefort, tabba

Extract the code responsible for locking the KVM MMU and mapping the PFN
into the stage-2 page tables into a new helper, kvm_s2_fault_map().

This helper manages the kvm_fault_lock, checks for MMU invalidation
retries, attempts to adjust for transparent huge pages (THP), handles
MTE sanitization if needed, and finally maps or relaxes permissions on
the stage-2 entries.

With this change, the main user_mem_abort() function is now a sequential
dispatcher that delegates to specialized helper functions.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 128 +++++++++++++++++++++++--------------------
 1 file changed, 68 insertions(+), 60 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index b328299cc0f5..833a7f769467 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1892,68 +1892,13 @@ static int kvm_s2_fault_compute_prot(struct kvm_s2_fault *fault)
 	return 0;
 }
 
-static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
-			  struct kvm_s2_trans *nested,
-			  struct kvm_memory_slot *memslot, unsigned long hva,
-			  bool fault_is_perm)
+static int kvm_s2_fault_map(struct kvm_s2_fault *fault, void *memcache)
 {
-	int ret = 0;
-	struct kvm_s2_fault fault_data = {
-		.vcpu = vcpu,
-		.fault_ipa = fault_ipa,
-		.nested = nested,
-		.memslot = memslot,
-		.hva = hva,
-		.fault_is_perm = fault_is_perm,
-		.ipa = fault_ipa,
-		.logging_active = memslot_is_logging(memslot),
-		.force_pte = memslot_is_logging(memslot),
-		.s2_force_noncacheable = false,
-		.vfio_allow_any_uc = false,
-		.prot = KVM_PGTABLE_PROT_R,
-	};
-	struct kvm_s2_fault *fault = &fault_data;
-	struct kvm *kvm = vcpu->kvm;
-	void *memcache;
+	struct kvm *kvm = fault->vcpu->kvm;
 	struct kvm_pgtable *pgt;
+	int ret;
 	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
 
-	if (fault->fault_is_perm)
-		fault->fault_granule = kvm_vcpu_trap_get_perm_fault_granule(fault->vcpu);
-	fault->write_fault = kvm_is_write_fault(fault->vcpu);
-	fault->exec_fault = kvm_vcpu_trap_is_exec_fault(fault->vcpu);
-	VM_WARN_ON_ONCE(fault->write_fault && fault->exec_fault);
-
-	/*
-	 * Permission faults just need to update the existing leaf entry,
-	 * and so normally don't require allocations from the memcache. The
-	 * only exception to this is when dirty logging is enabled at runtime
-	 * and a write fault needs to collapse a block entry into a table.
-	 */
-	fault->topup_memcache = !fault->fault_is_perm ||
-				(fault->logging_active && fault->write_fault);
-	ret = prepare_mmu_memcache(fault->vcpu, fault->topup_memcache, &memcache);
-	if (ret)
-		return ret;
-
-	/*
-	 * Let's check if we will get back a huge fault->page backed by hugetlbfs, or
-	 * get block mapping for device MMIO region.
-	 */
-	ret = kvm_s2_fault_pin_pfn(fault);
-	if (ret != 1)
-		return ret;
-
-	ret = 0;
-
-	ret = kvm_s2_fault_compute_prot(fault);
-	if (ret == 1) {
-		ret = 1; /* fault injected */
-		goto out_put_page;
-	}
-	if (ret)
-		goto out_put_page;
-
 	kvm_fault_lock(kvm);
 	pgt = fault->vcpu->arch.hw_mmu->pgt;
 	if (mmu_invalidate_retry(kvm, fault->mmu_seq)) {
@@ -2001,8 +1946,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		 * PTE, which will be preserved.
 		 */
 		fault->prot &= ~KVM_NV_GUEST_MAP_SZ;
-		ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault->fault_ipa, fault->prot,
-								 flags);
+		ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault->fault_ipa,
+								 fault->prot, flags);
 	} else {
 		ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault->fault_ipa, fault->vma_pagesize,
 							 __pfn_to_phys(fault->pfn), fault->prot,
@@ -2018,6 +1963,69 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		mark_page_dirty_in_slot(kvm, fault->memslot, fault->gfn);
 
 	return ret != -EAGAIN ? ret : 0;
+}
+
+static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
+			  struct kvm_s2_trans *nested,
+			  struct kvm_memory_slot *memslot, unsigned long hva,
+			  bool fault_is_perm)
+{
+	int ret = 0;
+	struct kvm_s2_fault fault_data = {
+		.vcpu = vcpu,
+		.fault_ipa = fault_ipa,
+		.nested = nested,
+		.memslot = memslot,
+		.hva = hva,
+		.fault_is_perm = fault_is_perm,
+		.ipa = fault_ipa,
+		.logging_active = memslot_is_logging(memslot),
+		.force_pte = memslot_is_logging(memslot),
+		.s2_force_noncacheable = false,
+		.vfio_allow_any_uc = false,
+		.prot = KVM_PGTABLE_PROT_R,
+	};
+	struct kvm_s2_fault *fault = &fault_data;
+	void *memcache;
+
+	if (fault->fault_is_perm)
+		fault->fault_granule = kvm_vcpu_trap_get_perm_fault_granule(fault->vcpu);
+	fault->write_fault = kvm_is_write_fault(fault->vcpu);
+	fault->exec_fault = kvm_vcpu_trap_is_exec_fault(fault->vcpu);
+	VM_WARN_ON_ONCE(fault->write_fault && fault->exec_fault);
+
+	/*
+	 * Permission faults just need to update the existing leaf entry,
+	 * and so normally don't require allocations from the memcache. The
+	 * only exception to this is when dirty logging is enabled at runtime
+	 * and a write fault needs to collapse a block entry into a table.
+	 */
+	fault->topup_memcache = !fault->fault_is_perm ||
+				(fault->logging_active && fault->write_fault);
+	ret = prepare_mmu_memcache(fault->vcpu, fault->topup_memcache, &memcache);
+	if (ret)
+		return ret;
+
+	/*
+	 * Let's check if we will get back a huge fault->page backed by hugetlbfs, or
+	 * get block mapping for device MMIO region.
+	 */
+	ret = kvm_s2_fault_pin_pfn(fault);
+	if (ret != 1)
+		return ret;
+
+	ret = 0;
+
+	ret = kvm_s2_fault_compute_prot(fault);
+	if (ret == 1) {
+		ret = 1; /* fault injected */
+		goto out_put_page;
+	}
+	if (ret)
+		goto out_put_page;
+
+	ret = kvm_s2_fault_map(fault, memcache);
+	return ret;
 
 out_put_page:
 	kvm_release_page_unused(fault->page);
-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 07/13] KVM: arm64: Simplify nested VMA shift calculation
  2026-03-06 14:02 [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Fuad Tabba
                   ` (5 preceding siblings ...)
  2026-03-06 14:02 ` [PATCH v1 06/13] KVM: arm64: Extract page table mapping " Fuad Tabba
@ 2026-03-06 14:02 ` Fuad Tabba
  2026-03-06 14:02 ` [PATCH v1 08/13] KVM: arm64: Remove redundant state variables from struct kvm_s2_fault Fuad Tabba
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: Fuad Tabba @ 2026-03-06 14:02 UTC (permalink / raw)
  To: kvm, kvmarm, linux-arm-kernel
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, qperret, vdonnefort, tabba

In the kvm_s2_resolve_vma_size() helper, the local variable vma_pagesize
is calculated from vma_shift, only to be used to bound the vma_pagesize
by max_map_size and subsequently convert it back to a shift via __ffs().

Because vma_pagesize and max_map_size are both powers of two, we can
simplify the logic by omitting vma_pagesize entirely and bounding the
vma_shift directly using the shift of max_map_size. This achieves the
same result while keeping the size-to-shift conversion out of the helper
logic.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 833a7f769467..090a906d3a12 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1646,7 +1646,6 @@ static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma,
 				     bool *force_pte, phys_addr_t *ipa)
 {
 	short vma_shift;
-	long vma_pagesize;
 
 	if (*force_pte)
 		vma_shift = PAGE_SHIFT;
@@ -1677,8 +1676,6 @@ static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma,
 		WARN_ONCE(1, "Unknown vma_shift %d", vma_shift);
 	}
 
-	vma_pagesize = 1UL << vma_shift;
-
 	if (nested) {
 		unsigned long max_map_size;
 
@@ -1703,8 +1700,7 @@ static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma,
 			max_map_size = PAGE_SIZE;
 
 		*force_pte = (max_map_size == PAGE_SIZE);
-		vma_pagesize = min_t(long, vma_pagesize, max_map_size);
-		vma_shift = __ffs(vma_pagesize);
+		vma_shift = min_t(short, vma_shift, __ffs(max_map_size));
 	}
 
 	return vma_shift;
-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 08/13] KVM: arm64: Remove redundant state variables from struct kvm_s2_fault
  2026-03-06 14:02 [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Fuad Tabba
                   ` (6 preceding siblings ...)
  2026-03-06 14:02 ` [PATCH v1 07/13] KVM: arm64: Simplify nested VMA shift calculation Fuad Tabba
@ 2026-03-06 14:02 ` Fuad Tabba
  2026-03-06 14:02 ` [PATCH v1 09/13] KVM: arm64: Simplify return logic in user_mem_abort() Fuad Tabba
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: Fuad Tabba @ 2026-03-06 14:02 UTC (permalink / raw)
  To: kvm, kvmarm, linux-arm-kernel
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, qperret, vdonnefort, tabba

Remove redundant variables vma_shift and vfio_allow_any_uc from struct
kvm_s2_fault as they are easily derived or checked when needed.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 090a906d3a12..01f4f4bee155 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1721,10 +1721,8 @@ struct kvm_s2_fault {
 	bool mte_allowed;
 	bool is_vma_cacheable;
 	bool s2_force_noncacheable;
-	bool vfio_allow_any_uc;
 	unsigned long mmu_seq;
 	phys_addr_t ipa;
-	short vma_shift;
 	gfn_t gfn;
 	kvm_pfn_t pfn;
 	bool logging_active;
@@ -1749,9 +1747,9 @@ static int kvm_s2_fault_get_vma_info(struct kvm_s2_fault *fault)
 		return -EFAULT;
 	}
 
-	fault->vma_shift = kvm_s2_resolve_vma_size(vma, fault->hva, fault->memslot, fault->nested,
-						   &fault->force_pte, &fault->ipa);
-	fault->vma_pagesize = 1UL << fault->vma_shift;
+	fault->vma_pagesize = 1UL << kvm_s2_resolve_vma_size(vma, fault->hva, fault->memslot,
+							     fault->nested, &fault->force_pte,
+							     &fault->ipa);
 
 	/*
 	 * Both the canonical IPA and fault IPA must be aligned to the
@@ -1764,8 +1762,6 @@ static int kvm_s2_fault_get_vma_info(struct kvm_s2_fault *fault)
 	fault->gfn = fault->ipa >> PAGE_SHIFT;
 	fault->mte_allowed = kvm_vma_mte_allowed(vma);
 
-	fault->vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
-
 	fault->vm_flags = vma->vm_flags;
 
 	fault->is_vma_cacheable = kvm_vma_is_cacheable(vma);
@@ -1796,7 +1792,7 @@ static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault)
 				       fault->write_fault ? FOLL_WRITE : 0,
 				       &fault->writable, &fault->page);
 	if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
-		kvm_send_hwpoison_signal(fault->hva, fault->vma_shift);
+		kvm_send_hwpoison_signal(fault->hva, __ffs(fault->vma_pagesize));
 		return 0;
 	}
 	if (is_error_noslot_pfn(fault->pfn))
@@ -1874,7 +1870,7 @@ static int kvm_s2_fault_compute_prot(struct kvm_s2_fault *fault)
 		fault->prot |= KVM_PGTABLE_PROT_X;
 
 	if (fault->s2_force_noncacheable) {
-		if (fault->vfio_allow_any_uc)
+		if (fault->vm_flags & VM_ALLOW_ANY_UNCACHED)
 			fault->prot |= KVM_PGTABLE_PROT_NORMAL_NC;
 		else
 			fault->prot |= KVM_PGTABLE_PROT_DEVICE;
@@ -1978,7 +1974,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		.logging_active = memslot_is_logging(memslot),
 		.force_pte = memslot_is_logging(memslot),
 		.s2_force_noncacheable = false,
-		.vfio_allow_any_uc = false,
 		.prot = KVM_PGTABLE_PROT_R,
 	};
 	struct kvm_s2_fault *fault = &fault_data;
-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 09/13] KVM: arm64: Simplify return logic in user_mem_abort()
  2026-03-06 14:02 [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Fuad Tabba
                   ` (7 preceding siblings ...)
  2026-03-06 14:02 ` [PATCH v1 08/13] KVM: arm64: Remove redundant state variables from struct kvm_s2_fault Fuad Tabba
@ 2026-03-06 14:02 ` Fuad Tabba
  2026-03-06 14:02 ` [PATCH v1 10/13] KVM: arm64: Initialize struct kvm_s2_fault completely at declaration Fuad Tabba
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: Fuad Tabba @ 2026-03-06 14:02 UTC (permalink / raw)
  To: kvm, kvmarm, linux-arm-kernel
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, qperret, vdonnefort, tabba

With the refactoring done, the final return block of user_mem_abort()
can be tidied up a bit more.

Clean up the trailing edge by dropping the unnecessary assignment,
collapsing the return evaluation for kvm_s2_fault_compute_prot(), and
tail calling kvm_s2_fault_map() directly.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 17 ++++-------------
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 01f4f4bee155..35bcacba5800 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2005,22 +2005,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (ret != 1)
 		return ret;
 
-	ret = 0;
-
 	ret = kvm_s2_fault_compute_prot(fault);
-	if (ret == 1) {
-		ret = 1; /* fault injected */
-		goto out_put_page;
+	if (ret) {
+		kvm_release_page_unused(fault->page);
+		return ret;
 	}
-	if (ret)
-		goto out_put_page;
 
-	ret = kvm_s2_fault_map(fault, memcache);
-	return ret;
-
-out_put_page:
-	kvm_release_page_unused(fault->page);
-	return ret;
+	return kvm_s2_fault_map(fault, memcache);
 }
 
 /* Resolve the access fault by making the page young again. */
-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 10/13] KVM: arm64: Initialize struct kvm_s2_fault completely at declaration
  2026-03-06 14:02 [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Fuad Tabba
                   ` (8 preceding siblings ...)
  2026-03-06 14:02 ` [PATCH v1 09/13] KVM: arm64: Simplify return logic in user_mem_abort() Fuad Tabba
@ 2026-03-06 14:02 ` Fuad Tabba
  2026-03-06 14:02 ` [PATCH v1 11/13] KVM: arm64: Optimize early exit checks in kvm_s2_fault_pin_pfn() Fuad Tabba
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: Fuad Tabba @ 2026-03-06 14:02 UTC (permalink / raw)
  To: kvm, kvmarm, linux-arm-kernel
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, qperret, vdonnefort, tabba

Simplify the initialization of struct kvm_s2_fault in user_mem_abort().

Instead of partially initializing the struct via designated initializers
and then sequentially assigning the remaining fields (like write_fault
and topup_memcache) further down the function, evaluate those
dependencies upfront.

This allows the entire struct to be fully initialized at declaration. It
also eliminates the need for the intermediate fault_data variable and
its associated fault pointer, reducing boilerplate.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 35bcacba5800..2d6e749c1756 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1962,8 +1962,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  bool fault_is_perm)
 {
-	int ret = 0;
-	struct kvm_s2_fault fault_data = {
+	bool write_fault = kvm_is_write_fault(vcpu);
+	bool logging_active = memslot_is_logging(memslot);
+	struct kvm_s2_fault fault = {
 		.vcpu = vcpu,
 		.fault_ipa = fault_ipa,
 		.nested = nested,
@@ -1971,19 +1972,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		.hva = hva,
 		.fault_is_perm = fault_is_perm,
 		.ipa = fault_ipa,
-		.logging_active = memslot_is_logging(memslot),
-		.force_pte = memslot_is_logging(memslot),
-		.s2_force_noncacheable = false,
+		.logging_active = logging_active,
+		.force_pte = logging_active,
 		.prot = KVM_PGTABLE_PROT_R,
+		.fault_granule = fault_is_perm ? kvm_vcpu_trap_get_perm_fault_granule(vcpu) : 0,
+		.write_fault = write_fault,
+		.exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu),
+		.topup_memcache = !fault_is_perm || (logging_active && write_fault),
 	};
-	struct kvm_s2_fault *fault = &fault_data;
 	void *memcache;
+	int ret;
 
-	if (fault->fault_is_perm)
-		fault->fault_granule = kvm_vcpu_trap_get_perm_fault_granule(fault->vcpu);
-	fault->write_fault = kvm_is_write_fault(fault->vcpu);
-	fault->exec_fault = kvm_vcpu_trap_is_exec_fault(fault->vcpu);
-	VM_WARN_ON_ONCE(fault->write_fault && fault->exec_fault);
+	VM_WARN_ON_ONCE(fault.write_fault && fault.exec_fault);
 
 	/*
 	 * Permission faults just need to update the existing leaf entry,
@@ -1991,9 +1991,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * only exception to this is when dirty logging is enabled at runtime
 	 * and a write fault needs to collapse a block entry into a table.
 	 */
-	fault->topup_memcache = !fault->fault_is_perm ||
-				(fault->logging_active && fault->write_fault);
-	ret = prepare_mmu_memcache(fault->vcpu, fault->topup_memcache, &memcache);
+	ret = prepare_mmu_memcache(vcpu, fault.topup_memcache, &memcache);
 	if (ret)
 		return ret;
 
@@ -2001,17 +1999,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * Let's check if we will get back a huge fault->page backed by hugetlbfs, or
 	 * get block mapping for device MMIO region.
 	 */
-	ret = kvm_s2_fault_pin_pfn(fault);
+	ret = kvm_s2_fault_pin_pfn(&fault);
 	if (ret != 1)
 		return ret;
 
-	ret = kvm_s2_fault_compute_prot(fault);
+	ret = kvm_s2_fault_compute_prot(&fault);
 	if (ret) {
-		kvm_release_page_unused(fault->page);
+		kvm_release_page_unused(fault.page);
 		return ret;
 	}
 
-	return kvm_s2_fault_map(fault, memcache);
+	return kvm_s2_fault_map(&fault, memcache);
 }
 
 /* Resolve the access fault by making the page young again. */
-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 11/13] KVM: arm64: Optimize early exit checks in kvm_s2_fault_pin_pfn()
  2026-03-06 14:02 [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Fuad Tabba
                   ` (9 preceding siblings ...)
  2026-03-06 14:02 ` [PATCH v1 10/13] KVM: arm64: Initialize struct kvm_s2_fault completely at declaration Fuad Tabba
@ 2026-03-06 14:02 ` Fuad Tabba
  2026-03-17 17:10   ` Joey Gouly
  2026-03-06 14:02 ` [PATCH v1 12/13] KVM: arm64: Hoist MTE validation check out of MMU lock path Fuad Tabba
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 20+ messages in thread
From: Fuad Tabba @ 2026-03-06 14:02 UTC (permalink / raw)
  To: kvm, kvmarm, linux-arm-kernel
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, qperret, vdonnefort, tabba

Optimize the early exit checks in kvm_s2_fault_pin_pfn by grouping all
error responses under the generic is_error_noslot_pfn check first,
avoiding unnecessary branches in the hot path.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 2d6e749c1756..9265a7fc43f7 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1791,12 +1791,13 @@ static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault)
 	fault->pfn = __kvm_faultin_pfn(fault->memslot, fault->gfn,
 				       fault->write_fault ? FOLL_WRITE : 0,
 				       &fault->writable, &fault->page);
-	if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
-		kvm_send_hwpoison_signal(fault->hva, __ffs(fault->vma_pagesize));
-		return 0;
-	}
-	if (is_error_noslot_pfn(fault->pfn))
+	if (unlikely(is_error_noslot_pfn(fault->pfn))) {
+		if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
+			kvm_send_hwpoison_signal(fault->hva, __ffs(fault->vma_pagesize));
+			return 0;
+		}
 		return -EFAULT;
+	}
 
 	return 1;
 }
-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 11/13] KVM: arm64: Optimize early exit checks in kvm_s2_fault_pin_pfn()
  2026-03-06 14:02 ` [PATCH v1 11/13] KVM: arm64: Optimize early exit checks in kvm_s2_fault_pin_pfn() Fuad Tabba
@ 2026-03-17 17:10   ` Joey Gouly
  0 siblings, 0 replies; 20+ messages in thread
From: Joey Gouly @ 2026-03-17 17:10 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, kvmarm, linux-arm-kernel, maz, oliver.upton, suzuki.poulose,
	yuzenghui, catalin.marinas, will, qperret, vdonnefort

On Fri, Mar 06, 2026 at 02:02:30PM +0000, Fuad Tabba wrote:
> Optimize the early exit checks in kvm_s2_fault_pin_pfn by grouping all
> error responses under the generic is_error_noslot_pfn check first,
> avoiding unnecessary branches in the hot path.
> 
> Signed-off-by: Fuad Tabba <tabba@google.com>

Reviewed-by: Joey Gouly <joey.gouly@arm.com>

> ---
>  arch/arm64/kvm/mmu.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 2d6e749c1756..9265a7fc43f7 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1791,12 +1791,13 @@ static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault)
>  	fault->pfn = __kvm_faultin_pfn(fault->memslot, fault->gfn,
>  				       fault->write_fault ? FOLL_WRITE : 0,
>  				       &fault->writable, &fault->page);
> -	if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
> -		kvm_send_hwpoison_signal(fault->hva, __ffs(fault->vma_pagesize));
> -		return 0;
> -	}
> -	if (is_error_noslot_pfn(fault->pfn))
> +	if (unlikely(is_error_noslot_pfn(fault->pfn))) {
> +		if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
> +			kvm_send_hwpoison_signal(fault->hva, __ffs(fault->vma_pagesize));
> +			return 0;
> +		}
>  		return -EFAULT;
> +	}
>  
>  	return 1;
>  }
> -- 
> 2.53.0.473.g4a7958ca14-goog
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v1 12/13] KVM: arm64: Hoist MTE validation check out of MMU lock path
  2026-03-06 14:02 [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Fuad Tabba
                   ` (10 preceding siblings ...)
  2026-03-06 14:02 ` [PATCH v1 11/13] KVM: arm64: Optimize early exit checks in kvm_s2_fault_pin_pfn() Fuad Tabba
@ 2026-03-06 14:02 ` Fuad Tabba
  2026-03-06 14:02 ` [PATCH v1 13/13] KVM: arm64: Clean up control flow in kvm_s2_fault_map() Fuad Tabba
  2026-03-06 15:34 ` [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Marc Zyngier
  13 siblings, 0 replies; 20+ messages in thread
From: Fuad Tabba @ 2026-03-06 14:02 UTC (permalink / raw)
  To: kvm, kvmarm, linux-arm-kernel
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, qperret, vdonnefort, tabba

Simplify the non-cacheable attributes assignment by using a ternary
operator. Additionally, hoist the MTE validation check (mte_allowed) out
of kvm_s2_fault_map() and into kvm_s2_fault_compute_prot(). This allows
us to fail faster and avoid acquiring the KVM MMU lock unnecessarily
when the VMM introduces a disallowed VMA for an MTE-enabled guest.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 28 ++++++++++++----------------
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 9265a7fc43f7..cc6b35efcee5 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1870,18 +1870,21 @@ static int kvm_s2_fault_compute_prot(struct kvm_s2_fault *fault)
 	if (fault->exec_fault)
 		fault->prot |= KVM_PGTABLE_PROT_X;
 
-	if (fault->s2_force_noncacheable) {
-		if (fault->vm_flags & VM_ALLOW_ANY_UNCACHED)
-			fault->prot |= KVM_PGTABLE_PROT_NORMAL_NC;
-		else
-			fault->prot |= KVM_PGTABLE_PROT_DEVICE;
-	} else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) {
+	if (fault->s2_force_noncacheable)
+		fault->prot |= (fault->vm_flags & VM_ALLOW_ANY_UNCACHED) ?
+			       KVM_PGTABLE_PROT_NORMAL_NC : KVM_PGTABLE_PROT_DEVICE;
+	else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC))
 		fault->prot |= KVM_PGTABLE_PROT_X;
-	}
 
 	if (fault->nested)
 		adjust_nested_exec_perms(kvm, fault->nested, &fault->prot);
 
+	if (!fault->fault_is_perm && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) {
+		/* Check the VMM hasn't introduced a new disallowed VMA */
+		if (!fault->mte_allowed)
+			return -EFAULT;
+	}
+
 	return 0;
 }
 
@@ -1918,15 +1921,8 @@ static int kvm_s2_fault_map(struct kvm_s2_fault *fault, void *memcache)
 		}
 	}
 
-	if (!fault->fault_is_perm && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) {
-		/* Check the VMM hasn't introduced a new disallowed VMA */
-		if (fault->mte_allowed) {
-			sanitise_mte_tags(kvm, fault->pfn, fault->vma_pagesize);
-		} else {
-			ret = -EFAULT;
-			goto out_unlock;
-		}
-	}
+	if (!fault->fault_is_perm && !fault->s2_force_noncacheable && kvm_has_mte(kvm))
+		sanitise_mte_tags(kvm, fault->pfn, fault->vma_pagesize);
 
 	/*
 	 * Under the premise of getting a FSC_PERM fault, we just need to relax
-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 13/13] KVM: arm64: Clean up control flow in kvm_s2_fault_map()
  2026-03-06 14:02 [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Fuad Tabba
                   ` (11 preceding siblings ...)
  2026-03-06 14:02 ` [PATCH v1 12/13] KVM: arm64: Hoist MTE validation check out of MMU lock path Fuad Tabba
@ 2026-03-06 14:02 ` Fuad Tabba
  2026-03-06 15:34 ` [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Marc Zyngier
  13 siblings, 0 replies; 20+ messages in thread
From: Fuad Tabba @ 2026-03-06 14:02 UTC (permalink / raw)
  To: kvm, kvmarm, linux-arm-kernel
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, qperret, vdonnefort, tabba

Clean up the KVM MMU lock retry loop by pre-assigning the error code.
Add clear braces to the THP adjustment integration for readability, and
safely unnest the transparent hugepage logic branches.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index cc6b35efcee5..5542a50dc8a6 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1897,10 +1897,9 @@ static int kvm_s2_fault_map(struct kvm_s2_fault *fault, void *memcache)
 
 	kvm_fault_lock(kvm);
 	pgt = fault->vcpu->arch.hw_mmu->pgt;
-	if (mmu_invalidate_retry(kvm, fault->mmu_seq)) {
-		ret = -EAGAIN;
+	ret = -EAGAIN;
+	if (mmu_invalidate_retry(kvm, fault->mmu_seq))
 		goto out_unlock;
-	}
 
 	/*
 	 * If we are not forced to use fault->page mapping, check if we are
@@ -1908,16 +1907,17 @@ static int kvm_s2_fault_map(struct kvm_s2_fault *fault, void *memcache)
 	 */
 	if (fault->vma_pagesize == PAGE_SIZE &&
 	    !(fault->force_pte || fault->s2_force_noncacheable)) {
-		if (fault->fault_is_perm && fault->fault_granule > PAGE_SIZE)
+		if (fault->fault_is_perm && fault->fault_granule > PAGE_SIZE) {
 			fault->vma_pagesize = fault->fault_granule;
-		else
+		} else {
 			fault->vma_pagesize = transparent_hugepage_adjust(kvm, fault->memslot,
 									  fault->hva, &fault->pfn,
 									  &fault->fault_ipa);
 
-		if (fault->vma_pagesize < 0) {
-			ret = fault->vma_pagesize;
-			goto out_unlock;
+			if (fault->vma_pagesize < 0) {
+				ret = fault->vma_pagesize;
+				goto out_unlock;
+			}
 		}
 	}
 
@@ -1951,7 +1951,9 @@ static int kvm_s2_fault_map(struct kvm_s2_fault *fault, void *memcache)
 	if (fault->writable && !ret)
 		mark_page_dirty_in_slot(kvm, fault->memslot, fault->gfn);
 
-	return ret != -EAGAIN ? ret : 0;
+	if (ret != -EAGAIN)
+		return ret;
+	return 0;
 }
 
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
-- 
2.53.0.473.g4a7958ca14-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model
  2026-03-06 14:02 [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Fuad Tabba
                   ` (12 preceding siblings ...)
  2026-03-06 14:02 ` [PATCH v1 13/13] KVM: arm64: Clean up control flow in kvm_s2_fault_map() Fuad Tabba
@ 2026-03-06 15:34 ` Marc Zyngier
  2026-03-06 15:44   ` Fuad Tabba
  13 siblings, 1 reply; 20+ messages in thread
From: Marc Zyngier @ 2026-03-06 15:34 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, kvmarm, linux-arm-kernel, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, catalin.marinas, will, qperret,
	vdonnefort

Hi Fuad,

On Fri, 06 Mar 2026 14:02:19 +0000,
Fuad Tabba <tabba@google.com> wrote:
> 
> As promised in my recent patch series fixing a couple of urgent bugs in
> user_mem_abort() [1], here is the actual refactoring to finally clean up this
> monolith.
> 
> If you look through the Fixes: history of user_mem_abort(), you will start to
> see a very clear pattern of whack-a-mole caused by the sheer size and
> complexity of the function. For example:
> - We keep leaking struct page references on early error returns because the
>   cleanup logic is hard to track (e.g., 5f9466b50c1b and the atomic fault leak
>   I just fixed in the previous series).
> - We have had uninitialized memcache pointers (157dbc4a321f) because the
>   initialization flow jumps around unpredictably.
> - We have had subtle TOCTOU and locking boundary bugs (like 13ec9308a857 and
>   f587661f21eb) because we drop the mmap_read_lock midway through the function
>   but leave the vma pointer and mmu_seq floating around in the same lexical
>   scope, tempting people to use them.
> 
> The bulk of the work is in the first 6 patches, which perform a strict,
> no-logic-change structural refactoring of user_mem_abort() into a clean,
> sequential dispatcher.
> 
> We introduce a state object, struct kvm_s2_fault, which encapsulates
> both the input parameters and the intermediate state. Then,
> user_mem_abort() is broken down into focused, standalone helpers:
> - kvm_s2_resolve_vma_size(): Determines the VMA shift and page size.
> - kvm_s2_fault_pin_pfn(): Handles faulting in the physical page.
>   - kvm_s2_fault_get_vma_info(): A tightly-scoped sub-helper that isolates the
>     mmap_read_lock, VMA lookup, and metadata snapshotting.
> - kvm_s2_fault_compute_prot(): Computes stage-2 protections and evaluates
>   permission/execution constraints.
> - kvm_s2_fault_map(): Manages the KVM MMU lock, mmu_seq retry loops, MTE, and
>   the final stage-2 mapping.
> 
> This structural change makes the "danger zone" foolproof. By isolating
> the mmap_read_lock region inside a tightly-scoped sub-helper
> (kvm_s2_fault_get_vma_info), the vma pointer is confined. It snapshots
> the required metadata into the kvm_s2_fault structure before dropping
> the lock. Because the pointers scope ends when the sub-helper returns,
> accessing a stale VMA in the mapping phase is not possible by design.
> 
> The remaining patches in are localized cleanup patches. With the logic
> finally extracted into digestible helpers, these patches take the
> opportunity to streamline struct initialization, drop redundant struct
> variables, simplify nested math, and hoist validation checks (like MTE)
> out of the lock-heavy mapping phase.
> 
> I think that there are still more opportunities to tidy things up some
> more, but I'll stop here to see what you think.

Thanks a lot for going through this, this looks like a very valuable
starting point. From a high-level perspective, the end result is even
more shocking:

<quote>
static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
			  struct kvm_s2_trans *nested,
			  struct kvm_memory_slot *memslot, unsigned long hva,
			  bool fault_is_perm)
{
	bool write_fault = kvm_is_write_fault(vcpu);
	bool logging_active = memslot_is_logging(memslot);
	struct kvm_s2_fault fault = {
		.vcpu = vcpu,
		.fault_ipa = fault_ipa,
		.nested = nested,
		.memslot = memslot,
		.hva = hva,
		.fault_is_perm = fault_is_perm,
		.ipa = fault_ipa,
		.logging_active = logging_active,
		.force_pte = logging_active,
		.prot = KVM_PGTABLE_PROT_R,
		.fault_granule = fault_is_perm ? kvm_vcpu_trap_get_perm_fault_granule(vcpu) : 0,
		.write_fault = write_fault,
		.exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu),
		.topup_memcache = !fault_is_perm || (logging_active && write_fault),
	};
</quote>

This kvm_s2_fault structure is a mess of architectural state (the
fault itself), hypervisor context (logging_active, force_pte...), and
implementation details (topup_memcache...). Effectively, that's the
container for *everything* that was a variable in u_m_a() is now part
of the structure that is passed around.

The task at hand is now to completely nuke this monster. The
architectural fault information should be further split, and directly
passed as a parameter to user_mem_abort(). We have a lot of redundant
state that is computed, recomputed, derived, and that should probably
moved into the exact place where it matters.

I'll have a play with it,

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model
  2026-03-06 15:34 ` [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Marc Zyngier
@ 2026-03-06 15:44   ` Fuad Tabba
  2026-03-16 18:13     ` Marc Zyngier
  0 siblings, 1 reply; 20+ messages in thread
From: Fuad Tabba @ 2026-03-06 15:44 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, kvmarm, linux-arm-kernel, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, catalin.marinas, will, qperret,
	vdonnefort

On Fri, 6 Mar 2026 at 15:34, Marc Zyngier <maz@kernel.org> wrote:
>
> Hi Fuad,
>
> On Fri, 06 Mar 2026 14:02:19 +0000,
> Fuad Tabba <tabba@google.com> wrote:
> >
> > As promised in my recent patch series fixing a couple of urgent bugs in
> > user_mem_abort() [1], here is the actual refactoring to finally clean up this
> > monolith.
> >
> > If you look through the Fixes: history of user_mem_abort(), you will start to
> > see a very clear pattern of whack-a-mole caused by the sheer size and
> > complexity of the function. For example:
> > - We keep leaking struct page references on early error returns because the
> >   cleanup logic is hard to track (e.g., 5f9466b50c1b and the atomic fault leak
> >   I just fixed in the previous series).
> > - We have had uninitialized memcache pointers (157dbc4a321f) because the
> >   initialization flow jumps around unpredictably.
> > - We have had subtle TOCTOU and locking boundary bugs (like 13ec9308a857 and
> >   f587661f21eb) because we drop the mmap_read_lock midway through the function
> >   but leave the vma pointer and mmu_seq floating around in the same lexical
> >   scope, tempting people to use them.
> >
> > The bulk of the work is in the first 6 patches, which perform a strict,
> > no-logic-change structural refactoring of user_mem_abort() into a clean,
> > sequential dispatcher.
> >
> > We introduce a state object, struct kvm_s2_fault, which encapsulates
> > both the input parameters and the intermediate state. Then,
> > user_mem_abort() is broken down into focused, standalone helpers:
> > - kvm_s2_resolve_vma_size(): Determines the VMA shift and page size.
> > - kvm_s2_fault_pin_pfn(): Handles faulting in the physical page.
> >   - kvm_s2_fault_get_vma_info(): A tightly-scoped sub-helper that isolates the
> >     mmap_read_lock, VMA lookup, and metadata snapshotting.
> > - kvm_s2_fault_compute_prot(): Computes stage-2 protections and evaluates
> >   permission/execution constraints.
> > - kvm_s2_fault_map(): Manages the KVM MMU lock, mmu_seq retry loops, MTE, and
> >   the final stage-2 mapping.
> >
> > This structural change makes the "danger zone" foolproof. By isolating
> > the mmap_read_lock region inside a tightly-scoped sub-helper
> > (kvm_s2_fault_get_vma_info), the vma pointer is confined. It snapshots
> > the required metadata into the kvm_s2_fault structure before dropping
> > the lock. Because the pointers scope ends when the sub-helper returns,
> > accessing a stale VMA in the mapping phase is not possible by design.
> >
> > The remaining patches in are localized cleanup patches. With the logic
> > finally extracted into digestible helpers, these patches take the
> > opportunity to streamline struct initialization, drop redundant struct
> > variables, simplify nested math, and hoist validation checks (like MTE)
> > out of the lock-heavy mapping phase.
> >
> > I think that there are still more opportunities to tidy things up some
> > more, but I'll stop here to see what you think.
>
> Thanks a lot for going through this, this looks like a very valuable
> starting point. From a high-level perspective, the end result is even
> more shocking:
>
> <quote>
> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>                           struct kvm_s2_trans *nested,
>                           struct kvm_memory_slot *memslot, unsigned long hva,
>                           bool fault_is_perm)
> {
>         bool write_fault = kvm_is_write_fault(vcpu);
>         bool logging_active = memslot_is_logging(memslot);
>         struct kvm_s2_fault fault = {
>                 .vcpu = vcpu,
>                 .fault_ipa = fault_ipa,
>                 .nested = nested,
>                 .memslot = memslot,
>                 .hva = hva,
>                 .fault_is_perm = fault_is_perm,
>                 .ipa = fault_ipa,
>                 .logging_active = logging_active,
>                 .force_pte = logging_active,
>                 .prot = KVM_PGTABLE_PROT_R,
>                 .fault_granule = fault_is_perm ? kvm_vcpu_trap_get_perm_fault_granule(vcpu) : 0,
>                 .write_fault = write_fault,
>                 .exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu),
>                 .topup_memcache = !fault_is_perm || (logging_active && write_fault),
>         };
> </quote>
>
> This kvm_s2_fault structure is a mess of architectural state (the
> fault itself), hypervisor context (logging_active, force_pte...), and
> implementation details (topup_memcache...). Effectively, that's the
> container for *everything* that was a variable in u_m_a() is now part
> of the structure that is passed around.
>
> The task at hand is now to completely nuke this monster. The
> architectural fault information should be further split, and directly
> passed as a parameter to user_mem_abort(). We have a lot of redundant
> state that is computed, recomputed, derived, and that should probably
> moved into the exact place where it matters.

Thanks Marc! Like I said, I stopped after 13 patches to see what you
think, and if this is the right approach. But yes, more can be done
here.

>
> I'll have a play with it,

Sounds good, and if you'd like me to tackle any particular part of
this, let me know.

I have to admit, breaking this down into pieces and seeing how much
tidier and easier to understand it became felt oddly satisfying.

Cheers,
/fuad

> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model
  2026-03-06 15:44   ` Fuad Tabba
@ 2026-03-16 18:13     ` Marc Zyngier
  0 siblings, 0 replies; 20+ messages in thread
From: Marc Zyngier @ 2026-03-16 18:13 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, kvmarm, linux-arm-kernel, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, catalin.marinas, will, qperret,
	vdonnefort

On Fri, 06 Mar 2026 15:44:44 +0000,
Fuad Tabba <tabba@google.com> wrote:
> 
> I have to admit, breaking this down into pieces and seeing how much
> tidier and easier to understand it became felt oddly satisfying.

I can only agree.

Having played with this for a few days, I came up with a set of
additional changes that get rid of this kvm_s2_fault entirely, and
sandbox the state in funny ways.

You'll find the result in your inbox...

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-03-17 17:10 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-06 14:02 [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Fuad Tabba
2026-03-06 14:02 ` [PATCH v1 01/13] KVM: arm64: Extract VMA size resolution in user_mem_abort() Fuad Tabba
2026-03-17 15:07   ` Joey Gouly
2026-03-06 14:02 ` [PATCH v1 02/13] KVM: arm64: Introduce struct kvm_s2_fault to user_mem_abort() Fuad Tabba
2026-03-17 16:00   ` Joey Gouly
2026-03-06 14:02 ` [PATCH v1 03/13] KVM: arm64: Extract PFN resolution in user_mem_abort() Fuad Tabba
2026-03-06 14:02 ` [PATCH v1 04/13] KVM: arm64: Isolate mmap_read_lock inside new kvm_s2_fault_get_vma_info() helper Fuad Tabba
2026-03-06 14:02 ` [PATCH v1 05/13] KVM: arm64: Extract stage-2 permission logic in user_mem_abort() Fuad Tabba
2026-03-06 14:02 ` [PATCH v1 06/13] KVM: arm64: Extract page table mapping " Fuad Tabba
2026-03-06 14:02 ` [PATCH v1 07/13] KVM: arm64: Simplify nested VMA shift calculation Fuad Tabba
2026-03-06 14:02 ` [PATCH v1 08/13] KVM: arm64: Remove redundant state variables from struct kvm_s2_fault Fuad Tabba
2026-03-06 14:02 ` [PATCH v1 09/13] KVM: arm64: Simplify return logic in user_mem_abort() Fuad Tabba
2026-03-06 14:02 ` [PATCH v1 10/13] KVM: arm64: Initialize struct kvm_s2_fault completely at declaration Fuad Tabba
2026-03-06 14:02 ` [PATCH v1 11/13] KVM: arm64: Optimize early exit checks in kvm_s2_fault_pin_pfn() Fuad Tabba
2026-03-17 17:10   ` Joey Gouly
2026-03-06 14:02 ` [PATCH v1 12/13] KVM: arm64: Hoist MTE validation check out of MMU lock path Fuad Tabba
2026-03-06 14:02 ` [PATCH v1 13/13] KVM: arm64: Clean up control flow in kvm_s2_fault_map() Fuad Tabba
2026-03-06 15:34 ` [PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model Marc Zyngier
2026-03-06 15:44   ` Fuad Tabba
2026-03-16 18:13     ` Marc Zyngier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox