From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 368C5D41149
	for <linux-arm-kernel@archiver.kernel.org>; Thu, 15 Jan 2026 09:51:37 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help
	:List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Type:MIME-Version:
	References:In-Reply-To:Subject:Cc:To:From:Message-ID:Date:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date:
	Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner;
	bh=HvgoqOMTULNkALVPI+xURsoQUao/Y4F/rp5sgFtpXZU=; b=G/vyXCVNmrYU0JIbkLx0OPjw8J
	xvQVZP8y2p3M7/i92eu9Gqx7NQM8Levz2dlFeKEblJDnGSuIUd3Rri61vvM6GblWVh9ESNy4Etzb3
	pOj1UXR8PpJYWi9E880Q1qIH22X/u4a2S+NCOtl1B0fzGyNz3aRLMZk1g0DUHCxDtTo0QLp9oBIUj
	i5q43ZojBmnNcx3XAu81coPkJ8JDS4s4ZIYoJeyO09iAM7AtAbt8HUxSRzcAIMXnsw+CYrqAbxvQr
	VQl0mj5AWieXJFfe9wqATf7CBQ0KCsrgawiRtaKbGwIUoLfBsIArlouS1By1mXR8CR9qXgNSLX/si
	ayprbVtA==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux))
	id 1vgK0p-0000000C6I7-1V1D;
	Thu, 15 Jan 2026 09:51:31 +0000
Received: from tor.source.kernel.org ([2600:3c04:e001:324:0:1991:8:25])
	by bombadil.infradead.org with esmtps (Exim 4.98.2 #2 (Red Hat Linux))
	id 1vgK0l-0000000C6FE-33bt
	for linux-arm-kernel@lists.infradead.org;
	Thu, 15 Jan 2026 09:51:27 +0000
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by tor.source.kernel.org (Postfix) with ESMTP id 26EC3601BA;
	Thu, 15 Jan 2026 09:51:27 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id AF707C19424;
	Thu, 15 Jan 2026 09:51:26 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1768470686;
	bh=Dg++9WKqNwA597VTObyEBdbA+vIt0RmzeX+Nfr5j6CI=;
	h=Date:From:To:Cc:Subject:In-Reply-To:References:From;
	b=JZA1aBjq+1+js8ZjrrqdMwdE/DwgHCvT6DcCZ01uIlzIeHyKfkAgvc0zo9Steja2J
	 +C/r6gi5NjV1rampsRnxmDmZUoZa3nzujeeRiyb9LWFNeeKmtjaYkuh4ukYP50vfzr
	 IorBfc7yQb6Dw+9hC+BtTX1tN2609A9+ZjeVHQOhiOWr5siOW2obkJVBC+748dxWfO
	 Mhwq7eqGXZjCnEILMWeUf+Zkv0p1mQn1vcSMJ4yew1kAgSKN5z4DlfR0RyITiDpj8p
	 dWO46FeW4cKGJ/OayB/e9D+7VZBEgVaGVfxKWEqGuYDzn2ttgFs0CbGj7mPnI1mI2e
	 GCh+o0WNdBwew==
Received: from sofa.misterjones.org ([185.219.108.64] helo=goblin-girl.misterjones.org)
	by disco-boy.misterjones.org with esmtpsa  (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.98.2)
	(envelope-from <maz@kernel.org>)
	id 1vgK0i-00000002TQq-1ZPm;
	Thu, 15 Jan 2026 09:51:24 +0000
Date: Thu, 15 Jan 2026 09:51:23 +0000
Message-ID: <86jyxjkxus.wl-maz@kernel.org>
From: Marc Zyngier <maz@kernel.org>
To: Jack Thomson <jackabt.amazon@gmail.com>
Cc: oliver.upton@linux.dev,
	pbonzini@redhat.com,
	joey.gouly@arm.com,
	suzuki.poulose@arm.com,
	yuzenghui@huawei.com,
	catalin.marinas@arm.com,
	will@kernel.org,
	shuah@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org,
	isaku.yamahata@intel.com,
	xmarcalx@amazon.co.uk,
	kalyazin@amazon.co.uk,
	jackabt@amazon.com, Vladimir Murzin <vladimir.murzin@arm.com>
Subject: Re: [PATCH v4 1/3] KVM: arm64: Add pre_fault_memory implementation
In-Reply-To: <20260113152643.18858-2-jackabt.amazon@gmail.com>
References: <20260113152643.18858-1-jackabt.amazon@gmail.com>
	<20260113152643.18858-2-jackabt.amazon@gmail.com>
User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue)
 FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/30.1
 (aarch64-unknown-linux-gnu) MULE/6.0 (HANACHIRUSATO)
MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue")
Content-Type: text/plain; charset=US-ASCII
X-SA-Exim-Connect-IP: 185.219.108.64
X-SA-Exim-Rcpt-To: jackabt.amazon@gmail.com, oliver.upton@linux.dev, pbonzini@redhat.com, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org, shuah@kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, isaku.yamahata@intel.com, xmarcalx@amazon.co.uk, kalyazin@amazon.co.uk, jackabt@amazon.com, vladimir.murzin@arm.com
X-SA-Exim-Mail-From: maz@kernel.org
X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

[+ Vladimir, who was also looking at this patch]

On Tue, 13 Jan 2026 15:26:40 +0000,
Jack Thomson <jackabt.amazon@gmail.com> wrote:
> 
> From: Jack Thomson <jackabt@amazon.com>
> 
> Add kvm_arch_vcpu_pre_fault_memory() for arm64. The implementation hands
> off the stage-2 faulting logic to either gmem_abort() or
> user_mem_abort().
> 
> Add an optional page_size output parameter to user_mem_abort() to
> return the VMA page size, which is needed when pre-faulting.
> 
> Update the documentation to clarify x86 specific behaviour.
> 
> Signed-off-by: Jack Thomson <jackabt@amazon.com>
> ---
>  Documentation/virt/kvm/api.rst |  3 +-
>  arch/arm64/kvm/Kconfig         |  1 +
>  arch/arm64/kvm/arm.c           |  1 +
>  arch/arm64/kvm/mmu.c           | 79 ++++++++++++++++++++++++++++++++--
>  4 files changed, 79 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 01a3abef8abb..44cfd9e736bb 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6493,7 +6493,8 @@ Errors:
>  KVM_PRE_FAULT_MEMORY populates KVM's stage-2 page tables used to map memory
>  for the current vCPU state.  KVM maps memory as if the vCPU generated a
>  stage-2 read page fault, e.g. faults in memory as needed, but doesn't break
> -CoW.  However, KVM does not mark any newly created stage-2 PTE as Accessed.
> +CoW.  However, on x86, KVM does not mark any newly created stage-2 PTE as
> +Accessed.
>  
>  In the case of confidential VM types where there is an initial set up of
>  private guest memory before the guest is 'finalized'/measured, this ioctl
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 4f803fd1c99a..6872aaabe16c 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -25,6 +25,7 @@ menuconfig KVM
>  	select HAVE_KVM_CPU_RELAX_INTERCEPT
>  	select KVM_MMIO
>  	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
> +	select KVM_GENERIC_PRE_FAULT_MEMORY
>  	select VIRT_XFER_TO_GUEST_WORK
>  	select KVM_VFIO
>  	select HAVE_KVM_DIRTY_RING_ACQ_REL
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 4f80da0c0d1d..19bac68f737f 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -332,6 +332,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_COUNTER_OFFSET:
>  	case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS:
>  	case KVM_CAP_ARM_SEA_TO_USER:
> +	case KVM_CAP_PRE_FAULT_MEMORY:
>  		r = 1;
>  		break;
>  	case KVM_CAP_SET_GUEST_DEBUG2:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 48d7c372a4cd..499b131f794e 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1642,8 +1642,8 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_s2_trans *nested,
> -			  struct kvm_memory_slot *memslot, unsigned long hva,
> -			  bool fault_is_perm)
> +			  struct kvm_memory_slot *memslot, unsigned long *page_size,
> +			  unsigned long hva, bool fault_is_perm)
>  {
>  	int ret = 0;
>  	bool topup_memcache;
> @@ -1923,6 +1923,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	kvm_release_faultin_page(kvm, page, !!ret, writable);
>  	kvm_fault_unlock(kvm);
>  
> +	if (page_size)
> +		*page_size = vma_pagesize;
> +
>  	/* Mark the page dirty only if the fault is handled successfully */
>  	if (writable && !ret)
>  		mark_page_dirty_in_slot(kvm, memslot, gfn);
> @@ -2196,8 +2199,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>  		ret = gmem_abort(vcpu, fault_ipa, nested, memslot,
>  				 esr_fsc_is_permission_fault(esr));
>  	else
> -		ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva,
> -				     esr_fsc_is_permission_fault(esr));
> +		ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, NULL,
> +				     hva, esr_fsc_is_permission_fault(esr));
>  	if (ret == 0)
>  		ret = 1;
>  out:
> @@ -2573,3 +2576,71 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
>  
>  	trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
>  }
> +
> +long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> +				    struct kvm_pre_fault_memory *range)
> +{
> +	struct kvm_vcpu_fault_info *fault_info = &vcpu->arch.fault;
> +	struct kvm_s2_trans nested_trans, *nested = NULL;
> +	unsigned long page_size = PAGE_SIZE;
> +	struct kvm_memory_slot *memslot;
> +	phys_addr_t ipa = range->gpa;
> +	phys_addr_t end;
> +	hva_t hva;
> +	gfn_t gfn;
> +	int ret;
> +
> +	if (vcpu_is_protected(vcpu))
> +		return -EOPNOTSUPP;

This feels pretty odd. If you have advertised the capability, then
saying "not supported" at this stage is not on.

> +
> +	/*
> +	 * We may prefault on a shadow stage 2 page table if we are
> +	 * running a nested guest.  In this case, we have to resolve the L2
> +	 * IPA to the L1 IPA first, before knowing what kind of memory should
> +	 * back the L1 IPA.
> +	 *
> +	 * If the shadow stage 2 page table walk faults, then we return
> +	 * -EFAULT
> +	 */
> +	if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu) &&
> +	    vcpu->arch.hw_mmu->nested_stage2_enabled) {
> +		ret = kvm_walk_nested_s2(vcpu, ipa, &nested_trans);
> +		if (ret)
> +			return -EFAULT;

And then what? Userspace is completely screwed here, with no way to
make any forward progress, because the L1 is in charge of that S2, and
L1 is not running. What's the outcome? Light a candle and pray?

Also, the IPA you are passing as a parameter means absolutely nothing
in the context of L2. Userspace doesn't have the faintest clue about
the memory map presented to L2, as that's L1 business. L1 can
absolutely present to L2 a memory map that doesn't have a single
address in common with its own.

So this really doesn't work at all.

> +
> +		ipa = kvm_s2_trans_output(&nested_trans);
> +		nested = &nested_trans;
> +	}
> +
> +	if (ipa >= kvm_phys_size(vcpu->arch.hw_mmu))
> +		return -ENOENT;
> +
> +	/* Generate a synthetic abort for the pre-fault address */
> +	fault_info->esr_el2 = (ESR_ELx_EC_DABT_LOW << ESR_ELx_EC_SHIFT) |
> +		ESR_ELx_FSC_FAULT_L(KVM_PGTABLE_LAST_LEVEL);

Why level 3? You must present a fault that matches the level at which
the emulated fault would actually occur, because the rest of the
infrastructure relies on that (at least on the permission path, and
more to come).

Taking a step back on all this, 90% of the problems are there because
you are trying to support prefaulting a guest that is already running.
If you limited this to actually *pre*-faulting the guest, it would be
the easiest thing ever, and wouldn't suffer from any of the above (you
can't be in a nested context if you haven't run).

What prevents you from doing so? I'm perfectly happy to make this a
separate API if this contradicts other implementations. Or are you
relying on other side effects of the "already running" state?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.