From: Marc Zyngier <maz@kernel.org>
To: Fuad Tabba <tabba@google.com>
Cc: "Roy, Patrick" <roypat@amazon.co.uk>,
"ackerleytng@google.com" <ackerleytng@google.com>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"amoorthy@google.com" <amoorthy@google.com>,
"anup@brainfault.org" <anup@brainfault.org>,
"aou@eecs.berkeley.edu" <aou@eecs.berkeley.edu>,
"brauner@kernel.org" <brauner@kernel.org>,
"catalin.marinas@arm.com" <catalin.marinas@arm.com>,
"chao.p.peng@linux.intel.com" <chao.p.peng@linux.intel.com>,
"chenhuacai@kernel.org" <chenhuacai@kernel.org>,
"david@redhat.com" <david@redhat.com>,
"dmatlack@google.com" <dmatlack@google.com>,
"fvdl@google.com" <fvdl@google.com>,
"hch@infradead.org" <hch@infradead.org>,
"hughd@google.com" <hughd@google.com>,
"ira.weiny@intel.com" <ira.weiny@intel.com>,
"isaku.yamahata@gmail.com" <isaku.yamahata@gmail.com>,
"isaku.yamahata@intel.com" <isaku.yamahata@intel.com>,
"james.morse@arm.com" <james.morse@arm.com>,
"jarkko@kernel.org" <jarkko@kernel.org>,
"jgg@nvidia.com" <jgg@nvidia.com>,
"jhubbard@nvidia.com" <jhubbard@nvidia.com>,
"jthoughton@google.com" <jthoughton@google.com>,
"keirf@google.com" <keirf@google.com>,
"kirill.shutemov@linux.intel.com"
<kirill.shutemov@linux.intel.com>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"kvmarm@lists.linux.dev" <kvmarm@lists.linux.dev>,
"liam.merwick@oracle.com" <liam.merwick@oracle.com>,
"linux-arm-msm@vger.kernel.org" <linux-arm-msm@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"mail@maciej.szmigiero.name" <mail@maciej.szmigiero.name>,
"mic@digikod.net" <mic@digikod.net>,
"michael.roth@amd.com" <michael.roth@amd.com>,
"mpe@ellerman.id.au" <mpe@ellerman.id.au>,
"oliver.upton@linux.dev" <oliver.upton@linux.dev>,
"palmer@dabbelt.com" <palmer@dabbelt.com>,
"pankaj.gupta@amd.com" <pankaj.gupta@amd.com>,
"paul.walmsley@sifive.com" <paul.walmsley@sifive.com>,
"pbonzini@redhat.com" <pbonzini@redhat.com>,
"peterx@redhat.com" <peterx@redhat.com>,
"qperret@google.com" <qperret@google.com>,
"quic_cvanscha@quicinc.com" <quic_cvanscha@quicinc.com>,
"quic_eberman@quicinc.com" <quic_eberman@quicinc.com>,
"quic_mnalajal@quicinc.com" <quic_mnalajal@quicinc.com>,
"quic_pderrin@quicinc.com" <quic_pderrin@quicinc.com>,
"quic_pheragu@quicinc.com" <quic_pheragu@quicinc.com>,
"quic_svaddagi@quicinc.com" <quic_svaddagi@quicinc.com>,
"quic_tsoni@quicinc.com" <quic_tsoni@quicinc.com>,
"rientjes@google.com" <rientjes@google.com>,
"seanjc@google.com" <seanjc@google.com>,
"shuah@kernel.org" <shuah@kernel.org>,
"steven.price@arm.com" <steven.price@arm.com>,
"suzuki.poulose@arm.com" <suzuki.poulose@arm.com>,
"vannapurve@google.com" <vannapurve@google.com>,
"vbabka@suse.cz" <vbabka@suse.cz>,
"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
"wei.w.wang@intel.com" <wei.w.wang@intel.com>,
"will@kernel.org" <will@kernel.org>,
"willy@infradead.org" <willy@infradead.org>,
"xiaoyao.li@intel.com" <xiaoyao.li@intel.com>,
"yilun.xu@intel.com" <yilun.xu@intel.com>,
"yuzenghui@huawei.com" <yuzenghui@huawei.com>
Subject: Re: [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults
Date: Fri, 11 Jul 2025 16:48:35 +0100 [thread overview]
Message-ID: <867c0eafu4.wl-maz@kernel.org> (raw)
In-Reply-To: <CA+EHjTz-MWYUKA6dbcZGvt=rRXnorrpJHbNLq-Kng5q7yaLERA@mail.gmail.com>
On Fri, 11 Jul 2025 15:17:46 +0100,
Fuad Tabba <tabba@google.com> wrote:
>
> Hi Marc,
>
> On Fri, 11 Jul 2025 at 14:50, Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Fri, 11 Jul 2025 10:59:39 +0100,
> > "Roy, Patrick" <roypat@amazon.co.uk> wrote:
> > >
> > >
> > > Hi Fuad,
> > >
> > > On Wed, 2025-07-09 at 11:59 +0100, Fuad Tabba wrote:> -snip-
> > > > +#define KVM_PGTABLE_WALK_MEMABORT_FLAGS (KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED)
> > > > +
> > > > +static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > > + struct kvm_s2_trans *nested,
> > > > + struct kvm_memory_slot *memslot, bool is_perm)
> > > > +{
> > > > + bool write_fault, exec_fault, writable;
> > > > + enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
> > > > + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > > > + struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> > > > + struct page *page;
> > > > + struct kvm *kvm = vcpu->kvm;
> > > > + void *memcache;
> > > > + kvm_pfn_t pfn;
> > > > + gfn_t gfn;
> > > > + int ret;
> > > > +
> > > > + ret = prepare_mmu_memcache(vcpu, true, &memcache);
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > > > + if (nested)
> > > > + gfn = kvm_s2_trans_output(nested) >> PAGE_SHIFT;
> > > > + else
> > > > + gfn = fault_ipa >> PAGE_SHIFT;
> > > > +
> > > > + write_fault = kvm_is_write_fault(vcpu);
> > > > + exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> > > > +
> > > > + if (write_fault && exec_fault) {
> > > > + kvm_err("Simultaneous write and execution fault\n");
> > > > + return -EFAULT;
> > > > + }
> > > > +
> > > > + if (is_perm && !write_fault && !exec_fault) {
> > > > + kvm_err("Unexpected L2 read permission error\n");
> > > > + return -EFAULT;
> > > > + }
> > > > +
> > > > + ret = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, &page, NULL);
> > > > + if (ret) {
> > > > + kvm_prepare_memory_fault_exit(vcpu, fault_ipa, PAGE_SIZE,
> > > > + write_fault, exec_fault, false);
> > > > + return ret;
> > > > + }
> > > > +
> > > > + writable = !(memslot->flags & KVM_MEM_READONLY);
> > > > +
> > > > + if (nested)
> > > > + adjust_nested_fault_perms(nested, &prot, &writable);
> > > > +
> > > > + if (writable)
> > > > + prot |= KVM_PGTABLE_PROT_W;
> > > > +
> > > > + if (exec_fault ||
> > > > + (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) &&
> > > > + (!nested || kvm_s2_trans_executable(nested))))
> > > > + prot |= KVM_PGTABLE_PROT_X;
> > > > +
> > > > + kvm_fault_lock(kvm);
> > >
> > > Doesn't this race with gmem invalidations (e.g. fallocate(PUNCH_HOLE))?
> > > E.g. if between kvm_gmem_get_pfn() above and this kvm_fault_lock() a
> > > gmem invalidation occurs, don't we end up with stage-2 page tables
> > > refering to a stale host page? In user_mem_abort() there's the "grab
> > > mmu_invalidate_seq before dropping mmap_lock and check it hasnt changed
> > > after grabbing mmu_lock" which prevents this, but I don't really see an
> > > equivalent here.
> >
> > Indeed. We have a similar construct in kvm_translate_vncr() as well,
> > and I'd definitely expect something of the sort 'round here. If for
> > some reason this is not needed, then a comment explaining why would be
> > welcome.
> >
> > But this brings me to another interesting bit: kvm_translate_vncr() is
> > another path that deals with a guest translation fault (despite being
> > caught as an EL2 S1 fault), and calls kvm_faultin_pfn(). What happens
> > when the backing store is gmem? Probably nothin
>
> I'll add guest_memfd handling logic to kvm_translate_vncr().
>
> > I don't immediately see why NV and gmem should be incompatible, so
> > something must be done on that front too (including the return to
> > userspace if the page is gone).
>
> Should it return to userspace or go back to the guest?
> user_mem_abort() returns to the guest if the page disappears (I don't
> quite understand the rationale behind that, but it was a deliberate
> change [1]): on mmu_invalidate_retry() it sets ret to -EAGAIN [2],
> which gets flipped to 0 on returning from user_mem_abort() [3].
Outside of gmem, racing with an invalidation (resulting in -EAGAIN) is
never a problem. We just replay the faulting instruction. Also,
kvm_faultin_pfn() never fails outside of error cases (guest accessing
non-memory, or writing to RO memory). So returning to the guest is
always the right thing to do, and userspace never needs to see any of
that (I ignore userfaultfd here, as that's a different matter).
With gmem, you don't really have a choice. Whoever is in charge of the
memory told you it can't get to it, and it's only fair to go back to
userspace for it to sort it out (if at all possible).
So when it comes to VNCR faults, the behaviour should be the same,
given that the faulting page *is* a guest page, even if this isn't a
stage-2 mapping that we are dealing with.
I'd expect something along the lines of the hack below, (completely
untested, as usual).
Thanks,
M.
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 5b191f4dc5668..98b1d6d4688a6 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -1172,8 +1172,9 @@ static u64 read_vncr_el2(struct kvm_vcpu *vcpu)
return (u64)sign_extend64(__vcpu_sys_reg(vcpu, VNCR_EL2), 48);
}
-static int kvm_translate_vncr(struct kvm_vcpu *vcpu)
+static int kvm_translate_vncr(struct kvm_vcpu *vcpu, bool *gmem)
{
+ struct kvm_memory_slot *memslot;
bool write_fault, writable;
unsigned long mmu_seq;
struct vncr_tlb *vt;
@@ -1216,9 +1217,21 @@ static int kvm_translate_vncr(struct kvm_vcpu *vcpu)
smp_rmb();
gfn = vt->wr.pa >> PAGE_SHIFT;
- pfn = kvm_faultin_pfn(vcpu, gfn, write_fault, &writable, &page);
- if (is_error_noslot_pfn(pfn) || (write_fault && !writable))
- return -EFAULT;
+ memslot = gfn_to_memslot(vcpu->kvm, gfn);
+ *gmem = kvm_slot_has_gmem(memslot);
+ if (!*gmem) {
+ pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0,
+ &writable, &page);
+ if (is_error_noslot_pfn(pfn) || (write_fault && !writable))
+ return -EFAULT;
+ } else {
+ ret = kvm_gmem_get_pfn(vcpu->kvm, memslot, gfn, &pfn, &page, NULL);
+ if (ret) {
+ kvm_prepare_memory_fault_exit(vcpu, vt->wr.pa, PAGE_SIZE,
+ write_fault, false, false);
+ return ret;
+ }
+ }
scoped_guard(write_lock, &vcpu->kvm->mmu_lock) {
if (mmu_invalidate_retry(vcpu->kvm, mmu_seq))
@@ -1292,14 +1305,14 @@ int kvm_handle_vncr_abort(struct kvm_vcpu *vcpu)
if (esr_fsc_is_permission_fault(esr)) {
inject_vncr_perm(vcpu);
} else if (esr_fsc_is_translation_fault(esr)) {
- bool valid;
+ bool valid, gmem = false;
int ret;
scoped_guard(read_lock, &vcpu->kvm->mmu_lock)
valid = kvm_vncr_tlb_lookup(vcpu);
if (!valid)
- ret = kvm_translate_vncr(vcpu);
+ ret = kvm_translate_vncr(vcpu, &gmem);
else
ret = -EPERM;
@@ -1309,6 +1322,14 @@ int kvm_handle_vncr_abort(struct kvm_vcpu *vcpu)
/* Let's try again... */
break;
case -EFAULT:
+ case -EIO:
+ /*
+ * FIXME: Add whatever other error cases the
+ * GMEM stuff can spit out.
+ */
+ if (gmem)
+ return 0;
+ fallthrough;
case -EINVAL:
case -ENOENT:
case -EACCES:
--
Without deviation from the norm, progress is not possible.
next prev parent reply other threads:[~2025-07-11 15:48 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 01/20] KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GMEM Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 02/20] KVM: Rename CONFIG_KVM_GENERIC_PRIVATE_MEM to CONFIG_KVM_GENERIC_GMEM_POPULATE Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 03/20] KVM: Introduce kvm_arch_supports_gmem() Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 04/20] KVM: x86: Introduce kvm->arch.supports_gmem Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 05/20] KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 06/20] KVM: Fix comments that refer to slots_lock Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 07/20] KVM: Fix comment that refers to kvm uapi header path Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 08/20] KVM: guest_memfd: Allow host to map guest_memfd pages Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 09/20] KVM: guest_memfd: Track guest_memfd mmap support in memslot Fuad Tabba
2025-07-11 8:34 ` Shivank Garg
2025-07-09 10:59 ` [PATCH v13 10/20] KVM: x86/mmu: Generalize private_max_mapping_level x86 op to max_mapping_level Fuad Tabba
2025-07-11 9:36 ` David Hildenbrand
2025-07-09 10:59 ` [PATCH v13 11/20] KVM: x86/mmu: Allow NULL-able fault in kvm_max_private_mapping_level Fuad Tabba
2025-07-11 9:38 ` David Hildenbrand
2025-07-09 10:59 ` [PATCH v13 12/20] KVM: x86/mmu: Consult guest_memfd when computing max_mapping_level Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 13/20] KVM: x86/mmu: Handle guest page faults for guest_memfd with shared memory Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 14/20] KVM: x86: Enable guest_memfd mmap for default VM type Fuad Tabba
2025-07-11 1:14 ` kernel test robot
2025-07-11 9:45 ` David Hildenbrand
2025-07-11 11:09 ` Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 15/20] KVM: arm64: Refactor user_mem_abort() Fuad Tabba
2025-07-11 13:25 ` Marc Zyngier
2025-07-09 10:59 ` [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults Fuad Tabba
2025-07-11 9:59 ` Roy, Patrick
2025-07-11 11:08 ` Fuad Tabba
2025-07-11 13:49 ` Marc Zyngier
2025-07-11 14:17 ` Fuad Tabba
2025-07-11 15:48 ` Marc Zyngier [this message]
2025-07-14 6:35 ` Fuad Tabba
2025-07-11 16:37 ` Marc Zyngier
2025-07-14 7:42 ` Fuad Tabba
2025-07-14 8:04 ` Marc Zyngier
2025-07-09 10:59 ` [PATCH v13 17/20] KVM: arm64: Enable host mapping of shared guest_memfd memory Fuad Tabba
2025-07-11 14:25 ` Marc Zyngier
2025-07-11 14:34 ` Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 18/20] KVM: Introduce the KVM capability KVM_CAP_GMEM_MMAP Fuad Tabba
2025-07-11 8:48 ` Shivank Garg
2025-07-09 10:59 ` [PATCH v13 19/20] KVM: selftests: Do not use hardcoded page sizes in guest_memfd test Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 20/20] KVM: selftests: guest_memfd mmap() test when mmap is supported Fuad Tabba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=867c0eafu4.wl-maz@kernel.org \
--to=maz@kernel.org \
--cc=ackerleytng@google.com \
--cc=akpm@linux-foundation.org \
--cc=amoorthy@google.com \
--cc=anup@brainfault.org \
--cc=aou@eecs.berkeley.edu \
--cc=brauner@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=chao.p.peng@linux.intel.com \
--cc=chenhuacai@kernel.org \
--cc=david@redhat.com \
--cc=dmatlack@google.com \
--cc=fvdl@google.com \
--cc=hch@infradead.org \
--cc=hughd@google.com \
--cc=ira.weiny@intel.com \
--cc=isaku.yamahata@gmail.com \
--cc=isaku.yamahata@intel.com \
--cc=james.morse@arm.com \
--cc=jarkko@kernel.org \
--cc=jgg@nvidia.com \
--cc=jhubbard@nvidia.com \
--cc=jthoughton@google.com \
--cc=keirf@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.linux.dev \
--cc=liam.merwick@oracle.com \
--cc=linux-arm-msm@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mail@maciej.szmigiero.name \
--cc=mic@digikod.net \
--cc=michael.roth@amd.com \
--cc=mpe@ellerman.id.au \
--cc=oliver.upton@linux.dev \
--cc=palmer@dabbelt.com \
--cc=pankaj.gupta@amd.com \
--cc=paul.walmsley@sifive.com \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=qperret@google.com \
--cc=quic_cvanscha@quicinc.com \
--cc=quic_eberman@quicinc.com \
--cc=quic_mnalajal@quicinc.com \
--cc=quic_pderrin@quicinc.com \
--cc=quic_pheragu@quicinc.com \
--cc=quic_svaddagi@quicinc.com \
--cc=quic_tsoni@quicinc.com \
--cc=rientjes@google.com \
--cc=roypat@amazon.co.uk \
--cc=seanjc@google.com \
--cc=shuah@kernel.org \
--cc=steven.price@arm.com \
--cc=suzuki.poulose@arm.com \
--cc=tabba@google.com \
--cc=vannapurve@google.com \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=wei.w.wang@intel.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=xiaoyao.li@intel.com \
--cc=yilun.xu@intel.com \
--cc=yuzenghui@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).