From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6CE1EC83F20 for ; Mon, 14 Jul 2025 06:36:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A2F056B007B; Mon, 14 Jul 2025 02:36:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9B8866B0089; Mon, 14 Jul 2025 02:36:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 859336B008A; Mon, 14 Jul 2025 02:36:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 6D3866B007B for ; Mon, 14 Jul 2025 02:36:21 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 9580A140535 for ; Mon, 14 Jul 2025 06:36:20 +0000 (UTC) X-FDA: 83661910920.06.A7FA34A Received: from mail-qt1-f176.google.com (mail-qt1-f176.google.com [209.85.160.176]) by imf11.hostedemail.com (Postfix) with ESMTP id C978C40002 for ; Mon, 14 Jul 2025 06:36:18 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=V1BUSB0e; spf=pass (imf11.hostedemail.com: domain of tabba@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=tabba@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752474978; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dZF+Ky0ktDSi2tu4HY/pzm8qU38Pe1SoFQA75nYTHJk=; b=EAyDOL7OIjQ3aYHAMf+844DR0tBV0FP7PSQ8qieYFxHpar+aRu/2REblOTO7TsEV6v8nz+ QFJdJw+QUQh39THiJLIG6z8RNZpoPOdzX5m4CKmzHlI4LPExaAy4X9Fr9UPBLiE03FIUdu 6IRUEYnQng5c9RHJpafghlrDTjROP3o= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752474978; a=rsa-sha256; cv=none; b=sfMn8HhKzRUIMJcLLj5/GgoETA1LsE10blnlTP5mnrZVrtTJdftCiVM8Vyf+jmdzysa5D6 8BWszLJB4baXfmmBRjVzbC59CLIxJv9GMWnZbviexOLk3cNW6J16vCAtNZQx5Bxb6XNyfZ Qa3+FKhoVgDNYI2/kkx3XymgFVrPIeI= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=V1BUSB0e; spf=pass (imf11.hostedemail.com: domain of tabba@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=tabba@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-qt1-f176.google.com with SMTP id d75a77b69052e-4ab3855fca3so594821cf.1 for ; Sun, 13 Jul 2025 23:36:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1752474978; x=1753079778; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=dZF+Ky0ktDSi2tu4HY/pzm8qU38Pe1SoFQA75nYTHJk=; b=V1BUSB0esQpygMd5TeYc6O/LSYQh0Qm+ZKfRSRrE5V83GvT5wo3+UCfWArjICntZDg +pQE3FxXUK9tiRtNXKjuOjn43IXzel22sx2/zSyHw6W6rGKiB9SXCm40GDROefxKoKdj s1jG4BMB7XgXgW1VZYuT4SeJxPX9s5AtlFOZaO3Lq9jvWZh22oE7zmSQBVTKPKOMvKXL HtdVz3H9GOf/eU+DoUpzLtQ2B9WEL2Pluh0KPoZ4UuRSEfLyJFZg1PIa2tbc5jRcCRdx QlWWfhuNj5vdXFq791PU78uUuA9x+acrl5j9Hsvphs2T7YXJTpULwA1CXOyDi2zS8ydc Y+ZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752474978; x=1753079778; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=dZF+Ky0ktDSi2tu4HY/pzm8qU38Pe1SoFQA75nYTHJk=; b=Nx34OSQa+0Tux8g1ZHxCPGR9PhGhPGvKhrgnkj/a8j7epyN95ykSU5Up/q8J4cT2AZ vHRegzKVzoJLYVV6cudVgYcbLca/gZhQucsNMbzYMgjfqRxbrBcgldGsbMWRpV05mjW2 XlamnyE/AT/65jP9mYzn2yh9JIMhNxlW1yPROSeaqsJnmI5N0jEQWWaHJB807a+8dabf eodv19R4icWDwsesgwX9MUQAp/O/xGK8ywvoQzsl0ilwxYQedbenKvZzY45zHxha3qAV Wbu8webzfpJY+8qlmJrbMQbX8aIZe0S7+kwtGUH8+rqIT5CtvX0OUR0bZ38CnyvDRXhn Ppxw== X-Forwarded-Encrypted: i=1; AJvYcCXuHLQ2Fh4DD0WuZ/6hPQWjzXHG4KDzmJCuwAcQfGfun+OM83ykxT2Mm8jCk4l4Bcc5fl0LZ8xJjw==@kvack.org X-Gm-Message-State: AOJu0YzvDrZ2KQG6kDLSIqSq5Z0i+RLFZGI2wZg/M7sQaZhWEfxI5akW 2hHb1b6rTjR0ktsQwdCEQzb1i5S0Y2hx8BjARrmaourcrm+SyJBzgICyoL22JQlhCqrRFK1yqd/ Tah8+H9AQnamHDmh8czfSkQJKaKHIfW9cZAJiLcGK X-Gm-Gg: ASbGncvzzcTaJeNzacVu6ZOCcyNH408sNXARKcoEnxTdlOowvYg5wTgjxTBpLPh7jIt cAUvPX4h7GEumDIhEkYhEDMY/aTwIwOW8gBUExTWJEXrWfg7zz9FulyQ9HJvXeFbNf3elV7I3F9 5o0MFegsH07GQj3DY2wjcLU6SEAaKCch9Mpmf9Jnlo3XUBEnWC4+tNPguDimMf+VysBVdRFIpkD 32KjNA= X-Google-Smtp-Source: AGHT+IGUAkWGqNwyYlfnsIV6EkkPBFxL4VHZi8XE1tnkXrW7n/ZbY/7gbbYRwSh5UmMPS2fHVD8oBu3+Ej09sHNYdeM= X-Received: by 2002:a05:622a:1450:b0:4a5:a83d:f50d with SMTP id d75a77b69052e-4ab544ec3a4mr4630431cf.11.1752474977130; Sun, 13 Jul 2025 23:36:17 -0700 (PDT) MIME-Version: 1.0 References: <20250709105946.4009897-17-tabba@google.com> <20250711095937.22365-1-roypat@amazon.co.uk> <86a55aalbv.wl-maz@kernel.org> <867c0eafu4.wl-maz@kernel.org> In-Reply-To: <867c0eafu4.wl-maz@kernel.org> From: Fuad Tabba Date: Mon, 14 Jul 2025 07:35:40 +0100 X-Gm-Features: Ac12FXz5OVPVA8v7QODq4KbEi6-bkk4Mp_oA-Ws0GJ1yiq68_TYv2mw4lRKzfA4 Message-ID: Subject: Re: [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults To: Marc Zyngier Cc: "Roy, Patrick" , "ackerleytng@google.com" , "akpm@linux-foundation.org" , "amoorthy@google.com" , "anup@brainfault.org" , "aou@eecs.berkeley.edu" , "brauner@kernel.org" , "catalin.marinas@arm.com" , "chao.p.peng@linux.intel.com" , "chenhuacai@kernel.org" , "david@redhat.com" , "dmatlack@google.com" , "fvdl@google.com" , "hch@infradead.org" , "hughd@google.com" , "ira.weiny@intel.com" , "isaku.yamahata@gmail.com" , "isaku.yamahata@intel.com" , "james.morse@arm.com" , "jarkko@kernel.org" , "jgg@nvidia.com" , "jhubbard@nvidia.com" , "jthoughton@google.com" , "keirf@google.com" , "kirill.shutemov@linux.intel.com" , "kvm@vger.kernel.org" , "kvmarm@lists.linux.dev" , "liam.merwick@oracle.com" , "linux-arm-msm@vger.kernel.org" , "linux-mm@kvack.org" , "mail@maciej.szmigiero.name" , "mic@digikod.net" , "michael.roth@amd.com" , "mpe@ellerman.id.au" , "oliver.upton@linux.dev" , "palmer@dabbelt.com" , "pankaj.gupta@amd.com" , "paul.walmsley@sifive.com" , "pbonzini@redhat.com" , "peterx@redhat.com" , "qperret@google.com" , "quic_cvanscha@quicinc.com" , "quic_eberman@quicinc.com" , "quic_mnalajal@quicinc.com" , "quic_pderrin@quicinc.com" , "quic_pheragu@quicinc.com" , "quic_svaddagi@quicinc.com" , "quic_tsoni@quicinc.com" , "rientjes@google.com" , "seanjc@google.com" , "shuah@kernel.org" , "steven.price@arm.com" , "suzuki.poulose@arm.com" , "vannapurve@google.com" , "vbabka@suse.cz" , "viro@zeniv.linux.org.uk" , "wei.w.wang@intel.com" , "will@kernel.org" , "willy@infradead.org" , "xiaoyao.li@intel.com" , "yilun.xu@intel.com" , "yuzenghui@huawei.com" Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: C978C40002 X-Stat-Signature: ykhtgbxcsg8po56fezjfb9cs3pbzr8xh X-Rspam-User: X-Rspamd-Server: rspam07 X-HE-Tag: 1752474978-988058 X-HE-Meta: U2FsdGVkX1+xDrR/rBDm3CaHw+8xnhmPYbKEGbCSpLXs0MvOCMasaROouXmqEVqA9ByyeLtte7xFSRoEBZdVXHwYFQ3eaST9UE+kNLKQnjMuT+vEqTBOmW33SFm5p1ojzXSW3HpS7uqbzA95NTSD8EpaQ3eUNDP0k+A5WxyMZQPpa+20jZA0/f+Tz19OHXXhUsArTXpN05rLzIVyxKUZJ187KDi0jlzMQjw/B6i6/YAsWy1xxPR4BHSvvUtEW6pRRKDN14J2jAvojqKbNBUJ8E/CWLANf7Ayxqbku0d+2thSn3d/DJtsid0Xxy0q/AJgXcp5R32qwupNcDiXQwUvIC34b3oSLUtI/AWmRYKF5QvC8KsFH1/oSRMsHL362UyqiFIvWsPqDmjAy39CEae7aLNWvvr4XBWBkc9bdozQDvxq68lhcUFPoKuJXWCO9nYsJBToyWTuWTymBfKJadQc46ZZOV0ulm4OYGqjQtJIKkGOHQ62kf/E7WRuWn0hsfjsdW939zimT3ZLpSzB8pAZQ58qNyWVWAb59yJZGv+zG4jetfG2T2/U9AOu333IdKA+glmK3qQKPuZro9PBsZXiOHUEq8iKobyEJrByBiao1lGnFhMpKMHRAGwiY05USbylGXa4ynEmzgiC6998fu7/2bEDMaJDiBmMmsbT6RNp5g0/FTCqlfUsDn++bfBr7D9E7dk4Vl+bSYuRSMbyibemgpJgrw86LFe6vNIzNBE4WScB6+j0ktZ5YDAmJzo4Ff8yd2XXgeCdR2OI1L65fgQgFg5/aUM9eA3+BkEM1b9z4FLJNHpgFr/0/3rOrEZHZNjRGrt28H55uLWvdEeod6dNFT4StEfEozFUG96juUR6B6z6tjdxpJLdSUQgEbuRW2zgVdhdPK5P64bOrqJhn4G9zzfRxgsefHHLinYgOnyK1DciKDymV2ruyis3NVDGYYntStGVldD5jx9eXZlIoBD 9UtM1BSo ShSg4UPC9MqB40FSpsdKktFcTjILe0CR7fNEsXSx5lXBJVRdfM2EACZj/+8CqrTtm7s2B9Rqnk+NHeheDATXNNI7UPPyOGHMxur5PwNyhZELYMyOP6lGFT06zIhEThW0g0VCMPb7GubUyV/fjzDeDcx6zWepLqVV/Neexg0UNc2WPd4p32YwEHgyfm8LIOK3A0Yx08Z5aOwOPSv4TZL+4DlkbzTM6F/6nFt7xCkY+SzeSMpic+aLiQafSenLwOShk1tnpw1+zS/l3ZDm4A1Yv57hA/6o9Qge15rTE/O6gzY+8FsHaj0EQ5ZGt5MQEUM9UQ08b7BNZKndCLBXjJFS7L5n9s9BPBlDVClEOK2W6RbHkOg2adDZDVZ91xxG5gnlV2GvtRZZYoLAq60Hvdb6Z9FpvGoyWIJEKJqYxHoNI30kq9ug= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Marc, On Fri, 11 Jul 2025 at 16:48, Marc Zyngier wrote: > > On Fri, 11 Jul 2025 15:17:46 +0100, > Fuad Tabba wrote: > > > > Hi Marc, > > > > On Fri, 11 Jul 2025 at 14:50, Marc Zyngier wrote: > > > > > > On Fri, 11 Jul 2025 10:59:39 +0100, > > > "Roy, Patrick" wrote: > > > > > > > > > > > > Hi Fuad, > > > > > > > > On Wed, 2025-07-09 at 11:59 +0100, Fuad Tabba wrote:> -snip- > > > > > +#define KVM_PGTABLE_WALK_MEMABORT_FLAGS (KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED) > > > > > + > > > > > +static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > > > > + struct kvm_s2_trans *nested, > > > > > + struct kvm_memory_slot *memslot, bool is_perm) > > > > > +{ > > > > > + bool write_fault, exec_fault, writable; > > > > > + enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS; > > > > > + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; > > > > > + struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt; > > > > > + struct page *page; > > > > > + struct kvm *kvm = vcpu->kvm; > > > > > + void *memcache; > > > > > + kvm_pfn_t pfn; > > > > > + gfn_t gfn; > > > > > + int ret; > > > > > + > > > > > + ret = prepare_mmu_memcache(vcpu, true, &memcache); > > > > > + if (ret) > > > > > + return ret; > > > > > + > > > > > + if (nested) > > > > > + gfn = kvm_s2_trans_output(nested) >> PAGE_SHIFT; > > > > > + else > > > > > + gfn = fault_ipa >> PAGE_SHIFT; > > > > > + > > > > > + write_fault = kvm_is_write_fault(vcpu); > > > > > + exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu); > > > > > + > > > > > + if (write_fault && exec_fault) { > > > > > + kvm_err("Simultaneous write and execution fault\n"); > > > > > + return -EFAULT; > > > > > + } > > > > > + > > > > > + if (is_perm && !write_fault && !exec_fault) { > > > > > + kvm_err("Unexpected L2 read permission error\n"); > > > > > + return -EFAULT; > > > > > + } > > > > > + > > > > > + ret = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, &page, NULL); > > > > > + if (ret) { > > > > > + kvm_prepare_memory_fault_exit(vcpu, fault_ipa, PAGE_SIZE, > > > > > + write_fault, exec_fault, false); > > > > > + return ret; > > > > > + } > > > > > + > > > > > + writable = !(memslot->flags & KVM_MEM_READONLY); > > > > > + > > > > > + if (nested) > > > > > + adjust_nested_fault_perms(nested, &prot, &writable); > > > > > + > > > > > + if (writable) > > > > > + prot |= KVM_PGTABLE_PROT_W; > > > > > + > > > > > + if (exec_fault || > > > > > + (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) && > > > > > + (!nested || kvm_s2_trans_executable(nested)))) > > > > > + prot |= KVM_PGTABLE_PROT_X; > > > > > + > > > > > + kvm_fault_lock(kvm); > > > > > > > > Doesn't this race with gmem invalidations (e.g. fallocate(PUNCH_HOLE))? > > > > E.g. if between kvm_gmem_get_pfn() above and this kvm_fault_lock() a > > > > gmem invalidation occurs, don't we end up with stage-2 page tables > > > > refering to a stale host page? In user_mem_abort() there's the "grab > > > > mmu_invalidate_seq before dropping mmap_lock and check it hasnt changed > > > > after grabbing mmu_lock" which prevents this, but I don't really see an > > > > equivalent here. > > > > > > Indeed. We have a similar construct in kvm_translate_vncr() as well, > > > and I'd definitely expect something of the sort 'round here. If for > > > some reason this is not needed, then a comment explaining why would be > > > welcome. > > > > > > But this brings me to another interesting bit: kvm_translate_vncr() is > > > another path that deals with a guest translation fault (despite being > > > caught as an EL2 S1 fault), and calls kvm_faultin_pfn(). What happens > > > when the backing store is gmem? Probably nothin > > > > I'll add guest_memfd handling logic to kvm_translate_vncr(). > > > > > I don't immediately see why NV and gmem should be incompatible, so > > > something must be done on that front too (including the return to > > > userspace if the page is gone). > > > > Should it return to userspace or go back to the guest? > > user_mem_abort() returns to the guest if the page disappears (I don't > > quite understand the rationale behind that, but it was a deliberate > > change [1]): on mmu_invalidate_retry() it sets ret to -EAGAIN [2], > > which gets flipped to 0 on returning from user_mem_abort() [3]. > > Outside of gmem, racing with an invalidation (resulting in -EAGAIN) is > never a problem. We just replay the faulting instruction. Also, > kvm_faultin_pfn() never fails outside of error cases (guest accessing > non-memory, or writing to RO memory). So returning to the guest is > always the right thing to do, and userspace never needs to see any of > that (I ignore userfaultfd here, as that's a different matter). > > With gmem, you don't really have a choice. Whoever is in charge of the > memory told you it can't get to it, and it's only fair to go back to > userspace for it to sort it out (if at all possible). Makes sense. > So when it comes to VNCR faults, the behaviour should be the same, > given that the faulting page *is* a guest page, even if this isn't a > stage-2 mapping that we are dealing with. > > I'd expect something along the lines of the hack below, (completely > untested, as usual). Thanks! /fuad > Thanks, > > M. > > diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c > index 5b191f4dc5668..98b1d6d4688a6 100644 > --- a/arch/arm64/kvm/nested.c > +++ b/arch/arm64/kvm/nested.c > @@ -1172,8 +1172,9 @@ static u64 read_vncr_el2(struct kvm_vcpu *vcpu) > return (u64)sign_extend64(__vcpu_sys_reg(vcpu, VNCR_EL2), 48); > } > > -static int kvm_translate_vncr(struct kvm_vcpu *vcpu) > +static int kvm_translate_vncr(struct kvm_vcpu *vcpu, bool *gmem) > { > + struct kvm_memory_slot *memslot; > bool write_fault, writable; > unsigned long mmu_seq; > struct vncr_tlb *vt; > @@ -1216,9 +1217,21 @@ static int kvm_translate_vncr(struct kvm_vcpu *vcpu) > smp_rmb(); > > gfn = vt->wr.pa >> PAGE_SHIFT; > - pfn = kvm_faultin_pfn(vcpu, gfn, write_fault, &writable, &page); > - if (is_error_noslot_pfn(pfn) || (write_fault && !writable)) > - return -EFAULT; > + memslot = gfn_to_memslot(vcpu->kvm, gfn); > + *gmem = kvm_slot_has_gmem(memslot); > + if (!*gmem) { > + pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0, > + &writable, &page); > + if (is_error_noslot_pfn(pfn) || (write_fault && !writable)) > + return -EFAULT; > + } else { > + ret = kvm_gmem_get_pfn(vcpu->kvm, memslot, gfn, &pfn, &page, NULL); > + if (ret) { > + kvm_prepare_memory_fault_exit(vcpu, vt->wr.pa, PAGE_SIZE, > + write_fault, false, false); > + return ret; > + } > + } > > scoped_guard(write_lock, &vcpu->kvm->mmu_lock) { > if (mmu_invalidate_retry(vcpu->kvm, mmu_seq)) > @@ -1292,14 +1305,14 @@ int kvm_handle_vncr_abort(struct kvm_vcpu *vcpu) > if (esr_fsc_is_permission_fault(esr)) { > inject_vncr_perm(vcpu); > } else if (esr_fsc_is_translation_fault(esr)) { > - bool valid; > + bool valid, gmem = false; > int ret; > > scoped_guard(read_lock, &vcpu->kvm->mmu_lock) > valid = kvm_vncr_tlb_lookup(vcpu); > > if (!valid) > - ret = kvm_translate_vncr(vcpu); > + ret = kvm_translate_vncr(vcpu, &gmem); > else > ret = -EPERM; > > @@ -1309,6 +1322,14 @@ int kvm_handle_vncr_abort(struct kvm_vcpu *vcpu) > /* Let's try again... */ > break; > case -EFAULT: > + case -EIO: > + /* > + * FIXME: Add whatever other error cases the > + * GMEM stuff can spit out. > + */ > + if (gmem) > + return 0; > + fallthrough; > case -EINVAL: > case -ENOENT: > case -EACCES: > > -- > Without deviation from the norm, progress is not possible.