From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9CCAB1A23BE; Sun, 14 Sep 2025 07:45:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757835912; cv=none; b=JDnYndnlgXCEQ1m2ZfB8f69WjWS4372nyPuU1nJXI2vwqsZucv62WUesv03nwkU0EpmrYmAy6mFPXBs+ZTiUPgWeopf97LBjBIfyKv1Io3rC6LzysGnHxjumwuI50oNLjoxZKz5dnCO1H5yjOTTXFe4rs3XoOpJEdnl1YA2MYt4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757835912; c=relaxed/simple; bh=ll3A1mV/yYnF5rQfgiisRFaxkVuSejJaeersKDWiT7w=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=bBD4wTrW/WePbo4tUjtGFY5XrlEcp5tBSA/+PNtcTgCovJC7OBYvNbK/4WCrrw1L8ABuVtQ0lmwQftfwv1Zr/l5994kBl5waZweXbj2/97WrqoBY+Eu48aeMLDbvZco+0GjOMCWhgZnTfpbLObtzqvpN11nQFqnu60zBeld+5M4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=fUYb7nO6; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="fUYb7nO6" Received: by smtp.kernel.org (Postfix) with ESMTPSA id B2208C4CEF0; Sun, 14 Sep 2025 07:44:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1757835912; bh=ll3A1mV/yYnF5rQfgiisRFaxkVuSejJaeersKDWiT7w=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=fUYb7nO6KxjWLuQIdwhu6X1LPYHLoTJ/4cmUkEmPVpZaKrxM6SrzN31vDYyJ2XEQz LedFqqqu33W2LGAtTO5UJ1M2drFT3RegKj9wb+nnQF0nopYQN7rpg2SyooKvsQ/WxG X8YekHGfml2KIKwPR0AEq81x0hQhORjCBVJrEtVvp3JSDC6bfVU/EikgXRvkzvI1yL 8D3LuWHAlgWQlstJ8mmkdWucVHj94CCg9mTrw/ZX0xoxOR1Cpp7WtqP/Ps6cw9rNJQ Hjwz0Wy8pqnyq9JFwGW3j79iP8C/ND1idBjoQMJuM2X+Yev6Io/lDeaxUuACOah7Wk NnpJJe7DBniYw== Date: Sun, 14 Sep 2025 10:44:41 +0300 From: Mike Rapoport To: "Roy, Patrick" Cc: "Thomson, Jack" , "Kalyazin, Nikita" , "Cali, Marco" , "derekmn@amazon.co.uk" , "willy@infradead.org" , "corbet@lwn.net" , "pbonzini@redhat.com" , "maz@kernel.org" , "oliver.upton@linux.dev" , "joey.gouly@arm.com" , "suzuki.poulose@arm.com" , "yuzenghui@huawei.com" , "catalin.marinas@arm.com" , "will@kernel.org" , "chenhuacai@kernel.org" , "kernel@xen0n.name" , "paul.walmsley@sifive.com" , "palmer@dabbelt.com" , "aou@eecs.berkeley.edu" , "alex@ghiti.fr" , "agordeev@linux.ibm.com" , "gerald.schaefer@linux.ibm.com" , "hca@linux.ibm.com" , "gor@linux.ibm.com" , "borntraeger@linux.ibm.com" , "svens@linux.ibm.com" , "dave.hansen@linux.intel.com" , "luto@kernel.org" , "peterz@infradead.org" , "tglx@linutronix.de" , "mingo@redhat.com" , "bp@alien8.de" , "x86@kernel.org" , "hpa@zytor.com" , "trondmy@kernel.org" , "anna@kernel.org" , "hubcap@omnibond.com" , "martin@omnibond.com" , "viro@zeniv.linux.org.uk" , "brauner@kernel.org" , "jack@suse.cz" , "akpm@linux-foundation.org" , "david@redhat.com" , "lorenzo.stoakes@oracle.com" , "Liam.Howlett@oracle.com" , "vbabka@suse.cz" , "surenb@google.com" , "mhocko@suse.com" , "ast@kernel.org" , "daniel@iogearbox.net" , "andrii@kernel.org" , "martin.lau@linux.dev" , "eddyz87@gmail.com" , "song@kernel.org" , "yonghong.song@linux.dev" , "john.fastabend@gmail.com" , "kpsingh@kernel.org" , "sdf@fomichev.me" , "haoluo@google.com" , "jolsa@kernel.org" , "jgg@ziepe.ca" , "jhubbard@nvidia.com" , "peterx@redhat.com" , "jannh@google.com" , "pfalcato@suse.de" , "axelrasmussen@google.com" , "yuanchu@google.com" , "weixugc@google.com" , "hannes@cmpxchg.org" , "zhengqi.arch@bytedance.com" , "shakeel.butt@linux.dev" , "shuah@kernel.org" , "seanjc@google.com" , "linux-fsdevel@vger.kernel.org" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "kvm@vger.kernel.org" , "linux-arm-kernel@lists.infradead.org" , "kvmarm@lists.linux.dev" , "loongarch@lists.linux.dev" , "linux-riscv@lists.infradead.org" , "linux-s390@vger.kernel.org" , "linux-nfs@vger.kernel.org" , "devel@lists.orangefs.org" , "linux-mm@kvack.org" , "bpf@vger.kernel.org" , "linux-kselftest@vger.kernel.org" Subject: Re: [PATCH v6 05/11] KVM: guest_memfd: Add flag to remove from direct map Message-ID: References: <20250912091708.17502-1-roypat@amazon.co.uk> <20250912091708.17502-6-roypat@amazon.co.uk> Precedence: bulk X-Mailing-List: linux-doc@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250912091708.17502-6-roypat@amazon.co.uk> On Fri, Sep 12, 2025 at 09:17:37AM +0000, Roy, Patrick wrote: > Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() > ioctl. When set, guest_memfd folios will be removed from the direct map > after preparation, with direct map entries only restored when the folios > are freed. > > To ensure these folios do not end up in places where the kernel cannot > deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct > address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested. > > Add KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP to let userspace discover whether > guest_memfd supports GUEST_MEMFD_FLAG_NO_DIRECT_MAP. Support depends on > guest_memfd itself being supported, but also on whether linux supports > manipulatomg the direct map at page granularity at all (possible most of > the time, outliers being arm64 where its impossible if the direct map > has been setup using hugepages, as arm64 cannot break these apart due to > break-before-make semantics, and powerpc, which does not select > ARCH_HAS_SET_DIRECT_MAP, which also doesn't support guest_memfd anyway > though). > > Note that this flag causes removal of direct map entries for all > guest_memfd folios independent of whether they are "shared" or "private" > (although current guest_memfd only supports either all folios in the > "shared" state, or all folios in the "private" state if > GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map > entries of also the shared parts of guest_memfd are a special type of > non-CoCo VM where, host userspace is trusted to have access to all of > guest memory, but where Spectre-style transient execution attacks > through the host kernel's direct map should still be mitigated. In this > setup, KVM retains access to guest memory via userspace mappings of > guest_memfd, which are reflected back into KVM's memslots via > userspace_addr. This is needed for things like MMIO emulation on x86_64 > to work. > > Do not perform TLB flushes after direct map manipulations. This is > because TLB flushes resulted in a up to 40x elongation of page faults in > guest_memfd (scaling with the number of CPU cores), or a 5x elongation > of memory population. TLB flushes are not needed for functional > correctness (the virt->phys mapping technically stays "correct", the > kernel should simply not use it for a while). On the other hand, it means > that the desired protection from Spectre-style attacks is not perfect, > as an attacker could try to prevent a stale TLB entry from getting > evicted, keeping it alive until the page it refers to is used by the > guest for some sensitive data, and then targeting it using a > spectre-gadget. > > Signed-off-by: Patrick Roy With a few nits below Acked-by: Mike Rapoport (Microsoft) > --- > Documentation/virt/kvm/api.rst | 5 ++++ > arch/arm64/include/asm/kvm_host.h | 12 ++++++++ > include/linux/kvm_host.h | 7 +++++ > include/uapi/linux/kvm.h | 2 ++ > virt/kvm/guest_memfd.c | 49 +++++++++++++++++++++++++++---- > virt/kvm/kvm_main.c | 5 ++++ > 6 files changed, 75 insertions(+), 5 deletions(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index c17a87a0a5ac..b52c14d58798 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6418,6 +6418,11 @@ When the capability KVM_CAP_GUEST_MEMFD_MMAP is supported, the 'flags' field > supports GUEST_MEMFD_FLAG_MMAP. Setting this flag on guest_memfd creation > enables mmap() and faulting of guest_memfd memory to host userspace. > > +When the capability KVM_CAP_GMEM_NO_DIRECT_MAP is supported, the 'flags' field > +supports GUEST_MEMFG_FLAG_NO_DIRECT_MAP. Setting this flag makes the guest_memfd > +instance behave similarly to memfd_secret, and unmaps the memory backing it from > +the kernel's address space after allocation. > + > When the KVM MMU performs a PFN lookup to service a guest fault and the backing > guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be > consumed from guest_memfd, regardless of whether it is a shared or a private > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > index 2f2394cce24e..0bfd8e5fd9de 100644 > --- a/arch/arm64/include/asm/kvm_host.h > +++ b/arch/arm64/include/asm/kvm_host.h > @@ -19,6 +19,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -1706,5 +1707,16 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt); > void get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg, u64 *res0, u64 *res1); > void check_feature_map(void); > > +#ifdef CONFIG_KVM_GUEST_MEMFD > +static inline bool kvm_arch_gmem_supports_no_direct_map(void) > +{ > + /* > + * Without FWB, direct map access is needed in kvm_pgtable_stage2_map(), > + * as it calls dcache_clean_inval_poc(). > + */ > + return can_set_direct_map() && cpus_have_final_cap(ARM64_HAS_STAGE2_FWB); > +} > +#define kvm_arch_gmem_supports_no_direct_map kvm_arch_gmem_supports_no_direct_map > +#endif /* CONFIG_KVM_GUEST_MEMFD */ > > #endif /* __ARM64_KVM_HOST_H__ */ > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 1d0585616aa3..a9468bce55f2 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -36,6 +36,7 @@ > #include > #include > #include > +#include The set_memory APIs are not used in the header, no need to include it here. > #include > #include > @@ -731,6 +732,12 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm) > bool kvm_arch_supports_gmem_mmap(struct kvm *kvm); > #endif > > +#ifdef CONFIG_KVM_GUEST_MEMFD > +#ifndef kvm_arch_gmem_supports_no_direct_map > +#define kvm_arch_gmem_supports_no_direct_map can_set_direct_map > +#endif > +#endif /* CONFIG_KVM_GUEST_MEMFD */ > + > #ifndef kvm_arch_has_readonly_mem > static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm) > { > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index 6efa98a57ec1..33c8e8946019 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -963,6 +963,7 @@ struct kvm_enable_cap { > #define KVM_CAP_RISCV_MP_STATE_RESET 242 > #define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243 > #define KVM_CAP_GUEST_MEMFD_MMAP 244 > +#define KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP 245 > > struct kvm_irq_routing_irqchip { > __u32 irqchip; > @@ -1600,6 +1601,7 @@ struct kvm_memory_attributes { > > #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd) > #define GUEST_MEMFD_FLAG_MMAP (1ULL << 0) > +#define GUEST_MEMFD_FLAG_NO_DIRECT_MAP (1ULL << 1) > > struct kvm_create_guest_memfd { > __u64 size; > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index 81028984ff89..3c64099fc98a 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -4,6 +4,7 @@ > #include > #include > #include > +#include > > #include "kvm_mm.h" > > @@ -42,9 +43,24 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo > return 0; > } > > -static inline void kvm_gmem_mark_prepared(struct folio *folio) > +static bool kvm_gmem_test_no_direct_map(struct inode *inode) > { > - folio_mark_uptodate(folio); > + return ((unsigned long) inode->i_private) & GUEST_MEMFD_FLAG_NO_DIRECT_MAP; > +} > + > +static inline int kvm_gmem_mark_prepared(struct folio *folio) > +{ > + struct inode *inode = folio_inode(folio); > + int r = 0; > + > + if (kvm_gmem_test_no_direct_map(inode)) > + r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio), > + false); > + > + if (!r) > + folio_mark_uptodate(folio); > + > + return r; > } > > /* > @@ -82,7 +98,7 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, > index = ALIGN_DOWN(index, 1 << folio_order(folio)); > r = __kvm_gmem_prepare_folio(kvm, slot, index, folio); > if (!r) > - kvm_gmem_mark_prepared(folio); > + r = kvm_gmem_mark_prepared(folio); If this fails, shouldn't we undo __kvm_gmem_prepare_folio()? > > return r; > } > @@ -344,8 +360,15 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf) > } > > if (!folio_test_uptodate(folio)) { > + int err = 0; > + > clear_highpage(folio_page(folio, 0)); > - kvm_gmem_mark_prepared(folio); > + err = kvm_gmem_mark_prepared(folio); > + > + if (err) { > + ret = vmf_error(err); > + goto out_folio; > + } > } > > vmf->page = folio_file_page(folio, vmf->pgoff); > @@ -436,6 +459,16 @@ static void kvm_gmem_free_folio(struct address_space *mapping, > kvm_pfn_t pfn = page_to_pfn(page); > int order = folio_order(folio); > > + /* > + * Direct map restoration cannot fail, as the only error condition > + * for direct map manipulation is failure to allocate page tables > + * when splitting huge pages, but this split would have already > + * happened in set_direct_map_invalid_noflush() in kvm_gmem_mark_prepared(). > + * Thus set_direct_map_valid_noflush() here only updates prot bits. > + */ > + if (kvm_gmem_test_no_direct_map(mapping->host)) > + set_direct_map_valid_noflush(page, folio_nr_pages(folio), true); > + > kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order)); > } > > @@ -500,6 +533,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) > /* Unmovable mappings are supposed to be marked unevictable as well. */ > WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); > > + if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) > + mapping_set_no_direct_map(inode->i_mapping); > + > kvm_get_kvm(kvm); > gmem->kvm = kvm; > xa_init(&gmem->bindings); > @@ -524,6 +560,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) > if (kvm_arch_supports_gmem_mmap(kvm)) > valid_flags |= GUEST_MEMFD_FLAG_MMAP; > > + if (kvm_arch_gmem_supports_no_direct_map()) > + valid_flags |= GUEST_MEMFD_FLAG_NO_DIRECT_MAP; > + > if (flags & ~valid_flags) > return -EINVAL; > > @@ -768,7 +807,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > p = src ? src + i * PAGE_SIZE : NULL; > ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > if (!ret) > - kvm_gmem_mark_prepared(folio); > + ret = kvm_gmem_mark_prepared(folio); > > put_folio_and_exit: > folio_put(folio); > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 18f29ef93543..b5e702d95230 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -65,6 +65,7 @@ > #include > > #include > +#include > > > /* Worst case buffer size needed for holding an integer. */ > @@ -4916,6 +4917,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) > return kvm_supported_mem_attributes(kvm); > #endif > #ifdef CONFIG_KVM_GUEST_MEMFD > + case KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP: > + if (!kvm_arch_gmem_supports_no_direct_map()) > + return 0; > + fallthrough; > case KVM_CAP_GUEST_MEMFD: > return 1; > case KVM_CAP_GUEST_MEMFD_MMAP: > -- > 2.50.1 > -- Sincerely yours, Mike.