From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3DFA183A19; Wed, 10 Jul 2024 07:34:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720596861; cv=none; b=L02f1Sm7lcIq7D9MW4q30dkBSASvQ9PUu27oq8dO8HyCcjHD+k8TRtdSTfaefNNHrfVU3W8Yn8iJfWKyOAKQ9XShm4gqBfbWyKct5/Yb2Xed3XtPBjxdRF3Wo/TW2917n7M8b3B5OO8y8+30rKcT4z4j+aRYtCN6aAQVwOQtk0w= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720596861; c=relaxed/simple; bh=dvoKNXwrD28MkkI8y466iNL0VlvuCYSk/fPFFvZKz3w=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=RgofaytIANv1dpFUhg9C2F0muyg/DoeLlHQpS2sfeM8j3VzWOLz8X5Y7Oem96EYL05BNJjyCGN90k6cUmOPBrRghgW4WE6O4k9w3OlTdwXPcocSVgRxHvl2ezKkO0aFy07Xu89Nxqvx9lYwhGSYldVPqSGu1kkw5m4ibvFYM/gk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=LRVvovir; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="LRVvovir" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 75F9CC32782; Wed, 10 Jul 2024 07:34:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1720596861; bh=dvoKNXwrD28MkkI8y466iNL0VlvuCYSk/fPFFvZKz3w=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=LRVvovireNfQcRKQxb/uK2FuFMyJjw3TkEkJ22O/546QnRRV2zF7JWMfct3nzE+pd iu4WoPH27IVKO9aIdF5W6mx5AJgNXnQlImXYZ2/2vRBRfs4R71WfQU2nzevFuhPqnF TKKqwdfacz/bphBPOxDi0xD/k65vyQB0f+56uxW7A98Sq9jbnHoo6YuLaNO73gcuSH cbsB4K96xVc+RBmY0Qjj4u//FkFW1Z6G9r0XLNEMXUUOAvfE13MLwkgQW5HMcJq7S7 3jJDBzQoeHsLcy2+VZ/lfvQA6qTD4yc6Cx9Fi3rjJMFD58cOk7fuwpyVec1qQT8rQp 3/UutjGYb5zSg== Date: Wed, 10 Jul 2024 10:31:35 +0300 From: Mike Rapoport To: Patrick Roy Cc: seanjc@google.com, pbonzini@redhat.com, akpm@linux-foundation.org, dwmw@amazon.co.uk, david@redhat.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, willy@infradead.org, graf@amazon.com, derekmn@amazon.com, kalyazin@amazon.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, dmatlack@google.com, tabba@google.com, chao.p.peng@linux.intel.com, xmarcalx@amazon.co.uk Subject: Re: [RFC PATCH 5/8] kvm: gmem: add option to remove guest private memory from direct map Message-ID: References: <20240709132041.3625501-1-roypat@amazon.co.uk> <20240709132041.3625501-6-roypat@amazon.co.uk> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240709132041.3625501-6-roypat@amazon.co.uk> On Tue, Jul 09, 2024 at 02:20:33PM +0100, Patrick Roy wrote: > While guest_memfd is not available to be mapped by userspace, it is > still accessible through the kernel's direct map. This means that in > scenarios where guest-private memory is not hardware protected, it can > be speculatively read and its contents potentially leaked through > hardware side-channels. Removing guest-private memory from the direct > map, thus mitigates a large class of speculative execution issues > [1, Table 1]. > > This patch adds a flag to the `KVM_CREATE_GUEST_MEMFD` which, if set, removes the > struct pages backing guest-private memory from the direct map. Should > `CONFIG_HAVE_KVM_GMEM_{INVALIDATE, PREPARE}` be set, pages are removed > after preparation and before invalidation, so that the > prepare/invalidate routines do not have to worry about potentially > absent direct map entries. > > Direct map removal do not reuse the `KVM_GMEM_PREPARE` machinery, since `prepare` can be > called multiple time, and it is the responsibility of the preparation > routine to not "prepare" the same folio twice [2]. Thus, instead > explicitly check if `filemap_grab_folio` allocated a new folio, and > remove the returned folio from the direct map only if this was the case. > > The patch uses release_folio instead of free_folio to reinsert pages > back into the direct map as by the time free_folio is called, > folio->mapping can already be NULL. This means that a call to > folio_inode inside free_folio might deference a NULL pointer, leaving no > way to access the inode which stores the flags that allow determining > whether the page was removed from the direct map in the first place. > > Lastly, the patch uses set_direct_map_{invalid,default}_noflush instead > of `set_memory_[n]p` to avoid expensive flushes of TLBs and the L*-cache > hierarchy. This is especially important once KVM restores direct map > entries on-demand in later patches, where simple FIO benchmarks of a > virtio-blk device have shown that TLB flushes on a Intel(R) Xeon(R) > Platinum 8375C CPU @ 2.90GHz resulted in 80% degradation in throughput > compared to a non-flushing solution. > > Not flushing the TLB means that until TLB entries for temporarily > restored direct map entries get naturally evicted, they can be used > during speculative execution, and effectively "unhide" the memory for > longer than intended. We consider this acceptable, as the only pages > that are temporarily reinserted into the direct map like this will > either hold PV data structures (kvm-clock, asyncpf, etc), or pages > containing privileged instructions inside the guest kernel image (in the > MMIO emulation case). > > [1]: https://download.vusec.net/papers/quarantine_raid23.pdf > > Signed-off-by: Patrick Roy > --- > include/uapi/linux/kvm.h | 2 ++ > virt/kvm/guest_memfd.c | 52 ++++++++++++++++++++++++++++++++++------ > 2 files changed, 47 insertions(+), 7 deletions(-) > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index e065d9fe7ab2..409116aa23c9 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -1563,4 +1563,6 @@ struct kvm_create_guest_memfd { > __u64 reserved[6]; > }; > > +#define KVM_GMEM_NO_DIRECT_MAP (1ULL << 0) > + > #endif /* __LINUX_KVM_H */ > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index 9148b9679bb1..dc9b0c2d0b0e 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -4,6 +4,7 @@ > #include > #include > #include > +#include > > #include "kvm_mm.h" > > @@ -49,9 +50,16 @@ static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct fol > return 0; > } > > +static bool kvm_gmem_not_present(struct inode *inode) > +{ > + return ((unsigned long)inode->i_private & KVM_GMEM_NO_DIRECT_MAP) != 0; > +} > + > static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare) > { > struct folio *folio; > + bool zap_direct_map = false; > + int r; > > /* TODO: Support huge pages. */ > folio = filemap_grab_folio(inode->i_mapping, index); > @@ -74,16 +82,30 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool > for (i = 0; i < nr_pages; i++) > clear_highpage(folio_page(folio, i)); > > + // We need to clear the folio before calling kvm_gmem_prepare_folio, > + // but can only remove it from the direct map _after_ preparation is done. No C++ comments please > + zap_direct_map = kvm_gmem_not_present(inode); > + > folio_mark_uptodate(folio); > } > > if (prepare) { > - int r = kvm_gmem_prepare_folio(inode, index, folio); > - if (r < 0) { > - folio_unlock(folio); > - folio_put(folio); > - return ERR_PTR(r); > - } > + r = kvm_gmem_prepare_folio(inode, index, folio); > + if (r < 0) > + goto out_err; > + } > + > + if (zap_direct_map) { > + r = set_direct_map_invalid_noflush(&folio->page); It's not future proof to presume that folio is a single page here. You should loop over folio pages and add a TLB flush after the loop. > + if (r < 0) > + goto out_err; > + > + // We use the private flag to track whether the folio has been removed > + // from the direct map. This is because inside of ->free_folio, > + // we do not have access to the address_space anymore, meaning we > + // cannot check folio_inode(folio)->i_private to determine whether > + // KVM_GMEM_NO_DIRECT_MAP was set. > + folio_set_private(folio); > } > > /* > @@ -91,6 +113,10 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool > * unevictable and there is no storage to write back to. > */ > return folio; > +out_err: > + folio_unlock(folio); > + folio_put(folio); > + return ERR_PTR(r); > } > > static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start, > @@ -354,10 +380,22 @@ static void kvm_gmem_free_folio(struct folio *folio) > } > #endif > > +static void kvm_gmem_invalidate_folio(struct folio *folio, size_t start, size_t end) > +{ > + if (start == 0 && end == PAGE_SIZE) { > + // We only get here if PG_private is set, which only happens if kvm_gmem_not_present > + // returned true in kvm_gmem_get_folio. Thus no need to do that check again. > + BUG_ON(set_direct_map_default_noflush(&folio->page)); Ditto. > + > + folio_clear_private(folio); > + } > +} > + > static const struct address_space_operations kvm_gmem_aops = { > .dirty_folio = noop_dirty_folio, > .migrate_folio = kvm_gmem_migrate_folio, > .error_remove_folio = kvm_gmem_error_folio, > + .invalidate_folio = kvm_gmem_invalidate_folio, > #ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE > .free_folio = kvm_gmem_free_folio, > #endif > @@ -443,7 +481,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) > { > loff_t size = args->size; > u64 flags = args->flags; > - u64 valid_flags = 0; > + u64 valid_flags = KVM_GMEM_NO_DIRECT_MAP; > > if (flags & ~valid_flags) > return -EINVAL; > -- > 2.45.2 > -- Sincerely yours, Mike.