All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: Patrick Roy <roypat@amazon.co.uk>,
	seanjc@google.com, pbonzini@redhat.com,
	akpm@linux-foundation.org, dwmw@amazon.co.uk, rppt@kernel.org,
	david@redhat.com
Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	willy@infradead.org, graf@amazon.com, derekmn@amazon.com,
	kalyazin@amazon.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	dmatlack@google.com, tabba@google.com,
	chao.p.peng@linux.intel.com, xmarcalx@amazon.co.uk
Subject: Re: [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map
Date: Mon, 22 Jul 2024 14:28:00 +0200	[thread overview]
Message-ID: <e12b91ef-ca0c-4b77-840b-dcfb2c76a984@kernel.org> (raw)
In-Reply-To: <20240709132041.3625501-1-roypat@amazon.co.uk>

On 7/9/24 3:20 PM, Patrick Roy wrote:
> Hey all,
> 
> This RFC series is a rough draft adding support for running
> non-confidential compute VMs in guest_memfd, based on prior discussions
> with Sean [1]. Our specific usecase for this is the ability to unmap
> guest memory from the host kernel's direct map, as a mitigation against
> a large class of speculative execution issues.
> 
> === Implementation ===
> 
> This patch series introduces a new flag to the `KVM_CREATE_GUEST_MEMFD`
> to remove its pages from the direct map when they are allocated. When
> trying to run a guest from such a VM, we now face the problem that
> without either userspace or kernelspace mappings of guest_memfd, KVM
> cannot access guest memory to, for example, do MMIO emulation of access
> memory used to guest/host communication. We have multiple options for
> solving this when running non-CoCo VMs: (1) implement a TDX-light
> solution, where the guest shares memory that KVM needs to access, and
> relies on paravirtual solutions where this is not possible (e.g. MMIO),
> (2) have KVM use userspace mappings of guest_memfd (e.g. a
> memfd_secret-style solution), or (3) dynamically reinsert pages into the
> direct map whenever KVM wants to access them.
> 
> This RFC goes for option (3). Option (1) is a lot of overhead for very
> little gain, since we are not actually constrained by a physical
> inability to access guest memory (e.g. we are not in a TDX context where
> accesses to guest memory cause a #MC). Option (2) has previously been
> rejected [1].

Do the pages have to have the same address when they are temporarily mapped?
Wouldn't it be easier to do something similar to kmap_local_page() used for
HIMEM? I.e. you get a temporary kernel mapping to do what's needed, but it
doesn't have to alter the shared directmap.

Maybe that was already discussed somewhere as unsuitable but didn't spot it
here.

> In this patch series, we make sufficient parts of KVM gmem-aware to be
> able to boot a Linux initrd from private memory on x86. These include
> KVM's MMIO emulation (including guest page table walking) and kvm-clock.
> For VM types which do not allow accessing gmem, we return -EFAULT and
> attempt to prepare a KVM_EXIT_MEMORY_FAULT.
> 
> Additionally, this patch series adds support for "restricted" userspace
> mappings of guest_memfd, which work similar to memfd_secret (e.g.
> disallow get_user_pages), which allows handling I/O and loading the
> guest kernel in a simple way. Support for this is completely independent
> of the rest of the functionality introduced in this patch series.
> However, it is required to build a minimal hypervisor PoC that actually
> allows booting a VM from a disk.
> 
> === Performance ===
> 
> We have run some preliminary performance benchmarks to assess the impact
> of on-the-fly direct map manipulations. We were mainly interested in the
> impact of manipulating the direct map for MMIO emulation on virtio-mmio.
> Particularly, we were worried about the impact of the TLB and L1/2/3
> Cache flushes that set_memory_[n]p entails.
> 
> In our setup, we have taken a modified Firecracker VMM, spawned a Linux
> guest with 1 vCPU, and used fio to stress a virtio_blk device. We found
> that the cache flushes caused throughput to drop from around 600MB/s to
> ~50MB/s (~90%) for both reads and writes (on a Intel(R) Xeon(R) Platinum
> 8375C CPU with 64 cores). We then converted our prototype to use
> set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p and
> found that without cache flushes the pure impact of the direct map
> manipulation is indistinguishable from noise. This is why we use
> set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p in
> this RFC.
> 
> Note that in this comparison, both the baseline, as well as the
> guest_memfd-supporting version of Firecracker were made to bounce I/O
> buffers in VMM userspace. As GUP is disabled for the guest_memfd VMAs,
> the virtio stack cannot directly pass guest buffers to read/write
> syscalls.
> 
> === Security ===
> 
> We want to use unmapping guest memory from the host kernel as a security
> mitigation against transient execution attacks. Temporarily restoring
> direct map entries whenever KVM requires access to guest memory leaves a
> gap in this mitigation. We believe this to be acceptable for the above
> cases, since pages used for paravirtual guest/host communication (e.g.
> kvm-clock) and guest page tables do not contain sensitive data. MMIO
> emulation will only end up reading pages containing privileged
> instructions (e.g. guest kernel code).
> 
> === Summary ===
> 
> Patches 1-4 are about hot-patching various points inside of KVM that
> access guest memory to correctly handle the case where memory happens to
> be guest-private. This means either handling the access as a memory
> error, or simply accessing the memslot's guest_memfd instead of looking
> at the userspace provided VMA if the VM type allows these kind of
> accesses. Patches 5-6 add a flag to KVM_CREATE_GUEST_MEMFD that will
> make it remove its pages from the kernel's direct map. Whenever KVM
> wants to access guest-private memory, it will temporarily re-insert the
> relevant pages. Patches 7-8 allow for restricted userspace mappings
> (e.g. get_user_pages paths are disabled like for memfd_secret) of
> guest_memfd, so that userspace has an easy path for loading the guest
> kernel and handling I/O-buffers.
> 
> === ToDos / Limitations ===
> 
> There are still a few rough edges that need to be addressed before
> dropping the "RFC" tag, e.g.
> 
> * Handle errors of set_direct_map_default_not_flush in
>   kvm_gmem_invalidate_folio instead of calling BUG_ON
> * Lift the limitation of "at most one gfn_to_pfn_cache for each
>   gfn/pfn" in e1c61f0a7963 ("kvm: gmem: Temporarily restore direct map
>   entries when needed"). It currently means that guests with more than 1
>   vcpu fail to boot, because multiple vcpus can put their kvm-clock PV
>   structures into the same page (gfn)
> * Write selftests, particularly around hole punching, direct map removal,
>   and mmap.
> 
> Lastly, there's the question of nested virtualization which Sean brought
> up in previous discussions, which runs into similar problems as MMIO. I
> have looked at it very briefly. On Intel, KVM uses various gfn->uhva
> caches, which run in similar problems as the gfn_to_hva_caches dealt
> with in 200834b15dda ("kvm: use slowpath in gfn_to_hva_cache if memory
> is private"). However, previous attempts at just converting this to
> gfn_to_pfn_cache (which would make them work with guest_memfd) proved
> complicated [2]. I suppose initially, we should probably disallow nested
> virtualization in VMs that have their memory removed from the direct
> map.
> 
> Best,
> Patrick
> 
> [1]: https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@amazon.com/
> [2]: https://lore.kernel.org/kvm/ZBEEQtmtNPaEqU1i@google.com/
> 
> Patrick Roy (8):
>   kvm: Allow reading/writing gmem using kvm_{read,write}_guest
>   kvm: use slowpath in gfn_to_hva_cache if memory is private
>   kvm: pfncache: enlighten about gmem
>   kvm: x86: support walking guest page tables in gmem
>   kvm: gmem: add option to remove guest private memory from direct map
>   kvm: gmem: Temporarily restore direct map entries when needed
>   mm: secretmem: use AS_INACCESSIBLE to prohibit GUP
>   kvm: gmem: Allow restricted userspace mappings
> 
>  arch/x86/kvm/mmu/paging_tmpl.h |  94 +++++++++++++++++++-----
>  include/linux/kvm_host.h       |   5 ++
>  include/linux/kvm_types.h      |   1 +
>  include/linux/secretmem.h      |  13 +++-
>  include/uapi/linux/kvm.h       |   2 +
>  mm/secretmem.c                 |   6 +-
>  virt/kvm/guest_memfd.c         |  83 +++++++++++++++++++--
>  virt/kvm/kvm_main.c            | 112 +++++++++++++++++++++++++++-
>  virt/kvm/pfncache.c            | 130 +++++++++++++++++++++++++++++----
>  9 files changed, 399 insertions(+), 47 deletions(-)
> 
> 
> base-commit: 890a64810d59b1a58ed26efc28cfd821fc068e84


  parent reply	other threads:[~2024-07-22 12:28 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-09 13:20 [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 1/8] kvm: Allow reading/writing gmem using kvm_{read,write}_guest Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 2/8] kvm: use slowpath in gfn_to_hva_cache if memory is private Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 3/8] kvm: pfncache: enlighten about gmem Patrick Roy
2024-07-09 14:36   ` David Woodhouse
2024-07-10  9:49     ` Patrick Roy
2024-07-10 10:20       ` David Woodhouse
2024-07-10 10:46         ` Patrick Roy
2024-07-10 10:50           ` David Woodhouse
2024-07-09 13:20 ` [RFC PATCH 4/8] kvm: x86: support walking guest page tables in gmem Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 5/8] kvm: gmem: add option to remove guest private memory from direct map Patrick Roy
2024-07-10  7:31   ` Mike Rapoport
2024-07-10  9:50     ` Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 6/8] kvm: gmem: Temporarily restore direct map entries when needed Patrick Roy
2024-07-11  6:25   ` Paolo Bonzini
2024-07-09 13:20 ` [RFC PATCH 7/8] mm: secretmem: use AS_INACCESSIBLE to prohibit GUP Patrick Roy
2024-07-09 21:09   ` David Hildenbrand
2024-07-10  7:32     ` Mike Rapoport
2024-07-10  9:50       ` Patrick Roy
2024-07-10 21:14         ` David Hildenbrand
2024-07-09 13:20 ` [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings Patrick Roy
2024-07-09 14:48   ` Fuad Tabba
2024-07-09 21:13     ` David Hildenbrand
2024-07-10  9:51       ` Patrick Roy
2024-07-10 21:12         ` David Hildenbrand
2024-07-10 21:53           ` Sean Christopherson
2024-07-10 21:56             ` David Hildenbrand
2024-07-12 15:59           ` Patrick Roy
2024-07-30 10:15             ` David Hildenbrand
2024-08-01 10:30               ` Patrick Roy
2024-07-22 12:28 ` Vlastimil Babka (SUSE) [this message]
2024-07-26  6:55   ` [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Patrick Roy
2024-07-30 10:17     ` David Hildenbrand
2024-07-26 16:44 ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e12b91ef-ca0c-4b77-840b-dcfb2c76a984@kernel.org \
    --to=vbabka@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=chao.p.peng@linux.intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=derekmn@amazon.com \
    --cc=dmatlack@google.com \
    --cc=dwmw@amazon.co.uk \
    --cc=graf@amazon.com \
    --cc=hpa@zytor.com \
    --cc=kalyazin@amazon.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=roypat@amazon.co.uk \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    --cc=tabba@google.com \
    --cc=tglx@linutronix.de \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=xmarcalx@amazon.co.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.