qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Chenyi Qiang <chenyi.qiang@intel.com>
To: "Paolo Bonzini" <pbonzini@redhat.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Peter Xu" <peterx@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@linaro.org>,
	"Michael Roth" <michael.roth@amd.com>
Cc: <qemu-devel@nongnu.org>, <kvm@vger.kernel.org>,
	Williams Dan J <dan.j.williams@intel.com>,
	Edgecombe Rick P <rick.p.edgecombe@intel.com>,
	Wang Wei W <wei.w.wang@intel.com>,
	Peng Chao P <chao.p.peng@intel.com>,
	"Gao Chao" <chao.gao@intel.com>, Wu Hao <hao.wu@intel.com>,
	Xu Yilun <yilun.xu@intel.com>
Subject: Re: [RFC PATCH 0/6] Enable shared device assignment
Date: Fri, 16 Aug 2024 11:02:23 +0800	[thread overview]
Message-ID: <b7197241-7826-49b7-8dfc-04ffecb8a54b@intel.com> (raw)
In-Reply-To: <20240725072118.358923-1-chenyi.qiang@intel.com>

Hi Paolo,

Hope to draw your attention. As TEE I/O would depend on shared device
assignment and we introduce this RDM solution in QEMU. Now, Observe the
in-place private/shared conversion option mentioned by David, do you
think we should continue to add pass-thru support for this in-qemu page
conversion method? Or wait for the option discussion to see if it will
change to in-kernel conversion.

Thanks
Chenyi

On 7/25/2024 3:21 PM, Chenyi Qiang wrote:
> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
> discard") effectively disables device assignment with guest_memfd.
> guest_memfd is required for confidential guests, so device assignment to
> confidential guests is disabled. A supporting assumption for disabling
> device-assignment was that TEE I/O (SEV-TIO, TDX Connect, COVE-IO
> etc...) solves the confidential-guest device-assignment problem [1].
> That turns out not to be the case because TEE I/O depends on being able
> to operate devices against "shared"/untrusted memory for device
> initialization and error recovery scenarios.
> 
> This series utilizes an existing framework named RamDiscardManager to
> notify VFIO of page conversions. However, there's still one concern
> related to the semantics of RamDiscardManager which is used to manage
> the memory plug/unplug state. This is a little different from the memory
> shared/private in our requirement. See the "Open" section below for more
> details.
> 
> Background
> ==========
> Confidential VMs have two classes of memory: shared and private memory.
> Shared memory is accessible from the host/VMM while private memory is
> not. Confidential VMs can decide which memory is shared/private and
> convert memory between shared/private at runtime.
> 
> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
> private memory. The key differences between guest_memfd and normal memfd
> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
> cannot be mapped, read or written by userspace.
> 
> In QEMU's implementation, shared memory is allocated with normal methods
> (e.g. mmap or fallocate) while private memory is allocated from
> guest_memfd. When a VM performs memory conversions, QEMU frees pages via
> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
> allocates new pages from the other side.
> 
> Problem
> =======
> Device assignment in QEMU is implemented via VFIO system. In the normal
> VM, VM memory is pinned at the beginning of time by VFIO. In the
> confidential VM, the VM can convert memory and when that happens
> nothing currently tells VFIO that its mappings are stale. This means
> that page conversion leaks memory and leaves stale IOMMU mappings. For
> example, sequence like the following can result in stale IOMMU mappings:
> 
> 1. allocate shared page
> 2. convert page shared->private
> 3. discard shared page
> 4. convert page private->shared
> 5. allocate shared page
> 6. issue DMA operations against that shared page
> 
> After step 3, VFIO is still pinning the page. However, DMA operations in
> step 6 will hit the old mapping that was allocated in step 1, which
> causes the device to access the invalid data.
> 
> Currently, the commit 852f0048f3 ("RAMBlock: make guest_memfd require
> uncoordinated discard") has blocked the device assignment with
> guest_memfd to avoid this problem.
> 
> Solution
> ========
> The key to enable shared device assignment is to solve the stale IOMMU
> mappings problem.
> 
> Given the constraints and assumptions here is a solution that satisfied
> the use cases. RamDiscardManager, an existing interface currently
> utilized by virtio-mem, offers a means to modify IOMMU mappings in
> accordance with VM page assignment. Page conversion is similar to
> hot-removing a page in one mode and adding it back in the other.
> 
> This series implements a RamDiscardManager for confidential VMs and
> utilizes its infrastructure to notify VFIO of page conversions.
> 
> Another possible attempt [2] was to not discard shared pages in step 3
> above. This was an incomplete band-aid because guests would consume
> twice the memory since shared pages wouldn't be freed even after they
> were converted to private.
> 
> Open
> ====
> Implementing a RamDiscardManager to notify VFIO of page conversions
> causes changes in semantics: private memory is treated as discarded (or
> hot-removed) memory. This isn't aligned with the expectation of current
> RamDiscardManager users (e.g. VFIO or live migration) who really
> expect that discarded memory is hot-removed and thus can be skipped when
> the users are processing guest memory. Treating private memory as
> discarded won't work in future if VFIO or live migration needs to handle
> private memory. e.g. VFIO may need to map private memory to support
> Trusted IO and live migration for confidential VMs need to migrate
> private memory.
> 
> There are two possible ways to mitigate the semantics changes.
> 1. Develop a new mechanism to notify the page conversions between
> private and shared. For example, utilize the notifier_list in QEMU. VFIO
> registers its own handler and gets notified upon page conversions. This
> is a clean approach which only touches the notifier workflow. A
> challenge is that for device hotplug, existing shared memory should be
> mapped in IOMMU. This will need additional changes.
> 
> 2. Extend the existing RamDiscardManager interface to manage not only
> the discarded/populated status of guest memory but also the
> shared/private status. RamDiscardManager users like VFIO will be
> notified with one more argument indicating what change is happening and
> can take action accordingly. It also has challenges e.g. QEMU allows
> only one RamDiscardManager, how to support virtio-mem for confidential
> VMs would be a problem. And some APIs like .is_populated() exposed by
> RamDiscardManager are meaningless to shared/private memory. So they may
> need some adjustments.
> 
> Testing
> =======
> This patch series is tested based on the internal TDX KVM/QEMU tree.
> 
> To facilitate shared device assignment with the NIC, employ the legacy
> type1 VFIO with the QEMU command:
> 
> qemu-system-x86_64 [...]
>     -device vfio-pci,host=XX:XX.X
> 
> The parameter of dma_entry_limit needs to be adjusted. For example, a
> 16GB guest needs to adjust the parameter like
> vfio_iommu_type1.dma_entry_limit=4194304.
> 
> If use the iommufd-backed VFIO with the qemu command:
> 
> qemu-system-x86_64 [...]
>     -object iommufd,id=iommufd0 \
>     -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
> 
> No additional adjustment required.
> 
> Following the bootup of the TD guest, the guest's IP address becomes
> visible, and iperf is able to successfully send and receive data.
> 
> Related link
> ============
> [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/
> [2] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/
> 
> Chenyi Qiang (6):
>   guest_memfd: Introduce an object to manage the guest-memfd with
>     RamDiscardManager
>   guest_memfd: Introduce a helper to notify the shared/private state
>     change
>   KVM: Notify the state change via RamDiscardManager helper during
>     shared/private conversion
>   memory: Register the RamDiscardManager instance upon guest_memfd
>     creation
>   guest-memfd: Default to discarded (private) in guest_memfd_manager
>   RAMBlock: make guest_memfd require coordinate discard
> 
>  accel/kvm/kvm-all.c                  |   7 +
>  include/sysemu/guest-memfd-manager.h |  49 +++
>  system/guest-memfd-manager.c         | 425 +++++++++++++++++++++++++++
>  system/meson.build                   |   1 +
>  system/physmem.c                     |  11 +-
>  5 files changed, 492 insertions(+), 1 deletion(-)
>  create mode 100644 include/sysemu/guest-memfd-manager.h
>  create mode 100644 system/guest-memfd-manager.c
> 
> 
> base-commit: 900536d3e97aed7fdd9cb4dadd3bf7023360e819


  parent reply	other threads:[~2024-08-16  3:03 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-25  7:21 [RFC PATCH 0/6] Enable shared device assignment Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 1/6] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 2/6] guest_memfd: Introduce a helper to notify the shared/private state change Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 3/6] KVM: Notify the state change via RamDiscardManager helper during shared/private conversion Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 4/6] memory: Register the RamDiscardManager instance upon guest_memfd creation Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 5/6] guest-memfd: Default to discarded (private) in guest_memfd_manager Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 6/6] RAMBlock: make guest_memfd require coordinate discard Chenyi Qiang
2024-07-25 14:04 ` [RFC PATCH 0/6] Enable shared device assignment David Hildenbrand
2024-07-26  5:02   ` Tian, Kevin
2024-07-26  7:08     ` David Hildenbrand
2024-07-31  7:12       ` Xu Yilun
2024-07-31 11:05         ` David Hildenbrand
2024-07-26  6:20   ` Chenyi Qiang
2024-07-26  7:20     ` David Hildenbrand
2024-07-26 10:56       ` Chenyi Qiang
2024-07-31 11:18         ` David Hildenbrand
2024-08-02  7:00           ` Chenyi Qiang
2024-08-01  7:32       ` Yin, Fengwei
2024-08-16  3:02 ` Chenyi Qiang [this message]
2024-10-08  8:59   ` Chenyi Qiang
2024-11-15 16:47     ` Rob Nertney
2024-11-15 17:20       ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b7197241-7826-49b7-8dfc-04ffecb8a54b@intel.com \
    --to=chenyi.qiang@intel.com \
    --cc=chao.gao@intel.com \
    --cc=chao.p.peng@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=hao.wu@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=wei.w.wang@intel.com \
    --cc=yilun.xu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).