From: "Cédric Le Goater" <clg@kaod.org>
To: "Chenyi Qiang" <chenyi.qiang@intel.com>,
"David Hildenbrand" <david@redhat.com>,
"Alexey Kardashevskiy" <aik@amd.com>,
"Peter Xu" <peterx@redhat.com>,
"Gupta Pankaj" <pankaj.gupta@amd.com>,
"Paolo Bonzini" <pbonzini@redhat.com>,
"Philippe Mathieu-Daudé" <philmd@linaro.org>,
"Michael Roth" <michael.roth@amd.com>
Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org,
Williams Dan J <dan.j.williams@intel.com>,
Zhao Liu <zhao1.liu@intel.com>,
Baolu Lu <baolu.lu@linux.intel.com>,
Gao Chao <chao.gao@intel.com>, Xu Yilun <yilun.xu@intel.com>,
Li Xiaoyao <xiaoyao.li@intel.com>,
'Alex Williamson' <alex.williamson@redhat.com>
Subject: Re: [PATCH v5 00/10] Enable shared device assignment
Date: Mon, 26 May 2025 13:37:07 +0200 [thread overview]
Message-ID: <7283f8f2-a9d9-4e7d-bfbd-3854b3d1736e@kaod.org> (raw)
In-Reply-To: <20250520102856.132417-1-chenyi.qiang@intel.com>
On 5/20/25 12:28, Chenyi Qiang wrote:
> This is the v5 series of the shared device assignment support.
>
> As discussed in the v4 series [1], the GenericStateManager parent class
> and PrivateSharedManager child interface were deemed to be in the wrong
> direction. This series reverts back to the original single
> RamDiscardManager interface and puts it as future work to allow the
> co-existence of multiple pairs of state management. For example, if we
> want to have virtio-mem co-exist with guest_memfd, it will need a new
> framework to combine the private/shared/discard states [2].
>
> Another change since the last version is the error handling of memory
> conversion. Currently, the failure of kvm_convert_memory() causes QEMU
> to quit instead of resuming the guest. The complex rollback operation
> doesn't add value and merely adds code that is difficult to test.
> Although in the future, it is more likely to encounter more errors on
> conversion paths like unmap failure on shared to private in-place
> conversion. This series keeps complex error handling out of the picture
> for now and attaches related handling at the end of the series for
> future extension.
>
> Apart from the above two parts with future work, there's some
> optimization work in the future, i.e., using other more memory-efficient
> mechanism to track ranges of contiguous states instead of a bitmap [3].
> This series still uses a bitmap for simplicity.
>
> The overview of this series:
> - Patch 1-3: Preparation patches. These include function exposure and
> some definition changes to return values.
> - Patch 4-5: Introduce a new object to implement RamDiscardManager
> interface and a helper to notify the shared/private state change.
> - Patch 6: Store the new object including guest_memfd information in
> RAMBlock. Register the RamDiscardManager instance to the target
> RAMBlock's MemoryRegion so that the RamDiscardManager users can run in
> the specific path.
> - Patch 7: Unlock the coordinate discard so that the shared device
> assignment (VFIO) can work with guest_memfd. After this patch, the
> basic device assignement functionality can work properly.
> - Patch 8-9: Some cleanup work. Move the state change handling into a
> RamDiscardListener so that it can be invoked together with the VFIO
> listener by the state_change() call. This series dropped the priority
> support in v4 which is required by in-place conversions, because the
> conversion path will likely change.
> - Patch 10: More complex error handing including rollback and mixture
> states conversion case.
>
> More small changes or details can be found in the individual patches.
>
> ---
> Original cover letter:
>
> Background
> ==========
> Confidential VMs have two classes of memory: shared and private memory.
> Shared memory is accessible from the host/VMM while private memory is
> not. Confidential VMs can decide which memory is shared/private and
> convert memory between shared/private at runtime.
>
> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
> private memory. In current implementation, shared memory is allocated
> with normal methods (e.g. mmap or fallocate) while private memory is
> allocated from guest_memfd. When a VM performs memory conversions, QEMU
> frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from
> one side, and allocates new pages from the other side. This will cause a
> stale IOMMU mapping issue mentioned in [4] when we try to enable shared
> device assignment in confidential VMs.
>
> Solution
> ========
> The key to enable shared device assignment is to update the IOMMU mappings
> on page conversion. RamDiscardManager, an existing interface currently
> utilized by virtio-mem, offers a means to modify IOMMU mappings in
> accordance with VM page assignment. Although the required operations in
> VFIO for page conversion are similar to memory plug/unplug, the states of
> private/shared are different from discard/populated. We want a similar
> mechanism with RamDiscardManager but used to manage the state of private
> and shared.
>
> This series introduce a new parent abstract class to manage a pair of
> opposite states with RamDiscardManager as its child to manage
> populate/discard states, and introduce a new child class,
> PrivateSharedManager, which can also utilize the same infrastructure to
> notify VFIO of page conversions.
>
> Relationship with in-place page conversion
> ==========================================
> To support 1G page support for guest_memfd [5], the current direction is to
> allow mmap() of guest_memfd to userspace so that both private and shared
> memory can use the same physical pages as the backend. This in-place page
> conversion design eliminates the need to discard pages during shared/private
> conversions. However, device assignment will still be blocked because the
> in-place page conversion will reject the conversion when the page is pinned
> by VFIO.
>
> To address this, the key difference lies in the sequence of VFIO map/unmap
> operations and the page conversion. It can be adjusted to achieve
> unmap-before-conversion-to-private and map-after-conversion-to-shared,
> ensuring compatibility with guest_memfd.
>
> Limitation
> ==========
> One limitation is that VFIO expects the DMA mapping for a specific IOVA
> to be mapped and unmapped with the same granularity. The guest may
> perform partial conversions, such as converting a small region within a
> larger region. To prevent such invalid cases, all operations are
> performed with 4K granularity. This could be optimized after the
> cut_mapping operation[6] is introduced in future. We can alway perform a
> split-before-unmap if partial conversions happen. If the split succeeds,
> the unmap will succeed and be atomic. If the split fails, the unmap
> process fails.
>
> Testing
> =======
> This patch series is tested based on TDX patches available at:
> KVM: https://github.com/intel/tdx/tree/kvm-coco-queue-snapshot/kvm-coco-queue-snapshot-20250408
> QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2025-05-20
>
> Because the new features like cut_mapping operation will only be support in iommufd.
> It is recommended to use the iommufd-backed VFIO with the qemu command:
Is it recommended or required ? If the VFIO IOMMU type1 backend is not
supported for confidential VMs, QEMU should fail to start.
Please add Alex Williamson and I to the Cc: list.
Thanks,
C.
> qemu-system-x86_64 [...]
> -object iommufd,id=iommufd0 \
> -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
>
> Following the bootup of the TD guest, the guest's IP address becomes
> visible, and iperf is able to successfully send and receive data.
>
> Related link
> ============
> [1] https://lore.kernel.org/qemu-devel/20250407074939.18657-1-chenyi.qiang@intel.com/
> [2] https://lore.kernel.org/qemu-devel/d1a71e00-243b-4751-ab73-c05a4e090d58@redhat.com/
> [3] https://lore.kernel.org/qemu-devel/96ab7fa9-bd7a-444d-aef8-8c9c30439044@redhat.com/
> [4] https://lore.kernel.org/qemu-devel/20240423150951.41600-54-pbonzini@redhat.com/
> [5] https://lore.kernel.org/kvm/cover.1747264138.git.ackerleytng@google.com/
> [6] https://lore.kernel.org/linux-iommu/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com/
>
>
> Chenyi Qiang (10):
> memory: Export a helper to get intersection of a MemoryRegionSection
> with a given range
> memory: Change memory_region_set_ram_discard_manager() to return the
> result
> memory: Unify the definiton of ReplayRamPopulate() and
> ReplayRamDiscard()
> ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock
> with guest_memfd
> ram-block-attribute: Introduce a helper to notify shared/private state
> changes
> memory: Attach RamBlockAttribute to guest_memfd-backed RAMBlocks
> RAMBlock: Make guest_memfd require coordinate discard
> memory: Change NotifyRamDiscard() definition to return the result
> KVM: Introduce RamDiscardListener for attribute changes during memory
> conversions
> ram-block-attribute: Add more error handling during state changes
>
> MAINTAINERS | 1 +
> accel/kvm/kvm-all.c | 79 ++-
> hw/vfio/listener.c | 6 +-
> hw/virtio/virtio-mem.c | 83 ++--
> include/system/confidential-guest-support.h | 9 +
> include/system/memory.h | 76 ++-
> include/system/ramblock.h | 22 +
> migration/ram.c | 33 +-
> system/memory.c | 22 +-
> system/meson.build | 1 +
> system/physmem.c | 18 +-
> system/ram-block-attribute.c | 514 ++++++++++++++++++++
> target/i386/kvm/tdx.c | 1 +
> target/i386/sev.c | 1 +
> 14 files changed, 770 insertions(+), 96 deletions(-)
> create mode 100644 system/ram-block-attribute.c
>
next prev parent reply other threads:[~2025-05-26 11:37 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-20 10:28 [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 01/10] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 02/10] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
2025-05-26 8:40 ` David Hildenbrand
2025-05-27 6:56 ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 03/10] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
2025-05-26 8:42 ` David Hildenbrand
2025-05-26 9:35 ` Philippe Mathieu-Daudé
2025-05-26 10:21 ` Chenyi Qiang
2025-05-27 6:56 ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 04/10] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock with guest_memfd Chenyi Qiang
2025-05-26 9:01 ` David Hildenbrand
2025-05-26 9:28 ` Chenyi Qiang
2025-05-26 11:16 ` Alexey Kardashevskiy
2025-05-27 1:15 ` Chenyi Qiang
2025-05-27 1:20 ` Alexey Kardashevskiy
2025-05-27 3:14 ` Chenyi Qiang
2025-05-27 6:06 ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 05/10] ram-block-attribute: Introduce a helper to notify shared/private state changes Chenyi Qiang
2025-05-26 9:02 ` David Hildenbrand
2025-05-27 7:35 ` Alexey Kardashevskiy
2025-05-27 9:06 ` Chenyi Qiang
2025-05-27 9:19 ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 06/10] memory: Attach RamBlockAttribute to guest_memfd-backed RAMBlocks Chenyi Qiang
2025-05-26 9:06 ` David Hildenbrand
2025-05-26 9:46 ` Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 07/10] RAMBlock: Make guest_memfd require coordinate discard Chenyi Qiang
2025-05-26 9:08 ` David Hildenbrand
2025-05-27 5:47 ` Chenyi Qiang
2025-05-27 7:42 ` Alexey Kardashevskiy
2025-05-27 8:12 ` Chenyi Qiang
2025-05-27 11:20 ` David Hildenbrand
2025-05-28 1:57 ` Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 08/10] memory: Change NotifyRamDiscard() definition to return the result Chenyi Qiang
2025-05-26 9:31 ` Philippe Mathieu-Daudé
2025-05-26 10:36 ` Cédric Le Goater
2025-05-26 12:44 ` Cédric Le Goater
2025-05-27 5:29 ` Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 09/10] KVM: Introduce RamDiscardListener for attribute changes during memory conversions Chenyi Qiang
2025-05-26 9:22 ` David Hildenbrand
2025-05-27 8:01 ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes Chenyi Qiang
2025-05-26 9:17 ` David Hildenbrand
2025-05-26 10:19 ` Chenyi Qiang
2025-05-26 12:10 ` David Hildenbrand
2025-05-26 12:39 ` Chenyi Qiang
2025-05-27 9:11 ` Alexey Kardashevskiy
2025-05-27 10:18 ` Chenyi Qiang
2025-05-27 11:21 ` David Hildenbrand
2025-05-26 11:37 ` Cédric Le Goater [this message]
2025-05-26 12:16 ` [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7283f8f2-a9d9-4e7d-bfbd-3854b3d1736e@kaod.org \
--to=clg@kaod.org \
--cc=aik@amd.com \
--cc=alex.williamson@redhat.com \
--cc=baolu.lu@linux.intel.com \
--cc=chao.gao@intel.com \
--cc=chenyi.qiang@intel.com \
--cc=dan.j.williams@intel.com \
--cc=david@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=michael.roth@amd.com \
--cc=pankaj.gupta@amd.com \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=philmd@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=xiaoyao.li@intel.com \
--cc=yilun.xu@intel.com \
--cc=zhao1.liu@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).