From: Chenyi Qiang <chenyi.qiang@intel.com>
To: "David Hildenbrand" <david@redhat.com>,
"Alexey Kardashevskiy" <aik@amd.com>,
"Peter Xu" <peterx@redhat.com>,
"Gupta Pankaj" <pankaj.gupta@amd.com>,
"Paolo Bonzini" <pbonzini@redhat.com>,
"Philippe Mathieu-Daudé" <philmd@linaro.org>,
"Michael Roth" <michael.roth@amd.com>
Cc: Chenyi Qiang <chenyi.qiang@intel.com>,
qemu-devel@nongnu.org, kvm@vger.kernel.org,
Williams Dan J <dan.j.williams@intel.com>,
Peng Chao P <chao.p.peng@intel.com>,
Gao Chao <chao.gao@intel.com>, Xu Yilun <yilun.xu@intel.com>,
Li Xiaoyao <xiaoyao.li@intel.com>
Subject: [PATCH v4 00/13] Enable shared device assignment
Date: Mon, 7 Apr 2025 15:49:20 +0800 [thread overview]
Message-ID: <20250407074939.18657-1-chenyi.qiang@intel.com> (raw)
This is the v4 series of the shared device assignment support.
Compared with v3 series, the main changes are:
- Introduced a new GenericStateManager parent class, so that the existing
RamDiscardManager and new PrivateSharedManager can be its child class
and manage different states.
- Changed the name of MemoryAttributeManager to RamBlockAttribute to
distinguish from the XXXManager interface and still use it to manage
guest_memfd information. Meanwhile, Use it to implement
PrivateSharedManager instead of RamDiscardManager to distinguish the
states of populate/discard and shared/private.
- Moved the attribute change operations into a listener so that both the
attribute change and IOMMU pins can be invoked in listener callbacks.
- Added priority listener support in PrivateSharedListener so that the
attribute change listener and VFIO listener can be triggered in
expected order to comply with in-place conversin requirement.
- v3: https://lore.kernel.org/qemu-devel/20250310081837.13123-1-chenyi.qiang@intel.com/
The overview of this series:
- Patch 1-3: preparation patches. These include function exposure and
some definition changes to return values.
- Patch 4: Introduce a generic state change parent class with
RamDiscardManager as its child class. This paves the way to introduce
new child classes to manage other memory states.
- Patch 5-6: Introduce a new child class, PrivateSharedManager, to
manage the private and shared states. Also adds VFIO support for this
new interface to coordinate RAM discard support.
- Patch 7-9: Introduce a new object to implement the
PrivateSharedManager interface and a callback to notify the
shared/private state change. Stores it in RAMBlocks and register it in
the target MemoryRegion so that the object can notify page conversion
events to other systems.
- Patch 10-11: Moves the state change handling into a
PrivateSharedListener so that it can be invoked together with the VFIO
listener by the state_change() call.
- Patch 12: To comply with in-place conversion, introduces the priority
listener support so that the attribute change and IOMMU pin can follow
the expected order.
- Patch 13: Unlocks the coordinate discard so that the shared device
assignment (VFIO) can work with guest_memfd.
More small changes or details can be found in the individual patches.
---
Original cover letter with minor changes related to new parent class:
Background
==========
Confidential VMs have two classes of memory: shared and private memory.
Shared memory is accessible from the host/VMM while private memory is
not. Confidential VMs can decide which memory is shared/private and
convert memory between shared/private at runtime.
"guest_memfd" is a new kind of fd whose primary goal is to serve guest
private memory. In current implementation, shared memory is allocated
with normal methods (e.g. mmap or fallocate) while private memory is
allocated from guest_memfd. When a VM performs memory conversions, QEMU
frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from
one side, and allocates new pages from the other side. This will cause a
stale IOMMU mapping issue mentioned in [1] when we try to enable shared
device assignment in confidential VMs.
Solution
========
The key to enable shared device assignment is to update the IOMMU mappings
on page conversion. RamDiscardManager, an existing interface currently
utilized by virtio-mem, offers a means to modify IOMMU mappings in
accordance with VM page assignment. Although the required operations in
VFIO for page conversion are similar to memory plug/unplug, the states of
private/shared are different from discard/populated. We want a similar
mechanism with RamDiscardManager but used to manage the state of private
and shared.
This series introduce a new parent abstract class to manage a pair of
opposite states with RamDiscardManager as its child to manage
populate/discard states, and introduce a new child class,
PrivateSharedManager, which can also utilize the same infrastructure to
notify VFIO of page conversions.
Relationship with in-place page conversion
==========================================
To support 1G page support for guest_memfd [2], the current direction is to
allow mmap() of guest_memfd to userspace so that both private and shared
memory can use the same physical pages as the backend. This in-place page
conversion design eliminates the need to discard pages during shared/private
conversions. However, device assignment will still be blocked because the
in-place page conversion will reject the conversion when the page is pinned
by VFIO.
To address this, the key difference lies in the sequence of VFIO map/unmap
operations and the page conversion. It can be adjusted to achieve
unmap-before-conversion-to-private and map-after-conversion-to-shared,
ensuring compatibility with guest_memfd.
Limitation
==========
One limitation is that VFIO expects the DMA mapping for a specific IOVA
to be mapped and unmapped with the same granularity. The guest may
perform partial conversions, such as converting a small region within a
larger region. To prevent such invalid cases, all operations are
performed with 4K granularity. This could be optimized after the
cut_mapping operation [3] is introduced in future. We can alway perform a
split-before-unmap if partial conversions happen. If the split succeeds,
the unmap will succeed and be atomic. If the split fails, the unmap
process fails.
Testing
=======
This patch series is tested based on TDX patches available at:
KVM: https://github.com/intel/tdx/tree/kvm-coco-queue-snapshot/kvm-coco-queue-snapshot-20250322
(With the revert of HEAD commit)
QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2025-04-07
To facilitate shared device assignment with the NIC, employ the legacy
type1 VFIO with the QEMU command:
qemu-system-x86_64 [...]
-device vfio-pci,host=XX:XX.X
The parameter of dma_entry_limit needs to be adjusted. For example, a
16GB guest needs to adjust the parameter like
vfio_iommu_type1.dma_entry_limit=4194304.
If use the iommufd-backed VFIO with the qemu command:
qemu-system-x86_64 [...]
-object iommufd,id=iommufd0 \
-device vfio-pci,host=XX:XX.X,iommufd=iommufd0
No additional adjustment required.
Following the bootup of the TD guest, the guest's IP address becomes
visible, and iperf is able to successfully send and receive data.
Related link
============
[1] https://lore.kernel.org/qemu-devel/20240423150951.41600-54-pbonzini@redhat.com/
[2] https://lore.kernel.org/lkml/cover.1726009989.git.ackerleytng@google.com/
[3] https://lore.kernel.org/linux-iommu/7-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/
Chenyi Qiang (13):
memory: Export a helper to get intersection of a MemoryRegionSection
with a given range
memory: Change memory_region_set_ram_discard_manager() to return the
result
memory: Unify the definiton of ReplayRamPopulate() and
ReplayRamDiscard()
memory: Introduce generic state change parent class for
RamDiscardManager
memory: Introduce PrivateSharedManager Interface as child of
GenericStateManager
vfio: Add the support for PrivateSharedManager Interface
ram-block-attribute: Introduce RamBlockAttribute to manage RAMBLock
with guest_memfd
ram-block-attribute: Introduce a callback to notify shared/private
state changes
memory: Attach RamBlockAttribute to guest_memfd-backed RAMBlocks
memory: Change NotifyStateClear() definition to return the result
KVM: Introduce CVMPrivateSharedListener for attribute changes during
page conversions
ram-block-attribute: Add priority listener support for
PrivateSharedListener
RAMBlock: Make guest_memfd require coordinate discard
accel/kvm/kvm-all.c | 81 +++-
hw/vfio/common.c | 131 +++++-
hw/vfio/container-base.c | 1 +
hw/virtio/virtio-mem.c | 168 +++----
include/exec/memory.h | 407 ++++++++++------
include/exec/ramblock.h | 25 +
include/hw/vfio/vfio-container-base.h | 10 +
include/system/confidential-guest-support.h | 10 +
migration/ram.c | 21 +-
system/memory.c | 137 ++++--
system/memory_mapping.c | 6 +-
system/meson.build | 1 +
system/physmem.c | 20 +-
system/ram-block-attribute.c | 495 ++++++++++++++++++++
target/i386/kvm/tdx.c | 1 +
target/i386/sev.c | 1 +
16 files changed, 1192 insertions(+), 323 deletions(-)
create mode 100644 system/ram-block-attribute.c
--
2.43.5
next reply other threads:[~2025-04-07 7:51 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-07 7:49 Chenyi Qiang [this message]
2025-04-07 7:49 ` [PATCH v4 01/13] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
2025-04-09 2:47 ` Alexey Kardashevskiy
2025-04-09 6:26 ` Chenyi Qiang
2025-04-09 6:45 ` Alexey Kardashevskiy
2025-04-09 7:38 ` Chenyi Qiang
2025-05-12 3:24 ` Zhao Liu
2025-04-07 7:49 ` [PATCH v4 02/13] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
2025-04-07 9:53 ` Xiaoyao Li
2025-04-08 0:50 ` Chenyi Qiang
2025-04-09 5:35 ` Alexey Kardashevskiy
2025-04-09 5:52 ` Chenyi Qiang
2025-04-25 12:35 ` David Hildenbrand
2025-04-07 7:49 ` [PATCH v4 03/13] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
2025-04-09 5:43 ` Alexey Kardashevskiy
2025-04-09 6:56 ` Chenyi Qiang
2025-04-25 12:44 ` David Hildenbrand
2025-04-25 12:42 ` David Hildenbrand
2025-04-27 2:13 ` Chenyi Qiang
2025-04-07 7:49 ` [PATCH v4 04/13] memory: Introduce generic state change parent class for RamDiscardManager Chenyi Qiang
2025-04-09 9:56 ` Alexey Kardashevskiy
2025-04-09 12:57 ` Chenyi Qiang
2025-04-10 0:11 ` Alexey Kardashevskiy
2025-04-10 1:44 ` Chenyi Qiang
2025-04-16 3:32 ` Chenyi Qiang
2025-04-17 23:10 ` Alexey Kardashevskiy
2025-04-18 3:49 ` Chenyi Qiang
2025-04-25 12:54 ` David Hildenbrand
2025-04-25 12:49 ` David Hildenbrand
2025-04-27 1:33 ` Chenyi Qiang
2025-04-07 7:49 ` [PATCH v4 05/13] memory: Introduce PrivateSharedManager Interface as child of GenericStateManager Chenyi Qiang
2025-04-09 9:56 ` Alexey Kardashevskiy
2025-04-10 3:47 ` Chenyi Qiang
2025-04-25 12:57 ` David Hildenbrand
2025-04-27 1:40 ` Chenyi Qiang
2025-04-29 10:01 ` David Hildenbrand
2025-04-07 7:49 ` [PATCH v4 06/13] vfio: Add the support for PrivateSharedManager Interface Chenyi Qiang
2025-04-09 9:58 ` Alexey Kardashevskiy
2025-04-10 5:53 ` Chenyi Qiang
2025-04-07 7:49 ` [PATCH v4 07/13] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBLock with guest_memfd Chenyi Qiang
2025-04-09 9:57 ` Alexey Kardashevskiy
2025-04-10 7:37 ` Chenyi Qiang
2025-05-09 6:41 ` Baolu Lu
2025-05-09 7:55 ` Chenyi Qiang
2025-05-09 8:18 ` David Hildenbrand
2025-05-09 10:37 ` Chenyi Qiang
2025-05-12 8:07 ` Zhao Liu
2025-05-12 9:43 ` Chenyi Qiang
2025-05-13 8:31 ` Zhao Liu
2025-05-14 1:39 ` Chenyi Qiang
2025-04-07 7:49 ` [PATCH v4 08/13] ram-block-attribute: Introduce a callback to notify shared/private state changes Chenyi Qiang
2025-04-07 7:49 ` [PATCH v4 09/13] memory: Attach RamBlockAttribute to guest_memfd-backed RAMBlocks Chenyi Qiang
2025-04-07 7:49 ` [PATCH v4 10/13] memory: Change NotifyStateClear() definition to return the result Chenyi Qiang
2025-04-27 2:26 ` Chenyi Qiang
2025-05-09 2:38 ` Chao Gao
2025-05-09 8:20 ` David Hildenbrand
2025-05-09 9:19 ` Chenyi Qiang
2025-05-09 8:22 ` Baolu Lu
2025-05-09 10:04 ` Chenyi Qiang
2025-05-12 7:54 ` David Hildenbrand
2025-04-07 7:49 ` [PATCH v4 11/13] KVM: Introduce CVMPrivateSharedListener for attribute changes during page conversions Chenyi Qiang
2025-05-09 9:03 ` Baolu Lu
2025-05-12 3:18 ` Chenyi Qiang
2025-04-07 7:49 ` [PATCH v4 12/13] ram-block-attribute: Add priority listener support for PrivateSharedListener Chenyi Qiang
2025-05-09 9:23 ` Baolu Lu
2025-05-09 9:39 ` Chenyi Qiang
2025-04-07 7:49 ` [PATCH v4 13/13] RAMBlock: Make guest_memfd require coordinate discard Chenyi Qiang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250407074939.18657-1-chenyi.qiang@intel.com \
--to=chenyi.qiang@intel.com \
--cc=aik@amd.com \
--cc=chao.gao@intel.com \
--cc=chao.p.peng@intel.com \
--cc=dan.j.williams@intel.com \
--cc=david@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=michael.roth@amd.com \
--cc=pankaj.gupta@amd.com \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=philmd@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=xiaoyao.li@intel.com \
--cc=yilun.xu@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).