[RFC PATCH 0/6] Enable shared device assignment

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/6] Enable shared device assignment
@ 2024-07-25  7:21 Chenyi Qiang
  2024-07-25  7:21 ` [RFC PATCH 1/6] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager Chenyi Qiang
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Chenyi Qiang @ 2024-07-25  7:21 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Edgecombe Rick P,
	Wang Wei W, Peng Chao P, Gao Chao, Wu Hao, Xu Yilun

Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
discard") effectively disables device assignment with guest_memfd.
guest_memfd is required for confidential guests, so device assignment to
confidential guests is disabled. A supporting assumption for disabling
device-assignment was that TEE I/O (SEV-TIO, TDX Connect, COVE-IO
etc...) solves the confidential-guest device-assignment problem [1].
That turns out not to be the case because TEE I/O depends on being able
to operate devices against "shared"/untrusted memory for device
initialization and error recovery scenarios.

This series utilizes an existing framework named RamDiscardManager to
notify VFIO of page conversions. However, there's still one concern
related to the semantics of RamDiscardManager which is used to manage
the memory plug/unplug state. This is a little different from the memory
shared/private in our requirement. See the "Open" section below for more
details.

Background
==========
Confidential VMs have two classes of memory: shared and private memory.
Shared memory is accessible from the host/VMM while private memory is
not. Confidential VMs can decide which memory is shared/private and
convert memory between shared/private at runtime.

"guest_memfd" is a new kind of fd whose primary goal is to serve guest
private memory. The key differences between guest_memfd and normal memfd
are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
cannot be mapped, read or written by userspace.

In QEMU's implementation, shared memory is allocated with normal methods
(e.g. mmap or fallocate) while private memory is allocated from
guest_memfd. When a VM performs memory conversions, QEMU frees pages via
madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
allocates new pages from the other side.

Problem
=======
Device assignment in QEMU is implemented via VFIO system. In the normal
VM, VM memory is pinned at the beginning of time by VFIO. In the
confidential VM, the VM can convert memory and when that happens
nothing currently tells VFIO that its mappings are stale. This means
that page conversion leaks memory and leaves stale IOMMU mappings. For
example, sequence like the following can result in stale IOMMU mappings:

1. allocate shared page
2. convert page shared->private
3. discard shared page
4. convert page private->shared
5. allocate shared page
6. issue DMA operations against that shared page

After step 3, VFIO is still pinning the page. However, DMA operations in
step 6 will hit the old mapping that was allocated in step 1, which
causes the device to access the invalid data.

Currently, the commit 852f0048f3 ("RAMBlock: make guest_memfd require
uncoordinated discard") has blocked the device assignment with
guest_memfd to avoid this problem.

Solution
========
The key to enable shared device assignment is to solve the stale IOMMU
mappings problem.

Given the constraints and assumptions here is a solution that satisfied
the use cases. RamDiscardManager, an existing interface currently
utilized by virtio-mem, offers a means to modify IOMMU mappings in
accordance with VM page assignment. Page conversion is similar to
hot-removing a page in one mode and adding it back in the other.

This series implements a RamDiscardManager for confidential VMs and
utilizes its infrastructure to notify VFIO of page conversions.

Another possible attempt [2] was to not discard shared pages in step 3
above. This was an incomplete band-aid because guests would consume
twice the memory since shared pages wouldn't be freed even after they
were converted to private.

Open
====
Implementing a RamDiscardManager to notify VFIO of page conversions
causes changes in semantics: private memory is treated as discarded (or
hot-removed) memory. This isn't aligned with the expectation of current
RamDiscardManager users (e.g. VFIO or live migration) who really
expect that discarded memory is hot-removed and thus can be skipped when
the users are processing guest memory. Treating private memory as
discarded won't work in future if VFIO or live migration needs to handle
private memory. e.g. VFIO may need to map private memory to support
Trusted IO and live migration for confidential VMs need to migrate
private memory.

There are two possible ways to mitigate the semantics changes.
1. Develop a new mechanism to notify the page conversions between
private and shared. For example, utilize the notifier_list in QEMU. VFIO
registers its own handler and gets notified upon page conversions. This
is a clean approach which only touches the notifier workflow. A
challenge is that for device hotplug, existing shared memory should be
mapped in IOMMU. This will need additional changes.

2. Extend the existing RamDiscardManager interface to manage not only
the discarded/populated status of guest memory but also the
shared/private status. RamDiscardManager users like VFIO will be
notified with one more argument indicating what change is happening and
can take action accordingly. It also has challenges e.g. QEMU allows
only one RamDiscardManager, how to support virtio-mem for confidential
VMs would be a problem. And some APIs like .is_populated() exposed by
RamDiscardManager are meaningless to shared/private memory. So they may
need some adjustments.

Testing
=======
This patch series is tested based on the internal TDX KVM/QEMU tree.

To facilitate shared device assignment with the NIC, employ the legacy
type1 VFIO with the QEMU command:

qemu-system-x86_64 [...]
    -device vfio-pci,host=XX:XX.X

The parameter of dma_entry_limit needs to be adjusted. For example, a
16GB guest needs to adjust the parameter like
vfio_iommu_type1.dma_entry_limit=4194304.

If use the iommufd-backed VFIO with the qemu command:

qemu-system-x86_64 [...]
    -object iommufd,id=iommufd0 \
    -device vfio-pci,host=XX:XX.X,iommufd=iommufd0

No additional adjustment required.

Following the bootup of the TD guest, the guest's IP address becomes
visible, and iperf is able to successfully send and receive data.

Related link
============
[1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/
[2] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/

Chenyi Qiang (6):
  guest_memfd: Introduce an object to manage the guest-memfd with
    RamDiscardManager
  guest_memfd: Introduce a helper to notify the shared/private state
    change
  KVM: Notify the state change via RamDiscardManager helper during
    shared/private conversion
  memory: Register the RamDiscardManager instance upon guest_memfd
    creation
  guest-memfd: Default to discarded (private) in guest_memfd_manager
  RAMBlock: make guest_memfd require coordinate discard

 accel/kvm/kvm-all.c                  |   7 +
 include/sysemu/guest-memfd-manager.h |  49 +++
 system/guest-memfd-manager.c         | 425 +++++++++++++++++++++++++++
 system/meson.build                   |   1 +
 system/physmem.c                     |  11 +-
 5 files changed, 492 insertions(+), 1 deletion(-)
 create mode 100644 include/sysemu/guest-memfd-manager.h
 create mode 100644 system/guest-memfd-manager.c

base-commit: 900536d3e97aed7fdd9cb4dadd3bf7023360e819
-- 
2.43.5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 1/6] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2024-07-25  7:21 [RFC PATCH 0/6] Enable shared device assignment Chenyi Qiang
@ 2024-07-25  7:21 ` Chenyi Qiang
  2024-07-25  7:21 ` [RFC PATCH 2/6] guest_memfd: Introduce a helper to notify the shared/private state change Chenyi Qiang
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Chenyi Qiang @ 2024-07-25  7:21 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Edgecombe Rick P,
	Wang Wei W, Peng Chao P, Gao Chao, Wu Hao, Xu Yilun

As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
uncoordinated discard") highlighted, some subsystems like VFIO might
disable ram block discard. However, guest_memfd relies on the discard
operation to perform page conversion between private and shared memory.
This can lead to stale IOMMU mapping issue when assigning a hardware
device to a confidential guest via shared memory (unprotected memory
pages). Blocking shared page discard can solve this problem, but it
could cause guests to consume twice the memory with VFIO, which is not
acceptable in some cases. An alternative solution is to convey other
systems like VFIO to refresh its outdated IOMMU mappings.

RamDiscardManager is an existing concept (used by virtio-mem) to adjust
VFIO mappings in relation to VM page assignement. Effectively page
conversion is similar to hot-removing a page in one mode and adding it
back in the other, so the similar work that needs to happen in response
to virtio-mem changes needs to happen for page conversion events.
Introduce the RamDiscardManager to guest_memfd to achieve it.

However, Implementing the RamDiscardManager interface poses a challenge
as guest_memfd is not an object, instead, it is contained within RamBlock
and is indicated by a RAM_GUEST_MEMFD flag upon creation.

One option is to implement the interface in HostMemoryBackend. Any
guest_memfd-backed host memory backend can register itself in the target
MemoryRegion. However, this solution doesn't cover the scenario where a
guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
the virtual BIOS MemoryRegion.

Thus, implement the second option, which involves defining an object type
named guest_memfd_manager with the RamDiscardManager interface. Upon
creation of guest_memfd, a new guest_memfd_manager object can be
instantiated and registered to the managed guest_memfd MemoryRegion to
handle the page conversion events.

In the context of guest_memfd, the discarded state signifies that the
page is private, while the populated state indicated that the page is
shared. The state of the memory is tracked at the granularity of the
host page size (i.e. block_size), as the minimum conversion size can be
one page per request. In addition, VFIO expects the DMA mapping for a
specific iova to be mapped and unmapped with the same granularity.
However, there's no guarantee that the confidential guest won't
partially convert the pages. For instance the confidential guest may
flip a 2M page from private to shared and later flip the first 4K
sub-range from shared to private. To prevent such invalid cases, all
operations are performed with a 4K granularity.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 include/sysemu/guest-memfd-manager.h |  46 +++++
 system/guest-memfd-manager.c         | 283 +++++++++++++++++++++++++++
 system/meson.build                   |   1 +
 3 files changed, 330 insertions(+)
 create mode 100644 include/sysemu/guest-memfd-manager.h
 create mode 100644 system/guest-memfd-manager.c

diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/guest-memfd-manager.h
new file mode 100644
index 0000000000..ab8c2ba362
--- /dev/null
+++ b/include/sysemu/guest-memfd-manager.h
@@ -0,0 +1,46 @@
+/*
+ * QEMU guest memfd manager
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *      Chenyi Qiang <chenyi.qiang@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory
+ *
+ */
+
+#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
+#define SYSEMU_GUEST_MEMFD_MANAGER_H
+
+#include "sysemu/hostmem.h"
+
+#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
+
+OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass, GUEST_MEMFD_MANAGER)
+
+struct GuestMemfdManager {
+    Object parent;
+
+    /* Managed memory region. */
+    MemoryRegion *mr;
+
+    /* bitmap used to track discard (private) memory */
+    int32_t discard_bitmap_size;
+    unsigned long *discard_bitmap;
+
+    /* block size and alignment */
+    uint64_t block_size;
+
+    /* listeners to notify on populate/discard activity. */
+    QLIST_HEAD(, RamDiscardListener) rdl_list;
+};
+
+struct GuestMemfdManagerClass {
+    ObjectClass parent_class;
+
+    void (*realize)(Object *gmm, MemoryRegion *mr, uint64_t region_size);
+};
+
+#endif
diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
new file mode 100644
index 0000000000..7b90f26859
--- /dev/null
+++ b/system/guest-memfd-manager.c
@@ -0,0 +1,283 @@
+/*
+ * QEMU guest memfd manager
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *      Chenyi Qiang <chenyi.qiang@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "sysemu/guest-memfd-manager.h"
+
+OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(GuestMemfdManager,
+                                          guest_memfd_manager,
+                                          GUEST_MEMFD_MANAGER,
+                                          OBJECT,
+                                          { TYPE_RAM_DISCARD_MANAGER },
+                                          { })
+
+static bool guest_memfd_rdm_is_populated(const RamDiscardManager *rdm,
+                                         const MemoryRegionSection *section)
+{
+    const GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    uint64_t first_bit = section->offset_within_region / gmm->block_size;
+    uint64_t last_bit = first_bit + int128_get64(section->size) / gmm->block_size - 1;
+    unsigned long first_discard_bit;
+
+    first_discard_bit = find_next_bit(gmm->discard_bitmap, last_bit + 1, first_bit);
+    return first_discard_bit > last_bit;
+}
+
+static bool guest_memfd_rdm_intersect_memory_section(MemoryRegionSection *section,
+                                                     uint64_t offset, uint64_t size)
+{
+    uint64_t start = MAX(section->offset_within_region, offset);
+    uint64_t end = MIN(section->offset_within_region + int128_get64(section->size),
+                       offset + size);
+    if (end <= start) {
+        return false;
+    }
+
+    section->offset_within_address_space += start - section->offset_within_region;
+    section->offset_within_region = start;
+    section->size = int128_make64(end - start);
+
+    return true;
+}
+
+typedef int (*guest_memfd_section_cb)(MemoryRegionSection *s, void *arg);
+
+static int guest_memfd_notify_populate_cb(MemoryRegionSection *section, void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    return rdl->notify_populate(rdl, section);
+}
+
+static int guest_memfd_notify_discard_cb(MemoryRegionSection *section, void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    rdl->notify_discard(rdl, section);
+
+    return 0;
+}
+
+static int guest_memfd_for_each_populated_range(const GuestMemfdManager *gmm,
+                                                MemoryRegionSection *section,
+                                                void *arg,
+                                                guest_memfd_section_cb cb)
+{
+    unsigned long first_zero_bit, last_zero_bit;
+    uint64_t offset, size;
+    int ret = 0;
+
+    first_zero_bit = section->offset_within_region / gmm->block_size;
+    first_zero_bit = find_next_zero_bit(gmm->discard_bitmap, gmm->discard_bitmap_size,
+                                        first_zero_bit);
+
+    while (first_zero_bit < gmm->discard_bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_zero_bit * gmm->block_size;
+        last_zero_bit = find_next_bit(gmm->discard_bitmap, gmm->discard_bitmap_size,
+                                      first_zero_bit + 1) - 1;
+        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size;
+
+        if (!guest_memfd_rdm_intersect_memory_section(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            break;
+        }
+
+        first_zero_bit = find_next_zero_bit(gmm->discard_bitmap, gmm->discard_bitmap_size,
+                                            last_zero_bit + 2);
+    }
+
+    return ret;
+}
+
+static int guest_memfd_for_each_discarded_range(const GuestMemfdManager *gmm,
+                                                MemoryRegionSection *section,
+                                                void *arg,
+                                                guest_memfd_section_cb cb)
+{
+    unsigned long first_one_bit, last_one_bit;
+    uint64_t offset, size;
+    int ret = 0;
+
+    first_one_bit = section->offset_within_region / gmm->block_size;
+    first_one_bit = find_next_bit(gmm->discard_bitmap, gmm->discard_bitmap_size,
+                                  first_one_bit);
+
+    while (first_one_bit < gmm->discard_bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_one_bit * gmm->block_size;
+        last_one_bit = find_next_zero_bit(gmm->discard_bitmap, gmm->discard_bitmap_size,
+                                          first_one_bit + 1) - 1;
+        size = (last_one_bit - first_one_bit + 1) * gmm->block_size;
+
+        if (!guest_memfd_rdm_intersect_memory_section(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            break;
+        }
+
+        first_one_bit = find_next_bit(gmm->discard_bitmap, gmm->discard_bitmap_size,
+                                      last_one_bit + 2);
+    }
+
+    return ret;
+}
+
+static uint64_t guest_memfd_rdm_get_min_granularity(const RamDiscardManager *rdm,
+                                                    const MemoryRegion *mr)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+
+    g_assert(mr == gmm->mr);
+    return gmm->block_size;
+}
+
+static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
+                                              RamDiscardListener *rdl,
+                                              MemoryRegionSection *section)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    int ret;
+
+    g_assert(section->mr == gmm->mr);
+    rdl->section = memory_region_section_new_copy(section);
+
+    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
+
+    ret = guest_memfd_for_each_populated_range(gmm, section, rdl,
+                                               guest_memfd_notify_populate_cb);
+    if (ret) {
+        error_report("%s: Failed to register RAM discard listener: %s", __func__,
+                     strerror(-ret));
+    }
+}
+
+static void guest_memfd_rdm_unregister_listener(RamDiscardManager *rdm,
+                                                RamDiscardListener *rdl)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    int ret;
+
+    g_assert(rdl->section);
+    g_assert(rdl->section->mr == gmm->mr);
+
+    ret = guest_memfd_for_each_populated_range(gmm, rdl->section, rdl,
+                                               guest_memfd_notify_discard_cb);
+    if (ret) {
+        error_report("%s: Failed to unregister RAM discard listener: %s", __func__,
+                     strerror(-ret));
+    }
+
+    memory_region_section_free_copy(rdl->section);
+    rdl->section = NULL;
+    QLIST_REMOVE(rdl, next);
+
+}
+
+typedef struct GuestMemfdReplayData {
+    void *fn;
+    void *opaque;
+} GuestMemfdReplayData;
+
+static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, void *arg)
+{
+    struct GuestMemfdReplayData *data = arg;
+    ReplayRamPopulate replay_fn = data->fn;
+
+    return replay_fn(section, data->opaque);
+}
+
+static int guest_memfd_rdm_replay_populated(const RamDiscardManager *rdm,
+                                            MemoryRegionSection *section,
+                                            ReplayRamPopulate replay_fn,
+                                            void *opaque)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == gmm->mr);
+    return guest_memfd_for_each_populated_range(gmm, section, &data,
+                                                guest_memfd_rdm_replay_populated_cb);
+}
+
+static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection *section, void *arg)
+{
+    struct GuestMemfdReplayData *data = arg;
+    ReplayRamDiscard replay_fn = data->fn;
+
+    replay_fn(section, data->opaque);
+
+    return 0;
+}
+
+static void guest_memfd_rdm_replay_discarded(const RamDiscardManager *rdm,
+                                             MemoryRegionSection *section,
+                                             ReplayRamDiscard replay_fn,
+                                             void *opaque)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == gmm->mr);
+    guest_memfd_for_each_discarded_range(gmm, section, &data,
+                                         guest_memfd_rdm_replay_discarded_cb);
+}
+
+static void guest_memfd_manager_realize(Object *obj, MemoryRegion *mr,
+                                        uint64_t region_size)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
+    uint64_t bitmap_size = ROUND_UP(region_size, gmm->block_size) / gmm->block_size;
+
+    gmm->mr = mr;
+    gmm->discard_bitmap_size = bitmap_size;
+    gmm->discard_bitmap = bitmap_new(bitmap_size);
+}
+
+static void guest_memfd_manager_init(Object *obj)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
+
+    gmm->block_size = qemu_real_host_page_size();
+    QLIST_INIT(&gmm->rdl_list);
+}
+
+static void guest_memfd_manager_finalize(Object *obj)
+{
+    g_free(GUEST_MEMFD_MANAGER(obj)->discard_bitmap);
+}
+
+static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
+{
+    GuestMemfdManagerClass *gmmc = GUEST_MEMFD_MANAGER_CLASS(oc);
+    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
+
+    gmmc->realize = guest_memfd_manager_realize;
+
+    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
+    rdmc->register_listener = guest_memfd_rdm_register_listener;
+    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
+    rdmc->is_populated = guest_memfd_rdm_is_populated;
+    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
+    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
+}
diff --git a/system/meson.build b/system/meson.build
index a296270cb0..9b96d645ab 100644
--- a/system/meson.build
+++ b/system/meson.build
@@ -16,6 +16,7 @@ system_ss.add(files(
   'dirtylimit.c',
   'dma-helpers.c',
   'globals.c',
+  'guest-memfd-manager.c',
   'memory_mapping.c',
   'qdev-monitor.c',
   'qtest.c',
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 2/6] guest_memfd: Introduce a helper to notify the shared/private state change
  2024-07-25  7:21 [RFC PATCH 0/6] Enable shared device assignment Chenyi Qiang
  2024-07-25  7:21 ` [RFC PATCH 1/6] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager Chenyi Qiang
@ 2024-07-25  7:21 ` Chenyi Qiang
  2024-07-25  7:21 ` [RFC PATCH 3/6] KVM: Notify the state change via RamDiscardManager helper during shared/private conversion Chenyi Qiang
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Chenyi Qiang @ 2024-07-25  7:21 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Edgecombe Rick P,
	Wang Wei W, Peng Chao P, Gao Chao, Wu Hao, Xu Yilun

Introduce a helper function within RamDiscardManager to efficiently
notify all registered RamDiscardListeners, including VFIO listeners
about the memory conversion events between shared and private in
guest_memfd. The existing VFIO listener can dynamically DMA map/unmap
the shared pages based on the conversion type:
- For conversions from shared to private, the VFIO system ensures the
  discarding of shared mapping from the IOMMU.
- For conversions from private to shared, it triggers the population of
  the shared mapping into the IOMMU.

Additionally, there could be some special conversion requests:
- When a conversion request is made for a page already in the desired
  state (either private or shared), the helper simply returns success.
- For requests involving a range partially in the desired state, only
  the necessary segments are converted, ensuring the entire range
  complies with the request efficiently.
- In scenarios where a conversion request is declined by other systems,
  such as a failure from VFIO during notify_populate(), the helper will
  roll back the request, maintaining consistency.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 include/sysemu/guest-memfd-manager.h |   3 +
 system/guest-memfd-manager.c         | 141 +++++++++++++++++++++++++++
 2 files changed, 144 insertions(+)

diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/guest-memfd-manager.h
index ab8c2ba362..1cce4cde43 100644
--- a/include/sysemu/guest-memfd-manager.h
+++ b/include/sysemu/guest-memfd-manager.h
@@ -43,4 +43,7 @@ struct GuestMemfdManagerClass {
     void (*realize)(Object *gmm, MemoryRegion *mr, uint64_t region_size);
 };
 
+int guest_memfd_state_change(GuestMemfdManager *gmm, uint64_t offset, uint64_t size,
+                             bool shared_to_private);
+
 #endif
diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
index 7b90f26859..deb43db90b 100644
--- a/system/guest-memfd-manager.c
+++ b/system/guest-memfd-manager.c
@@ -243,6 +243,147 @@ static void guest_memfd_rdm_replay_discarded(const RamDiscardManager *rdm,
                                          guest_memfd_rdm_replay_discarded_cb);
 }
 
+static bool guest_memfd_is_valid_range(GuestMemfdManager *gmm,
+                                       uint64_t offset, uint64_t size)
+{
+    MemoryRegion *mr = gmm->mr;
+
+    g_assert(mr);
+
+    uint64_t region_size = memory_region_size(mr);
+    if (!QEMU_IS_ALIGNED(offset, gmm->block_size)) {
+        return false;
+    }
+    if (offset + size < offset || !size) {
+        return false;
+    }
+    if (offset >= region_size || offset + size > region_size) {
+        return false;
+    }
+    return true;
+}
+
+static void guest_memfd_notify_discard(GuestMemfdManager *gmm,
+                                       uint64_t offset, uint64_t size)
+{
+    RamDiscardListener *rdl;
+
+    QLIST_FOREACH(rdl, &gmm->rdl_list, next) {
+        MemoryRegionSection tmp = *rdl->section;
+
+        if (!guest_memfd_rdm_intersect_memory_section(&tmp, offset, size)) {
+            continue;
+        }
+
+        guest_memfd_for_each_populated_range(gmm, &tmp, rdl,
+                                             guest_memfd_notify_discard_cb);
+    }
+}
+
+
+static int guest_memfd_notify_populate(GuestMemfdManager *gmm,
+                                       uint64_t offset, uint64_t size)
+{
+    RamDiscardListener *rdl, *rdl2;
+    int ret = 0;
+
+    QLIST_FOREACH(rdl, &gmm->rdl_list, next) {
+        MemoryRegionSection tmp = *rdl->section;
+
+        if (!guest_memfd_rdm_intersect_memory_section(&tmp, offset, size)) {
+            continue;
+        }
+
+        ret = guest_memfd_for_each_discarded_range(gmm, &tmp, rdl,
+                                                   guest_memfd_notify_populate_cb);
+        if (ret) {
+            break;
+        }
+    }
+
+    if (ret) {
+        /* Notify all already-notified listeners. */
+        QLIST_FOREACH(rdl2, &gmm->rdl_list, next) {
+            MemoryRegionSection tmp = *rdl2->section;
+
+            if (rdl2 == rdl) {
+                break;
+            }
+            if (!guest_memfd_rdm_intersect_memory_section(&tmp, offset, size)) {
+                continue;
+            }
+
+            guest_memfd_for_each_discarded_range(gmm, &tmp, rdl2,
+                                                 guest_memfd_notify_discard_cb);
+        }
+    }
+    return ret;
+}
+
+static bool guest_memfd_is_range_populated(GuestMemfdManager *gmm,
+                                           uint64_t offset, uint64_t size)
+{
+    const unsigned long first_bit = offset / gmm->block_size;
+    const unsigned long last_bit = first_bit + (size / gmm->block_size) - 1;
+    unsigned long found_bit;
+
+    /* We fake a shorter bitmap to avoid searching too far. */
+    found_bit = find_next_bit(gmm->discard_bitmap, last_bit + 1, first_bit);
+    return found_bit > last_bit;
+}
+
+static bool guest_memfd_is_range_discarded(GuestMemfdManager *gmm,
+                                           uint64_t offset, uint64_t size)
+{
+    const unsigned long first_bit = offset / gmm->block_size;
+    const unsigned long last_bit = first_bit + (size / gmm->block_size) - 1;
+    unsigned long found_bit;
+
+    /* We fake a shorter bitmap to avoid searching too far. */
+    found_bit = find_next_zero_bit(gmm->discard_bitmap, last_bit + 1, first_bit);
+    return found_bit > last_bit;
+}
+
+int guest_memfd_state_change(GuestMemfdManager *gmm, uint64_t offset, uint64_t size,
+                             bool shared_to_private)
+{
+    int ret = 0;
+
+    if (!guest_memfd_is_valid_range(gmm, offset, size)) {
+        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
+                     __func__, offset, size);
+        return -1;
+    }
+
+    if ((shared_to_private && guest_memfd_is_range_discarded(gmm, offset, size)) ||
+        (!shared_to_private && guest_memfd_is_range_populated(gmm, offset, size))) {
+        return 0;
+    }
+
+    if (shared_to_private) {
+        guest_memfd_notify_discard(gmm, offset, size);
+    } else {
+        ret = guest_memfd_notify_populate(gmm, offset, size);
+    }
+
+    if (!ret) {
+        unsigned long first_bit = offset / gmm->block_size;
+        unsigned long nbits = size / gmm->block_size;
+
+        g_assert((first_bit + nbits) <= gmm->discard_bitmap_size);
+
+        if (shared_to_private) {
+            bitmap_set(gmm->discard_bitmap, first_bit, nbits);
+        } else {
+            bitmap_clear(gmm->discard_bitmap, first_bit, nbits);
+        }
+
+        return 0;
+    }
+
+    return ret;
+}
+
 static void guest_memfd_manager_realize(Object *obj, MemoryRegion *mr,
                                         uint64_t region_size)
 {
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 3/6] KVM: Notify the state change via RamDiscardManager helper during shared/private conversion
  2024-07-25  7:21 [RFC PATCH 0/6] Enable shared device assignment Chenyi Qiang
  2024-07-25  7:21 ` [RFC PATCH 1/6] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager Chenyi Qiang
  2024-07-25  7:21 ` [RFC PATCH 2/6] guest_memfd: Introduce a helper to notify the shared/private state change Chenyi Qiang
@ 2024-07-25  7:21 ` Chenyi Qiang
  2024-07-25  7:21 ` [RFC PATCH 4/6] memory: Register the RamDiscardManager instance upon guest_memfd creation Chenyi Qiang
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Chenyi Qiang @ 2024-07-25  7:21 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Edgecombe Rick P,
	Wang Wei W, Peng Chao P, Gao Chao, Wu Hao, Xu Yilun

Once exit to userspace to convert the page from private to shared or
vice versa at runtime, notify the state change via the
guest_memfd_state_change() helper so that other registered subsystems
like VFIO can be notified.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 accel/kvm/kvm-all.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 854cb86b22..94bbbbd2de 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -48,6 +48,7 @@
 #include "kvm-cpus.h"
 #include "sysemu/dirtylimit.h"
 #include "qemu/range.h"
+#include "sysemu/guest-memfd-manager.h"
 
 #include "hw/boards.h"
 #include "sysemu/stats.h"
@@ -2852,6 +2853,7 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
     RAMBlock *rb;
     void *addr;
     int ret = -1;
+    GuestMemfdManager *gmm;
 
     trace_kvm_convert_memory(start, size, to_private ? "shared_to_private" : "private_to_shared");
 
@@ -2914,6 +2916,11 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
     addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
     rb = qemu_ram_block_from_host(addr, false, &offset);
 
+    gmm = GUEST_MEMFD_MANAGER(mr->rdm);
+    if (gmm) {
+        guest_memfd_state_change(gmm, offset, size, to_private);
+    }
+
     if (to_private) {
         if (rb->page_size != qemu_real_host_page_size()) {
             /*
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 4/6] memory: Register the RamDiscardManager instance upon guest_memfd creation
  2024-07-25  7:21 [RFC PATCH 0/6] Enable shared device assignment Chenyi Qiang
                   ` (2 preceding siblings ...)
  2024-07-25  7:21 ` [RFC PATCH 3/6] KVM: Notify the state change via RamDiscardManager helper during shared/private conversion Chenyi Qiang
@ 2024-07-25  7:21 ` Chenyi Qiang
  2024-07-25  7:21 ` [RFC PATCH 5/6] guest-memfd: Default to discarded (private) in guest_memfd_manager Chenyi Qiang
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Chenyi Qiang @ 2024-07-25  7:21 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Edgecombe Rick P,
	Wang Wei W, Peng Chao P, Gao Chao, Wu Hao, Xu Yilun

Instantiate a new guest_memfd_manager object and register it in the
target MemoryRegion. From this point, other subsystems such as VFIO can
register their listeners in guest_memfd_manager and receive conversion
events through RamDiscardManager.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 system/physmem.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/system/physmem.c b/system/physmem.c
index 33d09f7571..98072ae246 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -53,6 +53,7 @@
 #include "sysemu/hostmem.h"
 #include "sysemu/hw_accel.h"
 #include "sysemu/xen-mapcache.h"
+#include "sysemu/guest-memfd-manager.h"
 #include "trace/trace-root.h"
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
@@ -1861,6 +1862,12 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
             qemu_mutex_unlock_ramlist();
             goto out_free;
         }
+
+        GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(object_new(TYPE_GUEST_MEMFD_MANAGER));
+        GuestMemfdManagerClass *gmmc = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
+        g_assert(new_block->mr);
+        gmmc->realize(OBJECT(gmm), new_block->mr, new_block->mr->size);
+        memory_region_set_ram_discard_manager(gmm->mr, RAM_DISCARD_MANAGER(gmm));
     }
 
     new_ram_size = MAX(old_ram_size,
@@ -2118,6 +2125,8 @@ static void reclaim_ramblock(RAMBlock *block)
 
     if (block->guest_memfd >= 0) {
         close(block->guest_memfd);
+        g_assert(block->mr);
+        object_unref(OBJECT(block->mr->rdm));
         ram_block_discard_require(false);
     }
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 5/6] guest-memfd: Default to discarded (private) in guest_memfd_manager
  2024-07-25  7:21 [RFC PATCH 0/6] Enable shared device assignment Chenyi Qiang
                   ` (3 preceding siblings ...)
  2024-07-25  7:21 ` [RFC PATCH 4/6] memory: Register the RamDiscardManager instance upon guest_memfd creation Chenyi Qiang
@ 2024-07-25  7:21 ` Chenyi Qiang
  2024-07-25  7:21 ` [RFC PATCH 6/6] RAMBlock: make guest_memfd require coordinate discard Chenyi Qiang
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Chenyi Qiang @ 2024-07-25  7:21 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Edgecombe Rick P,
	Wang Wei W, Peng Chao P, Gao Chao, Wu Hao, Xu Yilun

guest_memfd was initially set to shared until the commit bd3bcf6962
("kvm/memory: Make memory type private by default if it has guest memfd
backend"). To align with this change, the default state in
guest_memfd_manager is set to discarded.

One concern raised by this commit is the handling of the virtual BIOS.
The virtual BIOS loads its image into the shared memory of guest_memfd.
However, during the region_commit() stage, the memory attribute is
set to private while its shared memory remains valid. This mismatch
persists until the shared content is copied to the private region.
Fortunately, this interval only exits during setup stage and currently,
only the guest_memfd_manager is concerned with the state of the
guest_memfd at that stage. For simplicity, the default bitmap in
guest_memfd_manager is set to discarded (private). This is feasible
because the shared content of the virtual BIOS will eventually be
discarded and there are no requests to DMA access to this shared part
during this period.

Additionally, setting the default to private can also reduce the
overhead of mapping shared pages into IOMMU by VFIO at the bootup stage.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 system/guest-memfd-manager.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
index deb43db90b..ad1a46bac4 100644
--- a/system/guest-memfd-manager.c
+++ b/system/guest-memfd-manager.c
@@ -393,6 +393,7 @@ static void guest_memfd_manager_realize(Object *obj, MemoryRegion *mr,
     gmm->mr = mr;
     gmm->discard_bitmap_size = bitmap_size;
     gmm->discard_bitmap = bitmap_new(bitmap_size);
+    bitmap_fill(gmm->discard_bitmap, bitmap_size);
 }

 static void guest_memfd_manager_init(Object *obj)
-- 
2.43.5

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 6/6] RAMBlock: make guest_memfd require coordinate discard
  2024-07-25  7:21 [RFC PATCH 0/6] Enable shared device assignment Chenyi Qiang
                   ` (4 preceding siblings ...)
  2024-07-25  7:21 ` [RFC PATCH 5/6] guest-memfd: Default to discarded (private) in guest_memfd_manager Chenyi Qiang
@ 2024-07-25  7:21 ` Chenyi Qiang
  2024-07-25 14:04 ` [RFC PATCH 0/6] Enable shared device assignment David Hildenbrand
  2024-08-16  3:02 ` Chenyi Qiang
  7 siblings, 0 replies; 22+ messages in thread
From: Chenyi Qiang @ 2024-07-25  7:21 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Edgecombe Rick P,
	Wang Wei W, Peng Chao P, Gao Chao, Wu Hao, Xu Yilun

As guest_memfd is now managed by guest_memfd_manager with
RamDiscardManager, only block uncoordinated discard.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 system/physmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/system/physmem.c b/system/physmem.c
index 98072ae246..ffd68debf0 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1849,7 +1849,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         assert(kvm_enabled());
         assert(new_block->guest_memfd < 0);
 
-        if (ram_block_discard_require(true) < 0) {
+        if (ram_block_coordinated_discard_require(true) < 0) {
             error_setg_errno(errp, errno,
                              "cannot set up private guest memory: discard currently blocked");
             error_append_hint(errp, "Are you using assigned devices?\n");
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/6] Enable shared device assignment
  2024-07-25  7:21 [RFC PATCH 0/6] Enable shared device assignment Chenyi Qiang
                   ` (5 preceding siblings ...)
  2024-07-25  7:21 ` [RFC PATCH 6/6] RAMBlock: make guest_memfd require coordinate discard Chenyi Qiang
@ 2024-07-25 14:04 ` David Hildenbrand
  2024-07-26  5:02   ` Tian, Kevin
  2024-07-26  6:20   ` Chenyi Qiang
  2024-08-16  3:02 ` Chenyi Qiang
  7 siblings, 2 replies; 22+ messages in thread
From: David Hildenbrand @ 2024-07-25 14:04 UTC (permalink / raw)
  To: Chenyi Qiang, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Edgecombe Rick P, Wang Wei W,
	Peng Chao P, Gao Chao, Wu Hao, Xu Yilun

> Open
> ====
> Implementing a RamDiscardManager to notify VFIO of page conversions
> causes changes in semantics: private memory is treated as discarded (or
> hot-removed) memory. This isn't aligned with the expectation of current
> RamDiscardManager users (e.g. VFIO or live migration) who really
> expect that discarded memory is hot-removed and thus can be skipped when
> the users are processing guest memory. Treating private memory as
> discarded won't work in future if VFIO or live migration needs to handle
> private memory. e.g. VFIO may need to map private memory to support
> Trusted IO and live migration for confidential VMs need to migrate
> private memory.

"VFIO may need to map private memory to support Trusted IO"

I've been told that the way we handle shared memory won't be the way 
this is going to work with guest_memfd. KVM will coordinate directly 
with VFIO or $whatever and update the IOMMU tables itself right in the 
kernel; the pages are pinned/owned by guest_memfd, so that will just 
work. So I don't consider that currently a concern. guest_memfd private 
memory is not mapped into user page tables and as it currently seems it 
never will be.

Similarly: live migration. We cannot simply migrate that memory the 
traditional way. We even have to track the dirty state differently.

So IMHO, treating both memory as discarded == don't touch it the usual 
way might actually be a feature not a bug ;)

> 
> There are two possible ways to mitigate the semantics changes.
> 1. Develop a new mechanism to notify the page conversions between
> private and shared. For example, utilize the notifier_list in QEMU. VFIO
> registers its own handler and gets notified upon page conversions. This
> is a clean approach which only touches the notifier workflow. A
> challenge is that for device hotplug, existing shared memory should be
> mapped in IOMMU. This will need additional changes.
> 
> 2. Extend the existing RamDiscardManager interface to manage not only
> the discarded/populated status of guest memory but also the
> shared/private status. RamDiscardManager users like VFIO will be
> notified with one more argument indicating what change is happening and
> can take action accordingly. It also has challenges e.g. QEMU allows
> only one RamDiscardManager, how to support virtio-mem for confidential
> VMs would be a problem. And some APIs like .is_populated() exposed by
> RamDiscardManager are meaningless to shared/private memory. So they may
> need some adjustments.

Think of all of that in terms of "shared memory is populated, private 
memory is some inaccessible stuff that needs very special way and other 
means for device assignment, live migration, etc.". Then it actually 
quite makes sense to use of RamDiscardManager (AFAIKS :) ).

> 
> Testing
> =======
> This patch series is tested based on the internal TDX KVM/QEMU tree.
> 
> To facilitate shared device assignment with the NIC, employ the legacy
> type1 VFIO with the QEMU command:
> 
> qemu-system-x86_64 [...]
>      -device vfio-pci,host=XX:XX.X
> 
> The parameter of dma_entry_limit needs to be adjusted. For example, a
> 16GB guest needs to adjust the parameter like
> vfio_iommu_type1.dma_entry_limit=4194304.

But here you note the biggest real issue I see (not related to 
RAMDiscardManager, but that we have to prepare for conversion of each 
possible private page to shared and back): we need a single IOMMU 
mapping for each 4 KiB page.

Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB. 
Does it even scale then?


There is the alternative of having in-place private/shared conversion 
when we also let guest_memfd manage some shared memory. It has plenty of 
downsides, but for the problem at hand it would mean that we don't 
discard on shared/private conversion.

But whenever we want to convert memory shared->private we would 
similarly have to from IOMMU page tables via VFIO. (the in-place 
conversion will only be allowed if any additional references on a page 
are gone -- when it is inaccessible by userspace/kernel).

Again, if IOMMU page tables would be managed by KVM in the kernel 
without user space intervention/vfio this would work with device 
assignment just fine. But I guess it will take a while until we actually 
have that option.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [RFC PATCH 0/6] Enable shared device assignment
  2024-07-25 14:04 ` [RFC PATCH 0/6] Enable shared device assignment David Hildenbrand
@ 2024-07-26  5:02   ` Tian, Kevin
  2024-07-26  7:08     ` David Hildenbrand
  2024-07-26  6:20   ` Chenyi Qiang
  1 sibling, 1 reply; 22+ messages in thread
From: Tian, Kevin @ 2024-07-26  5:02 UTC (permalink / raw)
  To: David Hildenbrand, Qiang, Chenyi, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org, Williams, Dan J,
	Edgecombe, Rick P, Wang, Wei W, Peng, Chao P, Gao, Chao, Wu, Hao,
	Xu, Yilun

> From: David Hildenbrand <david@redhat.com>
> Sent: Thursday, July 25, 2024 10:04 PM
> 
> > Open
> > ====
> > Implementing a RamDiscardManager to notify VFIO of page conversions
> > causes changes in semantics: private memory is treated as discarded (or
> > hot-removed) memory. This isn't aligned with the expectation of current
> > RamDiscardManager users (e.g. VFIO or live migration) who really
> > expect that discarded memory is hot-removed and thus can be skipped
> when
> > the users are processing guest memory. Treating private memory as
> > discarded won't work in future if VFIO or live migration needs to handle
> > private memory. e.g. VFIO may need to map private memory to support
> > Trusted IO and live migration for confidential VMs need to migrate
> > private memory.
> 
> "VFIO may need to map private memory to support Trusted IO"
> 
> I've been told that the way we handle shared memory won't be the way
> this is going to work with guest_memfd. KVM will coordinate directly
> with VFIO or $whatever and update the IOMMU tables itself right in the
> kernel; the pages are pinned/owned by guest_memfd, so that will just
> work. So I don't consider that currently a concern. guest_memfd private
> memory is not mapped into user page tables and as it currently seems it
> never will be.

Or could extend MAP_DMA to accept guest_memfd+offset in place of
'vaddr' and have VFIO/IOMMUFD call guest_memfd helpers to retrieve
the pinned pfn.

IMHO it's more the TIO arch deciding whether VFIO/IOMMUFD needs
to manage the mapping of the private memory instead of the use of
guest_memfd.

e.g. SEV-TIO, iiuc, introduces a new-layer page ownership tracker (RMP)
to check the HPA after the IOMMU walks the existing I/O page tables. 
So reasonably VFIO/IOMMUFD could continue to manage those I/O
page tables including both private and shared memory, with a hint to
know where to find the pfn (host page table or guest_memfd).

But TDX Connect introduces a new I/O page table format (same as secure
EPT) for mapping the private memory and further requires sharing the
secure-EPT between CPU/IOMMU for private. Then it appears to be
a different story.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/6] Enable shared device assignment
  2024-07-25 14:04 ` [RFC PATCH 0/6] Enable shared device assignment David Hildenbrand
  2024-07-26  5:02   ` Tian, Kevin
@ 2024-07-26  6:20   ` Chenyi Qiang
  2024-07-26  7:20     ` David Hildenbrand
  1 sibling, 1 reply; 22+ messages in thread
From: Chenyi Qiang @ 2024-07-26  6:20 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Edgecombe Rick P, Wang Wei W,
	Peng Chao P, Gao Chao, Wu Hao, Xu Yilun



On 7/25/2024 10:04 PM, David Hildenbrand wrote:
>> Open
>> ====
>> Implementing a RamDiscardManager to notify VFIO of page conversions
>> causes changes in semantics: private memory is treated as discarded (or
>> hot-removed) memory. This isn't aligned with the expectation of current
>> RamDiscardManager users (e.g. VFIO or live migration) who really
>> expect that discarded memory is hot-removed and thus can be skipped when
>> the users are processing guest memory. Treating private memory as
>> discarded won't work in future if VFIO or live migration needs to handle
>> private memory. e.g. VFIO may need to map private memory to support
>> Trusted IO and live migration for confidential VMs need to migrate
>> private memory.
> 
> "VFIO may need to map private memory to support Trusted IO"
> 
> I've been told that the way we handle shared memory won't be the way
> this is going to work with guest_memfd. KVM will coordinate directly
> with VFIO or $whatever and update the IOMMU tables itself right in the
> kernel; the pages are pinned/owned by guest_memfd, so that will just
> work. So I don't consider that currently a concern. guest_memfd private
> memory is not mapped into user page tables and as it currently seems it
> never will be.

That's correct. AFAIK, some TEE IO solution like TDX Connect would let
kernel coordinate and update private mapping in IOMMU tables. Here, It
mentions that VFIO "may" need map private memory. I want to make this
more generic to account for potential future TEE IO solutions that may
require such functionality. :)

> 
> Similarly: live migration. We cannot simply migrate that memory the
> traditional way. We even have to track the dirty state differently.
> 
> So IMHO, treating both memory as discarded == don't touch it the usual
> way might actually be a feature not a bug ;)

Do you mean treating the private memory in both VFIO and live migration
as discarded? That is what this patch series does. And as you mentioned,
these RDM users cannot follow the traditional RDM way. Because of this,
we also considered whether we should use RDM or a more generic mechanism
like notifier_list below.

> 
>>
>> There are two possible ways to mitigate the semantics changes.
>> 1. Develop a new mechanism to notify the page conversions between
>> private and shared. For example, utilize the notifier_list in QEMU. VFIO
>> registers its own handler and gets notified upon page conversions. This
>> is a clean approach which only touches the notifier workflow. A
>> challenge is that for device hotplug, existing shared memory should be
>> mapped in IOMMU. This will need additional changes.
>>
>> 2. Extend the existing RamDiscardManager interface to manage not only
>> the discarded/populated status of guest memory but also the
>> shared/private status. RamDiscardManager users like VFIO will be
>> notified with one more argument indicating what change is happening and
>> can take action accordingly. It also has challenges e.g. QEMU allows
>> only one RamDiscardManager, how to support virtio-mem for confidential
>> VMs would be a problem. And some APIs like .is_populated() exposed by
>> RamDiscardManager are meaningless to shared/private memory. So they may
>> need some adjustments.
> 
> Think of all of that in terms of "shared memory is populated, private
> memory is some inaccessible stuff that needs very special way and other
> means for device assignment, live migration, etc.". Then it actually
> quite makes sense to use of RamDiscardManager (AFAIKS :) ).

Yes, such notification mechanism is what we want. But for the users of
RDM, it would require additional change accordingly. Current users just
skip inaccessible stuff, but in private memory case, it can't be simply
skipped. Maybe renaming RamDiscardManager to RamStateManager is more
accurate then. :)

> 
>>
>> Testing
>> =======
>> This patch series is tested based on the internal TDX KVM/QEMU tree.
>>
>> To facilitate shared device assignment with the NIC, employ the legacy
>> type1 VFIO with the QEMU command:
>>
>> qemu-system-x86_64 [...]
>>      -device vfio-pci,host=XX:XX.X
>>
>> The parameter of dma_entry_limit needs to be adjusted. For example, a
>> 16GB guest needs to adjust the parameter like
>> vfio_iommu_type1.dma_entry_limit=4194304.
> 
> But here you note the biggest real issue I see (not related to
> RAMDiscardManager, but that we have to prepare for conversion of each
> possible private page to shared and back): we need a single IOMMU
> mapping for each 4 KiB page.
> 
> Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB.
> Does it even scale then?

The entry limitation needs to be increased as the guest memory size
increases. For this issue, are you concerned that having too many
entries might bring some performance issue? Maybe we could introduce
some PV mechanism to coordinate with guest to convert memory only in 2M
granularity. This may help mitigate the problem.

> 
> 
> There is the alternative of having in-place private/shared conversion
> when we also let guest_memfd manage some shared memory. It has plenty of
> downsides, but for the problem at hand it would mean that we don't
> discard on shared/private conversion.>
> But whenever we want to convert memory shared->private we would
> similarly have to from IOMMU page tables via VFIO. (the in-place
> conversion will only be allowed if any additional references on a page
> are gone -- when it is inaccessible by userspace/kernel).

I'm not clear about this in-place private/shared conversion. Can you
elaborate a little bit? It seems this alternative changes private and
shared management in current guest_memfd?

> 
> Again, if IOMMU page tables would be managed by KVM in the kernel
> without user space intervention/vfio this would work with device
> assignment just fine. But I guess it will take a while until we actually
> have that option.
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/6] Enable shared device assignment
  2024-07-26  5:02   ` Tian, Kevin
@ 2024-07-26  7:08     ` David Hildenbrand
  2024-07-31  7:12       ` Xu Yilun
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2024-07-26  7:08 UTC (permalink / raw)
  To: Tian, Kevin, Qiang, Chenyi, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org, Williams, Dan J,
	Edgecombe, Rick P, Wang, Wei W, Peng, Chao P, Gao, Chao, Wu, Hao,
	Xu, Yilun

On 26.07.24 07:02, Tian, Kevin wrote:
>> From: David Hildenbrand <david@redhat.com>
>> Sent: Thursday, July 25, 2024 10:04 PM
>>
>>> Open
>>> ====
>>> Implementing a RamDiscardManager to notify VFIO of page conversions
>>> causes changes in semantics: private memory is treated as discarded (or
>>> hot-removed) memory. This isn't aligned with the expectation of current
>>> RamDiscardManager users (e.g. VFIO or live migration) who really
>>> expect that discarded memory is hot-removed and thus can be skipped
>> when
>>> the users are processing guest memory. Treating private memory as
>>> discarded won't work in future if VFIO or live migration needs to handle
>>> private memory. e.g. VFIO may need to map private memory to support
>>> Trusted IO and live migration for confidential VMs need to migrate
>>> private memory.
>>
>> "VFIO may need to map private memory to support Trusted IO"
>>
>> I've been told that the way we handle shared memory won't be the way
>> this is going to work with guest_memfd. KVM will coordinate directly
>> with VFIO or $whatever and update the IOMMU tables itself right in the
>> kernel; the pages are pinned/owned by guest_memfd, so that will just
>> work. So I don't consider that currently a concern. guest_memfd private
>> memory is not mapped into user page tables and as it currently seems it
>> never will be.
> 
> Or could extend MAP_DMA to accept guest_memfd+offset in place of
> 'vaddr' and have VFIO/IOMMUFD call guest_memfd helpers to retrieve
> the pinned pfn.

In theory yes, and I've been thinking of the same for a while. Until 
people told me that it is unlikely that it will work that way in the future.

> 
> IMHO it's more the TIO arch deciding whether VFIO/IOMMUFD needs
> to manage the mapping of the private memory instead of the use of
> guest_memfd.
> 
> e.g. SEV-TIO, iiuc, introduces a new-layer page ownership tracker (RMP)
> to check the HPA after the IOMMU walks the existing I/O page tables.
> So reasonably VFIO/IOMMUFD could continue to manage those I/O
> page tables including both private and shared memory, with a hint to
> know where to find the pfn (host page table or guest_memfd).
> 
> But TDX Connect introduces a new I/O page table format (same as secure
> EPT) for mapping the private memory and further requires sharing the
> secure-EPT between CPU/IOMMU for private. Then it appears to be
> a different story.

Yes. This seems to be the future and more in-line with 
in-place/in-kernel conversion as e.g., pKVM wants to have it. If you 
want to avoid user space altogether when doing shared<->private 
conversions, then letting user space manage the IOMMUs is not going to work.


If we ever have to go down that path (MAP_DMA of guest_memfd), we could 
have two RAMDiscardManager for a RAM region, just like we have two 
memory backends: one for shared memory populate/discard (what this 
series tries to achieve), one for private memory populate/discard.

The thing is, that private memory will always have to be special-cased 
all over the place either way, unfortunately.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/6] Enable shared device assignment
  2024-07-26  6:20   ` Chenyi Qiang
@ 2024-07-26  7:20     ` David Hildenbrand
  2024-07-26 10:56       ` Chenyi Qiang
  2024-08-01  7:32       ` Yin, Fengwei
  0 siblings, 2 replies; 22+ messages in thread
From: David Hildenbrand @ 2024-07-26  7:20 UTC (permalink / raw)
  To: Chenyi Qiang, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Edgecombe Rick P, Wang Wei W,
	Peng Chao P, Gao Chao, Wu Hao, Xu Yilun

On 26.07.24 08:20, Chenyi Qiang wrote:
> 
> 
> On 7/25/2024 10:04 PM, David Hildenbrand wrote:
>>> Open
>>> ====
>>> Implementing a RamDiscardManager to notify VFIO of page conversions
>>> causes changes in semantics: private memory is treated as discarded (or
>>> hot-removed) memory. This isn't aligned with the expectation of current
>>> RamDiscardManager users (e.g. VFIO or live migration) who really
>>> expect that discarded memory is hot-removed and thus can be skipped when
>>> the users are processing guest memory. Treating private memory as
>>> discarded won't work in future if VFIO or live migration needs to handle
>>> private memory. e.g. VFIO may need to map private memory to support
>>> Trusted IO and live migration for confidential VMs need to migrate
>>> private memory.
>>
>> "VFIO may need to map private memory to support Trusted IO"
>>
>> I've been told that the way we handle shared memory won't be the way
>> this is going to work with guest_memfd. KVM will coordinate directly
>> with VFIO or $whatever and update the IOMMU tables itself right in the
>> kernel; the pages are pinned/owned by guest_memfd, so that will just
>> work. So I don't consider that currently a concern. guest_memfd private
>> memory is not mapped into user page tables and as it currently seems it
>> never will be.
> 
> That's correct. AFAIK, some TEE IO solution like TDX Connect would let
> kernel coordinate and update private mapping in IOMMU tables. Here, It
> mentions that VFIO "may" need map private memory. I want to make this
> more generic to account for potential future TEE IO solutions that may
> require such functionality. :)

Careful to not over-enginner something that is not even real or 
close-to-be-real yet, though. :) Nobody really knows who that will look 
like, besides that we know for Intel that we won't need that.

> 
>>
>> Similarly: live migration. We cannot simply migrate that memory the
>> traditional way. We even have to track the dirty state differently.
>>
>> So IMHO, treating both memory as discarded == don't touch it the usual
>> way might actually be a feature not a bug ;)
> 
> Do you mean treating the private memory in both VFIO and live migration
> as discarded? That is what this patch series does. And as you mentioned,
> these RDM users cannot follow the traditional RDM way. Because of this,
> we also considered whether we should use RDM or a more generic mechanism
> like notifier_list below.

Yes, the shared memory is logically discarded. At the same time we 
*might* get private memory effectively populated. See my reply to Kevin 
that there might be ways of having shared vs. private populate/discard 
in the future, if required. Just some idea, though.

> 
>>
>>>
>>> There are two possible ways to mitigate the semantics changes.
>>> 1. Develop a new mechanism to notify the page conversions between
>>> private and shared. For example, utilize the notifier_list in QEMU. VFIO
>>> registers its own handler and gets notified upon page conversions. This
>>> is a clean approach which only touches the notifier workflow. A
>>> challenge is that for device hotplug, existing shared memory should be
>>> mapped in IOMMU. This will need additional changes.
>>>
>>> 2. Extend the existing RamDiscardManager interface to manage not only
>>> the discarded/populated status of guest memory but also the
>>> shared/private status. RamDiscardManager users like VFIO will be
>>> notified with one more argument indicating what change is happening and
>>> can take action accordingly. It also has challenges e.g. QEMU allows
>>> only one RamDiscardManager, how to support virtio-mem for confidential
>>> VMs would be a problem. And some APIs like .is_populated() exposed by
>>> RamDiscardManager are meaningless to shared/private memory. So they may
>>> need some adjustments.
>>
>> Think of all of that in terms of "shared memory is populated, private
>> memory is some inaccessible stuff that needs very special way and other
>> means for device assignment, live migration, etc.". Then it actually
>> quite makes sense to use of RamDiscardManager (AFAIKS :) ).
> 
> Yes, such notification mechanism is what we want. But for the users of
> RDM, it would require additional change accordingly. Current users just
> skip inaccessible stuff, but in private memory case, it can't be simply
> skipped. Maybe renaming RamDiscardManager to RamStateManager is more
> accurate then. :)

Current users must skip it, yes. How private memory would have to be 
handled, and who would handle it, is rather unclear.

Again, maybe we'd want separate RamDiscardManager for private and shared 
memory (after all, these are two separate memory backends).

Not sure that "RamStateManager" terminology would be reasonable in that 
approach.

> 
>>
>>>
>>> Testing
>>> =======
>>> This patch series is tested based on the internal TDX KVM/QEMU tree.
>>>
>>> To facilitate shared device assignment with the NIC, employ the legacy
>>> type1 VFIO with the QEMU command:
>>>
>>> qemu-system-x86_64 [...]
>>>       -device vfio-pci,host=XX:XX.X
>>>
>>> The parameter of dma_entry_limit needs to be adjusted. For example, a
>>> 16GB guest needs to adjust the parameter like
>>> vfio_iommu_type1.dma_entry_limit=4194304.
>>
>> But here you note the biggest real issue I see (not related to
>> RAMDiscardManager, but that we have to prepare for conversion of each
>> possible private page to shared and back): we need a single IOMMU
>> mapping for each 4 KiB page.
>>
>> Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB.
>> Does it even scale then?
> 
> The entry limitation needs to be increased as the guest memory size
> increases. For this issue, are you concerned that having too many
> entries might bring some performance issue? Maybe we could introduce
> some PV mechanism to coordinate with guest to convert memory only in 2M
> granularity. This may help mitigate the problem.

I've had this talk with Intel, because the 4K granularity is a pain. I 
was told that ship has sailed ... and we have to cope with random 4K 
conversions :(

The many mappings will likely add both memory and runtime overheads in 
the kernel. But we only know once we measure.

Key point is that even 4194304 "only" allows for 16 GiB. Imagine 1 TiB 
of shared memory :/

> 
>>
>>
>> There is the alternative of having in-place private/shared conversion
>> when we also let guest_memfd manage some shared memory. It has plenty of
>> downsides, but for the problem at hand it would mean that we don't
>> discard on shared/private conversion.>
>> But whenever we want to convert memory shared->private we would
>> similarly have to from IOMMU page tables via VFIO. (the in-place
>> conversion will only be allowed if any additional references on a page
>> are gone -- when it is inaccessible by userspace/kernel).
> 
> I'm not clear about this in-place private/shared conversion. Can you
> elaborate a little bit? It seems this alternative changes private and
> shared management in current guest_memfd?

Yes, there have been discussions about that, also in the context of 
supporting huge pages while allowing for the guest to still convert 
individual 4K chunks ...

A summary is here [1]. Likely more things will be covered at Linux Plumbers.


[1] 
https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@google.com/

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/6] Enable shared device assignment
  2024-07-26  7:20     ` David Hildenbrand
@ 2024-07-26 10:56       ` Chenyi Qiang
  2024-07-31 11:18         ` David Hildenbrand
  2024-08-01  7:32       ` Yin, Fengwei
  1 sibling, 1 reply; 22+ messages in thread
From: Chenyi Qiang @ 2024-07-26 10:56 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Edgecombe Rick P, Wang Wei W,
	Peng Chao P, Gao Chao, Wu Hao, Xu Yilun



On 7/26/2024 3:20 PM, David Hildenbrand wrote:
> On 26.07.24 08:20, Chenyi Qiang wrote:
>>
>>
>> On 7/25/2024 10:04 PM, David Hildenbrand wrote:
>>>> Open
>>>> ====
>>>> Implementing a RamDiscardManager to notify VFIO of page conversions
>>>> causes changes in semantics: private memory is treated as discarded (or
>>>> hot-removed) memory. This isn't aligned with the expectation of current
>>>> RamDiscardManager users (e.g. VFIO or live migration) who really
>>>> expect that discarded memory is hot-removed and thus can be skipped
>>>> when
>>>> the users are processing guest memory. Treating private memory as
>>>> discarded won't work in future if VFIO or live migration needs to
>>>> handle
>>>> private memory. e.g. VFIO may need to map private memory to support
>>>> Trusted IO and live migration for confidential VMs need to migrate
>>>> private memory.
>>>
>>> "VFIO may need to map private memory to support Trusted IO"
>>>
>>> I've been told that the way we handle shared memory won't be the way
>>> this is going to work with guest_memfd. KVM will coordinate directly
>>> with VFIO or $whatever and update the IOMMU tables itself right in the
>>> kernel; the pages are pinned/owned by guest_memfd, so that will just
>>> work. So I don't consider that currently a concern. guest_memfd private
>>> memory is not mapped into user page tables and as it currently seems it
>>> never will be.
>>
>> That's correct. AFAIK, some TEE IO solution like TDX Connect would let
>> kernel coordinate and update private mapping in IOMMU tables. Here, It
>> mentions that VFIO "may" need map private memory. I want to make this
>> more generic to account for potential future TEE IO solutions that may
>> require such functionality. :)
> 
> Careful to not over-enginner something that is not even real or
> close-to-be-real yet, though. :) Nobody really knows who that will look
> like, besides that we know for Intel that we won't need that.

OK, Thanks for the reminder!

> 
>>
>>>
>>> Similarly: live migration. We cannot simply migrate that memory the
>>> traditional way. We even have to track the dirty state differently.
>>>
>>> So IMHO, treating both memory as discarded == don't touch it the usual
>>> way might actually be a feature not a bug ;)
>>
>> Do you mean treating the private memory in both VFIO and live migration
>> as discarded? That is what this patch series does. And as you mentioned,
>> these RDM users cannot follow the traditional RDM way. Because of this,
>> we also considered whether we should use RDM or a more generic mechanism
>> like notifier_list below.
> 
> Yes, the shared memory is logically discarded. At the same time we
> *might* get private memory effectively populated. See my reply to Kevin
> that there might be ways of having shared vs. private populate/discard
> in the future, if required. Just some idea, though.
> 
>>
>>>
>>>>
>>>> There are two possible ways to mitigate the semantics changes.
>>>> 1. Develop a new mechanism to notify the page conversions between
>>>> private and shared. For example, utilize the notifier_list in QEMU.
>>>> VFIO
>>>> registers its own handler and gets notified upon page conversions. This
>>>> is a clean approach which only touches the notifier workflow. A
>>>> challenge is that for device hotplug, existing shared memory should be
>>>> mapped in IOMMU. This will need additional changes.
>>>>
>>>> 2. Extend the existing RamDiscardManager interface to manage not only
>>>> the discarded/populated status of guest memory but also the
>>>> shared/private status. RamDiscardManager users like VFIO will be
>>>> notified with one more argument indicating what change is happening and
>>>> can take action accordingly. It also has challenges e.g. QEMU allows
>>>> only one RamDiscardManager, how to support virtio-mem for confidential
>>>> VMs would be a problem. And some APIs like .is_populated() exposed by
>>>> RamDiscardManager are meaningless to shared/private memory. So they may
>>>> need some adjustments.
>>>
>>> Think of all of that in terms of "shared memory is populated, private
>>> memory is some inaccessible stuff that needs very special way and other
>>> means for device assignment, live migration, etc.". Then it actually
>>> quite makes sense to use of RamDiscardManager (AFAIKS :) ).
>>
>> Yes, such notification mechanism is what we want. But for the users of
>> RDM, it would require additional change accordingly. Current users just
>> skip inaccessible stuff, but in private memory case, it can't be simply
>> skipped. Maybe renaming RamDiscardManager to RamStateManager is more
>> accurate then. :)
> 
> Current users must skip it, yes. How private memory would have to be
> handled, and who would handle it, is rather unclear.
> 
> Again, maybe we'd want separate RamDiscardManager for private and shared
> memory (after all, these are two separate memory backends).

We also considered distinguishing the populate and discard operation for
private and shared memory separately. As in method 2 above, we mentioned
to add a new argument to indicate the memory attribute to operate on.
They seem to have a similar idea.

> 
> Not sure that "RamStateManager" terminology would be reasonable in that
> approach.
> 
>>
>>>
>>>>
>>>> Testing
>>>> =======
>>>> This patch series is tested based on the internal TDX KVM/QEMU tree.
>>>>
>>>> To facilitate shared device assignment with the NIC, employ the legacy
>>>> type1 VFIO with the QEMU command:
>>>>
>>>> qemu-system-x86_64 [...]
>>>>       -device vfio-pci,host=XX:XX.X
>>>>
>>>> The parameter of dma_entry_limit needs to be adjusted. For example, a
>>>> 16GB guest needs to adjust the parameter like
>>>> vfio_iommu_type1.dma_entry_limit=4194304.
>>>
>>> But here you note the biggest real issue I see (not related to
>>> RAMDiscardManager, but that we have to prepare for conversion of each
>>> possible private page to shared and back): we need a single IOMMU
>>> mapping for each 4 KiB page.
>>>
>>> Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB.
>>> Does it even scale then?
>>
>> The entry limitation needs to be increased as the guest memory size
>> increases. For this issue, are you concerned that having too many
>> entries might bring some performance issue? Maybe we could introduce
>> some PV mechanism to coordinate with guest to convert memory only in 2M
>> granularity. This may help mitigate the problem.
> 
> I've had this talk with Intel, because the 4K granularity is a pain. I
> was told that ship has sailed ... and we have to cope with random 4K
> conversions :(
> 
> The many mappings will likely add both memory and runtime overheads in
> the kernel. But we only know once we measure.

In the normal case, the main runtime overhead comes from
private<->shared flip in SWIOTLB, which defaults to 6% of memory with a
maximum of 1Gbyte. I think this overhead is acceptable. In non-default
case, e.g. dynamic allocated DMA buffer, the runtime overhead will
increase. As for the memory overheads, It is indeed unavoidable.

Will these performance issues be a deal breaker for enabling shared
device assignment in this way?

> 
> Key point is that even 4194304 "only" allows for 16 GiB. Imagine 1 TiB
> of shared memory :/
> 
>>
>>>
>>>
>>> There is the alternative of having in-place private/shared conversion
>>> when we also let guest_memfd manage some shared memory. It has plenty of
>>> downsides, but for the problem at hand it would mean that we don't
>>> discard on shared/private conversion.>
>>> But whenever we want to convert memory shared->private we would
>>> similarly have to from IOMMU page tables via VFIO. (the in-place
>>> conversion will only be allowed if any additional references on a page
>>> are gone -- when it is inaccessible by userspace/kernel).
>>
>> I'm not clear about this in-place private/shared conversion. Can you
>> elaborate a little bit? It seems this alternative changes private and
>> shared management in current guest_memfd?
> 
> Yes, there have been discussions about that, also in the context of
> supporting huge pages while allowing for the guest to still convert
> individual 4K chunks ...
> 
> A summary is here [1]. Likely more things will be covered at Linux
> Plumbers.
> 
> 
> [1]
> https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@google.com/
> 

Thanks for your sharing.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/6] Enable shared device assignment
  2024-07-26  7:08     ` David Hildenbrand
@ 2024-07-31  7:12       ` Xu Yilun
  2024-07-31 11:05         ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Xu Yilun @ 2024-07-31  7:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Tian, Kevin, Qiang, Chenyi, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel@nongnu.org,
	kvm@vger.kernel.org, Williams, Dan J, Edgecombe, Rick P,
	Wang, Wei W, Peng, Chao P, Gao, Chao, Wu, Hao, Xu, Yilun

On Fri, Jul 26, 2024 at 09:08:51AM +0200, David Hildenbrand wrote:
> On 26.07.24 07:02, Tian, Kevin wrote:
> > > From: David Hildenbrand <david@redhat.com>
> > > Sent: Thursday, July 25, 2024 10:04 PM
> > > 
> > > > Open
> > > > ====
> > > > Implementing a RamDiscardManager to notify VFIO of page conversions
> > > > causes changes in semantics: private memory is treated as discarded (or
> > > > hot-removed) memory. This isn't aligned with the expectation of current
> > > > RamDiscardManager users (e.g. VFIO or live migration) who really
> > > > expect that discarded memory is hot-removed and thus can be skipped
> > > when
> > > > the users are processing guest memory. Treating private memory as
> > > > discarded won't work in future if VFIO or live migration needs to handle
> > > > private memory. e.g. VFIO may need to map private memory to support
> > > > Trusted IO and live migration for confidential VMs need to migrate
> > > > private memory.
> > > 
> > > "VFIO may need to map private memory to support Trusted IO"
> > > 
> > > I've been told that the way we handle shared memory won't be the way
> > > this is going to work with guest_memfd. KVM will coordinate directly
> > > with VFIO or $whatever and update the IOMMU tables itself right in the
> > > kernel; the pages are pinned/owned by guest_memfd, so that will just
> > > work. So I don't consider that currently a concern. guest_memfd private
> > > memory is not mapped into user page tables and as it currently seems it
> > > never will be.
> > 
> > Or could extend MAP_DMA to accept guest_memfd+offset in place of

With TIO, I can imagine several buffer sharing requirements: KVM maps VFIO
owned private MMIO, IOMMU maps gmem owned private memory, IOMMU maps VFIO
owned private MMIO. These buffers cannot be found by user page table
anymore. I'm wondering it would be messy to have specific PFN finding
methods for each FD type. Is it possible we have a unified way for
buffer sharing and PFN finding, is dma-buf a candidate?

> > 'vaddr' and have VFIO/IOMMUFD call guest_memfd helpers to retrieve
> > the pinned pfn.
> 
> In theory yes, and I've been thinking of the same for a while. Until people
> told me that it is unlikely that it will work that way in the future.

Could you help specify why it won't work? As Kevin mentioned below, SEV-TIO
may still allow userspace to manage the IOMMU mapping for private. I'm
not sure how they map private memory for IOMMU without touching gmemfd.

Thanks,
Yilun

> 
> > 
> > IMHO it's more the TIO arch deciding whether VFIO/IOMMUFD needs
> > to manage the mapping of the private memory instead of the use of
> > guest_memfd.
> > 
> > e.g. SEV-TIO, iiuc, introduces a new-layer page ownership tracker (RMP)
> > to check the HPA after the IOMMU walks the existing I/O page tables.
> > So reasonably VFIO/IOMMUFD could continue to manage those I/O
> > page tables including both private and shared memory, with a hint to
> > know where to find the pfn (host page table or guest_memfd).
> > 
> > But TDX Connect introduces a new I/O page table format (same as secure
> > EPT) for mapping the private memory and further requires sharing the
> > secure-EPT between CPU/IOMMU for private. Then it appears to be
> > a different story.
> 
> Yes. This seems to be the future and more in-line with in-place/in-kernel
> conversion as e.g., pKVM wants to have it. If you want to avoid user space
> altogether when doing shared<->private conversions, then letting user space
> manage the IOMMUs is not going to work.
> 
> 
> If we ever have to go down that path (MAP_DMA of guest_memfd), we could have
> two RAMDiscardManager for a RAM region, just like we have two memory
> backends: one for shared memory populate/discard (what this series tries to
> achieve), one for private memory populate/discard.
> 
> The thing is, that private memory will always have to be special-cased all
> over the place either way, unfortunately.
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/6] Enable shared device assignment
  2024-07-31  7:12       ` Xu Yilun
@ 2024-07-31 11:05         ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2024-07-31 11:05 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Tian, Kevin, Qiang, Chenyi, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel@nongnu.org,
	kvm@vger.kernel.org, Williams, Dan J, Edgecombe, Rick P,
	Wang, Wei W, Peng, Chao P, Gao, Chao, Wu, Hao, Xu, Yilun

On 31.07.24 09:12, Xu Yilun wrote:
> On Fri, Jul 26, 2024 at 09:08:51AM +0200, David Hildenbrand wrote:
>> On 26.07.24 07:02, Tian, Kevin wrote:
>>>> From: David Hildenbrand <david@redhat.com>
>>>> Sent: Thursday, July 25, 2024 10:04 PM
>>>>
>>>>> Open
>>>>> ====
>>>>> Implementing a RamDiscardManager to notify VFIO of page conversions
>>>>> causes changes in semantics: private memory is treated as discarded (or
>>>>> hot-removed) memory. This isn't aligned with the expectation of current
>>>>> RamDiscardManager users (e.g. VFIO or live migration) who really
>>>>> expect that discarded memory is hot-removed and thus can be skipped
>>>> when
>>>>> the users are processing guest memory. Treating private memory as
>>>>> discarded won't work in future if VFIO or live migration needs to handle
>>>>> private memory. e.g. VFIO may need to map private memory to support
>>>>> Trusted IO and live migration for confidential VMs need to migrate
>>>>> private memory.
>>>>
>>>> "VFIO may need to map private memory to support Trusted IO"
>>>>
>>>> I've been told that the way we handle shared memory won't be the way
>>>> this is going to work with guest_memfd. KVM will coordinate directly
>>>> with VFIO or $whatever and update the IOMMU tables itself right in the
>>>> kernel; the pages are pinned/owned by guest_memfd, so that will just
>>>> work. So I don't consider that currently a concern. guest_memfd private
>>>> memory is not mapped into user page tables and as it currently seems it
>>>> never will be.
>>>
>>> Or could extend MAP_DMA to accept guest_memfd+offset in place of
> 
> With TIO, I can imagine several buffer sharing requirements: KVM maps VFIO
> owned private MMIO, IOMMU maps gmem owned private memory, IOMMU maps VFIO
> owned private MMIO. These buffers cannot be found by user page table
> anymore. I'm wondering it would be messy to have specific PFN finding
> methods for each FD type. Is it possible we have a unified way for
> buffer sharing and PFN finding, is dma-buf a candidate?

No expert on that, so I'm afraid I can't help.

> 
>>> 'vaddr' and have VFIO/IOMMUFD call guest_memfd helpers to retrieve
>>> the pinned pfn.
>>
>> In theory yes, and I've been thinking of the same for a while. Until people
>> told me that it is unlikely that it will work that way in the future.
> 
> Could you help specify why it won't work? As Kevin mentioned below, SEV-TIO
> may still allow userspace to manage the IOMMU mapping for private. I'm
> not sure how they map private memory for IOMMU without touching gmemfd.

I raised that question in [1]:

"How would the device be able to grab/access "private memory", if not 
via the user page tables?"

Jason summarized it as "The approaches I'm aware of require the secure 
world to own the IOMMU and generate the IOMMU page tables. So we will 
not use a GUP approach with VFIO today as the kernel will not have any 
reason to generate a page table in the first place. Instead we will say 
"this PCI device translates through the secure world" and walk away."

I think for some cVM approaches it really cannot work without letting 
KVM/secure world handle the IOMMU (e.g., sharing of page tables between 
IOMMU and KVM).

For your use case it *might* work, but I am wondering if this is how it 
should be done, and if there are better alternatives.


[1] https://lkml.org/lkml/2024/6/20/920

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/6] Enable shared device assignment
  2024-07-26 10:56       ` Chenyi Qiang
@ 2024-07-31 11:18         ` David Hildenbrand
  2024-08-02  7:00           ` Chenyi Qiang
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2024-07-31 11:18 UTC (permalink / raw)
  To: Chenyi Qiang, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Edgecombe Rick P, Wang Wei W,
	Peng Chao P, Gao Chao, Wu Hao, Xu Yilun

Sorry for the late reply!

>> Current users must skip it, yes. How private memory would have to be
>> handled, and who would handle it, is rather unclear.
>>
>> Again, maybe we'd want separate RamDiscardManager for private and shared
>> memory (after all, these are two separate memory backends).
> 
> We also considered distinguishing the populate and discard operation for
> private and shared memory separately. As in method 2 above, we mentioned
> to add a new argument to indicate the memory attribute to operate on.
> They seem to have a similar idea.

Yes. Likely it's just some implementation detail. I think the following 
states would be possible:

* Discarded in shared + discarded in private (not populated)
* Discarded in shared + populated in private (private populated)
* Populated in shared + discarded in private (shared populated)

One could map these to states discarded/private/shared indeed.

[...]

>> I've had this talk with Intel, because the 4K granularity is a pain. I
>> was told that ship has sailed ... and we have to cope with random 4K
>> conversions :(
>>
>> The many mappings will likely add both memory and runtime overheads in
>> the kernel. But we only know once we measure.
> 
> In the normal case, the main runtime overhead comes from
> private<->shared flip in SWIOTLB, which defaults to 6% of memory with a
> maximum of 1Gbyte. I think this overhead is acceptable. In non-default
> case, e.g. dynamic allocated DMA buffer, the runtime overhead will
> increase. As for the memory overheads, It is indeed unavoidable.
> 
> Will these performance issues be a deal breaker for enabling shared
> device assignment in this way?

I see the most problematic part being the dma_entry_limit and all of 
these individual MAP/UNMAP calls on 4KiB granularity.

dma_entry_limit is "unsigned int", and defaults to U16_MAX. So the 
possible maximum should be 4294967296, and the default is 65535.

So we should be able to have a maximum of 16 TiB shared memory all in 
4KiB chunks.

sizeof(struct vfio_dma) is probably something like <= 96 bytes, implying 
a per-page overhead of ~2.4%, excluding the actual rbtree.

Tree lookup/modifications with that many nodes might also get a bit 
slower, but likely still tolerable as you note.

Deal breaker? Not sure. Rather "suboptimal" :) ... but maybe unavoidable 
for your use case?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/6] Enable shared device assignment
  2024-07-26  7:20     ` David Hildenbrand
  2024-07-26 10:56       ` Chenyi Qiang
@ 2024-08-01  7:32       ` Yin, Fengwei
  1 sibling, 0 replies; 22+ messages in thread
From: Yin, Fengwei @ 2024-08-01  7:32 UTC (permalink / raw)
  To: David Hildenbrand, Chenyi Qiang, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Edgecombe Rick P, Wang Wei W,
	Peng Chao P, Gao Chao, Wu Hao, Xu Yilun, Lu, Aaron

Hi David,

On 7/26/2024 3:20 PM, David Hildenbrand wrote:
> Yes, there have been discussions about that, also in the context of 
> supporting huge pages while allowing for the guest to still convert 
> individual 4K chunks ...
> 
> A summary is here [1]. Likely more things will be covered at Linux 
> Plumbers.
> 
> 
> [1] 
> https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@google.com/
This is a very valuable link. Thanks a lot for sharing.

Aaron and I are particular interesting to the huge page (both hugetlb
and THP) support for gmem_fd (per our testing, at least 10%+ performance
gain with it with TDX for many workloads). We will monitor the linux-mm
for such kind of discussion. I am wondering whether it's possible that
you can involve Aaron and me if the discussion is still open but not on
the mailing list (I suppose you will be included always for such kind of
discussion). Thanks.

Regards
Yin, Fengwei

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/6] Enable shared device assignment
  2024-07-31 11:18         ` David Hildenbrand
@ 2024-08-02  7:00           ` Chenyi Qiang
  0 siblings, 0 replies; 22+ messages in thread
From: Chenyi Qiang @ 2024-08-02  7:00 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Edgecombe Rick P, Wang Wei W,
	Peng Chao P, Gao Chao, Wu Hao, Xu Yilun



On 7/31/2024 7:18 PM, David Hildenbrand wrote:
> Sorry for the late reply!
> 
>>> Current users must skip it, yes. How private memory would have to be
>>> handled, and who would handle it, is rather unclear.
>>>
>>> Again, maybe we'd want separate RamDiscardManager for private and shared
>>> memory (after all, these are two separate memory backends).
>>
>> We also considered distinguishing the populate and discard operation for
>> private and shared memory separately. As in method 2 above, we mentioned
>> to add a new argument to indicate the memory attribute to operate on.
>> They seem to have a similar idea.
> 
> Yes. Likely it's just some implementation detail. I think the following
> states would be possible:
> 
> * Discarded in shared + discarded in private (not populated)
> * Discarded in shared + populated in private (private populated)
> * Populated in shared + discarded in private (shared populated)
> 
> One could map these to states discarded/private/shared indeed.

Make sense. We can follow this if the mechanism of RamDiscardManager is
acceptable and no other concerns.

> 
> [...]
> 
>>> I've had this talk with Intel, because the 4K granularity is a pain. I
>>> was told that ship has sailed ... and we have to cope with random 4K
>>> conversions :(
>>>
>>> The many mappings will likely add both memory and runtime overheads in
>>> the kernel. But we only know once we measure.
>>
>> In the normal case, the main runtime overhead comes from
>> private<->shared flip in SWIOTLB, which defaults to 6% of memory with a
>> maximum of 1Gbyte. I think this overhead is acceptable. In non-default
>> case, e.g. dynamic allocated DMA buffer, the runtime overhead will
>> increase. As for the memory overheads, It is indeed unavoidable.
>>
>> Will these performance issues be a deal breaker for enabling shared
>> device assignment in this way?
> 
> I see the most problematic part being the dma_entry_limit and all of
> these individual MAP/UNMAP calls on 4KiB granularity.
> 
> dma_entry_limit is "unsigned int", and defaults to U16_MAX. So the
> possible maximum should be 4294967296, and the default is 65535.
> 
> So we should be able to have a maximum of 16 TiB shared memory all in
> 4KiB chunks.
> 
> sizeof(struct vfio_dma) is probably something like <= 96 bytes, implying
> a per-page overhead of ~2.4%, excluding the actual rbtree.
> 
> Tree lookup/modifications with that many nodes might also get a bit
> slower, but likely still tolerable as you note.
> 
> Deal breaker? Not sure. Rather "suboptimal" :) ... but maybe unavoidable
> for your use case?

Yes. We can't guarantee the behavior of guest, so the overhead would be
uncertain and unavoidable.

> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/6] Enable shared device assignment
  2024-07-25  7:21 [RFC PATCH 0/6] Enable shared device assignment Chenyi Qiang
                   ` (6 preceding siblings ...)
  2024-07-25 14:04 ` [RFC PATCH 0/6] Enable shared device assignment David Hildenbrand
@ 2024-08-16  3:02 ` Chenyi Qiang
  2024-10-08  8:59   ` Chenyi Qiang
  7 siblings, 1 reply; 22+ messages in thread
From: Chenyi Qiang @ 2024-08-16  3:02 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Edgecombe Rick P, Wang Wei W,
	Peng Chao P, Gao Chao, Wu Hao, Xu Yilun

Hi Paolo,

Hope to draw your attention. As TEE I/O would depend on shared device
assignment and we introduce this RDM solution in QEMU. Now, Observe the
in-place private/shared conversion option mentioned by David, do you
think we should continue to add pass-thru support for this in-qemu page
conversion method? Or wait for the option discussion to see if it will
change to in-kernel conversion.

Thanks
Chenyi

On 7/25/2024 3:21 PM, Chenyi Qiang wrote:
> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
> discard") effectively disables device assignment with guest_memfd.
> guest_memfd is required for confidential guests, so device assignment to
> confidential guests is disabled. A supporting assumption for disabling
> device-assignment was that TEE I/O (SEV-TIO, TDX Connect, COVE-IO
> etc...) solves the confidential-guest device-assignment problem [1].
> That turns out not to be the case because TEE I/O depends on being able
> to operate devices against "shared"/untrusted memory for device
> initialization and error recovery scenarios.
> 
> This series utilizes an existing framework named RamDiscardManager to
> notify VFIO of page conversions. However, there's still one concern
> related to the semantics of RamDiscardManager which is used to manage
> the memory plug/unplug state. This is a little different from the memory
> shared/private in our requirement. See the "Open" section below for more
> details.
> 
> Background
> ==========
> Confidential VMs have two classes of memory: shared and private memory.
> Shared memory is accessible from the host/VMM while private memory is
> not. Confidential VMs can decide which memory is shared/private and
> convert memory between shared/private at runtime.
> 
> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
> private memory. The key differences between guest_memfd and normal memfd
> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
> cannot be mapped, read or written by userspace.
> 
> In QEMU's implementation, shared memory is allocated with normal methods
> (e.g. mmap or fallocate) while private memory is allocated from
> guest_memfd. When a VM performs memory conversions, QEMU frees pages via
> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
> allocates new pages from the other side.
> 
> Problem
> =======
> Device assignment in QEMU is implemented via VFIO system. In the normal
> VM, VM memory is pinned at the beginning of time by VFIO. In the
> confidential VM, the VM can convert memory and when that happens
> nothing currently tells VFIO that its mappings are stale. This means
> that page conversion leaks memory and leaves stale IOMMU mappings. For
> example, sequence like the following can result in stale IOMMU mappings:
> 
> 1. allocate shared page
> 2. convert page shared->private
> 3. discard shared page
> 4. convert page private->shared
> 5. allocate shared page
> 6. issue DMA operations against that shared page
> 
> After step 3, VFIO is still pinning the page. However, DMA operations in
> step 6 will hit the old mapping that was allocated in step 1, which
> causes the device to access the invalid data.
> 
> Currently, the commit 852f0048f3 ("RAMBlock: make guest_memfd require
> uncoordinated discard") has blocked the device assignment with
> guest_memfd to avoid this problem.
> 
> Solution
> ========
> The key to enable shared device assignment is to solve the stale IOMMU
> mappings problem.
> 
> Given the constraints and assumptions here is a solution that satisfied
> the use cases. RamDiscardManager, an existing interface currently
> utilized by virtio-mem, offers a means to modify IOMMU mappings in
> accordance with VM page assignment. Page conversion is similar to
> hot-removing a page in one mode and adding it back in the other.
> 
> This series implements a RamDiscardManager for confidential VMs and
> utilizes its infrastructure to notify VFIO of page conversions.
> 
> Another possible attempt [2] was to not discard shared pages in step 3
> above. This was an incomplete band-aid because guests would consume
> twice the memory since shared pages wouldn't be freed even after they
> were converted to private.
> 
> Open
> ====
> Implementing a RamDiscardManager to notify VFIO of page conversions
> causes changes in semantics: private memory is treated as discarded (or
> hot-removed) memory. This isn't aligned with the expectation of current
> RamDiscardManager users (e.g. VFIO or live migration) who really
> expect that discarded memory is hot-removed and thus can be skipped when
> the users are processing guest memory. Treating private memory as
> discarded won't work in future if VFIO or live migration needs to handle
> private memory. e.g. VFIO may need to map private memory to support
> Trusted IO and live migration for confidential VMs need to migrate
> private memory.
> 
> There are two possible ways to mitigate the semantics changes.
> 1. Develop a new mechanism to notify the page conversions between
> private and shared. For example, utilize the notifier_list in QEMU. VFIO
> registers its own handler and gets notified upon page conversions. This
> is a clean approach which only touches the notifier workflow. A
> challenge is that for device hotplug, existing shared memory should be
> mapped in IOMMU. This will need additional changes.
> 
> 2. Extend the existing RamDiscardManager interface to manage not only
> the discarded/populated status of guest memory but also the
> shared/private status. RamDiscardManager users like VFIO will be
> notified with one more argument indicating what change is happening and
> can take action accordingly. It also has challenges e.g. QEMU allows
> only one RamDiscardManager, how to support virtio-mem for confidential
> VMs would be a problem. And some APIs like .is_populated() exposed by
> RamDiscardManager are meaningless to shared/private memory. So they may
> need some adjustments.
> 
> Testing
> =======
> This patch series is tested based on the internal TDX KVM/QEMU tree.
> 
> To facilitate shared device assignment with the NIC, employ the legacy
> type1 VFIO with the QEMU command:
> 
> qemu-system-x86_64 [...]
>     -device vfio-pci,host=XX:XX.X
> 
> The parameter of dma_entry_limit needs to be adjusted. For example, a
> 16GB guest needs to adjust the parameter like
> vfio_iommu_type1.dma_entry_limit=4194304.
> 
> If use the iommufd-backed VFIO with the qemu command:
> 
> qemu-system-x86_64 [...]
>     -object iommufd,id=iommufd0 \
>     -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
> 
> No additional adjustment required.
> 
> Following the bootup of the TD guest, the guest's IP address becomes
> visible, and iperf is able to successfully send and receive data.
> 
> Related link
> ============
> [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/
> [2] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/
> 
> Chenyi Qiang (6):
>   guest_memfd: Introduce an object to manage the guest-memfd with
>     RamDiscardManager
>   guest_memfd: Introduce a helper to notify the shared/private state
>     change
>   KVM: Notify the state change via RamDiscardManager helper during
>     shared/private conversion
>   memory: Register the RamDiscardManager instance upon guest_memfd
>     creation
>   guest-memfd: Default to discarded (private) in guest_memfd_manager
>   RAMBlock: make guest_memfd require coordinate discard
> 
>  accel/kvm/kvm-all.c                  |   7 +
>  include/sysemu/guest-memfd-manager.h |  49 +++
>  system/guest-memfd-manager.c         | 425 +++++++++++++++++++++++++++
>  system/meson.build                   |   1 +
>  system/physmem.c                     |  11 +-
>  5 files changed, 492 insertions(+), 1 deletion(-)
>  create mode 100644 include/sysemu/guest-memfd-manager.h
>  create mode 100644 system/guest-memfd-manager.c
> 
> 
> base-commit: 900536d3e97aed7fdd9cb4dadd3bf7023360e819


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/6] Enable shared device assignment
  2024-08-16  3:02 ` Chenyi Qiang
@ 2024-10-08  8:59   ` Chenyi Qiang
  2024-11-15 16:47     ` Rob Nertney
  0 siblings, 1 reply; 22+ messages in thread
From: Chenyi Qiang @ 2024-10-08  8:59 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Edgecombe Rick P, Wang Wei W,
	Peng Chao P, Gao Chao, Wu Hao, Xu Yilun

Hi Paolo,

Kindly ping for this thread. The in-place page conversion is discussed
at Linux Plumbers. Does it give some direction for shared device
assignment enabling work?

Thanks
Chenyi

On 8/16/2024 11:02 AM, Chenyi Qiang wrote:
> Hi Paolo,
> 
> Hope to draw your attention. As TEE I/O would depend on shared device
> assignment and we introduce this RDM solution in QEMU. Now, Observe the
> in-place private/shared conversion option mentioned by David, do you
> think we should continue to add pass-thru support for this in-qemu page
> conversion method? Or wait for the option discussion to see if it will
> change to in-kernel conversion.
> 
> Thanks
> Chenyi
> 
> On 7/25/2024 3:21 PM, Chenyi Qiang wrote:
>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>> discard") effectively disables device assignment with guest_memfd.
>> guest_memfd is required for confidential guests, so device assignment to
>> confidential guests is disabled. A supporting assumption for disabling
>> device-assignment was that TEE I/O (SEV-TIO, TDX Connect, COVE-IO
>> etc...) solves the confidential-guest device-assignment problem [1].
>> That turns out not to be the case because TEE I/O depends on being able
>> to operate devices against "shared"/untrusted memory for device
>> initialization and error recovery scenarios.
>>
>> This series utilizes an existing framework named RamDiscardManager to
>> notify VFIO of page conversions. However, there's still one concern
>> related to the semantics of RamDiscardManager which is used to manage
>> the memory plug/unplug state. This is a little different from the memory
>> shared/private in our requirement. See the "Open" section below for more
>> details.
>>
>> Background
>> ==========
>> Confidential VMs have two classes of memory: shared and private memory.
>> Shared memory is accessible from the host/VMM while private memory is
>> not. Confidential VMs can decide which memory is shared/private and
>> convert memory between shared/private at runtime.
>>
>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
>> private memory. The key differences between guest_memfd and normal memfd
>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
>> cannot be mapped, read or written by userspace.
>>
>> In QEMU's implementation, shared memory is allocated with normal methods
>> (e.g. mmap or fallocate) while private memory is allocated from
>> guest_memfd. When a VM performs memory conversions, QEMU frees pages via
>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>> allocates new pages from the other side.
>>
>> Problem
>> =======
>> Device assignment in QEMU is implemented via VFIO system. In the normal
>> VM, VM memory is pinned at the beginning of time by VFIO. In the
>> confidential VM, the VM can convert memory and when that happens
>> nothing currently tells VFIO that its mappings are stale. This means
>> that page conversion leaks memory and leaves stale IOMMU mappings. For
>> example, sequence like the following can result in stale IOMMU mappings:
>>
>> 1. allocate shared page
>> 2. convert page shared->private
>> 3. discard shared page
>> 4. convert page private->shared
>> 5. allocate shared page
>> 6. issue DMA operations against that shared page
>>
>> After step 3, VFIO is still pinning the page. However, DMA operations in
>> step 6 will hit the old mapping that was allocated in step 1, which
>> causes the device to access the invalid data.
>>
>> Currently, the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>> uncoordinated discard") has blocked the device assignment with
>> guest_memfd to avoid this problem.
>>
>> Solution
>> ========
>> The key to enable shared device assignment is to solve the stale IOMMU
>> mappings problem.
>>
>> Given the constraints and assumptions here is a solution that satisfied
>> the use cases. RamDiscardManager, an existing interface currently
>> utilized by virtio-mem, offers a means to modify IOMMU mappings in
>> accordance with VM page assignment. Page conversion is similar to
>> hot-removing a page in one mode and adding it back in the other.
>>
>> This series implements a RamDiscardManager for confidential VMs and
>> utilizes its infrastructure to notify VFIO of page conversions.
>>
>> Another possible attempt [2] was to not discard shared pages in step 3
>> above. This was an incomplete band-aid because guests would consume
>> twice the memory since shared pages wouldn't be freed even after they
>> were converted to private.
>>
>> Open
>> ====
>> Implementing a RamDiscardManager to notify VFIO of page conversions
>> causes changes in semantics: private memory is treated as discarded (or
>> hot-removed) memory. This isn't aligned with the expectation of current
>> RamDiscardManager users (e.g. VFIO or live migration) who really
>> expect that discarded memory is hot-removed and thus can be skipped when
>> the users are processing guest memory. Treating private memory as
>> discarded won't work in future if VFIO or live migration needs to handle
>> private memory. e.g. VFIO may need to map private memory to support
>> Trusted IO and live migration for confidential VMs need to migrate
>> private memory.
>>
>> There are two possible ways to mitigate the semantics changes.
>> 1. Develop a new mechanism to notify the page conversions between
>> private and shared. For example, utilize the notifier_list in QEMU. VFIO
>> registers its own handler and gets notified upon page conversions. This
>> is a clean approach which only touches the notifier workflow. A
>> challenge is that for device hotplug, existing shared memory should be
>> mapped in IOMMU. This will need additional changes.
>>
>> 2. Extend the existing RamDiscardManager interface to manage not only
>> the discarded/populated status of guest memory but also the
>> shared/private status. RamDiscardManager users like VFIO will be
>> notified with one more argument indicating what change is happening and
>> can take action accordingly. It also has challenges e.g. QEMU allows
>> only one RamDiscardManager, how to support virtio-mem for confidential
>> VMs would be a problem. And some APIs like .is_populated() exposed by
>> RamDiscardManager are meaningless to shared/private memory. So they may
>> need some adjustments.
>>
>> Testing
>> =======
>> This patch series is tested based on the internal TDX KVM/QEMU tree.
>>
>> To facilitate shared device assignment with the NIC, employ the legacy
>> type1 VFIO with the QEMU command:
>>
>> qemu-system-x86_64 [...]
>>     -device vfio-pci,host=XX:XX.X
>>
>> The parameter of dma_entry_limit needs to be adjusted. For example, a
>> 16GB guest needs to adjust the parameter like
>> vfio_iommu_type1.dma_entry_limit=4194304.
>>
>> If use the iommufd-backed VFIO with the qemu command:
>>
>> qemu-system-x86_64 [...]
>>     -object iommufd,id=iommufd0 \
>>     -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
>>
>> No additional adjustment required.
>>
>> Following the bootup of the TD guest, the guest's IP address becomes
>> visible, and iperf is able to successfully send and receive data.
>>
>> Related link
>> ============
>> [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/
>> [2] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/
>>
>> Chenyi Qiang (6):
>>   guest_memfd: Introduce an object to manage the guest-memfd with
>>     RamDiscardManager
>>   guest_memfd: Introduce a helper to notify the shared/private state
>>     change
>>   KVM: Notify the state change via RamDiscardManager helper during
>>     shared/private conversion
>>   memory: Register the RamDiscardManager instance upon guest_memfd
>>     creation
>>   guest-memfd: Default to discarded (private) in guest_memfd_manager
>>   RAMBlock: make guest_memfd require coordinate discard
>>
>>  accel/kvm/kvm-all.c                  |   7 +
>>  include/sysemu/guest-memfd-manager.h |  49 +++
>>  system/guest-memfd-manager.c         | 425 +++++++++++++++++++++++++++
>>  system/meson.build                   |   1 +
>>  system/physmem.c                     |  11 +-
>>  5 files changed, 492 insertions(+), 1 deletion(-)
>>  create mode 100644 include/sysemu/guest-memfd-manager.h
>>  create mode 100644 system/guest-memfd-manager.c
>>
>>
>> base-commit: 900536d3e97aed7fdd9cb4dadd3bf7023360e819



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/6] Enable shared device assignment
  2024-10-08  8:59   ` Chenyi Qiang
@ 2024-11-15 16:47     ` Rob Nertney
  2024-11-15 17:20       ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Rob Nertney @ 2024-11-15 16:47 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, David Hildenbrand, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Edgecombe Rick P, Wang Wei W, Peng Chao P,
	Gao Chao, Wu Hao, Xu Yilun

On Tue, Oct 08, 2024 at 04:59:45PM +0800, Chenyi Qiang wrote:
> Hi Paolo,
> 
> Kindly ping for this thread. The in-place page conversion is discussed
> at Linux Plumbers. Does it give some direction for shared device
> assignment enabling work?
>
Hi everybody.

Our NVIDIA GPUs currently support this shared-memory/bounce-buffer method to
provide AI acceleration within TEE CVMs. We require passing though the GPU via
VFIO stubbing, which means that we are impacted by the absence of an API to
inform VFIO about page conversions.

The CSPs have enough kernel engineers who handle this process in their own host
kernels, but we have several enterprise customers who are eager to begin using
this solution in the upstream. AMD has successfully ported enough of the
SEV-SNP support into 6.11 and our initial testing shows successful operation,
but only by disabling discard via these two QEMU patches:
- https://github.com/AMDESE/qemu/commit/0c9ae28d3e199de9a40876a492e0f03a11c6f5d8
- https://github.com/AMDESE/qemu/commit/5256c41fb3055961ea7ac368acc0b86a6632d095

This "workaround" is a bit of a hack, as it effectively requires greater than
double the amount of host memory than as to be allocated to the guest CVM. The
proposal here appears to be a promising workaround; are there other solutions
that are recommended for this use case?

This configuration is in GA right now and NVIDIA is committed to support and
test this bounce-buffer mailbox solution for many years into the future, so
we're highly invested in seeing a converged solution in the upstream.

Thanks,
Rob

> Thanks
> Chenyi
> 
> On 8/16/2024 11:02 AM, Chenyi Qiang wrote:
> > Hi Paolo,
> > 
> > Hope to draw your attention. As TEE I/O would depend on shared device
> > assignment and we introduce this RDM solution in QEMU. Now, Observe the
> > in-place private/shared conversion option mentioned by David, do you
> > think we should continue to add pass-thru support for this in-qemu page
> > conversion method? Or wait for the option discussion to see if it will
> > change to in-kernel conversion.
> > 
> > Thanks
> > Chenyi
> > 
> > On 7/25/2024 3:21 PM, Chenyi Qiang wrote:
> >> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
> >> discard") effectively disables device assignment with guest_memfd.
> >> guest_memfd is required for confidential guests, so device assignment to
> >> confidential guests is disabled. A supporting assumption for disabling
> >> device-assignment was that TEE I/O (SEV-TIO, TDX Connect, COVE-IO
> >> etc...) solves the confidential-guest device-assignment problem [1].
> >> That turns out not to be the case because TEE I/O depends on being able
> >> to operate devices against "shared"/untrusted memory for device
> >> initialization and error recovery scenarios.
> >>
> >> This series utilizes an existing framework named RamDiscardManager to
> >> notify VFIO of page conversions. However, there's still one concern
> >> related to the semantics of RamDiscardManager which is used to manage
> >> the memory plug/unplug state. This is a little different from the memory
> >> shared/private in our requirement. See the "Open" section below for more
> >> details.
> >>
> >> Background
> >> ==========
> >> Confidential VMs have two classes of memory: shared and private memory.
> >> Shared memory is accessible from the host/VMM while private memory is
> >> not. Confidential VMs can decide which memory is shared/private and
> >> convert memory between shared/private at runtime.
> >>
> >> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
> >> private memory. The key differences between guest_memfd and normal memfd
> >> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
> >> cannot be mapped, read or written by userspace.
> >>
> >> In QEMU's implementation, shared memory is allocated with normal methods
> >> (e.g. mmap or fallocate) while private memory is allocated from
> >> guest_memfd. When a VM performs memory conversions, QEMU frees pages via
> >> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
> >> allocates new pages from the other side.
> >>
> >> Problem
> >> =======
> >> Device assignment in QEMU is implemented via VFIO system. In the normal
> >> VM, VM memory is pinned at the beginning of time by VFIO. In the
> >> confidential VM, the VM can convert memory and when that happens
> >> nothing currently tells VFIO that its mappings are stale. This means
> >> that page conversion leaks memory and leaves stale IOMMU mappings. For
> >> example, sequence like the following can result in stale IOMMU mappings:
> >>
> >> 1. allocate shared page
> >> 2. convert page shared->private
> >> 3. discard shared page
> >> 4. convert page private->shared
> >> 5. allocate shared page
> >> 6. issue DMA operations against that shared page
> >>
> >> After step 3, VFIO is still pinning the page. However, DMA operations in
> >> step 6 will hit the old mapping that was allocated in step 1, which
> >> causes the device to access the invalid data.
> >>
> >> Currently, the commit 852f0048f3 ("RAMBlock: make guest_memfd require
> >> uncoordinated discard") has blocked the device assignment with
> >> guest_memfd to avoid this problem.
> >>
> >> Solution
> >> ========
> >> The key to enable shared device assignment is to solve the stale IOMMU
> >> mappings problem.
> >>
> >> Given the constraints and assumptions here is a solution that satisfied
> >> the use cases. RamDiscardManager, an existing interface currently
> >> utilized by virtio-mem, offers a means to modify IOMMU mappings in
> >> accordance with VM page assignment. Page conversion is similar to
> >> hot-removing a page in one mode and adding it back in the other.
> >>
> >> This series implements a RamDiscardManager for confidential VMs and
> >> utilizes its infrastructure to notify VFIO of page conversions.
> >>
> >> Another possible attempt [2] was to not discard shared pages in step 3
> >> above. This was an incomplete band-aid because guests would consume
> >> twice the memory since shared pages wouldn't be freed even after they
> >> were converted to private.
> >>
> >> Open
> >> ====
> >> Implementing a RamDiscardManager to notify VFIO of page conversions
> >> causes changes in semantics: private memory is treated as discarded (or
> >> hot-removed) memory. This isn't aligned with the expectation of current
> >> RamDiscardManager users (e.g. VFIO or live migration) who really
> >> expect that discarded memory is hot-removed and thus can be skipped when
> >> the users are processing guest memory. Treating private memory as
> >> discarded won't work in future if VFIO or live migration needs to handle
> >> private memory. e.g. VFIO may need to map private memory to support
> >> Trusted IO and live migration for confidential VMs need to migrate
> >> private memory.
> >>
> >> There are two possible ways to mitigate the semantics changes.
> >> 1. Develop a new mechanism to notify the page conversions between
> >> private and shared. For example, utilize the notifier_list in QEMU. VFIO
> >> registers its own handler and gets notified upon page conversions. This
> >> is a clean approach which only touches the notifier workflow. A
> >> challenge is that for device hotplug, existing shared memory should be
> >> mapped in IOMMU. This will need additional changes.
> >>
> >> 2. Extend the existing RamDiscardManager interface to manage not only
> >> the discarded/populated status of guest memory but also the
> >> shared/private status. RamDiscardManager users like VFIO will be
> >> notified with one more argument indicating what change is happening and
> >> can take action accordingly. It also has challenges e.g. QEMU allows
> >> only one RamDiscardManager, how to support virtio-mem for confidential
> >> VMs would be a problem. And some APIs like .is_populated() exposed by
> >> RamDiscardManager are meaningless to shared/private memory. So they may
> >> need some adjustments.
> >>
> >> Testing
> >> =======
> >> This patch series is tested based on the internal TDX KVM/QEMU tree.
> >>
> >> To facilitate shared device assignment with the NIC, employ the legacy
> >> type1 VFIO with the QEMU command:
> >>
> >> qemu-system-x86_64 [...]
> >>     -device vfio-pci,host=XX:XX.X
> >>
> >> The parameter of dma_entry_limit needs to be adjusted. For example, a
> >> 16GB guest needs to adjust the parameter like
> >> vfio_iommu_type1.dma_entry_limit=4194304.
> >>
> >> If use the iommufd-backed VFIO with the qemu command:
> >>
> >> qemu-system-x86_64 [...]
> >>     -object iommufd,id=iommufd0 \
> >>     -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
> >>
> >> No additional adjustment required.
> >>
> >> Following the bootup of the TD guest, the guest's IP address becomes
> >> visible, and iperf is able to successfully send and receive data.
> >>
> >> Related link
> >> ============
> >> [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/
> >> [2] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/
> >>
> >> Chenyi Qiang (6):
> >>   guest_memfd: Introduce an object to manage the guest-memfd with
> >>     RamDiscardManager
> >>   guest_memfd: Introduce a helper to notify the shared/private state
> >>     change
> >>   KVM: Notify the state change via RamDiscardManager helper during
> >>     shared/private conversion
> >>   memory: Register the RamDiscardManager instance upon guest_memfd
> >>     creation
> >>   guest-memfd: Default to discarded (private) in guest_memfd_manager
> >>   RAMBlock: make guest_memfd require coordinate discard
> >>
> >>  accel/kvm/kvm-all.c                  |   7 +
> >>  include/sysemu/guest-memfd-manager.h |  49 +++
> >>  system/guest-memfd-manager.c         | 425 +++++++++++++++++++++++++++
> >>  system/meson.build                   |   1 +
> >>  system/physmem.c                     |  11 +-
> >>  5 files changed, 492 insertions(+), 1 deletion(-)
> >>  create mode 100644 include/sysemu/guest-memfd-manager.h
> >>  create mode 100644 system/guest-memfd-manager.c
> >>
> >>
> >> base-commit: 900536d3e97aed7fdd9cb4dadd3bf7023360e819
> 
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/6] Enable shared device assignment
  2024-11-15 16:47     ` Rob Nertney
@ 2024-11-15 17:20       ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2024-11-15 17:20 UTC (permalink / raw)
  To: Rob Nertney, Chenyi Qiang
  Cc: Paolo Bonzini, Peter Xu, Philippe Mathieu-Daudé,
	Michael Roth, qemu-devel, kvm, Williams Dan J, Edgecombe Rick P,
	Wang Wei W, Peng Chao P, Gao Chao, Wu Hao, Xu Yilun

On 15.11.24 17:47, Rob Nertney wrote:
> On Tue, Oct 08, 2024 at 04:59:45PM +0800, Chenyi Qiang wrote:
>> Hi Paolo,
>>
>> Kindly ping for this thread. The in-place page conversion is discussed
>> at Linux Plumbers. Does it give some direction for shared device
>> assignment enabling work?
>>
> Hi everybody.

Hi,

> 
> Our NVIDIA GPUs currently support this shared-memory/bounce-buffer method to
> provide AI acceleration within TEE CVMs. We require passing though the GPU via
> VFIO stubbing, which means that we are impacted by the absence of an API to
> inform VFIO about page conversions.
> 
> The CSPs have enough kernel engineers who handle this process in their own host
> kernels, but we have several enterprise customers who are eager to begin using
> this solution in the upstream. AMD has successfully ported enough of the
> SEV-SNP support into 6.11 and our initial testing shows successful operation,
> but only by disabling discard via these two QEMU patches:
> - https://github.com/AMDESE/qemu/commit/0c9ae28d3e199de9a40876a492e0f03a11c6f5d8
> - https://github.com/AMDESE/qemu/commit/5256c41fb3055961ea7ac368acc0b86a6632d095
> 
> This "workaround" is a bit of a hack, as it effectively requires greater than
> double the amount of host memory than as to be allocated to the guest CVM. The
> proposal here appears to be a promising workaround; are there other solutions
> that are recommended for this use case?

What people we are working on is supporting private and shared memory in 
guest_memfd, and allowing an in-place conversion between shared and 
private: this avoids discards + reallocation and consequently any double 
memory allocation.

To get stuff into VFIO, we must only map the currently shared pages 
(VFIO will pin + map them), and unmap them (VFIO will unmap + unpin 
them) before converting them to private.

This series should likely achieve the 
unmap-before-conversion-to-private, and map-after-conversion-to-shared, 
such that it could be compatible with guest_memfd.

QEMU would simply mmap the guest_memfd to obtain a user space mapping, 
from which it can pass address ranges to VFIO like we already do. This 
user space mapping only allows for shared pages to be faulted in. 
Currently private pages cannot be faulted in (inaccessible -> SIGBUS). 
So far the theory.

I'll note that this is likely not the most elegant solution, but 
something that would achieve in a reasonable timeframe one solution to 
the problem.

Cheers!

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-11-15 18:04 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-25  7:21 [RFC PATCH 0/6] Enable shared device assignment Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 1/6] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 2/6] guest_memfd: Introduce a helper to notify the shared/private state change Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 3/6] KVM: Notify the state change via RamDiscardManager helper during shared/private conversion Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 4/6] memory: Register the RamDiscardManager instance upon guest_memfd creation Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 5/6] guest-memfd: Default to discarded (private) in guest_memfd_manager Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 6/6] RAMBlock: make guest_memfd require coordinate discard Chenyi Qiang
2024-07-25 14:04 ` [RFC PATCH 0/6] Enable shared device assignment David Hildenbrand
2024-07-26  5:02   ` Tian, Kevin
2024-07-26  7:08     ` David Hildenbrand
2024-07-31  7:12       ` Xu Yilun
2024-07-31 11:05         ` David Hildenbrand
2024-07-26  6:20   ` Chenyi Qiang
2024-07-26  7:20     ` David Hildenbrand
2024-07-26 10:56       ` Chenyi Qiang
2024-07-31 11:18         ` David Hildenbrand
2024-08-02  7:00           ` Chenyi Qiang
2024-08-01  7:32       ` Yin, Fengwei
2024-08-16  3:02 ` Chenyi Qiang
2024-10-08  8:59   ` Chenyi Qiang
2024-11-15 16:47     ` Rob Nertney
2024-11-15 17:20       ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).