qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Chenyi Qiang <chenyi.qiang@intel.com>
To: "Cédric Le Goater" <clg@kaod.org>,
	"David Hildenbrand" <david@redhat.com>,
	"Alexey Kardashevskiy" <aik@amd.com>,
	"Peter Xu" <peterx@redhat.com>,
	"Gupta Pankaj" <pankaj.gupta@amd.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@linaro.org>,
	"Michael Roth" <michael.roth@amd.com>
Cc: <qemu-devel@nongnu.org>, <kvm@vger.kernel.org>,
	Williams Dan J <dan.j.williams@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Baolu Lu <baolu.lu@linux.intel.com>,
	Gao Chao <chao.gao@intel.com>, Xu Yilun <yilun.xu@intel.com>,
	Li Xiaoyao <xiaoyao.li@intel.com>,
	'Alex Williamson' <alex.williamson@redhat.com>
Subject: Re: [PATCH v5 00/10] Enable shared device assignment
Date: Mon, 26 May 2025 20:16:55 +0800	[thread overview]
Message-ID: <faa1f415-4c12-4082-abdb-7bdd637a4d36@intel.com> (raw)
In-Reply-To: <7283f8f2-a9d9-4e7d-bfbd-3854b3d1736e@kaod.org>



On 5/26/2025 7:37 PM, Cédric Le Goater wrote:
> On 5/20/25 12:28, Chenyi Qiang wrote:
>> This is the v5 series of the shared device assignment support.
>>
>> As discussed in the v4 series [1], the GenericStateManager parent class
>> and PrivateSharedManager child interface were deemed to be in the wrong
>> direction. This series reverts back to the original single
>> RamDiscardManager interface and puts it as future work to allow the
>> co-existence of multiple pairs of state management. For example, if we
>> want to have virtio-mem co-exist with guest_memfd, it will need a new
>> framework to combine the private/shared/discard states [2].
>>
>> Another change since the last version is the error handling of memory
>> conversion. Currently, the failure of kvm_convert_memory() causes QEMU
>> to quit instead of resuming the guest. The complex rollback operation
>> doesn't add value and merely adds code that is difficult to test.
>> Although in the future, it is more likely to encounter more errors on
>> conversion paths like unmap failure on shared to private in-place
>> conversion. This series keeps complex error handling out of the picture
>> for now and attaches related handling at the end of the series for
>> future extension.
>>
>> Apart from the above two parts with future work, there's some
>> optimization work in the future, i.e., using other more memory-efficient
>> mechanism to track ranges of contiguous states instead of a bitmap [3].
>> This series still uses a bitmap for simplicity.
>>   The overview of this series:
>> - Patch 1-3: Preparation patches. These include function exposure and
>>    some definition changes to return values.
>> - Patch 4-5: Introduce a new object to implement RamDiscardManager
>>    interface and a helper to notify the shared/private state change.
>> - Patch 6: Store the new object including guest_memfd information in
>>    RAMBlock. Register the RamDiscardManager instance to the target
>>    RAMBlock's MemoryRegion so that the RamDiscardManager users can run in
>>    the specific path.
>> - Patch 7: Unlock the coordinate discard so that the shared device
>>    assignment (VFIO) can work with guest_memfd. After this patch, the
>>    basic device assignement functionality can work properly.
>> - Patch 8-9: Some cleanup work. Move the state change handling into a
>>    RamDiscardListener so that it can be invoked together with the VFIO
>>    listener by the state_change() call. This series dropped the priority
>>    support in v4 which is required by in-place conversions, because the
>>    conversion path will likely change.
>> - Patch 10: More complex error handing including rollback and mixture
>>    states conversion case.
>>
>> More small changes or details can be found in the individual patches.
>>
>> ---
>> Original cover letter:
>>
>> Background
>> ==========
>> Confidential VMs have two classes of memory: shared and private memory.
>> Shared memory is accessible from the host/VMM while private memory is
>> not. Confidential VMs can decide which memory is shared/private and
>> convert memory between shared/private at runtime.
>>
>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
>> private memory. In current implementation, shared memory is allocated
>> with normal methods (e.g. mmap or fallocate) while private memory is
>> allocated from guest_memfd. When a VM performs memory conversions, QEMU
>> frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from
>> one side, and allocates new pages from the other side. This will cause a
>> stale IOMMU mapping issue mentioned in [4] when we try to enable shared
>> device assignment in confidential VMs.
>>
>> Solution
>> ========
>> The key to enable shared device assignment is to update the IOMMU
>> mappings
>> on page conversion. RamDiscardManager, an existing interface currently
>> utilized by virtio-mem, offers a means to modify IOMMU mappings in
>> accordance with VM page assignment. Although the required operations in
>> VFIO for page conversion are similar to memory plug/unplug, the states of
>> private/shared are different from discard/populated. We want a similar
>> mechanism with RamDiscardManager but used to manage the state of private
>> and shared.
>>
>> This series introduce a new parent abstract class to manage a pair of
>> opposite states with RamDiscardManager as its child to manage
>> populate/discard states, and introduce a new child class,
>> PrivateSharedManager, which can also utilize the same infrastructure to
>> notify VFIO of page conversions.
>>
>> Relationship with in-place page conversion
>> ==========================================
>> To support 1G page support for guest_memfd [5], the current direction
>> is to
>> allow mmap() of guest_memfd to userspace so that both private and shared
>> memory can use the same physical pages as the backend. This in-place page
>> conversion design eliminates the need to discard pages during shared/
>> private
>> conversions. However, device assignment will still be blocked because the
>> in-place page conversion will reject the conversion when the page is
>> pinned
>> by VFIO.
>>
>> To address this, the key difference lies in the sequence of VFIO map/
>> unmap
>> operations and the page conversion. It can be adjusted to achieve
>> unmap-before-conversion-to-private and map-after-conversion-to-shared,
>> ensuring compatibility with guest_memfd.
>>
>> Limitation
>> ==========
>> One limitation is that VFIO expects the DMA mapping for a specific IOVA
>> to be mapped and unmapped with the same granularity. The guest may
>> perform partial conversions, such as converting a small region within a
>> larger region. To prevent such invalid cases, all operations are
>> performed with 4K granularity. This could be optimized after the
>> cut_mapping operation[6] is introduced in future. We can alway perform a
>> split-before-unmap if partial conversions happen. If the split succeeds,
>> the unmap will succeed and be atomic. If the split fails, the unmap
>> process fails.
>>
>> Testing
>> =======
>> This patch series is tested based on TDX patches available at:
>> KVM: https://github.com/intel/tdx/tree/kvm-coco-queue-snapshot/kvm-
>> coco-queue-snapshot-20250408
>> QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-
>> snapshot-2025-05-20
>>
>> Because the new features like cut_mapping operation will only be
>> support in iommufd.
>> It is recommended to use the iommufd-backed VFIO with the qemu command:
> 
> Is it recommended or required ? If the VFIO IOMMU type1 backend is not
> supported for confidential VMs, QEMU should fail to start.

VFIO IOMMU type1 backend is also supported but need to increase the
dma_entry_limit parameter, as this series currently do the map/unmap
with 4K granularity.

> 
> Please add Alex Williamson and I to the Cc: list.

Sure, will do in next version.

> 
> Thanks,
> 
> C.
> 


      reply	other threads:[~2025-05-26 12:18 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-20 10:28 [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 01/10] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 02/10] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
2025-05-26  8:40   ` David Hildenbrand
2025-05-27  6:56   ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 03/10] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
2025-05-26  8:42   ` David Hildenbrand
2025-05-26  9:35   ` Philippe Mathieu-Daudé
2025-05-26 10:21     ` Chenyi Qiang
2025-05-27  6:56   ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 04/10] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock with guest_memfd Chenyi Qiang
2025-05-26  9:01   ` David Hildenbrand
2025-05-26  9:28     ` Chenyi Qiang
2025-05-26 11:16       ` Alexey Kardashevskiy
2025-05-27  1:15         ` Chenyi Qiang
2025-05-27  1:20           ` Alexey Kardashevskiy
2025-05-27  3:14             ` Chenyi Qiang
2025-05-27  6:06               ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 05/10] ram-block-attribute: Introduce a helper to notify shared/private state changes Chenyi Qiang
2025-05-26  9:02   ` David Hildenbrand
2025-05-27  7:35   ` Alexey Kardashevskiy
2025-05-27  9:06     ` Chenyi Qiang
2025-05-27  9:19       ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 06/10] memory: Attach RamBlockAttribute to guest_memfd-backed RAMBlocks Chenyi Qiang
2025-05-26  9:06   ` David Hildenbrand
2025-05-26  9:46     ` Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 07/10] RAMBlock: Make guest_memfd require coordinate discard Chenyi Qiang
2025-05-26  9:08   ` David Hildenbrand
2025-05-27  5:47     ` Chenyi Qiang
2025-05-27  7:42       ` Alexey Kardashevskiy
2025-05-27  8:12         ` Chenyi Qiang
2025-05-27 11:20       ` David Hildenbrand
2025-05-28  1:57         ` Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 08/10] memory: Change NotifyRamDiscard() definition to return the result Chenyi Qiang
2025-05-26  9:31   ` Philippe Mathieu-Daudé
2025-05-26 10:36   ` Cédric Le Goater
2025-05-26 12:44     ` Cédric Le Goater
2025-05-27  5:29       ` Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 09/10] KVM: Introduce RamDiscardListener for attribute changes during memory conversions Chenyi Qiang
2025-05-26  9:22   ` David Hildenbrand
2025-05-27  8:01   ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes Chenyi Qiang
2025-05-26  9:17   ` David Hildenbrand
2025-05-26 10:19     ` Chenyi Qiang
2025-05-26 12:10       ` David Hildenbrand
2025-05-26 12:39         ` Chenyi Qiang
2025-05-27  9:11   ` Alexey Kardashevskiy
2025-05-27 10:18     ` Chenyi Qiang
2025-05-27 11:21       ` David Hildenbrand
2025-05-26 11:37 ` [PATCH v5 00/10] Enable shared device assignment Cédric Le Goater
2025-05-26 12:16   ` Chenyi Qiang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=faa1f415-4c12-4082-abdb-7bdd637a4d36@intel.com \
    --to=chenyi.qiang@intel.com \
    --cc=aik@amd.com \
    --cc=alex.williamson@redhat.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=chao.gao@intel.com \
    --cc=clg@kaod.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=pankaj.gupta@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=xiaoyao.li@intel.com \
    --cc=yilun.xu@intel.com \
    --cc=zhao1.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).