public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Chenyi Qiang <chenyi.qiang@intel.com>
To: "David Hildenbrand" <david@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Peter Xu" <peterx@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@linaro.org>,
	"Michael Roth" <michael.roth@amd.com>
Cc: <qemu-devel@nongnu.org>, <kvm@vger.kernel.org>,
	Williams Dan J <dan.j.williams@intel.com>,
	Edgecombe Rick P <rick.p.edgecombe@intel.com>,
	Wang Wei W <wei.w.wang@intel.com>,
	Peng Chao P <chao.p.peng@intel.com>,
	"Gao Chao" <chao.gao@intel.com>, Wu Hao <hao.wu@intel.com>,
	Xu Yilun <yilun.xu@intel.com>
Subject: Re: [RFC PATCH 0/6] Enable shared device assignment
Date: Fri, 26 Jul 2024 18:56:00 +0800	[thread overview]
Message-ID: <0fdd0340-8daa-45b8-9e1c-bafe6f4e6a60@intel.com> (raw)
In-Reply-To: <d87a5e47-3c48-4e20-b3de-e83c2ca44606@redhat.com>



On 7/26/2024 3:20 PM, David Hildenbrand wrote:
> On 26.07.24 08:20, Chenyi Qiang wrote:
>>
>>
>> On 7/25/2024 10:04 PM, David Hildenbrand wrote:
>>>> Open
>>>> ====
>>>> Implementing a RamDiscardManager to notify VFIO of page conversions
>>>> causes changes in semantics: private memory is treated as discarded (or
>>>> hot-removed) memory. This isn't aligned with the expectation of current
>>>> RamDiscardManager users (e.g. VFIO or live migration) who really
>>>> expect that discarded memory is hot-removed and thus can be skipped
>>>> when
>>>> the users are processing guest memory. Treating private memory as
>>>> discarded won't work in future if VFIO or live migration needs to
>>>> handle
>>>> private memory. e.g. VFIO may need to map private memory to support
>>>> Trusted IO and live migration for confidential VMs need to migrate
>>>> private memory.
>>>
>>> "VFIO may need to map private memory to support Trusted IO"
>>>
>>> I've been told that the way we handle shared memory won't be the way
>>> this is going to work with guest_memfd. KVM will coordinate directly
>>> with VFIO or $whatever and update the IOMMU tables itself right in the
>>> kernel; the pages are pinned/owned by guest_memfd, so that will just
>>> work. So I don't consider that currently a concern. guest_memfd private
>>> memory is not mapped into user page tables and as it currently seems it
>>> never will be.
>>
>> That's correct. AFAIK, some TEE IO solution like TDX Connect would let
>> kernel coordinate and update private mapping in IOMMU tables. Here, It
>> mentions that VFIO "may" need map private memory. I want to make this
>> more generic to account for potential future TEE IO solutions that may
>> require such functionality. :)
> 
> Careful to not over-enginner something that is not even real or
> close-to-be-real yet, though. :) Nobody really knows who that will look
> like, besides that we know for Intel that we won't need that.

OK, Thanks for the reminder!

> 
>>
>>>
>>> Similarly: live migration. We cannot simply migrate that memory the
>>> traditional way. We even have to track the dirty state differently.
>>>
>>> So IMHO, treating both memory as discarded == don't touch it the usual
>>> way might actually be a feature not a bug ;)
>>
>> Do you mean treating the private memory in both VFIO and live migration
>> as discarded? That is what this patch series does. And as you mentioned,
>> these RDM users cannot follow the traditional RDM way. Because of this,
>> we also considered whether we should use RDM or a more generic mechanism
>> like notifier_list below.
> 
> Yes, the shared memory is logically discarded. At the same time we
> *might* get private memory effectively populated. See my reply to Kevin
> that there might be ways of having shared vs. private populate/discard
> in the future, if required. Just some idea, though.
> 
>>
>>>
>>>>
>>>> There are two possible ways to mitigate the semantics changes.
>>>> 1. Develop a new mechanism to notify the page conversions between
>>>> private and shared. For example, utilize the notifier_list in QEMU.
>>>> VFIO
>>>> registers its own handler and gets notified upon page conversions. This
>>>> is a clean approach which only touches the notifier workflow. A
>>>> challenge is that for device hotplug, existing shared memory should be
>>>> mapped in IOMMU. This will need additional changes.
>>>>
>>>> 2. Extend the existing RamDiscardManager interface to manage not only
>>>> the discarded/populated status of guest memory but also the
>>>> shared/private status. RamDiscardManager users like VFIO will be
>>>> notified with one more argument indicating what change is happening and
>>>> can take action accordingly. It also has challenges e.g. QEMU allows
>>>> only one RamDiscardManager, how to support virtio-mem for confidential
>>>> VMs would be a problem. And some APIs like .is_populated() exposed by
>>>> RamDiscardManager are meaningless to shared/private memory. So they may
>>>> need some adjustments.
>>>
>>> Think of all of that in terms of "shared memory is populated, private
>>> memory is some inaccessible stuff that needs very special way and other
>>> means for device assignment, live migration, etc.". Then it actually
>>> quite makes sense to use of RamDiscardManager (AFAIKS :) ).
>>
>> Yes, such notification mechanism is what we want. But for the users of
>> RDM, it would require additional change accordingly. Current users just
>> skip inaccessible stuff, but in private memory case, it can't be simply
>> skipped. Maybe renaming RamDiscardManager to RamStateManager is more
>> accurate then. :)
> 
> Current users must skip it, yes. How private memory would have to be
> handled, and who would handle it, is rather unclear.
> 
> Again, maybe we'd want separate RamDiscardManager for private and shared
> memory (after all, these are two separate memory backends).

We also considered distinguishing the populate and discard operation for
private and shared memory separately. As in method 2 above, we mentioned
to add a new argument to indicate the memory attribute to operate on.
They seem to have a similar idea.

> 
> Not sure that "RamStateManager" terminology would be reasonable in that
> approach.
> 
>>
>>>
>>>>
>>>> Testing
>>>> =======
>>>> This patch series is tested based on the internal TDX KVM/QEMU tree.
>>>>
>>>> To facilitate shared device assignment with the NIC, employ the legacy
>>>> type1 VFIO with the QEMU command:
>>>>
>>>> qemu-system-x86_64 [...]
>>>>       -device vfio-pci,host=XX:XX.X
>>>>
>>>> The parameter of dma_entry_limit needs to be adjusted. For example, a
>>>> 16GB guest needs to adjust the parameter like
>>>> vfio_iommu_type1.dma_entry_limit=4194304.
>>>
>>> But here you note the biggest real issue I see (not related to
>>> RAMDiscardManager, but that we have to prepare for conversion of each
>>> possible private page to shared and back): we need a single IOMMU
>>> mapping for each 4 KiB page.
>>>
>>> Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB.
>>> Does it even scale then?
>>
>> The entry limitation needs to be increased as the guest memory size
>> increases. For this issue, are you concerned that having too many
>> entries might bring some performance issue? Maybe we could introduce
>> some PV mechanism to coordinate with guest to convert memory only in 2M
>> granularity. This may help mitigate the problem.
> 
> I've had this talk with Intel, because the 4K granularity is a pain. I
> was told that ship has sailed ... and we have to cope with random 4K
> conversions :(
> 
> The many mappings will likely add both memory and runtime overheads in
> the kernel. But we only know once we measure.

In the normal case, the main runtime overhead comes from
private<->shared flip in SWIOTLB, which defaults to 6% of memory with a
maximum of 1Gbyte. I think this overhead is acceptable. In non-default
case, e.g. dynamic allocated DMA buffer, the runtime overhead will
increase. As for the memory overheads, It is indeed unavoidable.

Will these performance issues be a deal breaker for enabling shared
device assignment in this way?

> 
> Key point is that even 4194304 "only" allows for 16 GiB. Imagine 1 TiB
> of shared memory :/
> 
>>
>>>
>>>
>>> There is the alternative of having in-place private/shared conversion
>>> when we also let guest_memfd manage some shared memory. It has plenty of
>>> downsides, but for the problem at hand it would mean that we don't
>>> discard on shared/private conversion.>
>>> But whenever we want to convert memory shared->private we would
>>> similarly have to from IOMMU page tables via VFIO. (the in-place
>>> conversion will only be allowed if any additional references on a page
>>> are gone -- when it is inaccessible by userspace/kernel).
>>
>> I'm not clear about this in-place private/shared conversion. Can you
>> elaborate a little bit? It seems this alternative changes private and
>> shared management in current guest_memfd?
> 
> Yes, there have been discussions about that, also in the context of
> supporting huge pages while allowing for the guest to still convert
> individual 4K chunks ...
> 
> A summary is here [1]. Likely more things will be covered at Linux
> Plumbers.
> 
> 
> [1]
> https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@google.com/
> 

Thanks for your sharing.


  reply	other threads:[~2024-07-26 10:56 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-25  7:21 [RFC PATCH 0/6] Enable shared device assignment Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 1/6] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 2/6] guest_memfd: Introduce a helper to notify the shared/private state change Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 3/6] KVM: Notify the state change via RamDiscardManager helper during shared/private conversion Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 4/6] memory: Register the RamDiscardManager instance upon guest_memfd creation Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 5/6] guest-memfd: Default to discarded (private) in guest_memfd_manager Chenyi Qiang
2024-07-25  7:21 ` [RFC PATCH 6/6] RAMBlock: make guest_memfd require coordinate discard Chenyi Qiang
2024-07-25 14:04 ` [RFC PATCH 0/6] Enable shared device assignment David Hildenbrand
2024-07-26  5:02   ` Tian, Kevin
2024-07-26  7:08     ` David Hildenbrand
2024-07-31  7:12       ` Xu Yilun
2024-07-31 11:05         ` David Hildenbrand
2024-07-26  6:20   ` Chenyi Qiang
2024-07-26  7:20     ` David Hildenbrand
2024-07-26 10:56       ` Chenyi Qiang [this message]
2024-07-31 11:18         ` David Hildenbrand
2024-08-02  7:00           ` Chenyi Qiang
2024-08-01  7:32       ` Yin, Fengwei
2024-08-16  3:02 ` Chenyi Qiang
2024-10-08  8:59   ` Chenyi Qiang
2024-11-15 16:47     ` Rob Nertney
2024-11-15 17:20       ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0fdd0340-8daa-45b8-9e1c-bafe6f4e6a60@intel.com \
    --to=chenyi.qiang@intel.com \
    --cc=chao.gao@intel.com \
    --cc=chao.p.peng@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=hao.wu@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=wei.w.wang@intel.com \
    --cc=yilun.xu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox