From: David Hildenbrand <david@redhat.com>
To: Chenyi Qiang <chenyi.qiang@intel.com>,
Jean-Philippe Brucker <jean-philippe@linaro.org>,
philmd@linaro.org, peterx@redhat.com, pbonzini@redhat.com,
peter.maydell@linaro.org, Alexey Kardashevskiy <aik@amd.com>,
Gao Chao <chao.gao@intel.com>
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org
Subject: Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers
Date: Thu, 27 Feb 2025 12:27:19 +0100 [thread overview]
Message-ID: <812768d7-a02d-4b29-95f3-fb7a125cf54e@redhat.com> (raw)
In-Reply-To: <39155512-8d71-412e-aa5c-591d7317d210@intel.com>
On 27.02.25 04:26, Chenyi Qiang wrote:
>
>
> On 2/26/2025 8:43 PM, Chenyi Qiang wrote:
>>
>>
>> On 2/25/2025 5:41 PM, David Hildenbrand wrote:
>>> On 25.02.25 03:00, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 2/21/2025 6:04 PM, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 2/21/2025 4:09 PM, David Hildenbrand wrote:
>>>>>> On 21.02.25 03:25, Chenyi Qiang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote:
>>>>>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote:
>>>>>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call the
>>>>>>>>> IOMMU
>>>>>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard
>>>>>>>>> notifier isn't sufficient for CCA because the DMA addresses need a
>>>>>>>>> translation (even without vIOMMU).
>>>>>>>>>
>>>>>>>>> At the moment:
>>>>>>>>> * guest_memfd_state_change() calls the populate() notifier
>>>>>>>>> * the populate notifier() calls IOMMU notifiers
>>>>>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get
>>>>>>>>> a VA
>>>>>>>>> * it calls ram_discard_manager_is_populated() which fails.
>>>>>>>>>
>>>>>>>>> guest_memfd_state_change() only changes the section's state after
>>>>>>>>> calling the populate() notifier. We can't easily invert the order of
>>>>>>>>> operation because it uses the old state bitmap to know which
>>>>>>>>> pages need
>>>>>>>>> the populate() notifier.
>>>>>>>>
>>>>>>>> I assume we talk about this code: [1]
>>>>>>>>
>>>>>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1-
>>>>>>>> chenyi.qiang@intel.com
>>>>>>>>
>>>>>>>>
>>>>>>>> +static int memory_attribute_state_change(MemoryAttributeManager
>>>>>>>> *mgr,
>>>>>>>> uint64_t offset,
>>>>>>>> + uint64_t size, bool
>>>>>>>> shared_to_private)
>>>>>>>> +{
>>>>>>>> + int block_size = memory_attribute_manager_get_block_size(mgr);
>>>>>>>> + int ret = 0;
>>>>>>>> +
>>>>>>>> + if (!memory_attribute_is_valid_range(mgr, offset, size)) {
>>>>>>>> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>>>>>>> + __func__, offset, size);
>>>>>>>> + return -1;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + if ((shared_to_private &&
>>>>>>>> memory_attribute_is_range_discarded(mgr,
>>>>>>>> offset, size)) ||
>>>>>>>> + (!shared_to_private &&
>>>>>>>> memory_attribute_is_range_populated(mgr,
>>>>>>>> offset, size))) {
>>>>>>>> + return 0;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + if (shared_to_private) {
>>>>>>>> + memory_attribute_notify_discard(mgr, offset, size);
>>>>>>>> + } else {
>>>>>>>> + ret = memory_attribute_notify_populate(mgr, offset, size);
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + if (!ret) {
>>>>>>>> + unsigned long first_bit = offset / block_size;
>>>>>>>> + unsigned long nbits = size / block_size;
>>>>>>>> +
>>>>>>>> + g_assert((first_bit + nbits) <= mgr->bitmap_size);
>>>>>>>> +
>>>>>>>> + if (shared_to_private) {
>>>>>>>> + bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
>>>>>>>> + } else {
>>>>>>>> + bitmap_set(mgr->shared_bitmap, first_bit, nbits);
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + return 0;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + return ret;
>>>>>>>> +}
>>>>>>>>
>>>>>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap
>>>>>>>> again.
>>>>>>>>
>>>>>>>> Why?
>>>>>>>>
>>>>>>>> We just checked that it's all in the expected state, no?
>>>>>>>>
>>>>>>>>
>>>>>>>> virtio-mem doesn't handle it that way, so I'm curious why we would
>>>>>>>> have
>>>>>>>> to do it here?
>>>>>>>
>>>>>>> I was concerned about the case where the guest issues a request that
>>>>>>> only partial of the range is in the desired state.
>>>>>>> I think the main problem is the policy for the guest conversion
>>>>>>> request.
>>>>>>> My current handling is:
>>>>>>>
>>>>>>> 1. When a conversion request is made for a range already in the
>>>>>>> desired
>>>>>>> state, the helper simply returns success.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>> 2. For requests involving a range partially in the desired state, only
>>>>>>> the necessary segments are converted, ensuring the entire range
>>>>>>> complies with the request efficiently.
>>>>>>
>>>>>>
>>>>>> Ah, now I get:
>>>>>>
>>>>>> + if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
>>>>>> offset, size)) ||
>>>>>> + (!shared_to_private &&
>>>>>> memory_attribute_is_range_populated(mgr,
>>>>>> offset, size))) {
>>>>>> + return 0;
>>>>>> + }
>>>>>> +
>>>>>>
>>>>>> We're not failing if it might already partially be in the other state.
>>>>>>
>>>>>>> 3. In scenarios where a conversion request is declined by other
>>>>>>> systems,
>>>>>>> such as a failure from VFIO during notify_populate(), the
>>>>>>> helper will
>>>>>>> roll back the request, maintaining consistency.
>>>>>>>
>>>>>>> And the policy of virtio-mem is to refuse the state change if not all
>>>>>>> blocks are in the opposite state.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>
>>>>>>> Actually, this part is still a uncertain to me.
>>>>>>>
>>>>>>
>>>>>> IIUC, the problem does not exist if we only convert a single page at a
>>>>>> time.
>>>>>>
>>>>>> Is there a known use case where such partial conversions could happen?
>>>>>
>>>>> I don't see such case yet. Actually, I'm trying to follow the behavior
>>>>> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it
>>>>> doesn't reject the request if the whole range isn't in the opposite
>>>>> state. It just uses xa_store() to update it. Also, I don't see the spec
>>>>> says how to handle such case. To be robust, I just allow this special
>>>>> case.
>>>>>
>>>>>>
>>>>>>> BTW, per the status/bitmap track, the virtio-mem also changes the
>>>>>>> bitmap
>>>>>>> after the plug/unplug notifier. This is the same, correct?
>>>>>> Right. But because we reject these partial requests, we don't have to
>>>>>> traverse the bitmap and could just adjust the bitmap operations.
>>>>>
>>>>> Yes, If we treat it as a guest error/bug, we can adjust it.
>>>>
>>>> Hi David, do you think which option is better? If prefer to reject the
>>>> partial requests, I'll change it in my next version.
>>>
>>> Hi,
>>>
>>> still scratching my head. Having to work around it as in this patch here is
>>> suboptimal.
>>>
>>> Could we simplify the whole thing while still allowing for (unexpected)
>>> partial
>>> conversions?
>>>
>>> Essentially: If states are mixed, fallback to a "1 block at a time"
>>> handling.
>>>
>>> The only problem is: what to do if we fail halfway through? Well, we can
>>> only have
>>> such partial completions for "populate", not for discard.
>>>
>>> Option a) Just leave it as "partially completed populate" and return the
>>> error. The
>>> bitmap and the notifiers are consistent.
>>>
>>> Option b) Just discard everything: someone tried to convert something
>>> "partial
>>> shared" to "shared". So maybe, if anything goes wrong, we can just have
>>> "all private".
>>>
>>> The question is also, what the expectation from the caller is: can the
>>> caller
>>> even make progress on failure or do we have to retry until it works?
>>
>> Yes, That's the key problem.
>>
>> For core mm side conversion, The caller (guest) handles three case:
>> success, failure and retry. guest can continue on failure but will keep
>> the memory in its original attribute and trigger some calltrace. While
>> in QEMU side, it would cause VM stop if kvm_set_memory_attributes() failed.
>>
>> As for the VFIO conversion, at present, we allow it to fail and don't
>> return error code to guest as long as we undo the conversion. It only
>> causes the device not work in guest.
>>
>> I think if we view the attribute mismatch between core mm and IOMMU as a
>> fatal error, we can call VM stop or let guest retry until it converts
>> successfully.
>>
>
> Just think more about the options for the failure case handling
> theoretically as we haven't hit such state_change() failure:
>
> 1. Undo + return invalid error
> Pros: The guest can make progress
> Cons: Complicated undo operations: Option a) is not appliable, because
> it leaves it as partial completed populate, but the guest thinks the
> operation has failed.
> Also need to add the undo for set_memory_attribute() after
> state_change() failed. Maybe also apply the attribute bitmap to
> set_memory_attribute() operation to handle the mixed request case
>
> 2. Undo in VFIO and no undo for set_memory_attribute() + return success
> (Current approach in my series)
> Pros: The guest can make progress although device doesn't work.
> Cons: the attribute bitmap only tracks the status in iommu.
Right, we should avoid that. Bitmap + notifiers should stay in sync.
>
> 3. No undo + return retry
> Pros: It keeps the attribute bitmap aligned in core mm and iommu.
> Cons: The guest doesn't know how to handle the retry. It would cause
> infinite loop.
>
> 4. No undo + no return. Just VM stop.
> Pros: simple
> Cons: maybe overkill.
>
> Maybe option 1 or 4 is better?
Well, we can proper undo working using a temporary bitmap when
converting to shared and we run into this "mixed" scenario.
Something like this on top of your work:
From f36e3916596ed5952f15233eb7747c65a6424949 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Tue, 25 Feb 2025 09:55:38 +0100
Subject: [PATCH] tmp
Signed-off-by: David Hildenbrand <david@redhat.com>
---
system/memory-attribute-manager.c | 95 +++++++++++++++++++++----------
1 file changed, 65 insertions(+), 30 deletions(-)
diff --git a/system/memory-attribute-manager.c b/system/memory-attribute-manager.c
index 17c70cf677..e98e7367c1 100644
--- a/system/memory-attribute-manager.c
+++ b/system/memory-attribute-manager.c
@@ -274,9 +274,7 @@ static void memory_attribute_notify_discard(MemoryAttributeManager *mgr,
if (!memory_region_section_intersect_range(&tmp, offset, size)) {
continue;
}
-
- memory_attribute_for_each_populated_section(mgr, &tmp, rdl,
- memory_attribute_notify_discard_cb);
+ rdl->notify_discard(rdl, &tmp);
}
}
@@ -292,9 +290,7 @@ static int memory_attribute_notify_populate(MemoryAttributeManager *mgr,
if (!memory_region_section_intersect_range(&tmp, offset, size)) {
continue;
}
-
- ret = memory_attribute_for_each_discarded_section(mgr, &tmp, rdl,
- memory_attribute_notify_populate_cb);
+ ret = rdl->notify_populate(rdl, &tmp);
if (ret) {
break;
}
@@ -311,9 +307,7 @@ static int memory_attribute_notify_populate(MemoryAttributeManager *mgr,
if (!memory_region_section_intersect_range(&tmp, offset, size)) {
continue;
}
-
- memory_attribute_for_each_discarded_section(mgr, &tmp, rdl2,
- memory_attribute_notify_discard_cb);
+ rdl2->notify_discard(rdl2, &tmp);
}
}
return ret;
@@ -348,7 +342,12 @@ static bool memory_attribute_is_range_discarded(MemoryAttributeManager *mgr,
static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t offset,
uint64_t size, bool shared_to_private)
{
- int block_size = memory_attribute_manager_get_block_size(mgr);
+ const int block_size = memory_attribute_manager_get_block_size(mgr);
+ const unsigned long first_bit = offset / block_size;
+ const unsigned long nbits = size / block_size;
+ const uint64_t end = offset + size;
+ unsigned long bit;
+ uint64_t cur;
int ret = 0;
if (!memory_attribute_is_valid_range(mgr, offset, size)) {
@@ -357,32 +356,68 @@ static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t o
return -1;
}
- if ((shared_to_private && memory_attribute_is_range_discarded(mgr, offset, size)) ||
- (!shared_to_private && memory_attribute_is_range_populated(mgr, offset, size))) {
- return 0;
- }
-
if (shared_to_private) {
- memory_attribute_notify_discard(mgr, offset, size);
- } else {
- ret = memory_attribute_notify_populate(mgr, offset, size);
- }
-
- if (!ret) {
- unsigned long first_bit = offset / block_size;
- unsigned long nbits = size / block_size;
-
- g_assert((first_bit + nbits) <= mgr->bitmap_size);
-
- if (shared_to_private) {
+ if (memory_attribute_is_range_discarded(mgr, offset, size)) {
+ /* Already private. */
+ } else if (!memory_attribute_is_range_populated(mgr, offset, size)) {
+ /* Unexpected mixture: process individual blocks. */
+ for (cur = offset; cur < end; cur += block_size) {
+ bit = cur / block_size;
+ if (!test_bit(bit, mgr->shared_bitmap))
+ continue;
+ clear_bit(bit, mgr->shared_bitmap);
+ memory_attribute_notify_discard(mgr, cur, block_size);
+ }
+ } else {
+ /* Completely shared. */
bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+ memory_attribute_notify_discard(mgr, offset, size);
+ }
+ } else {
+ if (memory_attribute_is_range_populated(mgr, offset, size)) {
+ /* Already shared. */
+ } else if (!memory_attribute_is_range_discarded(mgr, offset, size)) {
+ /* Unexpected mixture: process individual blocks. */
+ unsigned long *modified_bitmap = bitmap_new(nbits);
+
+ for (cur = offset; cur < end; cur += block_size) {
+ bit = cur / block_size;
+ if (test_bit(bit, mgr->shared_bitmap))
+ continue;
+ set_bit(bit, mgr->shared_bitmap);
+ ret = memory_attribute_notify_populate(mgr, cur, block_size);
+ if (!ret) {
+ set_bit(bit - first_bit, modified_bitmap);
+ continue;
+ }
+ clear_bit(bit, mgr->shared_bitmap);
+ break;
+ }
+ if (ret) {
+ /*
+ * Very unexpected: something went wrong. Revert to the old
+ * state, marking only the blocks as private that we converted
+ * to shared.
+ */
+ for (cur = offset; cur < end; cur += block_size) {
+ bit = cur / block_size;
+ if (!test_bit(bit - first_bit, modified_bitmap))
+ continue;
+ assert(test_bit(bit, mgr->shared_bitmap));
+ clear_bit(bit, mgr->shared_bitmap);
+ memory_attribute_notify_discard(mgr, offset, block_size);
+ }
+ }
+ g_free(modified_bitmap);
} else {
+ /* Completely private. */
bitmap_set(mgr->shared_bitmap, first_bit, nbits);
+ ret = memory_attribute_notify_populate(mgr, offset, size);
+ if (ret) {
+ bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+ }
}
-
- return 0;
}
-
return ret;
}
--
2.48.1
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2025-02-27 11:28 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-20 16:13 [RFC 0/2] arm: Add DMA remapping for CCA Jean-Philippe Brucker
2025-02-20 16:13 ` [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers Jean-Philippe Brucker
2025-02-20 19:39 ` David Hildenbrand
2025-02-21 2:25 ` Chenyi Qiang
2025-02-21 8:09 ` David Hildenbrand
2025-02-21 10:04 ` Chenyi Qiang
2025-02-25 2:00 ` Chenyi Qiang
2025-02-25 9:41 ` David Hildenbrand
2025-02-26 12:43 ` Chenyi Qiang
2025-02-27 3:26 ` Chenyi Qiang
2025-02-27 11:27 ` David Hildenbrand [this message]
2025-02-28 5:39 ` Chenyi Qiang
2025-02-20 16:13 ` [RFC 2/2] target/arm/kvm-rme: Add DMA remapping for the shared memory region Jean-Philippe Brucker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=812768d7-a02d-4b29-95f3-fb7a125cf54e@redhat.com \
--to=david@redhat.com \
--cc=aik@amd.com \
--cc=chao.gao@intel.com \
--cc=chenyi.qiang@intel.com \
--cc=jean-philippe@linaro.org \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=peterx@redhat.com \
--cc=philmd@linaro.org \
--cc=qemu-arm@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).