[RFC 0/2] arm: Add DMA remapping for CCA

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/2] arm: Add DMA remapping for CCA
@ 2025-02-20 16:13 Jean-Philippe Brucker
  2025-02-20 16:13 ` [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers Jean-Philippe Brucker
  2025-02-20 16:13 ` [RFC 2/2] target/arm/kvm-rme: Add DMA remapping for the shared memory region Jean-Philippe Brucker
  0 siblings, 2 replies; 13+ messages in thread
From: Jean-Philippe Brucker @ 2025-02-20 16:13 UTC (permalink / raw)
  To: philmd, david, peterx, pbonzini, peter.maydell
  Cc: qemu-arm, qemu-devel, Jean-Philippe Brucker

These two patches will be included in the series that adds support for
Arm CCA guests to QEMU, which isn't ready to be merged [1], but I'm sending
them as RFC first to seek advice about the best way to implement this.

There is a breaking change to CCA guests, where DMA addresses now have
the "shared" top bit set. The VMM needs to strip the address before
accessing memory. See more details on patch 2 of this RFC and on the
Linux change:

https://lore.kernel.org/all/20250219220751.1276854-1-suzuki.poulose@arm.com/

Patch 2 inserts an IOMMUMemoryRegion on the DMA path, so all DMA
accesses get the top bit stripped. It also adds RAM discard listeners
(guest_memfd -> IOMMU notifiers) so that we can create VFIO mappings in
the top half of the guest address space. Patch 1 is a hack to support
GPA->VA translation during the discard_populate notification.

I'm testing VFIO support using Chenyi Qiang's series for shared device
assignment [2]

[1] https://lore.kernel.org/qemu-devel/20241125195626.856992-2-jean-philippe@linaro.org/
[2] https://lore.kernel.org/qemu-devel/20250217081833.21568-1-chenyi.qiang@intel.com/

Jean-Philippe Brucker (2):
  system/memory: Allow creating IOMMU mappings from RAM discard populate
    notifiers
  target/arm/kvm-rme: Add DMA remapping for the shared memory region

 include/exec/memory.h |   5 +
 target/arm/kvm_arm.h  |  15 +++
 hw/arm/virt.c         |   2 +
 system/memory.c       |   3 +-
 target/arm/kvm-rme.c  | 220 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 244 insertions(+), 1 deletion(-)

-- 
2.48.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers
  2025-02-20 16:13 [RFC 0/2] arm: Add DMA remapping for CCA Jean-Philippe Brucker
@ 2025-02-20 16:13 ` Jean-Philippe Brucker
  2025-02-20 19:39   ` David Hildenbrand
  2025-02-20 16:13 ` [RFC 2/2] target/arm/kvm-rme: Add DMA remapping for the shared memory region Jean-Philippe Brucker
  1 sibling, 1 reply; 13+ messages in thread
From: Jean-Philippe Brucker @ 2025-02-20 16:13 UTC (permalink / raw)
  To: philmd, david, peterx, pbonzini, peter.maydell
  Cc: qemu-arm, qemu-devel, Jean-Philippe Brucker

For Arm CCA we'd like the guest_memfd discard notifier to call the IOMMU
notifiers and create e.g. VFIO mappings. The default VFIO discard
notifier isn't sufficient for CCA because the DMA addresses need a
translation (even without vIOMMU).

At the moment:
* guest_memfd_state_change() calls the populate() notifier
* the populate notifier() calls IOMMU notifiers
* the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA
* it calls ram_discard_manager_is_populated() which fails.

guest_memfd_state_change() only changes the section's state after
calling the populate() notifier. We can't easily invert the order of
operation because it uses the old state bitmap to know which pages need
the populate() notifier.

For now add a flag to the IOMMU notifier to tell memory_get_xlat_addr()
that we're aware of the RAM discard manager state.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---

Definitely not the prettiest hack, any idea how to do this cleanly?
---
 include/exec/memory.h | 5 +++++
 system/memory.c       | 3 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 9f73b59867..6fcd98fe58 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -116,6 +116,11 @@ typedef enum {
     IOMMU_RO   = 1,
     IOMMU_WO   = 2,
     IOMMU_RW   = 3,
+    /*
+     * Allow mapping a discarded page, because we're in the process of
+     * populating it.
+     */
+    IOMMU_POPULATING = 4,
 } IOMMUAccessFlags;
 
 #define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0))
diff --git a/system/memory.c b/system/memory.c
index 4c829793a0..8e884f9c15 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2221,7 +2221,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
          * Disallow that. vmstate priorities make sure any RamDiscardManager
          * were already restored before IOMMUs are restored.
          */
-        if (!ram_discard_manager_is_populated(rdm, &tmp)) {
+        if (!(iotlb->perm & IOMMU_POPULATING) &&
+            !ram_discard_manager_is_populated(rdm, &tmp)) {
             error_setg(errp, "iommu map to discarded memory (e.g., unplugged"
                          " via virtio-mem): %" HWADDR_PRIx "",
                          iotlb->translated_addr);
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers
  2025-02-20 16:13 ` [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers Jean-Philippe Brucker
@ 2025-02-20 19:39   ` David Hildenbrand
  2025-02-21  2:25     ` Chenyi Qiang
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2025-02-20 19:39 UTC (permalink / raw)
  To: Jean-Philippe Brucker, philmd, peterx, pbonzini, peter.maydell,
	Chenyi Qiang
  Cc: qemu-arm, qemu-devel

On 20.02.25 17:13, Jean-Philippe Brucker wrote:
> For Arm CCA we'd like the guest_memfd discard notifier to call the IOMMU
> notifiers and create e.g. VFIO mappings. The default VFIO discard
> notifier isn't sufficient for CCA because the DMA addresses need a
> translation (even without vIOMMU).
> 
> At the moment:
> * guest_memfd_state_change() calls the populate() notifier
> * the populate notifier() calls IOMMU notifiers
> * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA
> * it calls ram_discard_manager_is_populated() which fails.
> 
> guest_memfd_state_change() only changes the section's state after
> calling the populate() notifier. We can't easily invert the order of
> operation because it uses the old state bitmap to know which pages need
> the populate() notifier.

I assume we talk about this code: [1]

[1] https://lkml.kernel.org/r/20250217081833.21568-1-chenyi.qiang@intel.com


+static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t offset,
+                                         uint64_t size, bool shared_to_private)
+{
+    int block_size = memory_attribute_manager_get_block_size(mgr);
+    int ret = 0;
+
+    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
+        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
+                     __func__, offset, size);
+        return -1;
+    }
+
+    if ((shared_to_private && memory_attribute_is_range_discarded(mgr, offset, size)) ||
+        (!shared_to_private && memory_attribute_is_range_populated(mgr, offset, size))) {
+        return 0;
+    }
+
+    if (shared_to_private) {
+        memory_attribute_notify_discard(mgr, offset, size);
+    } else {
+        ret = memory_attribute_notify_populate(mgr, offset, size);
+    }
+
+    if (!ret) {
+        unsigned long first_bit = offset / block_size;
+        unsigned long nbits = size / block_size;
+
+        g_assert((first_bit + nbits) <= mgr->bitmap_size);
+
+        if (shared_to_private) {
+            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+        } else {
+            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
+        }
+
+        return 0;
+    }
+
+    return ret;
+}

Then, in memory_attribute_notify_populate(), we walk the bitmap again.

Why?

We just checked that it's all in the expected state, no?


virtio-mem doesn't handle it that way, so I'm curious why we would have to do it here?


> 
> For now add a flag to the IOMMU notifier to tell memory_get_xlat_addr()
> that we're aware of the RAM discard manager state.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
> 
> Definitely not the prettiest hack, any idea how to do this cleanly?
> ---
>   include/exec/memory.h | 5 +++++
>   system/memory.c       | 3 ++-
>   2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 9f73b59867..6fcd98fe58 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -116,6 +116,11 @@ typedef enum {
>       IOMMU_RO   = 1,
>       IOMMU_WO   = 2,
>       IOMMU_RW   = 3,
> +    /*
> +     * Allow mapping a discarded page, because we're in the process of
> +     * populating it.
> +     */
> +    IOMMU_POPULATING = 4,
>   } IOMMUAccessFlags;
>   
>   #define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0))
> diff --git a/system/memory.c b/system/memory.c
> index 4c829793a0..8e884f9c15 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2221,7 +2221,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>            * Disallow that. vmstate priorities make sure any RamDiscardManager
>            * were already restored before IOMMUs are restored.
>            */
> -        if (!ram_discard_manager_is_populated(rdm, &tmp)) {
> +        if (!(iotlb->perm & IOMMU_POPULATING) &&
> +            !ram_discard_manager_is_populated(rdm, &tmp)) {
>               error_setg(errp, "iommu map to discarded memory (e.g., unplugged"
>                            " via virtio-mem): %" HWADDR_PRIx "",
>                            iotlb->translated_addr);


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers
  2025-02-20 19:39   ` David Hildenbrand
@ 2025-02-21  2:25     ` Chenyi Qiang
  2025-02-21  8:09       ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: Chenyi Qiang @ 2025-02-21  2:25 UTC (permalink / raw)
  To: David Hildenbrand, Jean-Philippe Brucker, philmd, peterx,
	pbonzini, peter.maydell
  Cc: qemu-arm, qemu-devel



On 2/21/2025 3:39 AM, David Hildenbrand wrote:
> On 20.02.25 17:13, Jean-Philippe Brucker wrote:
>> For Arm CCA we'd like the guest_memfd discard notifier to call the IOMMU
>> notifiers and create e.g. VFIO mappings. The default VFIO discard
>> notifier isn't sufficient for CCA because the DMA addresses need a
>> translation (even without vIOMMU).
>>
>> At the moment:
>> * guest_memfd_state_change() calls the populate() notifier
>> * the populate notifier() calls IOMMU notifiers
>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA
>> * it calls ram_discard_manager_is_populated() which fails.
>>
>> guest_memfd_state_change() only changes the section's state after
>> calling the populate() notifier. We can't easily invert the order of
>> operation because it uses the old state bitmap to know which pages need
>> the populate() notifier.
> 
> I assume we talk about this code: [1]
> 
> [1] https://lkml.kernel.org/r/20250217081833.21568-1-chenyi.qiang@intel.com
> 
> 
> +static int memory_attribute_state_change(MemoryAttributeManager *mgr,
> uint64_t offset,
> +                                         uint64_t size, bool
> shared_to_private)
> +{
> +    int block_size = memory_attribute_manager_get_block_size(mgr);
> +    int ret = 0;
> +
> +    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
> +                     __func__, offset, size);
> +        return -1;
> +    }
> +
> +    if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
> offset, size)) ||
> +        (!shared_to_private && memory_attribute_is_range_populated(mgr,
> offset, size))) {
> +        return 0;
> +    }
> +
> +    if (shared_to_private) {
> +        memory_attribute_notify_discard(mgr, offset, size);
> +    } else {
> +        ret = memory_attribute_notify_populate(mgr, offset, size);
> +    }
> +
> +    if (!ret) {
> +        unsigned long first_bit = offset / block_size;
> +        unsigned long nbits = size / block_size;
> +
> +        g_assert((first_bit + nbits) <= mgr->bitmap_size);
> +
> +        if (shared_to_private) {
> +            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
> +        } else {
> +            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
> +        }
> +
> +        return 0;
> +    }
> +
> +    return ret;
> +}
> 
> Then, in memory_attribute_notify_populate(), we walk the bitmap again.
> 
> Why?
> 
> We just checked that it's all in the expected state, no?
> 
> 
> virtio-mem doesn't handle it that way, so I'm curious why we would have
> to do it here?

I was concerned about the case where the guest issues a request that
only partial of the range is in the desired state.
I think the main problem is the policy for the guest conversion request.
My current handling is:

1. When a conversion request is made for a range already in the desired
  state, the helper simply returns success.
2. For requests involving a range partially in the desired state, only
  the necessary segments are converted, ensuring the entire range
  complies with the request efficiently.
3. In scenarios where a conversion request is declined by other systems,
  such as a failure from VFIO during notify_populate(), the helper will
  roll back the request, maintaining consistency.

And the policy of virtio-mem is to refuse the state change if not all
blocks are in the opposite state.

Actually, this part is still a uncertain to me.

BTW, per the status/bitmap track, the virtio-mem also changes the bitmap
after the plug/unplug notifier. This is the same, correct?

> 
> 
>>
>> For now add a flag to the IOMMU notifier to tell memory_get_xlat_addr()
>> that we're aware of the RAM discard manager state.
>>
>> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
>> ---
>>
>> Definitely not the prettiest hack, any idea how to do this cleanly?
>> ---
>>   include/exec/memory.h | 5 +++++
>>   system/memory.c       | 3 ++-
>>   2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index 9f73b59867..6fcd98fe58 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -116,6 +116,11 @@ typedef enum {
>>       IOMMU_RO   = 1,
>>       IOMMU_WO   = 2,
>>       IOMMU_RW   = 3,
>> +    /*
>> +     * Allow mapping a discarded page, because we're in the process of
>> +     * populating it.
>> +     */
>> +    IOMMU_POPULATING = 4,
>>   } IOMMUAccessFlags;
>>     #define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ?
>> IOMMU_WO : 0))
>> diff --git a/system/memory.c b/system/memory.c
>> index 4c829793a0..8e884f9c15 100644
>> --- a/system/memory.c
>> +++ b/system/memory.c
>> @@ -2221,7 +2221,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb,
>> void **vaddr,
>>            * Disallow that. vmstate priorities make sure any
>> RamDiscardManager
>>            * were already restored before IOMMUs are restored.
>>            */
>> -        if (!ram_discard_manager_is_populated(rdm, &tmp)) {
>> +        if (!(iotlb->perm & IOMMU_POPULATING) &&
>> +            !ram_discard_manager_is_populated(rdm, &tmp)) {
>>               error_setg(errp, "iommu map to discarded memory (e.g.,
>> unplugged"
>>                            " via virtio-mem): %" HWADDR_PRIx "",
>>                            iotlb->translated_addr);
> 
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers
  2025-02-21  2:25     ` Chenyi Qiang
@ 2025-02-21  8:09       ` David Hildenbrand
  2025-02-21 10:04         ` Chenyi Qiang
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2025-02-21  8:09 UTC (permalink / raw)
  To: Chenyi Qiang, Jean-Philippe Brucker, philmd, peterx, pbonzini,
	peter.maydell
  Cc: qemu-arm, qemu-devel

On 21.02.25 03:25, Chenyi Qiang wrote:
> 
> 
> On 2/21/2025 3:39 AM, David Hildenbrand wrote:
>> On 20.02.25 17:13, Jean-Philippe Brucker wrote:
>>> For Arm CCA we'd like the guest_memfd discard notifier to call the IOMMU
>>> notifiers and create e.g. VFIO mappings. The default VFIO discard
>>> notifier isn't sufficient for CCA because the DMA addresses need a
>>> translation (even without vIOMMU).
>>>
>>> At the moment:
>>> * guest_memfd_state_change() calls the populate() notifier
>>> * the populate notifier() calls IOMMU notifiers
>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA
>>> * it calls ram_discard_manager_is_populated() which fails.
>>>
>>> guest_memfd_state_change() only changes the section's state after
>>> calling the populate() notifier. We can't easily invert the order of
>>> operation because it uses the old state bitmap to know which pages need
>>> the populate() notifier.
>>
>> I assume we talk about this code: [1]
>>
>> [1] https://lkml.kernel.org/r/20250217081833.21568-1-chenyi.qiang@intel.com
>>
>>
>> +static int memory_attribute_state_change(MemoryAttributeManager *mgr,
>> uint64_t offset,
>> +                                         uint64_t size, bool
>> shared_to_private)
>> +{
>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>> +    int ret = 0;
>> +
>> +    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>> +                     __func__, offset, size);
>> +        return -1;
>> +    }
>> +
>> +    if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
>> offset, size)) ||
>> +        (!shared_to_private && memory_attribute_is_range_populated(mgr,
>> offset, size))) {
>> +        return 0;
>> +    }
>> +
>> +    if (shared_to_private) {
>> +        memory_attribute_notify_discard(mgr, offset, size);
>> +    } else {
>> +        ret = memory_attribute_notify_populate(mgr, offset, size);
>> +    }
>> +
>> +    if (!ret) {
>> +        unsigned long first_bit = offset / block_size;
>> +        unsigned long nbits = size / block_size;
>> +
>> +        g_assert((first_bit + nbits) <= mgr->bitmap_size);
>> +
>> +        if (shared_to_private) {
>> +            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
>> +        } else {
>> +            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
>> +        }
>> +
>> +        return 0;
>> +    }
>> +
>> +    return ret;
>> +}
>>
>> Then, in memory_attribute_notify_populate(), we walk the bitmap again.
>>
>> Why?
>>
>> We just checked that it's all in the expected state, no?
>>
>>
>> virtio-mem doesn't handle it that way, so I'm curious why we would have
>> to do it here?
> 
> I was concerned about the case where the guest issues a request that
> only partial of the range is in the desired state.
> I think the main problem is the policy for the guest conversion request.
> My current handling is:
> 
> 1. When a conversion request is made for a range already in the desired
>    state, the helper simply returns success.

Yes.

> 2. For requests involving a range partially in the desired state, only
>    the necessary segments are converted, ensuring the entire range
>    complies with the request efficiently.


Ah, now I get:

+    if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
offset, size)) ||
+        (!shared_to_private && memory_attribute_is_range_populated(mgr,
offset, size))) {
+        return 0;
+    }
+

We're not failing if it might already partially be in the other state.

> 3. In scenarios where a conversion request is declined by other systems,
>    such as a failure from VFIO during notify_populate(), the helper will
>    roll back the request, maintaining consistency.
> 
> And the policy of virtio-mem is to refuse the state change if not all
> blocks are in the opposite state.

Yes.

> 
> Actually, this part is still a uncertain to me.
> 

IIUC, the problem does not exist if we only convert a single page at a time.

Is there a known use case where such partial conversions could happen?

> BTW, per the status/bitmap track, the virtio-mem also changes the bitmap
> after the plug/unplug notifier. This is the same, correct?
Right. But because we reject these partial requests, we don't have to 
traverse the bitmap and could just adjust the bitmap operations.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers
  2025-02-21  8:09       ` David Hildenbrand
@ 2025-02-21 10:04         ` Chenyi Qiang
  2025-02-25  2:00           ` Chenyi Qiang
  0 siblings, 1 reply; 13+ messages in thread
From: Chenyi Qiang @ 2025-02-21 10:04 UTC (permalink / raw)
  To: David Hildenbrand, Jean-Philippe Brucker, philmd, peterx,
	pbonzini, peter.maydell
  Cc: qemu-arm, qemu-devel



On 2/21/2025 4:09 PM, David Hildenbrand wrote:
> On 21.02.25 03:25, Chenyi Qiang wrote:
>>
>>
>> On 2/21/2025 3:39 AM, David Hildenbrand wrote:
>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote:
>>>> For Arm CCA we'd like the guest_memfd discard notifier to call the
>>>> IOMMU
>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard
>>>> notifier isn't sufficient for CCA because the DMA addresses need a
>>>> translation (even without vIOMMU).
>>>>
>>>> At the moment:
>>>> * guest_memfd_state_change() calls the populate() notifier
>>>> * the populate notifier() calls IOMMU notifiers
>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA
>>>> * it calls ram_discard_manager_is_populated() which fails.
>>>>
>>>> guest_memfd_state_change() only changes the section's state after
>>>> calling the populate() notifier. We can't easily invert the order of
>>>> operation because it uses the old state bitmap to know which pages need
>>>> the populate() notifier.
>>>
>>> I assume we talk about this code: [1]
>>>
>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1-
>>> chenyi.qiang@intel.com
>>>
>>>
>>> +static int memory_attribute_state_change(MemoryAttributeManager *mgr,
>>> uint64_t offset,
>>> +                                         uint64_t size, bool
>>> shared_to_private)
>>> +{
>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>> +    int ret = 0;
>>> +
>>> +    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
>>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>> +                     __func__, offset, size);
>>> +        return -1;
>>> +    }
>>> +
>>> +    if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
>>> offset, size)) ||
>>> +        (!shared_to_private && memory_attribute_is_range_populated(mgr,
>>> offset, size))) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    if (shared_to_private) {
>>> +        memory_attribute_notify_discard(mgr, offset, size);
>>> +    } else {
>>> +        ret = memory_attribute_notify_populate(mgr, offset, size);
>>> +    }
>>> +
>>> +    if (!ret) {
>>> +        unsigned long first_bit = offset / block_size;
>>> +        unsigned long nbits = size / block_size;
>>> +
>>> +        g_assert((first_bit + nbits) <= mgr->bitmap_size);
>>> +
>>> +        if (shared_to_private) {
>>> +            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
>>> +        } else {
>>> +            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
>>> +        }
>>> +
>>> +        return 0;
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>>
>>> Then, in memory_attribute_notify_populate(), we walk the bitmap again.
>>>
>>> Why?
>>>
>>> We just checked that it's all in the expected state, no?
>>>
>>>
>>> virtio-mem doesn't handle it that way, so I'm curious why we would have
>>> to do it here?
>>
>> I was concerned about the case where the guest issues a request that
>> only partial of the range is in the desired state.
>> I think the main problem is the policy for the guest conversion request.
>> My current handling is:
>>
>> 1. When a conversion request is made for a range already in the desired
>>    state, the helper simply returns success.
> 
> Yes.
> 
>> 2. For requests involving a range partially in the desired state, only
>>    the necessary segments are converted, ensuring the entire range
>>    complies with the request efficiently.
> 
> 
> Ah, now I get:
> 
> +    if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
> offset, size)) ||
> +        (!shared_to_private && memory_attribute_is_range_populated(mgr,
> offset, size))) {
> +        return 0;
> +    }
> +
> 
> We're not failing if it might already partially be in the other state.
> 
>> 3. In scenarios where a conversion request is declined by other systems,
>>    such as a failure from VFIO during notify_populate(), the helper will
>>    roll back the request, maintaining consistency.
>>
>> And the policy of virtio-mem is to refuse the state change if not all
>> blocks are in the opposite state.
> 
> Yes.
> 
>>
>> Actually, this part is still a uncertain to me.
>>
> 
> IIUC, the problem does not exist if we only convert a single page at a
> time.
> 
> Is there a known use case where such partial conversions could happen?

I don't see such case yet. Actually, I'm trying to follow the behavior
of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it
doesn't reject the request if the whole range isn't in the opposite
state. It just uses xa_store() to update it. Also, I don't see the spec
says how to handle such case. To be robust, I just allow this special case.

> 
>> BTW, per the status/bitmap track, the virtio-mem also changes the bitmap
>> after the plug/unplug notifier. This is the same, correct?
> Right. But because we reject these partial requests, we don't have to
> traverse the bitmap and could just adjust the bitmap operations.

Yes, If we treat it as a guest error/bug, we can adjust it.

> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers
  2025-02-21 10:04         ` Chenyi Qiang
@ 2025-02-25  2:00           ` Chenyi Qiang
  2025-02-25  9:41             ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: Chenyi Qiang @ 2025-02-25  2:00 UTC (permalink / raw)
  To: David Hildenbrand, Jean-Philippe Brucker, philmd, peterx,
	pbonzini, peter.maydell, Alexey Kardashevskiy
  Cc: qemu-arm, qemu-devel



On 2/21/2025 6:04 PM, Chenyi Qiang wrote:
> 
> 
> On 2/21/2025 4:09 PM, David Hildenbrand wrote:
>> On 21.02.25 03:25, Chenyi Qiang wrote:
>>>
>>>
>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote:
>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote:
>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call the
>>>>> IOMMU
>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard
>>>>> notifier isn't sufficient for CCA because the DMA addresses need a
>>>>> translation (even without vIOMMU).
>>>>>
>>>>> At the moment:
>>>>> * guest_memfd_state_change() calls the populate() notifier
>>>>> * the populate notifier() calls IOMMU notifiers
>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA
>>>>> * it calls ram_discard_manager_is_populated() which fails.
>>>>>
>>>>> guest_memfd_state_change() only changes the section's state after
>>>>> calling the populate() notifier. We can't easily invert the order of
>>>>> operation because it uses the old state bitmap to know which pages need
>>>>> the populate() notifier.
>>>>
>>>> I assume we talk about this code: [1]
>>>>
>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1-
>>>> chenyi.qiang@intel.com
>>>>
>>>>
>>>> +static int memory_attribute_state_change(MemoryAttributeManager *mgr,
>>>> uint64_t offset,
>>>> +                                         uint64_t size, bool
>>>> shared_to_private)
>>>> +{
>>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>>> +    int ret = 0;
>>>> +
>>>> +    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
>>>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>>> +                     __func__, offset, size);
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
>>>> offset, size)) ||
>>>> +        (!shared_to_private && memory_attribute_is_range_populated(mgr,
>>>> offset, size))) {
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    if (shared_to_private) {
>>>> +        memory_attribute_notify_discard(mgr, offset, size);
>>>> +    } else {
>>>> +        ret = memory_attribute_notify_populate(mgr, offset, size);
>>>> +    }
>>>> +
>>>> +    if (!ret) {
>>>> +        unsigned long first_bit = offset / block_size;
>>>> +        unsigned long nbits = size / block_size;
>>>> +
>>>> +        g_assert((first_bit + nbits) <= mgr->bitmap_size);
>>>> +
>>>> +        if (shared_to_private) {
>>>> +            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
>>>> +        } else {
>>>> +            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
>>>> +        }
>>>> +
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>>
>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap again.
>>>>
>>>> Why?
>>>>
>>>> We just checked that it's all in the expected state, no?
>>>>
>>>>
>>>> virtio-mem doesn't handle it that way, so I'm curious why we would have
>>>> to do it here?
>>>
>>> I was concerned about the case where the guest issues a request that
>>> only partial of the range is in the desired state.
>>> I think the main problem is the policy for the guest conversion request.
>>> My current handling is:
>>>
>>> 1. When a conversion request is made for a range already in the desired
>>>    state, the helper simply returns success.
>>
>> Yes.
>>
>>> 2. For requests involving a range partially in the desired state, only
>>>    the necessary segments are converted, ensuring the entire range
>>>    complies with the request efficiently.
>>
>>
>> Ah, now I get:
>>
>> +    if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
>> offset, size)) ||
>> +        (!shared_to_private && memory_attribute_is_range_populated(mgr,
>> offset, size))) {
>> +        return 0;
>> +    }
>> +
>>
>> We're not failing if it might already partially be in the other state.
>>
>>> 3. In scenarios where a conversion request is declined by other systems,
>>>    such as a failure from VFIO during notify_populate(), the helper will
>>>    roll back the request, maintaining consistency.
>>>
>>> And the policy of virtio-mem is to refuse the state change if not all
>>> blocks are in the opposite state.
>>
>> Yes.
>>
>>>
>>> Actually, this part is still a uncertain to me.
>>>
>>
>> IIUC, the problem does not exist if we only convert a single page at a
>> time.
>>
>> Is there a known use case where such partial conversions could happen?
> 
> I don't see such case yet. Actually, I'm trying to follow the behavior
> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it
> doesn't reject the request if the whole range isn't in the opposite
> state. It just uses xa_store() to update it. Also, I don't see the spec
> says how to handle such case. To be robust, I just allow this special case.
> 
>>
>>> BTW, per the status/bitmap track, the virtio-mem also changes the bitmap
>>> after the plug/unplug notifier. This is the same, correct?
>> Right. But because we reject these partial requests, we don't have to
>> traverse the bitmap and could just adjust the bitmap operations.
> 
> Yes, If we treat it as a guest error/bug, we can adjust it.

Hi David, do you think which option is better? If prefer to reject the
partial requests, I'll change it in my next version.

> 
>>
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers
  2025-02-25  2:00           ` Chenyi Qiang
@ 2025-02-25  9:41             ` David Hildenbrand
  2025-02-26 12:43               ` Chenyi Qiang
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2025-02-25  9:41 UTC (permalink / raw)
  To: Chenyi Qiang, Jean-Philippe Brucker, philmd, peterx, pbonzini,
	peter.maydell, Alexey Kardashevskiy
  Cc: qemu-arm, qemu-devel

On 25.02.25 03:00, Chenyi Qiang wrote:
> 
> 
> On 2/21/2025 6:04 PM, Chenyi Qiang wrote:
>>
>>
>> On 2/21/2025 4:09 PM, David Hildenbrand wrote:
>>> On 21.02.25 03:25, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote:
>>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote:
>>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call the
>>>>>> IOMMU
>>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard
>>>>>> notifier isn't sufficient for CCA because the DMA addresses need a
>>>>>> translation (even without vIOMMU).
>>>>>>
>>>>>> At the moment:
>>>>>> * guest_memfd_state_change() calls the populate() notifier
>>>>>> * the populate notifier() calls IOMMU notifiers
>>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA
>>>>>> * it calls ram_discard_manager_is_populated() which fails.
>>>>>>
>>>>>> guest_memfd_state_change() only changes the section's state after
>>>>>> calling the populate() notifier. We can't easily invert the order of
>>>>>> operation because it uses the old state bitmap to know which pages need
>>>>>> the populate() notifier.
>>>>>
>>>>> I assume we talk about this code: [1]
>>>>>
>>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1-
>>>>> chenyi.qiang@intel.com
>>>>>
>>>>>
>>>>> +static int memory_attribute_state_change(MemoryAttributeManager *mgr,
>>>>> uint64_t offset,
>>>>> +                                         uint64_t size, bool
>>>>> shared_to_private)
>>>>> +{
>>>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>>>> +    int ret = 0;
>>>>> +
>>>>> +    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
>>>>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>>>> +                     __func__, offset, size);
>>>>> +        return -1;
>>>>> +    }
>>>>> +
>>>>> +    if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
>>>>> offset, size)) ||
>>>>> +        (!shared_to_private && memory_attribute_is_range_populated(mgr,
>>>>> offset, size))) {
>>>>> +        return 0;
>>>>> +    }
>>>>> +
>>>>> +    if (shared_to_private) {
>>>>> +        memory_attribute_notify_discard(mgr, offset, size);
>>>>> +    } else {
>>>>> +        ret = memory_attribute_notify_populate(mgr, offset, size);
>>>>> +    }
>>>>> +
>>>>> +    if (!ret) {
>>>>> +        unsigned long first_bit = offset / block_size;
>>>>> +        unsigned long nbits = size / block_size;
>>>>> +
>>>>> +        g_assert((first_bit + nbits) <= mgr->bitmap_size);
>>>>> +
>>>>> +        if (shared_to_private) {
>>>>> +            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
>>>>> +        } else {
>>>>> +            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
>>>>> +        }
>>>>> +
>>>>> +        return 0;
>>>>> +    }
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>>
>>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap again.
>>>>>
>>>>> Why?
>>>>>
>>>>> We just checked that it's all in the expected state, no?
>>>>>
>>>>>
>>>>> virtio-mem doesn't handle it that way, so I'm curious why we would have
>>>>> to do it here?
>>>>
>>>> I was concerned about the case where the guest issues a request that
>>>> only partial of the range is in the desired state.
>>>> I think the main problem is the policy for the guest conversion request.
>>>> My current handling is:
>>>>
>>>> 1. When a conversion request is made for a range already in the desired
>>>>     state, the helper simply returns success.
>>>
>>> Yes.
>>>
>>>> 2. For requests involving a range partially in the desired state, only
>>>>     the necessary segments are converted, ensuring the entire range
>>>>     complies with the request efficiently.
>>>
>>>
>>> Ah, now I get:
>>>
>>> +    if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
>>> offset, size)) ||
>>> +        (!shared_to_private && memory_attribute_is_range_populated(mgr,
>>> offset, size))) {
>>> +        return 0;
>>> +    }
>>> +
>>>
>>> We're not failing if it might already partially be in the other state.
>>>
>>>> 3. In scenarios where a conversion request is declined by other systems,
>>>>     such as a failure from VFIO during notify_populate(), the helper will
>>>>     roll back the request, maintaining consistency.
>>>>
>>>> And the policy of virtio-mem is to refuse the state change if not all
>>>> blocks are in the opposite state.
>>>
>>> Yes.
>>>
>>>>
>>>> Actually, this part is still a uncertain to me.
>>>>
>>>
>>> IIUC, the problem does not exist if we only convert a single page at a
>>> time.
>>>
>>> Is there a known use case where such partial conversions could happen?
>>
>> I don't see such case yet. Actually, I'm trying to follow the behavior
>> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it
>> doesn't reject the request if the whole range isn't in the opposite
>> state. It just uses xa_store() to update it. Also, I don't see the spec
>> says how to handle such case. To be robust, I just allow this special case.
>>
>>>
>>>> BTW, per the status/bitmap track, the virtio-mem also changes the bitmap
>>>> after the plug/unplug notifier. This is the same, correct?
>>> Right. But because we reject these partial requests, we don't have to
>>> traverse the bitmap and could just adjust the bitmap operations.
>>
>> Yes, If we treat it as a guest error/bug, we can adjust it.
> 
> Hi David, do you think which option is better? If prefer to reject the
> partial requests, I'll change it in my next version.

Hi,

still scratching my head. Having to work around it as in this patch here is
suboptimal.

Could we simplify the whole thing while still allowing for (unexpected) partial
conversions?

Essentially: If states are mixed, fallback to a "1 block at a time" handling.

The only problem is: what to do if we fail halfway through? Well, we can only have
such partial completions for "populate", not for discard.

Option a) Just leave it as "partially completed populate" and return the error. The
bitmap and the notifiers are consistent.

Option b) Just discard everything: someone tried to convert something "partial
shared" to "shared". So maybe, if anything goes wrong, we can just have "all private".

The question is also, what the expectation from the caller is: can the caller
even make progress on failure or do we have to retry until it works?

Both options would be compatible with "the caller must retry either way until it works".

a) is probably easiest. Something like the following on top of your code:



 From 40c001a57c3c1350f3a44288ccb013d903d300cf Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Tue, 25 Feb 2025 09:55:38 +0100
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  system/memory-attribute-manager.c | 66 +++++++++++++++++--------------
  1 file changed, 36 insertions(+), 30 deletions(-)

diff --git a/system/memory-attribute-manager.c b/system/memory-attribute-manager.c
index 17c70cf677..31960e4a54 100644
--- a/system/memory-attribute-manager.c
+++ b/system/memory-attribute-manager.c
@@ -274,9 +274,7 @@ static void memory_attribute_notify_discard(MemoryAttributeManager *mgr,
          if (!memory_region_section_intersect_range(&tmp, offset, size)) {
              continue;
          }
-
-        memory_attribute_for_each_populated_section(mgr, &tmp, rdl,
-                                                    memory_attribute_notify_discard_cb);
+        rdl->notify_discard(rdl, &tmp);
      }
  }
  
@@ -292,9 +290,7 @@ static int memory_attribute_notify_populate(MemoryAttributeManager *mgr,
          if (!memory_region_section_intersect_range(&tmp, offset, size)) {
              continue;
          }
-
-        ret = memory_attribute_for_each_discarded_section(mgr, &tmp, rdl,
-                                                          memory_attribute_notify_populate_cb);
+        ret = rdl->notify_populate(rdl, &tmp);
          if (ret) {
              break;
          }
@@ -311,9 +307,7 @@ static int memory_attribute_notify_populate(MemoryAttributeManager *mgr,
              if (!memory_region_section_intersect_range(&tmp, offset, size)) {
                  continue;
              }
-
-            memory_attribute_for_each_discarded_section(mgr, &tmp, rdl2,
-                                                        memory_attribute_notify_discard_cb);
+            rdl2->notify_discard(rdl2, &tmp);
          }
      }
      return ret;
@@ -348,7 +342,10 @@ static bool memory_attribute_is_range_discarded(MemoryAttributeManager *mgr,
  static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t offset,
                                           uint64_t size, bool shared_to_private)
  {
-    int block_size = memory_attribute_manager_get_block_size(mgr);
+    const int block_size = memory_attribute_manager_get_block_size(mgr);
+    const unsigned long first_bit = offset / block_size;
+    const unsigned long nbits = size / block_size;
+    uint64_t cur_offset;
      int ret = 0;
  
      if (!memory_attribute_is_valid_range(mgr, offset, size)) {
@@ -357,32 +354,41 @@ static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t o
          return -1;
      }
  
-    if ((shared_to_private && memory_attribute_is_range_discarded(mgr, offset, size)) ||
-        (!shared_to_private && memory_attribute_is_range_populated(mgr, offset, size))) {
-        return 0;
-    }
-
      if (shared_to_private) {
-        memory_attribute_notify_discard(mgr, offset, size);
-    } else {
-        ret = memory_attribute_notify_populate(mgr, offset, size);
-    }
-
-    if (!ret) {
-        unsigned long first_bit = offset / block_size;
-        unsigned long nbits = size / block_size;
-
-        g_assert((first_bit + nbits) <= mgr->bitmap_size);
-
-        if (shared_to_private) {
+        if (memory_attribute_is_range_discarded(mgr, offset, size)) {
+            /* Already private. */
+        } else if (!memory_attribute_is_range_populated(mgr, offset, size)) {
+            /* Unexpected mixture: not completely shared. */
+            for (cur_offset = 0; cur_offset < offset; cur_offset += block_size) {
+               memory_attribute_state_change(mgr, cur_offset, block_size,
+                                             true);
+            }
+        } else {
+            /* Completely shared. */
              bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+            memory_attribute_notify_discard(mgr, offset, size);
+        }
+    } else {
+        if (memory_attribute_is_range_populated(mgr, offset, size)) {
+            /* Already shared. */
+        } else if (!memory_attribute_is_range_discarded(mgr, offset, size)) {
+            /* Unexpected mixture: not completely private. */
+            for (cur_offset = 0; cur_offset < offset; cur_offset += block_size) {
+               ret = memory_attribute_state_change(mgr, cur_offset, block_size,
+                                                   false);
+               if (ret) {
+                   break;
+               }
+            }
          } else {
+            /* Completely private. */
              bitmap_set(mgr->shared_bitmap, first_bit, nbits);
+            ret = memory_attribute_notify_populate(mgr, offset, size);
+            if (ret) {
+                bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+            }
          }
-
-        return 0;
      }
-
      return ret;
  }
  
-- 
2.48.1


-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers
  2025-02-25  9:41             ` David Hildenbrand
@ 2025-02-26 12:43               ` Chenyi Qiang
  2025-02-27  3:26                 ` Chenyi Qiang
  0 siblings, 1 reply; 13+ messages in thread
From: Chenyi Qiang @ 2025-02-26 12:43 UTC (permalink / raw)
  To: David Hildenbrand, Jean-Philippe Brucker, philmd, peterx,
	pbonzini, peter.maydell, Alexey Kardashevskiy
  Cc: qemu-arm, qemu-devel



On 2/25/2025 5:41 PM, David Hildenbrand wrote:
> On 25.02.25 03:00, Chenyi Qiang wrote:
>>
>>
>> On 2/21/2025 6:04 PM, Chenyi Qiang wrote:
>>>
>>>
>>> On 2/21/2025 4:09 PM, David Hildenbrand wrote:
>>>> On 21.02.25 03:25, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote:
>>>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote:
>>>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call the
>>>>>>> IOMMU
>>>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard
>>>>>>> notifier isn't sufficient for CCA because the DMA addresses need a
>>>>>>> translation (even without vIOMMU).
>>>>>>>
>>>>>>> At the moment:
>>>>>>> * guest_memfd_state_change() calls the populate() notifier
>>>>>>> * the populate notifier() calls IOMMU notifiers
>>>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get
>>>>>>> a VA
>>>>>>> * it calls ram_discard_manager_is_populated() which fails.
>>>>>>>
>>>>>>> guest_memfd_state_change() only changes the section's state after
>>>>>>> calling the populate() notifier. We can't easily invert the order of
>>>>>>> operation because it uses the old state bitmap to know which
>>>>>>> pages need
>>>>>>> the populate() notifier.
>>>>>>
>>>>>> I assume we talk about this code: [1]
>>>>>>
>>>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1-
>>>>>> chenyi.qiang@intel.com
>>>>>>
>>>>>>
>>>>>> +static int memory_attribute_state_change(MemoryAttributeManager
>>>>>> *mgr,
>>>>>> uint64_t offset,
>>>>>> +                                         uint64_t size, bool
>>>>>> shared_to_private)
>>>>>> +{
>>>>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>>>>> +    int ret = 0;
>>>>>> +
>>>>>> +    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
>>>>>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>>>>> +                     __func__, offset, size);
>>>>>> +        return -1;
>>>>>> +    }
>>>>>> +
>>>>>> +    if ((shared_to_private &&
>>>>>> memory_attribute_is_range_discarded(mgr,
>>>>>> offset, size)) ||
>>>>>> +        (!shared_to_private &&
>>>>>> memory_attribute_is_range_populated(mgr,
>>>>>> offset, size))) {
>>>>>> +        return 0;
>>>>>> +    }
>>>>>> +
>>>>>> +    if (shared_to_private) {
>>>>>> +        memory_attribute_notify_discard(mgr, offset, size);
>>>>>> +    } else {
>>>>>> +        ret = memory_attribute_notify_populate(mgr, offset, size);
>>>>>> +    }
>>>>>> +
>>>>>> +    if (!ret) {
>>>>>> +        unsigned long first_bit = offset / block_size;
>>>>>> +        unsigned long nbits = size / block_size;
>>>>>> +
>>>>>> +        g_assert((first_bit + nbits) <= mgr->bitmap_size);
>>>>>> +
>>>>>> +        if (shared_to_private) {
>>>>>> +            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
>>>>>> +        } else {
>>>>>> +            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
>>>>>> +        }
>>>>>> +
>>>>>> +        return 0;
>>>>>> +    }
>>>>>> +
>>>>>> +    return ret;
>>>>>> +}
>>>>>>
>>>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap
>>>>>> again.
>>>>>>
>>>>>> Why?
>>>>>>
>>>>>> We just checked that it's all in the expected state, no?
>>>>>>
>>>>>>
>>>>>> virtio-mem doesn't handle it that way, so I'm curious why we would
>>>>>> have
>>>>>> to do it here?
>>>>>
>>>>> I was concerned about the case where the guest issues a request that
>>>>> only partial of the range is in the desired state.
>>>>> I think the main problem is the policy for the guest conversion
>>>>> request.
>>>>> My current handling is:
>>>>>
>>>>> 1. When a conversion request is made for a range already in the
>>>>> desired
>>>>>     state, the helper simply returns success.
>>>>
>>>> Yes.
>>>>
>>>>> 2. For requests involving a range partially in the desired state, only
>>>>>     the necessary segments are converted, ensuring the entire range
>>>>>     complies with the request efficiently.
>>>>
>>>>
>>>> Ah, now I get:
>>>>
>>>> +    if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
>>>> offset, size)) ||
>>>> +        (!shared_to_private &&
>>>> memory_attribute_is_range_populated(mgr,
>>>> offset, size))) {
>>>> +        return 0;
>>>> +    }
>>>> +
>>>>
>>>> We're not failing if it might already partially be in the other state.
>>>>
>>>>> 3. In scenarios where a conversion request is declined by other
>>>>> systems,
>>>>>     such as a failure from VFIO during notify_populate(), the
>>>>> helper will
>>>>>     roll back the request, maintaining consistency.
>>>>>
>>>>> And the policy of virtio-mem is to refuse the state change if not all
>>>>> blocks are in the opposite state.
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>> Actually, this part is still a uncertain to me.
>>>>>
>>>>
>>>> IIUC, the problem does not exist if we only convert a single page at a
>>>> time.
>>>>
>>>> Is there a known use case where such partial conversions could happen?
>>>
>>> I don't see such case yet. Actually, I'm trying to follow the behavior
>>> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it
>>> doesn't reject the request if the whole range isn't in the opposite
>>> state. It just uses xa_store() to update it. Also, I don't see the spec
>>> says how to handle such case. To be robust, I just allow this special
>>> case.
>>>
>>>>
>>>>> BTW, per the status/bitmap track, the virtio-mem also changes the
>>>>> bitmap
>>>>> after the plug/unplug notifier. This is the same, correct?
>>>> Right. But because we reject these partial requests, we don't have to
>>>> traverse the bitmap and could just adjust the bitmap operations.
>>>
>>> Yes, If we treat it as a guest error/bug, we can adjust it.
>>
>> Hi David, do you think which option is better? If prefer to reject the
>> partial requests, I'll change it in my next version.
> 
> Hi,
> 
> still scratching my head. Having to work around it as in this patch here is
> suboptimal.
> 
> Could we simplify the whole thing while still allowing for (unexpected)
> partial
> conversions?
> 
> Essentially: If states are mixed, fallback to a "1 block at a time"
> handling.
> 
> The only problem is: what to do if we fail halfway through? Well, we can
> only have
> such partial completions for "populate", not for discard.
> 
> Option a) Just leave it as "partially completed populate" and return the
> error. The
> bitmap and the notifiers are consistent.
> 
> Option b) Just discard everything: someone tried to convert something
> "partial
> shared" to "shared". So maybe, if anything goes wrong, we can just have
> "all private".
> 
> The question is also, what the expectation from the caller is: can the
> caller
> even make progress on failure or do we have to retry until it works?

Yes, That's the key problem.

For core mm side conversion, The caller (guest) handles three case:
success, failure and retry. guest can continue on failure but will keep
the memory in its original attribute and trigger some calltrace. While
in QEMU side, it would cause VM stop if kvm_set_memory_attributes() failed.

As for the VFIO conversion, at present, we allow it to fail and don't
return error code to guest as long as we undo the conversion. It only
causes the device not work in guest.

I think if we view the attribute mismatch between core mm and IOMMU as a
fatal error, we can call VM stop or let guest retry until it converts
successfully.

> 
> Both options would be compatible with "the caller must retry either way
> until it works".
> 
> a) is probably easiest. Something like the following on top of your code:

LGTM, with option a), We need to return the retry status to guest. the
caller (guest) must handle the retry. Not doing so with "partially
completed populate" would cause some problem.

> 
> 
> 
> From 40c001a57c3c1350f3a44288ccb013d903d300cf Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Tue, 25 Feb 2025 09:55:38 +0100
> Subject: [PATCH] tmp
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  system/memory-attribute-manager.c | 66 +++++++++++++++++--------------
>  1 file changed, 36 insertions(+), 30 deletions(-)
> 
> diff --git a/system/memory-attribute-manager.c b/system/memory-
> attribute-manager.c
> index 17c70cf677..31960e4a54 100644
> --- a/system/memory-attribute-manager.c
> +++ b/system/memory-attribute-manager.c
> @@ -274,9 +274,7 @@ static void
> memory_attribute_notify_discard(MemoryAttributeManager *mgr,
>          if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>              continue;
>          }
> -
> -        memory_attribute_for_each_populated_section(mgr, &tmp, rdl,
> -                                                   
> memory_attribute_notify_discard_cb);
> +        rdl->notify_discard(rdl, &tmp);
>      }
>  }
>  
> @@ -292,9 +290,7 @@ static int
> memory_attribute_notify_populate(MemoryAttributeManager *mgr,
>          if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>              continue;
>          }
> -
> -        ret = memory_attribute_for_each_discarded_section(mgr, &tmp, rdl,
> -                                                         
> memory_attribute_notify_populate_cb);
> +        ret = rdl->notify_populate(rdl, &tmp);
>          if (ret) {
>              break;
>          }
> @@ -311,9 +307,7 @@ static int
> memory_attribute_notify_populate(MemoryAttributeManager *mgr,
>              if (!memory_region_section_intersect_range(&tmp, offset,
> size)) {
>                  continue;
>              }
> -
> -            memory_attribute_for_each_discarded_section(mgr, &tmp, rdl2,
> -                                                       
> memory_attribute_notify_discard_cb);
> +            rdl2->notify_discard(rdl2, &tmp);
>          }
>      }
>      return ret;
> @@ -348,7 +342,10 @@ static bool
> memory_attribute_is_range_discarded(MemoryAttributeManager *mgr,
>  static int memory_attribute_state_change(MemoryAttributeManager *mgr,
> uint64_t offset,
>                                           uint64_t size, bool
> shared_to_private)
>  {
> -    int block_size = memory_attribute_manager_get_block_size(mgr);
> +    const int block_size = memory_attribute_manager_get_block_size(mgr);
> +    const unsigned long first_bit = offset / block_size;
> +    const unsigned long nbits = size / block_size;
> +    uint64_t cur_offset;
>      int ret = 0;
>  
>      if (!memory_attribute_is_valid_range(mgr, offset, size)) {
> @@ -357,32 +354,41 @@ static int
> memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t o
>          return -1;
>      }
>  
> -    if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
> offset, size)) ||
> -        (!shared_to_private && memory_attribute_is_range_populated(mgr,
> offset, size))) {
> -        return 0;
> -    }
> -
>      if (shared_to_private) {
> -        memory_attribute_notify_discard(mgr, offset, size);
> -    } else {
> -        ret = memory_attribute_notify_populate(mgr, offset, size);
> -    }
> -
> -    if (!ret) {
> -        unsigned long first_bit = offset / block_size;
> -        unsigned long nbits = size / block_size;
> -
> -        g_assert((first_bit + nbits) <= mgr->bitmap_size);
> -
> -        if (shared_to_private) {
> +        if (memory_attribute_is_range_discarded(mgr, offset, size)) {
> +            /* Already private. */
> +        } else if (!memory_attribute_is_range_populated(mgr, offset,
> size)) {
> +            /* Unexpected mixture: not completely shared. */
> +            for (cur_offset = 0; cur_offset < offset; cur_offset +=
> block_size) {
> +               memory_attribute_state_change(mgr, cur_offset, block_size,
> +                                             true);
> +            }
> +        } else {
> +            /* Completely shared. */
>              bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
> +            memory_attribute_notify_discard(mgr, offset, size);
> +        }
> +    } else {
> +        if (memory_attribute_is_range_populated(mgr, offset, size)) {
> +            /* Already shared. */
> +        } else if (!memory_attribute_is_range_discarded(mgr, offset,
> size)) {
> +            /* Unexpected mixture: not completely private. */
> +            for (cur_offset = 0; cur_offset < offset; cur_offset +=
> block_size) {
> +               ret = memory_attribute_state_change(mgr, cur_offset,
> block_size,
> +                                                   false);
> +               if (ret) {
> +                   break;
> +               }
> +            }
>          } else {
> +            /* Completely private. */
>              bitmap_set(mgr->shared_bitmap, first_bit, nbits);
> +            ret = memory_attribute_notify_populate(mgr, offset, size);
> +            if (ret) {
> +                bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
> +            }
>          }
> -
> -        return 0;
>      }
> -
>      return ret;
>  }
>  



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers
  2025-02-26 12:43               ` Chenyi Qiang
@ 2025-02-27  3:26                 ` Chenyi Qiang
  2025-02-27 11:27                   ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: Chenyi Qiang @ 2025-02-27  3:26 UTC (permalink / raw)
  To: David Hildenbrand, Jean-Philippe Brucker, philmd, peterx,
	pbonzini, peter.maydell, Alexey Kardashevskiy, Gao Chao
  Cc: qemu-arm, qemu-devel



On 2/26/2025 8:43 PM, Chenyi Qiang wrote:
> 
> 
> On 2/25/2025 5:41 PM, David Hildenbrand wrote:
>> On 25.02.25 03:00, Chenyi Qiang wrote:
>>>
>>>
>>> On 2/21/2025 6:04 PM, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 2/21/2025 4:09 PM, David Hildenbrand wrote:
>>>>> On 21.02.25 03:25, Chenyi Qiang wrote:
>>>>>>
>>>>>>
>>>>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote:
>>>>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote:
>>>>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call the
>>>>>>>> IOMMU
>>>>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard
>>>>>>>> notifier isn't sufficient for CCA because the DMA addresses need a
>>>>>>>> translation (even without vIOMMU).
>>>>>>>>
>>>>>>>> At the moment:
>>>>>>>> * guest_memfd_state_change() calls the populate() notifier
>>>>>>>> * the populate notifier() calls IOMMU notifiers
>>>>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get
>>>>>>>> a VA
>>>>>>>> * it calls ram_discard_manager_is_populated() which fails.
>>>>>>>>
>>>>>>>> guest_memfd_state_change() only changes the section's state after
>>>>>>>> calling the populate() notifier. We can't easily invert the order of
>>>>>>>> operation because it uses the old state bitmap to know which
>>>>>>>> pages need
>>>>>>>> the populate() notifier.
>>>>>>>
>>>>>>> I assume we talk about this code: [1]
>>>>>>>
>>>>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1-
>>>>>>> chenyi.qiang@intel.com
>>>>>>>
>>>>>>>
>>>>>>> +static int memory_attribute_state_change(MemoryAttributeManager
>>>>>>> *mgr,
>>>>>>> uint64_t offset,
>>>>>>> +                                         uint64_t size, bool
>>>>>>> shared_to_private)
>>>>>>> +{
>>>>>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>>>>>> +    int ret = 0;
>>>>>>> +
>>>>>>> +    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
>>>>>>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>>>>>> +                     __func__, offset, size);
>>>>>>> +        return -1;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    if ((shared_to_private &&
>>>>>>> memory_attribute_is_range_discarded(mgr,
>>>>>>> offset, size)) ||
>>>>>>> +        (!shared_to_private &&
>>>>>>> memory_attribute_is_range_populated(mgr,
>>>>>>> offset, size))) {
>>>>>>> +        return 0;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    if (shared_to_private) {
>>>>>>> +        memory_attribute_notify_discard(mgr, offset, size);
>>>>>>> +    } else {
>>>>>>> +        ret = memory_attribute_notify_populate(mgr, offset, size);
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    if (!ret) {
>>>>>>> +        unsigned long first_bit = offset / block_size;
>>>>>>> +        unsigned long nbits = size / block_size;
>>>>>>> +
>>>>>>> +        g_assert((first_bit + nbits) <= mgr->bitmap_size);
>>>>>>> +
>>>>>>> +        if (shared_to_private) {
>>>>>>> +            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
>>>>>>> +        } else {
>>>>>>> +            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        return 0;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return ret;
>>>>>>> +}
>>>>>>>
>>>>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap
>>>>>>> again.
>>>>>>>
>>>>>>> Why?
>>>>>>>
>>>>>>> We just checked that it's all in the expected state, no?
>>>>>>>
>>>>>>>
>>>>>>> virtio-mem doesn't handle it that way, so I'm curious why we would
>>>>>>> have
>>>>>>> to do it here?
>>>>>>
>>>>>> I was concerned about the case where the guest issues a request that
>>>>>> only partial of the range is in the desired state.
>>>>>> I think the main problem is the policy for the guest conversion
>>>>>> request.
>>>>>> My current handling is:
>>>>>>
>>>>>> 1. When a conversion request is made for a range already in the
>>>>>> desired
>>>>>>     state, the helper simply returns success.
>>>>>
>>>>> Yes.
>>>>>
>>>>>> 2. For requests involving a range partially in the desired state, only
>>>>>>     the necessary segments are converted, ensuring the entire range
>>>>>>     complies with the request efficiently.
>>>>>
>>>>>
>>>>> Ah, now I get:
>>>>>
>>>>> +    if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
>>>>> offset, size)) ||
>>>>> +        (!shared_to_private &&
>>>>> memory_attribute_is_range_populated(mgr,
>>>>> offset, size))) {
>>>>> +        return 0;
>>>>> +    }
>>>>> +
>>>>>
>>>>> We're not failing if it might already partially be in the other state.
>>>>>
>>>>>> 3. In scenarios where a conversion request is declined by other
>>>>>> systems,
>>>>>>     such as a failure from VFIO during notify_populate(), the
>>>>>> helper will
>>>>>>     roll back the request, maintaining consistency.
>>>>>>
>>>>>> And the policy of virtio-mem is to refuse the state change if not all
>>>>>> blocks are in the opposite state.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>> Actually, this part is still a uncertain to me.
>>>>>>
>>>>>
>>>>> IIUC, the problem does not exist if we only convert a single page at a
>>>>> time.
>>>>>
>>>>> Is there a known use case where such partial conversions could happen?
>>>>
>>>> I don't see such case yet. Actually, I'm trying to follow the behavior
>>>> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it
>>>> doesn't reject the request if the whole range isn't in the opposite
>>>> state. It just uses xa_store() to update it. Also, I don't see the spec
>>>> says how to handle such case. To be robust, I just allow this special
>>>> case.
>>>>
>>>>>
>>>>>> BTW, per the status/bitmap track, the virtio-mem also changes the
>>>>>> bitmap
>>>>>> after the plug/unplug notifier. This is the same, correct?
>>>>> Right. But because we reject these partial requests, we don't have to
>>>>> traverse the bitmap and could just adjust the bitmap operations.
>>>>
>>>> Yes, If we treat it as a guest error/bug, we can adjust it.
>>>
>>> Hi David, do you think which option is better? If prefer to reject the
>>> partial requests, I'll change it in my next version.
>>
>> Hi,
>>
>> still scratching my head. Having to work around it as in this patch here is
>> suboptimal.
>>
>> Could we simplify the whole thing while still allowing for (unexpected)
>> partial
>> conversions?
>>
>> Essentially: If states are mixed, fallback to a "1 block at a time"
>> handling.
>>
>> The only problem is: what to do if we fail halfway through? Well, we can
>> only have
>> such partial completions for "populate", not for discard.
>>
>> Option a) Just leave it as "partially completed populate" and return the
>> error. The
>> bitmap and the notifiers are consistent.
>>
>> Option b) Just discard everything: someone tried to convert something
>> "partial
>> shared" to "shared". So maybe, if anything goes wrong, we can just have
>> "all private".
>>
>> The question is also, what the expectation from the caller is: can the
>> caller
>> even make progress on failure or do we have to retry until it works?
> 
> Yes, That's the key problem.
> 
> For core mm side conversion, The caller (guest) handles three case:
> success, failure and retry. guest can continue on failure but will keep
> the memory in its original attribute and trigger some calltrace. While
> in QEMU side, it would cause VM stop if kvm_set_memory_attributes() failed.
> 
> As for the VFIO conversion, at present, we allow it to fail and don't
> return error code to guest as long as we undo the conversion. It only
> causes the device not work in guest.
> 
> I think if we view the attribute mismatch between core mm and IOMMU as a
> fatal error, we can call VM stop or let guest retry until it converts
> successfully.
> 

Just think more about the options for the failure case handling
theoretically as we haven't hit such state_change() failure:

1. Undo + return invalid error
Pros: The guest can make progress
Cons: Complicated undo operations: Option a) is not appliable, because
it leaves it as partial completed populate, but the guest thinks the
operation has failed.
Also need to add the undo for set_memory_attribute() after
state_change() failed. Maybe also apply the attribute bitmap to
set_memory_attribute() operation to handle the mixed request case

2. Undo in VFIO and no undo for set_memory_attribute() + return success
(Current approach in my series)
Pros: The guest can make progress although device doesn't work.
Cons: the attribute bitmap only tracks the status in iommu.

3. No undo + return retry
Pros: It keeps the attribute bitmap aligned in core mm and iommu.
Cons: The guest doesn't know how to handle the retry. It would cause
infinite loop.

4. No undo + no return. Just VM stop.
Pros: simple
Cons: maybe overkill.

Maybe option 1 or 4 is better?

> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers
  2025-02-27  3:26                 ` Chenyi Qiang
@ 2025-02-27 11:27                   ` David Hildenbrand
  2025-02-28  5:39                     ` Chenyi Qiang
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2025-02-27 11:27 UTC (permalink / raw)
  To: Chenyi Qiang, Jean-Philippe Brucker, philmd, peterx, pbonzini,
	peter.maydell, Alexey Kardashevskiy, Gao Chao
  Cc: qemu-arm, qemu-devel

On 27.02.25 04:26, Chenyi Qiang wrote:
> 
> 
> On 2/26/2025 8:43 PM, Chenyi Qiang wrote:
>>
>>
>> On 2/25/2025 5:41 PM, David Hildenbrand wrote:
>>> On 25.02.25 03:00, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 2/21/2025 6:04 PM, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 2/21/2025 4:09 PM, David Hildenbrand wrote:
>>>>>> On 21.02.25 03:25, Chenyi Qiang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote:
>>>>>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote:
>>>>>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call the
>>>>>>>>> IOMMU
>>>>>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard
>>>>>>>>> notifier isn't sufficient for CCA because the DMA addresses need a
>>>>>>>>> translation (even without vIOMMU).
>>>>>>>>>
>>>>>>>>> At the moment:
>>>>>>>>> * guest_memfd_state_change() calls the populate() notifier
>>>>>>>>> * the populate notifier() calls IOMMU notifiers
>>>>>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get
>>>>>>>>> a VA
>>>>>>>>> * it calls ram_discard_manager_is_populated() which fails.
>>>>>>>>>
>>>>>>>>> guest_memfd_state_change() only changes the section's state after
>>>>>>>>> calling the populate() notifier. We can't easily invert the order of
>>>>>>>>> operation because it uses the old state bitmap to know which
>>>>>>>>> pages need
>>>>>>>>> the populate() notifier.
>>>>>>>>
>>>>>>>> I assume we talk about this code: [1]
>>>>>>>>
>>>>>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1-
>>>>>>>> chenyi.qiang@intel.com
>>>>>>>>
>>>>>>>>
>>>>>>>> +static int memory_attribute_state_change(MemoryAttributeManager
>>>>>>>> *mgr,
>>>>>>>> uint64_t offset,
>>>>>>>> +                                         uint64_t size, bool
>>>>>>>> shared_to_private)
>>>>>>>> +{
>>>>>>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>>>>>>> +    int ret = 0;
>>>>>>>> +
>>>>>>>> +    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
>>>>>>>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>>>>>>> +                     __func__, offset, size);
>>>>>>>> +        return -1;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    if ((shared_to_private &&
>>>>>>>> memory_attribute_is_range_discarded(mgr,
>>>>>>>> offset, size)) ||
>>>>>>>> +        (!shared_to_private &&
>>>>>>>> memory_attribute_is_range_populated(mgr,
>>>>>>>> offset, size))) {
>>>>>>>> +        return 0;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    if (shared_to_private) {
>>>>>>>> +        memory_attribute_notify_discard(mgr, offset, size);
>>>>>>>> +    } else {
>>>>>>>> +        ret = memory_attribute_notify_populate(mgr, offset, size);
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    if (!ret) {
>>>>>>>> +        unsigned long first_bit = offset / block_size;
>>>>>>>> +        unsigned long nbits = size / block_size;
>>>>>>>> +
>>>>>>>> +        g_assert((first_bit + nbits) <= mgr->bitmap_size);
>>>>>>>> +
>>>>>>>> +        if (shared_to_private) {
>>>>>>>> +            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
>>>>>>>> +        } else {
>>>>>>>> +            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
>>>>>>>> +        }
>>>>>>>> +
>>>>>>>> +        return 0;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return ret;
>>>>>>>> +}
>>>>>>>>
>>>>>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap
>>>>>>>> again.
>>>>>>>>
>>>>>>>> Why?
>>>>>>>>
>>>>>>>> We just checked that it's all in the expected state, no?
>>>>>>>>
>>>>>>>>
>>>>>>>> virtio-mem doesn't handle it that way, so I'm curious why we would
>>>>>>>> have
>>>>>>>> to do it here?
>>>>>>>
>>>>>>> I was concerned about the case where the guest issues a request that
>>>>>>> only partial of the range is in the desired state.
>>>>>>> I think the main problem is the policy for the guest conversion
>>>>>>> request.
>>>>>>> My current handling is:
>>>>>>>
>>>>>>> 1. When a conversion request is made for a range already in the
>>>>>>> desired
>>>>>>>      state, the helper simply returns success.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>> 2. For requests involving a range partially in the desired state, only
>>>>>>>      the necessary segments are converted, ensuring the entire range
>>>>>>>      complies with the request efficiently.
>>>>>>
>>>>>>
>>>>>> Ah, now I get:
>>>>>>
>>>>>> +    if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
>>>>>> offset, size)) ||
>>>>>> +        (!shared_to_private &&
>>>>>> memory_attribute_is_range_populated(mgr,
>>>>>> offset, size))) {
>>>>>> +        return 0;
>>>>>> +    }
>>>>>> +
>>>>>>
>>>>>> We're not failing if it might already partially be in the other state.
>>>>>>
>>>>>>> 3. In scenarios where a conversion request is declined by other
>>>>>>> systems,
>>>>>>>      such as a failure from VFIO during notify_populate(), the
>>>>>>> helper will
>>>>>>>      roll back the request, maintaining consistency.
>>>>>>>
>>>>>>> And the policy of virtio-mem is to refuse the state change if not all
>>>>>>> blocks are in the opposite state.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>
>>>>>>> Actually, this part is still a uncertain to me.
>>>>>>>
>>>>>>
>>>>>> IIUC, the problem does not exist if we only convert a single page at a
>>>>>> time.
>>>>>>
>>>>>> Is there a known use case where such partial conversions could happen?
>>>>>
>>>>> I don't see such case yet. Actually, I'm trying to follow the behavior
>>>>> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it
>>>>> doesn't reject the request if the whole range isn't in the opposite
>>>>> state. It just uses xa_store() to update it. Also, I don't see the spec
>>>>> says how to handle such case. To be robust, I just allow this special
>>>>> case.
>>>>>
>>>>>>
>>>>>>> BTW, per the status/bitmap track, the virtio-mem also changes the
>>>>>>> bitmap
>>>>>>> after the plug/unplug notifier. This is the same, correct?
>>>>>> Right. But because we reject these partial requests, we don't have to
>>>>>> traverse the bitmap and could just adjust the bitmap operations.
>>>>>
>>>>> Yes, If we treat it as a guest error/bug, we can adjust it.
>>>>
>>>> Hi David, do you think which option is better? If prefer to reject the
>>>> partial requests, I'll change it in my next version.
>>>
>>> Hi,
>>>
>>> still scratching my head. Having to work around it as in this patch here is
>>> suboptimal.
>>>
>>> Could we simplify the whole thing while still allowing for (unexpected)
>>> partial
>>> conversions?
>>>
>>> Essentially: If states are mixed, fallback to a "1 block at a time"
>>> handling.
>>>
>>> The only problem is: what to do if we fail halfway through? Well, we can
>>> only have
>>> such partial completions for "populate", not for discard.
>>>
>>> Option a) Just leave it as "partially completed populate" and return the
>>> error. The
>>> bitmap and the notifiers are consistent.
>>>
>>> Option b) Just discard everything: someone tried to convert something
>>> "partial
>>> shared" to "shared". So maybe, if anything goes wrong, we can just have
>>> "all private".
>>>
>>> The question is also, what the expectation from the caller is: can the
>>> caller
>>> even make progress on failure or do we have to retry until it works?
>>
>> Yes, That's the key problem.
>>
>> For core mm side conversion, The caller (guest) handles three case:
>> success, failure and retry. guest can continue on failure but will keep
>> the memory in its original attribute and trigger some calltrace. While
>> in QEMU side, it would cause VM stop if kvm_set_memory_attributes() failed.
>>
>> As for the VFIO conversion, at present, we allow it to fail and don't
>> return error code to guest as long as we undo the conversion. It only
>> causes the device not work in guest.
>>
>> I think if we view the attribute mismatch between core mm and IOMMU as a
>> fatal error, we can call VM stop or let guest retry until it converts
>> successfully.
>>
> 
> Just think more about the options for the failure case handling
> theoretically as we haven't hit such state_change() failure:
> 
> 1. Undo + return invalid error
> Pros: The guest can make progress
> Cons: Complicated undo operations: Option a) is not appliable, because
> it leaves it as partial completed populate, but the guest thinks the
> operation has failed.
> Also need to add the undo for set_memory_attribute() after
> state_change() failed. Maybe also apply the attribute bitmap to
> set_memory_attribute() operation to handle the mixed request case
> 
> 2. Undo in VFIO and no undo for set_memory_attribute() + return success
> (Current approach in my series)
> Pros: The guest can make progress although device doesn't work.
> Cons: the attribute bitmap only tracks the status in iommu.

Right, we should avoid that. Bitmap + notifiers should stay in sync.

> 
> 3. No undo + return retry
> Pros: It keeps the attribute bitmap aligned in core mm and iommu.
> Cons: The guest doesn't know how to handle the retry. It would cause
> infinite loop.
> 
> 4. No undo + no return. Just VM stop.
> Pros: simple
> Cons: maybe overkill.
> 
> Maybe option 1 or 4 is better?

Well, we can proper undo working using a temporary bitmap when
converting to shared and we run into this "mixed" scenario.

Something like this on top of your work:


 From f36e3916596ed5952f15233eb7747c65a6424949 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Tue, 25 Feb 2025 09:55:38 +0100
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  system/memory-attribute-manager.c | 95 +++++++++++++++++++++----------
  1 file changed, 65 insertions(+), 30 deletions(-)

diff --git a/system/memory-attribute-manager.c b/system/memory-attribute-manager.c
index 17c70cf677..e98e7367c1 100644
--- a/system/memory-attribute-manager.c
+++ b/system/memory-attribute-manager.c
@@ -274,9 +274,7 @@ static void memory_attribute_notify_discard(MemoryAttributeManager *mgr,
          if (!memory_region_section_intersect_range(&tmp, offset, size)) {
              continue;
          }
-
-        memory_attribute_for_each_populated_section(mgr, &tmp, rdl,
-                                                    memory_attribute_notify_discard_cb);
+        rdl->notify_discard(rdl, &tmp);
      }
  }
  
@@ -292,9 +290,7 @@ static int memory_attribute_notify_populate(MemoryAttributeManager *mgr,
          if (!memory_region_section_intersect_range(&tmp, offset, size)) {
              continue;
          }
-
-        ret = memory_attribute_for_each_discarded_section(mgr, &tmp, rdl,
-                                                          memory_attribute_notify_populate_cb);
+        ret = rdl->notify_populate(rdl, &tmp);
          if (ret) {
              break;
          }
@@ -311,9 +307,7 @@ static int memory_attribute_notify_populate(MemoryAttributeManager *mgr,
              if (!memory_region_section_intersect_range(&tmp, offset, size)) {
                  continue;
              }
-
-            memory_attribute_for_each_discarded_section(mgr, &tmp, rdl2,
-                                                        memory_attribute_notify_discard_cb);
+            rdl2->notify_discard(rdl2, &tmp);
          }
      }
      return ret;
@@ -348,7 +342,12 @@ static bool memory_attribute_is_range_discarded(MemoryAttributeManager *mgr,
  static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t offset,
                                           uint64_t size, bool shared_to_private)
  {
-    int block_size = memory_attribute_manager_get_block_size(mgr);
+    const int block_size = memory_attribute_manager_get_block_size(mgr);
+    const unsigned long first_bit = offset / block_size;
+    const unsigned long nbits = size / block_size;
+    const uint64_t end = offset + size;
+    unsigned long bit;
+    uint64_t cur;
      int ret = 0;
  
      if (!memory_attribute_is_valid_range(mgr, offset, size)) {
@@ -357,32 +356,68 @@ static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t o
          return -1;
      }
  
-    if ((shared_to_private && memory_attribute_is_range_discarded(mgr, offset, size)) ||
-        (!shared_to_private && memory_attribute_is_range_populated(mgr, offset, size))) {
-        return 0;
-    }
-
      if (shared_to_private) {
-        memory_attribute_notify_discard(mgr, offset, size);
-    } else {
-        ret = memory_attribute_notify_populate(mgr, offset, size);
-    }
-
-    if (!ret) {
-        unsigned long first_bit = offset / block_size;
-        unsigned long nbits = size / block_size;
-
-        g_assert((first_bit + nbits) <= mgr->bitmap_size);
-
-        if (shared_to_private) {
+        if (memory_attribute_is_range_discarded(mgr, offset, size)) {
+            /* Already private. */
+        } else if (!memory_attribute_is_range_populated(mgr, offset, size)) {
+            /* Unexpected mixture: process individual blocks. */
+            for (cur = offset; cur < end; cur += block_size) {
+                bit = cur / block_size;
+                if (!test_bit(bit, mgr->shared_bitmap))
+                    continue;
+                clear_bit(bit, mgr->shared_bitmap);
+                memory_attribute_notify_discard(mgr, cur, block_size);
+            }
+        } else {
+            /* Completely shared. */
              bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+            memory_attribute_notify_discard(mgr, offset, size);
+        }
+    } else {
+        if (memory_attribute_is_range_populated(mgr, offset, size)) {
+            /* Already shared. */
+        } else if (!memory_attribute_is_range_discarded(mgr, offset, size)) {
+            /* Unexpected mixture: process individual blocks. */
+            unsigned long *modified_bitmap = bitmap_new(nbits);
+
+            for (cur = offset; cur < end; cur += block_size) {
+                bit = cur / block_size;
+                if (test_bit(bit, mgr->shared_bitmap))
+                    continue;
+                set_bit(bit, mgr->shared_bitmap);
+                ret = memory_attribute_notify_populate(mgr, cur, block_size);
+                if (!ret) {
+                    set_bit(bit - first_bit, modified_bitmap);
+                    continue;
+                }
+                clear_bit(bit, mgr->shared_bitmap);
+                break;
+            }
+            if (ret) {
+                /*
+                 * Very unexpected: something went wrong. Revert to the old
+                 * state, marking only the blocks as private that we converted
+                 * to shared.
+                 */
+                for (cur = offset; cur < end; cur += block_size) {
+                    bit = cur / block_size;
+                    if (!test_bit(bit - first_bit, modified_bitmap))
+                        continue;
+                    assert(test_bit(bit, mgr->shared_bitmap));
+                    clear_bit(bit, mgr->shared_bitmap);
+                    memory_attribute_notify_discard(mgr, offset, block_size);
+                }
+            }
+            g_free(modified_bitmap);
          } else {
+            /* Completely private. */
              bitmap_set(mgr->shared_bitmap, first_bit, nbits);
+            ret = memory_attribute_notify_populate(mgr, offset, size);
+            if (ret) {
+                bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+            }
          }
-
-        return 0;
      }
-
      return ret;
  }
  
-- 
2.48.1


-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers
  2025-02-27 11:27                   ` David Hildenbrand
@ 2025-02-28  5:39                     ` Chenyi Qiang
  0 siblings, 0 replies; 13+ messages in thread
From: Chenyi Qiang @ 2025-02-28  5:39 UTC (permalink / raw)
  To: David Hildenbrand, Jean-Philippe Brucker, philmd, peterx,
	pbonzini, peter.maydell, Alexey Kardashevskiy, Gao Chao
  Cc: qemu-arm, qemu-devel



On 2/27/2025 7:27 PM, David Hildenbrand wrote:
> On 27.02.25 04:26, Chenyi Qiang wrote:
>>
>>
>> On 2/26/2025 8:43 PM, Chenyi Qiang wrote:
>>>
>>>
>>> On 2/25/2025 5:41 PM, David Hildenbrand wrote:
>>>> On 25.02.25 03:00, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 2/21/2025 6:04 PM, Chenyi Qiang wrote:
>>>>>>
>>>>>>
>>>>>> On 2/21/2025 4:09 PM, David Hildenbrand wrote:
>>>>>>> On 21.02.25 03:25, Chenyi Qiang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote:
>>>>>>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote:
>>>>>>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call
>>>>>>>>>> the
>>>>>>>>>> IOMMU
>>>>>>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard
>>>>>>>>>> notifier isn't sufficient for CCA because the DMA addresses
>>>>>>>>>> need a
>>>>>>>>>> translation (even without vIOMMU).
>>>>>>>>>>
>>>>>>>>>> At the moment:
>>>>>>>>>> * guest_memfd_state_change() calls the populate() notifier
>>>>>>>>>> * the populate notifier() calls IOMMU notifiers
>>>>>>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get
>>>>>>>>>> a VA
>>>>>>>>>> * it calls ram_discard_manager_is_populated() which fails.
>>>>>>>>>>
>>>>>>>>>> guest_memfd_state_change() only changes the section's state after
>>>>>>>>>> calling the populate() notifier. We can't easily invert the
>>>>>>>>>> order of
>>>>>>>>>> operation because it uses the old state bitmap to know which
>>>>>>>>>> pages need
>>>>>>>>>> the populate() notifier.
>>>>>>>>>
>>>>>>>>> I assume we talk about this code: [1]
>>>>>>>>>
>>>>>>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1-
>>>>>>>>> chenyi.qiang@intel.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +static int memory_attribute_state_change(MemoryAttributeManager
>>>>>>>>> *mgr,
>>>>>>>>> uint64_t offset,
>>>>>>>>> +                                         uint64_t size, bool
>>>>>>>>> shared_to_private)
>>>>>>>>> +{
>>>>>>>>> +    int block_size =
>>>>>>>>> memory_attribute_manager_get_block_size(mgr);
>>>>>>>>> +    int ret = 0;
>>>>>>>>> +
>>>>>>>>> +    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
>>>>>>>>> +        error_report("%s, invalid range: offset 0x%lx, size
>>>>>>>>> 0x%lx",
>>>>>>>>> +                     __func__, offset, size);
>>>>>>>>> +        return -1;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    if ((shared_to_private &&
>>>>>>>>> memory_attribute_is_range_discarded(mgr,
>>>>>>>>> offset, size)) ||
>>>>>>>>> +        (!shared_to_private &&
>>>>>>>>> memory_attribute_is_range_populated(mgr,
>>>>>>>>> offset, size))) {
>>>>>>>>> +        return 0;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    if (shared_to_private) {
>>>>>>>>> +        memory_attribute_notify_discard(mgr, offset, size);
>>>>>>>>> +    } else {
>>>>>>>>> +        ret = memory_attribute_notify_populate(mgr, offset,
>>>>>>>>> size);
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    if (!ret) {
>>>>>>>>> +        unsigned long first_bit = offset / block_size;
>>>>>>>>> +        unsigned long nbits = size / block_size;
>>>>>>>>> +
>>>>>>>>> +        g_assert((first_bit + nbits) <= mgr->bitmap_size);
>>>>>>>>> +
>>>>>>>>> +        if (shared_to_private) {
>>>>>>>>> +            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
>>>>>>>>> +        } else {
>>>>>>>>> +            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
>>>>>>>>> +        }
>>>>>>>>> +
>>>>>>>>> +        return 0;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    return ret;
>>>>>>>>> +}
>>>>>>>>>
>>>>>>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap
>>>>>>>>> again.
>>>>>>>>>
>>>>>>>>> Why?
>>>>>>>>>
>>>>>>>>> We just checked that it's all in the expected state, no?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> virtio-mem doesn't handle it that way, so I'm curious why we would
>>>>>>>>> have
>>>>>>>>> to do it here?
>>>>>>>>
>>>>>>>> I was concerned about the case where the guest issues a request
>>>>>>>> that
>>>>>>>> only partial of the range is in the desired state.
>>>>>>>> I think the main problem is the policy for the guest conversion
>>>>>>>> request.
>>>>>>>> My current handling is:
>>>>>>>>
>>>>>>>> 1. When a conversion request is made for a range already in the
>>>>>>>> desired
>>>>>>>>      state, the helper simply returns success.
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>> 2. For requests involving a range partially in the desired
>>>>>>>> state, only
>>>>>>>>      the necessary segments are converted, ensuring the entire
>>>>>>>> range
>>>>>>>>      complies with the request efficiently.
>>>>>>>
>>>>>>>
>>>>>>> Ah, now I get:
>>>>>>>
>>>>>>> +    if ((shared_to_private &&
>>>>>>> memory_attribute_is_range_discarded(mgr,
>>>>>>> offset, size)) ||
>>>>>>> +        (!shared_to_private &&
>>>>>>> memory_attribute_is_range_populated(mgr,
>>>>>>> offset, size))) {
>>>>>>> +        return 0;
>>>>>>> +    }
>>>>>>> +
>>>>>>>
>>>>>>> We're not failing if it might already partially be in the other
>>>>>>> state.
>>>>>>>
>>>>>>>> 3. In scenarios where a conversion request is declined by other
>>>>>>>> systems,
>>>>>>>>      such as a failure from VFIO during notify_populate(), the
>>>>>>>> helper will
>>>>>>>>      roll back the request, maintaining consistency.
>>>>>>>>
>>>>>>>> And the policy of virtio-mem is to refuse the state change if
>>>>>>>> not all
>>>>>>>> blocks are in the opposite state.
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>>
>>>>>>>> Actually, this part is still a uncertain to me.
>>>>>>>>
>>>>>>>
>>>>>>> IIUC, the problem does not exist if we only convert a single page
>>>>>>> at a
>>>>>>> time.
>>>>>>>
>>>>>>> Is there a known use case where such partial conversions could
>>>>>>> happen?
>>>>>>
>>>>>> I don't see such case yet. Actually, I'm trying to follow the
>>>>>> behavior
>>>>>> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it
>>>>>> doesn't reject the request if the whole range isn't in the opposite
>>>>>> state. It just uses xa_store() to update it. Also, I don't see the
>>>>>> spec
>>>>>> says how to handle such case. To be robust, I just allow this special
>>>>>> case.
>>>>>>
>>>>>>>
>>>>>>>> BTW, per the status/bitmap track, the virtio-mem also changes the
>>>>>>>> bitmap
>>>>>>>> after the plug/unplug notifier. This is the same, correct?
>>>>>>> Right. But because we reject these partial requests, we don't
>>>>>>> have to
>>>>>>> traverse the bitmap and could just adjust the bitmap operations.
>>>>>>
>>>>>> Yes, If we treat it as a guest error/bug, we can adjust it.
>>>>>
>>>>> Hi David, do you think which option is better? If prefer to reject the
>>>>> partial requests, I'll change it in my next version.
>>>>
>>>> Hi,
>>>>
>>>> still scratching my head. Having to work around it as in this patch
>>>> here is
>>>> suboptimal.
>>>>
>>>> Could we simplify the whole thing while still allowing for (unexpected)
>>>> partial
>>>> conversions?
>>>>
>>>> Essentially: If states are mixed, fallback to a "1 block at a time"
>>>> handling.
>>>>
>>>> The only problem is: what to do if we fail halfway through? Well, we
>>>> can
>>>> only have
>>>> such partial completions for "populate", not for discard.
>>>>
>>>> Option a) Just leave it as "partially completed populate" and return
>>>> the
>>>> error. The
>>>> bitmap and the notifiers are consistent.
>>>>
>>>> Option b) Just discard everything: someone tried to convert something
>>>> "partial
>>>> shared" to "shared". So maybe, if anything goes wrong, we can just have
>>>> "all private".
>>>>
>>>> The question is also, what the expectation from the caller is: can the
>>>> caller
>>>> even make progress on failure or do we have to retry until it works?
>>>
>>> Yes, That's the key problem.
>>>
>>> For core mm side conversion, The caller (guest) handles three case:
>>> success, failure and retry. guest can continue on failure but will keep
>>> the memory in its original attribute and trigger some calltrace. While
>>> in QEMU side, it would cause VM stop if kvm_set_memory_attributes()
>>> failed.
>>>
>>> As for the VFIO conversion, at present, we allow it to fail and don't
>>> return error code to guest as long as we undo the conversion. It only
>>> causes the device not work in guest.
>>>
>>> I think if we view the attribute mismatch between core mm and IOMMU as a
>>> fatal error, we can call VM stop or let guest retry until it converts
>>> successfully.
>>>
>>
>> Just think more about the options for the failure case handling
>> theoretically as we haven't hit such state_change() failure:
>>
>> 1. Undo + return invalid error
>> Pros: The guest can make progress
>> Cons: Complicated undo operations: Option a) is not appliable, because
>> it leaves it as partial completed populate, but the guest thinks the
>> operation has failed.
>> Also need to add the undo for set_memory_attribute() after
>> state_change() failed. Maybe also apply the attribute bitmap to
>> set_memory_attribute() operation to handle the mixed request case
>>
>> 2. Undo in VFIO and no undo for set_memory_attribute() + return success
>> (Current approach in my series)
>> Pros: The guest can make progress although device doesn't work.
>> Cons: the attribute bitmap only tracks the status in iommu.
> 
> Right, we should avoid that. Bitmap + notifiers should stay in sync.
> 
>>
>> 3. No undo + return retry
>> Pros: It keeps the attribute bitmap aligned in core mm and iommu.
>> Cons: The guest doesn't know how to handle the retry. It would cause
>> infinite loop.
>>
>> 4. No undo + no return. Just VM stop.
>> Pros: simple
>> Cons: maybe overkill.
>>
>> Maybe option 1 or 4 is better?
> 
> Well, we can proper undo working using a temporary bitmap when
> converting to shared and we run into this "mixed" scenario.
> 
> Something like this on top of your work:

LGTM. I'll choose option 1 and add the change in my next version. Thanks
a lot!



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC 2/2] target/arm/kvm-rme: Add DMA remapping for the shared memory region
  2025-02-20 16:13 [RFC 0/2] arm: Add DMA remapping for CCA Jean-Philippe Brucker
  2025-02-20 16:13 ` [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers Jean-Philippe Brucker
@ 2025-02-20 16:13 ` Jean-Philippe Brucker
  1 sibling, 0 replies; 13+ messages in thread
From: Jean-Philippe Brucker @ 2025-02-20 16:13 UTC (permalink / raw)
  To: philmd, david, peterx, pbonzini, peter.maydell
  Cc: qemu-arm, qemu-devel, Jean-Philippe Brucker

In Arm CCA, the guest-physical address space is split in half. The top
half represents memory shared between guest and host, and the bottom
half is private to the guest. From QEMU's point of view, the two halves
are merged into a single region, and pages within this region are
either shared or private.

Addresses used by device DMA can potentially target both halves.
Physical devices assigned to the VM access the top half, until they are
authenticated using features like PCIe CMA-SPDM at which point they can
also access memory private to the guest.

Virtual devices implemented by the host are only allowed to access the
top half. For emulated MMIO, KVM strips the GPA before returning to
QEMU, so the GPA already belongs to QEMU's merged view of guest memory.
However DMA addresses cannot be stripped this way and need special
handling by the VMM:

* When emulating DMA the VMM needs to translate the addresses into its
  merged view. Add an IOMMU memory region on the top half, that
  retargets DMA accesses to the merged sysmem.

* when creating IOMMU mappings for (unauthenticated) VFIO devices, the VMM
  needs to map the top half of guest-physical addresses to the shared pages.
  Install RAM discard listeners that issue IOMMU map and unmap requests
  to IOMMU listeners such as VFIO.

The resulting mtree looks like this:

  address-space: vfio-pci
    0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container
      0000000000000000-000001ffffffffff (prio 0, i/o): alias bus master @realm-dma-region 0000000000000000-000001ffffffffff

  memory-region: realm-dma-region
    0000000000000000-000001ffffffffff (prio 0, i/o): realm-dma-region

There are at least two problems with this approach: given that we use
the PCI bus master address space, a vIOMMU cannot install its own
address space at the moment. And since sysbus devices can't have an
IOMMU at the moment, DMA from non-PCI devices isn't supported.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 target/arm/kvm_arm.h |  15 +++
 hw/arm/virt.c        |   2 +
 target/arm/kvm-rme.c | 220 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 237 insertions(+)

diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
index ede6a7138c..13f2d76e2a 100644
--- a/target/arm/kvm_arm.h
+++ b/target/arm/kvm_arm.h
@@ -268,6 +268,16 @@ int kvm_arm_rme_vcpu_init(CPUState *cs);
  */
 void kvm_arm_rme_init_guest_ram(hwaddr base, size_t size);
 
+/**
+ * kvm_arm_rme_setup_gpa
+ * @highest_gpa: highest address of the lower half of the guest address space
+ * @pci_bus: The main PCI bus, for which PCI queries DMA address spaces
+ *
+ * Setup the guest-physical address space for a Realm. Install a memory region
+ * and notifier to manage the shared upper half of the address space.
+ */
+void kvm_arm_rme_init_gpa_space(hwaddr highest_gpa, PCIBus *pci_bus);
+
 #else
 
 /*
@@ -298,6 +308,11 @@ static inline void kvm_arm_rme_init_guest_ram(hwaddr base, size_t size)
 {
 }
 
+static inline void kvm_arm_rme_init_gpa_space(hwaddr highest_gpa,
+                                              PCIBus *pci_bus)
+{
+}
+
 /*
  * These functions should never actually be called without KVM support.
  */
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index d14f54e3be..c28b2cc3ac 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2456,6 +2456,8 @@ static void machvirt_init(MachineState *machine)
                                vms->fw_cfg, OBJECT(vms));
     }
 
+    kvm_arm_rme_init_gpa_space(vms->highest_gpa, vms->bus);
+
     vms->bootinfo.ram_size = machine->ram_size;
     vms->bootinfo.board_id = -1;
     vms->bootinfo.loader_start = vms->memmap[VIRT_MEM].base;
diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c
index ad099bffbb..f21499f699 100644
--- a/target/arm/kvm-rme.c
+++ b/target/arm/kvm-rme.c
@@ -9,6 +9,7 @@
 #include "hw/boards.h"
 #include "hw/core/cpu.h"
 #include "hw/loader.h"
+#include "hw/pci/pci.h"
 #include "kvm_arm.h"
 #include "migration/blocker.h"
 #include "qapi/error.h"
@@ -24,6 +25,35 @@ OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST)
 
 #define RME_PAGE_SIZE qemu_real_host_page_size()
 
+/*
+ * Realms have a split guest-physical address space: the bottom half is private
+ * to the realm, and the top half is shared with the host. Within QEMU, we use a
+ * merged view of both halves. Most of RAM is private to the guest and not
+ * accessible to us, but the guest shares some pages with us.
+ *
+ * For DMA, devices generally target the shared half (top) of the guest address
+ * space. Only the devices trusted by the guest (using mechanisms like TDISP for
+ * device authentication) can access the bottom half.
+ *
+ * RealmDmaRegion performs remapping of top-half accesses to system memory.
+ */
+struct RealmDmaRegion {
+    IOMMUMemoryRegion parent_obj;
+};
+
+#define TYPE_REALM_DMA_REGION "realm-dma-region"
+OBJECT_DECLARE_SIMPLE_TYPE(RealmDmaRegion, REALM_DMA_REGION)
+OBJECT_DEFINE_SIMPLE_TYPE(RealmDmaRegion, realm_dma_region,
+                          REALM_DMA_REGION, IOMMU_MEMORY_REGION);
+
+typedef struct RealmRamDiscardListener {
+    MemoryRegion *mr;
+    hwaddr offset_within_region;
+    uint64_t granularity;
+    RamDiscardListener listener;
+    QLIST_ENTRY(RealmRamDiscardListener) rrdl_next;
+} RealmRamDiscardListener;
+
 typedef struct {
     hwaddr base;
     hwaddr size;
@@ -39,6 +69,12 @@ struct RmeGuest {
     RmeGuestMeasurementAlgorithm measurement_algo;
 
     RmeRamRegion init_ram;
+    uint8_t ipa_bits;
+
+    RealmDmaRegion *dma_region;
+    QLIST_HEAD(, RealmRamDiscardListener) ram_discard_list;
+    MemoryListener memory_listener;
+    AddressSpace dma_as;
 };
 
 OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(RmeGuest, rme_guest, RME_GUEST,
@@ -305,6 +341,7 @@ static void rme_guest_init(Object *obj)
 
 static void rme_guest_finalize(Object *obj)
 {
+    memory_listener_unregister(&rme_guest->memory_listener);
 }
 
 static gint rme_compare_ram_regions(gconstpointer a, gconstpointer b)
@@ -405,3 +442,186 @@ int kvm_arm_rme_vm_type(MachineState *ms)
     }
     return 0;
 }
+
+static int rme_ram_discard_notify(RamDiscardListener *rdl,
+                                  MemoryRegionSection *section,
+                                  bool populate)
+{
+    hwaddr gpa, next;
+    IOMMUTLBEvent event;
+    const hwaddr end = section->offset_within_address_space +
+                       int128_get64(section->size);
+    const hwaddr address_mask = MAKE_64BIT_MASK(0, rme_guest->ipa_bits - 1);
+    RealmRamDiscardListener *rrdl = container_of(rdl, RealmRamDiscardListener,
+                                                 listener);
+
+    assert(rme_guest->dma_region != NULL);
+
+    event.type = populate ? IOMMU_NOTIFIER_MAP : IOMMU_NOTIFIER_UNMAP;
+    event.entry.target_as = &address_space_memory;
+    event.entry.perm = populate ? (IOMMU_RW | IOMMU_POPULATING) : IOMMU_NONE;
+    event.entry.addr_mask = rrdl->granularity - 1;
+
+    assert(end <= address_mask);
+
+    /*
+     * Create IOMMU mappings from the top half of the address space to the RAM
+     * region.
+     */
+    for (gpa = section->offset_within_address_space; gpa < end; gpa = next) {
+        event.entry.iova = gpa + address_mask + 1;
+        event.entry.translated_addr = gpa;
+        memory_region_notify_iommu(IOMMU_MEMORY_REGION(rme_guest->dma_region),
+                                   0, event);
+
+        next = ROUND_UP(gpa + 1, rrdl->granularity);
+        next = MIN(next, end);
+    }
+
+    return 0;
+}
+
+static int rme_ram_discard_notify_populate(RamDiscardListener *rdl,
+                                           MemoryRegionSection *section)
+{
+    return rme_ram_discard_notify(rdl, section, /* populate */ true);
+}
+
+static void rme_ram_discard_notify_discard(RamDiscardListener *rdl,
+                                           MemoryRegionSection *section)
+{
+    rme_ram_discard_notify(rdl, section, /* populate */ false);
+}
+
+/* Install a RAM discard listener */
+static void rme_listener_region_add(MemoryListener *listener,
+                                    MemoryRegionSection *section)
+{
+    RealmRamDiscardListener *rrdl;
+    RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
+
+    if (!rdm) {
+        return;
+    }
+
+    rrdl = g_new0(RealmRamDiscardListener, 1);
+    rrdl->mr = section->mr;
+    rrdl->offset_within_region = section->offset_within_region;
+    rrdl->granularity = ram_discard_manager_get_min_granularity(rdm,
+                                                                section->mr);
+    QLIST_INSERT_HEAD(&rme_guest->ram_discard_list, rrdl, rrdl_next);
+
+    ram_discard_listener_init(&rrdl->listener,
+                              rme_ram_discard_notify_populate,
+                              rme_ram_discard_notify_discard, true);
+    ram_discard_manager_register_listener(rdm, &rrdl->listener, section);
+}
+
+static void rme_listener_region_del(MemoryListener *listener,
+                                    MemoryRegionSection *section)
+{
+    RealmRamDiscardListener *rrdl;
+    RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
+
+    if (!rdm) {
+        return;
+    }
+
+    QLIST_FOREACH(rrdl, &rme_guest->ram_discard_list, rrdl_next) {
+        if (MEMORY_REGION(rrdl->mr) == section->mr &&
+            rrdl->offset_within_region == section->offset_within_region) {
+            ram_discard_manager_unregister_listener(rdm, &rrdl->listener);
+            g_free(rrdl);
+            break;
+        }
+    }
+}
+
+static AddressSpace *rme_dma_get_address_space(PCIBus *bus, void *opaque,
+                                               int devfn)
+{
+    return &rme_guest->dma_as;
+}
+
+static const PCIIOMMUOps rme_dma_ops = {
+    .get_address_space = rme_dma_get_address_space,
+};
+
+void kvm_arm_rme_init_gpa_space(hwaddr highest_gpa, PCIBus *pci_bus)
+{
+    RealmDmaRegion *dma_region;
+    const unsigned int ipa_bits = 64 - clz64(highest_gpa) + 1;
+
+    if (!rme_guest) {
+        return;
+    }
+
+    assert(ipa_bits < 64);
+
+    /*
+     * Setup a DMA translation from the shared top half of the guest-physical
+     * address space to our merged view of RAM.
+     */
+    dma_region = g_new0(RealmDmaRegion, 1);
+
+    memory_region_init_iommu(dma_region, sizeof(*dma_region),
+                             TYPE_REALM_DMA_REGION, OBJECT(rme_guest),
+                             "realm-dma-region", 1ULL << ipa_bits);
+    address_space_init(&rme_guest->dma_as, MEMORY_REGION(dma_region),
+                       TYPE_REALM_DMA_REGION);
+    rme_guest->dma_region = dma_region;
+
+    pci_setup_iommu(pci_bus, &rme_dma_ops, NULL);
+
+    /*
+     * Install notifiers to forward RAM discard changes to the IOMMU notifiers
+     * (ie. tell VFIO to map shared pages and unmap private ones).
+     */
+    rme_guest->memory_listener = (MemoryListener) {
+        .name = "rme",
+        .region_add = rme_listener_region_add,
+        .region_del = rme_listener_region_del,
+    };
+    memory_listener_register(&rme_guest->memory_listener,
+                             &address_space_memory);
+
+    rme_guest->ipa_bits = ipa_bits;
+}
+
+static void realm_dma_region_init(Object *obj)
+{
+}
+
+static IOMMUTLBEntry realm_dma_region_translate(IOMMUMemoryRegion *mr,
+                                                hwaddr addr,
+                                                IOMMUAccessFlags flag,
+                                                int iommu_idx)
+{
+    const hwaddr address_mask = MAKE_64BIT_MASK(0, rme_guest->ipa_bits - 1);
+    IOMMUTLBEntry entry = {
+        .target_as = &address_space_memory,
+        .iova = addr,
+        .translated_addr = addr & address_mask,
+        .addr_mask = address_mask,
+        .perm = IOMMU_RW,
+    };
+
+    return entry;
+}
+
+static void realm_dma_region_replay(IOMMUMemoryRegion *mr, IOMMUNotifier *n)
+{
+    /* Nothing is shared at boot */
+}
+
+static void realm_dma_region_finalize(Object *obj)
+{
+}
+
+static void realm_dma_region_class_init(ObjectClass *oc, void *data)
+{
+    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_CLASS(oc);
+
+    imrc->translate = realm_dma_region_translate;
+    imrc->replay = realm_dma_region_replay;
+}
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-02-28  5:40 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-20 16:13 [RFC 0/2] arm: Add DMA remapping for CCA Jean-Philippe Brucker
2025-02-20 16:13 ` [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers Jean-Philippe Brucker
2025-02-20 19:39   ` David Hildenbrand
2025-02-21  2:25     ` Chenyi Qiang
2025-02-21  8:09       ` David Hildenbrand
2025-02-21 10:04         ` Chenyi Qiang
2025-02-25  2:00           ` Chenyi Qiang
2025-02-25  9:41             ` David Hildenbrand
2025-02-26 12:43               ` Chenyi Qiang
2025-02-27  3:26                 ` Chenyi Qiang
2025-02-27 11:27                   ` David Hildenbrand
2025-02-28  5:39                     ` Chenyi Qiang
2025-02-20 16:13 ` [RFC 2/2] target/arm/kvm-rme: Add DMA remapping for the shared memory region Jean-Philippe Brucker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).