Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Chenyi Qiang <chenyi.qiang@intel.com>,
	Jean-Philippe Brucker <jean-philippe@linaro.org>,
	philmd@linaro.org, peterx@redhat.com, pbonzini@redhat.com,
	peter.maydell@linaro.org, Alexey Kardashevskiy <aik@amd.com>,
	Gao Chao <chao.gao@intel.com>
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org
Subject: Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers
Date: Thu, 27 Feb 2025 12:27:19 +0100	[thread overview]
Message-ID: <812768d7-a02d-4b29-95f3-fb7a125cf54e@redhat.com> (raw)
In-Reply-To: <39155512-8d71-412e-aa5c-591d7317d210@intel.com>

On 27.02.25 04:26, Chenyi Qiang wrote:
> 
> 
> On 2/26/2025 8:43 PM, Chenyi Qiang wrote:
>>
>>
>> On 2/25/2025 5:41 PM, David Hildenbrand wrote:
>>> On 25.02.25 03:00, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 2/21/2025 6:04 PM, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 2/21/2025 4:09 PM, David Hildenbrand wrote:
>>>>>> On 21.02.25 03:25, Chenyi Qiang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote:
>>>>>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote:
>>>>>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call the
>>>>>>>>> IOMMU
>>>>>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard
>>>>>>>>> notifier isn't sufficient for CCA because the DMA addresses need a
>>>>>>>>> translation (even without vIOMMU).
>>>>>>>>>
>>>>>>>>> At the moment:
>>>>>>>>> * guest_memfd_state_change() calls the populate() notifier
>>>>>>>>> * the populate notifier() calls IOMMU notifiers
>>>>>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get
>>>>>>>>> a VA
>>>>>>>>> * it calls ram_discard_manager_is_populated() which fails.
>>>>>>>>>
>>>>>>>>> guest_memfd_state_change() only changes the section's state after
>>>>>>>>> calling the populate() notifier. We can't easily invert the order of
>>>>>>>>> operation because it uses the old state bitmap to know which
>>>>>>>>> pages need
>>>>>>>>> the populate() notifier.
>>>>>>>>
>>>>>>>> I assume we talk about this code: [1]
>>>>>>>>
>>>>>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1-
>>>>>>>> chenyi.qiang@intel.com
>>>>>>>>
>>>>>>>>
>>>>>>>> +static int memory_attribute_state_change(MemoryAttributeManager
>>>>>>>> *mgr,
>>>>>>>> uint64_t offset,
>>>>>>>> +                                         uint64_t size, bool
>>>>>>>> shared_to_private)
>>>>>>>> +{
>>>>>>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>>>>>>> +    int ret = 0;
>>>>>>>> +
>>>>>>>> +    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
>>>>>>>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>>>>>>> +                     __func__, offset, size);
>>>>>>>> +        return -1;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    if ((shared_to_private &&
>>>>>>>> memory_attribute_is_range_discarded(mgr,
>>>>>>>> offset, size)) ||
>>>>>>>> +        (!shared_to_private &&
>>>>>>>> memory_attribute_is_range_populated(mgr,
>>>>>>>> offset, size))) {
>>>>>>>> +        return 0;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    if (shared_to_private) {
>>>>>>>> +        memory_attribute_notify_discard(mgr, offset, size);
>>>>>>>> +    } else {
>>>>>>>> +        ret = memory_attribute_notify_populate(mgr, offset, size);
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    if (!ret) {
>>>>>>>> +        unsigned long first_bit = offset / block_size;
>>>>>>>> +        unsigned long nbits = size / block_size;
>>>>>>>> +
>>>>>>>> +        g_assert((first_bit + nbits) <= mgr->bitmap_size);
>>>>>>>> +
>>>>>>>> +        if (shared_to_private) {
>>>>>>>> +            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
>>>>>>>> +        } else {
>>>>>>>> +            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
>>>>>>>> +        }
>>>>>>>> +
>>>>>>>> +        return 0;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return ret;
>>>>>>>> +}
>>>>>>>>
>>>>>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap
>>>>>>>> again.
>>>>>>>>
>>>>>>>> Why?
>>>>>>>>
>>>>>>>> We just checked that it's all in the expected state, no?
>>>>>>>>
>>>>>>>>
>>>>>>>> virtio-mem doesn't handle it that way, so I'm curious why we would
>>>>>>>> have
>>>>>>>> to do it here?
>>>>>>>
>>>>>>> I was concerned about the case where the guest issues a request that
>>>>>>> only partial of the range is in the desired state.
>>>>>>> I think the main problem is the policy for the guest conversion
>>>>>>> request.
>>>>>>> My current handling is:
>>>>>>>
>>>>>>> 1. When a conversion request is made for a range already in the
>>>>>>> desired
>>>>>>>      state, the helper simply returns success.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>> 2. For requests involving a range partially in the desired state, only
>>>>>>>      the necessary segments are converted, ensuring the entire range
>>>>>>>      complies with the request efficiently.
>>>>>>
>>>>>>
>>>>>> Ah, now I get:
>>>>>>
>>>>>> +    if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
>>>>>> offset, size)) ||
>>>>>> +        (!shared_to_private &&
>>>>>> memory_attribute_is_range_populated(mgr,
>>>>>> offset, size))) {
>>>>>> +        return 0;
>>>>>> +    }
>>>>>> +
>>>>>>
>>>>>> We're not failing if it might already partially be in the other state.
>>>>>>
>>>>>>> 3. In scenarios where a conversion request is declined by other
>>>>>>> systems,
>>>>>>>      such as a failure from VFIO during notify_populate(), the
>>>>>>> helper will
>>>>>>>      roll back the request, maintaining consistency.
>>>>>>>
>>>>>>> And the policy of virtio-mem is to refuse the state change if not all
>>>>>>> blocks are in the opposite state.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>
>>>>>>> Actually, this part is still a uncertain to me.
>>>>>>>
>>>>>>
>>>>>> IIUC, the problem does not exist if we only convert a single page at a
>>>>>> time.
>>>>>>
>>>>>> Is there a known use case where such partial conversions could happen?
>>>>>
>>>>> I don't see such case yet. Actually, I'm trying to follow the behavior
>>>>> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it
>>>>> doesn't reject the request if the whole range isn't in the opposite
>>>>> state. It just uses xa_store() to update it. Also, I don't see the spec
>>>>> says how to handle such case. To be robust, I just allow this special
>>>>> case.
>>>>>
>>>>>>
>>>>>>> BTW, per the status/bitmap track, the virtio-mem also changes the
>>>>>>> bitmap
>>>>>>> after the plug/unplug notifier. This is the same, correct?
>>>>>> Right. But because we reject these partial requests, we don't have to
>>>>>> traverse the bitmap and could just adjust the bitmap operations.
>>>>>
>>>>> Yes, If we treat it as a guest error/bug, we can adjust it.
>>>>
>>>> Hi David, do you think which option is better? If prefer to reject the
>>>> partial requests, I'll change it in my next version.
>>>
>>> Hi,
>>>
>>> still scratching my head. Having to work around it as in this patch here is
>>> suboptimal.
>>>
>>> Could we simplify the whole thing while still allowing for (unexpected)
>>> partial
>>> conversions?
>>>
>>> Essentially: If states are mixed, fallback to a "1 block at a time"
>>> handling.
>>>
>>> The only problem is: what to do if we fail halfway through? Well, we can
>>> only have
>>> such partial completions for "populate", not for discard.
>>>
>>> Option a) Just leave it as "partially completed populate" and return the
>>> error. The
>>> bitmap and the notifiers are consistent.
>>>
>>> Option b) Just discard everything: someone tried to convert something
>>> "partial
>>> shared" to "shared". So maybe, if anything goes wrong, we can just have
>>> "all private".
>>>
>>> The question is also, what the expectation from the caller is: can the
>>> caller
>>> even make progress on failure or do we have to retry until it works?
>>
>> Yes, That's the key problem.
>>
>> For core mm side conversion, The caller (guest) handles three case:
>> success, failure and retry. guest can continue on failure but will keep
>> the memory in its original attribute and trigger some calltrace. While
>> in QEMU side, it would cause VM stop if kvm_set_memory_attributes() failed.
>>
>> As for the VFIO conversion, at present, we allow it to fail and don't
>> return error code to guest as long as we undo the conversion. It only
>> causes the device not work in guest.
>>
>> I think if we view the attribute mismatch between core mm and IOMMU as a
>> fatal error, we can call VM stop or let guest retry until it converts
>> successfully.
>>
> 
> Just think more about the options for the failure case handling
> theoretically as we haven't hit such state_change() failure:
> 
> 1. Undo + return invalid error
> Pros: The guest can make progress
> Cons: Complicated undo operations: Option a) is not appliable, because
> it leaves it as partial completed populate, but the guest thinks the
> operation has failed.
> Also need to add the undo for set_memory_attribute() after
> state_change() failed. Maybe also apply the attribute bitmap to
> set_memory_attribute() operation to handle the mixed request case
> 
> 2. Undo in VFIO and no undo for set_memory_attribute() + return success
> (Current approach in my series)
> Pros: The guest can make progress although device doesn't work.
> Cons: the attribute bitmap only tracks the status in iommu.

Right, we should avoid that. Bitmap + notifiers should stay in sync.

> 
> 3. No undo + return retry
> Pros: It keeps the attribute bitmap aligned in core mm and iommu.
> Cons: The guest doesn't know how to handle the retry. It would cause
> infinite loop.
> 
> 4. No undo + no return. Just VM stop.
> Pros: simple
> Cons: maybe overkill.
> 
> Maybe option 1 or 4 is better?

Well, we can proper undo working using a temporary bitmap when
converting to shared and we run into this "mixed" scenario.

Something like this on top of your work:


 From f36e3916596ed5952f15233eb7747c65a6424949 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Tue, 25 Feb 2025 09:55:38 +0100
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  system/memory-attribute-manager.c | 95 +++++++++++++++++++++----------
  1 file changed, 65 insertions(+), 30 deletions(-)

diff --git a/system/memory-attribute-manager.c b/system/memory-attribute-manager.c
index 17c70cf677..e98e7367c1 100644
--- a/system/memory-attribute-manager.c
+++ b/system/memory-attribute-manager.c
@@ -274,9 +274,7 @@ static void memory_attribute_notify_discard(MemoryAttributeManager *mgr,
          if (!memory_region_section_intersect_range(&tmp, offset, size)) {
              continue;
          }
-
-        memory_attribute_for_each_populated_section(mgr, &tmp, rdl,
-                                                    memory_attribute_notify_discard_cb);
+        rdl->notify_discard(rdl, &tmp);
      }
  }
  
@@ -292,9 +290,7 @@ static int memory_attribute_notify_populate(MemoryAttributeManager *mgr,
          if (!memory_region_section_intersect_range(&tmp, offset, size)) {
              continue;
          }
-
-        ret = memory_attribute_for_each_discarded_section(mgr, &tmp, rdl,
-                                                          memory_attribute_notify_populate_cb);
+        ret = rdl->notify_populate(rdl, &tmp);
          if (ret) {
              break;
          }
@@ -311,9 +307,7 @@ static int memory_attribute_notify_populate(MemoryAttributeManager *mgr,
              if (!memory_region_section_intersect_range(&tmp, offset, size)) {
                  continue;
              }
-
-            memory_attribute_for_each_discarded_section(mgr, &tmp, rdl2,
-                                                        memory_attribute_notify_discard_cb);
+            rdl2->notify_discard(rdl2, &tmp);
          }
      }
      return ret;
@@ -348,7 +342,12 @@ static bool memory_attribute_is_range_discarded(MemoryAttributeManager *mgr,
  static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t offset,
                                           uint64_t size, bool shared_to_private)
  {
-    int block_size = memory_attribute_manager_get_block_size(mgr);
+    const int block_size = memory_attribute_manager_get_block_size(mgr);
+    const unsigned long first_bit = offset / block_size;
+    const unsigned long nbits = size / block_size;
+    const uint64_t end = offset + size;
+    unsigned long bit;
+    uint64_t cur;
      int ret = 0;
  
      if (!memory_attribute_is_valid_range(mgr, offset, size)) {
@@ -357,32 +356,68 @@ static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t o
          return -1;
      }
  
-    if ((shared_to_private && memory_attribute_is_range_discarded(mgr, offset, size)) ||
-        (!shared_to_private && memory_attribute_is_range_populated(mgr, offset, size))) {
-        return 0;
-    }
-
      if (shared_to_private) {
-        memory_attribute_notify_discard(mgr, offset, size);
-    } else {
-        ret = memory_attribute_notify_populate(mgr, offset, size);
-    }
-
-    if (!ret) {
-        unsigned long first_bit = offset / block_size;
-        unsigned long nbits = size / block_size;
-
-        g_assert((first_bit + nbits) <= mgr->bitmap_size);
-
-        if (shared_to_private) {
+        if (memory_attribute_is_range_discarded(mgr, offset, size)) {
+            /* Already private. */
+        } else if (!memory_attribute_is_range_populated(mgr, offset, size)) {
+            /* Unexpected mixture: process individual blocks. */
+            for (cur = offset; cur < end; cur += block_size) {
+                bit = cur / block_size;
+                if (!test_bit(bit, mgr->shared_bitmap))
+                    continue;
+                clear_bit(bit, mgr->shared_bitmap);
+                memory_attribute_notify_discard(mgr, cur, block_size);
+            }
+        } else {
+            /* Completely shared. */
              bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+            memory_attribute_notify_discard(mgr, offset, size);
+        }
+    } else {
+        if (memory_attribute_is_range_populated(mgr, offset, size)) {
+            /* Already shared. */
+        } else if (!memory_attribute_is_range_discarded(mgr, offset, size)) {
+            /* Unexpected mixture: process individual blocks. */
+            unsigned long *modified_bitmap = bitmap_new(nbits);
+
+            for (cur = offset; cur < end; cur += block_size) {
+                bit = cur / block_size;
+                if (test_bit(bit, mgr->shared_bitmap))
+                    continue;
+                set_bit(bit, mgr->shared_bitmap);
+                ret = memory_attribute_notify_populate(mgr, cur, block_size);
+                if (!ret) {
+                    set_bit(bit - first_bit, modified_bitmap);
+                    continue;
+                }
+                clear_bit(bit, mgr->shared_bitmap);
+                break;
+            }
+            if (ret) {
+                /*
+                 * Very unexpected: something went wrong. Revert to the old
+                 * state, marking only the blocks as private that we converted
+                 * to shared.
+                 */
+                for (cur = offset; cur < end; cur += block_size) {
+                    bit = cur / block_size;
+                    if (!test_bit(bit - first_bit, modified_bitmap))
+                        continue;
+                    assert(test_bit(bit, mgr->shared_bitmap));
+                    clear_bit(bit, mgr->shared_bitmap);
+                    memory_attribute_notify_discard(mgr, offset, block_size);
+                }
+            }
+            g_free(modified_bitmap);
          } else {
+            /* Completely private. */
              bitmap_set(mgr->shared_bitmap, first_bit, nbits);
+            ret = memory_attribute_notify_populate(mgr, offset, size);
+            if (ret) {
+                bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+            }
          }
-
-        return 0;
      }
-
      return ret;
  }
  
-- 
2.48.1


-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2025-02-27 11:28 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-20 16:13 [RFC 0/2] arm: Add DMA remapping for CCA Jean-Philippe Brucker
2025-02-20 16:13 ` [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers Jean-Philippe Brucker
2025-02-20 19:39   ` David Hildenbrand
2025-02-21  2:25     ` Chenyi Qiang
2025-02-21  8:09       ` David Hildenbrand
2025-02-21 10:04         ` Chenyi Qiang
2025-02-25  2:00           ` Chenyi Qiang
2025-02-25  9:41             ` David Hildenbrand
2025-02-26 12:43               ` Chenyi Qiang
2025-02-27  3:26                 ` Chenyi Qiang
2025-02-27 11:27                   ` David Hildenbrand [this message]
2025-02-28  5:39                     ` Chenyi Qiang
2025-02-20 16:13 ` [RFC 2/2] target/arm/kvm-rme: Add DMA remapping for the shared memory region Jean-Philippe Brucker

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:17c70cf67 dfblob:e98e7367c )
 OR (
bs:"tmp" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=812768d7-a02d-4b29-95f3-fb7a125cf54e@redhat.com \
    --to=david@redhat.com \
    --cc=aik@amd.com \
    --cc=chao.gao@intel.com \
    --cc=chenyi.qiang@intel.com \
    --cc=jean-philippe@linaro.org \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=peterx@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).