qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Alexey Kardashevskiy <aik@amd.com>
To: "Chenyi Qiang" <chenyi.qiang@intel.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Peter Xu" <peterx@redhat.com>,
	"Gupta Pankaj" <pankaj.gupta@amd.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@linaro.org>,
	"Michael Roth" <michael.roth@amd.com>
Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org,
	Williams Dan J <dan.j.williams@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Baolu Lu <baolu.lu@linux.intel.com>,
	Gao Chao <chao.gao@intel.com>, Xu Yilun <yilun.xu@intel.com>,
	Li Xiaoyao <xiaoyao.li@intel.com>
Subject: Re: [PATCH v5 05/10] ram-block-attribute: Introduce a helper to notify shared/private state changes
Date: Tue, 27 May 2025 19:19:06 +1000	[thread overview]
Message-ID: <2702b8d4-2db2-44dc-838f-a67adbb5cf7b@amd.com> (raw)
In-Reply-To: <e2ad3d45-68db-41fe-be1d-cefe0484d52e@intel.com>



On 27/5/25 19:06, Chenyi Qiang wrote:
> 
> 
> On 5/27/2025 3:35 PM, Alexey Kardashevskiy wrote:
>>
>>
>> On 20/5/25 20:28, Chenyi Qiang wrote:
>>> A new state_change() helper is introduced for RamBlockAttribute
>>> to efficiently notify all registered RamDiscardListeners, including
>>> VFIO listeners, about memory conversion events in guest_memfd. The VFIO
>>> listener can dynamically DMA map/unmap shared pages based on conversion
>>> types:
>>> - For conversions from shared to private, the VFIO system ensures the
>>>     discarding of shared mapping from the IOMMU.
>>> - For conversions from private to shared, it triggers the population of
>>>     the shared mapping into the IOMMU.
>>>
>>> Currently, memory conversion failures cause QEMU to quit instead of
>>> resuming the guest or retrying the operation. It would be a future work
>>> to add more error handling or rollback mechanisms once conversion
>>> failures are allowed. For example, in-place conversion of guest_memfd
>>> could retry the unmap operation during the conversion from shared to
>>> private. However, for now, keep the complex error handling out of the
>>> picture as it is not required:
>>>
>>> - If a conversion request is made for a page already in the desired
>>>     state, the helper simply returns success.
>>> - For requests involving a range partially in the desired state, there
>>>     is no such scenario in practice at present. Simply return error.
>>> - If a conversion request is declined by other systems, such as a
>>>     failure from VFIO during notify_to_populated(), the failure is
>>>     returned directly. As for notify_to_discard(), VFIO cannot fail
>>>     unmap/unpin, so no error is returned.
>>>
>>> Note that the bitmap status is updated before callbacks, allowing
>>> listeners to handle memory based on the latest status.
>>>
>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>> ---
>>> Change in v5:
>>>       - Move the state_change() back to a helper instead of a callback of
>>>         the class since there's no child for the RamBlockAttributeClass.
>>>       - Remove the error handling and move them to an individual patch for
>>>         simple management.
>>>
>>> Changes in v4:
>>>       - Add the state_change() callback in PrivateSharedManagerClass
>>>         instead of the RamBlockAttribute.
>>>
>>> Changes in v3:
>>>       - Move the bitmap update before notifier callbacks.
>>>       - Call the notifier callbacks directly in notify_discard/populate()
>>>         with the expectation that the request memory range is in the
>>>         desired attribute.
>>>       - For the case that only partial range in the desire status, handle
>>>         the range with block_size granularity for ease of rollback
>>>         (https://lore.kernel.org/qemu-devel/812768d7-a02d-4b29-95f3-
>>> fb7a125cf54e@redhat.com/)
>>>
>>> Changes in v2:
>>>       - Do the alignment changes due to the rename to
>>> MemoryAttributeManager
>>>       - Move the state_change() helper definition in this patch.
>>> ---
>>>    include/system/ramblock.h    |   2 +
>>>    system/ram-block-attribute.c | 134 +++++++++++++++++++++++++++++++++++
>>>    2 files changed, 136 insertions(+)
>>>
>>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>>> index 09255e8495..270dffb2f3 100644
>>> --- a/include/system/ramblock.h
>>> +++ b/include/system/ramblock.h
>>> @@ -108,6 +108,8 @@ struct RamBlockAttribute {
>>>        QLIST_HEAD(, RamDiscardListener) rdl_list;
>>>    };
>>>    +int ram_block_attribute_state_change(RamBlockAttribute *attr,
>>> uint64_t offset,
>>> +                                     uint64_t size, bool to_private);
>>
>> Not sure about the "to_private" name. I'd think private/shared is
>> something KVM operates with and here, in RamBlock, it is discarded/
>> populated.
> 
> Make sense. To keep consistent, I will rename it as to_discard.
> 
>>
>>>    RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr);
>>>    void ram_block_attribute_destroy(RamBlockAttribute *attr);
>>>    diff --git a/system/ram-block-attribute.c b/system/ram-block-
>>> attribute.c
>>> index 8d4a24738c..f12dd4b881 100644
>>> --- a/system/ram-block-attribute.c
>>> +++ b/system/ram-block-attribute.c
>>> @@ -253,6 +253,140 @@ ram_block_attribute_rdm_replay_discard(const
>>> RamDiscardManager *rdm,
>>>                                               
>>> ram_block_attribute_rdm_replay_cb);
>>>    }
>>>    +static bool ram_block_attribute_is_valid_range(RamBlockAttribute
>>> *attr,
>>> +                                               uint64_t offset,
>>> uint64_t size)
>>> +{
>>> +    MemoryRegion *mr = attr->mr;
>>> +
>>> +    g_assert(mr);
>>> +
>>> +    uint64_t region_size = memory_region_size(mr);
>>> +    int block_size = ram_block_attribute_get_block_size(attr);
>>
>> It is size_t, not int.
> 
> Fixed this and all below. Thanks!
> 
>>
>>> +
>>> +    if (!QEMU_IS_ALIGNED(offset, block_size)) {
>>
>> Does not the @size have to be aligned too?
> 
> Yes. Actually, the "start" and "size" are already do the alignment check
> in kvm_convert_memory(). I doubt if we still need it here.

Sure. My point is either check them both or neither.

> Anyway, in
> case of other users in the future, I'll add it.

Ok.

>>
>>> +        return false;
>>> +    }
>>> +    if (offset + size < offset || !size) {
>>
>> This could be just (offset + size <= offset).
>> (these overflow checks always blow up my little brain)
> 
> Modified.
> 
>>
>>> +        return false;
>>> +    }
>>> +    if (offset >= region_size || offset + size > region_size) {
>>
>> Just (offset + size > region_size) should do.
> 
> Ditto.
> 
>>
>>> +        return false;
>>> +    }
>>> +    return true;
>>> +}
>>> +
>>> +static void ram_block_attribute_notify_to_discard(RamBlockAttribute
>>> *attr,
>>> +                                                  uint64_t offset,
>>> +                                                  uint64_t size)
>>> +{
>>> +    RamDiscardListener *rdl;
>>> +
>>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>>> +        MemoryRegionSection tmp = *rdl->section;
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            continue;
>>> +        }
>>> +        rdl->notify_discard(rdl, &tmp);
>>> +    }
>>> +}
>>> +
>>> +static int
>>> +ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
>>> +                                        uint64_t offset, uint64_t size)
>>> +{
>>> +    RamDiscardListener *rdl;
>>> +    int ret = 0;
>>> +
>>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>>> +        MemoryRegionSection tmp = *rdl->section;
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            continue;
>>> +        }
>>> +        ret = rdl->notify_populate(rdl, &tmp);
>>> +        if (ret) {
>>> +            break;
>>> +        }
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static bool ram_block_attribute_is_range_populated(RamBlockAttribute
>>> *attr,
>>> +                                                   uint64_t offset,
>>> +                                                   uint64_t size)
>>> +{
>>> +    const int block_size = ram_block_attribute_get_block_size(attr);
>>
>> size_t.
>>
>>> +    const unsigned long first_bit = offset / block_size;
>>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>>> +    unsigned long found_bit;
>>> +
>>> +    /* We fake a shorter bitmap to avoid searching too far. */
>>
>> What is "fake" about it? We truthfully check here that every bit in
>> [first_bit, last_bit] is set.
> 
> Aha, you ask this question again :)
> (https://lore.kernel.org/qemu-devel/7131b4a3-a836-4efd-bcfc-982a0112ef05@intel.com/)

ah sorry :)

> If it is really confusing, let me remove this comment in next version.

yes please. Quite obvious if the helper takes the size, then this is what the caller wants to search within.

> 
>>
>>> +    found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
>>> +                                   first_bit);
>>> +    return found_bit > last_bit;
>>> +}
>>> +
>>> +static bool
>>> +ram_block_attribute_is_range_discard(RamBlockAttribute *attr,
>>> +                                     uint64_t offset, uint64_t size)
>>> +{
>>> +    const int block_size = ram_block_attribute_get_block_size(attr);
>>
>> size_t.
>>
>>> +    const unsigned long first_bit = offset / block_size;
>>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>>> +    unsigned long found_bit;
>>> +
>>> +    /* We fake a shorter bitmap to avoid searching too far. */
>>> +    found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit);
>>> +    return found_bit > last_bit;
>>> +}
>>> +
>>> +int ram_block_attribute_state_change(RamBlockAttribute *attr,
>>> uint64_t offset,
>>> +                                     uint64_t size, bool to_private)
>>> +{
>>> +    const int block_size = ram_block_attribute_get_block_size(attr);
>>
>> size_t.
>>
>>> +    const unsigned long first_bit = offset / block_size;
>>> +    const unsigned long nbits = size / block_size;
>>> +    int ret = 0;
>>> +
>>> +    if (!ram_block_attribute_is_valid_range(attr, offset, size)) {
>>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>> +                     __func__, offset, size);
>>> +        return -1;
>>
>> May be -EINVAL?
> 
> Modified.
> 
>>
>>> +    }
>>> +
>>> +    /* Already discard/populated */
>>> +    if ((ram_block_attribute_is_range_discard(attr, offset, size) &&
>>> +         to_private) ||
>>> +        (ram_block_attribute_is_range_populated(attr, offset, size) &&
>>> +         !to_private)) {
>>
>> A tracepoint would be useful here imho.
> 
> [...]
> 
>>
>>> +        return 0;
>>> +    }
>>> +
>>> +    /* Unexpected mixture */
>>> +    if ((!ram_block_attribute_is_range_populated(attr, offset, size) &&
>>> +         to_private) ||
>>> +        (!ram_block_attribute_is_range_discard(attr, offset, size) &&
>>> +         !to_private)) {
>>> +        error_report("%s, the range is not all in the desired state: "
>>> +                     "(offset 0x%lx, size 0x%lx), %s",
>>> +                     __func__, offset, size,
>>> +                     to_private ? "private" : "shared");
>>> +        return -1;
>>
>> -EBUSY?
> 
> Maybe also -EINVAL since it is due to the invalid provided mixture
> range?

May be, I just prefer them different - might save some time on gdb-ing or adding printf's. Thanks,

> But Anyway, according to the discussion in patch #10, I'll add
> the support for this mixture scenario. No need to return the error.
Yeah, chunk from 10/10 should be here really.

>>
>>> +    }
>>> +
>>> +    if (to_private) {
>>> +        bitmap_clear(attr->bitmap, first_bit, nbits);
>>> +        ram_block_attribute_notify_to_discard(attr, offset, size);
>>> +    } else {
>>> +        bitmap_set(attr->bitmap, first_bit, nbits);
>>> +        ret = ram_block_attribute_notify_to_populated(attr, offset,
>>> size);
>>> +    }
>>
>> and a successful tracepoint here may be?
> 
> Good suggestion! I'll add tracepoint in next version.
> 
>>
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>    RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr)
>>>    {
>>>        uint64_t bitmap_size;
>>
> 

-- 
Alexey



  reply	other threads:[~2025-05-27  9:19 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-20 10:28 [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 01/10] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 02/10] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
2025-05-26  8:40   ` David Hildenbrand
2025-05-27  6:56   ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 03/10] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
2025-05-26  8:42   ` David Hildenbrand
2025-05-26  9:35   ` Philippe Mathieu-Daudé
2025-05-26 10:21     ` Chenyi Qiang
2025-05-27  6:56   ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 04/10] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock with guest_memfd Chenyi Qiang
2025-05-26  9:01   ` David Hildenbrand
2025-05-26  9:28     ` Chenyi Qiang
2025-05-26 11:16       ` Alexey Kardashevskiy
2025-05-27  1:15         ` Chenyi Qiang
2025-05-27  1:20           ` Alexey Kardashevskiy
2025-05-27  3:14             ` Chenyi Qiang
2025-05-27  6:06               ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 05/10] ram-block-attribute: Introduce a helper to notify shared/private state changes Chenyi Qiang
2025-05-26  9:02   ` David Hildenbrand
2025-05-27  7:35   ` Alexey Kardashevskiy
2025-05-27  9:06     ` Chenyi Qiang
2025-05-27  9:19       ` Alexey Kardashevskiy [this message]
2025-05-20 10:28 ` [PATCH v5 06/10] memory: Attach RamBlockAttribute to guest_memfd-backed RAMBlocks Chenyi Qiang
2025-05-26  9:06   ` David Hildenbrand
2025-05-26  9:46     ` Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 07/10] RAMBlock: Make guest_memfd require coordinate discard Chenyi Qiang
2025-05-26  9:08   ` David Hildenbrand
2025-05-27  5:47     ` Chenyi Qiang
2025-05-27  7:42       ` Alexey Kardashevskiy
2025-05-27  8:12         ` Chenyi Qiang
2025-05-27 11:20       ` David Hildenbrand
2025-05-28  1:57         ` Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 08/10] memory: Change NotifyRamDiscard() definition to return the result Chenyi Qiang
2025-05-26  9:31   ` Philippe Mathieu-Daudé
2025-05-26 10:36   ` Cédric Le Goater
2025-05-26 12:44     ` Cédric Le Goater
2025-05-27  5:29       ` Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 09/10] KVM: Introduce RamDiscardListener for attribute changes during memory conversions Chenyi Qiang
2025-05-26  9:22   ` David Hildenbrand
2025-05-27  8:01   ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes Chenyi Qiang
2025-05-26  9:17   ` David Hildenbrand
2025-05-26 10:19     ` Chenyi Qiang
2025-05-26 12:10       ` David Hildenbrand
2025-05-26 12:39         ` Chenyi Qiang
2025-05-27  9:11   ` Alexey Kardashevskiy
2025-05-27 10:18     ` Chenyi Qiang
2025-05-27 11:21       ` David Hildenbrand
2025-05-26 11:37 ` [PATCH v5 00/10] Enable shared device assignment Cédric Le Goater
2025-05-26 12:16   ` Chenyi Qiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2702b8d4-2db2-44dc-838f-a67adbb5cf7b@amd.com \
    --to=aik@amd.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=chao.gao@intel.com \
    --cc=chenyi.qiang@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=pankaj.gupta@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=xiaoyao.li@intel.com \
    --cc=yilun.xu@intel.com \
    --cc=zhao1.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).