From: Chenyi Qiang <chenyi.qiang@intel.com>
To: "Alexey Kardashevskiy" <aik@amd.com>,
"David Hildenbrand" <david@redhat.com>,
"Peter Xu" <peterx@redhat.com>,
"Gupta Pankaj" <pankaj.gupta@amd.com>,
"Paolo Bonzini" <pbonzini@redhat.com>,
"Philippe Mathieu-Daudé" <philmd@linaro.org>,
"Michael Roth" <michael.roth@amd.com>
Cc: <qemu-devel@nongnu.org>, <kvm@vger.kernel.org>,
Williams Dan J <dan.j.williams@intel.com>,
Zhao Liu <zhao1.liu@intel.com>,
Baolu Lu <baolu.lu@linux.intel.com>,
Gao Chao <chao.gao@intel.com>, Xu Yilun <yilun.xu@intel.com>,
Li Xiaoyao <xiaoyao.li@intel.com>
Subject: Re: [PATCH v5 05/10] ram-block-attribute: Introduce a helper to notify shared/private state changes
Date: Tue, 27 May 2025 17:06:31 +0800 [thread overview]
Message-ID: <e2ad3d45-68db-41fe-be1d-cefe0484d52e@intel.com> (raw)
In-Reply-To: <952ff8ef-815e-484f-a319-3416dd3c03e8@amd.com>
On 5/27/2025 3:35 PM, Alexey Kardashevskiy wrote:
>
>
> On 20/5/25 20:28, Chenyi Qiang wrote:
>> A new state_change() helper is introduced for RamBlockAttribute
>> to efficiently notify all registered RamDiscardListeners, including
>> VFIO listeners, about memory conversion events in guest_memfd. The VFIO
>> listener can dynamically DMA map/unmap shared pages based on conversion
>> types:
>> - For conversions from shared to private, the VFIO system ensures the
>> discarding of shared mapping from the IOMMU.
>> - For conversions from private to shared, it triggers the population of
>> the shared mapping into the IOMMU.
>>
>> Currently, memory conversion failures cause QEMU to quit instead of
>> resuming the guest or retrying the operation. It would be a future work
>> to add more error handling or rollback mechanisms once conversion
>> failures are allowed. For example, in-place conversion of guest_memfd
>> could retry the unmap operation during the conversion from shared to
>> private. However, for now, keep the complex error handling out of the
>> picture as it is not required:
>>
>> - If a conversion request is made for a page already in the desired
>> state, the helper simply returns success.
>> - For requests involving a range partially in the desired state, there
>> is no such scenario in practice at present. Simply return error.
>> - If a conversion request is declined by other systems, such as a
>> failure from VFIO during notify_to_populated(), the failure is
>> returned directly. As for notify_to_discard(), VFIO cannot fail
>> unmap/unpin, so no error is returned.
>>
>> Note that the bitmap status is updated before callbacks, allowing
>> listeners to handle memory based on the latest status.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>> Change in v5:
>> - Move the state_change() back to a helper instead of a callback of
>> the class since there's no child for the RamBlockAttributeClass.
>> - Remove the error handling and move them to an individual patch for
>> simple management.
>>
>> Changes in v4:
>> - Add the state_change() callback in PrivateSharedManagerClass
>> instead of the RamBlockAttribute.
>>
>> Changes in v3:
>> - Move the bitmap update before notifier callbacks.
>> - Call the notifier callbacks directly in notify_discard/populate()
>> with the expectation that the request memory range is in the
>> desired attribute.
>> - For the case that only partial range in the desire status, handle
>> the range with block_size granularity for ease of rollback
>> (https://lore.kernel.org/qemu-devel/812768d7-a02d-4b29-95f3-
>> fb7a125cf54e@redhat.com/)
>>
>> Changes in v2:
>> - Do the alignment changes due to the rename to
>> MemoryAttributeManager
>> - Move the state_change() helper definition in this patch.
>> ---
>> include/system/ramblock.h | 2 +
>> system/ram-block-attribute.c | 134 +++++++++++++++++++++++++++++++++++
>> 2 files changed, 136 insertions(+)
>>
>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>> index 09255e8495..270dffb2f3 100644
>> --- a/include/system/ramblock.h
>> +++ b/include/system/ramblock.h
>> @@ -108,6 +108,8 @@ struct RamBlockAttribute {
>> QLIST_HEAD(, RamDiscardListener) rdl_list;
>> };
>> +int ram_block_attribute_state_change(RamBlockAttribute *attr,
>> uint64_t offset,
>> + uint64_t size, bool to_private);
>
> Not sure about the "to_private" name. I'd think private/shared is
> something KVM operates with and here, in RamBlock, it is discarded/
> populated.
Make sense. To keep consistent, I will rename it as to_discard.
>
>> RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr);
>> void ram_block_attribute_destroy(RamBlockAttribute *attr);
>> diff --git a/system/ram-block-attribute.c b/system/ram-block-
>> attribute.c
>> index 8d4a24738c..f12dd4b881 100644
>> --- a/system/ram-block-attribute.c
>> +++ b/system/ram-block-attribute.c
>> @@ -253,6 +253,140 @@ ram_block_attribute_rdm_replay_discard(const
>> RamDiscardManager *rdm,
>>
>> ram_block_attribute_rdm_replay_cb);
>> }
>> +static bool ram_block_attribute_is_valid_range(RamBlockAttribute
>> *attr,
>> + uint64_t offset,
>> uint64_t size)
>> +{
>> + MemoryRegion *mr = attr->mr;
>> +
>> + g_assert(mr);
>> +
>> + uint64_t region_size = memory_region_size(mr);
>> + int block_size = ram_block_attribute_get_block_size(attr);
>
> It is size_t, not int.
Fixed this and all below. Thanks!
>
>> +
>> + if (!QEMU_IS_ALIGNED(offset, block_size)) {
>
> Does not the @size have to be aligned too?
Yes. Actually, the "start" and "size" are already do the alignment check
in kvm_convert_memory(). I doubt if we still need it here. Anyway, in
case of other users in the future, I'll add it.
>
>> + return false;
>> + }
>> + if (offset + size < offset || !size) {
>
> This could be just (offset + size <= offset).
> (these overflow checks always blow up my little brain)
Modified.
>
>> + return false;
>> + }
>> + if (offset >= region_size || offset + size > region_size) {
>
> Just (offset + size > region_size) should do.
Ditto.
>
>> + return false;
>> + }
>> + return true;
>> +}
>> +
>> +static void ram_block_attribute_notify_to_discard(RamBlockAttribute
>> *attr,
>> + uint64_t offset,
>> + uint64_t size)
>> +{
>> + RamDiscardListener *rdl;
>> +
>> + QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>> + MemoryRegionSection tmp = *rdl->section;
>> +
>> + if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> + continue;
>> + }
>> + rdl->notify_discard(rdl, &tmp);
>> + }
>> +}
>> +
>> +static int
>> +ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
>> + uint64_t offset, uint64_t size)
>> +{
>> + RamDiscardListener *rdl;
>> + int ret = 0;
>> +
>> + QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>> + MemoryRegionSection tmp = *rdl->section;
>> +
>> + if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> + continue;
>> + }
>> + ret = rdl->notify_populate(rdl, &tmp);
>> + if (ret) {
>> + break;
>> + }
>> + }
>> +
>> + return ret;
>> +}
>> +
>> +static bool ram_block_attribute_is_range_populated(RamBlockAttribute
>> *attr,
>> + uint64_t offset,
>> + uint64_t size)
>> +{
>> + const int block_size = ram_block_attribute_get_block_size(attr);
>
> size_t.
>
>> + const unsigned long first_bit = offset / block_size;
>> + const unsigned long last_bit = first_bit + (size / block_size) - 1;
>> + unsigned long found_bit;
>> +
>> + /* We fake a shorter bitmap to avoid searching too far. */
>
> What is "fake" about it? We truthfully check here that every bit in
> [first_bit, last_bit] is set.
Aha, you ask this question again :)
(https://lore.kernel.org/qemu-devel/7131b4a3-a836-4efd-bcfc-982a0112ef05@intel.com/)
If it is really confusing, let me remove this comment in next version.
>
>> + found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
>> + first_bit);
>> + return found_bit > last_bit;
>> +}
>> +
>> +static bool
>> +ram_block_attribute_is_range_discard(RamBlockAttribute *attr,
>> + uint64_t offset, uint64_t size)
>> +{
>> + const int block_size = ram_block_attribute_get_block_size(attr);
>
> size_t.
>
>> + const unsigned long first_bit = offset / block_size;
>> + const unsigned long last_bit = first_bit + (size / block_size) - 1;
>> + unsigned long found_bit;
>> +
>> + /* We fake a shorter bitmap to avoid searching too far. */
>> + found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit);
>> + return found_bit > last_bit;
>> +}
>> +
>> +int ram_block_attribute_state_change(RamBlockAttribute *attr,
>> uint64_t offset,
>> + uint64_t size, bool to_private)
>> +{
>> + const int block_size = ram_block_attribute_get_block_size(attr);
>
> size_t.
>
>> + const unsigned long first_bit = offset / block_size;
>> + const unsigned long nbits = size / block_size;
>> + int ret = 0;
>> +
>> + if (!ram_block_attribute_is_valid_range(attr, offset, size)) {
>> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>> + __func__, offset, size);
>> + return -1;
>
> May be -EINVAL?
Modified.
>
>> + }
>> +
>> + /* Already discard/populated */
>> + if ((ram_block_attribute_is_range_discard(attr, offset, size) &&
>> + to_private) ||
>> + (ram_block_attribute_is_range_populated(attr, offset, size) &&
>> + !to_private)) {
>
> A tracepoint would be useful here imho.
[...]
>
>> + return 0;
>> + }
>> +
>> + /* Unexpected mixture */
>> + if ((!ram_block_attribute_is_range_populated(attr, offset, size) &&
>> + to_private) ||
>> + (!ram_block_attribute_is_range_discard(attr, offset, size) &&
>> + !to_private)) {
>> + error_report("%s, the range is not all in the desired state: "
>> + "(offset 0x%lx, size 0x%lx), %s",
>> + __func__, offset, size,
>> + to_private ? "private" : "shared");
>> + return -1;
>
> -EBUSY?
Maybe also -EINVAL since it is due to the invalid provided mixture
range? But Anyway, according to the discussion in patch #10, I'll add
the support for this mixture scenario. No need to return the error.
>
>> + }
>> +
>> + if (to_private) {
>> + bitmap_clear(attr->bitmap, first_bit, nbits);
>> + ram_block_attribute_notify_to_discard(attr, offset, size);
>> + } else {
>> + bitmap_set(attr->bitmap, first_bit, nbits);
>> + ret = ram_block_attribute_notify_to_populated(attr, offset,
>> size);
>> + }
>
> and a successful tracepoint here may be?
Good suggestion! I'll add tracepoint in next version.
>
>> +
>> + return ret;
>> +}
>> +
>> RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr)
>> {
>> uint64_t bitmap_size;
>
next prev parent reply other threads:[~2025-05-27 9:07 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-20 10:28 [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 01/10] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 02/10] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
2025-05-26 8:40 ` David Hildenbrand
2025-05-27 6:56 ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 03/10] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
2025-05-26 8:42 ` David Hildenbrand
2025-05-26 9:35 ` Philippe Mathieu-Daudé
2025-05-26 10:21 ` Chenyi Qiang
2025-05-27 6:56 ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 04/10] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock with guest_memfd Chenyi Qiang
2025-05-26 9:01 ` David Hildenbrand
2025-05-26 9:28 ` Chenyi Qiang
2025-05-26 11:16 ` Alexey Kardashevskiy
2025-05-27 1:15 ` Chenyi Qiang
2025-05-27 1:20 ` Alexey Kardashevskiy
2025-05-27 3:14 ` Chenyi Qiang
2025-05-27 6:06 ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 05/10] ram-block-attribute: Introduce a helper to notify shared/private state changes Chenyi Qiang
2025-05-26 9:02 ` David Hildenbrand
2025-05-27 7:35 ` Alexey Kardashevskiy
2025-05-27 9:06 ` Chenyi Qiang [this message]
2025-05-27 9:19 ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 06/10] memory: Attach RamBlockAttribute to guest_memfd-backed RAMBlocks Chenyi Qiang
2025-05-26 9:06 ` David Hildenbrand
2025-05-26 9:46 ` Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 07/10] RAMBlock: Make guest_memfd require coordinate discard Chenyi Qiang
2025-05-26 9:08 ` David Hildenbrand
2025-05-27 5:47 ` Chenyi Qiang
2025-05-27 7:42 ` Alexey Kardashevskiy
2025-05-27 8:12 ` Chenyi Qiang
2025-05-27 11:20 ` David Hildenbrand
2025-05-28 1:57 ` Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 08/10] memory: Change NotifyRamDiscard() definition to return the result Chenyi Qiang
2025-05-26 9:31 ` Philippe Mathieu-Daudé
2025-05-26 10:36 ` Cédric Le Goater
2025-05-26 12:44 ` Cédric Le Goater
2025-05-27 5:29 ` Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 09/10] KVM: Introduce RamDiscardListener for attribute changes during memory conversions Chenyi Qiang
2025-05-26 9:22 ` David Hildenbrand
2025-05-27 8:01 ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes Chenyi Qiang
2025-05-26 9:17 ` David Hildenbrand
2025-05-26 10:19 ` Chenyi Qiang
2025-05-26 12:10 ` David Hildenbrand
2025-05-26 12:39 ` Chenyi Qiang
2025-05-27 9:11 ` Alexey Kardashevskiy
2025-05-27 10:18 ` Chenyi Qiang
2025-05-27 11:21 ` David Hildenbrand
2025-05-26 11:37 ` [PATCH v5 00/10] Enable shared device assignment Cédric Le Goater
2025-05-26 12:16 ` Chenyi Qiang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e2ad3d45-68db-41fe-be1d-cefe0484d52e@intel.com \
--to=chenyi.qiang@intel.com \
--cc=aik@amd.com \
--cc=baolu.lu@linux.intel.com \
--cc=chao.gao@intel.com \
--cc=dan.j.williams@intel.com \
--cc=david@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=michael.roth@amd.com \
--cc=pankaj.gupta@amd.com \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=philmd@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=xiaoyao.li@intel.com \
--cc=yilun.xu@intel.com \
--cc=zhao1.liu@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).