qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Chenyi Qiang <chenyi.qiang@intel.com>
To: "Alexey Kardashevskiy" <aik@amd.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Peter Xu" <peterx@redhat.com>,
	"Gupta Pankaj" <pankaj.gupta@amd.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@linaro.org>,
	"Michael Roth" <michael.roth@amd.com>
Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org,
	"Williams Dan J" <dan.j.williams@intel.com>,
	"Zhao Liu" <zhao1.liu@intel.com>,
	"Baolu Lu" <baolu.lu@linux.intel.com>,
	"Gao Chao" <chao.gao@intel.com>, "Xu Yilun" <yilun.xu@intel.com>,
	"Li Xiaoyao" <xiaoyao.li@intel.com>,
	"Cédric Le Goater" <clg@kaod.org>,
	"Alex Williamson" <alex.williamson@redhat.com>
Subject: Re: [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
Date: Thu, 5 Jun 2025 14:01:59 +0800	[thread overview]
Message-ID: <0fc5d786-5cb7-4fe6-aab4-b641b815c04d@intel.com> (raw)
In-Reply-To: <55ebb008-a26f-4173-937a-3bb2d8a6c972@amd.com>



On 6/4/2025 7:04 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 30/5/25 18:32, Chenyi Qiang wrote:
>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>> discard") highlighted that subsystems like VFIO may disable RAM block
>> discard. However, guest_memfd relies on discard operations for page
>> conversion between private and shared memory, potentially leading to
>> the stale IOMMU mapping issue when assigning hardware devices to
>> confidential VMs via shared memory. To address this and allow shared
>> device assignement, it is crucial to ensure the VFIO system refreshes
>> its IOMMU mappings.
>>
>> RamDiscardManager is an existing interface (used by virtio-mem) to
>> adjust VFIO mappings in relation to VM page assignment. Effectively page
>> conversion is similar to hot-removing a page in one mode and adding it
>> back in the other. Therefore, similar actions are required for page
>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>> facilitate this process.
>>
>> Since guest_memfd is not an object, it cannot directly implement the
>> RamDiscardManager interface. Implementing it in HostMemoryBackend is
>> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
>> have a memory backend while others do not. Notably, virtual BIOS
>> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
>> backend.
>>
>> To manage RAMBlocks with guest_memfd, define a new object named
>> RamBlockAttributes to implement the RamDiscardManager interface. This
>> object can store the guest_memfd information such as bitmap for shared
>> memory and the registered listeners for event notification. In the
>> context of RamDiscardManager, shared state is analogous to populated, and
>> private state is signified as discarded. To notify the conversion events,
>> a new state_change() helper is exported for the users to notify the
>> listeners like VFIO, so that VFIO can dynamically DMA map/unmap the
>> shared mapping.
>>
>> Note that the memory state is tracked at the host page size granularity,
>> as the minimum conversion size can be one page per request and VFIO
>> expects the DMA mapping for a specific iova to be mapped and unmapped
>> with the same granularity. Confidential VMs may perform partial
>> conversions, such as conversions on small regions within larger ones.
>> To prevent such invalid cases and until DMA mapping cut operation
>> support is available, all operations are performed with 4K granularity.
>>
>> In addition, memory conversion failures cause QEMU to quit instead of
>> resuming the guest or retrying the operation at present. It would be
>> future work to add more error handling or rollback mechanisms once
>> conversion failures are allowed. For example, in-place conversion of
>> guest_memfd could retry the unmap operation during the conversion from
>> shared to private. For now, keep the complex error handling out of the
>> picture as it is not required.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>

[...]

>> +
>> +int ram_block_attributes_state_change(RamBlockAttributes *attr,
>> +                                      uint64_t offset, uint64_t size,
>> +                                      bool to_discard)
>> +{
>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>> +    const unsigned long first_bit = offset / block_size;
>> +    const unsigned long nbits = size / block_size;
>> +    bool is_range_discarded, is_range_populated;
> 
> Can be reduced to "discarded" and "populated".

[...]

> 
>> +    const uint64_t end = offset + size;
>> +    unsigned long bit;
>> +    uint64_t cur;
>> +    int ret = 0;
>> +
>> +    if (!ram_block_attributes_is_valid_range(attr, offset, size)) {
>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>> +                     __func__, offset, size);
>> +        return -EINVAL;
>> +    }
>> +
>> +    is_range_discarded =
>> ram_block_attributes_is_range_discarded(attr, offset,
>> +                                                                 size);
> 
> See - needlessly long line.

Yes, doing the rename can avoid long line.

> 
>> +    is_range_populated =
>> ram_block_attributes_is_range_populated(attr, offset,
>> +                                                                 size);
> 
> If ram_block_attributes_is_range_populated() returned
> (found_bit*block_size), you could tell from a single call if it is
> populated (found_bit == size) or discarded (found_bit == 0), otherwise
> it is a mix (and dump just this number in the tracepoint below).
> 
> And then ditch ram_block_attributes_is_range_discarded() which is
> practically cut-n-paste. And then open code
> ram_block_attributes_is_range_populated().
> 
> These two are not used elsewhere anyway.
> 
>> +
>> +    trace_ram_block_attributes_state_change(offset, size,
>> +                                            is_range_discarded ?
>> "discarded" :
>> +                                            is_range_populated ?
>> "populated" :
>> +                                            "mixture",
>> +                                            to_discard ? "discarded" :
>> +                                            "populated");
> 
> 
> I'd just dump 3 numbers (is_range_discarded, is_range_populated,
> to_discard) in the tracepoint as:
> 
> ram_block_attributes_state_change(uint64_t offset, uint64_t size, int
> discarded, int populated, int to_discard) "offset 0x%"PRIx64" size
> 0x%"PRIx64" discarded=%d populated=%d to_discard=%d"

Maybe a string of "from xxx to xxx" is more straightforward and it can
also cover your information. Anyway I don't think they have much difference.

> 
> 
> 
>> +    if (to_discard) {
>> +        if (is_range_discarded) {
>> +            /* Already private */
>> +        } else if (is_range_populated) {
>> +            /* Completely shared */
>> +            bitmap_clear(attr->bitmap, first_bit, nbits);
>> +            ram_block_attributes_notify_discard(attr, offset, size);
>> +        } else {
>> +            /* Unexpected mixture: process individual blocks */
>> +            for (cur = offset; cur < end; cur += block_size) {
> 
> imho a little bit more accurate to:
> 
> for (bit = first_bit; bit < first_bit + nbits; ++bit) {
> 
> as you already have calculated first_bit, nbits...
> 
>> +                bit = cur / block_size;
> 
> ... and drop this ...
> 
>> +                if (!test_bit(bit, attr->bitmap)) {
>> +                    continue;
>> +                }
>> +                clear_bit(bit, attr->bitmap);
>> +                ram_block_attributes_notify_discard(attr, cur,
>> block_size);
> 
> .. and do: ram_block_attributes_notify_discard(attr, bit * block_size,
> block_size);
> 
> Then you can drop @cur which is used in one place inside the loop.

Yes, looks it can reduce some lines.

> 
> 
>> +            }
>> +        }
>> +    } else {
>> +        if (is_range_populated) {
>> +            /* Already shared */
>> +        } else if (is_range_discarded) {
>> +            /* Complete private */
> 
> s/Complete/Completely/
> 
>> +            bitmap_set(attr->bitmap, first_bit, nbits);
>> +            ret = ram_block_attributes_notify_populate(attr, offset,
>> size);
>> +        } else {
>> +            /* Unexpected mixture: process individual blocks */
>> +            for (cur = offset; cur < end; cur += block_size) {
>> +                bit = cur / block_size;
>> +                if (test_bit(bit, attr->bitmap)) {
>> +                    continue;
>> +                }
>> +                set_bit(bit, attr->bitmap);
>> +                ret = ram_block_attributes_notify_populate(attr, cur,
>> +                                                           block_size);
>> +                if (ret) {
>> +                    break;
>> +                }
>> +            }
>> +        }
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block)
>> +{
>> +    uint64_t bitmap_size;
> 
> Not really needed.
> 
>> +    const int block_size  = qemu_real_host_page_size();
>> +    RamBlockAttributes *attr;
>> +    int ret;
>> +    MemoryRegion *mr = ram_block->mr;
>> +
>> +    attr = RAM_BLOCK_ATTRIBUTES(object_new(TYPE_RAM_BLOCK_ATTRIBUTES));
>> +
>> +    attr->ram_block = ram_block;
>> +    ret = memory_region_set_ram_discard_manager(mr,
>> RAM_DISCARD_MANAGER(attr));
>> +    if (ret) {
> 
> Could just "if (memory_region_set_ram_discard_manager(...))".
> 
>> +        object_unref(OBJECT(attr));
>> +        return NULL;
>> +    }
>> +    bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
>> +    attr->bitmap_size = bitmap_size;
>> +    attr->bitmap = bitmap_new(bitmap_size);
>> +
>> +    return attr;
>> +}
>> +
>> +void ram_block_attributes_destroy(RamBlockAttributes *attr)
>> +{
>> +    if (!attr) {
> 
> 
> Rather g_assert().
> 
> 
>> +        return;
>> +    }
>> +
>> +    g_free(attr->bitmap);
>> +    memory_region_set_ram_discard_manager(attr->ram_block->mr, NULL);
>> +    object_unref(OBJECT(attr));
>> +}
>> +
>> +static void ram_block_attributes_init(Object *obj)
>> +{
>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(obj);
>> +
>> +    QLIST_INIT(&attr->rdl_list);
>> +}
> 
> Not used.
> 
>> +
>> +static void ram_block_attributes_finalize(Object *obj)
> 
> Not used.
> 
> Besides these two, feel free to ignore other comments :)

The init() and finalize() calls are used in the
OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES() macro. It is a common
template. I guess you forgot this again[1] :)

Other comments look good to me. But I think they are some minor changes.
I'll see if need a new version to carry these changes.


[1]
https://lore.kernel.org/qemu-devel/89e791a5-b71b-4b9d-a8b4-e225bfbd1bc2@amd.com/

> 
> Otherwise,
> 
> Tested-by: Alexey Kardashevskiy <aik@amd.com>
> Reviewed-by: Alexey Kardashevskiy <aik@amd.com>
> 
> 
>> +{
>> +}
>> +
>> +static void ram_block_attributes_class_init(ObjectClass *klass,
>> +                                            const void *data)
>> +{
>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(klass);
>> +
>> +    rdmc->get_min_granularity =
>> ram_block_attributes_rdm_get_min_granularity;
>> +    rdmc->register_listener =
>> ram_block_attributes_rdm_register_listener;
>> +    rdmc->unregister_listener =
>> ram_block_attributes_rdm_unregister_listener;
>> +    rdmc->is_populated = ram_block_attributes_rdm_is_populated;
>> +    rdmc->replay_populated = ram_block_attributes_rdm_replay_populated;
>> +    rdmc->replay_discarded = ram_block_attributes_rdm_replay_discarded;
>> +}
>> diff --git a/system/trace-events b/system/trace-events
>> index be12ebfb41..82856e44f2 100644
>> --- a/system/trace-events
>> +++ b/system/trace-events
>> @@ -52,3 +52,6 @@ dirtylimit_state_finalize(void)
>>   dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t
>> time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time
>> %"PRIi64 " us"
>>   dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set
>> dirty page rate limit %"PRIu64
>>   dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us)
>> "CPU[%d] sleep %"PRIi64 " us"
>> +
>> +# ram-block-attributes.c
>> +ram_block_attributes_state_change(uint64_t offset, uint64_t size,
>> const char *from, const char *to) "offset 0x%"PRIx64" size 0x%"PRIx64"
>> from '%s' to '%s'"
> 
> 
> 



  parent reply	other threads:[~2025-06-05  6:03 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-30  8:32 [PATCH v6 0/5] Enable shared device assignment Chenyi Qiang
2025-05-30  8:32 ` [PATCH v6 1/5] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
2025-06-01  9:19   ` Gupta, Pankaj
2025-06-01 15:44   ` Xiaoyao Li
2025-05-30  8:32 ` [PATCH v6 2/5] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
2025-06-01  9:21   ` Gupta, Pankaj
2025-06-01 15:53   ` Xiaoyao Li
2025-06-04 14:37   ` Alexey Kardashevskiy
2025-05-30  8:32 ` [PATCH v6 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
2025-06-01 16:12   ` Xiaoyao Li
2025-06-02  9:20   ` Gupta, Pankaj
2025-05-30  8:32 ` [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd Chenyi Qiang
2025-06-01  9:58   ` Gupta, Pankaj
2025-06-03  1:26     ` Chenyi Qiang
2025-06-03  5:26       ` Gupta, Pankaj
2025-06-03  6:33         ` Chenyi Qiang
2025-06-03  7:17           ` Gupta, Pankaj
2025-06-03  7:41             ` David Hildenbrand
2025-06-03  9:45               ` Gupta, Pankaj
2025-06-03 12:05                 ` Chenyi Qiang
2025-06-02 21:10   ` David Hildenbrand
2025-06-03  1:57     ` Chenyi Qiang
2025-06-04 11:04   ` Alexey Kardashevskiy
2025-06-05  0:35     ` Alexey Kardashevskiy
2025-06-05  3:22       ` Chenyi Qiang
2025-06-05  6:01     ` Chenyi Qiang [this message]
2025-05-30  8:32 ` [PATCH v6 5/5] physmem: Support coordinated discarding of RAM " Chenyi Qiang
2025-06-04 11:04   ` Alexey Kardashevskiy
2025-06-04 11:29   ` David Hildenbrand
2025-06-09  3:23 ` [PATCH v6 0/5] Enable shared device assignment Chenyi Qiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0fc5d786-5cb7-4fe6-aab4-b641b815c04d@intel.com \
    --to=chenyi.qiang@intel.com \
    --cc=aik@amd.com \
    --cc=alex.williamson@redhat.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=chao.gao@intel.com \
    --cc=clg@kaod.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=pankaj.gupta@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=xiaoyao.li@intel.com \
    --cc=yilun.xu@intel.com \
    --cc=zhao1.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).