qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Jonah Palmer <jonah.palmer@oracle.com>
To: Markus Armbruster <armbru@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>,
	qemu-devel@nongnu.org, eperezma@redhat.com, peterx@redhat.com,
	mst@redhat.com, lvivier@redhat.com, dtatulea@nvidia.com,
	leiyang@redhat.com, parav@mellanox.com, sgarzare@redhat.com,
	lingshan.zhu@intel.com, boris.ostrovsky@oracle.com,
	Si-Wei Liu <si-wei.liu@oracle.com>
Subject: Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
Date: Mon, 7 Jul 2025 09:21:36 -0400	[thread overview]
Message-ID: <face37ee-9850-448f-914b-cd90a39d3451@oracle.com> (raw)
In-Reply-To: <87frfcj904.fsf@pond.sub.org>



On 7/4/25 11:00 AM, Markus Armbruster wrote:
> Jonah Palmer <jonah.palmer@oracle.com> writes:
> 
>> On 6/26/25 8:08 AM, Markus Armbruster wrote:
> 
> [...]
> 
>> Apologies for the delay in getting back to you. I just wanted to be thorough and answer everything as accurately and clearly as possible.
>>
>> ----
>>
>> Before these patches, pinning started in vhost_vdpa_dev_start(), where the memory listener was registered, and began calling vhost_vdpa_listener_region_add() to invoke the actual memory pinning. This happens after entering qemu_main_loop().
>>
>> After these patches, pinning started in vhost_dev_init() (specifically vhost_vdpa_set_owner()), where the memory listener registration was moved to. This happens *before* entering qemu_main_loop().
>>
>> However, the entirety of pinning doesn't all happen pre qemu_main_loop(). The pinning that happens before we enter qemu_main_loop() is the full guest RAM pinning, which is the main, heavy lifting work when it comes to pinning memory.
>>
>> The rest of the pinning work happens after entering qemu_main_loop() (approximately around the same timing as when pinning started before these patches). But, since we already did the heavy lifting of the pinning work pre qemu_main_loop() (e.g. all pages were already allocated and pinned), we're just re-pinning here (i.e. kernel just updates its IOTLB tables for pages that're already mapped and locked in RAM).
>>
>> This makes the pinning work we do after entering qemu_main_loop() much faster than compared to the same pinning we had to do before these patches.
>>
>> However, we have to pay a cost for this. Because we do the heavy lifting work earlier pre qemu_main_loop(), we're pinning with cold memory. That is, the guest hasn't yet touched its memory yet, all host pages are still anonymous and unallocated. This essentially means that doing the pinning earlier is more expensive time-wise given that we need to also allocate physical pages for each chunk of memory.
>>
>> To (hopefully) show this more clearly, I ran some tests before and after these patches and averaged the results. I used a 50G guest with real vDPA hardware (Mellanox CX-6Dx):
>>
>> 0.) How many vhost_vdpa_listener_region_add() (pins) calls?
>>
>>                 | Total | Before qemu_main_loop | After qemu_main_loop
>> _____________________________________________________________________
>> Before patches |   6   |         0             |         6
>> ---------------|-----------------------------------------------------
>> After patches  |   11  |         5	       |         6
>>
>> - After the patches, this looks like we doubled the work we're doing (given the extra 5 calls), however, the 6 calls that happen after entering qemu_main_loop() are essentially replays of the first 5 we did.
>>
>>   * In other words, after the patches, the 6 calls made after entering qemu_main_loop() are performed much faster than the same 6 calls before the patches.
>>
>>   * From my measurements, these are the timings it took to perform those 6 calls after entering qemu_main_loop():
>>     > Before patches: 0.0770s
>>     > After patches:  0.0065s
>>
>> ---
>>
>> 1.) Time from starting the guest to entering qemu_main_loop():
>>   * Before patches: 0.112s
>>   * After patches:  3.900s
>>
>> - This is due to the 5 early pins we're doing now with these patches, whereas before we never did any pinning work at all.
>>
>> - From measuring the time between the first and last vhost_vdpa_listener_region_add() calls during this period, this comes out to ~3s for the early pinning.
> 
> So, total time increases: early pinning (before main loop) takes more
> time than we save pinning (in the main loop).  Correct?
> 

Correct. We only save ~0.07s from the pinning that happens in the main 
loop. But the extra 3s we now need to spend pinning before 
qemu_main_loop() overshadows it.

> We want this trade, because the time spent in the main loop is a
> problem: guest-visible downtime.  Correct?
> 
> [...]
> 

Correct. Though whether or not we want this trade I suppose is 
subjective. But the 50-60% reduction in guest-visible downtime is pretty 
nice if we can stomach the initial startup costs.

>>> Let's see whether I understand...  Please correct my mistakes.
>>>
>>> Memory pinning takes several seconds for large guests.
>>>
>>> Your patch makes pinning much slower.  You're theorizing this is because
>>> pinning cold memory is slower than pinning warm memory.
>>>
>>> I suppose the extra time is saved elsewhere, i.e. the entire startup
>>> time remains roughly the same.  Have you verified this experimentally?
>>
>> Based on my measurements that I did, we pay a ~3s increase in initialization time (pre qemu_main_loop()) to handle the heavy lifting of the memory pinning earlier for a vhost-vDPA device. This resulted in:
>>
>> * Faster memory pinning during qemu_main_loop() (0.0770s vs 0.0065s).
>>
>> * Shorter downtime phase during live migration (see below).
>>
>> * Slight increase in time for the device to be operational (e.g. guest sets DRIVER_OK).
>>    > This measured the start time of the guest to guest setting DRIVER_OK for the device:
>>
>>      Before patches: 22.46s
>>      After patches:  23.40s
>>
>> The real timesaver here is the guest-visisble downtime during live migration (when using a vhost-vDPA device). Since the heavy lifting of the memory pinning is done during the initialization phase, it's no longer included as part of the stop-and-copy phase, which results in a much shorter guest-visible downtime.
>>
>>  From v5's CV:
>>
>> Using ConnectX-6 Dx (MLX5) NICs in vhost-vDPA mode with 8 queue-pairs,
>> the series reduces guest-visible downtime during back-to-back live
>> migrations by more than half:
>> - 39G VM:   4.72s -> 2.09s (-2.63s, ~56% improvement)
>> - 128G VM:  14.72s -> 5.83s (-8.89s, ~60% improvement)
>>
>> Essentially, we pay a slight increased startup time tax to buy ourselves a much shorter downtime window when we want to perform a live migration with a vhost-vDPA networking device.
>>
>>> Your stated reason for moving the pinning is moving it from within
>>> migration downtime to before migration downtime.  I understand why
>>> that's useful.
>>>
>>> You mentioned "a small drawback [...] a time in the initialization where
>>> QEMU cannot respond to QMP".  Here's what I've been trying to figure out
>>> about this drawback since the beginning:
>>>
>>> * Under what circumstances is QMP responsiveness affected?  I *guess*
>>>    it's only when we start a guest with plenty of memory and a certain
>>>    vhost-vdpa configuration.  What configuration exactly?
>>>
>>
>> Regardless of these patches, as I understand it, QMP cannot actually run any command that requires the BQL while we're pinning memory (memory pinning needs to use the lock).
>>
>> However, the BQL is not held during the entirety of the pinning process. That is, it's periodically released throughout the entire pinning process. But those windows are *very* short and are only caught if you're really hammering QMP with commands very rapidly.
>>
>>  From a realistic point of view, it's more practical to think of QMP being fully ready once all pinning has finished, e.g. time_spent_memory_pinning ≈ time_QMP_is_blocked.
>>
>> ---
>>
>> As I understand it, QMP is not fully ready and cannot service requests until early on in qemu_main_loop().
> 
> It's a fair bit more complicated than that, but it'll do here.
> 
>> Given that these patches increase the time it takes to reach qemu_main_loop() (due to the early pinning work), this means that QMP will also be delayed for this time.
>>
>> I created a test that hammers QMP with commands until it's able to properly service the request and recorded how long it took from guest start to when it was able to fulfill the request:
>>   * Before patches: 0.167s
>>   * After patches:  4.080s
>>
>> This aligns with time measured to reach qemu_main_loop() and the time we're spending doing the early memory pinning.
>>
>> All in all, the larger the amount of memory we need to pin, the longer it will take for us to reach qemu_main_loop(), the larger time_spent_memory_pinning will be, and thus the longer it will take for QMP to be ready and fully functional.
>>
>> ----
>>
>> I don't believe this related to any specific vhost-vDPA configuration. I think bottom line is that if we're using a vhost-vDPA device, we'll be spending more time to reach qemu_main_loop(), so QMP has to wait until we get there.
> 
> Let me circle back to my question: Under what circumstances is QMP
> responsiveness affected?
> 
> The answer seems to be "only when we're using a vhost-vDPA device".
> Correct?
> 

Correct, since using one of these guys causes us to do this memory 
pinning. If we're not using one, it's business as usual for Qemu.

> We're using one exactly when QEMU is running with one of its
> vhost-vdpa-device-pci* device models.  Correct?
> 

Yea, or something like:

-netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,... \
-device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,... \


>>> * How is QMP responsiveness affected?  Delay until the QMP greeting is
>>>    sent?  Delay until the first command gets processed?  Delay at some
>>>    later time?
>>>
>>
>> Responsiveness: Longer initial delay due to early pinning work we need to do before we can bring QMP up.
>>
>> Greeting delay: No greeting delay. Greeting is flushed earlier, even before we start the early pinning work.
>>
>> * For both before and after the patches, this was ~0.052s for me.
>>
>> Delay until first command processed: Longer initial delay at startup.
>>
>> Delay at later time: None.
> 
> Got it.
> 
>>> * What's the absolute and the relative time of QMP non-responsiveness?
>>>    0.2s were mentioned.  I'm looking for something like "when we're not
>>>    pinning, it takes 0.8s until the first QMP command is processed, and
>>>    when we are, it takes 1.0s".
>>>
>>
>> The numbers below are based on my recent testing and measurements. This was with a 50G guest with real vDPA hardware.
>>
>> Before patches:
>> ---------------
>> * From the start time of the guest to the earliest time QMP is able to process a request (e.g. query-status): 0.167s.
>>    > This timing is pretty much the same regardless of whether or not we're pinning memory.
>>
>> * Time spent pinning memory (QMP cannot handle requests during this window): 0.077s.
>>
>> After patches:
>> --------------
>> * From the start time of the guest to the earliest time QMP is able to process a request (e.g. query-status): 4.08s
>>    > If we're not early pinning memory, it's ~0.167s.
>>
>> * Time spent pinning memory *after entering qemu_main_loop()* (QMP cannot handle requests during this window): 0.0065s.
> 
> Let me recap:
> 
> * No change at all unless we're pinning memory early, and we're doing
>    that only when we're using a vhost-vDPA device.  Correct?
> 
> * If we are using a vhost-vDPA device:
> 
>    - Total startup time (until we're done pinning) increases.
> 

Correct.

>    - QMP becomes available later.
> 

Correct.

>    - Main loop behavior improves: less guest-visible downtime, QMP more
>      responsive (once it's available)
> 

Correct. Though the improvement is modest at best if we put aside the 
guest-visible downtime improvement.

>    This is a tradeoff we want always.  There is no need to let users pick
>    "faster startup, worse main loop behavior."
> 

"Always" might be subjective here. For example, if there's no desire to 
perform live migration, then the user kinda just gets stuck with the cons.

Whether or not we want to make this configurable though is another 
discussion.

> Correct?
> 
> [...]
> 



  reply	other threads:[~2025-07-07 13:47 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-07 18:46 [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init Jonah Palmer
2025-05-07 18:46 ` [PATCH v4 1/7] vdpa: check for iova tree initialized at net_client_start Jonah Palmer
2025-05-16  1:52   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 2/7] vdpa: reorder vhost_vdpa_set_backend_cap Jonah Palmer
2025-05-16  1:53   ` Jason Wang
2025-05-16  1:56   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 3/7] vdpa: set backend capabilities at vhost_vdpa_init Jonah Palmer
2025-05-16  1:57   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 4/7] vdpa: add listener_registered Jonah Palmer
2025-05-16  2:00   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 5/7] vdpa: reorder listener assignment Jonah Palmer
2025-05-16  2:01   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init Jonah Palmer
2025-05-16  2:07   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 7/7] vdpa: move memory listener register to vhost_vdpa_init Jonah Palmer
2025-05-15  5:42   ` Michael S. Tsirkin
2025-05-15 17:36     ` Si-Wei Liu
2025-05-20 13:23       ` Jonah Palmer
2025-05-14  1:42 ` [PATCH v4 0/7] Move " Lei Yang
2025-05-14 15:49 ` Eugenio Perez Martin
2025-05-15  0:17   ` Si-Wei Liu
2025-05-15  5:43     ` Michael S. Tsirkin
2025-05-15 17:41       ` Si-Wei Liu
2025-05-16 10:45         ` Michael S. Tsirkin
2025-05-15  8:30     ` Eugenio Perez Martin
2025-05-16  1:49     ` Jason Wang
2025-05-20 13:27   ` Jonah Palmer
2025-05-14 23:00 ` Si-Wei Liu
2025-05-16  1:47 ` Jason Wang
2025-05-16  1:51 ` Jason Wang
2025-05-16  6:40   ` Markus Armbruster
2025-05-16 19:09     ` Si-Wei Liu
2025-05-26  9:16       ` Markus Armbruster
2025-05-29  7:57         ` Si-Wei Liu
2025-06-02  8:08           ` Markus Armbruster
2025-06-02  8:29             ` Markus Armbruster
2025-06-06 16:21               ` Jonah Palmer
2025-06-26 12:08                 ` Markus Armbruster
2025-07-02 19:31                   ` Jonah Palmer
2025-07-04 15:00                     ` Markus Armbruster
2025-07-07 13:21                       ` Jonah Palmer [this message]
2025-07-08  8:17                         ` Markus Armbruster
2025-07-09 19:57                           ` Jonah Palmer
2025-07-10  5:31                             ` Markus Armbruster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=face37ee-9850-448f-914b-cd90a39d3451@oracle.com \
    --to=jonah.palmer@oracle.com \
    --cc=armbru@redhat.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=dtatulea@nvidia.com \
    --cc=eperezma@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=leiyang@redhat.com \
    --cc=lingshan.zhu@intel.com \
    --cc=lvivier@redhat.com \
    --cc=mst@redhat.com \
    --cc=parav@mellanox.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=sgarzare@redhat.com \
    --cc=si-wei.liu@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).