Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Jonah Palmer <jonah.palmer@oracle.com>
To: Markus Armbruster <armbru@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>,
	qemu-devel@nongnu.org, eperezma@redhat.com, peterx@redhat.com,
	mst@redhat.com, lvivier@redhat.com, dtatulea@nvidia.com,
	leiyang@redhat.com, parav@mellanox.com, sgarzare@redhat.com,
	lingshan.zhu@intel.com, boris.ostrovsky@oracle.com,
	Si-Wei Liu <si-wei.liu@oracle.com>
Subject: Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
Date: Wed, 2 Jul 2025 15:31:27 -0400	[thread overview]
Message-ID: <69bc738c-90fd-4a48-9bee-bb7372388810@oracle.com> (raw)
In-Reply-To: <87o6uau2lj.fsf@pond.sub.org>



On 6/26/25 8:08 AM, Markus Armbruster wrote:
> Jonah Palmer <jonah.palmer@oracle.com> writes:
> 
>> On 6/2/25 4:29 AM, Markus Armbruster wrote:
>>> Butterfingers...  let's try this again.
>>>
>>> Markus Armbruster<armbru@redhat.com> writes:
>>>
>>>> Si-Wei Liu<si-wei.liu@oracle.com> writes:
>>>>
>>>>> On 5/26/2025 2:16 AM, Markus Armbruster wrote:
>>>>>> Si-Wei Liu<si-wei.liu@oracle.com> writes:
>>>>>>
>>>>>>> On 5/15/2025 11:40 PM, Markus Armbruster wrote:
>>>>>>>> Jason Wang<jasowang@redhat.com> writes:
>>>>>>>>
>>>>>>>>> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer<jonah.palmer@oracle.com> wrote:
>>>>>>>>>> Current memory operations like pinning may take a lot of time at the
>>>>>>>>>> destination.  Currently they are done after the source of the migration is
>>>>>>>>>> stopped, and before the workload is resumed at the destination.  This is a
>>>>>>>>>> period where neigher traffic can flow, nor the VM workload can continue
>>>>>>>>>> (downtime).
>>>>>>>>>>
>>>>>>>>>> We can do better as we know the memory layout of the guest RAM at the
>>>>>>>>>> destination from the moment that all devices are initializaed.  So
>>>>>>>>>> moving that operation allows QEMU to communicate the kernel the maps
>>>>>>>>>> while the workload is still running in the source, so Linux can start
>>>>>>>>>> mapping them.
>>>>>>>>>>
>>>>>>>>>> As a small drawback, there is a time in the initialization where QEMU
>>>>>>>>>> cannot respond to QMP etc.  By some testing, this time is about
>>>>>>>>>> 0.2seconds.
>>>>>>>>> Adding Markus to see if this is a real problem or not.
>>>>>>>> I guess the answer is "depends", and to get a more useful one, we need
>>>>>>>> more information.
>>>>>>>>
>>>>>>>> When all you care is time from executing qemu-system-FOO to guest
>>>>>>>> finish booting, and the guest takes 10s to boot, then an extra 0.2s
>>>>>>>> won't matter much.
>>>>>>> There's no such delay of an extra 0.2s or higher per se, it's just shifting around the page pinning hiccup, no matter it is 0.2s or something else, from the time of guest booting up to before guest is booted. This saves back guest boot time or start up delay, but in turn the same delay effectively will be charged to VM launch time. We follow the same model with VFIO, which would see the same hiccup during launch (at an early stage where no real mgmt software would care about).
>>>>>>>
>>>>>>>> When a management application runs qemu-system-FOO several times to
>>>>>>>> probe its capabilities via QMP, then even milliseconds can hurt.
>>>>>>>>
>>>>>>> Not something like that, this page pinning hiccup is one time only that occurs in the very early stage when launching QEMU, i.e. there's no consistent delay every time when QMP is called. The delay in QMP response at that very point depends on how much memory the VM has, but this is just specif to VM with VFIO or vDPA devices that have to pin memory for DMA. Having said, there's no extra delay at all if QEMU args has no vDPA device assignment, on the other hand, there's same delay or QMP hiccup when VFIO is around in QEMU args.
>>>>>>>
>>>>>>>> In what scenarios exactly is QMP delayed?
>>>>>>> Having said, this is not a new problem to QEMU in particular, this QMP delay is not peculiar, it's existent on VFIO as well.
>>>>>>
>>>>>> In what scenarios exactly is QMP delayed compared to before the patch?
>>>>>
>>>>> The page pinning process now runs in a pretty early phase at
>>>>> qemu_init() e.g. machine_run_board_init(),
>>>>
>>>> It runs within
>>>>
>>>>       qemu_init()
>>>>           qmp_x_exit_preconfig()
>>>>               qemu_init_board()
>>>>                   machine_run_board_init()
>>>>
>>>> Except when --preconfig is given, it instead runs within QMP command
>>>> x-exit-preconfig.
>>>>
>>>> Correct?
>>>>
>>>>> before any QMP command can be serviced, the latter of which typically
>>>>> would be able to get run from qemu_main_loop() until the AIO gets
>>>>> chance to be started to get polled and dispatched to bh.
>>>>
>>>> We create the QMP monitor within qemu_create_late_backends(), which runs
>>>> before qmp_x_exit_preconfig(), but commands get processed only in the
>>>> main loop, which we enter later.
>>>>
>>>> Correct?
>>>>
>>>>> Technically it's not a real delay for specific QMP command, but rather
>>>>> an extended span of initialization process may take place before the
>>>>> very first QMP request, usually qmp_capabilities, will be
>>>>> serviced. It's natural for mgmt software to expect initialization
>>>>> delay for the first qmp_capabilities response if it has to immediately
>>>>> issue one after launching qemu, especially when you have a large guest
>>>>> with hundred GBs of memory and with passthrough device that has to pin
>>>>> memory for DMA e.g. VFIO, the delayed effect from the QEMU
>>>>> initialization process is very visible too.
>>>
>>> The work clearly needs to be done.  Whether it needs to be blocking
>>> other things is less clear.
>>>
>>> Even if it doesn't need to be blocking, we may choose not to avoid
>>> blocking for now.  That should be an informed decision, though.
>>>
>>> All I'm trying to do here is understand the tradeoffs, so I can give
>>> useful advice.
>>>
>>>>>                                               On the other hand, before
>>>>> the patch, if memory happens to be in the middle of being pinned, any
>>>>> ongoing QMP can't be serviced by the QEMU main loop, either.
>>>
>>> When exactly does this pinning happen before the patch?  In which
>>> function?
>>
>> Before the patches, the memory listener was registered in
>> vhost_vdpa_dev_start(), well after device initialization.
>>
>> And by device initialization here I mean the
>> qemu_create_late_backends() function.
>>
>> With these patches, the memory listener is now being
>> registered in vhost_vdpa_set_owner(), called from
>> vhost_dev_init(), which is part of the device
>> initialization phase.
>>
>> However, even though the memory_listener_register() is
>> called during the device initialization phase, the actual
>> pinning happens (very shortly) after
>> qemu_create_late_backends() returns (due to RAM being
>> initialized later).
>>
>> ---
>>
>> So, without these patches, and based on my measurements,
>> memory pinning starts ~2.9s after qemu_create_late_backends()
>> returns.
>>
>> With these patches, memory pinning starts ~0.003s after
>> qemu_create_late_backends() returns.
> 
> So, we're registering the memory listener earlier, which makes it do its
> expensive work (pinning) earlier ("very shortly after
> qemu_create_late_backends()).  I still don't understand where exactly
> the pinning happens (where at runtime and where in the code).  Not sure
> I have to.
> 

Apologies for the delay in getting back to you. I just wanted to be 
thorough and answer everything as accurately and clearly as possible.

----

Before these patches, pinning started in vhost_vdpa_dev_start(), where 
the memory listener was registered, and began calling 
vhost_vdpa_listener_region_add() to invoke the actual memory pinning. 
This happens after entering qemu_main_loop().

After these patches, pinning started in vhost_dev_init() (specifically 
vhost_vdpa_set_owner()), where the memory listener registration was 
moved to. This happens *before* entering qemu_main_loop().

However, the entirety of pinning doesn't all happen pre 
qemu_main_loop(). The pinning that happens before we enter 
qemu_main_loop() is the full guest RAM pinning, which is the main, heavy 
lifting work when it comes to pinning memory.

The rest of the pinning work happens after entering qemu_main_loop() 
(approximately around the same timing as when pinning started before 
these patches). But, since we already did the heavy lifting of the 
pinning work pre qemu_main_loop() (e.g. all pages were already allocated 
and pinned), we're just re-pinning here (i.e. kernel just updates its 
IOTLB tables for pages that're already mapped and locked in RAM).

This makes the pinning work we do after entering qemu_main_loop() much 
faster than compared to the same pinning we had to do before these patches.

However, we have to pay a cost for this. Because we do the heavy lifting 
work earlier pre qemu_main_loop(), we're pinning with cold memory. That 
is, the guest hasn't yet touched its memory yet, all host pages are 
still anonymous and unallocated. This essentially means that doing the 
pinning earlier is more expensive time-wise given that we need to also 
allocate physical pages for each chunk of memory.

To (hopefully) show this more clearly, I ran some tests before and after 
these patches and averaged the results. I used a 50G guest with real 
vDPA hardware (Mellanox CX-6Dx):

0.) How many vhost_vdpa_listener_region_add() (pins) calls?

                | Total | Before qemu_main_loop | After qemu_main_loop
_____________________________________________________________________
Before patches |   6   |         0             |         6
---------------|-----------------------------------------------------
After patches  |   11  |         5	       |         6

- After the patches, this looks like we doubled the work we're doing 
(given the extra 5 calls), however, the 6 calls that happen after 
entering qemu_main_loop() are essentially replays of the first 5 we did.

  * In other words, after the patches, the 6 calls made after entering 
qemu_main_loop() are performed much faster than the same 6 calls before 
the patches.

  * From my measurements, these are the timings it took to perform those 
6 calls after entering qemu_main_loop():
    > Before patches: 0.0770s
    > After patches:  0.0065s

---

1.) Time from starting the guest to entering qemu_main_loop():
  * Before patches: 0.112s
  * After patches:  3.900s

- This is due to the 5 early pins we're doing now with these patches, 
whereas before we never did any pinning work at all.

- From measuring the time between the first and last 
vhost_vdpa_listener_region_add() calls during this period, this comes 
out to ~3s for the early pinning.

>>>>> I'd also like to highlight that without this patch, the pretty high
>>>>> delay due to page pinning is even visible to the guest in addition to
>>>>> just QMP delay, which largely affected guest boot time with vDPA
>>>>> device already. It is long standing, and every VM user with vDPA
>>>>> device would like to avoid such high delay for the first boot, which
>>>>> is not seen with similar device e.g. VFIO passthrough.
>>>
>>> I understand that hiding the delay from the guest could be useful.
>>>
>>>>>>> Thanks,
>>>>>>> -Siwei
>>>>>>>
>>>>>>>> You told us an absolute delay you observed.  What's the relative delay,
>>>>>>>> i.e. what's the delay with and without these patches?
>>>>>>
>>>>>> Can you answer this question?
>>>>>
>>>>> I thought I already got that answered in earlier reply. The relative
>>>>> delay is subject to the size of memory. Usually mgmt software won't be
>>>>> able to notice, unless the guest has more than 100GB of THP memory to
>>>>> pin, for DMA or whatever reason.
>>>
>>> Alright, what are the delays you observe with and without these patches
>>> for three test cases that pin 50 / 100 / 200 GiB of THP memory
>>> respectively?
>>
>> So with THP memory specifically, these are my measurements below.
>> For these measurements, I simply started up a guest, traced the
>> vhost_vdpa_listener_region_add() calls, and found the difference
>> in time between the first and last calls. In other words, this is
>> roughly the time it took to pin all of guest memory. I did 5 runs
>> for each memory size:
>>
>> Before patches:
>> ===============
>> 50G:   7.652s,  7.992s,  7.981s,  7.631s,  7.953s (Avg.  7.841s)
>> 100G:  8.990s,  8.656s,  9.003s,  8.683s,  8.669s (Avg.  8.800s)
>> 200G: 10.705s, 10.841s, 10.816s, 10.772s, 10.818s (Avg. 10.790s)
>>
>> After patches:
>> ==============
>> 50G:  12.091s, 11.685s, 11.626s, 11.952s, 11.656s (Avg. 11.802s)
>> 100G: 14.121s, 14.079s, 13.700s, 14.023s, 14.130s (Avg. 14.010s)
>> 200G: 18.134s, 18.350s, 18.387s, 17.800s, 18.401s (Avg. 18.214s)
>>
>> The reason we're seeing a jump here may be that with the memory
>> pinning happening earlier, the pinning happens before Qemu has
>> fully faulted in the guest's RAM.
>>
>> As far as I understand, before these patches, by the time we
>> reached vhost_vdpa_dev_start(), all pages were already resident
>> (and THP splits already happened with the prealloc=on step), so
>> get_user_pages() pinned "warm" pages much faster.
>>
>> With these patches, the memory listener is running on cold memory.
>> Every get_user_pages() call would fault in its 4KiB subpage (and
>> if THP was folded, split a 2MiB hugepage) before handing in a
>> 'struct page'.
> 
> Let's see whether I understand...  Please correct my mistakes.
> 
> Memory pinning takes several seconds for large guests.
> 
> Your patch makes pinning much slower.  You're theorizing this is because
> pinning cold memory is slower than pinning warm memory.
> 
> I suppose the extra time is saved elsewhere, i.e. the entire startup
> time remains roughly the same.  Have you verified this experimentally?
> 

Based on my measurements that I did, we pay a ~3s increase in 
initialization time (pre qemu_main_loop()) to handle the heavy lifting 
of the memory pinning earlier for a vhost-vDPA device. This resulted in:

* Faster memory pinning during qemu_main_loop() (0.0770s vs 0.0065s).

* Shorter downtime phase during live migration (see below).

* Slight increase in time for the device to be operational (e.g. guest 
sets DRIVER_OK).
   > This measured the start time of the guest to guest setting 
DRIVER_OK for the device:

     Before patches: 22.46s
     After patches:  23.40s

The real timesaver here is the guest-visisble downtime during live 
migration (when using a vhost-vDPA device). Since the heavy lifting of 
the memory pinning is done during the initialization phase, it's no 
longer included as part of the stop-and-copy phase, which results in a 
much shorter guest-visible downtime.

 From v5's CV:

Using ConnectX-6 Dx (MLX5) NICs in vhost-vDPA mode with 8 queue-pairs,
the series reduces guest-visible downtime during back-to-back live
migrations by more than half:
- 39G VM:   4.72s -> 2.09s (-2.63s, ~56% improvement)
- 128G VM:  14.72s -> 5.83s (-8.89s, ~60% improvement)

Essentially, we pay a slight increased startup time tax to buy ourselves 
a much shorter downtime window when we want to perform a live migration 
with a vhost-vDPA networking device.

> Your stated reason for moving the pinning is moving it from within
> migration downtime to before migration downtime.  I understand why
> that's useful.
> 
> You mentioned "a small drawback [...] a time in the initialization where
> QEMU cannot respond to QMP".  Here's what I've been trying to figure out
> about this drawback since the beginning:
> 
> * Under what circumstances is QMP responsiveness affected?  I *guess*
>    it's only when we start a guest with plenty of memory and a certain
>    vhost-vdpa configuration.  What configuration exactly?
> 

Regardless of these patches, as I understand it, QMP cannot actually run 
any command that requires the BQL while we're pinning memory (memory 
pinning needs to use the lock).

However, the BQL is not held during the entirety of the pinning process. 
That is, it's periodically released throughout the entire pinning 
process. But those windows are *very* short and are only caught if 
you're really hammering QMP with commands very rapidly.

 From a realistic point of view, it's more practical to think of QMP 
being fully ready once all pinning has finished, e.g. 
time_spent_memory_pinning ≈ time_QMP_is_blocked.

---

As I understand it, QMP is not fully ready and cannot service requests 
until early on in qemu_main_loop().

Given that these patches increase the time it takes to reach 
qemu_main_loop() (due to the early pinning work), this means that QMP 
will also be delayed for this time.

I created a test that hammers QMP with commands until it's able to 
properly service the request and recorded how long it took from guest 
start to when it was able to fulfill the request:
  * Before patches: 0.167s
  * After patches:  4.080s

This aligns with time measured to reach qemu_main_loop() and the time 
we're spending doing the early memory pinning.

All in all, the larger the amount of memory we need to pin, the longer 
it will take for us to reach qemu_main_loop(), the larger 
time_spent_memory_pinning will be, and thus the longer it will take for 
QMP to be ready and fully functional.

----

I don't believe this related to any specific vhost-vDPA configuration. I 
think bottom line is that if we're using a vhost-vDPA device, we'll be 
spending more time to reach qemu_main_loop(), so QMP has to wait until 
we get there.

> * How is QMP responsiveness affected?  Delay until the QMP greeting is
>    sent?  Delay until the first command gets processed?  Delay at some
>    later time?
> 

Responsiveness: Longer initial delay due to early pinning work we need 
to do before we can bring QMP up.

Greeting delay: No greeting delay. Greeting is flushed earlier, even 
before we start the early pinning work.

* For both before and after the patches, this was ~0.052s for me.

Delay until first command processed: Longer initial delay at startup.

Delay at later time: None.

> * What's the absolute and the relative time of QMP non-responsiveness?
>    0.2s were mentioned.  I'm looking for something like "when we're not
>    pinning, it takes 0.8s until the first QMP command is processed, and
>    when we are, it takes 1.0s".
> 

The numbers below are based on my recent testing and measurements. This 
was with a 50G guest with real vDPA hardware.

Before patches:
---------------
* From the start time of the guest to the earliest time QMP is able to 
process a request (e.g. query-status): 0.167s.
   > This timing is pretty much the same regardless of whether or not 
we're pinning memory.

* Time spent pinning memory (QMP cannot handle requests during this 
window): 0.077s.

After patches:
--------------
* From the start time of the guest to the earliest time QMP is able to 
process a request (e.g. query-status): 4.08s
   > If we're not early pinning memory, it's ~0.167s.

* Time spent pinning memory *after entering qemu_main_loop()* (QMP 
cannot handle requests during this window): 0.0065s.

>> I believe this to be the case since in my measurements I noticed
>> some larger time gaps (fault + split overhead) in between some of
>> the vhost_vdpa_listener_region_add() calls.
>>
>> However I'm still learning some of these memory pinning details,
>> so please let me know if I'm misunderstanding anything here.
> 
> Thank you!
> 
> [...]
>

next prev parent reply	other threads:[~2025-07-02 19:32 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-07 18:46 [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init Jonah Palmer
2025-05-07 18:46 ` [PATCH v4 1/7] vdpa: check for iova tree initialized at net_client_start Jonah Palmer
2025-05-16  1:52   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 2/7] vdpa: reorder vhost_vdpa_set_backend_cap Jonah Palmer
2025-05-16  1:53   ` Jason Wang
2025-05-16  1:56   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 3/7] vdpa: set backend capabilities at vhost_vdpa_init Jonah Palmer
2025-05-16  1:57   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 4/7] vdpa: add listener_registered Jonah Palmer
2025-05-16  2:00   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 5/7] vdpa: reorder listener assignment Jonah Palmer
2025-05-16  2:01   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init Jonah Palmer
2025-05-16  2:07   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 7/7] vdpa: move memory listener register to vhost_vdpa_init Jonah Palmer
2025-05-15  5:42   ` Michael S. Tsirkin
2025-05-15 17:36     ` Si-Wei Liu
2025-05-20 13:23       ` Jonah Palmer
2025-05-14  1:42 ` [PATCH v4 0/7] Move " Lei Yang
2025-05-14 15:49 ` Eugenio Perez Martin
2025-05-15  0:17   ` Si-Wei Liu
2025-05-15  5:43     ` Michael S. Tsirkin
2025-05-15 17:41       ` Si-Wei Liu
2025-05-16 10:45         ` Michael S. Tsirkin
2025-05-15  8:30     ` Eugenio Perez Martin
2025-05-16  1:49     ` Jason Wang
2025-05-20 13:27   ` Jonah Palmer
2025-05-14 23:00 ` Si-Wei Liu
2025-05-16  1:47 ` Jason Wang
2025-05-16  1:51 ` Jason Wang
2025-05-16  6:40   ` Markus Armbruster
2025-05-16 19:09     ` Si-Wei Liu
2025-05-26  9:16       ` Markus Armbruster
2025-05-29  7:57         ` Si-Wei Liu
2025-06-02  8:08           ` Markus Armbruster
2025-06-02  8:29             ` Markus Armbruster
2025-06-06 16:21               ` Jonah Palmer
2025-06-26 12:08                 ` Markus Armbruster
2025-07-02 19:31                   ` Jonah Palmer [this message]
2025-07-04 15:00                     ` Markus Armbruster
2025-07-07 13:21                       ` Jonah Palmer
2025-07-08  8:17                         ` Markus Armbruster
2025-07-09 19:57                           ` Jonah Palmer
2025-07-10  5:31                             ` Markus Armbruster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=69bc738c-90fd-4a48-9bee-bb7372388810@oracle.com \
    --to=jonah.palmer@oracle.com \
    --cc=armbru@redhat.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=dtatulea@nvidia.com \
    --cc=eperezma@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=leiyang@redhat.com \
    --cc=lingshan.zhu@intel.com \
    --cc=lvivier@redhat.com \
    --cc=mst@redhat.com \
    --cc=parav@mellanox.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=sgarzare@redhat.com \
    --cc=si-wei.liu@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).