qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Eugenio Perez Martin <eperezma@redhat.com>
To: Jonah Palmer <jonah.palmer@oracle.com>
Cc: Peter Xu <peterx@redhat.com>,
	si-wei.liu@oracle.com, qemu-devel@nongnu.org, farosas@suse.de,
	eblake@redhat.com, armbru@redhat.com, jasowang@redhat.com,
	 mst@redhat.com, boris.ostrovsky@oracle.com,
	 Dragos Tatulea DE <dtatulea@nvidia.com>
Subject: Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
Date: Mon, 1 Sep 2025 08:57:56 +0200	[thread overview]
Message-ID: <CAJaqyWdRgbuEgWA8OcCfvgJ_ZC-OKHg2oGkirXkfaG2QBankpQ@mail.gmail.com> (raw)
In-Reply-To: <f143a9a6-56b5-43a8-bced-bec7c7be8a2d@oracle.com>

On Wed, Aug 27, 2025 at 6:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
>
>
> On 8/20/25 3:59 AM, Eugenio Perez Martin wrote:
> > On Tue, Aug 19, 2025 at 5:11 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>
> >>
> >>
> >> On 8/19/25 3:10 AM, Eugenio Perez Martin wrote:
> >>> On Mon, Aug 18, 2025 at 4:46 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 8/18/25 2:51 AM, Eugenio Perez Martin wrote:
> >>>>> On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
> >>>>>>>>
> >>>>>>>> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
> >>>>>>>>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
> >>>>>>>>>>> This effort was started to reduce the guest visible downtime by
> >>>>>>>>>>> virtio-net/vhost-net/vhost-vDPA during live migration, especially
> >>>>>>>>>>> vhost-vDPA.
> >>>>>>>>>>>
> >>>>>>>>>>> The downtime contributed by vhost-vDPA, for example, is not from having to
> >>>>>>>>>>> migrate a lot of state but rather expensive backend control-plane latency
> >>>>>>>>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
> >>>>>>>>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
> >>>>>>>>>>> dominates its downtime.
> >>>>>>>>>>>
> >>>>>>>>>>> In other words, by migrating the state of virtio-net early (before the
> >>>>>>>>>>> stop-and-copy phase), we can also start staging backend configurations,
> >>>>>>>>>>> which is the main contributor of downtime when migrating a vhost-vDPA
> >>>>>>>>>>> device.
> >>>>>>>>>>>
> >>>>>>>>>>> I apologize if this series gives the impression that we're migrating a lot
> >>>>>>>>>>> of data here. It's more along the lines of moving control-plane latency out
> >>>>>>>>>>> of the stop-and-copy phase.
> >>>>>>>>>>
> >>>>>>>>>> I see, thanks.
> >>>>>>>>>>
> >>>>>>>>>> Please add these into the cover letter of the next post.  IMHO it's
> >>>>>>>>>> extremely important information to explain the real goal of this work.  I
> >>>>>>>>>> bet it is not expected for most people when reading the current cover
> >>>>>>>>>> letter.
> >>>>>>>>>>
> >>>>>>>>>> Then it could have nothing to do with iterative phase, am I right?
> >>>>>>>>>>
> >>>>>>>>>> What are the data needed for the dest QEMU to start staging backend
> >>>>>>>>>> configurations to the HWs underneath?  Does dest QEMU already have them in
> >>>>>>>>>> the cmdlines?
> >>>>>>>>>>
> >>>>>>>>>> Asking this because I want to know whether it can be done completely
> >>>>>>>>>> without src QEMU at all, e.g. when dest QEMU starts.
> >>>>>>>>>>
> >>>>>>>>>> If src QEMU's data is still needed, please also first consider providing
> >>>>>>>>>> such facility using an "early VMSD" if it is ever possible: feel free to
> >>>>>>>>>> refer to commit 3b95a71b22827d26178.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> While it works for this series, it does not allow to resend the state
> >>>>>>>>> when the src device changes. For example, if the number of virtqueues
> >>>>>>>>> is modified.
> >>>>>>>>
> >>>>>>>> Some explanation on "how sync number of vqueues helps downtime" would help.
> >>>>>>>> Not "it might preheat things", but exactly why, and how that differs when
> >>>>>>>> it's pure software, and when hardware will be involved.
> >>>>>>>>
> >>>>>>>
> >>>>>>> By nvidia engineers to configure vqs (number, size, RSS, etc) takes
> >>>>>>> about ~200ms:
> >>>>>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
> >>>>>>>
> >>>>>>> Adding Dragos here in case he can provide more details. Maybe the
> >>>>>>> numbers have changed though.
> >>>>>>>
> >>>>>>> And I guess the difference with pure SW will always come down to PCI
> >>>>>>> communications, which assume it is slower than configuring the host SW
> >>>>>>> device in RAM or even CPU cache. But I admin that proper profiling is
> >>>>>>> needed before making those claims.
> >>>>>>>
> >>>>>>> Jonah, can you print the time it takes to configure the vDPA device
> >>>>>>> with traces vs the time it takes to enable the dataplane of the
> >>>>>>> device? So we can get an idea of how much time we save with this.
> >>>>>>>
> >>>>>>
> >>>>>> Let me know if this isn't what you're looking for.
> >>>>>>
> >>>>>> I'm assuming by "configuration time" you mean:
> >>>>>>      - Time from device startup (entry to vhost_vdpa_dev_start()) to right
> >>>>>>        before we start enabling the vrings (e.g.
> >>>>>>        VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
> >>>>>>
> >>>>>> And by "time taken to enable the dataplane" I'm assuming you mean:
> >>>>>>      - Time right before we start enabling the vrings (see above) to right
> >>>>>>        after we enable the last vring (at the end of
> >>>>>>        vhost_vdpa_net_cvq_load())
> >>>>>>
> >>>>>> Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
> >>>>>>
> >>>>>> -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
> >>>>>>             queues=8,x-svq=on
> >>>>>>
> >>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
> >>>>>>             romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
> >>>>>>             ctrl_vlan=off,vectors=18,host_mtu=9000,
> >>>>>>             disable-legacy=on,disable-modern=off
> >>>>>>
> >>>>>> ---
> >>>>>>
> >>>>>> Configuration time:    ~31s
> >>>>>> Dataplane enable time: ~0.14ms
> >>>>>>
> >>>>>
> >>>>> I was vague, but yes, that's representative enough! It would be more
> >>>>> accurate if the configuration time ends by the time QEMU enables the
> >>>>> first queue of the dataplane though.
> >>>>>
> >>>>> As Si-Wei mentions, is v->shared->listener_registered == true at the
> >>>>> beginning of vhost_vdpa_dev_start?
> >>>>>
> >>>>
> >>>> Ah, I also realized that Qemu I was using for measurements was using a
> >>>> version before the listener_registered member was introduced.
> >>>>
> >>>> I retested with the latest changes in Qemu and set x-svq=off, e.g.:
> >>>> guest specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3
> >>>> times for measurements.
> >>>>
> >>>> v->shared->listener_registered == false at the beginning of
> >>>> vhost_vdpa_dev_start().
> >>>>
> >>>
> >>> Let's move out the effect of the mem pinning from the downtime by
> >>> registering the listener before the migration. Can you check why is it
> >>> not registered at vhost_vdpa_set_owner?
> >>>
> >>
> >> Sorry I was profiling improperly. The listener is registered at
> >> vhost_vdpa_set_owner initially and v->shared->listener_registered is set
> >> to true, but once we reach the first vhost_vdpa_dev_start call, it shows
> >> as false and is re-registered later in the function.
> >>
> >> Should we always expect listener_registered == true at every
> >> vhost_vdpa_dev_start call during startup?
> >
> > Yes, that leaves all the memory pinning time out of the downtime.
> >
> >> This is what I traced during
> >> startup of a single guest (no migration).
> >
> > We can trace the destination's QEMU to be more accurate, but probably
> > it makes no difference.
> >
> >> Tracepoint is right at the
> >> start of the vhost_vdpa_dev_start function:
> >>
> >> vhost_vdpa_set_owner() - register memory listener
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >
> > This is surprising. Can you trace how listener_registered goes to 0 again?
> >
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> ...
> >> * VQs are now being enabled *
> >>
> >> I'm also seeing that when the guest is being shutdown,
> >> dev->vhost_ops->vhost_get_vring_base() is failing in
> >> do_vhost_virtqueue_stop():
> >>
> >> ...
> >> [  114.718429] systemd-shutdown[1]: Syncing filesystems and block devices.
> >> [  114.719255] systemd-shutdown[1]: Powering off.
> >> [  114.719916] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> >> [  114.724826] ACPI: PM: Preparing to enter system sleep state S5
> >> [  114.725593] reboot: Power down
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> qemu-system-x86_64: vhost VQ 2 ring restore failed: -1: Operation not
> >> permitted (1)
> >> qemu-system-x86_64: vhost VQ 3 ring restore failed: -1: Operation not
> >> permitted (1)
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> qemu-system-x86_64: vhost VQ 4 ring restore failed: -1: Operation not
> >> permitted (1)
> >> qemu-system-x86_64: vhost VQ 5 ring restore failed: -1: Operation not
> >> permitted (1)
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> qemu-system-x86_64: vhost VQ 6 ring restore failed: -1: Operation not
> >> permitted (1)
> >> qemu-system-x86_64: vhost VQ 7 ring restore failed: -1: Operation not
> >> permitted (1)
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> qemu-system-x86_64: vhost VQ 8 ring restore failed: -1: Operation not
> >> permitted (1)
> >> qemu-system-x86_64: vhost VQ 9 ring restore failed: -1: Operation not
> >> permitted (1)
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> qemu-system-x86_64: vhost VQ 10 ring restore failed: -1: Operation not
> >> permitted (1)
> >> qemu-system-x86_64: vhost VQ 11 ring restore failed: -1: Operation not
> >> permitted (1)
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> qemu-system-x86_64: vhost VQ 12 ring restore failed: -1: Operation not
> >> permitted (1)
> >> qemu-system-x86_64: vhost VQ 13 ring restore failed: -1: Operation not
> >> permitted (1)
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> qemu-system-x86_64: vhost VQ 14 ring restore failed: -1: Operation not
> >> permitted (1)
> >> qemu-system-x86_64: vhost VQ 15 ring restore failed: -1: Operation not
> >> permitted (1)
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>
> >> However when x-svq=on, I don't see these errors on shutdown.
> >>
> >
> > SVQ can mask this error as it does not need to forward the ring
> > restore message to the device. It can just start with 0 and convert
> > indexes.
> >
> > Let's focus on listened_registered first :).
> >
> >>>> ---
> >>>>
> >>>> Configuration time: Time from first entry into vhost_vdpa_dev_start() to
> >>>> right after Qemu enables the first VQ.
> >>>>     - 26.947s, 26.606s, 27.326s
> >>>>
> >>>> Enable dataplane: Time from right after first VQ is enabled to right
> >>>> after the last VQ is enabled.
> >>>>     - 0.081ms, 0.081ms, 0.079ms
> >>>>
> >>>
> >>
> >
>
> I looked into this a bit more and realized I was being naive thinking
> that the vhost-vDPA device startup path of a single VM would be the same
> as that on a destination VM during live migration. This is **not** the
> case and I apologize for the confusion I caused.
>
> What I described and profiled above is indeed true for the startup of a
> single VM / source VM with a vhost-vDPA device. However, this is not
> true on the destination side and its configuration time is drastically
> different.
>
> Under the same specs, but now with a live migration performed between a
> source and destination VM (128G Mem, SVQ=off, CVQ=on, 8 queue pairs),
> and using the same tracepoints to find the configuration time and enable
> dataplane time, these are the measurements I found for the **destination
> VM**:
>
> Configuration time: Time from first entry into vhost_vdpa_dev_start to
> right after Qemu enables the first VQ.
>     - 268.603ms, 241.515ms, 249.007ms
>
> Enable dataplane time: Time from right after the first VQ is enabled to
> right after the last VQ is enabled.
>     - 0.072ms, 0.071ms, 0.070ms
>
> ---
>
> For those curious, using the same printouts as I did above, this is what
> it actually looks like on the destination side:
>
> * Destination VM is started *
>
> vhost_vdpa_set_owner() - register memory listener
> vhost_vdpa_reset_device() - unregistering listener
>
> * Start live migration on source VM *
> (qemu) migrate unix:/tmp/lm.sock
> ...
>
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - register listener
>

That's weird, can you check why the memory listener is not registered
at vhost_vdpa_set_owner? Or, if it is registered, why is it not
registered by the time vhost_vdpa_dev_start is called? This changes
the downtime a lot, more than half of the time is spent on this. So it
is worth fixing it before continuing.

> And this is very different than the churning we saw in my previous email
> that happens on the source / single guest VM with vhost-vDPA and its
> startup path.
>
> ---
>
> Again, apologies on the confusion this caused. This was my fault for not
> being more careful.
>

No worries!



  reply	other threads:[~2025-09-01  6:59 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-22 12:41 [RFC 0/6] virtio-net: initial iterative live migration support Jonah Palmer
2025-07-22 12:41 ` [RFC 1/6] migration: Add virtio-iterative capability Jonah Palmer
2025-08-06 15:58   ` Peter Xu
2025-08-07 12:50     ` Jonah Palmer
2025-08-07 13:13       ` Peter Xu
2025-08-07 14:20         ` Jonah Palmer
2025-08-08 10:48   ` Markus Armbruster
2025-08-11 12:18     ` Jonah Palmer
2025-08-25 12:44       ` Markus Armbruster
2025-08-25 14:57         ` Jonah Palmer
2025-08-26  6:11           ` Markus Armbruster
2025-08-26 18:08             ` Jonah Palmer
2025-08-27  6:37               ` Markus Armbruster
2025-08-28 15:29                 ` Jonah Palmer
2025-08-29  9:24                   ` Markus Armbruster
2025-09-01 14:10                     ` Jonah Palmer
2025-07-22 12:41 ` [RFC 2/6] virtio-net: Reorder vmstate_virtio_net and helpers Jonah Palmer
2025-07-22 12:41 ` [RFC 3/6] virtio-net: Add SaveVMHandlers for iterative migration Jonah Palmer
2025-07-22 12:41 ` [RFC 4/6] virtio-net: iter live migration - migrate vmstate Jonah Palmer
2025-07-23  6:51   ` Michael S. Tsirkin
2025-07-24 14:45     ` Jonah Palmer
2025-07-25  9:31       ` Michael S. Tsirkin
2025-07-28 12:30         ` Jonah Palmer
2025-07-22 12:41 ` [RFC 5/6] virtio, virtio-net: skip consistency check in virtio_load for iterative migration Jonah Palmer via
2025-07-28 15:30   ` [RFC 5/6] virtio,virtio-net: " Eugenio Perez Martin
2025-07-28 16:23     ` Jonah Palmer
2025-07-30  8:59       ` Eugenio Perez Martin
2025-08-06 16:27   ` Peter Xu
2025-08-07 14:18     ` Jonah Palmer
2025-08-07 16:31       ` Peter Xu
2025-08-11 12:30         ` Jonah Palmer
2025-08-11 13:39           ` Peter Xu
2025-08-11 21:26             ` Jonah Palmer
2025-08-11 21:55               ` Peter Xu
2025-08-12 15:51                 ` Jonah Palmer
2025-08-13  9:25                 ` Eugenio Perez Martin
2025-08-13 14:06                   ` Peter Xu
2025-08-14  9:28                     ` Eugenio Perez Martin
2025-08-14 16:16                       ` Dragos Tatulea
2025-08-14 20:27                       ` Peter Xu
2025-08-15 14:50                       ` Jonah Palmer
2025-08-15 19:35                         ` Si-Wei Liu
2025-08-18  6:51                         ` Eugenio Perez Martin
2025-08-18 14:46                           ` Jonah Palmer
2025-08-18 16:21                             ` Peter Xu
2025-08-19  7:20                               ` Eugenio Perez Martin
2025-08-19  7:10                             ` Eugenio Perez Martin
2025-08-19 15:10                               ` Jonah Palmer
2025-08-20  7:59                                 ` Eugenio Perez Martin
2025-08-25 12:16                                   ` Jonah Palmer
2025-08-27 16:55                                   ` Jonah Palmer
2025-09-01  6:57                                     ` Eugenio Perez Martin [this message]
2025-09-01 13:17                                       ` Jonah Palmer
2025-09-02  7:31                                         ` Eugenio Perez Martin
2025-07-22 12:41 ` [RFC 6/6] virtio-net: skip vhost_started assertion during " Jonah Palmer
2025-07-23  5:51 ` [RFC 0/6] virtio-net: initial iterative live migration support Jason Wang
2025-07-24 21:59   ` Jonah Palmer
2025-07-25  9:18     ` Lei Yang
2025-07-25  9:33     ` Michael S. Tsirkin
2025-07-28  7:09       ` Jason Wang
2025-07-28  7:35         ` Jason Wang
2025-07-28 12:41           ` Jonah Palmer
2025-07-28 14:51           ` Eugenio Perez Martin
2025-07-28 15:38             ` Eugenio Perez Martin
2025-07-29  2:38             ` Jason Wang
2025-07-29 12:41               ` Jonah Palmer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJaqyWdRgbuEgWA8OcCfvgJ_ZC-OKHg2oGkirXkfaG2QBankpQ@mail.gmail.com \
    --to=eperezma@redhat.com \
    --cc=armbru@redhat.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=dtatulea@nvidia.com \
    --cc=eblake@redhat.com \
    --cc=farosas@suse.de \
    --cc=jasowang@redhat.com \
    --cc=jonah.palmer@oracle.com \
    --cc=mst@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=si-wei.liu@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).