public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Yishai Hadas <yishaih@nvidia.com>
Cc: Alex Williamson <alex@shazbot.org>,
	jgg@nvidia.com, kvm@vger.kernel.org, kevin.tian@intel.com,
	joao.m.martins@oracle.com, leonro@nvidia.com, maorg@nvidia.com,
	avihaih@nvidia.com, clg@redhat.com, liulongfang@huawei.com,
	giovanni.cabiddu@intel.com, kwankhede@nvidia.com
Subject: Re: [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO
Date: Mon, 16 Mar 2026 15:24:52 -0400	[thread overview]
Message-ID: <abhZBIe-vR-GX3TT@x1.local> (raw)
In-Reply-To: <fb6cb519-25d0-4cba-b13f-513349fb49db@nvidia.com>

On Sun, Mar 15, 2026 at 04:19:18PM +0200, Yishai Hadas wrote:
> On 12/03/2026 22:16, Peter Xu wrote:
> > On Thu, Mar 12, 2026 at 01:08:17PM -0600, Alex Williamson wrote:
> > > Hey Peter,
> > 
> > Hey, Alex,
> > 
> > > 
> > > On Thu, 12 Mar 2026 13:37:04 -0400
> > > Peter Xu <peterx@redhat.com> wrote:
> > > 
> > > > Hi, Yishai,
> > > > 
> > > > Please feel free to treat my comments as pure questions only.
> > > > 
> > > > On Tue, Mar 10, 2026 at 06:40:06PM +0200, Yishai Hadas wrote:
> > > > > When userspace opts into VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2, the
> > > > > driver may report the VFIO_PRECOPY_INFO_REINIT output flag in response
> > > > > to the VFIO_MIG_GET_PRECOPY_INFO ioctl, along with a new initial_bytes
> > > > > value.
> > > > 
> > > > Does it also mean that VFIO_PRECOPY_INFO_REINIT is almost only a hint that
> > > > can be deduced by the userspace too, if it remembers the last time fetch of
> > > > initial_bytes?
> > > 
> > > I'll try to answer some of these.  PRECOPY_INFO is already just a hint.
> > > We essentially define initial_bytes as the "please copy this before
> > > migration to avoid high latency setup" and dirty_bytes is "I also have
> > > this much dirty state I could give to you now".  We've defined
> > > initial_bytes as monotonically decreasing, so a user could deduce that
> > > they've passed the intended high latency setup threshold, while
> > > dirty_bytes is purely volatile.
> > 
> > I see..  That might be another problem though to switchover decisions.
> > 
> > Currently, QEMU relies on dirty reporting to decide when to switchover.
> > 
> > What it does is asking all the modules for how many dirty data left, then
> > src QEMU do a sum, divide that sum with the estimated bandwidth to guess
> > the downtime.
> > 
> > When the estimated downtime is small enough so as to satisfy the user
> > specified downtime, QEMU src will switchover.  This didn't take
> > switchover_ack for VFIO into account, but it's a separate concept.
> > 
> > Above was based on the fact that the reported values are "total data", not
> > "what you can collect"..
> > 
> > Is there possible way to provide a total amount?  It can even be a maximum
> > total amount just to cap the downtime.
> 
> The total amount is already reported today via the
> VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl and QEMU accounts that in the
> switchover decision.

Ok, I somehow got the impression that initial+dirty should be the total
previously.  It's likely because I was referring to this piece of code in
QEMU:

static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
                                        uint64_t *can_postcopy)
{
    VFIODevice *vbasedev = opaque;
    VFIOMigration *migration = vbasedev->migration;

    if (!vfio_device_state_is_precopy(vbasedev)) {
        return;
    }

    *must_precopy +=
        migration->precopy_init_size + migration->precopy_dirty_size;

    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
                                      *can_postcopy,
                                      migration->precopy_init_size,
                                      migration->precopy_dirty_size);
}

After you said so, I found indeed the exact() version is fetching the
stop-size:

static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
                                     uint64_t *can_postcopy)
{
    VFIODevice *vbasedev = opaque;
    VFIOMigration *migration = vbasedev->migration;
    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;

    /*
     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
     * reported so downtime limit won't be violated.
     */
    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
    *must_precopy += stop_copy_size;

    if (vfio_device_state_is_precopy(vbasedev)) {
        vfio_query_precopy_size(migration);
    }

    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
                                   stop_copy_size, migration->precopy_init_size,
                                   migration->precopy_dirty_size);
}

Do you know why the estimate version doesn't report a cached stop_size
instead?

Reporting different things will also confuse QEMU in its estimate() and
exact() hooks.  They should report the same thing except that the
estimate() can use a fast path for cached value.

> 
>  If with the current reporting
> > definition, VM is destined to have unpredictable live migration downtime
> > when relevant VFIO devices are involved.
> > 
> > The larger the diff between the current reported dirty value v.s. "total
> > data", the larger the downtime mistake can happen.
> > 
> > > 
> > > The trouble comes, for example, if the device has undergone a
> > > reconfiguration during migration, which may effectively negate the
> > > initial_bytes and switchover-ack.
> > 
> > Ah so it's about that, thanks.  IMHO it might be great if Yishai could
> > mention the source of growing initial_bytes somewhere in the commit log, or
> > even when documenting the new feature bit.
> 
> Sure, we can add as part of V2 the below chunk when documenting the new
> feature.
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 90e51e84539d..bb4a2df0550d 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1268,6 +1268,8 @@ enum vfio_device_mig_state {
>   * value and decrease as migration data is read from the device.
>   * The presence of the VFIO_PRECOPY_INFO_REINIT output flag indicates
>   * that new initial data is present on the stream.
> + * The new initial data may result, for example, from device
> reconfiguration
> + * during migration that requires additional initialization data.

This is helpful at least to me, thanks.

> 
> 
> > 
> > > 
> > > A user deducing they've sent enough device data to cover initial_bytes
> > > is essentially what we have now because our protocol doesn't allow the
> > > driver to reset initial_bytes.  The driver may choose to send that
> > > reconfiguration information in dirty_bytes bytes, but we don't
> > > currently have any way to indicate to the user that data remaining
> > > there is of higher importance for startup on the target than any other
> > > dirtying of device state.
> > > 
> > > Hopefully the user/VMM is already polling the interface for dirty
> > > bytes, where with the opt-in for the protocol change here, allows the
> > > driver to split out the priority bytes versus the background dirtying.
> > > > It definitely sounds a bit weird when some initial_* data can actually
> > > > change, because it's not "initial_" anymore.
> > > 
> > > It's just a priority scheme.  In the case I've outlined above it might
> > > be more aptly named setup_bytes or critical_bytes as you've used, but
> > > another driver might just use it for detecting migration compatibility.
> > > Naming is hard.
> > 
> > Yep. :) initial_bytes is still fine at least to me.  I wonder if we could
> > still update the document of this field, then it'll be good enough.
> 
> As Alex mentioned, initial_bytes can be used for various purposes.
> 
> So, I would keep the existing description in the uAPI.
> 
> In the context of the new feature, the uAPI commit message refers to
> initial_bytes as 'critical data', to explain the motivation behind the
> feature. Together with the extra chunk in the uAPI suggested above, I
> believe this clarifies the intended usage.
> 
> Makes sense ?

As long as Alex is happy with it, I'm OK either way.

> 
> > 
> > > > Another question is, if initial_bytes reached zero, could it be boosted
> > > > again to be non-zero?
> > > 
> > > Under the new protocol, yes, and the REINIT flag would be set indicate
> > > it had been reset.  Under the old protocol, no.
> > > > I don't see what stops it from happening, if the "we get some fresh new
> > > > critical data" seem to be able to happen anytime..  but if so, I wonder if
> > > > it's a problem to QEMU: when initial_bytes reported to 0 at least _once_ it
> > > > means it's possible src QEMU decides to switchover.  Then looks like it
> > > > beats the purpose of "don't switchover until we flush the critical data"
> > > > whole idea.
> > > 
> > > The definition of the protocol in the header stop it from happening.
> > > We can't know that there isn't some userspace that follows the
> > > deduction protocol rather than polling.  We don't know there isn't some
> > > userspace that segfaults if initial_bytes doesn't follow the published
> > > protocol.  Therefore opt-in where we have a mechanism to expose a new
> > > initial_bytes session without it becoming a purely volatile value.
> > 
> > Here, IMHO the problem is QEMU still needs to know when a switchover can
> > happen.
> > 
> > After a new QEMU probing this new driver feature bit and enable it, now
> > initial_bytes can be incremented when REINIT flag set.  This is fine on its
> > own.  But then, src QEMU still needs to decide when it can switch over.
> > 
> > It seems to me the only way to do it (with/without the new feature bit
> > enabled), is to relying on initial_bytes being zero.  When it's zero, it
> > means all possible "critical data" has been moved, then src QEMU can
> > kickoff that "switchover" message.
> > 
> > After that, IIUC we need to be prepared to trigger switchover anytime.
> > 
> > With the new REINIT, it means we can still observe REINIT event after src
> > QEMU making that decision.  Would that be a problem?
> > 
> > Nowadays, when looking at vfio code, what happens is src QEMU after seeing
> > initial_bytes==0 send one VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to dest QEMU,
> > later dst QEMU will ack that by sending back MIG_RP_MSG_SWITCHOVER_ACK.
> > Then switchover can happen anytime by the downtime calculation above.
> > 
> > Maybe there should be solution in the userspace to fix it, but we'll need
> > to figure it out.  Likely, we need one way or another to revoke the
> > switchover message, so ultimately we need to stop VM, query the last time,
> > seeing initial_bytes==0, then it can proceed with switchover.  If it sees
> > initial_bytes nonzero again, it will need to restart the VM and revoke the
> > previous message somehow.
> 
> The counterpart QEMU series that we pointed to, handles that in similar way
> to what you described.
> 
> The switchover-ack mechanism is modified to be revoke-able and a final query
> to check initial_bytes == 0 is added after vCPUs are stopped.

I only roughly skimmed the series and overlooked the QEMU branch link.  I
read it, indeed it should do most of above, except one possible issue I
found, that QEMU shouldn't fail the migration when REINIT happened after
exact() but before vm_stop(); instead IIUC it should fallback to iterations
and try to move over the initial_bytes and retry a switchover.

Thanks,

> 
> Thanks,
> Yishai
> 
> > 
> > > > Is there a way the HW can report and confidentally say no further critical
> > > > data will be generated?
> > > 
> > > So long as there's a guest userspace running that can reconfigure the
> > > device, no.  But if you stop the vCPUs and test PRECOPY_INFO, it should
> > > be reliable.
> > 
> > This is definitely an important piece of info.  I recall Zhiyi used to tell
> > me there's no way to really stop a VFIO device from generating dirty data.
> > Happy to know it seems there seems to still be a way.  And now I suspect
> > what Zhiyi observed was exactly seeing dirty_bytes growing even after VM
> > stopped.  If that counter means "how much you can read" it all makes more
> > sense (even though it may suffer from the issue I mentioned above).
> > 
> > > 
> > > > > The presence of the VFIO_PRECOPY_INFO_REINIT flag indicates to the
> > > > > caller that new initial data is available in the migration stream.
> > > > > 
> > > > > If the firmware reports a new initial-data chunk, any previously dirty
> > > > > bytes in memory are treated as initial bytes, since the caller must read
> > > > > both sets before reaching the end of the initial-data region.
> > > > 
> > > > This is unfortunate.  I believe it's a limtation because of the current
> > > > single fd streaming protocol, so HW can only append things because it's
> > > > kind of a pipeline.
> > > > 
> > > > One thing to mention is, I recall VFIO migration suffers from a major
> > > > bottleneck on read() of the VFIO FD, it means this streaming whole design
> > > > is also causing other perf issues.
> > > > 
> > > > Have you or anyone thought about making it not a stream anymore?  Take
> > > > example of RAM blocks: it is pagesize accessible, with that we can do a lot
> > > > more, e.g. we don't need to streamline pages, we can send pages in whatever
> > > > order.  Meanwhile, we can send pages concurrently because they're not
> > > > streamlined too.
> > > > 
> > > > I wonder if VFIO FDs can provide something like that too, as a start it
> > > > doesn't need to be as fine granule, maybe at least instead of using one
> > > > stream it can provide two streams, one for initial_bytes (or, I really
> > > > think this should be called "critical data" or something similar, if it
> > > > represents that rather than "some initial states", not anymore), another
> > > > one for dirty.  Then at least when you attach new critical data you don't
> > > > need to flush dirty queue too.
> > > > 
> > > > If to extend it a bit more, then we can also make e.g. dirty queue to be
> > > > multiple FDs, so that userspace can read() in multiple threads, speeding up
> > > > the switchover phase.
> > > > 
> > > > I had a vague memory that there's sometimes kernel big locks to block it,
> > > > but from interfacing POV it sounds always better to avoid using one fd to
> > > > stream everything.
> > > 
> > > I'll leave it to others to brainstorm improvements, but I'll note that
> > > flushing dirty_bytes is a driver policy, another driver could consider
> > > unread dirty bytes as invalidated by new initial_bytes and reset
> > > counters.
> > > 
> > > It's not clear to me that there's generic algorithm to use for handling
> > > device state as addressable blocks rather than serialized into a data
> > > stream.  Multiple streams of different priorities seems feasible, but
> > > now we're talking about a v3 migration protocol.  Thanks,
> > 
> > Yep, definitely not a request to invent v3 yet, but just to brainstorm it.
> > It doesn't need to be all-things addressable, index-able (e.g. via >1
> > objects) would be also nice even through one fd, then it can also be
> > threadified somehow.
> > 
> > It seems the HW designer needs to understand how hypervisor works on
> > collecting these HW data, so it does look like a hard problem when it's all
> > across the stack from silicon layer..
> > 
> > I just had a feeling that v3 (or more) will come at some point when we want
> > to finally resolve the VFIO downtime problems..
> > 
> > Thanks,
> > 
> 

-- 
Peter Xu


  reply	other threads:[~2026-03-16 19:24 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-10 16:40 [PATCH V1 vfio 0/6] Add support for PRE_COPY initial bytes re-initialization Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 1/6] vfio: Define uAPI for re-init initial bytes during the PRE_COPY phase Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 2/6] vfio: Add support for VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 3/6] vfio: Adapt drivers to use the core helper vfio_check_precopy_ioctl Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 4/6] net/mlx5: Add IFC bits for migration state Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 5/6] vfio/mlx5: consider inflight SAVE during PRE_COPY Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO Yishai Hadas
2026-03-12 17:37   ` Peter Xu
2026-03-12 19:08     ` Alex Williamson
2026-03-12 20:16       ` Peter Xu
2026-03-15 14:19         ` Yishai Hadas
2026-03-16 19:24           ` Peter Xu [this message]
2026-03-17  9:58             ` Avihai Horon
2026-03-17 14:06               ` Peter Xu
2026-03-17 15:22                 ` Avihai Horon
2026-03-17 15:52                   ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abhZBIe-vR-GX3TT@x1.local \
    --to=peterx@redhat.com \
    --cc=alex@shazbot.org \
    --cc=avihaih@nvidia.com \
    --cc=clg@redhat.com \
    --cc=giovanni.cabiddu@intel.com \
    --cc=jgg@nvidia.com \
    --cc=joao.m.martins@oracle.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=leonro@nvidia.com \
    --cc=liulongfang@huawei.com \
    --cc=maorg@nvidia.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox