public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Yishai Hadas <yishaih@nvidia.com>
Cc: alex@shazbot.org, jgg@nvidia.com, kvm@vger.kernel.org,
	kevin.tian@intel.com, joao.m.martins@oracle.com,
	leonro@nvidia.com, maorg@nvidia.com, avihaih@nvidia.com,
	clg@redhat.com, liulongfang@huawei.com,
	giovanni.cabiddu@intel.com, kwankhede@nvidia.com
Subject: Re: [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO
Date: Thu, 12 Mar 2026 13:37:04 -0400	[thread overview]
Message-ID: <abL5wKfPGzi88iBy@x1.local> (raw)
In-Reply-To: <20260310164006.4020-7-yishaih@nvidia.com>

Hi, Yishai,

Please feel free to treat my comments as pure questions only.

On Tue, Mar 10, 2026 at 06:40:06PM +0200, Yishai Hadas wrote:
> When userspace opts into VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2, the
> driver may report the VFIO_PRECOPY_INFO_REINIT output flag in response
> to the VFIO_MIG_GET_PRECOPY_INFO ioctl, along with a new initial_bytes
> value.

Does it also mean that VFIO_PRECOPY_INFO_REINIT is almost only a hint that
can be deduced by the userspace too, if it remembers the last time fetch of
initial_bytes?

It definitely sounds a bit weird when some initial_* data can actually
change, because it's not "initial_" anymore.

Another question is, if initial_bytes reached zero, could it be boosted
again to be non-zero?

I don't see what stops it from happening, if the "we get some fresh new
critical data" seem to be able to happen anytime..  but if so, I wonder if
it's a problem to QEMU: when initial_bytes reported to 0 at least _once_ it
means it's possible src QEMU decides to switchover.  Then looks like it
beats the purpose of "don't switchover until we flush the critical data"
whole idea.

Is there a way the HW can report and confidentally say no further critical
data will be generated?

> 
> The presence of the VFIO_PRECOPY_INFO_REINIT flag indicates to the
> caller that new initial data is available in the migration stream.
> 
> If the firmware reports a new initial-data chunk, any previously dirty
> bytes in memory are treated as initial bytes, since the caller must read
> both sets before reaching the end of the initial-data region.

This is unfortunate.  I believe it's a limtation because of the current
single fd streaming protocol, so HW can only append things because it's
kind of a pipeline.

One thing to mention is, I recall VFIO migration suffers from a major
bottleneck on read() of the VFIO FD, it means this streaming whole design
is also causing other perf issues.

Have you or anyone thought about making it not a stream anymore?  Take
example of RAM blocks: it is pagesize accessible, with that we can do a lot
more, e.g. we don't need to streamline pages, we can send pages in whatever
order.  Meanwhile, we can send pages concurrently because they're not
streamlined too.

I wonder if VFIO FDs can provide something like that too, as a start it
doesn't need to be as fine granule, maybe at least instead of using one
stream it can provide two streams, one for initial_bytes (or, I really
think this should be called "critical data" or something similar, if it
represents that rather than "some initial states", not anymore), another
one for dirty.  Then at least when you attach new critical data you don't
need to flush dirty queue too.

If to extend it a bit more, then we can also make e.g. dirty queue to be
multiple FDs, so that userspace can read() in multiple threads, speeding up
the switchover phase.

I had a vague memory that there's sometimes kernel big locks to block it,
but from interfacing POV it sounds always better to avoid using one fd to
stream everything.

Thanks,

> 
> In this case, the driver issues a new SAVE command to fetch the data and
> prepare it for a subsequent read() from userspace.

-- 
Peter Xu


  reply	other threads:[~2026-03-12 17:37 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-10 16:40 [PATCH V1 vfio 0/6] Add support for PRE_COPY initial bytes re-initialization Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 1/6] vfio: Define uAPI for re-init initial bytes during the PRE_COPY phase Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 2/6] vfio: Add support for VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 3/6] vfio: Adapt drivers to use the core helper vfio_check_precopy_ioctl Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 4/6] net/mlx5: Add IFC bits for migration state Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 5/6] vfio/mlx5: consider inflight SAVE during PRE_COPY Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO Yishai Hadas
2026-03-12 17:37   ` Peter Xu [this message]
2026-03-12 19:08     ` Alex Williamson
2026-03-12 20:16       ` Peter Xu
2026-03-15 14:19         ` Yishai Hadas
2026-03-16 19:24           ` Peter Xu
2026-03-17  9:58             ` Avihai Horon
2026-03-17 14:06               ` Peter Xu
2026-03-17 15:22                 ` Avihai Horon
2026-03-17 15:52                   ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abL5wKfPGzi88iBy@x1.local \
    --to=peterx@redhat.com \
    --cc=alex@shazbot.org \
    --cc=avihaih@nvidia.com \
    --cc=clg@redhat.com \
    --cc=giovanni.cabiddu@intel.com \
    --cc=jgg@nvidia.com \
    --cc=joao.m.martins@oracle.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=leonro@nvidia.com \
    --cc=liulongfang@huawei.com \
    --cc=maorg@nvidia.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox