From: Alex Williamson <alex.williamson@redhat.com>
To: Kirti Wankhede <kwankhede@nvidia.com>
Cc: "Malcolm Crossley" <mcrossley@nvidia.com>,
"Neo Jia" <cjia@nvidia.com>,
"Juan Quintela" <quintela@redhat.com>,
"Cornelia Huck" <cohuck@redhat.com>,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
qemu-devel@nongnu.org, "Dheeraj Nigam" <dnigam@nvidia.com>,
"Philippe Mathieu-Daudé" <philmd@redhat.com>
Subject: Re: [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental
Date: Tue, 10 Nov 2020 08:20:50 -0700 [thread overview]
Message-ID: <20201110082050.10ca74d4@w520.home> (raw)
In-Reply-To: <898ba98f-9967-f3b3-737c-2d18b0281b51@nvidia.com>
On Tue, 10 Nov 2020 19:46:20 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:
> On 11/10/2020 2:40 PM, Dr. David Alan Gilbert wrote:
> > * Alex Williamson (alex.williamson@redhat.com) wrote:
> >> On Mon, 9 Nov 2020 19:44:17 +0000
> >> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >>
> >>> * Alex Williamson (alex.williamson@redhat.com) wrote:
> >>>> Per the proposed documentation for vfio device migration:
> >>>>
> >>>> Dirty pages are tracked when device is in stop-and-copy phase
> >>>> because if pages are marked dirty during pre-copy phase and
> >>>> content is transfered from source to destination, there is no
> >>>> way to know newly dirtied pages from the point they were copied
> >>>> earlier until device stops. To avoid repeated copy of same
> >>>> content, pinned pages are marked dirty only during
> >>>> stop-and-copy phase.
> >>>>
> >>>> Essentially, since we don't have hardware dirty page tracking for
> >>>> assigned devices at this point, we consider any page that is pinned
> >>>> by an mdev vendor driver or pinned and mapped through the IOMMU to
> >>>> be perpetually dirty. In the worst case, this may result in all of
> >>>> guest memory being considered dirty during every iteration of live
> >>>> migration. The current vfio implementation of migration has chosen
> >>>> to mask device dirtied pages until the final stages of migration in
> >>>> order to avoid this worst case scenario.
> >>>>
> >>>> Allowing the device to implement a policy decision to prioritize
> >>>> reduced migration data like this jeopardizes QEMU's overall ability
> >>>> to implement any degree of service level guarantees during migration.
> >>>> For example, any estimates towards achieving acceptable downtime
> >>>> margins cannot be trusted when such a device is present. The vfio
> >>>> device should participate in dirty page tracking to the best of its
> >>>> ability throughout migration, even if that means the dirty footprint
> >>>> of the device impedes migration progress, allowing both QEMU and
> >>>> higher level management tools to decide whether to continue the
> >>>> migration or abort due to failure to achieve the desired behavior.
> >>>
> >>> I don't feel particularly badly about the decision to squash it in
> >>> during the stop-and-copy phase; for devices where the pinned memory
> >>> is large, I don't think doing it during the main phase makes much sense;
> >>> especially if you then have to deal with tracking changes in pinning.
> >>
> >>
> >> AFAIK the kernel support for tracking changes in page pinning already
> >> exists, this is largely the vfio device in QEMU that decides when to
> >> start exposing the device dirty footprint to QEMU. I'm a bit surprised
> >> by this answer though, we don't really know what the device memory
> >> footprint is. It might be large, it might be nothing, but by not
> >> participating in dirty page tracking until the VM is stopped, we can't
> >> know what the footprint is and how it will affect downtime. Is it
> >> really the place of a QEMU device driver to impose this sort of policy?
> >
> > If it could actually track changes then I'd agree we shouldn't impose
> > any policy; but if it's just marking the whole area as dirty we're going
> > to need a bodge somewhere; this bodge doesn't look any worse than the
> > others to me.
> >
> >>
> >>> Having said that, I agree with marking it as experimental, because
> >>> I'm dubious how useful it will be for the same reason, I worry
> >>> about whether the downtime will be so large to make it pointless.
> >>
>
> Not all device state is large, for example NIC might only report
> currently mapped RX buffers which usually not more than a 1GB and could
> be as low as 10's of MB. GPU might or might not have large data, that
> depends on its use cases.
Right, it's only if we have a vendor driver that doesn't pin any memory
when dirty tracking is enabled and we're running without a viommu that
we would expect all of guest memory to be continuously dirty.
> >> TBH I think that's the wrong reason to mark it experimental. There's
> >> clearly demand for vfio device migration and even if the practical use
> >> cases are initially small, they will expand over time and hardware will
> >> get better. My objection is that the current behavior masks the
> >> hardware and device limitations, leading to unrealistic expectations.
> >> If the user expects minimal downtime, configures convergence to account
> >> for that, QEMU thinks it can achieve it, and then the device marks
> >> everything dirty, that's not supportable.
> >
> > Yes, agreed.
>
> Yes, there is demand for vfio device migration and many devices owners
> started scoping and development for migration support.
> Instead of making whole migration support as experimental, we can have
> opt-in option to decide to mark sys mem pages dirty during iterative
> phase (pre-copy phase) of migration.
Per my previous suggestion, I'd think an opt-out would be more
appropriate, ie. implementing pre-copy dirty page tracking by default.
> >> OTOH if the vfio device
> >> participates in dirty tracking through pre-copy, then the practical use
> >> cases will find themselves as migrations will either be aborted because
> >> downtime tolerances cannot be achieved or downtimes will be configured
> >> to match reality. Thanks,
> >
> > Without a way to prioritise the unpinned memory during that period,
> > we're going to be repeatedly sending the pinned memory which is going to
> > lead to a much larger bandwidth usage that required; so that's going in
> > completely the wrong direction and also wrong from the point of view of
> > the user.
Who decides which is the wrong direction for the user? If the user
wants minimal bandwidth regardless of downtime, can't they create a
procedure to pause the VM, do the migration, then resume? Are there
already migration tunables to effectively achieve this? If a user
attempts to do a "live" migration, isn't our priority then shifted to
managing the downtime constraints over the bandwidth? IOW the policy
decision is implied by the user actions and configuration of the
migration, I don't think that at the device level we should be guessing
which feature to prioritize, just like a vCPU doesn't to stop marking
dirty pages during pre-copy because it's touching too much memory.
Higher level policies and configurations should determine inflection
points... imo. Thanks,
Alex
next prev parent reply other threads:[~2020-11-10 15:23 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-09 18:56 [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental Alex Williamson
2020-11-09 19:44 ` Dr. David Alan Gilbert
2020-11-09 20:29 ` Alex Williamson
2020-11-10 9:10 ` Dr. David Alan Gilbert
2020-11-10 14:16 ` Kirti Wankhede
2020-11-10 15:20 ` Alex Williamson [this message]
2020-11-10 21:26 ` Neo Jia
2020-11-10 12:03 ` Cornelia Huck
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201110082050.10ca74d4@w520.home \
--to=alex.williamson@redhat.com \
--cc=cjia@nvidia.com \
--cc=cohuck@redhat.com \
--cc=dgilbert@redhat.com \
--cc=dnigam@nvidia.com \
--cc=kwankhede@nvidia.com \
--cc=mcrossley@nvidia.com \
--cc=philmd@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).