Re: [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Kirti Wankhede <kwankhede@nvidia.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Alex Williamson <alex.williamson@redhat.com>
Cc: "Malcolm Crossley" <mcrossley@nvidia.com>,
	"Neo Jia" <cjia@nvidia.com>,
	"Juan Quintela" <quintela@redhat.com>,
	"Cornelia Huck" <cohuck@redhat.com>,
	qemu-devel@nongnu.org, "Dheeraj Nigam" <dnigam@nvidia.com>,
	"Philippe Mathieu-Daudé" <philmd@redhat.com>
Subject: Re: [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental
Date: Tue, 10 Nov 2020 19:46:20 +0530	[thread overview]
Message-ID: <898ba98f-9967-f3b3-737c-2d18b0281b51@nvidia.com> (raw)
In-Reply-To: <20201110091037.GA3108@work-vm>



On 11/10/2020 2:40 PM, Dr. David Alan Gilbert wrote:
> * Alex Williamson (alex.williamson@redhat.com) wrote:
>> On Mon, 9 Nov 2020 19:44:17 +0000
>> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
>>
>>> * Alex Williamson (alex.williamson@redhat.com) wrote:
>>>> Per the proposed documentation for vfio device migration:
>>>>
>>>>    Dirty pages are tracked when device is in stop-and-copy phase
>>>>    because if pages are marked dirty during pre-copy phase and
>>>>    content is transfered from source to destination, there is no
>>>>    way to know newly dirtied pages from the point they were copied
>>>>    earlier until device stops. To avoid repeated copy of same
>>>>    content, pinned pages are marked dirty only during
>>>>    stop-and-copy phase.
>>>>
>>>> Essentially, since we don't have hardware dirty page tracking for
>>>> assigned devices at this point, we consider any page that is pinned
>>>> by an mdev vendor driver or pinned and mapped through the IOMMU to
>>>> be perpetually dirty.  In the worst case, this may result in all of
>>>> guest memory being considered dirty during every iteration of live
>>>> migration.  The current vfio implementation of migration has chosen
>>>> to mask device dirtied pages until the final stages of migration in
>>>> order to avoid this worst case scenario.
>>>>
>>>> Allowing the device to implement a policy decision to prioritize
>>>> reduced migration data like this jeopardizes QEMU's overall ability
>>>> to implement any degree of service level guarantees during migration.
>>>> For example, any estimates towards achieving acceptable downtime
>>>> margins cannot be trusted when such a device is present.  The vfio
>>>> device should participate in dirty page tracking to the best of its
>>>> ability throughout migration, even if that means the dirty footprint
>>>> of the device impedes migration progress, allowing both QEMU and
>>>> higher level management tools to decide whether to continue the
>>>> migration or abort due to failure to achieve the desired behavior.
>>>
>>> I don't feel particularly badly about the decision to squash it in
>>> during the stop-and-copy phase; for devices where the pinned memory
>>> is large, I don't think doing it during the main phase makes much sense;
>>> especially if you then have to deal with tracking changes in pinning.
>>
>>
>> AFAIK the kernel support for tracking changes in page pinning already
>> exists, this is largely the vfio device in QEMU that decides when to
>> start exposing the device dirty footprint to QEMU.  I'm a bit surprised
>> by this answer though, we don't really know what the device memory
>> footprint is.  It might be large, it might be nothing, but by not
>> participating in dirty page tracking until the VM is stopped, we can't
>> know what the footprint is and how it will affect downtime.  Is it
>> really the place of a QEMU device driver to impose this sort of policy?
> 
> If it could actually track changes then I'd agree we shouldn't impose
> any policy; but if it's just marking the whole area as dirty we're going
> to need a bodge somewhere; this bodge doesn't look any worse than the
> others to me.
> 
>>
>>> Having said that, I agree with marking it as experimental, because
>>> I'm dubious how useful it will be for the same reason, I worry
>>> about whether the downtime will be so large to make it pointless.
>>

Not all device state is large, for example NIC might only report 
currently mapped RX buffers which usually not more than a 1GB and could 
be as low as 10's of MB. GPU might or might not have large data, that 
depends on its use cases.

>>
>> TBH I think that's the wrong reason to mark it experimental.  There's
>> clearly demand for vfio device migration and even if the practical use
>> cases are initially small, they will expand over time and hardware will
>> get better.  My objection is that the current behavior masks the
>> hardware and device limitations, leading to unrealistic expectations.
>> If the user expects minimal downtime, configures convergence to account
>> for that, QEMU thinks it can achieve it, and then the device marks
>> everything dirty, that's not supportable.
> 
> Yes, agreed.

Yes, there is demand for vfio device migration and many devices owners 
started scoping and development for migration support.
Instead of making whole migration support as experimental, we can have 
opt-in option to decide to mark sys mem pages dirty during iterative 
phase (pre-copy phase) of migration.

Thanks,
Kirti

> 
>> OTOH if the vfio device
>> participates in dirty tracking through pre-copy, then the practical use
>> cases will find themselves as migrations will either be aborted because
>> downtime tolerances cannot be achieved or downtimes will be configured
>> to match reality.  Thanks,
> 
> Without a way to prioritise the unpinned memory during that period,
> we're going to be repeatedly sending the pinned memory which is going to
> lead to a much larger bandwidth usage that required; so that's going in
> completely the wrong direction and also wrong from the point of view of
> the user.
> 
> Dave
> 
>>
>> Alex
>>
>>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>
>>>> Link: https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg00807.html
>>>> Cc: Kirti Wankhede <kwankhede@nvidia.com>
>>>> Cc: Neo Jia <cjia@nvidia.com>
>>>> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>> Cc: Juan Quintela <quintela@redhat.com>
>>>> Cc: Philippe Mathieu-DaudÃ© <philmd@redhat.com>
>>>> Cc: Cornelia Huck <cohuck@redhat.com>
>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>> ---
>>>>
>>>> Given that our discussion in the link above seems to be going in
>>>> circles, I'm afraid it seems necessary to both have a contigency
>>>> plan and to raise the visibility of the current behavior to
>>>> determine whether others agree that this is a sufficiently
>>>> troubling behavior to consider migration support experimental
>>>> at this stage.  Please voice your opinion or contribute patches
>>>> to resolve this before QEMU 5.2.  Thanks,
>>>>
>>>> Alex
>>>>
>>>>   hw/vfio/migration.c           |    2 +-
>>>>   hw/vfio/pci.c                 |    2 ++
>>>>   include/hw/vfio/vfio-common.h |    1 +
>>>>   3 files changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index 3ce285ea395d..cd44d465a50b 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>>>> @@ -882,7 +882,7 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>>>>       Error *local_err = NULL;
>>>>       int ret = -ENOTSUP;
>>>>   
>>>> -    if (!container->dirty_pages_supported) {
>>>> +    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
>>>>           goto add_blocker;
>>>>       }
>>>>   
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index 58c0ce8971e3..1349b900e513 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -3194,6 +3194,8 @@ static Property vfio_pci_dev_properties[] = {
>>>>                       VFIO_FEATURE_ENABLE_REQ_BIT, true),
>>>>       DEFINE_PROP_BIT("x-igd-opregion", VFIOPCIDevice, features,
>>>>                       VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
>>>> +    DEFINE_PROP_BOOL("x-enable-migration", VFIOPCIDevice,
>>>> +                     vbasedev.enable_migration, false),
>>>>       DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
>>>>       DEFINE_PROP_BOOL("x-balloon-allowed", VFIOPCIDevice,
>>>>                        vbasedev.ram_block_discard_allowed, false),
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>> index baeb4dcff102..2119872c8af1 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -123,6 +123,7 @@ typedef struct VFIODevice {
>>>>       bool needs_reset;
>>>>       bool no_mmap;
>>>>       bool ram_block_discard_allowed;
>>>> +    bool enable_migration;
>>>>       VFIODeviceOps *ops;
>>>>       unsigned int num_irqs;
>>>>       unsigned int num_regions;
>>>>    
>>

next prev parent reply	other threads:[~2020-11-10 14:18 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-09 18:56 [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental Alex Williamson
2020-11-09 19:44 ` Dr. David Alan Gilbert
2020-11-09 20:29   ` Alex Williamson
2020-11-10  9:10     ` Dr. David Alan Gilbert
2020-11-10 14:16       ` Kirti Wankhede [this message]
2020-11-10 15:20         ` Alex Williamson
2020-11-10 21:26           ` Neo Jia
2020-11-10 12:03 ` Cornelia Huck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=898ba98f-9967-f3b3-737c-2d18b0281b51@nvidia.com \
    --to=kwankhede@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=cjia@nvidia.com \
    --cc=cohuck@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=dnigam@nvidia.com \
    --cc=mcrossley@nvidia.com \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).