From: Avihai Horon <avihaih@nvidia.com>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org,
"Alex Williamson" <alex.williamson@redhat.com>,
"Juan Quintela" <quintela@redhat.com>,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
"Cornelia Huck" <cohuck@redhat.com>,
"Paolo Bonzini" <pbonzini@redhat.com>,
"Vladimir Sementsov-Ogievskiy" <vsementsov@yandex-team.ru>,
"Cédric Le Goater" <clg@redhat.com>,
"Yishai Hadas" <yishaih@nvidia.com>,
"Jason Gunthorpe" <jgg@nvidia.com>,
"Maor Gottlieb" <maorg@nvidia.com>,
"Kirti Wankhede" <kwankhede@nvidia.com>,
"Tarun Gupta" <targupta@nvidia.com>,
"Joao Martins" <joao.m.martins@oracle.com>,
"Fabiano Rosas" <farosas@suse.de>, "Zhiyi Guo" <zhguo@redhat.com>
Subject: Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
Date: Thu, 5 Sep 2024 19:07:28 +0300 [thread overview]
Message-ID: <812e89c4-35d8-4fc0-ac10-ec36d57f215c@nvidia.com> (raw)
In-Reply-To: <ZtnLhW-2eo8hA7bQ@x1n>
On 05/09/2024 18:17, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Sep 05, 2024 at 02:41:09PM +0300, Avihai Horon wrote:
>> On 04/09/2024 19:16, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Wed, Sep 04, 2024 at 06:41:03PM +0300, Avihai Horon wrote:
>>>> On 04/09/2024 16:00, Peter Xu wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> Hello, Avihai,
>>>>>
>>>>> Reviving this thread just to discuss one issue below..
>>>>>
>>>>> On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote:
>>>>>> +/*
>>>>>> + * Migration size of VFIO devices can be as little as a few KBs or as big as
>>>>>> + * many GBs. This value should be big enough to cover the worst case.
>>>>>> + */
>>>>>> +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>>>>>> +
>>>>>> +/*
>>>>>> + * Only exact function is implemented and not estimate function. The reason is
>>>>>> + * that during pre-copy phase of migration the estimate function is called
>>>>>> + * repeatedly while pending RAM size is over the threshold, thus migration
>>>>>> + * can't converge and querying the VFIO device pending data size is useless.
>>>>>> + */
>>>>>> +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>>>>>> + uint64_t *can_postcopy)
>>>>>> +{
>>>>>> + VFIODevice *vbasedev = opaque;
>>>>>> + uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>>>>>> +
>>>>>> + /*
>>>>>> + * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
>>>>>> + * reported so downtime limit won't be violated.
>>>>>> + */
>>>>>> + vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>>>>> + *must_precopy += stop_copy_size;
>>>>> Is this the chunk of data only can be copied during VM stopped? If so, I
>>>>> wonder why it's reported as "must precopy" if we know precopy won't ever
>>>>> move them..
>>>> A VFIO device that doesn't support precopy will send this data only when VM
>>>> is stopped.
>>>> A VFIO device that supports precopy may or may not send this data (or part
>>>> of it) during precopy, and it depends on the specific VFIO device.
>>>>
>>>> According to state_pending_{estimate,exact} documentation, must_precopy is
>>>> the amount of data that must be migrated before target starts, and indeed
>>>> this VFIO data must be migrated before target starts.
>>> Correct. It's just that estimate/exact API will be more suitable when the
>>> exact() gets the report closer to estimate(), while the estimate() is only
>>> a fast-path of exact(). It'll cause some chaos like this if it doesn't do
>>> as so.
>>>
>>> Since we have discussion elsewhere on the fact that we currently ignore
>>> non-iterative device states during migration decides to switchover, I was
>>> wondering when reply on whether this stop-size could be the first thing
>>> that will start to provide such non-iterable pending data. But that might
>>> indeed need more thoughts, at least we may want to collect more outliers of
>>> non-iterative device states outside VFIO that can cause downtime to be too
>>> large.
>> Ah, I see.
>>
>>>>> The issue is if with such reporting (and now in latest master branch we do
>>>>> have the precopy size too, which was reported both in exact() and
>>>>> estimate()), we can observe weird reports like this:
>>>>>
>>>>> 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>> 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
>>>>> 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>> 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0)
>>>>> 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>> 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
>>>>> 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>> 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0)
>>>>> 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>> 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0)
>>>>> 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>>
>>>>> So estimate() keeps reporting zero but the exact() reports much larger, and
>>>>> it keeps spinning like this. I think that's not how it was designed to be
>>>>> used..
>>>> It keeps spinning and migration doesn't converge?
>>>> If so, configuring a higher downtime limit or the avail-switchover-bandwidth
>>>> parameter may solve it.
>>> Yes, this is the only way to go, but it's a separate issue on reportings of
>>> estimate()/exact(). More below.
>>>
>>>>> Does this stop copy size change for a VFIO device or not?
>>>> It depends on the specific VFIO device.
>>>> If the device supports precopy and all (or part) of its data is
>>>> precopy-able, then stopcopy size will change.
>>>> Besides that, the amount of resources currently used by the VFIO device can
>>>> also affect the stopcopy size, and it may increase or decrease as resources
>>>> are created or destroyed.
>>> I see, thanks.
>>>
>>>>> IIUC, we may want some other mechanism to report stop copy size for a
>>>>> device, rather than reporting it with the current exact()/estimate() api.
>>>>> That's, per my undertanding, only used for iterable data, while
>>>>> stop-copy-size may not fall into that category if so.
>>>> The above situation is caused by the fact that VFIO data may not be fully
>>>> precopy-able (as opposed to RAM data).
>>>> I don't think reporting the stop-copy-size in a different API will help the
>>>> above situation -- we would still have to take stop-copy-size into account
>>>> before converging, to not violate downtime.
>>> It will help some other situation, though.
>>>
>>> One issue with above freqeunt estimate()/exact() call is that QEMU will go
>>> into madness loop thinking "we're close" and "we're far away from converge"
>>> even if the reality is "we're far away". The bad side effect is when this
>>> loop happens it'll not only affect VFIO but also other devices (e.g. KVM,
>>> vhost, etc.) so we'll do high overhead sync() in an extremely frequent
>>> manner. IMHO they're totally a waste of resource, because all the rest of
>>> the modules are following the default rules of estimate()/exact().
>>>
>>> One simple but efficient fix for VFIO, IMHO, is at least VFIO should also
>>> cache the stop-size internally and report in estimate(), e.g.:
>>>
>>> /* Give an estimate of the amount left to be transferred,
>>> * the result is split into the amount for units that can and
>>> * for units that can't do postcopy.
>>> */
>>> void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
>>> uint64_t *can_postcopy)
>>> {
>>> }
>>>
>>> If it's justified that the stop-size to be reported in exact(), it should
>>> also be reported in estimate() per the comment above. It should also fall
>>> into precopy category in this case.
>>>
>>> Then with that we should avoid calling exact() frequently for not only VFIO
>>> but also others (especially, KVM GET_DIRTY_LOG / CLEAR_DIRTY_LOG ioctls),
>>> then we know it won't converge anyway without the help of tuning downtime
>>> upper, or adjust avail-switchover-bandwidth.
>> Yes, it will solve this problem, but IIRC, the reason I didn't cache and
>> return stop-copy-size in estimate from the first place was that it wasn't
>> fully precopy-able, and that could cause some problems:
>> Once we cache and report this size in estimate, it may not decrease anymore
>> -- we can't send it during precopy and we might not be able to call get
>> stop-copy-size again from the exact path (because estimate might now reach
>> below the threshold).
>>
>> For example, assume the threshold for convergence is 1GB.
>> A VFIO device may report 2GB stop-copy-size (not precopy-able) in the
>> beginning of the migration.
>> If the VFIO device stop-copy-size changes in the middle of migration (e.g.,
>> some of its resources are destroyed) and drops to 900MB, we will still see
>> 2GB report in estimate.
>> Only when calling the exact handler again we will get the updated 900MB
>> value and be able to converge. But that won't happen, because estimate size
>> will always be above the 1GB threshold.
>>
>> We could prevent these situations and call the get stop-copy-size once every
>> X seconds, but it feels a bit awkward.
> IMHO the confusion is caused by an unclear definition of stop-size and
> precopy-size in VFIO terms.
>
> What I would hope is the stop-size reported should be constant and not be
> affected by precopy process happening. Whatever _can_ change at all should
> be reported as precopy-size.
>
> So I wonder why stop-size can change from a driver, and whether that can be
> reported in a more predictable fashion. Otherwise I see little point in
> providing both stop-size and precopy-size, otherwise we'll always add them
> up into VFIO's total pending-size. The line on provision which data falls
> into which bucket doesn't seem to be clear to me.
Stopcopy-size is reported by VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl,
which states that this is "[...] the estimated data length that will be
required to complete stop copy".
So by this definition, stopcopy-size can change during precopy (it can
also change if device resources are created or destroyed).
Precopy-size is reported by VFIO_MIG_GET_PRECOPY_INFO ioctl, which
states that this is "[...] estimates of data available from the device
during the PRE_COPY states".
Maybe the confusion was caused by this line in vfio_state_pending_exact():
*must_precopy += migration->precopy_init_size +
migration->precopy_dirty_size;
Which I think should be removed, as stopcopy-size should already cover
precopy data.
Thanks.
>
>>> This may improve situation but still leave one other issue, that IIUC even
>>> with above change and even if we can avoid sync dirty bitmap frequently,
>>> the migration thread can still spinning 100% calling estimate() and keep
>>> seeing data (which is not iterable...). For the longer term we may still
>>> need to report non-iterable stop-size in another way so QEMU knows that
>>> iterate() over all the VMState registers won't help in this case, so it
>>> should go into sleep without eating the cores. I hope that explains why I
>>> think a new API should be still needed for the long run.
>> Yes, I get your point.
>> But is the spinning case very common? AFAIU, if it happens, it may indicate
>> some misconfiguration of the migration parameters.
> Agreed.
>
> It's just that our QE actively tests such migrations and there're always
> weird and similar reports on cpu spinning, so it might be nice to still fix
> it at some point.
>
> This shouldn't be super urgent, indeed. It's just that it can start to
> make sense when we have other discussions where reporting non-iteratble
> pending data might be pretty useful on its own as an idea. It's just that
> we need to figure out above on how VFIO can report predicatble stop-size in
> the drivers, and I wonder whether the drivers can be fixed (without
> breaking existing qemu; after all they currently always add both
> counters..).
>
>> Anyway, I think your idea may fit VFIO, but still need to think about all
>> the small details.
> Thanks,
>
> --
> Peter Xu
>
next prev parent reply other threads:[~2024-09-05 16:13 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-16 14:36 [PATCH v11 00/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
2023-02-16 14:36 ` [PATCH v11 01/11] linux-headers: Update to v6.2-rc8 Avihai Horon
2023-02-16 14:36 ` [PATCH v11 02/11] vfio/migration: Fix NULL pointer dereference bug Avihai Horon
2023-02-16 14:36 ` [PATCH v11 03/11] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support Avihai Horon
2023-02-16 14:36 ` [PATCH v11 04/11] vfio/common: Change vfio_devices_all_running_and_saving() logic to equivalent one Avihai Horon
2023-02-16 14:53 ` Juan Quintela
2023-02-16 14:36 ` [PATCH v11 05/11] vfio/migration: Block multiple devices migration Avihai Horon
2023-05-16 10:03 ` Shameerali Kolothum Thodi via
2023-05-16 11:59 ` Jason Gunthorpe
2023-05-16 13:57 ` Shameerali Kolothum Thodi via
2023-05-16 14:04 ` Jason Gunthorpe
2023-05-16 14:27 ` Alex Williamson
2023-05-16 14:35 ` Shameerali Kolothum Thodi via
2023-05-16 16:11 ` Jason Gunthorpe
2023-02-16 14:36 ` [PATCH v11 06/11] vfio/migration: Move migration v1 logic to vfio_migration_init() Avihai Horon
2023-02-16 14:50 ` Juan Quintela
2023-02-16 14:36 ` [PATCH v11 07/11] vfio/migration: Rename functions/structs related to v1 protocol Avihai Horon
2023-02-16 14:54 ` Juan Quintela
2023-02-16 14:36 ` [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
2023-02-16 15:43 ` Juan Quintela
2023-02-16 16:40 ` Avihai Horon
2023-02-16 16:52 ` Juan Quintela
2023-02-16 19:53 ` Alex Williamson
2024-09-04 13:00 ` Peter Xu
2024-09-04 15:41 ` Avihai Horon
2024-09-04 16:16 ` Peter Xu
2024-09-05 11:41 ` Avihai Horon
2024-09-05 15:17 ` Peter Xu
2024-09-05 16:07 ` Avihai Horon [this message]
2024-09-05 16:23 ` Peter Xu
2024-09-05 16:45 ` Avihai Horon
2024-09-05 18:31 ` Peter Xu
2024-09-09 12:52 ` Avihai Horon
2024-09-09 15:11 ` Peter Xu
2024-09-12 8:09 ` Avihai Horon
2024-09-12 9:41 ` Cédric Le Goater
2024-09-12 13:45 ` Peter Xu
2023-02-16 14:36 ` [PATCH v11 09/11] vfio/migration: Remove VFIO migration protocol v1 Avihai Horon
2023-02-16 14:36 ` [PATCH v11 10/11] vfio: Alphabetize migration section of VFIO trace-events file Avihai Horon
2023-02-16 14:36 ` [PATCH v11 11/11] docs/devel: Align VFIO migration docs to v2 protocol Avihai Horon
2023-02-16 14:57 ` Juan Quintela
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=812e89c4-35d8-4fc0-ac10-ec36d57f215c@nvidia.com \
--to=avihaih@nvidia.com \
--cc=alex.williamson@redhat.com \
--cc=clg@redhat.com \
--cc=cohuck@redhat.com \
--cc=dgilbert@redhat.com \
--cc=farosas@suse.de \
--cc=jgg@nvidia.com \
--cc=joao.m.martins@oracle.com \
--cc=kwankhede@nvidia.com \
--cc=maorg@nvidia.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=targupta@nvidia.com \
--cc=vsementsov@yandex-team.ru \
--cc=yishaih@nvidia.com \
--cc=zhguo@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).