From: Avihai Horon <avihaih@nvidia.com>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org, Fabiano Rosas <farosas@suse.de>,
Joao Martins <joao.m.martins@oracle.com>,
Jan Kiszka <jan.kiszka@siemens.com>
Subject: Re: [PATCH] migration: Don't serialize migration while can't switchover
Date: Wed, 28 Feb 2024 12:27:40 +0200 [thread overview]
Message-ID: <dec8b4a6-3bbc-4db7-aabf-7c85c43b47d5@nvidia.com> (raw)
In-Reply-To: <Zd8IUJErmhDYIiXR@x1n>
On 28/02/2024 12:17, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, Feb 28, 2024 at 11:39:52AM +0200, Avihai Horon wrote:
>> On 28/02/2024 5:04, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Wed, Feb 28, 2024 at 02:00:26AM +0200, Avihai Horon wrote:
>>>> On 27/02/2024 9:41, Peter Xu wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> On Thu, Feb 22, 2024 at 05:56:27PM +0200, Avihai Horon wrote:
>>>>>> Currently, migration code serializes device data sending during pre-copy
>>>>>> iterative phase. As noted in the code comment, this is done to prevent
>>>>>> faster changing device from sending its data over and over.
>>>>> Frankly speaking I don't understand the rational behind 90697be889 ("live
>>>>> migration: Serialize vmstate saving in stage 2"). I don't even think I
>>>>> noticed this logic before even if I worked on migration for a few years...
>>>>>
>>>>> I was thinking all devices should always get its chance to run for some
>>>>> period during iterations. Do you know the reasoning behind? And I must
>>>>> confess I also know little on block migration, which seems to be relevant
>>>>> to this change. Anyway, I also copy Jan just in case he'll be able to chim
>>>>> in.
>>>> I am not 100% sure either, but I can guess:
>>>> This commit is pretty old (dates to 2009), so maybe back then the only
>>>> iterative devices were block devices and RAM.
>>>> Block devices didn't change as fast as RAM (and were probably much bigger
>>>> than RAM), so it made sense to send them first and only then send RAM, which
>>>> changed constantly.
>>> Makes sense. For some reason I read it the other way round previously.
>>>
>>>>> If there is a fast changing device, even if we don't proceed with other
>>>>> device iterators and we stick with the current one, assuming it can finally
>>>>> finish dumping all data, but then we'll proceed with other devices and the
>>>>> fast changing device can again accumulate dirty information?
>>>> I guess this logic only makes sense for the case where we only have block
>>>> devices and a RAM device, because the block devices wouldn't change that
>>>> much?
>>>>
>>>>>> However, with switchover-ack capability enabled, this behavior can be
>>>>>> problematic and may prevent migration from converging. The problem lies
>>>>>> in the fact that an earlier device may never finish sending its data and
>>>>>> thus block other devices from sending theirs.
>>>>> Yes, this is a problem.
>>>>>
>>>>>> This bug was observed in several VFIO migration scenarios where some
>>>>>> workload on the VM prevented RAM from ever reaching a hard zero, not
>>>>>> allowing VFIO initial pre-copy data to be sent, and thus destination
>>>>>> could not ack switchover. Note that the same scenario, but without
>>>>>> switchover-ack, would converge.
>>>>>>
>>>>>> Fix it by not serializing device data sending during pre-copy iterative
>>>>>> phase if switchover was not acked yet.
>>>>> I am still not fully convinced that it's even legal that one device can
>>>>> consume all iterator's bandwidth, ignoring the rest.. Though again it's
>>>>> not about this patch, but about commit 90697be889.
>>>> Yes, I agree. As I wrote above, maybe this was valid back then when the only
>>>> iterative devices were block and RAM.
>>>>
>>>>> I'm thinking whether we should allow each device to have its own portion of
>>>>> chance to push data for each call to qemu_savevm_state_iterate(),
>>>>> irrelevant of vfio's switchover-ack capability.
>>>> I wasn't sure for 100% why we have this logic in the first place, so I wrote
>>>> my patch as little invasive as possible, keeping migration behavior as is
>>>> (except for switchover-ack).
>>>> But I tend to agree with you for three reasons:
>>>>
>>>> 1. I think block migration is deprecated (see commits 66db46ca83b8,
>>>> 40101f320d6b and 8846b5bfca76).
>>>> Instead, users are instructed to use blockdev-mirror and co. [1]. If I'm not
>>>> mistaken, this operates using a different infrastructure than migration.
>>>> So this logic is not relevant anymore.
>>>>
>>>> 2. As you pointed out earlier, the fast changing device can accumulate dirty
>>>> data over and over. VFIO devices come after RAM, so this logic doesn't
>>>> achieve its goal in this case (we may sync fast changing RAM over and over).
>>>>
>>>> 3. My fix in this patch won't solve a similar problem that could happen,
>>>> where a VFIO device with a lot of pre-copy data (not necessarily initial
>>>> data) may never be able to send it, thus not realizing the full potential of
>>>> pre-copy for VFIO.
>>>> (I personally have not encountered this problem yet, but maybe it can happen
>>>> with a vGPU).
>>> Thanks for a summary.
>>>
>>>> If you agree, I can send a v2 that simply removes this logic and gives every
>>>> device an equal chance to send its data (like Joao showed) with some
>>>> explanation why we do it.
>>> Let's see whether others have an opinion, but to me I think we can give it
>>> a shot. In that case we can "break" in the previous "ret < 0" check
>>> already.
>> Sure.
>> So I will wait some more and may send v2 next week, if no special feedback
>> is received.
>>
>>> One more thing to mention is then I think we need to calculate the case of
>>> "all iterators returned 1" (aka, "all completes") scenario. With the old
>>> check it is guaranteed if the loop iterates over all iterators then all
>>> iterators have completed. Now we allow ret==0 to keep iterating, then the
>>> check needs to be done explicitly.
>>>
>>> In general, it could be something like:
>>>
>>> all_done = 1;
>>> loop {
>>> ...
>>> ret = se->ops->save_live_iterate(f, se->opaque);
>>> if (ret < 0) {
>>> /* error handling.. */
>>> ...
>>> break;
>>> } else if (ret == 0) {
>>> all_done = 0;
>>> }
>>> }
>>> return all_done;
>> Yes, this looks good.
>>
>>>> We could also give RAM precedence over other devices only during the first
>>>> iteration of sending RAM (i.e., only until first dirty sync), but I don't
>>>> know how much benefit this would give.
>>> That sounds a bit tricky. We can leave that for later until justified to
>>> be anything useful.
>> I agree.
>>
>>> Now when I checked VFIO code I found that VFIO still may have two issues:
>>>
>>> ====
>>> 1) VFIO doesn't yet respect migration_rate_exceeded()
>>>
>>> That's the knob we use to limit send throughput when user specified the bw
>>> to be something not zero, then throttling will apply. Currently it seems
>>> VFIO always sends limited data per iteration - a little bit more than
>>> 1MB-ish?
>> Correct.
>>
>>> I'm mostly only looking at VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE,
>>> which seems fine, because in qemu_savevm_state_iterate() there's one more
>>> migration_rate_exceeded() check anyway:
>>>
>>> if (migration_rate_exceeded(f)) {
>>> return 0;
>>> }
>>>
>>> However it may be good that VFIO will also use migration_rate_exceeded()
>>> inside the iterator (before we can have some better way to do throttling..).
>> You mean that you want to add a migration_rate_exceeded() check inside
>> vfio_save_iterate() before we send any data?
>> E.g.:
>>
>> static int vfio_save_iterate(QEMUFile *f, void *opaque)
>> {
>> VFIODevice *vbasedev = opaque;
>> VFIOMigration *migration = vbasedev->migration;
>> ssize_t data_size;
>>
>> + if (migration_rate_exceeded()) {
>> + // return early
>> + }
>>
>> data_size = vfio_save_block(f, migration);
>> if (data_size < 0) {
>> return data_size;
>> }
>>
>> Or check migration_rate_exceeded() not only here, but also during sending of
>> the 1MB buffer (inside vfio_save_block())?
>>
>> Either way, I think this would only complicate things for no real benefit:
>> even if we do exceed BW, it would only be by 1MB max, and IMHO this
>> shouldn't make any real difference.
> So far with the ~1MB guard it shouldn't matter a huge deal indeed. But
> please keep that in mind in case in the future VFIO can iterate much more
> than that, because the current 1MB limitation came from nowhere, afaict..
> so I won't be surprised either if someday someone thinks it's a good idea
> to send more than that, keep ignoring migration_rate_exceeded().
>
> If you think worthwhile, maybe we can add a comment in vfio_save_iterate()
> explaining why migration_rate_exceeded() is not necessary so far, so that
> anyone in the future will be aware of its existance.
Of course, I will add it.
Thanks.
prev parent reply other threads:[~2024-02-28 10:28 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-22 15:56 [PATCH] migration: Don't serialize migration while can't switchover Avihai Horon
2024-02-27 3:16 ` Wang, Lei
2024-02-28 9:56 ` Avihai Horon
2024-02-27 7:41 ` Peter Xu
2024-02-27 10:44 ` Joao Martins
2024-02-28 0:00 ` Avihai Horon
2024-02-28 3:04 ` Peter Xu
2024-02-28 9:39 ` Avihai Horon
2024-02-28 10:17 ` Peter Xu
2024-02-28 10:27 ` Avihai Horon [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=dec8b4a6-3bbc-4db7-aabf-7c85c43b47d5@nvidia.com \
--to=avihaih@nvidia.com \
--cc=farosas@suse.de \
--cc=jan.kiszka@siemens.com \
--cc=joao.m.martins@oracle.com \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).