From: Avihai Horon <avihaih@nvidia.com>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Cédric Le Goater" <clg@redhat.com>,
	"Juan Quintela" <quintela@redhat.com>,
	"Leonardo Bras" <leobras@redhat.com>,
	"Eric Blake" <eblake@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>,
	"Thomas Huth" <thuth@redhat.com>,
	"Laurent Vivier" <lvivier@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Yishai Hadas" <yishaih@nvidia.com>,
	"Jason Gunthorpe" <jgg@nvidia.com>,
	"Maor Gottlieb" <maorg@nvidia.com>,
	"Kirti Wankhede" <kwankhede@nvidia.com>,
	"Tarun Gupta" <targupta@nvidia.com>,
	"Joao Martins" <joao.m.martins@oracle.com>
Subject: Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
Date: Sun, 7 May 2023 15:54:00 +0300	[thread overview]
Message-ID: <95db8c9d-c649-09b7-843e-215756820bb1@nvidia.com> (raw)
In-Reply-To: <ZFPUOeuICJ1gehNk@x1n>
On 04/05/2023 18:50, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, May 04, 2023 at 01:18:04PM +0300, Avihai Horon wrote:
>> On 03/05/2023 18:49, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Wed, May 03, 2023 at 06:22:59PM +0300, Avihai Horon wrote:
>>>> On 03/05/2023 1:49, Peter Xu wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote:
>>>>>> Hello everyone,
>>>>> Hi, Avihai,
>>>>>
>>>>>> === Flow of operation ===
>>>>>>
>>>>>> To use precopy initial data, the capability must be enabled in the
>>>>>> source.
>>>>>>
>>>>>> As this capability must be supported also in the destination, a
>>>>>> handshake is performed during migration setup. The purpose of the
>>>>>> handshake is to notify the destination that precopy initial data is used
>>>>>> and to check if it's supported.
>>>>>>
>>>>>> The handshake is done in two levels. First, a general handshake is done
>>>>>> with the destination migration code to notify that precopy initial data
>>>>>> is used. Then, for each migration user in the source that supports
>>>>>> precopy initial data, a handshake is done with its counterpart in the
>>>>>> destination:
>>>>>> If both support it, precopy initial data will be used for them.
>>>>>> If source doesn't support it, precopy initial data will not be used for
>>>>>> them.
>>>>>> If source supports it and destination doesn't, migration will be failed.
>>>>>>
>>>>>> Assuming the handshake succeeded, migration starts to send precopy data
>>>>>> and as part of it also the initial precopy data. Initial precopy data is
>>>>>> just like any other precopy data and as such, migration code is not
>>>>>> aware of it. Therefore, it's the responsibility of the migration users
>>>>>> (such as VFIO devices) to notify their counterparts in the destination
>>>>>> that their initial precopy data has been sent (for example, VFIO
>>>>>> migration does it when its initial bytes reach zero).
>>>>>>
>>>>>> In the destination, migration code will query each migration user that
>>>>>> supports precopy initial data and check if its initial data has been
>>>>>> loaded. If initial data has been loaded by all of them, an ACK will be
>>>>>> sent to the source which will now be able to complete migration when
>>>>>> appropriate.
>>>>> I can understand why this is useful, what I'm not 100% sure is whether the
>>>>> complexity is needed.  The idea seems to be that src never switchover
>>>>> unless it receives a READY notification from dst.
>>>>>
>>>>> I'm imaging below simplified and more general workflow, not sure whether it
>>>>> could work for you:
>>>>>
>>>>>      - Introduce a new cap "switchover-ready", it means whether there'll be a
>>>>>        ready event sent from dst -> src for "being ready for switchover"
>>>>>
>>>>>      - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and
>>>>>        handled on src showing that dest is ready for switchover. It'll be sent
>>>>>        only if dest is ready for the switchover
>>>>>
>>>>>      - Introduce a field SaveVMHandlers.explicit_switchover_needed.  For each
>>>>>        special device like vfio that would like to participate in the decision
>>>>>        making, device can set its explicit_switchover_needed=1.  This field is
>>>>>        ignored if the new cap is not set.
>>>>>
>>>>>      - Dst qemu: when new cap set, remember how many special devices are there
>>>>>        requesting explicit switchover (count of SaveVMHandlers that has the
>>>>>        bit set during load setup) as switch_over_pending=N.
>>>>>
>>>>>      - Dst qemu: Once a device thinks its fine to switchover (probably in the
>>>>>        load_state() callback), it calls migration_notify_switchover_ready().
>>>>>        That decreases switch_over_pending and when it hits zero, one msg
>>>>>        MIG_RP_MSG_SWITCHOVER_READY will be sent to src.
>>>>>
>>>>> Only until READY msg received on src could src switchover the precopy to
>>>>> dst.
>>>>>
>>>>> Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1
>>>>> more msg (dst->src).
>>>>>
>>>>> This is based on the fact that right now we always set caps on both qemus
>>>>> so I suppose it already means either both have or don't have the feature
>>>>> (even if one has, not setting the cap means disabled on both).
>>>>>
>>>>> Would it work for this case and cleaner?
>>>> Hi Peter, thanks for the response!
>>>> Your approach is indeed much simpler, however I have a few concerns
>>>> regarding compatibility.
>>>>
>>>> You are saying that caps are always set both in src and dest.
>>>> But what happens if we set the cap only on one side?
>>>> Should we care about these scenarios?
>>> I think it's not needed for now, but I am aware that this is a problem.
>>> It's just that it is a more generic problem to me rather than very special
>>> in the current feature being proposed.  At least there're a few times
>>> Daniel showed concern on keeping this way and hoped we can have a better
>>> handshake in general with migration framework.
>>>
>>> I'd be perfectly fine if you want to approach this with a handshake
>>> methodology, but I hope if so we should provide a more generic handshake.
>>> So potentially that can make this new feature rely on the handshake work,
>>> and slower to get into shape.  Your call on how to address this, at least
>>> fine by me either way.
>> I'd really like this feature to get in, and I'm afraid making it dependent
>> on first implementing a general migration handshake may take a long time,
>> like you said.
>> What about keeping current approach but changing it such that the capability
>> will have to be set in both src and dest, to make it similar to other
>> capability usages?
>> I.e., we will remove the "general" handshake:
>>
>>      /* Enable precopy initial data generally in the migration */
>>      memset(&buf, 0, sizeof(buf));
>>      buf.general_enable = 1;
>>      qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf),
>>                               (uint8_t *)&buf);
>>
>> but keep the per-device handshake, which is not a handshake for migration
>> capabilities, but a part of the protocol when the capability is set, like in
>> multifd, postcopy, etc.
>> This way we can advance with this feature while making the general migration
>> handshake an independent effort.
>> Will that work for you?
> Yes it's fine by me.
>
>> BTW, with your suggestion to add a notification mechanism to notify when
>> initial data is loaded in dest, I think we can drop these two SaveVMHandlers
>> handlers:
>>      /*
>>       * Checks if precopy initial data is active. If it's inactive,
>>       * initial_data_loaded check is skipped.
>>       */
>>      bool (*is_initial_data_active)(void *opaque);
>>      /* Checks if precopy initial data has been loaded in dest */
>>      bool (*initial_data_loaded)(void *opaque);
>>
>>> In my imagination a generic handshake should happen at the very start of
>>> migration and negociate feature bits between src/dst qemu, so they can
>>> reach a consensus on what to do next.
>>>
>>>> For example, if we set the cap only in src, then src will wait indefinitely
>>>> for dest to notify that switchover is ready.
>>>> Would you expect migration to fail instead of just keep running
>>>> indefinitely?
>>>> In current approach we only need to enable the cap in the source, so such
>>>> scenario can't happen.
>>>>
>>>> Let's look at some other scenario.
>>>> Src QEMU supports explicit-switchover for device X but *not* for device Y
>>>> (i.e., src QEMU is some older version of QEMU that supports
>>>> explicit-switchover for device X but not for Y).
>>>> Dest QEMU supports explicit-switchover for device X and device Y.
>>>> The capability is set in both src and dest.
>>>> In the destination we will have switchover_pending=2 because both X and Y
>>>> support explicit-switchover.
>>>> We do migration, but switchover_pending will never reach 0 because only X
>>>> supports it in the source, so the migration will run indefinitely.
>>>> The per-device handshake solves this by making device Y not use
>>>> explicit-switchover in this case.
>>> Hmm, right.  When I was replying obviously I thought that decision can be
>>> made sololy by the dest qemu, then I assumed it's fine.  Because IIUC in
>>> that case how many devices that supports switchover_pending on src qemu
>>> doesn't really matter but only dest.
>>>
>>> But I re-read the last patch and I do see that there's a new bit that will
>>> change the device protocol of migration:
>>>
>>>     if (migration->initial_data_active && !migration->precopy_init_size &&
>>>         !migration->initial_data_sent) {
>>>         qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
>>>         migration->initial_data_sent = true;
>>>     } else {
>>>         qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>     }
>>>
>>> With this, I think what you said makes sense because then the src qemu
>>> matters on deciding whether to send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT, it
>>> also needs to make sure dst qemu will recognize it.
>>>
>>> Do you think this new VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is a must to have?
>>> Can this decision be made on dest qemu only?
>>>
>>> To ask in another way, I saw that precopy_init_size is the fundation to
>>> decide whether to send this flag.  Then it's a matter of whether dest qemu
>>> is also aware of precopy_init_size, then it can already tell when it's
>>> ready to handle the switchover.
>> The destination is not aware of precopy_init_size, only the source knows it.
>> So the source must send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to notify dest that
>> the initial data was sent.
> Then, can the src qemu notify instead?
>
> We can have similar notification mechanism on src qemu and if that can work
> we can further same the other MIG_RP_MSG.  The counter counts how many
> special src devices are there and we don't switchover only if all agree.
Do you mean the following:
We will have the pending counter and notification mechanism in source 
instead of destination.
The MIG_RP_MSG_INITIAL_DATA_LOADED_ACK message will be sent by each 
device in the destination that loaded its initial data.
Each such RP_MSG will decrease the pending counter in source.
When the pending counter in source reaches zero, source can complete 
migration.
Or did I misunderstand you?
> I know that even if !precopy_init_size on src, it doesn't mean that dest
> has already digested all the data in the send buffer.  However since we'll
> anyway make sure queued data landed before switch over happens (e.g., when
> we only have 1 migration channel data are sent in sequential manner), it
> means when switchover the dst qemu should have these loaded?
>
> Thanks,
>
> --
> Peter Xu
>
next prev parent reply	other threads:[~2023-05-07 13:01 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-01 14:01 [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
2023-05-01 14:01 ` [PATCH 1/8] migration: Add precopy initial data capability Avihai Horon
2023-05-10  8:24   ` Juan Quintela
2023-05-17  9:17   ` Markus Armbruster
2023-05-17 10:16     ` Avihai Horon
2023-05-17 12:21       ` Markus Armbruster
2023-05-17 13:23         ` Avihai Horon
2023-05-01 14:01 ` [PATCH 2/8] migration: Add precopy initial data handshake Avihai Horon
2023-05-02 22:54   ` Peter Xu
2023-05-03 15:31     ` Avihai Horon
2023-05-10  8:40   ` Juan Quintela
2023-05-10 15:32     ` Avihai Horon
2023-05-14 16:42   ` Cédric Le Goater
2023-05-15  7:56     ` Avihai Horon
2023-05-01 14:01 ` [PATCH 3/8] migration: Add precopy initial data loaded ACK functionality Avihai Horon
2023-05-02 22:56   ` Peter Xu
2023-05-03 15:36     ` Avihai Horon
2023-05-10  8:54   ` Juan Quintela
2023-05-10 15:52     ` Avihai Horon
2023-05-10 15:59       ` Juan Quintela
2023-05-01 14:01 ` [PATCH 4/8] migration: Enable precopy initial data capability Avihai Horon
2023-05-10  8:55   ` Juan Quintela
2023-05-01 14:01 ` [PATCH 5/8] tests: Add migration precopy initial data capability test Avihai Horon
2023-05-10  8:55   ` Juan Quintela
2023-05-01 14:01 ` [PATCH 6/8] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
2023-05-10  9:00   ` Juan Quintela
2023-05-01 14:01 ` [PATCH 7/8] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
2023-05-01 14:01 ` [PATCH 8/8] vfio/migration: Add support for precopy initial data capability Avihai Horon
2023-05-02 22:49 ` [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Peter Xu
2023-05-03 15:22   ` Avihai Horon
2023-05-03 15:49     ` Peter Xu
2023-05-04 10:18       ` Avihai Horon
2023-05-04 15:50         ` Peter Xu
2023-05-07 12:54           ` Avihai Horon [this message]
2023-05-08  0:49             ` Peter Xu
2023-05-08 11:11               ` Avihai Horon
2023-05-10  9:12     ` Juan Quintela
2023-05-10 16:01       ` Avihai Horon
2023-05-10 16:41         ` Juan Quintela
2023-05-11 11:31           ` Avihai Horon
2023-05-11 13:09             ` Juan Quintela
2023-05-11 15:08               ` Avihai Horon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox
  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):
  git send-email \
    --in-reply-to=95db8c9d-c649-09b7-843e-215756820bb1@nvidia.com \
    --to=avihaih@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=armbru@redhat.com \
    --cc=clg@redhat.com \
    --cc=eblake@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=joao.m.martins@oracle.com \
    --cc=kwankhede@nvidia.com \
    --cc=leobras@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=maorg@nvidia.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=targupta@nvidia.com \
    --cc=thuth@redhat.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY
  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
  Be sure your reply has a Subject: header at the top and a blank line
  before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).