From: "Cédric Le Goater" <clg@redhat.com>
To: Avihai Horon <avihaih@nvidia.com>, qemu-devel@nongnu.org
Cc: Alex Williamson <alex.williamson@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Joao Martins <joao.m.martins@oracle.com>,
Yishai Hadas <yishaih@nvidia.com>,
Jason Gunthorpe <jgg@nvidia.com>,
Maor Gottlieb <maorg@nvidia.com>,
Kirti Wankhede <kwankhede@nvidia.com>,
Tarun Gupta <targupta@nvidia.com>
Subject: Re: [PATCH for-8.2 5/6] vfio/migration: Add P2P support for VFIO migration
Date: Fri, 21 Jul 2023 13:48:31 +0200 [thread overview]
Message-ID: <99a770f6-7f45-d7df-9555-bcc854914472@redhat.com> (raw)
In-Reply-To: <20230716081541.27900-6-avihaih@nvidia.com>
On 7/16/23 10:15, Avihai Horon wrote:
> VFIO migration uAPI defines an optional intermediate P2P quiescent
> state. While in the P2P quiescent state, P2P DMA transactions cannot be
> initiated by the device, but the device can respond to incoming ones.
> Additionally, all outstanding P2P transactions are guaranteed to have
> been completed by the time the device enters this state.
>
> The purpose of this state is to support migration of multiple devices
> that are doing P2P transactions between themselves.
>
> Add support for P2P migration by transitioning all the devices to the
> P2P quiescent state before stopping or starting the devices. Use the new
> VMChangeStateHandler pre_change_cb to achieve that behavior.
>
> This will allow migration of multiple VFIO devices if all of them
> support P2P migration.
LGTM, one small comment below
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
> docs/devel/vfio-migration.rst | 93 +++++++++++++++++++++--------------
> hw/vfio/common.c | 6 ++-
> hw/vfio/migration.c | 58 +++++++++++++++++-----
> hw/vfio/trace-events | 2 +-
> 4 files changed, 107 insertions(+), 52 deletions(-)
>
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> index b433cb5bb2..b9c57ba651 100644
> --- a/docs/devel/vfio-migration.rst
> +++ b/docs/devel/vfio-migration.rst
> @@ -23,9 +23,21 @@ and recommends that the initial bytes are sent and loaded in the destination
> before stopping the source VM. Enabling this migration capability will
> guarantee that and thus, can potentially reduce downtime even further.
>
> -Note that currently VFIO migration is supported only for a single device. This
> -is due to VFIO migration's lack of P2P support. However, P2P support is planned
> -to be added later on.
> +To support migration of multiple devices that are doing P2P transactions
> +between themselves, VFIO migration uAPI defines an intermediate P2P quiescent
> +state. While in the P2P quiescent state, P2P DMA transactions cannot be
> +initiated by the device, but the device can respond to incoming ones.
> +Additionally, all outstanding P2P transactions are guaranteed to have been
> +completed by the time the device enters this state.
> +
> +All the devices that support P2P migration are first transitioned to the P2P
> +quiescent state and only then are they stopped or started. This makes migration
> +safe P2P-wise, since starting and stopping the devices is not done atomically
> +for all the devices together.
> +
> +Thus, multiple VFIO devices migration is allowed only if all the devices
> +support P2P migration. Single VFIO device migration is allowed regardless of
> +P2P migration support.
>
> A detailed description of the UAPI for VFIO device migration can be found in
> the comment for the ``vfio_device_mig_state`` structure in the header file
> @@ -132,54 +144,63 @@ will be blocked.
> Flow of state changes during Live migration
> ===========================================
>
> -Below is the flow of state change during live migration.
> +Below is the state change flow during live migration for a VFIO device that
> +supports both precopy and P2P migration. The flow for devices that don't
> +support it is similar, except that the relevant states for precopy and P2P are
> +skipped.
> The values in the parentheses represent the VM state, the migration state, and
> the VFIO device state, respectively.
> -The text in the square brackets represents the flow if the VFIO device supports
> -pre-copy.
>
> Live migration save path
> ------------------------
>
> ::
>
> - QEMU normal running state
> - (RUNNING, _NONE, _RUNNING)
> - |
> + QEMU normal running state
> + (RUNNING, _NONE, _RUNNING)
> + |
> migrate_init spawns migration_thread
> - Migration thread then calls each device's .save_setup()
> - (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
> - |
> - (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
> - If device is active, get pending_bytes by .state_pending_{estimate,exact}()
> - If total pending_bytes >= threshold_size, call .save_live_iterate()
> - [Data of VFIO device for pre-copy phase is copied]
> - Iterate till total pending bytes converge and are less than threshold
> - |
> - On migration completion, vCPU stops and calls .save_live_complete_precopy for
> - each active device. The VFIO device is then transitioned into _STOP_COPY state
> - (FINISH_MIGRATE, _DEVICE, _STOP_COPY)
> - |
> - For the VFIO device, iterate in .save_live_complete_precopy until
> - pending data is 0
> - (FINISH_MIGRATE, _DEVICE, _STOP)
> - |
> - (FINISH_MIGRATE, _COMPLETED, _STOP)
> - Migraton thread schedules cleanup bottom half and exits
> + Migration thread then calls each device's .save_setup()
> + (RUNNING, _SETUP, _PRE_COPY)
> + |
> + (RUNNING, _ACTIVE, _PRE_COPY)
> + If device is active, get pending_bytes by .state_pending_{estimate,exact}()
> + If total pending_bytes >= threshold_size, call .save_live_iterate()
> + Data of VFIO device for pre-copy phase is copied
> + Iterate till total pending bytes converge and are less than threshold
> + |
> + On migration completion, the vCPUs and the VFIO device are stopped
> + The VFIO device is first put in P2P quiescent state
> + (FINISH_MIGRATE, _ACTIVE, _PRE_COPY_P2P)
> + |
> + Then the VFIO device is put in _STOP_COPY state
> + (FINISH_MIGRATE, _ACTIVE, _STOP_COPY)
> + .save_live_complete_precopy() is called for each active device
> + For the VFIO device, iterate in .save_live_complete_precopy() until
> + pending data is 0
> + |
> + (POSTMIGRATE, _COMPLETED, _STOP_COPY)
> + Migraton thread schedules cleanup bottom half and exits
> + |
> + .save_cleanup() is called
> + (POSTMIGRATE, _COMPLETED, _STOP)
>
> Live migration resume path
> --------------------------
>
> ::
>
> - Incoming migration calls .load_setup for each device
> - (RESTORE_VM, _ACTIVE, _STOP)
> - |
> - For each device, .load_state is called for that device section data
> - (RESTORE_VM, _ACTIVE, _RESUMING)
> - |
> - At the end, .load_cleanup is called for each device and vCPUs are started
> - (RUNNING, _NONE, _RUNNING)
> + Incoming migration calls .load_setup() for each device
> + (RESTORE_VM, _ACTIVE, _STOP)
> + |
> + For each device, .load_state() is called for that device section data
> + (RESTORE_VM, _ACTIVE, _RESUMING)
> + |
> + At the end, .load_cleanup() is called for each device and vCPUs are started
> + The VFIO device is first put in P2P quiescent state
> + (RUNNING, _ACTIVE, _RUNNING_P2P)
> + |
> + (RUNNING, _NONE, _RUNNING)
>
> Postcopy
> ========
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 16cf79a76c..7c3d636025 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -441,14 +441,16 @@ bool vfio_device_state_is_running(VFIODevice *vbasedev)
> {
> VFIOMigration *migration = vbasedev->migration;
>
> - return migration->device_state == VFIO_DEVICE_STATE_RUNNING;
> + return migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> + migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P;
> }
>
> bool vfio_device_state_is_precopy(VFIODevice *vbasedev)
> {
> VFIOMigration *migration = vbasedev->migration;
>
> - return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
> + return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
> + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P;
> }
>
> static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 48f9c23cbe..02ee99c938 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -71,8 +71,13 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
> return "STOP_COPY";
> case VFIO_DEVICE_STATE_RESUMING:
> return "RESUMING";
> + case VFIO_DEVICE_STATE_RUNNING_P2P:
> + return "RUNNING_P2P";
> case VFIO_DEVICE_STATE_PRE_COPY:
> return "PRE_COPY";
> + case VFIO_DEVICE_STATE_PRE_COPY_P2P:
> + return "PRE_COPY_P2P";
> +
> default:
> return "UNKNOWN STATE";
> }
> @@ -652,20 +657,31 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>
> /* ---------------------------------------------------------------------- */
>
> -static void vfio_vmstate_change(void *opaque, bool running, RunState state)
> +static void vfio_vmstate_change(VFIODevice *vbasedev, bool running,
> + RunState state, bool pre_state_change)
Instead of mixing both vmstate change handlers in one routine, I would prefer
a new pre_state handler to be introduced.
Thanks,
C.
> {
> - VFIODevice *vbasedev = opaque;
> + VFIOMigration *migration = vbasedev->migration;
> enum vfio_device_mig_state new_state;
> int ret;
>
> - if (running) {
> - new_state = VFIO_DEVICE_STATE_RUNNING;
> + if (pre_state_change && !(migration->mig_flags & VFIO_MIGRATION_P2P)) {
> + return;
> + }
> +
> + if (pre_state_change) {
> + new_state = migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ?
> + VFIO_DEVICE_STATE_PRE_COPY_P2P :
> + VFIO_DEVICE_STATE_RUNNING_P2P;
> } else {
> - new_state =
> - (vfio_device_state_is_precopy(vbasedev) &&
> - (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
> - VFIO_DEVICE_STATE_STOP_COPY :
> - VFIO_DEVICE_STATE_STOP;
> + if (running) {
> + new_state = VFIO_DEVICE_STATE_RUNNING;
> + } else {
> + new_state = (vfio_device_state_is_precopy(vbasedev) &&
> + (state == RUN_STATE_FINISH_MIGRATE ||
> + state == RUN_STATE_PAUSED)) ?
> + VFIO_DEVICE_STATE_STOP_COPY :
> + VFIO_DEVICE_STATE_STOP;
> + }
> }
>
> /*
> @@ -685,7 +701,23 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
> }
>
> trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> - mig_state_to_str(new_state));
> + pre_state_change, mig_state_to_str(new_state));
> +}
> +
> +static void vfio_vmstate_pre_change_handler(void *opaque, bool running,
> + RunState state)
> +{
> + VFIODevice *vbasedev = opaque;
> +
> + vfio_vmstate_change(vbasedev, running, state, true);
> +}
> +
> +static void vfio_vmstate_change_handler(void *opaque, bool running,
> + RunState state)
> +{
> + VFIODevice *vbasedev = opaque;
> +
> + vfio_vmstate_change(vbasedev, running, state, false);
> }
>
> static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> @@ -798,9 +830,9 @@ static int vfio_migration_init(VFIODevice *vbasedev)
> register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
> vbasedev);
>
> - migration->vm_state = qdev_add_vm_change_state_handler(vbasedev->dev,
> - vfio_vmstate_change,
> - vbasedev);
> + migration->vm_state = qdev_add_vm_change_state_handler_full(
> + vbasedev->dev, vfio_vmstate_change_handler,
> + vfio_vmstate_pre_change_handler, vbasedev);
> migration->migration_state.notify = vfio_migration_state_notifier;
> add_migration_state_change_notifier(&migration->migration_state);
>
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index ee7509e68e..efafe613f2 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -166,4 +166,4 @@ vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy
> vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
> vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
> vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
> -vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
> +vfio_vmstate_change(const char *name, int running, const char *reason, bool pre_state_change, const char *dev_state) " (%s) running %d reason %s pre state change %d device state %s"
next prev parent reply other threads:[~2023-07-21 11:49 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-16 8:15 [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration Avihai Horon
2023-07-16 8:15 ` [PATCH for-8.2 1/6] vfio/migration: Move from STOP_COPY to STOP in vfio_save_cleanup() Avihai Horon
2023-07-30 16:25 ` Cédric Le Goater
2023-07-31 6:32 ` Avihai Horon
2023-07-31 7:05 ` Cédric Le Goater
2023-07-16 8:15 ` [PATCH for-8.2 2/6] sysemu: Add pre VM state change callback Avihai Horon
2023-07-27 16:23 ` Cédric Le Goater
2023-07-30 6:31 ` Avihai Horon
2023-07-30 16:22 ` Cédric Le Goater
2023-07-16 8:15 ` [PATCH for-8.2 3/6] qdev: Add qdev_add_vm_change_state_handler_full() Avihai Horon
2023-07-16 8:15 ` [PATCH for-8.2 4/6] vfio/migration: Refactor PRE_COPY and RUNNING state checks Avihai Horon
2023-07-21 11:33 ` Cédric Le Goater
2023-07-16 8:15 ` [PATCH for-8.2 5/6] vfio/migration: Add P2P support for VFIO migration Avihai Horon
2023-07-21 11:48 ` Cédric Le Goater [this message]
2023-07-23 10:37 ` Avihai Horon
2023-07-16 8:15 ` [PATCH for-8.2 6/6] vfio/migration: Allow migration of multiple P2P supporting devices Avihai Horon
2023-07-21 12:09 ` Cédric Le Goater
2023-07-23 10:42 ` Avihai Horon
2023-07-18 15:46 ` [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration Jason Gunthorpe
2023-07-30 6:50 ` Avihai Horon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=99a770f6-7f45-d7df-9555-bcc854914472@redhat.com \
--to=clg@redhat.com \
--cc=alex.williamson@redhat.com \
--cc=avihaih@nvidia.com \
--cc=jgg@nvidia.com \
--cc=joao.m.martins@oracle.com \
--cc=kwankhede@nvidia.com \
--cc=maorg@nvidia.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=targupta@nvidia.com \
--cc=yishaih@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).