qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Cédric Le Goater" <clg@redhat.com>
To: Avihai Horon <avihaih@nvidia.com>, qemu-devel@nongnu.org
Cc: Alex Williamson <alex.williamson@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Joao Martins <joao.m.martins@oracle.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Maor Gottlieb <maorg@nvidia.com>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	Tarun Gupta <targupta@nvidia.com>
Subject: Re: [PATCH for-8.2 5/6] vfio/migration: Add P2P support for VFIO migration
Date: Fri, 21 Jul 2023 13:48:31 +0200	[thread overview]
Message-ID: <99a770f6-7f45-d7df-9555-bcc854914472@redhat.com> (raw)
In-Reply-To: <20230716081541.27900-6-avihaih@nvidia.com>

On 7/16/23 10:15, Avihai Horon wrote:
> VFIO migration uAPI defines an optional intermediate P2P quiescent
> state. While in the P2P quiescent state, P2P DMA transactions cannot be
> initiated by the device, but the device can respond to incoming ones.
> Additionally, all outstanding P2P transactions are guaranteed to have
> been completed by the time the device enters this state.
> 
> The purpose of this state is to support migration of multiple devices
> that are doing P2P transactions between themselves.
> 
> Add support for P2P migration by transitioning all the devices to the
> P2P quiescent state before stopping or starting the devices. Use the new
> VMChangeStateHandler pre_change_cb to achieve that behavior.
> 
> This will allow migration of multiple VFIO devices if all of them
> support P2P migration.

LGTM, one small comment below
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>   docs/devel/vfio-migration.rst | 93 +++++++++++++++++++++--------------
>   hw/vfio/common.c              |  6 ++-
>   hw/vfio/migration.c           | 58 +++++++++++++++++-----
>   hw/vfio/trace-events          |  2 +-
>   4 files changed, 107 insertions(+), 52 deletions(-)
> 
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> index b433cb5bb2..b9c57ba651 100644
> --- a/docs/devel/vfio-migration.rst
> +++ b/docs/devel/vfio-migration.rst
> @@ -23,9 +23,21 @@ and recommends that the initial bytes are sent and loaded in the destination
>   before stopping the source VM. Enabling this migration capability will
>   guarantee that and thus, can potentially reduce downtime even further.
>   
> -Note that currently VFIO migration is supported only for a single device. This
> -is due to VFIO migration's lack of P2P support. However, P2P support is planned
> -to be added later on.
> +To support migration of multiple devices that are doing P2P transactions
> +between themselves, VFIO migration uAPI defines an intermediate P2P quiescent
> +state. While in the P2P quiescent state, P2P DMA transactions cannot be
> +initiated by the device, but the device can respond to incoming ones.
> +Additionally, all outstanding P2P transactions are guaranteed to have been
> +completed by the time the device enters this state.
> +
> +All the devices that support P2P migration are first transitioned to the P2P
> +quiescent state and only then are they stopped or started. This makes migration
> +safe P2P-wise, since starting and stopping the devices is not done atomically
> +for all the devices together.
> +
> +Thus, multiple VFIO devices migration is allowed only if all the devices
> +support P2P migration. Single VFIO device migration is allowed regardless of
> +P2P migration support.
>   
>   A detailed description of the UAPI for VFIO device migration can be found in
>   the comment for the ``vfio_device_mig_state`` structure in the header file
> @@ -132,54 +144,63 @@ will be blocked.
>   Flow of state changes during Live migration
>   ===========================================
>   
> -Below is the flow of state change during live migration.
> +Below is the state change flow during live migration for a VFIO device that
> +supports both precopy and P2P migration. The flow for devices that don't
> +support it is similar, except that the relevant states for precopy and P2P are
> +skipped.
>   The values in the parentheses represent the VM state, the migration state, and
>   the VFIO device state, respectively.
> -The text in the square brackets represents the flow if the VFIO device supports
> -pre-copy.
>   
>   Live migration save path
>   ------------------------
>   
>   ::
>   
> -                        QEMU normal running state
> -                        (RUNNING, _NONE, _RUNNING)
> -                                  |
> +                           QEMU normal running state
> +                           (RUNNING, _NONE, _RUNNING)
> +                                      |
>                        migrate_init spawns migration_thread
> -                Migration thread then calls each device's .save_setup()
> -                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
> -                                  |
> -                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
> -      If device is active, get pending_bytes by .state_pending_{estimate,exact}()
> -          If total pending_bytes >= threshold_size, call .save_live_iterate()
> -                  [Data of VFIO device for pre-copy phase is copied]
> -        Iterate till total pending bytes converge and are less than threshold
> -                                  |
> -  On migration completion, vCPU stops and calls .save_live_complete_precopy for
> -  each active device. The VFIO device is then transitioned into _STOP_COPY state
> -                  (FINISH_MIGRATE, _DEVICE, _STOP_COPY)
> -                                  |
> -     For the VFIO device, iterate in .save_live_complete_precopy until
> -                         pending data is 0
> -                   (FINISH_MIGRATE, _DEVICE, _STOP)
> -                                  |
> -                 (FINISH_MIGRATE, _COMPLETED, _STOP)
> -             Migraton thread schedules cleanup bottom half and exits
> +            Migration thread then calls each device's .save_setup()
> +                          (RUNNING, _SETUP, _PRE_COPY)
> +                                      |
> +                         (RUNNING, _ACTIVE, _PRE_COPY)
> +  If device is active, get pending_bytes by .state_pending_{estimate,exact}()
> +       If total pending_bytes >= threshold_size, call .save_live_iterate()
> +                Data of VFIO device for pre-copy phase is copied
> +      Iterate till total pending bytes converge and are less than threshold
> +                                      |
> +       On migration completion, the vCPUs and the VFIO device are stopped
> +              The VFIO device is first put in P2P quiescent state
> +                    (FINISH_MIGRATE, _ACTIVE, _PRE_COPY_P2P)
> +                                      |
> +                Then the VFIO device is put in _STOP_COPY state
> +                     (FINISH_MIGRATE, _ACTIVE, _STOP_COPY)
> +         .save_live_complete_precopy() is called for each active device
> +      For the VFIO device, iterate in .save_live_complete_precopy() until
> +                               pending data is 0
> +                                      |
> +                     (POSTMIGRATE, _COMPLETED, _STOP_COPY)
> +            Migraton thread schedules cleanup bottom half and exits
> +                                      |
> +                           .save_cleanup() is called
> +                        (POSTMIGRATE, _COMPLETED, _STOP)
>   
>   Live migration resume path
>   --------------------------
>   
>   ::
>   
> -              Incoming migration calls .load_setup for each device
> -                       (RESTORE_VM, _ACTIVE, _STOP)
> -                                 |
> -       For each device, .load_state is called for that device section data
> -                       (RESTORE_VM, _ACTIVE, _RESUMING)
> -                                 |
> -    At the end, .load_cleanup is called for each device and vCPUs are started
> -                       (RUNNING, _NONE, _RUNNING)
> +             Incoming migration calls .load_setup() for each device
> +                          (RESTORE_VM, _ACTIVE, _STOP)
> +                                      |
> +     For each device, .load_state() is called for that device section data
> +                        (RESTORE_VM, _ACTIVE, _RESUMING)
> +                                      |
> +  At the end, .load_cleanup() is called for each device and vCPUs are started
> +              The VFIO device is first put in P2P quiescent state
> +                        (RUNNING, _ACTIVE, _RUNNING_P2P)
> +                                      |
> +                           (RUNNING, _NONE, _RUNNING)
>   
>   Postcopy
>   ========
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 16cf79a76c..7c3d636025 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -441,14 +441,16 @@ bool vfio_device_state_is_running(VFIODevice *vbasedev)
>   {
>       VFIOMigration *migration = vbasedev->migration;
>   
> -    return migration->device_state == VFIO_DEVICE_STATE_RUNNING;
> +    return migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> +           migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P;
>   }
>   
>   bool vfio_device_state_is_precopy(VFIODevice *vbasedev)
>   {
>       VFIOMigration *migration = vbasedev->migration;
>   
> -    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
> +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
> +           migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P;
>   }
>   
>   static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 48f9c23cbe..02ee99c938 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -71,8 +71,13 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
>           return "STOP_COPY";
>       case VFIO_DEVICE_STATE_RESUMING:
>           return "RESUMING";
> +    case VFIO_DEVICE_STATE_RUNNING_P2P:
> +        return "RUNNING_P2P";
>       case VFIO_DEVICE_STATE_PRE_COPY:
>           return "PRE_COPY";
> +    case VFIO_DEVICE_STATE_PRE_COPY_P2P:
> +        return "PRE_COPY_P2P";
> +
>       default:
>           return "UNKNOWN STATE";
>       }
> @@ -652,20 +657,31 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>   
>   /* ---------------------------------------------------------------------- */
>   
> -static void vfio_vmstate_change(void *opaque, bool running, RunState state)
> +static void vfio_vmstate_change(VFIODevice *vbasedev, bool running,
> +                                RunState state, bool pre_state_change)

Instead of mixing both vmstate change handlers in one routine, I would prefer
a new pre_state handler to be introduced.

Thanks,

C.


>   {
> -    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>       enum vfio_device_mig_state new_state;
>       int ret;
>   
> -    if (running) {
> -        new_state = VFIO_DEVICE_STATE_RUNNING;
> +    if (pre_state_change && !(migration->mig_flags & VFIO_MIGRATION_P2P)) {
> +        return;
> +    }
> +
> +    if (pre_state_change) {
> +        new_state = migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ?
> +                        VFIO_DEVICE_STATE_PRE_COPY_P2P :
> +                        VFIO_DEVICE_STATE_RUNNING_P2P;
>       } else {
> -        new_state =
> -            (vfio_device_state_is_precopy(vbasedev) &&
> -             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
> -                VFIO_DEVICE_STATE_STOP_COPY :
> -                VFIO_DEVICE_STATE_STOP;
> +        if (running) {
> +            new_state = VFIO_DEVICE_STATE_RUNNING;
> +        } else {
> +            new_state = (vfio_device_state_is_precopy(vbasedev) &&
> +                         (state == RUN_STATE_FINISH_MIGRATE ||
> +                          state == RUN_STATE_PAUSED)) ?
> +                            VFIO_DEVICE_STATE_STOP_COPY :
> +                            VFIO_DEVICE_STATE_STOP;
> +        }
>       }
>   
>       /*
> @@ -685,7 +701,23 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>       }
>   
>       trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> -                              mig_state_to_str(new_state));
> +                              pre_state_change, mig_state_to_str(new_state));
> +}
> +
> +static void vfio_vmstate_pre_change_handler(void *opaque, bool running,
> +                                            RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    vfio_vmstate_change(vbasedev, running, state, true);
> +}
> +
> +static void vfio_vmstate_change_handler(void *opaque, bool running,
> +                                        RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    vfio_vmstate_change(vbasedev, running, state, false);
>   }
>   
>   static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> @@ -798,9 +830,9 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>       register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
>                            vbasedev);
>   
> -    migration->vm_state = qdev_add_vm_change_state_handler(vbasedev->dev,
> -                                                           vfio_vmstate_change,
> -                                                           vbasedev);
> +    migration->vm_state = qdev_add_vm_change_state_handler_full(
> +        vbasedev->dev, vfio_vmstate_change_handler,
> +        vfio_vmstate_pre_change_handler, vbasedev);
>       migration->migration_state.notify = vfio_migration_state_notifier;
>       add_migration_state_change_notifier(&migration->migration_state);
>   
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index ee7509e68e..efafe613f2 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -166,4 +166,4 @@ vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy
>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
>   vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>   vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
> -vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
> +vfio_vmstate_change(const char *name, int running, const char *reason, bool pre_state_change, const char *dev_state) " (%s) running %d reason %s pre state change %d device state %s"



  reply	other threads:[~2023-07-21 11:49 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-16  8:15 [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration Avihai Horon
2023-07-16  8:15 ` [PATCH for-8.2 1/6] vfio/migration: Move from STOP_COPY to STOP in vfio_save_cleanup() Avihai Horon
2023-07-30 16:25   ` Cédric Le Goater
2023-07-31  6:32     ` Avihai Horon
2023-07-31  7:05       ` Cédric Le Goater
2023-07-16  8:15 ` [PATCH for-8.2 2/6] sysemu: Add pre VM state change callback Avihai Horon
2023-07-27 16:23   ` Cédric Le Goater
2023-07-30  6:31     ` Avihai Horon
2023-07-30 16:22       ` Cédric Le Goater
2023-07-16  8:15 ` [PATCH for-8.2 3/6] qdev: Add qdev_add_vm_change_state_handler_full() Avihai Horon
2023-07-16  8:15 ` [PATCH for-8.2 4/6] vfio/migration: Refactor PRE_COPY and RUNNING state checks Avihai Horon
2023-07-21 11:33   ` Cédric Le Goater
2023-07-16  8:15 ` [PATCH for-8.2 5/6] vfio/migration: Add P2P support for VFIO migration Avihai Horon
2023-07-21 11:48   ` Cédric Le Goater [this message]
2023-07-23 10:37     ` Avihai Horon
2023-07-16  8:15 ` [PATCH for-8.2 6/6] vfio/migration: Allow migration of multiple P2P supporting devices Avihai Horon
2023-07-21 12:09   ` Cédric Le Goater
2023-07-23 10:42     ` Avihai Horon
2023-07-18 15:46 ` [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration Jason Gunthorpe
2023-07-30  6:50   ` Avihai Horon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=99a770f6-7f45-d7df-9555-bcc854914472@redhat.com \
    --to=clg@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=avihaih@nvidia.com \
    --cc=jgg@nvidia.com \
    --cc=joao.m.martins@oracle.com \
    --cc=kwankhede@nvidia.com \
    --cc=maorg@nvidia.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=targupta@nvidia.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).