From: Cornelia Huck <cohuck@redhat.com>
To: Yishai Hadas <yishaih@nvidia.com>,
alex.williamson@redhat.com, bhelgaas@google.com, jgg@nvidia.com,
saeedm@nvidia.com
Cc: linux-pci@vger.kernel.org, kvm@vger.kernel.org,
netdev@vger.kernel.org, kuba@kernel.org, leonro@nvidia.com,
kwankhede@nvidia.com, mgurtovoy@nvidia.com, yishaih@nvidia.com,
maorg@nvidia.com, ashok.raj@intel.com, kevin.tian@intel.com,
shameerali.kolothum.thodi@huawei.com
Subject: Re: [PATCH V9 mlx5-next 09/15] vfio: Define device migration protocol v2
Date: Wed, 02 Mar 2022 12:19:20 +0100 [thread overview]
Message-ID: <87tucgiouf.fsf@redhat.com> (raw)
In-Reply-To: <20220224142024.147653-10-yishaih@nvidia.com>
On Thu, Feb 24 2022, Yishai Hadas <yishaih@nvidia.com> wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
>
> Replace the existing region based migration protocol with an ioctl based
> protocol. The two protocols have the same general semantic behaviors, but
> the way the data is transported is changed.
>
> This is the STOP_COPY portion of the new protocol, it defines the 5 states
> for basic stop and copy migration and the protocol to move the migration
> data in/out of the kernel.
>
> Compared to the clarification of the v1 protocol Alex proposed:
>
> https://lore.kernel.org/r/163909282574.728533.7460416142511440919.stgit@omen
>
> This has a few deliberate functional differences:
>
> - ERROR arcs allow the device function to remain unchanged.
>
> - The protocol is not required to return to the original state on
> transition failure. Instead userspace can execute an unwind back to
> the original state, reset, or do something else without needing kernel
> support. This simplifies the kernel design and should userspace choose
> a policy like always reset, avoids doing useless work in the kernel
> on error handling paths.
>
> - PRE_COPY is made optional, userspace must discover it before using it.
> This reflects the fact that the majority of drivers we are aware of
> right now will not implement PRE_COPY.
>
> - segmentation is not part of the data stream protocol, the receiver
> does not have to reproduce the framing boundaries.
>
> The hybrid FSM for the device_state is described as a Mealy machine by
> documenting each of the arcs the driver is required to implement. Defining
> the remaining set of old/new device_state transitions as 'combination
> transitions' which are naturally defined as taking multiple FSM arcs along
> the shortest path within the FSM's digraph allows a complete matrix of
> transitions.
>
> A new VFIO_DEVICE_FEATURE of VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is
> defined to replace writing to the device_state field in the region. This
> allows returning a brand new FD whenever the requested transition opens
> a data transfer session.
>
> The VFIO core code implements the new feature and provides a helper
> function to the driver. Using the helper the driver only has to
> implement 6 of the FSM arcs and the other combination transitions are
> elaborated consistently from those arcs.
>
> A new VFIO_DEVICE_FEATURE of VFIO_DEVICE_FEATURE_MIGRATION is defined to
> report the capability for migration and indicate which set of states and
> arcs are supported by the device. The FSM provides a lot of flexibility to
> make backwards compatible extensions but the VFIO_DEVICE_FEATURE also
> allows for future breaking extensions for scenarios that cannot support
> even the basic STOP_COPY requirements.
>
> The VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE with the GET option (i.e.
> VFIO_DEVICE_FEATURE_GET) can be used to read the current migration state
> of the VFIO device.
>
> Data transfer sessions are now carried over a file descriptor, instead of
> the region. The FD functions for the lifetime of the data transfer
> session. read() and write() transfer the data with normal Linux stream FD
> semantics. This design allows future expansion to support poll(),
> io_uring, and other performance optimizations.
>
> The complicated mmap mode for data transfer is discarded as current qemu
> doesn't take meaningful advantage of it, and the new qemu implementation
> avoids substantially all the performance penalty of using a read() on the
> region.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
> drivers/vfio/vfio.c | 199 ++++++++++++++++++++++++++++++++++++++
> include/linux/vfio.h | 20 ++++
> include/uapi/linux/vfio.h | 174 ++++++++++++++++++++++++++++++---
> 3 files changed, 380 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 71763e2ac561..b37ab27b511f 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1557,6 +1557,197 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
> return 0;
> }
>
> +/*
> + * vfio_mig_get_next_state - Compute the next step in the FSM
> + * @cur_fsm - The current state the device is in
> + * @new_fsm - The target state to reach
> + * @next_fsm - Pointer to the next step to get to new_fsm
> + *
> + * Return 0 upon success, otherwise -errno
> + * Upon success the next step in the state progression between cur_fsm and
> + * new_fsm will be set in next_fsm.
What about non-success? Can the caller make any assumption about
next_fsm in that case? Because...
> + *
> + * This breaks down requests for combination transitions into smaller steps and
> + * returns the next step to get to new_fsm. The function may need to be called
> + * multiple times before reaching new_fsm.
> + *
> + */
> +int vfio_mig_get_next_state(struct vfio_device *device,
> + enum vfio_device_mig_state cur_fsm,
> + enum vfio_device_mig_state new_fsm,
> + enum vfio_device_mig_state *next_fsm)
> +{
> + enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RESUMING + 1 };
> + /*
> + * The coding in this table requires the driver to implement 6
> + * FSM arcs:
> + * RESUMING -> STOP
> + * RUNNING -> STOP
> + * STOP -> RESUMING
> + * STOP -> RUNNING
> + * STOP -> STOP_COPY
> + * STOP_COPY -> STOP
> + *
> + * The coding will step through multiple states for these combination
> + * transitions:
> + * RESUMING -> STOP -> RUNNING
> + * RESUMING -> STOP -> STOP_COPY
> + * RUNNING -> STOP -> RESUMING
> + * RUNNING -> STOP -> STOP_COPY
> + * STOP_COPY -> STOP -> RESUMING
> + * STOP_COPY -> STOP -> RUNNING
> + */
> + static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = {
> + [VFIO_DEVICE_STATE_STOP] = {
> + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
> + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
> + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
> + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> + },
> + [VFIO_DEVICE_STATE_RUNNING] = {
> + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
> + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
> + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
> + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> + },
> + [VFIO_DEVICE_STATE_STOP_COPY] = {
> + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
> + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
> + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
> + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> + },
> + [VFIO_DEVICE_STATE_RESUMING] = {
> + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
> + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
> + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
> + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> + },
> + [VFIO_DEVICE_STATE_ERROR] = {
> + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR,
> + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
> + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
> + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
> + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> + },
> + };
> +
> + if (WARN_ON(cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table)))
> + return -EINVAL;
> +
> + if (new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
> + return -EINVAL;
> +
> + *next_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
> + return (*next_fsm != VFIO_DEVICE_STATE_ERROR) ? 0 : -EINVAL;
...next_fsm will contain STATE_ERROR if we try to transition from or to
STATE_ERROR, but it remains unchanged if the input states are out of
range, yet in both cases the return value is -EINVAL. Looking further, ...
> + * any -> ERROR
> + * ERROR cannot be specified as a device state, however any transition request
> + * can be failed with an errno return and may then move the device_state into
> + * ERROR. In this case the device was unable to execute the requested arc and
> + * was also unable to restore the device to any valid device_state.
> + * To recover from ERROR VFIO_DEVICE_RESET must be used to return the
> + * device_state back to RUNNING.
...this seems to indicate that not moving into STATE_ERROR is an
option anyway. Do we need any extra guidance in the description for
vfio_mig_get_next_state()?
next prev parent reply other threads:[~2022-03-02 11:20 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-24 14:20 [PATCH V9 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
2022-02-24 14:20 ` [PATCH V9 mlx5-next 01/15] PCI/IOV: Add pci_iov_vf_id() to get VF index Yishai Hadas
2022-02-24 14:20 ` [PATCH V9 mlx5-next 02/15] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
2022-02-24 14:20 ` [PATCH V9 mlx5-next 03/15] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
2022-02-24 14:20 ` [PATCH V9 mlx5-next 04/15] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF Yishai Hadas
2022-02-24 14:20 ` [PATCH V9 mlx5-next 05/15] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
2022-02-24 14:20 ` [PATCH V9 mlx5-next 06/15] net/mlx5: Introduce migration bits and structures Yishai Hadas
2022-02-24 14:20 ` [PATCH V9 mlx5-next 07/15] net/mlx5: Add migration commands definitions Yishai Hadas
2022-02-24 14:20 ` [PATCH V9 mlx5-next 08/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl Yishai Hadas
2022-03-02 10:00 ` Cornelia Huck
2022-03-02 14:23 ` Jason Gunthorpe
2022-02-24 14:20 ` [PATCH V9 mlx5-next 09/15] vfio: Define device migration protocol v2 Yishai Hadas
2022-03-02 11:19 ` Cornelia Huck [this message]
2022-03-02 14:27 ` Jason Gunthorpe
2022-03-02 15:34 ` Alex Williamson
2022-03-02 16:07 ` Cornelia Huck
2022-03-02 16:34 ` Alex Williamson
2022-03-02 16:56 ` Cornelia Huck
2022-03-02 16:34 ` Jason Gunthorpe
2022-03-02 16:57 ` Cornelia Huck
2022-02-24 14:20 ` [PATCH V9 mlx5-next 10/15] vfio: Extend the device migration protocol with RUNNING_P2P Yishai Hadas
2022-02-24 15:21 ` Cornelia Huck
2022-02-24 15:30 ` Alex Williamson
2022-02-24 16:13 ` Jason Gunthorpe
2022-02-24 16:35 ` Alex Williamson
2022-02-24 16:53 ` Cornelia Huck
2022-02-24 20:46 ` Alex Williamson
2022-03-02 11:51 ` Cornelia Huck
2022-02-24 14:20 ` [PATCH V9 mlx5-next 11/15] vfio: Remove migration protocol v1 documentation Yishai Hadas
2022-03-02 10:09 ` Cornelia Huck
2022-03-02 10:09 ` Cornelia Huck
2022-02-24 14:20 ` [PATCH V9 mlx5-next 12/15] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
2022-02-24 14:20 ` [PATCH V9 mlx5-next 13/15] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
2022-02-24 14:20 ` [PATCH V9 mlx5-next 14/15] vfio/pci: Expose vfio_pci_core_aer_err_detected() Yishai Hadas
2022-02-24 14:20 ` [PATCH V9 mlx5-next 15/15] vfio/mlx5: Use its own PCI reset_done error handler Yishai Hadas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87tucgiouf.fsf@redhat.com \
--to=cohuck@redhat.com \
--cc=alex.williamson@redhat.com \
--cc=ashok.raj@intel.com \
--cc=bhelgaas@google.com \
--cc=jgg@nvidia.com \
--cc=kevin.tian@intel.com \
--cc=kuba@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=kwankhede@nvidia.com \
--cc=leonro@nvidia.com \
--cc=linux-pci@vger.kernel.org \
--cc=maorg@nvidia.com \
--cc=mgurtovoy@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=saeedm@nvidia.com \
--cc=shameerali.kolothum.thodi@huawei.com \
--cc=yishaih@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.