netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Brett Creeley <brett.creeley@amd.com>
Cc: kvm@vger.kernel.org, netdev@vger.kernel.org,
	alex.williamson@redhat.com, yishaih@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com,
	shannon.nelson@amd.com, drivers@pensando.io,
	simon.horman@corigine.com
Subject: Re: [PATCH v8 vfio 6/7] vfio/pds: Add support for firmware recovery
Date: Fri, 14 Apr 2023 09:56:27 -0300	[thread overview]
Message-ID: <ZDlNeyv/HLG4SPwB@nvidia.com> (raw)
In-Reply-To: <20230404190141.57762-7-brett.creeley@amd.com>

On Tue, Apr 04, 2023 at 12:01:40PM -0700, Brett Creeley wrote:
> It's possible that the device firmware crashes and is able to recover
> due to some configuration and/or other issue. If a live migration
> is in progress while the firmware crashes, the live migration will
> fail. However, the VF PCI device should still be functional post
> crash recovery and subsequent migrations should go through as
> expected.
> 
> When the pds_core device notices that firmware crashes it sends an
> event to all its client drivers. When the pds_vfio driver receives
> this event while migration is in progress it will request a deferred
> reset on the next migration state transition. This state transition
> will report failure as well as any subsequent state transition
> requests from the VMM/VFIO. Based on uapi/vfio.h the only way out of
> VFIO_DEVICE_STATE_ERROR is by issuing VFIO_DEVICE_RESET. Once this
> reset is done, the migration state will be reset to
> VFIO_DEVICE_STATE_RUNNING and migration can be performed.
> 
> If the event is received while no migration is in progress (i.e.
> the VM is in normal operating mode), then no actions are taken
> and the migration state remains VFIO_DEVICE_STATE_RUNNING.
> 
> Signed-off-by: Brett Creeley <brett.creeley@amd.com>
> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
> ---
>  drivers/vfio/pci/pds/pci_drv.c  | 110 +++++++++++++++++++++++++++++++-
>  drivers/vfio/pci/pds/vfio_dev.c |  34 +++++++++-
>  drivers/vfio/pci/pds/vfio_dev.h |   6 +-
>  3 files changed, 146 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/pci/pds/pci_drv.c b/drivers/vfio/pci/pds/pci_drv.c
> index b0781d9f4246..b155ac9b98ae 100644
> --- a/drivers/vfio/pci/pds/pci_drv.c
> +++ b/drivers/vfio/pci/pds/pci_drv.c
> @@ -20,6 +20,104 @@
>  #define PDS_VFIO_DRV_DESCRIPTION	"AMD/Pensando VFIO Device Driver"
>  #define PCI_VENDOR_ID_PENSANDO		0x1dd8
>  
> +static void
> +pds_vfio_recovery_work(struct work_struct *work)
> +{
> +	struct pds_vfio_pci_device *pds_vfio =
> +		container_of(work, struct pds_vfio_pci_device, work);
> +	bool deferred_reset_needed = false;
> +
> +	/* Documentation states that the kernel migration driver must not
> +	 * generate asynchronous device state transitions outside of
> +	 * manipulation by the user or the VFIO_DEVICE_RESET ioctl.
> +	 *
> +	 * Since recovery is an asynchronous event received from the device,
> +	 * initiate a deferred reset. Only issue the deferred reset if a
> +	 * migration is in progress, which will cause the next step of the
> +	 * migration to fail. Also, if the device is in a state that will
> +	 * be set to VFIO_DEVICE_STATE_RUNNING on the next action (i.e. VM is
> +	 * shutdown and device is in VFIO_DEVICE_STATE_STOP) as that will clear
> +	 * the VFIO_DEVICE_STATE_ERROR when the VM starts back up.
> +	 */
> +	mutex_lock(&pds_vfio->state_mutex);
> +	if ((pds_vfio->state != VFIO_DEVICE_STATE_RUNNING &&
> +	     pds_vfio->state != VFIO_DEVICE_STATE_ERROR) ||
> +	    (pds_vfio->state == VFIO_DEVICE_STATE_RUNNING &&
> +	     pds_vfio_dirty_is_enabled(pds_vfio)))
> +		deferred_reset_needed = true;
> +	mutex_unlock(&pds_vfio->state_mutex);
> +
> +	/* On the next user initiated state transition, the device will
> +	 * transition to the VFIO_DEVICE_STATE_ERROR. At this point it's the user's
> +	 * responsibility to reset the device.
> +	 *
> +	 * If a VFIO_DEVICE_RESET is requested post recovery and before the next
> +	 * state transition, then the deferred reset state will be set to
> +	 * VFIO_DEVICE_STATE_RUNNING.
> +	 */
> +	if (deferred_reset_needed)
> +		pds_vfio_deferred_reset(pds_vfio, VFIO_DEVICE_STATE_ERROR);
> +}

Why is this a work? it is threaded on a blocking_notifier_chain so it
can call the mutex?

Why is the locking like this, can't you just call
pds_vfio_deferred_reset() under the mutex?

Jason

  reply	other threads:[~2023-04-14 12:56 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-04 19:01 [PATCH v8 vfio 0/7] pds_vfio driver Brett Creeley
2023-04-04 19:01 ` [PATCH v8 vfio 1/7] vfio: Commonize combine_ranges for use in other VFIO drivers Brett Creeley
2023-04-14 12:31   ` Jason Gunthorpe
2023-04-04 19:01 ` [PATCH v8 vfio 2/7] vfio/pds: Initial support for pds_vfio VFIO driver Brett Creeley
2023-04-04 19:01 ` [PATCH v8 vfio 3/7] vfio/pds: register with the pds_core PF Brett Creeley
2023-04-10 20:41   ` Alex Williamson
2023-04-11 17:09     ` Brett Creeley
2023-04-14 12:43   ` Jason Gunthorpe
2023-04-17 18:42     ` Shannon Nelson
2023-04-21  0:42     ` Brett Creeley
2023-04-04 19:01 ` [PATCH v8 vfio 4/7] vfio/pds: Add VFIO live migration support Brett Creeley
2023-04-10 22:05   ` Alex Williamson
2023-04-11 17:21     ` Brett Creeley
2023-04-14 12:52   ` Jason Gunthorpe
2023-04-04 19:01 ` [PATCH v8 vfio 5/7] vfio/pds: Add support for dirty page tracking Brett Creeley
2023-04-10 22:15   ` Alex Williamson
2023-04-04 19:01 ` [PATCH v8 vfio 6/7] vfio/pds: Add support for firmware recovery Brett Creeley
2023-04-14 12:56   ` Jason Gunthorpe [this message]
2023-04-04 19:01 ` [PATCH v8 vfio 7/7] vfio/pds: Add Kconfig and documentation Brett Creeley
2023-04-14 12:57   ` Jason Gunthorpe
2023-04-04 19:03 ` [PATCH v8 vfio 0/7] pds_vfio driver Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZDlNeyv/HLG4SPwB@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=brett.creeley@amd.com \
    --cc=drivers@pensando.io \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=shannon.nelson@amd.com \
    --cc=simon.horman@corigine.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).