Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: intel-xe <intel-xe@lists.freedesktop.org>,
	Francois Dugast <francois.dugast@intel.com>,
	Riana Tauro <riana.tauro@intel.com>
Subject: Re: [PATCH v2 1/2] drm/xe: Move survivability back to xe
Date: Tue, 11 Mar 2025 16:55:16 -0400	[thread overview]
Message-ID: <Z9CjNDYFg40wqT5l@intel.com> (raw)
In-Reply-To: <20250311-fix-survivability-v2-1-729ce081155e@intel.com>

On Tue, Mar 11, 2025 at 11:34:55AM -0700, Lucas De Marchi wrote:
> Commit d40f275d96e8 ("drm/xe: Move survivability entirely to xe_pci")
> moved the survivability handling to be done entirely in the xe_pci
> layer. However there are some issues with that approach:
> 
> 1) Survivability mode needs at least the mmio initialized, otherwise it
>    can't really read a register to decide if it should enter that state
> 2) SR-IOV mode should be initialized, otherwise it's not possible to
>    check if it's VF
> 
> Besides, as pointed by Riana the check for
> xe_survivability_mode_enable() was wrong in xe_pci_probe() since it's
> not a bool return.
> 
> Fix that by moving the initialization to be entirely in the xe_device
> layer, with the correct dependencies handled. The xe_pci now only checks
> for "is it enabled?", like it's doing in
> xe_pci_suspend()/xe_pci_remove(), etc.
> 
> Cc: Riana Tauro <riana.tauro@intel.com>
> Fixes: d40f275d96e8 ("drm/xe: Move survivability entirely to xe_pci")
> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device.c             | 14 +++++++++++++-
>  drivers/gpu/drm/xe/xe_pci.c                | 16 +++++++---------
>  drivers/gpu/drm/xe/xe_survivability_mode.c | 14 +++++++++-----
>  3 files changed, 29 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 5d79b439dd625..023290e5be392 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -53,6 +53,7 @@
>  #include "xe_pxp.h"
>  #include "xe_query.h"
>  #include "xe_shrinker.h"
> +#include "xe_survivability_mode.h"
>  #include "xe_sriov.h"
>  #include "xe_tile.h"
>  #include "xe_ttm_stolen_mgr.h"
> @@ -705,8 +706,19 @@ int xe_device_probe_early(struct xe_device *xe)
>  	sriov_update_device_info(xe);
>  
>  	err = xe_pcode_probe_early(xe);
> -	if (err)
> +	if (err) {
> +		int save_err = err;
> +
> +		/*
> +		 * Try to leave device in survivability mode if device is
> +		 * capable
> +		 */
> +		err = xe_survivability_mode_enable(xe);
> +		if (!err || err == -ENOTRECOVERABLE)
> +			return save_err;
> +
>  		return err;
> +	}
>  
>  	err = wait_for_lmem_ready(xe);

I'm not 100% sure if we can skip this...
Before this patch if lmem_ready failed we would also check for the
survivability bit...

>  	if (err)
> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> index 4d982a5a4ffd9..6fea3091e2348 100644
> --- a/drivers/gpu/drm/xe/xe_pci.c
> +++ b/drivers/gpu/drm/xe/xe_pci.c
> @@ -808,16 +808,14 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  		return err;
>  
>  	err = xe_device_probe_early(xe);
> -
> -	/*
> -	 * In Boot Survivability mode, no drm card is exposed and driver is
> -	 * loaded with bare minimum to allow for firmware to be flashed through
> -	 * mei. If early probe fails, check if survivability mode is flagged by
> -	 * HW to be enabled. In that case enable it and return success.
> -	 */
>  	if (err) {
> -		if (xe_survivability_mode_required(xe) &&
> -		    xe_survivability_mode_enable(xe))
> +		/*
> +		 * In Boot Survivability mode, no drm card is exposed and driver
> +		 * is loaded with bare minimum to allow for firmware to be
> +		 * flashed through mei. If early probe failed, but it managed to
> +		 * enable survivability mode, return success.
> +		 */
> +		if (xe_survivability_mode_is_enabled(xe))
>  			return 0;
>  
>  		return err;
> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
> index d939ce70e6fa8..153b8d598a270 100644
> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
> @@ -178,15 +178,16 @@ bool xe_survivability_mode_is_enabled(struct xe_device *xe)
>  	return xe->survivability.mode;
>  }
>  
> -/**
> - * xe_survivability_mode_required - checks if survivability mode is required
> +/*
> + * xe_survivability_mode_capable - checks if it's possible to enable
> + * survivability mode

that bit doesn't indicate capabitility, but if mode is required by FW
after identifying if something went wrong like done checks and lmem
link trainings...

>   * @xe: xe device instance
>   *
> - * This function reads the boot status from Pcode
> + * This function reads the boot status from Pcode.
>   *
> - * Return: true if boot status indicates failure, false otherwise
> + * Return: true if boot status indicates failure, false otherwise.
>   */
> -bool xe_survivability_mode_required(struct xe_device *xe)
> +static bool xe_survivability_mode_capable(struct xe_device *xe)
>  {
>  	struct xe_survivability *survivability = &xe->survivability;
>  	struct xe_mmio *mmio = xe_root_tile_mmio(xe);
> @@ -216,6 +217,9 @@ int xe_survivability_mode_enable(struct xe_device *xe)
>  	struct xe_survivability_info *info;
>  	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>  
> +	if (!xe_survivability_mode_capable(xe))
> +		return -ENOTRECOVERABLE;

We might want to enter survivability regardless of that bit as well.
With that configfs work that Riana is doing for instance, you set
that, but you disregard if that was requested by PCODE in that
'required' bit.

> +
>  	survivability->size = MAX_SCRATCH_MMIO;
>  
>  	info = devm_kcalloc(xe->drm.dev, survivability->size, sizeof(*info),
> 
> -- 
> 2.48.1
> 

  reply	other threads:[~2025-03-11 20:55 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-11 18:34 [PATCH v2 0/2] drm/xe: Fix survivability Lucas De Marchi
2025-03-11 18:34 ` [PATCH v2 1/2] drm/xe: Move survivability back to xe Lucas De Marchi
2025-03-11 20:55   ` Rodrigo Vivi [this message]
2025-03-11 21:09     ` Lucas De Marchi
2025-03-12 20:12       ` Rodrigo Vivi
2025-03-11 18:34 ` [PATCH v2 2/2] drm/xe: Allow to inject error in early probe Lucas De Marchi
2025-03-12  8:20   ` Francois Dugast
2025-03-11 22:02 ` ✓ CI.Patch_applied: success for drm/xe: Fix survivability (rev2) Patchwork
2025-03-11 22:03 ` ✓ CI.checkpatch: " Patchwork
2025-03-11 22:04 ` ✓ CI.KUnit: " Patchwork
2025-03-11 22:21 ` ✓ CI.Build: " Patchwork
2025-03-11 22:23 ` ✓ CI.Hooks: " Patchwork
2025-03-11 22:24 ` ✓ CI.checksparse: " Patchwork
2025-03-12  5:46 ` ✓ Xe.CI.BAT: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z9CjNDYFg40wqT5l@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=francois.dugast@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=riana.tauro@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox