Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: intel-xe <intel-xe@lists.freedesktop.org>,
	Francois Dugast <francois.dugast@intel.com>,
	Riana Tauro <riana.tauro@intel.com>
Subject: Re: [PATCH v2 1/2] drm/xe: Move survivability back to xe
Date: Wed, 12 Mar 2025 16:12:17 -0400	[thread overview]
Message-ID: <Z9HqoTBhvwscVxOE@intel.com> (raw)
In-Reply-To: <d5o2j2fff2ntlmzq43marpyx3r3uec3cyarxjhlq72e64xxydl@gmudrfgn7xhm>

On Tue, Mar 11, 2025 at 04:09:44PM -0500, Lucas De Marchi wrote:
> On Tue, Mar 11, 2025 at 04:55:16PM -0400, Rodrigo Vivi wrote:
> > On Tue, Mar 11, 2025 at 11:34:55AM -0700, Lucas De Marchi wrote:
> > > Commit d40f275d96e8 ("drm/xe: Move survivability entirely to xe_pci")
> > > moved the survivability handling to be done entirely in the xe_pci
> > > layer. However there are some issues with that approach:
> > > 
> > > 1) Survivability mode needs at least the mmio initialized, otherwise it
> > >    can't really read a register to decide if it should enter that state
> > > 2) SR-IOV mode should be initialized, otherwise it's not possible to
> > >    check if it's VF
> > > 
> > > Besides, as pointed by Riana the check for
> > > xe_survivability_mode_enable() was wrong in xe_pci_probe() since it's
> > > not a bool return.
> > > 
> > > Fix that by moving the initialization to be entirely in the xe_device
> > > layer, with the correct dependencies handled. The xe_pci now only checks
> > > for "is it enabled?", like it's doing in
> > > xe_pci_suspend()/xe_pci_remove(), etc.
> > > 
> > > Cc: Riana Tauro <riana.tauro@intel.com>
> > > Fixes: d40f275d96e8 ("drm/xe: Move survivability entirely to xe_pci")
> > > Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_device.c             | 14 +++++++++++++-
> > >  drivers/gpu/drm/xe/xe_pci.c                | 16 +++++++---------
> > >  drivers/gpu/drm/xe/xe_survivability_mode.c | 14 +++++++++-----
> > >  3 files changed, 29 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > > index 5d79b439dd625..023290e5be392 100644
> > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > @@ -53,6 +53,7 @@
> > >  #include "xe_pxp.h"
> > >  #include "xe_query.h"
> > >  #include "xe_shrinker.h"
> > > +#include "xe_survivability_mode.h"
> > >  #include "xe_sriov.h"
> > >  #include "xe_tile.h"
> > >  #include "xe_ttm_stolen_mgr.h"
> > > @@ -705,8 +706,19 @@ int xe_device_probe_early(struct xe_device *xe)
> > >  	sriov_update_device_info(xe);
> > > 
> > >  	err = xe_pcode_probe_early(xe);
> > > -	if (err)
> > > +	if (err) {
> > > +		int save_err = err;
> > > +
> > > +		/*
> > > +		 * Try to leave device in survivability mode if device is
> > > +		 * capable
> > > +		 */
> > > +		err = xe_survivability_mode_enable(xe);
> > > +		if (!err || err == -ENOTRECOVERABLE)
> > > +			return save_err;
> > > +
> > >  		return err;
> > > +	}
> > > 
> > >  	err = wait_for_lmem_ready(xe);
> > 
> > I'm not 100% sure if we can skip this...
> > Before this patch if lmem_ready failed we would also check for the
> > survivability bit...
> 
> which was not the case before the breakage in
> commit d40f275d96e8 ("drm/xe: Move survivability entirely to xe_pci").
> This restores that behavior, without the odd split of "init in the inner
> layer (xe_device) / enable in the outer layer (xe_pci)".

hmm... okay. perhaps we should just mention in the commit message then...

> 
> > 
> > >  	if (err)
> > > diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> > > index 4d982a5a4ffd9..6fea3091e2348 100644
> > > --- a/drivers/gpu/drm/xe/xe_pci.c
> > > +++ b/drivers/gpu/drm/xe/xe_pci.c
> > > @@ -808,16 +808,14 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> > >  		return err;
> > > 
> > >  	err = xe_device_probe_early(xe);
> > > -
> > > -	/*
> > > -	 * In Boot Survivability mode, no drm card is exposed and driver is
> > > -	 * loaded with bare minimum to allow for firmware to be flashed through
> > > -	 * mei. If early probe fails, check if survivability mode is flagged by
> > > -	 * HW to be enabled. In that case enable it and return success.
> > > -	 */
> > >  	if (err) {
> > > -		if (xe_survivability_mode_required(xe) &&
> > > -		    xe_survivability_mode_enable(xe))
> > > +		/*
> > > +		 * In Boot Survivability mode, no drm card is exposed and driver
> > > +		 * is loaded with bare minimum to allow for firmware to be
> > > +		 * flashed through mei. If early probe failed, but it managed to
> > > +		 * enable survivability mode, return success.
> > > +		 */
> > > +		if (xe_survivability_mode_is_enabled(xe))
> > >  			return 0;
> > > 
> > >  		return err;
> > > diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
> > > index d939ce70e6fa8..153b8d598a270 100644
> > > --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
> > > +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
> > > @@ -178,15 +178,16 @@ bool xe_survivability_mode_is_enabled(struct xe_device *xe)
> > >  	return xe->survivability.mode;
> > >  }
> > > 
> > > -/**
> > > - * xe_survivability_mode_required - checks if survivability mode is required
> > > +/*
> > > + * xe_survivability_mode_capable - checks if it's possible to enable
> > > + * survivability mode
> > 
> > that bit doesn't indicate capabitility, but if mode is required by FW
> > after identifying if something went wrong like done checks and lmem
> > link trainings...
> 
> You are saying the HW **requires** to enter survivability mode and yet
> there are multiple checks for "is the HW even capable of doing that" by
> means of platform and IP version.  Combining all of them to me means
> more a capability: platform supports it, and during initialization it
> **signaled that capability** rather than required anything.
> 
> maybe neither capable() nor required()... but requested()?

+1 on 'requested'!

> 
> > 
> > >   * @xe: xe device instance
> > >   *
> > > - * This function reads the boot status from Pcode
> > > + * This function reads the boot status from Pcode.
> > >   *
> > > - * Return: true if boot status indicates failure, false otherwise
> > > + * Return: true if boot status indicates failure, false otherwise.
> > >   */
> > > -bool xe_survivability_mode_required(struct xe_device *xe)
> > > +static bool xe_survivability_mode_capable(struct xe_device *xe)
> > >  {
> > >  	struct xe_survivability *survivability = &xe->survivability;
> > >  	struct xe_mmio *mmio = xe_root_tile_mmio(xe);
> > > @@ -216,6 +217,9 @@ int xe_survivability_mode_enable(struct xe_device *xe)
> > >  	struct xe_survivability_info *info;
> > >  	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> > > 
> > > +	if (!xe_survivability_mode_capable(xe))
> > > +		return -ENOTRECOVERABLE;
> > 
> > We might want to enter survivability regardless of that bit as well.
> > With that configfs work that Riana is doing for instance, you set
> > that, but you disregard if that was requested by PCODE in that
> > 'required' bit.
> 
> we'd add the check inside xe_survivability_mode_capable() to account for
> that, still checking for platform. For my tests I was just doing a
> 
> 	return .... || true;
> 
> in that function, to be replaced by the configfs check.

ah okay then!

> 
> Lucas De Marchi
> 
> > 
> > > +
> > >  	survivability->size = MAX_SCRATCH_MMIO;
> > > 
> > >  	info = devm_kcalloc(xe->drm.dev, survivability->size, sizeof(*info),
> > > 
> > > --
> > > 2.48.1
> > > 

  reply	other threads:[~2025-03-12 20:12 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-11 18:34 [PATCH v2 0/2] drm/xe: Fix survivability Lucas De Marchi
2025-03-11 18:34 ` [PATCH v2 1/2] drm/xe: Move survivability back to xe Lucas De Marchi
2025-03-11 20:55   ` Rodrigo Vivi
2025-03-11 21:09     ` Lucas De Marchi
2025-03-12 20:12       ` Rodrigo Vivi [this message]
2025-03-11 18:34 ` [PATCH v2 2/2] drm/xe: Allow to inject error in early probe Lucas De Marchi
2025-03-12  8:20   ` Francois Dugast
2025-03-11 22:02 ` ✓ CI.Patch_applied: success for drm/xe: Fix survivability (rev2) Patchwork
2025-03-11 22:03 ` ✓ CI.checkpatch: " Patchwork
2025-03-11 22:04 ` ✓ CI.KUnit: " Patchwork
2025-03-11 22:21 ` ✓ CI.Build: " Patchwork
2025-03-11 22:23 ` ✓ CI.Hooks: " Patchwork
2025-03-11 22:24 ` ✓ CI.checksparse: " Patchwork
2025-03-12  5:46 ` ✓ Xe.CI.BAT: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z9HqoTBhvwscVxOE@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=francois.dugast@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=riana.tauro@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox