Re: [PATCH 1/3] drm/xe: Introduce a simple wedged state

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
	Anshuman Gupta <anshuman.gupta@intel.com>
Subject: Re: [PATCH 1/3] drm/xe: Introduce a simple wedged state
Date: Wed, 13 Mar 2024 17:40:09 -0400	[thread overview]
Message-ID: <ZfIdOd_wHgm0VcQx@intel.com> (raw)
In-Reply-To: <qjap3n4uesovz4eaxnxjwvdiylv7gu7mz6p75umc2lvvcqi2hx@ghdxl2vwkz4g>

On Wed, Mar 13, 2024 at 03:27:05PM -0500, Lucas De Marchi wrote:
> On Wed, Mar 13, 2024 at 03:54:57PM -0400, Rodrigo Vivi wrote:
> > Introduce a very simple 'wedged' state where any attempt
> > to access the GPU is entirely blocked.
> > 
> > On some critical cases, like on gt_reset failure, we need to
> > block any other attempt to use the GPU. Otherwise we are at
> > a risk of reaching cases that would force us to reboot the machine.
> > 
> > So, when this cases are identified we corner and block any GPU
> > access. No IOCTL and not even another GT reset should be attempted.
> > 
> > The 'wedged' state in Xe is an end state with no way back.
> > Only a module reload can restore the GPU access.
> 
> I don´t really like this last line. See my feedback on igt: this should
> not be a module thing, but rather a device thing. I don't want to nuke
> xe from my igpu when debugging something in DG2.

yes, I will take care of this and everything else...
it is per device, so unbind + bind of the device should be enough.

> 
> > 
> > Cc: Anshuman Gupta <anshuman.gupta@intel.com>
> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_device.c       |  6 ++++++
> > drivers/gpu/drm/xe/xe_device.h       | 11 +++++++++++
> > drivers/gpu/drm/xe/xe_device_types.h |  6 ++++++
> > drivers/gpu/drm/xe/xe_gt.c           |  4 ++++
> > drivers/gpu/drm/xe/xe_migrate.c      |  6 ++++++
> > 5 files changed, 33 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > index 919ad88f0495..5f0a2bdb7c24 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -142,6 +142,9 @@ static long xe_drm_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> > 	struct xe_device *xe = to_xe_device(file_priv->minor->dev);
> > 	long ret;
> > 
> > +	if (xe_device_wedged(xe))
> > +		return -ECANCELED;
> > +
> > 	ret = xe_pm_runtime_get_ioctl(xe);
> > 	if (ret >= 0)
> > 		ret = drm_ioctl(file, cmd, arg);
> > @@ -157,6 +160,9 @@ static long xe_drm_compat_ioctl(struct file *file, unsigned int cmd, unsigned lo
> > 	struct xe_device *xe = to_xe_device(file_priv->minor->dev);
> > 	long ret;
> > 
> > +	if (xe_device_wedged(xe))
> > +		return -ECANCELED;
> > +
> > 	ret = xe_pm_runtime_get_ioctl(xe);
> > 	if (ret >= 0)
> > 		ret = drm_compat_ioctl(file, cmd, arg);
> > diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> > index 14be34d9f543..d10664d32f7f 100644
> > --- a/drivers/gpu/drm/xe/xe_device.h
> > +++ b/drivers/gpu/drm/xe/xe_device.h
> > @@ -176,4 +176,15 @@ void xe_device_snapshot_print(struct xe_device *xe, struct drm_printer *p);
> > u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
> > u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address);
> > 
> > +static inline bool xe_device_wedged(struct xe_device *xe)
> > +{
> > +	return atomic_read(&xe->wedged);
> > +}
> > +
> > +static inline void xe_device_declare_wedged(struct xe_device *xe)
> > +{
> > +	atomic_set(&xe->wedged, 1);
> > +	drm_err(&xe->drm, "CRITICAL: Xe has been declared wedged. A module reload is required.\n");
> 
> should probably identify the device and remove the mention to module
> reload. At most say "Device needs to be re-initialized" (or "re-probed",
> not sure which is better). Do we want to mention it will be FLR'ed in
> that case?

I don't want to mention the FLR specifically.
We have the driver-initiated flr that will be likely trigerred on
this unbind, but that is not the full PCI FLR.

For that we would need to write 1 to the 'reset' entry of the PCI
device. And in some cases we might even need that done by the user.

Or perhaps a big message explaining everything?

> 
> the rest of the patch seems to be device-based, so it's more the commit
> message and log above.
> 
> > +}
> > +
> > #endif
> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> > index 9785eef2e5a4..13971eb2334f 100644
> > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > @@ -455,6 +455,12 @@ struct xe_device {
> > 	/** @needs_flr_on_fini: requests function-reset on fini */
> > 	bool needs_flr_on_fini;
> > 
> > +	/**
> > +	 * @wedged: Xe device faced a critical error and is now blocked.
> > +	 * It cannot return to life without a module reload.
> > +	 */
> > +	atomic_t wedged;
> > +
> > 	/* private: */
> > 
> > #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
> > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > index 85408e7a932b..972c0c6d0608 100644
> > --- a/drivers/gpu/drm/xe/xe_gt.c
> > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > @@ -633,6 +633,9 @@ static int gt_reset(struct xe_gt *gt)
> > {
> > 	int err;
> > 
> > +	if (xe_device_wedged(gt_to_xe(gt)))
> > +	    return -ECANCELED;
> > +
> > 	/* We only support GT resets with GuC submission */
> > 	if (!xe_device_uc_enabled(gt_to_xe(gt)))
> > 		return -ENODEV;
> > @@ -686,6 +689,7 @@ static int gt_reset(struct xe_gt *gt)
> > 	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> > 
> > 	gt_to_xe(gt)->needs_flr_on_fini = true;
> > +	xe_device_declare_wedged(gt_to_xe(gt));
> > 
> > 	return err;
> > }
> > diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> > index ee1bb938c493..5b2eeb2048b5 100644
> > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > @@ -713,6 +713,9 @@ struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
> > 		xe_bo_needs_ccs_pages(src_bo) && xe_bo_needs_ccs_pages(dst_bo);
> > 	bool copy_system_ccs = copy_ccs && (!src_is_vram || !dst_is_vram);
> > 
> > +	if (xe_device_wedged(xe))
> > +	    return ERR_PTR(-ECANCELED);
> > +
> > 	/* Copying CCS between two different BOs is not supported yet. */
> > 	if (XE_WARN_ON(copy_ccs && src_bo != dst_bo))
> > 		return ERR_PTR(-EINVAL);
> > @@ -986,6 +989,9 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
> > 	int err;
> > 	int pass = 0;
> > 
> > +	if (xe_device_wedged(xe))
> > +	    return ERR_PTR(-ECANCELED);
> > +
> 
> bikeshed: I never liked the term "wedged" much from i915.
> Wouldn't "dead" describe the situation better?

I'm with you here. I was even pushing back on 'wedge' so people
would think that we would have a complicated machinery state like
i915 where people get that to plug some workarounds.

I had 'busted' in mind. thoughts about 'busted'?

device is not completely 'dead' because display is still working...

> 
> what was the motivation to add the checks on migration and migration
> only? If we want to make it die/wedge, shouldn't we be checking this in
> other places as well?

on migration we try to submit ccs commands... and on unbind/unprobe
we call the eviction of the buffers that ends up on that place.

So far these are the minimal places that I found that could block
the execution without entirely blocking the display. So I'd like
to start with this.

Perhaps we will end up finding more places to protect, but for now
I don't want the big hammer to block display.

> 
> Lucas De Marchi
> 
> 
> 
> > 	if (!clear_vram)
> > 		xe_res_first_sg(xe_bo_sg(bo), 0, bo->size, &src_it);
> > 	else
> > -- 
> > 2.44.0
> >

next prev parent reply	other threads:[~2024-03-13 21:40 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-13 19:54 [PATCH 1/3] drm/xe: Introduce a simple wedged state Rodrigo Vivi
2024-03-13 19:54 ` [PATCH 2/3] drm/xe: declare wedged and abort probe upon GuC load failure Rodrigo Vivi
2024-03-13 19:54 ` [PATCH 3/3] drm/xe: Force wedged state and block GT reset upon any GPU hang Rodrigo Vivi
2024-03-13 20:49   ` Lucas De Marchi
2024-03-13 20:56     ` Somaiya, Himanshu
2024-03-13 21:44     ` Rodrigo Vivi
2024-03-13 21:54       ` Lucas De Marchi
2024-03-13 22:06         ` Rodrigo Vivi
2024-03-14  4:00           ` Lucas De Marchi
2024-03-15  1:06             ` Rodrigo Vivi
2024-03-15  6:28               ` Teres Alexis, Alan Previn
2024-03-13 20:00 ` ✓ CI.Patch_applied: success for series starting with [1/3] drm/xe: Introduce a simple wedged state Patchwork
2024-03-13 20:01 ` ✗ CI.checkpatch: warning " Patchwork
2024-03-13 20:01 ` ✓ CI.KUnit: success " Patchwork
2024-03-13 20:12 ` ✓ CI.Build: " Patchwork
2024-03-13 20:16 ` ✓ CI.Hooks: " Patchwork
2024-03-13 20:18 ` ✓ CI.checksparse: " Patchwork
2024-03-13 20:27 ` [PATCH 1/3] " Lucas De Marchi
2024-03-13 21:40   ` Rodrigo Vivi [this message]
2024-03-13 20:39 ` ✓ CI.BAT: success for series starting with [1/3] " Patchwork
2024-03-14  1:40 ` [PATCH 1/3] " Aravind Iddamsetty
2024-03-14  3:33   ` Aravind Iddamsetty
2024-03-14 18:53     ` Rodrigo Vivi
2024-03-15  7:01       ` Aravind Iddamsetty

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZfIdOd_wHgm0VcQx@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox