Re: [PATCH 3/3] drm/xe: Force busted state and block GT reset upon any GPU hang

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: "Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	"De Marchi, Lucas" <lucas.demarchi@intel.com>,
	"Teres Alexis, Alan Previn" <alan.previn.teres.alexis@intel.com>,
	"Somaiya, Himanshu" <himanshu.somaiya@intel.com>,
	<matthew.brost@intel.com>
Subject: Re: [PATCH 3/3] drm/xe: Force busted state and block GT reset upon any GPU hang
Date: Fri, 15 Mar 2024 09:59:19 -0400	[thread overview]
Message-ID: <ZfRUN4oinqjQfTJz@intel.com> (raw)
In-Reply-To: <MW4PR11MB7056DA60DFEDC7ECDCEA2D0FB3282@MW4PR11MB7056.namprd11.prod.outlook.com>

On Fri, Mar 15, 2024 at 04:31:54AM +0000, Ghimiray, Himal Prasad wrote:
> 
> 
> > -----Original Message-----
> > From: Intel-xe <intel-xe-bounces@lists.freedesktop.org> On Behalf Of
> > Rodrigo Vivi
> > Sent: 15 March 2024 06:33
> > To: intel-xe@lists.freedesktop.org
> > Cc: Vivi, Rodrigo <rodrigo.vivi@intel.com>; De Marchi, Lucas
> > <lucas.demarchi@intel.com>; Teres Alexis, Alan Previn
> > <alan.previn.teres.alexis@intel.com>; Somaiya, Himanshu
> > <himanshu.somaiya@intel.com>
> > Subject: [PATCH 3/3] drm/xe: Force busted state and block GT reset upon any
> > GPU hang
> > 
> > In many validation situations when debugging GPU Hangs, it is useful to
> > preserve the GT situation from the moment that the timeout occurred.
> > 
> > This patch introduces a module parameter that could be used on situations
> > like this.
> > 
> > If xe.busted module parameter is set to 2, Xe will be declared busted on
> > every single execution timeout (a.k.a. GPU hang) right after devcoredump
> > snapshot capture and without attempting any kind of GT reset and blocking
> > entirely any kind of execution.
> > 
> > v2: Really block gt_reset from guc side. (Lucas)
> >     s/wedged/busted (Lucas)
> > 
> > Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> > Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
> > Cc: Himanshu Somaiya <himanshu.somaiya@intel.com>
> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_device.c     | 30 ++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_device.h     | 13 +------------
> >  drivers/gpu/drm/xe/xe_guc_ads.c    |  7 +++++++
> >  drivers/gpu/drm/xe/xe_guc_submit.c |  4 ++++
> >  drivers/gpu/drm/xe/xe_module.c     |  5 +++++
> >  drivers/gpu/drm/xe/xe_module.h     |  1 +
> >  6 files changed, 48 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > b/drivers/gpu/drm/xe/xe_device.c index d02e59fb49eb..e28e3628744f
> > 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -774,3 +774,33 @@ u64 xe_device_uncanonicalize_addr(struct
> > xe_device *xe, u64 address)  {
> >  	return address & GENMASK_ULL(xe->info.va_bits - 1, 0);  }
> > +
> > +/**
> > + * xe_device_declare_busted - Declare device busted
> > + * @xe: xe device instance
> > + *
> > + * This is a final state that can only be cleared with a module
> > + * re-probe (unbind + bind).
> > + * In this state every IOCTL will be blocked so the GT cannot be used.
> > + * In general it will be called upon any critical error such as gt
> > +reset
> > + * failure or guc loading failure.
> > + * If xe.busted module parameter is set to 2, this function will be
> > +called
> > + * on every single execution timeout (a.k.a. GPU hang) right after
> > +devcoredump
> > + * snapshot capture. In this mode, GT reset won't be attempted so the
> > +state of
> > + * the issue is preserved for further debugging.
> > + */
> > +void xe_device_declare_busted(struct xe_device *xe) {
> > +	if (xe_modparam.busted_mode == 0)
> > +		return;
>  
> Do you see any usecase or benefit of providing the option to disable busted mode with modparam ?

honestly? No!
I just wanted to have a chicken way back to the current status quo...

Matt?
Lucas?
thoughts?

> 
> BR
> Himal 
> 
> > +
> > +	if (!atomic_xchg(&xe->busted, 1))
> > +		drm_err(&xe->drm,
> > +			"CRITICAL: Xe has declared device %s as busted.\n"
> > +			"IOCTLs and executions are blocked until device is
> > probed again with unbind and bind operations:\n"
> > +			"echo '%s' | sudo tee
> > /sys/bus/pci/drivers/xe/unbind\n"
> > +			"echo '%s' | sudo tee
> > /sys/bus/pci/drivers/xe/bind\n"
> > +			"Please file a _new_ bug report at
> > https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
> > +			dev_name(xe->drm.dev), dev_name(xe->drm.dev),
> > +			dev_name(xe->drm.dev));
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_device.h
> > b/drivers/gpu/drm/xe/xe_device.h index 2c6d9b77821a..e6edf2d3ee4a
> > 100644
> > --- a/drivers/gpu/drm/xe/xe_device.h
> > +++ b/drivers/gpu/drm/xe/xe_device.h
> > @@ -181,17 +181,6 @@ static inline bool xe_device_busted(struct xe_device
> > *xe)
> >  	return atomic_read(&xe->busted);
> >  }
> > 
> > -static inline void xe_device_declare_busted(struct xe_device *xe) -{
> > -	if (!atomic_xchg(&xe->busted, 1))
> > -		drm_err(&xe->drm,
> > -			"CRITICAL: Xe has declared device %s as busted.\n"
> > -			"IOCTLs and executions are blocked until device is
> > probed again with unbind and bind operations:\n"
> > -			"echo '%s' | sudo tee
> > /sys/bus/pci/drivers/xe/unbind\n"
> > -			"echo '%s' | sudo tee
> > /sys/bus/pci/drivers/xe/bind\n"
> > -			"Please file a _new_ bug report at
> > https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
> > -			dev_name(xe->drm.dev), dev_name(xe->drm.dev),
> > -			dev_name(xe->drm.dev));
> > -}
> > +void xe_device_declare_busted(struct xe_device *xe);
> > 
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ads.c
> > b/drivers/gpu/drm/xe/xe_guc_ads.c index 6ad4c1a90a78..ecf45289b187
> > 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ads.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_ads.c
> > @@ -18,6 +18,7 @@
> >  #include "xe_lrc.h"
> >  #include "xe_map.h"
> >  #include "xe_mmio.h"
> > +#include "xe_module.h"
> >  #include "xe_platform_types.h"
> > 
> >  /* Slack of a few additional entries per engine */ @@ -312,10 +313,16 @@
> > int xe_guc_ads_init_post_hwconfig(struct xe_guc_ads *ads)
> > 
> >  static void guc_policies_init(struct xe_guc_ads *ads)  {
> > +	u32 global_flags = 0;
> > +
> >  	ads_blob_write(ads, policies.dpc_promote_time,
> >  		       GLOBAL_POLICY_DEFAULT_DPC_PROMOTE_TIME_US);
> >  	ads_blob_write(ads, policies.max_num_work_items,
> >  		       GLOBAL_POLICY_MAX_NUM_WI);
> > +
> > +	if (xe_modparam.busted_mode == 2)
> > +		global_flags |= GLOBAL_POLICY_DISABLE_ENGINE_RESET;
> > +
> >  	ads_blob_write(ads, policies.global_flags, 0);
> >  	ads_blob_write(ads, policies.is_valid, 1);  } diff --git
> > a/drivers/gpu/drm/xe/xe_guc_submit.c
> > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 82c955a2a15c..e7ddf35c1dac 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -34,6 +34,7 @@
> >  #include "xe_macros.h"
> >  #include "xe_map.h"
> >  #include "xe_mocs.h"
> > +#include "xe_module.h"
> >  #include "xe_ring_ops_types.h"
> >  #include "xe_sched_job.h"
> >  #include "xe_trace.h"
> > @@ -949,6 +950,9 @@ guc_exec_queue_timedout_job(struct
> > drm_sched_job *drm_job)
> >  	simple_error_capture(q);
> >  	xe_devcoredump(job);
> > 
> > +	if (xe_modparam.busted_mode == 2)
> > +		xe_device_declare_busted(xe);
> > +
> >  	trace_xe_sched_job_timedout(job);
> > 
> >  	/* Kill the run_job entry point */
> > diff --git a/drivers/gpu/drm/xe/xe_module.c
> > b/drivers/gpu/drm/xe/xe_module.c index 110b69864656..f81970e8d713
> > 100644
> > --- a/drivers/gpu/drm/xe/xe_module.c
> > +++ b/drivers/gpu/drm/xe/xe_module.c
> > @@ -17,6 +17,7 @@ struct xe_modparam xe_modparam = {
> >  	.enable_display = true,
> >  	.guc_log_level = 5,
> >  	.force_probe = CONFIG_DRM_XE_FORCE_PROBE,
> > +	.busted_mode = 1,
> >  	/* the rest are 0 by default */
> >  };
> > 
> > @@ -48,6 +49,10 @@ module_param_named_unsafe(force_probe,
> > xe_modparam.force_probe, charp, 0400);
> > MODULE_PARM_DESC(force_probe,
> >  		 "Force probe options for specified devices. See
> > CONFIG_DRM_XE_FORCE_PROBE for details.");
> > 
> > +module_param_named_unsafe(busted_mode,
> > xe_modparam.busted_mode, int,
> > +0600); MODULE_PARM_DESC(busted_mode,
> > +		 "Module's default policy for the busted mode - 0=never,
> > +1=upon-critical-errors[default], 2=upon-any-hang");
> > +
> >  struct init_funcs {
> >  	int (*init)(void);
> >  	void (*exit)(void);
> > diff --git a/drivers/gpu/drm/xe/xe_module.h
> > b/drivers/gpu/drm/xe/xe_module.h index 88ef0e8b2bfd..bbf88c34e4f4
> > 100644
> > --- a/drivers/gpu/drm/xe/xe_module.h
> > +++ b/drivers/gpu/drm/xe/xe_module.h
> > @@ -18,6 +18,7 @@ struct xe_modparam {
> >  	char *huc_firmware_path;
> >  	char *gsc_firmware_path;
> >  	char *force_probe;
> > +	int busted_mode;
> >  };
> > 
> >  extern struct xe_modparam xe_modparam;
> > --
> > 2.44.0
>

next prev parent reply	other threads:[~2024-03-15 13:59 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-15  1:03 [PATCH 1/3] drm/xe: Introduce a simple busted state Rodrigo Vivi
2024-03-15  1:03 ` [PATCH 2/3] drm/xe: declare busted upon GuC load failure Rodrigo Vivi
2024-03-15 11:16   ` Ghimiray, Himal Prasad
2024-03-15 13:54     ` Rodrigo Vivi
2024-03-15 14:23   ` Ghimiray, Himal Prasad
2024-03-15  1:03 ` [PATCH 3/3] drm/xe: Force busted state and block GT reset upon any GPU hang Rodrigo Vivi
2024-03-15  4:31   ` Ghimiray, Himal Prasad
2024-03-15 13:59     ` Rodrigo Vivi [this message]
2024-03-15  1:08 ` ✓ CI.Patch_applied: success for series starting with [1/3] drm/xe: Introduce a simple busted state Patchwork
2024-03-15  1:08 ` ✓ CI.checkpatch: " Patchwork
2024-03-15  1:09 ` ✓ CI.KUnit: " Patchwork
2024-03-15  1:20 ` ✓ CI.Build: " Patchwork
2024-03-15  1:22 ` ✓ CI.Hooks: " Patchwork
2024-03-15  1:23 ` [PATCH 1/3] " Dixit, Ashutosh
2024-03-15  3:08   ` Lucas De Marchi
2024-03-15  3:09   ` Rodrigo Vivi
2024-03-15 14:14     ` Tvrtko Ursulin
2024-03-18 23:23       ` Dixit, Ashutosh
2024-03-15  1:24 ` ✓ CI.checksparse: success for series starting with [1/3] " Patchwork
2024-03-15  1:48 ` ✓ CI.BAT: " Patchwork
2024-03-15  7:13 ` [PATCH 1/3] " Aravind Iddamsetty
2024-03-15 13:50   ` Rodrigo Vivi
2024-03-15 11:52 ` Ghimiray, Himal Prasad

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZfRUN4oinqjQfTJz@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=alan.previn.teres.alexis@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=himanshu.somaiya@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=matthew.brost@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.