All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
	Matthew Brost <matthew.brost@intel.com>,
	 Dafna Hirschfeld <dhirschfeld@habana.ai>,
	Alan Previn <alan.previn.teres.alexis@intel.com>,
	Himanshu Somaiya <himanshu.somaiya@intel.com>
Subject: Re: [PATCH 3/4] drm/xe: Force wedged state and block GT reset upon any GPU hang
Date: Tue, 16 Apr 2024 15:08:15 -0400	[thread overview]
Message-ID: <Zh7Mn0Wr_OHCkGf6@intel.com> (raw)
In-Reply-To: <y3bo36g6qldbxafa6mbukof2zfq5bnxlcd4xagbwkapowgyjwa@5vbtlvpu4hcg>

On Tue, Apr 16, 2024 at 12:19:50PM -0500, Lucas De Marchi wrote:
> On Tue, Apr 09, 2024 at 06:15:06PM GMT, Rodrigo Vivi wrote:
> > In many validation situations when debugging GPU Hangs,
> > it is useful to preserve the GT situation from the moment
> > that the timeout occurred.
> > 
> > This patch introduces a module parameter that could be used
> > on situations like this.
> > 
> > If xe.wedged module parameter is set to 2, Xe will be declared
> > wedged on every single execution timeout (a.k.a. GPU hang) right
> > after devcoredump snapshot capture and without attempting any
> > kind of GT reset and blocking entirely any kind of execution.
> > 
> > v2: Really block gt_reset from guc side. (Lucas)
> >    s/wedged/busted (Lucas)
> > 
> > v3: - s/busted/wedged
> >    - Really use global_flags (Dafna)
> >    - More robust timeout handling when wedging it.
> > 
> > v4: A really robust clean exit done by Matt Brost.
> >    No more kernel warns on unbind.
> > 
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Dafna Hirschfeld <dhirschfeld@habana.ai>
> > Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> > Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
> > Cc: Himanshu Somaiya <himanshu.somaiya@intel.com>
> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_device.c              | 32 ++++++++
> > drivers/gpu/drm/xe/xe_device.h              | 15 +---
> > drivers/gpu/drm/xe/xe_exec_queue.h          |  9 +++
> > drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c |  2 +-
> > drivers/gpu/drm/xe/xe_guc_ads.c             |  9 ++-
> > drivers/gpu/drm/xe/xe_guc_submit.c          | 90 +++++++++++++++++----
> > drivers/gpu/drm/xe/xe_module.c              |  5 ++
> > drivers/gpu/drm/xe/xe_module.h              |  1 +
> > 8 files changed, 132 insertions(+), 31 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > index 67de795e43b3..7928a5470cee 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -785,3 +785,35 @@ u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address)
> > {
> > 	return address & GENMASK_ULL(xe->info.va_bits - 1, 0);
> > }
> > +
> > +/**
> > + * xe_device_declare_wedged - Declare device wedged
> > + * @xe: xe device instance
> > + *
> > + * This is a final state that can only be cleared with a module
> > + * re-probe (unbind + bind).
> > + * In this state every IOCTL will be blocked so the GT cannot be used.
> > + * In general it will be called upon any critical error such as gt reset
> > + * failure or guc loading failure.
> > + * If xe.wedged module parameter is set to 2, this function will be called
> > + * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
> > + * snapshot capture. In this mode, GT reset won't be attempted so the state of
> > + * the issue is preserved for further debugging.
> > + */
> > +void xe_device_declare_wedged(struct xe_device *xe)
> 
> nit: this function could have been added in this file already in the
> previous patches.

indeed.. I would had struggled less with the rebases as well :/

> 
> > +{
> > +	if (xe_modparam.wedged_mode == 0)
> > +		return;
> 
> ^ so this would be the only diff

yeap, sorry about that :/

> 
> > +
> > +	if (!atomic_xchg(&xe->wedged, 1)) {
> > +		xe->needs_flr_on_fini = true;
> > +		drm_err(&xe->drm,
> > +			"CRITICAL: Xe has declared device %s as wedged.\n"
> > +			"IOCTLs and executions are blocked until device is probed again with unbind and bind operations:\n"
> > +			"echo '%s' | sudo tee /sys/bus/pci/drivers/xe/unbind\n"
> > +			"echo '%s' | sudo tee /sys/bus/pci/drivers/xe/bind\n"
> 
> this brings back the old version. Alternatively we may simplify the
> error message with:
> 
>  		drm_err(&xe->drm,
>  			"CRITICAL: Xe has declared device %s as wedged.\n"
>  			"IOCTLs and executions are blocked. Only a rebind may clear the failure\n"
>  			"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
>  			dev_name(xe->drm.dev));
> 
> 
> up to you.
> 
> > +			"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
> > +			dev_name(xe->drm.dev), dev_name(xe->drm.dev),
> > +			dev_name(xe->drm.dev));
> > +	}
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> > index c532209c5bbd..0fea5c18f76d 100644
> > --- a/drivers/gpu/drm/xe/xe_device.h
> > +++ b/drivers/gpu/drm/xe/xe_device.h
> > @@ -181,19 +181,6 @@ static inline bool xe_device_wedged(struct xe_device *xe)
> > 	return atomic_read(&xe->wedged);
> > }
> > 
> > -static inline void xe_device_declare_wedged(struct xe_device *xe)
> > -{
> > -	if (!atomic_xchg(&xe->wedged, 1)) {
> > -		xe->needs_flr_on_fini = true;
> > -		drm_err(&xe->drm,
> > -			"CRITICAL: Xe has declared device %s as wedged.\n"
> > -			"IOCTLs and executions are blocked until device is probed again with unbind and bind operations:\n"
> > -			"echo '%s' > /sys/bus/pci/drivers/xe/unbind\n"
> > -			"echo '%s' > /sys/bus/pci/drivers/xe/bind\n"
> > -			"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
> > -			dev_name(xe->drm.dev), dev_name(xe->drm.dev),
> > -			dev_name(xe->drm.dev));
> > -	}
> > -}
> > +void xe_device_declare_wedged(struct xe_device *xe);
> > 
> > #endif
> > diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h b/drivers/gpu/drm/xe/xe_exec_queue.h
> > index 02ce8d204622..48f6da53a292 100644
> > --- a/drivers/gpu/drm/xe/xe_exec_queue.h
> > +++ b/drivers/gpu/drm/xe/xe_exec_queue.h
> > @@ -26,6 +26,15 @@ void xe_exec_queue_fini(struct xe_exec_queue *q);
> > void xe_exec_queue_destroy(struct kref *ref);
> > void xe_exec_queue_assign_name(struct xe_exec_queue *q, u32 instance);
> > 
> > +static inline struct xe_exec_queue *
> > +xe_exec_queue_get_unless_zero(struct xe_exec_queue *q)
> > +{
> > +	if (kref_get_unless_zero(&q->refcount))
> > +		return q;
> > +
> > +	return NULL;
> > +}
> > +
> > struct xe_exec_queue *xe_exec_queue_lookup(struct xe_file *xef, u32 id);
> > 
> > static inline struct xe_exec_queue *xe_exec_queue_get(struct xe_exec_queue *q)
> > diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > index 93df2d7969b3..8e9c4b990fbb 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > @@ -245,7 +245,7 @@ int xe_gt_tlb_invalidation_ggtt(struct xe_gt *gt)
> > 			return seqno;
> > 
> > 		xe_gt_tlb_invalidation_wait(gt, seqno);
> > -	} else if (xe_device_uc_enabled(xe)) {
> > +	} else if (xe_device_uc_enabled(xe) && !xe_device_wedged(xe)) {
> > 		xe_gt_WARN_ON(gt, xe_force_wake_get(gt_to_fw(gt), XE_FW_GT));
> > 		if (xe->info.platform == XE_PVC || GRAPHICS_VER(xe) >= 20) {
> > 			xe_mmio_write32(gt, PVC_GUC_TLB_INV_DESC1,
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ads.c b/drivers/gpu/drm/xe/xe_guc_ads.c
> > index 757cbbb87869..dbd88ae20aa3 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ads.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_ads.c
> > @@ -20,6 +20,7 @@
> > #include "xe_lrc.h"
> > #include "xe_map.h"
> > #include "xe_mmio.h"
> > +#include "xe_module.h"
> > #include "xe_platform_types.h"
> > #include "xe_wa.h"
> > 
> > @@ -394,11 +395,17 @@ int xe_guc_ads_init_post_hwconfig(struct xe_guc_ads *ads)
> > 
> > static void guc_policies_init(struct xe_guc_ads *ads)
> > {
> > +	u32 global_flags = 0;
> > +
> > 	ads_blob_write(ads, policies.dpc_promote_time,
> > 		       GLOBAL_POLICY_DEFAULT_DPC_PROMOTE_TIME_US);
> > 	ads_blob_write(ads, policies.max_num_work_items,
> > 		       GLOBAL_POLICY_MAX_NUM_WI);
> > -	ads_blob_write(ads, policies.global_flags, 0);
> > +
> > +	if (xe_modparam.wedged_mode == 2)
> > +		global_flags |= GLOBAL_POLICY_DISABLE_ENGINE_RESET;
> > +
> > +	ads_blob_write(ads, policies.global_flags, global_flags);
> > 	ads_blob_write(ads, policies.is_valid, 1);
> > }
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index c7d38469fb46..0bea17536659 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -35,6 +35,7 @@
> > #include "xe_macros.h"
> > #include "xe_map.h"
> > #include "xe_mocs.h"
> > +#include "xe_module.h"
> > #include "xe_ring_ops_types.h"
> > #include "xe_sched_job.h"
> > #include "xe_trace.h"
> > @@ -59,6 +60,7 @@ exec_queue_to_guc(struct xe_exec_queue *q)
> > #define ENGINE_STATE_SUSPENDED		(1 << 5)
> > #define EXEC_QUEUE_STATE_RESET		(1 << 6)
> > #define ENGINE_STATE_KILLED		(1 << 7)
> > +#define EXEC_QUEUE_STATE_WEDGED		(1 << 8)
> > 
> > static bool exec_queue_registered(struct xe_exec_queue *q)
> > {
> > @@ -175,9 +177,20 @@ static void set_exec_queue_killed(struct xe_exec_queue *q)
> > 	atomic_or(ENGINE_STATE_KILLED, &q->guc->state);
> > }
> > 
> > -static bool exec_queue_killed_or_banned(struct xe_exec_queue *q)
> > +static bool exec_queue_wedged(struct xe_exec_queue *q)
> > {
> > -	return exec_queue_killed(q) || exec_queue_banned(q);
> > +	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_WEDGED;
> > +}
> > +
> > +static void set_exec_queue_wedged(struct xe_exec_queue *q)
> > +{
> > +	atomic_or(EXEC_QUEUE_STATE_WEDGED, &q->guc->state);
> > +}
> > +
> > +static bool exec_queue_killed_or_banned_or_wedged(struct xe_exec_queue *q)
> > +{
> > +	return exec_queue_banned(q) || (atomic_read(&q->guc->state) &
> > +		(EXEC_QUEUE_STATE_WEDGED | ENGINE_STATE_KILLED));
> > }
> > 
> > #ifdef CONFIG_PROVE_LOCKING
> > @@ -240,6 +253,17 @@ static void guc_submit_fini(struct drm_device *drm, void *arg)
> > 	free_submit_wq(guc);
> > }
> > 
> > +static void guc_submit_wedged_fini(struct drm_device *drm, void *arg)
> > +{
> > +	struct xe_guc *guc = arg;
> > +	struct xe_exec_queue *q;
> > +	unsigned long index;
> > +
> > +	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
> > +		if (exec_queue_wedged(q))
> > +			xe_exec_queue_put(q);
> > +}
> > +
> > static const struct xe_exec_queue_ops guc_exec_queue_ops;
> > 
> > static void primelockdep(struct xe_guc *guc)
> > @@ -708,7 +732,7 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
> > 
> > 	trace_xe_sched_job_run(job);
> > 
> > -	if (!exec_queue_killed_or_banned(q) && !xe_sched_job_is_error(job)) {
> > +	if (!exec_queue_killed_or_banned_or_wedged(q) && !xe_sched_job_is_error(job)) {
> > 		if (!exec_queue_registered(q))
> > 			register_engine(q);
> > 		if (!lr)	/* LR jobs are emitted in the exec IOCTL */
> > @@ -844,6 +868,28 @@ static void xe_guc_exec_queue_trigger_cleanup(struct xe_exec_queue *q)
> > 		xe_sched_tdr_queue_imm(&q->guc->sched);
> > }
> > 
> > +static void guc_submit_wedged(struct xe_guc *guc)
> > +{
> > +	struct xe_exec_queue *q;
> > +	unsigned long index;
> > +	int err;
> > +
> > +	xe_device_declare_wedged(guc_to_xe(guc));
> > +	xe_guc_submit_reset_prepare(guc);
> > +	xe_guc_ct_stop(&guc->ct);
> > +
> > +	err = drmm_add_action_or_reset(&guc_to_xe(guc)->drm,
> > +				       guc_submit_wedged_fini, guc);
> > +	if (err)
> > +		return;
> > +
> > +	mutex_lock(&guc->submission_state.lock);
> > +	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
> > +		if (xe_exec_queue_get_unless_zero(q))
> > +			set_exec_queue_wedged(q);
> > +	mutex_unlock(&guc->submission_state.lock);
> > +}
> > +
> > static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
> > {
> > 	struct xe_guc_exec_queue *ge =
> > @@ -852,10 +898,16 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
> > 	struct xe_guc *guc = exec_queue_to_guc(q);
> > 	struct xe_device *xe = guc_to_xe(guc);
> > 	struct xe_gpu_scheduler *sched = &ge->sched;
> > +	bool wedged = xe_device_wedged(xe);
> > 
> > 	xe_assert(xe, xe_exec_queue_is_lr(q));
> > 	trace_xe_exec_queue_lr_cleanup(q);
> > 
> > +	if (!wedged && xe_modparam.wedged_mode == 2) {
> 
> I wonder if these checks shouldn't be inside guc_submit_wedged()

true, but since this is anyway changed in the next patch,
I decided to keep it here.

> 
> 
> anyway,
> 
> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>

Thank you

> 
> Lucas De Marchi

  reply	other threads:[~2024-04-16 19:08 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-09 22:15 [PATCH 1/4] drm/xe: Introduce a simple wedged state Rodrigo Vivi
2024-04-09 22:15 ` [PATCH 2/4] drm/xe: declare wedged upon GuC load failure Rodrigo Vivi
2024-04-09 23:41   ` Matthew Brost
2024-04-10  6:50   ` Ghimiray, Himal Prasad
2024-04-16 17:40     ` Lucas De Marchi
2024-04-16 17:06   ` Lucas De Marchi
2024-04-16 19:05   ` Rodrigo Vivi
2024-04-16 19:13     ` Lucas De Marchi
2024-04-16 19:19       ` Ghimiray, Himal Prasad
2024-04-09 22:15 ` [PATCH 3/4] drm/xe: Force wedged state and block GT reset upon any GPU hang Rodrigo Vivi
2024-04-10 17:58   ` Matthew Brost
2024-04-16 17:19   ` Lucas De Marchi
2024-04-16 19:08     ` Rodrigo Vivi [this message]
2024-04-09 22:15 ` [PATCH 4/4] drm/xe: Introduce the wedged_mode debugfs Rodrigo Vivi
2024-04-17 19:51   ` Lucas De Marchi
2024-04-17 20:29     ` Rodrigo Vivi
2024-04-17 22:50       ` Lucas De Marchi
2024-04-18  5:14   ` Ghimiray, Himal Prasad
2024-04-18 10:44     ` Ghimiray, Himal Prasad
2024-04-09 22:21 ` ✓ CI.Patch_applied: success for series starting with [1/4] drm/xe: Introduce a simple wedged state Patchwork
2024-04-09 22:22 ` ✓ CI.checkpatch: " Patchwork
2024-04-09 22:23 ` ✓ CI.KUnit: " Patchwork
2024-04-09 22:34 ` ✓ CI.Build: " Patchwork
2024-04-09 22:37 ` ✓ CI.Hooks: " Patchwork
2024-04-09 22:38 ` ✓ CI.checksparse: " Patchwork
2024-04-09 23:02 ` ✓ CI.BAT: " Patchwork
2024-04-10  0:14 ` ✗ CI.FULL: failure " Patchwork
2024-04-16 17:03 ` [PATCH 1/4] " Lucas De Marchi
2024-04-16 19:20 ` Ghimiray, Himal Prasad
  -- strict thread matches above, loose matches on Subject: below --
2024-04-23 22:18 Rodrigo Vivi
2024-04-23 22:18 ` [PATCH 3/4] drm/xe: Force wedged state and block GT reset upon any GPU hang Rodrigo Vivi
2024-04-24  3:20   ` Ghimiray, Himal Prasad
2024-04-24 12:25     ` Rodrigo Vivi
2024-04-24 17:01       ` Ghimiray, Himal Prasad
2024-04-03 15:07 [PATCH 0/4] Introduce a wedged state Rodrigo Vivi
2024-04-03 15:07 ` [PATCH 3/4] drm/xe: Force wedged state and block GT reset upon any GPU hang Rodrigo Vivi
2024-04-04 17:52   ` Matthew Brost
2024-04-04 18:01     ` Rodrigo Vivi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zh7Mn0Wr_OHCkGf6@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=alan.previn.teres.alexis@intel.com \
    --cc=dhirschfeld@habana.ai \
    --cc=himanshu.somaiya@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=matthew.brost@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.