[PATCH v4 1/5] drm/xe: Always kill exec queues in xe_guc_submit_pause

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 1/5] drm/xe: Always kill exec queues in xe_guc_submit_pause_abort
       [not found] <20260127170455.618616-1-zhanjun.dong@intel.com>
@ 2026-01-27 17:04 ` Zhanjun Dong
  2026-01-27 17:04 ` [PATCH v4 2/5] drm/xe: Forcefully tear down exec queues in GuC submit fini Zhanjun Dong
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Zhanjun Dong @ 2026-01-27 17:04 UTC (permalink / raw)
  To: intel-xe; +Cc: Matthew Brost, stable, Zhanjun Dong, Stuart Summers

From: Matthew Brost <matthew.brost@intel.com>

xe_guc_submit_pause_abort is intended to be called after something
disastrous occurs (e.g., VF migration fails, device wedging, or driver
unload) and should immediately trigger the teardown of remaining
submission state. With that, kill any remaining queues in this function.

Fixes: 7c4b7e34c83b ("drm/xe/vf: Abort VF post migration recovery on failure")
Cc: stable@vger.kernel.org
Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 456f549c16f6..d61bd0094e0b 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -2774,8 +2774,7 @@ void xe_guc_submit_pause_abort(struct xe_guc *guc)
 			continue;
 
 		xe_sched_submission_start(sched);
-		if (exec_queue_killed_or_banned_or_wedged(q))
-			xe_guc_exec_queue_trigger_cleanup(q);
+		guc_exec_queue_kill(q);
 	}
 	mutex_unlock(&guc->submission_state.lock);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 2/5] drm/xe: Forcefully tear down exec queues in GuC submit fini
       [not found] <20260127170455.618616-1-zhanjun.dong@intel.com>
  2026-01-27 17:04 ` [PATCH v4 1/5] drm/xe: Always kill exec queues in xe_guc_submit_pause_abort Zhanjun Dong
@ 2026-01-27 17:04 ` Zhanjun Dong
  2026-01-27 17:04 ` [PATCH v4 3/5] drm/xe: Trigger queue cleanup if not in wedged mode 2 Zhanjun Dong
  2026-01-27 17:04 ` [PATCH v4 4/5] drm/xe/guc: Ensure CT state transitions via STOP before DISABLED Zhanjun Dong
  3 siblings, 0 replies; 7+ messages in thread
From: Zhanjun Dong @ 2026-01-27 17:04 UTC (permalink / raw)
  To: intel-xe; +Cc: Matthew Brost, stable, Zhanjun Dong

From: Matthew Brost <matthew.brost@intel.com>

In GuC submit fini, forcefully tear down any exec queues by disabling
CTs, stopping the scheduler (which cleans up lost G2H), killing all
remaining queues, and resuming scheduling to allow any remaining cleanup
actions to complete and signal any remaining fences.

guc_submit_fini requires access to device hardware. Using a device-managed
action guarantees the correct ordering of cleanup.

v3:
 - Add page fault fix
v2:
 - Fix VF failure (CI)

Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: stable@vger.kernel.org
Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 31 +++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index d61bd0094e0b..92ea32423838 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -239,13 +239,21 @@ static bool exec_queue_killed_or_banned_or_wedged(struct xe_exec_queue *q)
 		 EXEC_QUEUE_STATE_BANNED));
 }
 
-static void guc_submit_fini(struct drm_device *drm, void *arg)
+static int __xe_guc_submit_reset_prepare(struct xe_guc *guc);
+
+static void guc_submit_fini(void *arg)
 {
 	struct xe_guc *guc = arg;
 	struct xe_device *xe = guc_to_xe(guc);
 	struct xe_gt *gt = guc_to_gt(guc);
 	int ret;
 
+	/* Forcefully kill any remaining exec queues */
+	xe_guc_ct_stop(&guc->ct);
+	__xe_guc_submit_reset_prepare(guc);
+	xe_guc_submit_stop(guc);
+	xe_guc_submit_pause_abort(guc);
+
 	ret = wait_event_timeout(guc->submission_state.fini_wq,
 				 xa_empty(&guc->submission_state.exec_queue_lookup),
 				 HZ * 5);
@@ -326,7 +334,7 @@ int xe_guc_submit_init(struct xe_guc *guc, unsigned int num_ids)
 
 	guc->submission_state.initialized = true;
 
-	return drmm_add_action_or_reset(&xe->drm, guc_submit_fini, guc);
+	return devm_add_action_or_reset(xe->drm.dev, guc_submit_fini, guc);
 }
 
 /*
@@ -2354,16 +2362,10 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
 	}
 }
 
-int xe_guc_submit_reset_prepare(struct xe_guc *guc)
+static int __xe_guc_submit_reset_prepare(struct xe_guc *guc)
 {
 	int ret;
 
-	if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
-		return 0;
-
-	if (!guc->submission_state.initialized)
-		return 0;
-
 	/*
 	 * Using an atomic here rather than submission_state.lock as this
 	 * function can be called while holding the CT lock (engine reset
@@ -2378,6 +2380,17 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
 	return ret;
 }
 
+int xe_guc_submit_reset_prepare(struct xe_guc *guc)
+{
+	if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
+		return 0;
+
+	if (!guc->submission_state.initialized)
+		return 0;
+
+	return __xe_guc_submit_reset_prepare(guc);
+}
+
 void xe_guc_submit_reset_wait(struct xe_guc *guc)
 {
 	wait_event(guc->ct.wq, xe_device_wedged(guc_to_xe(guc)) ||
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 3/5] drm/xe: Trigger queue cleanup if not in wedged mode 2
       [not found] <20260127170455.618616-1-zhanjun.dong@intel.com>
  2026-01-27 17:04 ` [PATCH v4 1/5] drm/xe: Always kill exec queues in xe_guc_submit_pause_abort Zhanjun Dong
  2026-01-27 17:04 ` [PATCH v4 2/5] drm/xe: Forcefully tear down exec queues in GuC submit fini Zhanjun Dong
@ 2026-01-27 17:04 ` Zhanjun Dong
  2026-01-27 22:02   ` Michal Wajdeczko
  2026-01-27 17:04 ` [PATCH v4 4/5] drm/xe/guc: Ensure CT state transitions via STOP before DISABLED Zhanjun Dong
  3 siblings, 1 reply; 7+ messages in thread
From: Zhanjun Dong @ 2026-01-27 17:04 UTC (permalink / raw)
  To: intel-xe; +Cc: Matthew Brost, stable, Zhanjun Dong

From: Matthew Brost <matthew.brost@intel.com>

The intent of wedging a device is to allow queues to continue running
only in wedged mode 2. In other modes, queues should initiate cleanup
and signal all remaining fences. Fix xe_guc_submit_wedge to correctly
clean up queues when wedge mode != 2.

Fixes: 7dbe8af13c18 ("drm/xe: Wedge the entire device")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 34 ++++++++++++++++++------------
 1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 92ea32423838..f29ed62d2b12 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -1326,6 +1326,7 @@ static void disable_scheduling_deregister(struct xe_guc *guc,
  */
 void xe_guc_submit_wedge(struct xe_guc *guc)
 {
+	struct xe_device *xe = guc_to_xe(guc);
 	struct xe_gt *gt = guc_to_gt(guc);
 	struct xe_exec_queue *q;
 	unsigned long index;
@@ -1340,20 +1341,27 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
 	if (!guc->submission_state.initialized)
 		return;
 
-	err = devm_add_action_or_reset(guc_to_xe(guc)->drm.dev,
-				       guc_submit_wedged_fini, guc);
-	if (err) {
-		xe_gt_err(gt, "Failed to register clean-up in wedged.mode=%s; "
-			  "Although device is wedged.\n",
-			  xe_wedged_mode_to_string(XE_WEDGED_MODE_UPON_ANY_HANG_NO_RESET));
-		return;
-	}
+	if (xe->wedged.mode == 2) {
+		err = devm_add_action_or_reset(guc_to_xe(guc)->drm.dev,
+					       guc_submit_wedged_fini, guc);
+		if (err) {
+			xe_gt_err(gt, "Failed to register clean-up on wedged.mode=2; "
+				  "Although device is wedged.\n");
+			return;
+		}
 
-	mutex_lock(&guc->submission_state.lock);
-	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
-		if (xe_exec_queue_get_unless_zero(q))
-			set_exec_queue_wedged(q);
-	mutex_unlock(&guc->submission_state.lock);
+		mutex_lock(&guc->submission_state.lock);
+		xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
+			if (xe_exec_queue_get_unless_zero(q))
+				set_exec_queue_wedged(q);
+		mutex_unlock(&guc->submission_state.lock);
+	} else {
+		/* Forcefully kill any remaining exec queues, signal fences */
+		xe_guc_ct_stop(&guc->ct);
+		__xe_guc_submit_reset_prepare(guc);
+		xe_guc_submit_stop(guc);
+		xe_guc_submit_pause_abort(guc);
+	}
 }
 
 static bool guc_submit_hint_wedged(struct xe_guc *guc)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 4/5] drm/xe/guc: Ensure CT state transitions via STOP before DISABLED
       [not found] <20260127170455.618616-1-zhanjun.dong@intel.com>
                   ` (2 preceding siblings ...)
  2026-01-27 17:04 ` [PATCH v4 3/5] drm/xe: Trigger queue cleanup if not in wedged mode 2 Zhanjun Dong
@ 2026-01-27 17:04 ` Zhanjun Dong
  3 siblings, 0 replies; 7+ messages in thread
From: Zhanjun Dong @ 2026-01-27 17:04 UTC (permalink / raw)
  To: intel-xe; +Cc: Zhanjun Dong, stable, Matthew Brost

The GuC CT state transition requires moving to the STOP state before
entering the DISABLED state. Update the driver teardown sequence to make
the proper state machine transitions.

Fixes: ee4b32220a6b ("drm/xe/guc: Add devm release action to safely tear down CT")
Cc: stable@vger.kernel.org
Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_ct.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index dfbf76037b04..6a658f085e0f 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -345,6 +345,7 @@ static void guc_action_disable_ct(void *arg)
 {
 	struct xe_guc_ct *ct = arg;
 
+	xe_guc_ct_stop(ct);
 	guc_ct_change_state(ct, XE_GUC_CT_STATE_DISABLED);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 3/5] drm/xe: Trigger queue cleanup if not in wedged mode 2
  2026-01-27 17:04 ` [PATCH v4 3/5] drm/xe: Trigger queue cleanup if not in wedged mode 2 Zhanjun Dong
@ 2026-01-27 22:02   ` Michal Wajdeczko
  2026-01-28  3:39     ` Matthew Brost
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Wajdeczko @ 2026-01-27 22:02 UTC (permalink / raw)
  To: Zhanjun Dong, intel-xe; +Cc: Matthew Brost, stable



On 1/27/2026 6:04 PM, Zhanjun Dong wrote:
> From: Matthew Brost <matthew.brost@intel.com>
> 
> The intent of wedging a device is to allow queues to continue running
> only in wedged mode 2. In other modes, queues should initiate cleanup
> and signal all remaining fences. Fix xe_guc_submit_wedge to correctly
> clean up queues when wedge mode != 2.
> 
> Fixes: 7dbe8af13c18 ("drm/xe: Wedge the entire device")
> Cc: stable@vger.kernel.org
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_guc_submit.c | 34 ++++++++++++++++++------------
>  1 file changed, 21 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 92ea32423838..f29ed62d2b12 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -1326,6 +1326,7 @@ static void disable_scheduling_deregister(struct xe_guc *guc,
>   */
>  void xe_guc_submit_wedge(struct xe_guc *guc)
>  {
> +	struct xe_device *xe = guc_to_xe(guc);
>  	struct xe_gt *gt = guc_to_gt(guc);
>  	struct xe_exec_queue *q;
>  	unsigned long index;
> @@ -1340,20 +1341,27 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
>  	if (!guc->submission_state.initialized)
>  		return;
>  
> -	err = devm_add_action_or_reset(guc_to_xe(guc)->drm.dev,
> -				       guc_submit_wedged_fini, guc);
> -	if (err) {
> -		xe_gt_err(gt, "Failed to register clean-up in wedged.mode=%s; "
> -			  "Although device is wedged.\n",
> -			  xe_wedged_mode_to_string(XE_WEDGED_MODE_UPON_ANY_HANG_NO_RESET));
> -		return;
> -	}
> +	if (xe->wedged.mode == 2) {

wedged.mode is now an enum
you should use XE_WEDGED_MODE_UPON_ANY_HANG_NO_RESET instead of plain 2

> +		err = devm_add_action_or_reset(guc_to_xe(guc)->drm.dev,
> +					       guc_submit_wedged_fini, guc);
> +		if (err) {
> +			xe_gt_err(gt, "Failed to register clean-up on wedged.mode=2; "
> +				  "Although device is wedged.\n");
> +			return;

if we want to continue, shouldn't we call just devm_add_action() here?
some default cleanup will be done later anyway, no?

> +		}
>  
> -	mutex_lock(&guc->submission_state.lock);
> -	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
> -		if (xe_exec_queue_get_unless_zero(q))
> -			set_exec_queue_wedged(q);
> -	mutex_unlock(&guc->submission_state.lock);
> +		mutex_lock(&guc->submission_state.lock);
> +		xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
> +			if (xe_exec_queue_get_unless_zero(q))
> +				set_exec_queue_wedged(q);
> +		mutex_unlock(&guc->submission_state.lock);
> +	} else {
> +		/* Forcefully kill any remaining exec queues, signal fences */
> +		xe_guc_ct_stop(&guc->ct);

this was already called by xe_guc_declare_wedged() before calling us here

> +		__xe_guc_submit_reset_prepare(guc);
> +		xe_guc_submit_stop(guc);
> +		xe_guc_submit_pause_abort(guc);
> +	}
>  }
>  
>  static bool guc_submit_hint_wedged(struct xe_guc *guc)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 3/5] drm/xe: Trigger queue cleanup if not in wedged mode 2
  2026-01-27 22:02   ` Michal Wajdeczko
@ 2026-01-28  3:39     ` Matthew Brost
  2026-01-28 21:55       ` Dong, Zhanjun
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Brost @ 2026-01-28  3:39 UTC (permalink / raw)
  To: Michal Wajdeczko; +Cc: Zhanjun Dong, intel-xe, stable

On Tue, Jan 27, 2026 at 11:02:42PM +0100, Michal Wajdeczko wrote:
> 
> 
> On 1/27/2026 6:04 PM, Zhanjun Dong wrote:
> > From: Matthew Brost <matthew.brost@intel.com>
> > 
> > The intent of wedging a device is to allow queues to continue running
> > only in wedged mode 2. In other modes, queues should initiate cleanup
> > and signal all remaining fences. Fix xe_guc_submit_wedge to correctly
> > clean up queues when wedge mode != 2.
> > 
> > Fixes: 7dbe8af13c18 ("drm/xe: Wedge the entire device")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_guc_submit.c | 34 ++++++++++++++++++------------
> >  1 file changed, 21 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 92ea32423838..f29ed62d2b12 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -1326,6 +1326,7 @@ static void disable_scheduling_deregister(struct xe_guc *guc,
> >   */
> >  void xe_guc_submit_wedge(struct xe_guc *guc)
> >  {
> > +	struct xe_device *xe = guc_to_xe(guc);
> >  	struct xe_gt *gt = guc_to_gt(guc);
> >  	struct xe_exec_queue *q;
> >  	unsigned long index;
> > @@ -1340,20 +1341,27 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
> >  	if (!guc->submission_state.initialized)
> >  		return;
> >  
> > -	err = devm_add_action_or_reset(guc_to_xe(guc)->drm.dev,
> > -				       guc_submit_wedged_fini, guc);
> > -	if (err) {
> > -		xe_gt_err(gt, "Failed to register clean-up in wedged.mode=%s; "
> > -			  "Although device is wedged.\n",
> > -			  xe_wedged_mode_to_string(XE_WEDGED_MODE_UPON_ANY_HANG_NO_RESET));
> > -		return;
> > -	}
> > +	if (xe->wedged.mode == 2) {
> 
> wedged.mode is now an enum
> you should use XE_WEDGED_MODE_UPON_ANY_HANG_NO_RESET instead of plain 2
> 

Yes it should be. This is a fixes patch though and with
XE_WEDGED_MODE_UPON_ANY_HANG_NO_RESET, this patch will not cleanly apply
to stable kernels. A later patch in the series could add in the enum I
guess but maybe not an issue as xe_guc_submit_pause_abort isn't present
in a lot kernels either.

> > +		err = devm_add_action_or_reset(guc_to_xe(guc)->drm.dev,
> > +					       guc_submit_wedged_fini, guc);
> > +		if (err) {
> > +			xe_gt_err(gt, "Failed to register clean-up on wedged.mode=2; "
> > +				  "Although device is wedged.\n");
> > +			return;
> 
> if we want to continue, shouldn't we call just devm_add_action() here?
> some default cleanup will be done later anyway, no?
> 

Ah, no. If guc_submit_wedged_fini doesn't run at the end the driver will
not unload cleanly. This could be refactored but it is correct as is.

> > +		}
> >  
> > -	mutex_lock(&guc->submission_state.lock);
> > -	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
> > -		if (xe_exec_queue_get_unless_zero(q))
> > -			set_exec_queue_wedged(q);
> > -	mutex_unlock(&guc->submission_state.lock);
> > +		mutex_lock(&guc->submission_state.lock);
> > +		xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
> > +			if (xe_exec_queue_get_unless_zero(q))
> > +				set_exec_queue_wedged(q);
> > +		mutex_unlock(&guc->submission_state.lock);
> > +	} else {
> > +		/* Forcefully kill any remaining exec queues, signal fences */
> > +		xe_guc_ct_stop(&guc->ct);
> 
> this was already called by xe_guc_declare_wedged() before calling us here
> 
> > +		__xe_guc_submit_reset_prepare(guc);

We actually can skip __xe_guc_submit_reset_prepare too I believe as that
is also called by an upper layer.

Matt

> > +		xe_guc_submit_stop(guc);
> > +		xe_guc_submit_pause_abort(guc);
> > +	}
> >  }
> >  
> >  static bool guc_submit_hint_wedged(struct xe_guc *guc)
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 3/5] drm/xe: Trigger queue cleanup if not in wedged mode 2
  2026-01-28  3:39     ` Matthew Brost
@ 2026-01-28 21:55       ` Dong, Zhanjun
  0 siblings, 0 replies; 7+ messages in thread
From: Dong, Zhanjun @ 2026-01-28 21:55 UTC (permalink / raw)
  To: Matthew Brost, Michal Wajdeczko; +Cc: intel-xe, stable



On 2026-01-27 10:39 p.m., Matthew Brost wrote:
> On Tue, Jan 27, 2026 at 11:02:42PM +0100, Michal Wajdeczko wrote:
>>
>>
>> On 1/27/2026 6:04 PM, Zhanjun Dong wrote:
>>> From: Matthew Brost <matthew.brost@intel.com>
>>>
>>> The intent of wedging a device is to allow queues to continue running
>>> only in wedged mode 2. In other modes, queues should initiate cleanup
>>> and signal all remaining fences. Fix xe_guc_submit_wedge to correctly
>>> clean up queues when wedge mode != 2.
>>>
>>> Fixes: 7dbe8af13c18 ("drm/xe: Wedge the entire device")
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
>>> ---
>>>   drivers/gpu/drm/xe/xe_guc_submit.c | 34 ++++++++++++++++++------------
>>>   1 file changed, 21 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>>> index 92ea32423838..f29ed62d2b12 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>>> @@ -1326,6 +1326,7 @@ static void disable_scheduling_deregister(struct xe_guc *guc,
>>>    */
>>>   void xe_guc_submit_wedge(struct xe_guc *guc)
>>>   {
>>> +	struct xe_device *xe = guc_to_xe(guc);
>>>   	struct xe_gt *gt = guc_to_gt(guc);
>>>   	struct xe_exec_queue *q;
>>>   	unsigned long index;
>>> @@ -1340,20 +1341,27 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
>>>   	if (!guc->submission_state.initialized)
>>>   		return;
>>>   
>>> -	err = devm_add_action_or_reset(guc_to_xe(guc)->drm.dev,
>>> -				       guc_submit_wedged_fini, guc);
>>> -	if (err) {
>>> -		xe_gt_err(gt, "Failed to register clean-up in wedged.mode=%s; "
>>> -			  "Although device is wedged.\n",
>>> -			  xe_wedged_mode_to_string(XE_WEDGED_MODE_UPON_ANY_HANG_NO_RESET));
>>> -		return;
>>> -	}
>>> +	if (xe->wedged.mode == 2) {
>>
>> wedged.mode is now an enum
>> you should use XE_WEDGED_MODE_UPON_ANY_HANG_NO_RESET instead of plain 2
>>
> 
> Yes it should be. This is a fixes patch though and with
> XE_WEDGED_MODE_UPON_ANY_HANG_NO_RESET, this patch will not cleanly apply
> to stable kernels. A later patch in the series could add in the enum I
> guess but maybe not an issue as xe_guc_submit_pause_abort isn't present
> in a lot kernels either.
> 
>>> +		err = devm_add_action_or_reset(guc_to_xe(guc)->drm.dev,
>>> +					       guc_submit_wedged_fini, guc);
>>> +		if (err) {
>>> +			xe_gt_err(gt, "Failed to register clean-up on wedged.mode=2; "
>>> +				  "Although device is wedged.\n");
>>> +			return;
>>
>> if we want to continue, shouldn't we call just devm_add_action() here?
>> some default cleanup will be done later anyway, no?
>>
> 
> Ah, no. If guc_submit_wedged_fini doesn't run at the end the driver will
> not unload cleanly. This could be refactored but it is correct as is.
> 
>>> +		}
>>>   
>>> -	mutex_lock(&guc->submission_state.lock);
>>> -	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
>>> -		if (xe_exec_queue_get_unless_zero(q))
>>> -			set_exec_queue_wedged(q);
>>> -	mutex_unlock(&guc->submission_state.lock);
>>> +		mutex_lock(&guc->submission_state.lock);
>>> +		xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
>>> +			if (xe_exec_queue_get_unless_zero(q))
>>> +				set_exec_queue_wedged(q);
>>> +		mutex_unlock(&guc->submission_state.lock);
>>> +	} else {
>>> +		/* Forcefully kill any remaining exec queues, signal fences */
>>> +		xe_guc_ct_stop(&guc->ct);
>>
>> this was already called by xe_guc_declare_wedged() before calling us here
>>
>>> +		__xe_guc_submit_reset_prepare(guc);
> 
> We actually can skip __xe_guc_submit_reset_prepare too I believe as that
> is also called by an upper layer.
Will remove above 2 calls in next rev.

Regards,
Zhanjun Dong
> 
> Matt
> 
>>> +		xe_guc_submit_stop(guc);
>>> +		xe_guc_submit_pause_abort(guc);
>>> +	}
>>>   }
>>>   
>>>   static bool guc_submit_hint_wedged(struct xe_guc *guc)
>>


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-01-28 21:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260127170455.618616-1-zhanjun.dong@intel.com>
2026-01-27 17:04 ` [PATCH v4 1/5] drm/xe: Always kill exec queues in xe_guc_submit_pause_abort Zhanjun Dong
2026-01-27 17:04 ` [PATCH v4 2/5] drm/xe: Forcefully tear down exec queues in GuC submit fini Zhanjun Dong
2026-01-27 17:04 ` [PATCH v4 3/5] drm/xe: Trigger queue cleanup if not in wedged mode 2 Zhanjun Dong
2026-01-27 22:02   ` Michal Wajdeczko
2026-01-28  3:39     ` Matthew Brost
2026-01-28 21:55       ` Dong, Zhanjun
2026-01-27 17:04 ` [PATCH v4 4/5] drm/xe/guc: Ensure CT state transitions via STOP before DISABLED Zhanjun Dong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox