public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed
From: "Mallesh, Koujalagi" <mallesh.koujalagi@intel.com>
To: Riana Tauro <riana.tauro@intel.com>, <intel-xe@lists.freedesktop.org>
Cc: <anshuman.gupta@intel.com>, <rodrigo.vivi@intel.com>,
	<aravind.iddamsetty@linux.intel.com>, <badal.nilawar@intel.com>,
	<raag.jadav@intel.com>, <ravi.kishore.koppuravuri@intel.com>,
	Matthew Brost <matthew.brost@intel.com>,
	Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Subject: Re: [PATCH v2 05/11] drm/xe: Skip device access during PCI error recovery
Date: Wed, 4 Mar 2026 16:29:28 +0530	[thread overview]
Message-ID: <32b0bdbd-e9bb-476f-af1c-7843be3099b0@intel.com> (raw)
In-Reply-To: <20260302102155.4074630-18-riana.tauro@intel.com>


On 02-03-2026 03:52 pm, Riana Tauro wrote:
> When a fatal error occurs and the error_detected callback is
> invoked the device is inaccessible. The error_detected callback
> wedges the device causing the jobs to timeout.
>
> The timedout handler acquires forcewake to dump devcoredump and
> triggers a GT reset. Since the device is inacessible this causes
> errors. Skip all mmio accesses and gt reset when the device
> is in recovery.
>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_gt.c         | 11 ++++++++---
>   drivers/gpu/drm/xe/xe_guc_submit.c |  9 +++++----
>   2 files changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> index b455af1e6072..6f41090063bf 100644
> --- a/drivers/gpu/drm/xe/xe_gt.c
> +++ b/drivers/gpu/drm/xe/xe_gt.c
> @@ -933,18 +933,23 @@ static void gt_reset_worker(struct work_struct *w)
>   
>   void xe_gt_reset_async(struct xe_gt *gt)
>   {
> -	xe_gt_info(gt, "trying reset from %ps\n", __builtin_return_address(0));
> +	struct xe_device *xe = gt_to_xe(gt);
> +
> +	if (xe_device_is_in_recovery(xe))
> +		return;

Need to check in_recovery flag in the gt_reset_worker() as well to skip 
GT reset when device in PCI recovery.

Thanks

-/Mallesh

>   	/* Don't do a reset while one is already in flight */
>   	if (!xe_fault_inject_gt_reset() && xe_uc_reset_prepare(&gt->uc))
>   		return;
>   
> +	xe_gt_info(gt, "trying reset from %ps\n", __builtin_return_address(0));
> +
>   	xe_gt_info(gt, "reset queued\n");
>   
>   	/* Pair with put in gt_reset_worker() if work is enqueued */
> -	xe_pm_runtime_get_noresume(gt_to_xe(gt));
> +	xe_pm_runtime_get_noresume(xe);
>   	if (!queue_work(gt->ordered_wq, &gt->reset.worker))
> -		xe_pm_runtime_put(gt_to_xe(gt));
> +		xe_pm_runtime_put(xe);
>   }
>   
>   void xe_gt_suspend_prepare(struct xe_gt *gt)
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index ca7aa4f358d0..c25658f1e44b 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -1508,7 +1508,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>   	 * If devcoredump not captured and GuC capture for the job is not ready
>   	 * do manual capture first and decide later if we need to use it
>   	 */
> -	if (!exec_queue_killed(q) && !xe->devcoredump.captured &&
> +	if (!xe_device_is_in_recovery(xe) && !exec_queue_killed(q) && !xe->devcoredump.captured &&
>   	    !xe_guc_capture_get_matching_and_lock(q)) {
>   		/* take force wake before engine register manual capture */
>   		CLASS(xe_force_wake, fw_ref)(gt_to_fw(q->gt), XE_FORCEWAKE_ALL);
> @@ -1530,8 +1530,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>   	set_exec_queue_banned(q);
>   
>   	/* Kick job / queue off hardware */
> -	if (!wedged && (exec_queue_enabled(primary) ||
> -			exec_queue_pending_disable(primary))) {
> +	if (!xe_device_is_in_recovery(xe) && !wedged &&
> +	    (exec_queue_enabled(primary) || exec_queue_pending_disable(primary))) {
>   		int ret;
>   
>   		if (exec_queue_reset(primary))
> @@ -1599,7 +1599,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>   
>   	trace_xe_sched_job_timedout(job);
>   
> -	if (!exec_queue_killed(q))
> +	/* Do not access device if in recovery */
> +	if (!xe_device_is_in_recovery(xe) && !exec_queue_killed(q))
>   		xe_devcoredump(q, job,
>   			       "Timedout job - seqno=%u, lrc_seqno=%u, guc_id=%d, flags=0x%lx",
>   			       xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),

  reply	other threads:[~2026-03-04 10:59 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-02 10:21 [PATCH v2 00/11] Introduce Xe Uncorrectable Error Handling Riana Tauro
2026-03-02 10:21 ` [PATCH v2 01/11] drm/xe/xe_sysctrl: Add System controller patch Riana Tauro
2026-03-02 10:21 ` [PATCH v2 02/11] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
2026-03-02 17:00   ` Raag Jadav
2026-03-03  8:18     ` Mallesh, Koujalagi
2026-03-30 12:56       ` Tauro, Riana
2026-03-30 13:00     ` Tauro, Riana
2026-03-02 10:21 ` [PATCH v2 03/11] drm/xe/xe_pci_error: Implement PCI error recovery callbacks Riana Tauro
2026-03-02 17:37   ` Raag Jadav
2026-03-03  5:09     ` Riana Tauro
2026-03-04 10:38   ` Mallesh, Koujalagi
2026-03-31  5:18     ` Tauro, Riana
2026-03-02 10:21 ` [PATCH v2 04/11] drm/xe/xe_pci_error: Group all devres to release them on PCIe slot reset Riana Tauro
2026-03-02 10:22 ` [PATCH v2 05/11] drm/xe: Skip device access during PCI error recovery Riana Tauro
2026-03-04 10:59   ` Mallesh, Koujalagi [this message]
2026-03-02 10:22 ` [PATCH v2 06/11] drm/xe/xe_ras: Initialize Uncorrectable AER Registers Riana Tauro
2026-03-02 10:22 ` [PATCH v2 07/11] drm/xe/xe_ras: Add structures and commands for Uncorrectable Core Compute Errors Riana Tauro
2026-03-04 16:32   ` Raag Jadav
2026-03-31 16:14     ` Tauro, Riana
2026-04-01  6:25       ` Raag Jadav
2026-04-01  6:39         ` Tauro, Riana
2026-03-02 10:22 ` [PATCH v2 08/11] drm/xe/xe_ras: Add support for Uncorrectable Core-Compute errors Riana Tauro
2026-03-04 16:52   ` Raag Jadav
2026-03-06 18:37     ` Raag Jadav
2026-03-31 16:24     ` Tauro, Riana
2026-04-01  6:34       ` Raag Jadav
2026-04-01  6:47         ` Tauro, Riana
2026-03-06  3:50   ` [v2,08/11] " Purkait, Soham
2026-03-31 16:16     ` Tauro, Riana
2026-03-02 10:22 ` [PATCH v2 09/11] drm/xe/xe_ras: Add structures for SoC Internal errors Riana Tauro
2026-03-10 13:02   ` Mallesh, Koujalagi
2026-03-11 14:51     ` Riana Tauro
2026-03-02 10:22 ` [PATCH v2 10/11] drm/xe/xe_ras: Handle Uncorrectable " Riana Tauro
2026-03-10 13:29   ` Mallesh, Koujalagi
2026-03-11 14:55     ` Riana Tauro
2026-03-02 10:22 ` [PATCH v2 11/11] drm/xe/xe_pci_error: Process errors in mmio_enabled Riana Tauro
2026-03-11  7:10   ` Mallesh, Koujalagi
2026-03-11 14:39     ` Riana Tauro
2026-03-12  8:08       ` Mallesh, Koujalagi
2026-03-02 16:10 ` ✗ CI.checkpatch: warning for Introduce Xe Uncorrectable Error Handling (rev2) Patchwork
2026-03-02 16:11 ` ✓ CI.KUnit: success " Patchwork
2026-03-02 16:48 ` ✓ Xe.CI.BAT: " Patchwork
2026-03-02 18:29 ` ✗ Xe.CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=32b0bdbd-e9bb-476f-af1c-7843be3099b0@intel.com \
    --to=mallesh.koujalagi@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=ravi.kishore.koppuravuri@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox