From: "Mallesh, Koujalagi" <mallesh.koujalagi@intel.com>
To: Riana Tauro <riana.tauro@intel.com>, <intel-xe@lists.freedesktop.org>
Cc: <anshuman.gupta@intel.com>, <rodrigo.vivi@intel.com>,
<aravind.iddamsetty@linux.intel.com>, <badal.nilawar@intel.com>,
<raag.jadav@intel.com>, <ravi.kishore.koppuravuri@intel.com>,
Matthew Brost <matthew.brost@intel.com>,
Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Subject: Re: [PATCH v2 05/11] drm/xe: Skip device access during PCI error recovery
Date: Wed, 4 Mar 2026 16:29:28 +0530 [thread overview]
Message-ID: <32b0bdbd-e9bb-476f-af1c-7843be3099b0@intel.com> (raw)
In-Reply-To: <20260302102155.4074630-18-riana.tauro@intel.com>
On 02-03-2026 03:52 pm, Riana Tauro wrote:
> When a fatal error occurs and the error_detected callback is
> invoked the device is inaccessible. The error_detected callback
> wedges the device causing the jobs to timeout.
>
> The timedout handler acquires forcewake to dump devcoredump and
> triggers a GT reset. Since the device is inacessible this causes
> errors. Skip all mmio accesses and gt reset when the device
> is in recovery.
>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> drivers/gpu/drm/xe/xe_gt.c | 11 ++++++++---
> drivers/gpu/drm/xe/xe_guc_submit.c | 9 +++++----
> 2 files changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> index b455af1e6072..6f41090063bf 100644
> --- a/drivers/gpu/drm/xe/xe_gt.c
> +++ b/drivers/gpu/drm/xe/xe_gt.c
> @@ -933,18 +933,23 @@ static void gt_reset_worker(struct work_struct *w)
>
> void xe_gt_reset_async(struct xe_gt *gt)
> {
> - xe_gt_info(gt, "trying reset from %ps\n", __builtin_return_address(0));
> + struct xe_device *xe = gt_to_xe(gt);
> +
> + if (xe_device_is_in_recovery(xe))
> + return;
Need to check in_recovery flag in the gt_reset_worker() as well to skip
GT reset when device in PCI recovery.
Thanks
-/Mallesh
> /* Don't do a reset while one is already in flight */
> if (!xe_fault_inject_gt_reset() && xe_uc_reset_prepare(>->uc))
> return;
>
> + xe_gt_info(gt, "trying reset from %ps\n", __builtin_return_address(0));
> +
> xe_gt_info(gt, "reset queued\n");
>
> /* Pair with put in gt_reset_worker() if work is enqueued */
> - xe_pm_runtime_get_noresume(gt_to_xe(gt));
> + xe_pm_runtime_get_noresume(xe);
> if (!queue_work(gt->ordered_wq, >->reset.worker))
> - xe_pm_runtime_put(gt_to_xe(gt));
> + xe_pm_runtime_put(xe);
> }
>
> void xe_gt_suspend_prepare(struct xe_gt *gt)
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index ca7aa4f358d0..c25658f1e44b 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -1508,7 +1508,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> * If devcoredump not captured and GuC capture for the job is not ready
> * do manual capture first and decide later if we need to use it
> */
> - if (!exec_queue_killed(q) && !xe->devcoredump.captured &&
> + if (!xe_device_is_in_recovery(xe) && !exec_queue_killed(q) && !xe->devcoredump.captured &&
> !xe_guc_capture_get_matching_and_lock(q)) {
> /* take force wake before engine register manual capture */
> CLASS(xe_force_wake, fw_ref)(gt_to_fw(q->gt), XE_FORCEWAKE_ALL);
> @@ -1530,8 +1530,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> set_exec_queue_banned(q);
>
> /* Kick job / queue off hardware */
> - if (!wedged && (exec_queue_enabled(primary) ||
> - exec_queue_pending_disable(primary))) {
> + if (!xe_device_is_in_recovery(xe) && !wedged &&
> + (exec_queue_enabled(primary) || exec_queue_pending_disable(primary))) {
> int ret;
>
> if (exec_queue_reset(primary))
> @@ -1599,7 +1599,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>
> trace_xe_sched_job_timedout(job);
>
> - if (!exec_queue_killed(q))
> + /* Do not access device if in recovery */
> + if (!xe_device_is_in_recovery(xe) && !exec_queue_killed(q))
> xe_devcoredump(q, job,
> "Timedout job - seqno=%u, lrc_seqno=%u, guc_id=%d, flags=0x%lx",
> xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),
next prev parent reply other threads:[~2026-03-04 10:59 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-02 10:21 [PATCH v2 00/11] Introduce Xe Uncorrectable Error Handling Riana Tauro
2026-03-02 10:21 ` [PATCH v2 01/11] drm/xe/xe_sysctrl: Add System controller patch Riana Tauro
2026-03-02 10:21 ` [PATCH v2 02/11] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
2026-03-02 17:00 ` Raag Jadav
2026-03-03 8:18 ` Mallesh, Koujalagi
2026-03-30 12:56 ` Tauro, Riana
2026-03-30 13:00 ` Tauro, Riana
2026-03-02 10:21 ` [PATCH v2 03/11] drm/xe/xe_pci_error: Implement PCI error recovery callbacks Riana Tauro
2026-03-02 17:37 ` Raag Jadav
2026-03-03 5:09 ` Riana Tauro
2026-03-04 10:38 ` Mallesh, Koujalagi
2026-03-31 5:18 ` Tauro, Riana
2026-03-02 10:21 ` [PATCH v2 04/11] drm/xe/xe_pci_error: Group all devres to release them on PCIe slot reset Riana Tauro
2026-03-02 10:22 ` [PATCH v2 05/11] drm/xe: Skip device access during PCI error recovery Riana Tauro
2026-03-04 10:59 ` Mallesh, Koujalagi [this message]
2026-03-02 10:22 ` [PATCH v2 06/11] drm/xe/xe_ras: Initialize Uncorrectable AER Registers Riana Tauro
2026-03-02 10:22 ` [PATCH v2 07/11] drm/xe/xe_ras: Add structures and commands for Uncorrectable Core Compute Errors Riana Tauro
2026-03-04 16:32 ` Raag Jadav
2026-03-31 16:14 ` Tauro, Riana
2026-04-01 6:25 ` Raag Jadav
2026-04-01 6:39 ` Tauro, Riana
2026-03-02 10:22 ` [PATCH v2 08/11] drm/xe/xe_ras: Add support for Uncorrectable Core-Compute errors Riana Tauro
2026-03-04 16:52 ` Raag Jadav
2026-03-06 18:37 ` Raag Jadav
2026-03-31 16:24 ` Tauro, Riana
2026-04-01 6:34 ` Raag Jadav
2026-04-01 6:47 ` Tauro, Riana
2026-03-06 3:50 ` [v2,08/11] " Purkait, Soham
2026-03-31 16:16 ` Tauro, Riana
2026-03-02 10:22 ` [PATCH v2 09/11] drm/xe/xe_ras: Add structures for SoC Internal errors Riana Tauro
2026-03-10 13:02 ` Mallesh, Koujalagi
2026-03-11 14:51 ` Riana Tauro
2026-03-02 10:22 ` [PATCH v2 10/11] drm/xe/xe_ras: Handle Uncorrectable " Riana Tauro
2026-03-10 13:29 ` Mallesh, Koujalagi
2026-03-11 14:55 ` Riana Tauro
2026-03-02 10:22 ` [PATCH v2 11/11] drm/xe/xe_pci_error: Process errors in mmio_enabled Riana Tauro
2026-03-11 7:10 ` Mallesh, Koujalagi
2026-03-11 14:39 ` Riana Tauro
2026-03-12 8:08 ` Mallesh, Koujalagi
2026-03-02 16:10 ` ✗ CI.checkpatch: warning for Introduce Xe Uncorrectable Error Handling (rev2) Patchwork
2026-03-02 16:11 ` ✓ CI.KUnit: success " Patchwork
2026-03-02 16:48 ` ✓ Xe.CI.BAT: " Patchwork
2026-03-02 18:29 ` ✗ Xe.CI.FULL: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=32b0bdbd-e9bb-476f-af1c-7843be3099b0@intel.com \
--to=mallesh.koujalagi@intel.com \
--cc=anshuman.gupta@intel.com \
--cc=aravind.iddamsetty@linux.intel.com \
--cc=badal.nilawar@intel.com \
--cc=himal.prasad.ghimiray@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
--cc=raag.jadav@intel.com \
--cc=ravi.kishore.koppuravuri@intel.com \
--cc=riana.tauro@intel.com \
--cc=rodrigo.vivi@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox