[PATCH v4 4/8] drm/xe: Block reset while recovering from VF migration

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Tomasz Lis <tomasz.lis@intel.com>
To: intel-xe@lists.freedesktop.org
Cc: "Michał Winiarski" <michal.winiarski@intel.com>,
	"Michał Wajdeczko" <michal.wajdeczko@intel.com>,
	"Piotr Piórkowski" <piotr.piorkowski@intel.com>,
	"Matthew Brost" <matthew.brost@intel.com>,
	"Lucas De Marchi" <lucas.demarchi@intel.com>
Subject: [PATCH v4 4/8] drm/xe: Block reset while recovering from VF migration
Date: Fri,  6 Jun 2025 02:18:19 +0200	[thread overview]
Message-ID: <20250606001823.1010994-5-tomasz.lis@intel.com> (raw)
In-Reply-To: <20250606001823.1010994-1-tomasz.lis@intel.com>

Resetting GuC during recovery could interfere with the recovery
process. Such reset might be also triggered without justification,
due to migration taking time, rather than due to the workload not
progressing.

Doing GuC reset during the recovery would cause exit of RESFIX state,
and therefore continuation of GuC work while fixups are still being
applied. To avoid that, reset needs to be blocked during the recovery.

This patch blocks the reset during recovery. Reset request in that
time range will be stalled, and unblocked only after GuC goes out
of RESFIX state.

In case a reset procedure already started while the recovery is
triggered, there isn't much we can do - we cannot wait for it to
finish as it involves waiting for hardware, and we can't be sure
at which exact point of the reset procedure the GPU got switched.
Therefore, the rare cases where migration happens while reset is
in progress, are still dangerous. Resets are not a part of the
standard flow, and cause unfinished workloads - that will happen
during the reset interrupted by migration as well, so it doesn't
diverge that much from what normally happens during such resets.

v2: Introduce a new atomic for reset blocking, as we cannot reuse
  `stopped` atomic (that could lead to losing a workload).

Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Michal Winiarski <michal.winiarski@intel.com>
---
 drivers/gpu/drm/xe/xe_gt.c         | 10 +++++++++
 drivers/gpu/drm/xe/xe_guc_submit.c | 36 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_guc_submit.h |  3 +++
 drivers/gpu/drm/xe/xe_guc_types.h  |  6 +++++
 drivers/gpu/drm/xe/xe_sriov_vf.c   | 12 ++++++++--
 5 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index 858b5398c01b..96647c5801cc 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -41,6 +41,7 @@
 #include "xe_gt_topology.h"
 #include "xe_guc_exec_queue_types.h"
 #include "xe_guc_pc.h"
+#include "xe_guc_submit.h"
 #include "xe_hw_fence.h"
 #include "xe_hw_engine_class_sysfs.h"
 #include "xe_irq.h"
@@ -810,6 +811,11 @@ static int do_gt_restart(struct xe_gt *gt)
 	return 0;
 }
 
+static int gt_wait_reset_unblock(struct xe_gt *gt)
+{
+	return xe_guc_wait_reset_unblock(&gt->uc.guc);
+}
+
 static int gt_reset(struct xe_gt *gt)
 {
 	unsigned int fw_ref;
@@ -824,6 +830,10 @@ static int gt_reset(struct xe_gt *gt)
 
 	xe_gt_info(gt, "reset started\n");
 
+	err = gt_wait_reset_unblock(gt);
+	if (!err)
+		xe_gt_warn(gt, "reset block failed to get lifted");
+
 	xe_pm_runtime_get(gt_to_xe(gt));
 
 	if (xe_fault_inject_gt_reset()) {
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 61a1aac7a976..0d6d9e07fc48 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -1772,6 +1772,42 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
 	}
 }
 
+/**
+ * xe_guc_submit_reset_block - Disallow reset calls on given GuC.
+ * @guc: the &xe_guc struct instance
+ */
+int xe_guc_submit_reset_block(struct xe_guc *guc)
+{
+	return atomic_fetch_or(1, &guc->submission_state.reset_blocked);
+}
+
+/**
+ * xe_guc_submit_reset_unblock - Allow back reset calls on given GuC.
+ * @guc: the &xe_guc struct instance
+ */
+void xe_guc_submit_reset_unblock(struct xe_guc *guc)
+{
+	atomic_and(0, &guc->submission_state.reset_blocked);
+	/* we are going to wake workers, make sure they all see the value */
+	smp_wmb();
+	wake_up_all(&guc->ct.wq);
+}
+
+static int guc_submit_reset_is_blocked(struct xe_guc *guc)
+{
+	return atomic_read(&guc->submission_state.reset_blocked);
+}
+
+/**
+ * xe_guc_wait_reset_unblock - Wait until reset blocking flag is lifted, or timeout.
+ * @guc: the &xe_guc struct instance
+ */
+int xe_guc_wait_reset_unblock(struct xe_guc *guc)
+{
+	return wait_event_timeout(guc->ct.wq,
+				  !guc_submit_reset_is_blocked(guc), HZ * 5);
+}
+
 int xe_guc_submit_reset_prepare(struct xe_guc *guc)
 {
 	int ret;
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
index f1cf271492ae..45b978b047c9 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.h
+++ b/drivers/gpu/drm/xe/xe_guc_submit.h
@@ -20,6 +20,9 @@ void xe_guc_submit_stop(struct xe_guc *guc);
 int xe_guc_submit_start(struct xe_guc *guc);
 void xe_guc_submit_pause(struct xe_guc *guc);
 void xe_guc_submit_unpause(struct xe_guc *guc);
+int xe_guc_submit_reset_block(struct xe_guc *guc);
+void xe_guc_submit_reset_unblock(struct xe_guc *guc);
+int xe_guc_wait_reset_unblock(struct xe_guc *guc);
 void xe_guc_submit_wedge(struct xe_guc *guc);
 
 int xe_guc_read_stopped(struct xe_guc *guc);
diff --git a/drivers/gpu/drm/xe/xe_guc_types.h b/drivers/gpu/drm/xe/xe_guc_types.h
index 1fde7614fcc5..c7b9642b41ba 100644
--- a/drivers/gpu/drm/xe/xe_guc_types.h
+++ b/drivers/gpu/drm/xe/xe_guc_types.h
@@ -85,6 +85,12 @@ struct xe_guc {
 		struct xarray exec_queue_lookup;
 		/** @submission_state.stopped: submissions are stopped */
 		atomic_t stopped;
+		/**
+		 * @submission_state.reset_blocked: reset attempts are blocked;
+		 * blocking reset in order to delay it may be required if running
+		 * an operation which is sensitive to resets.
+		 */
+		atomic_t reset_blocked;
 		/** @submission_state.lock: protects submission state */
 		struct mutex lock;
 		/** @submission_state.enabled: submission is enabled */
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
index b72542953004..29f3082c8fa5 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
@@ -163,9 +163,15 @@ static void vf_post_migration_shutdown(struct xe_device *xe)
 {
 	struct xe_gt *gt;
 	unsigned int id;
+	int ret = 0;
 
-	for_each_gt(gt, xe, id)
+	for_each_gt(gt, xe, id) {
 		xe_guc_submit_pause(&gt->uc.guc);
+		ret |= xe_guc_submit_reset_block(&gt->uc.guc);
+	}
+
+	if (ret)
+		drm_info(&xe->drm, "migration recovery encountered ongoing reset\n");
 }
 
 /**
@@ -187,8 +193,10 @@ static void vf_post_migration_kickstart(struct xe_device *xe)
 	 */
 	xe_irq_resume(xe);
 
-	for_each_gt(gt, xe, id)
+	for_each_gt(gt, xe, id) {
+		xe_guc_submit_reset_unblock(&gt->uc.guc);
 		xe_guc_submit_unpause(&gt->uc.guc);
+	}
 }
 
 /**
-- 
2.25.1

next prev parent reply	other threads:[~2025-06-06  0:18 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-06  0:18 [PATCH v4 0/8] drm/xe/vf: Post-migration recovery of queues and jobs Tomasz Lis
2025-06-06  0:18 ` [PATCH v4 1/8] drm/xe/sa: Avoid caching GGTT address within the manager Tomasz Lis
2025-06-06  0:18 ` [PATCH v4 2/8] drm/xe/vf: Finish RESFIX by reset if CTB not enabled Tomasz Lis
2025-06-09 11:00   ` Michal Wajdeczko
2025-06-11 20:18     ` Lis, Tomasz
2025-06-06  0:18 ` [PATCH v4 3/8] drm/xe/vf: Pause submissions during RESFIX fixups Tomasz Lis
2025-06-06  0:18 ` Tomasz Lis [this message]
2025-06-06  0:18 ` [PATCH v4 5/8] drm/xe/vf: Rebase HWSP of all contexts after migration Tomasz Lis
2025-06-09 11:03   ` Michal Wajdeczko
2025-06-11 20:18     ` Lis, Tomasz
2025-06-06  0:18 ` [PATCH v4 6/8] drm/xe/vf: Rebase MEMIRQ structures for " Tomasz Lis
2025-06-09 11:15   ` Michal Wajdeczko
2025-06-06  0:18 ` [PATCH v4 7/8] drm/xe/vf: Post migration, repopulate ring area for pending request Tomasz Lis
2025-06-06  0:18 ` [PATCH v4 8/8] drm/xe/vf: Refresh utilization buffer during migration recovery Tomasz Lis
2025-06-06  2:43 ` ✓ CI.Patch_applied: success for drm/xe/vf: Post-migration recovery of queues and jobs (rev4) Patchwork
2025-06-06  2:44 ` ✗ CI.checkpatch: warning " Patchwork
2025-06-06  2:45 ` ✓ CI.KUnit: success " Patchwork
2025-06-06  2:56 ` ✓ CI.Build: " Patchwork
2025-06-06  2:58 ` ✗ CI.Hooks: failure " Patchwork
2025-06-06  3:00 ` ✓ CI.checksparse: success " Patchwork
2025-06-06  3:46 ` ✓ Xe.CI.BAT: " Patchwork
2025-06-08  7:58 ` ✗ Xe.CI.Full: failure " Patchwork

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:858b5398c01 dfblob:96647c5801c dfblob:61a1aac7a97
dfblob:0d6d9e07fc4 dfblob:f1cf271492a dfblob:45b978b047c
dfblob:1fde7614fcc dfblob:c7b9642b41b dfblob:b7254295300
dfblob:29f3082c8fa )
 OR (
bs:"[PATCH v4 4/8] drm/xe: Block reset while recovering from VF migration" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250606001823.1010994-5-tomasz.lis@intel.com \
    --to=tomasz.lis@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=michal.winiarski@intel.com \
    --cc=piotr.piorkowski@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox