[PATCH v3 00/36] VF migration redesign

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 00/36] VF migration redesign
@ 2025-09-29  2:55 Matthew Brost
  2025-09-29  2:55 ` [PATCH v3 01/36] drm/xe: Add NULL checks to scratch LRC allocation Matthew Brost
                   ` (38 more replies)
  0 siblings, 39 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

Rather than modifying buffers in place using GGTT addresses during VF
migration, this approach relies on the submission backend's stop/start
mechanism to issue fixups. The patch titled "Document GuC Submission
Backend" provides a detailed explanation of the design.

Testing was performed using an out-of-tree PF/VFIO driver with manual
triggering of VF migration while IGT test cases are running.

IGT test cases:

- A new series [1] that exercises active contexts, job resubmission, and
  compressd memory.

- A new test [2] that actively creates / destroys queue on each
  submission

- xe_exec_threads basic sections, which test context registration loss,
  schedule enable loss, and job resubmission.

- xe_exec_threads balancer sections, which follow the same flows as the 
  basic sections but include a work queue (GGTT address shift).

- xe_exec_threads compute mode user pointer invalidation sections, which
  exercise the same flow as the basic sections, plus replaying
  suspend/resume flows.

All code paths in "Replay GuC submission state on pause/unpause" that
replay state have been manually verified via debug messages "Add debug
prints for GuC replaying state during VF recovery".

v2:
 - Fix lockdep splat
 - Fix checkpatch
 - Fix PTL issue with LRC W/A buffer
 - Fix race creating / destroying queues across migration exposed by [2]
 - Include a version of Satya's patches in [3] which enable CCS save /
   restore across VF migration /w GGTT shift
v3:
 - Address feedback
 - Fix preempt fence mode deadlock /w work queues + VF recovery (Testing)
 - Add NULL checks to scratch LRC allocation

Matt

[1] https://patchwork.freedesktop.org/series/154616/ 
[2] https://patchwork.freedesktop.org/series/154931/
[3] https://patchwork.freedesktop.org/series/154682/

Matthew Brost (33):
  drm/xe: Add NULL checks to scratch LRC allocation
  Revert "drm/xe/vf: Rebase exec queue parallel commands during
    migration recovery"
  Revert "drm/xe/vf: Post migration, repopulate ring area for pending
    request"
  Revert "drm/xe/vf: Fixup CTB send buffer messages after migration"
  drm/xe: Save off position in ring in which a job was programmed
  drm/xe/guc: Track pending-enable source in submission state
  drm/xe: Track LR jobs in DRM scheduler pending list
  drm/xe: Don't change LRC ring head on job resubmission
  drm/xe: Make LRC W/A scratch buffer usage consistent
  drm/xe/guc: Document GuC submission backend
  drm/xe/vf: Add xe_gt_recovery_inprogress helper
  drm/xe/vf: Make VF recovery run on per-GT worker
  drm/xe/vf: Abort H2G sends during VF post-migration recovery
  drm/xe/vf: Remove memory allocations from VF post migration recovery
  drm/xe/vf: Close multi-GT GGTT shift race
  drm/xe/vf: Teardown VF post migration worker on driver unload
  drm/xe/vf: Don't allow GT reset to be queued during VF post migration
    recovery
  drm/xe/vf: Wakeup in GuC backend on VF post migration recovery
  drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs
    supporting migration
  drm/xe/vf: Extra debug on GGTT shift
  drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register
  drm/xe/vf: Flush and stop CTs in VF post migration recovery
  drm/xe/vf: Reset TLB invalidations during VF post migration recovery
  drm/xe/vf: Kickstart after resfix in VF post migration recovery
  drm/xe/vf: Start CTs before resfix VF post migration recovery
  drm/xe/vf: Abort VF post migration recovery on failure
  drm/xe/vf: Replay GuC submission state on pause / unpause
  drm/xe: Move queue init before LRC creation
  drm/xe/vf: Add debug prints for GuC replaying state during VF recovery
  drm/xe/vf: Workaround for race condition in GuC firmware during VF
    pause
  drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF
  drm/xe/vf: Ensure media GT VF recovery runs after primary GT on PTL
  drm/xe/vf: Rebase CCS save/restore BB GGTT addresses

Satyanarayana K V P (2):
  drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups
  drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC

Tomasz Lis (1):
  drm/xe/vf: Lock querying GGTT config during driver init

 Documentation/gpu/xe/index.rst               |   1 +
 drivers/gpu/drm/xe/abi/guc_actions_abi.h     |   8 -
 drivers/gpu/drm/xe/xe_device_types.h         |   2 +
 drivers/gpu/drm/xe/xe_exec.c                 |  12 +-
 drivers/gpu/drm/xe/xe_exec_queue.c           |  86 +-
 drivers/gpu/drm/xe/xe_exec_queue.h           |   5 +-
 drivers/gpu/drm/xe/xe_execlist.c             |   2 +-
 drivers/gpu/drm/xe/xe_gpu_scheduler.c        |  14 +
 drivers/gpu/drm/xe/xe_gpu_scheduler.h        |   2 +
 drivers/gpu/drm/xe/xe_gt.c                   |  37 +-
 drivers/gpu/drm/xe/xe_gt.h                   |  15 +-
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c          | 445 ++++++++--
 drivers/gpu/drm/xe/xe_gt_sriov_vf.h          |  11 +-
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h    |  36 +-
 drivers/gpu/drm/xe/xe_guc.c                  |   4 +-
 drivers/gpu/drm/xe/xe_guc_ct.c               | 293 ++-----
 drivers/gpu/drm/xe/xe_guc_ct.h               |   4 +-
 drivers/gpu/drm/xe/xe_guc_exec_queue_types.h |  15 +
 drivers/gpu/drm/xe/xe_guc_submit.c           | 834 +++++++++++++++----
 drivers/gpu/drm/xe/xe_guc_submit.h           |   7 +-
 drivers/gpu/drm/xe/xe_lrc.c                  |  12 +-
 drivers/gpu/drm/xe/xe_lrc.h                  |  10 +
 drivers/gpu/drm/xe/xe_map.h                  |  18 -
 drivers/gpu/drm/xe/xe_memirq.c               |  48 +-
 drivers/gpu/drm/xe/xe_memirq.h               |   2 +
 drivers/gpu/drm/xe/xe_migrate.c              |  28 +-
 drivers/gpu/drm/xe/xe_pci.c                  |   6 +-
 drivers/gpu/drm/xe/xe_pci_types.h            |   1 +
 drivers/gpu/drm/xe/xe_preempt_fence.c        |  11 +
 drivers/gpu/drm/xe/xe_ring_ops.c             |  23 +-
 drivers/gpu/drm/xe/xe_sched_job_types.h      |   9 +
 drivers/gpu/drm/xe/xe_sriov_vf.c             | 243 ------
 drivers/gpu/drm/xe/xe_sriov_vf.h             |   1 -
 drivers/gpu/drm/xe/xe_sriov_vf_ccs.c         |  28 +
 drivers/gpu/drm/xe/xe_sriov_vf_ccs.h         |   1 +
 drivers/gpu/drm/xe/xe_sriov_vf_types.h       |   4 -
 drivers/gpu/drm/xe/xe_tile.c                 |   2 +-
 drivers/gpu/drm/xe/xe_tile_sriov_vf.c        |   6 +-
 drivers/gpu/drm/xe/xe_tile_sriov_vf.h        |   1 -
 drivers/gpu/drm/xe/xe_vm.c                   |  29 +-
 40 files changed, 1499 insertions(+), 817 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH v3 01/36] drm/xe: Add NULL checks to scratch LRC allocation
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-30  2:06   ` Lis, Tomasz
  2025-09-29  2:55 ` [PATCH v3 02/36] drm/xe/vf: Lock querying GGTT config during driver init Matthew Brost
                   ` (37 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

kmalloc can fail, the returned value must have a NULL check.

Fixes: 168b5867318b ("drm/xe/vf: Refresh utilization buffer during migration recovery")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_lrc.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index 47e9df775072..e1bc102a6cae 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -1303,8 +1303,11 @@ static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
 	u32 *buf = NULL;
 	int ret;
 
-	if (lrc->bo->vmap.is_iomem)
+	if (lrc->bo->vmap.is_iomem) {
 		buf = kmalloc(LRC_WA_BB_SIZE, GFP_KERNEL);
+		if (!buf)
+			return -ENOMEM;
+	}
 
 	ret = xe_lrc_setup_wa_bb_with_scratch(lrc, hwe, buf);
 
@@ -1347,8 +1350,11 @@ setup_indirect_ctx(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
 	if (xe_gt_WARN_ON(lrc->gt, !state.funcs))
 		return 0;
 
-	if (lrc->bo->vmap.is_iomem)
+	if (lrc->bo->vmap.is_iomem) {
 		state.buffer = kmalloc(state.max_size, GFP_KERNEL);
+		if (!state.buffer)
+			return -ENOMEM;
+	}
 
 	ret = setup_bo(&state);
 	if (ret) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 02/36] drm/xe/vf: Lock querying GGTT config during driver init
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
  2025-09-29  2:55 ` [PATCH v3 01/36] drm/xe: Add NULL checks to scratch LRC allocation Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29  7:42   ` Michal Wajdeczko
  2025-09-29  8:13   ` Ville Syrjälä
  2025-09-29  2:55 ` [PATCH v3 03/36] Revert "drm/xe/vf: Rebase exec queue parallel commands during migration recovery" Matthew Brost
                   ` (36 subsequent siblings)
  38 siblings, 2 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

From: Tomasz Lis <tomasz.lis@intel.com>

Protect access to GGTT config as this is non-static information.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 96 ++++++++++++++++++-----
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  3 +
 drivers/gpu/drm/xe/xe_sriov_vf.c          |  6 ++
 3 files changed, 84 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 0461d5513487..016c867e5e2b 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -440,18 +440,21 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
 
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 
+	down_write(&config->lock);
+
 	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_START_KEY, &start);
 	if (unlikely(err))
-		return err;
+		goto out;
 
 	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_SIZE_KEY, &size);
 	if (unlikely(err))
-		return err;
+		goto out;
 
 	if (config->ggtt_size && config->ggtt_size != size) {
 		xe_gt_sriov_err(gt, "Unexpected GGTT reassignment: %lluK != %lluK\n",
 				size / SZ_1K, config->ggtt_size / SZ_1K);
-		return -EREMCHG;
+		err = -EREMCHG;
+		goto out;
 	}
 
 	xe_gt_sriov_dbg_verbose(gt, "GGTT %#llx-%#llx = %lluK\n",
@@ -460,8 +463,11 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
 	config->ggtt_shift = start - (s64)config->ggtt_base;
 	config->ggtt_base = start;
 	config->ggtt_size = size;
+	err = config->ggtt_size ? 0 : -ENODATA;
 
-	return config->ggtt_size ? 0 : -ENODATA;
+out:
+	up_write(&config->lock);
+	return err;
 }
 
 static int vf_get_lmem_info(struct xe_gt *gt)
@@ -474,22 +480,28 @@ static int vf_get_lmem_info(struct xe_gt *gt)
 
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 
+	down_write(&config->lock);
+
 	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_LMEM_SIZE_KEY, &size);
 	if (unlikely(err))
-		return err;
+		goto out;
 
 	if (config->lmem_size && config->lmem_size != size) {
 		xe_gt_sriov_err(gt, "Unexpected LMEM reassignment: %lluM != %lluM\n",
 				size / SZ_1M, config->lmem_size / SZ_1M);
-		return -EREMCHG;
+		err = -EREMCHG;
+		goto out;
 	}
 
 	string_get_size(size, 1, STRING_UNITS_2, size_str, sizeof(size_str));
 	xe_gt_sriov_dbg_verbose(gt, "LMEM %lluM %s\n", size / SZ_1M, size_str);
 
 	config->lmem_size = size;
+	err = config->lmem_size ? 0 : -ENODATA;
 
-	return config->lmem_size ? 0 : -ENODATA;
+out:
+	up_write(&config->lock);
+	return err;
 }
 
 static int vf_get_submission_cfg(struct xe_gt *gt)
@@ -501,23 +513,27 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
 
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 
+	down_write(&config->lock);
+
 	err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_CONTEXTS_KEY, &num_ctxs);
 	if (unlikely(err))
-		return err;
+		goto out;
 
 	err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_DOORBELLS_KEY, &num_dbs);
 	if (unlikely(err))
-		return err;
+		goto out;
 
 	if (config->num_ctxs && config->num_ctxs != num_ctxs) {
 		xe_gt_sriov_err(gt, "Unexpected CTXs reassignment: %u != %u\n",
 				num_ctxs, config->num_ctxs);
-		return -EREMCHG;
+		err = -EREMCHG;
+		goto out;
 	}
 	if (config->num_dbs && config->num_dbs != num_dbs) {
 		xe_gt_sriov_err(gt, "Unexpected DBs reassignment: %u != %u\n",
 				num_dbs, config->num_dbs);
-		return -EREMCHG;
+		err = -EREMCHG;
+		goto out;
 	}
 
 	xe_gt_sriov_dbg_verbose(gt, "CTXs %u DBs %u\n", num_ctxs, num_dbs);
@@ -525,7 +541,11 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
 	config->num_ctxs = num_ctxs;
 	config->num_dbs = num_dbs;
 
-	return config->num_ctxs ? 0 : -ENODATA;
+	err = config->num_ctxs ? 0 : -ENODATA;
+
+out:
+	up_write(&config->lock);
+	return err;
 }
 
 static void vf_cache_gmdid(struct xe_gt *gt)
@@ -579,11 +599,18 @@ int xe_gt_sriov_vf_query_config(struct xe_gt *gt)
  */
 u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
 {
+	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
+	u16 val;
+
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
-	xe_gt_assert(gt, gt->sriov.vf.self_config.num_ctxs);
 
-	return gt->sriov.vf.self_config.num_ctxs;
+	down_read(&config->lock);
+	xe_gt_assert(gt, config->num_ctxs);
+	val = config->num_ctxs;
+	up_read(&config->lock);
+
+	return val;
 }
 
 /**
@@ -596,11 +623,18 @@ u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
  */
 u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
 {
+	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
+	u64 val;
+
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
-	xe_gt_assert(gt, gt->sriov.vf.self_config.lmem_size);
 
-	return gt->sriov.vf.self_config.lmem_size;
+	down_read(&config->lock);
+	xe_gt_assert(gt, config->lmem_size);
+	val = config->lmem_size;
+	up_read(&config->lock);
+
+	return val;
 }
 
 /**
@@ -613,11 +647,17 @@ u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
  */
 u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt)
 {
+	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
+	u64 val;
+
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
-	xe_gt_assert(gt, gt->sriov.vf.self_config.ggtt_size);
 
-	return gt->sriov.vf.self_config.ggtt_size;
+	down_read(&config->lock);
+	val = config->ggtt_size;
+	up_read(&config->lock);
+
+	return val;
 }
 
 /**
@@ -630,11 +670,18 @@ u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt)
  */
 u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt)
 {
+	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
+	u64 val;
+
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
-	xe_gt_assert(gt, gt->sriov.vf.self_config.ggtt_size);
 
-	return gt->sriov.vf.self_config.ggtt_base;
+	down_read(&config->lock);
+	xe_gt_assert(gt, config->ggtt_size);
+	val = config->ggtt_base;
+	up_read(&config->lock);
+
+	return val;
 }
 
 /**
@@ -648,11 +695,16 @@ u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt)
 s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt)
 {
 	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
+	s64 val;
 
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 	xe_gt_assert(gt, xe_gt_is_main_type(gt));
 
-	return config->ggtt_shift;
+	down_read(&config->lock);
+	val = config->ggtt_shift;
+	up_read(&config->lock);
+
+	return val;
 }
 
 static int relay_action_handshake(struct xe_gt *gt, u32 *major, u32 *minor)
@@ -1044,6 +1096,7 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
 
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 
+	down_read(&config->lock);
 	drm_printf(p, "GGTT range:\t%#llx-%#llx\n",
 		   config->ggtt_base,
 		   config->ggtt_base + config->ggtt_size - 1);
@@ -1060,6 +1113,7 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
 
 	drm_printf(p, "GuC contexts:\t%u\n", config->num_ctxs);
 	drm_printf(p, "GuC doorbells:\t%u\n", config->num_dbs);
+	up_read(&config->lock);
 }
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
index 298dedf4b009..d95857bd789b 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
@@ -6,6 +6,7 @@
 #ifndef _XE_GT_SRIOV_VF_TYPES_H_
 #define _XE_GT_SRIOV_VF_TYPES_H_
 
+#include <linux/rwsem.h>
 #include <linux/types.h>
 #include "xe_uc_fw_types.h"
 
@@ -25,6 +26,8 @@ struct xe_gt_sriov_vf_selfconfig {
 	u16 num_ctxs;
 	/** @num_dbs: assigned number of GuC doorbells IDs. */
 	u16 num_dbs;
+	/** @lock: lock for protecting access to all selfconfig fields. */
+	struct rw_semaphore lock;
 };
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
index cdd9f8e78b2a..d6e2ed9b9bbc 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
@@ -197,6 +197,12 @@ static void vf_migration_init_early(struct xe_device *xe)
  */
 void xe_sriov_vf_init_early(struct xe_device *xe)
 {
+	struct xe_gt *gt;
+	unsigned int id;
+
+	for_each_gt(gt, xe, id)
+		init_rwsem(&gt->sriov.vf.self_config.lock);
+
 	vf_migration_init_early(xe);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 03/36] Revert "drm/xe/vf: Rebase exec queue parallel commands during migration recovery"
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
  2025-09-29  2:55 ` [PATCH v3 01/36] drm/xe: Add NULL checks to scratch LRC allocation Matthew Brost
  2025-09-29  2:55 ` [PATCH v3 02/36] drm/xe/vf: Lock querying GGTT config during driver init Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-30 15:22   ` Michal Wajdeczko
  2025-09-29  2:55 ` [PATCH v3 04/36] Revert "drm/xe/vf: Post migration, repopulate ring area for pending request" Matthew Brost
                   ` (35 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

This reverts commit ba180a362128cb71d16c3f0ce6645448011d2607.

Due to change in the VF migration recovery design this code
is not needed any more.

v3:
 - Add commit message (Michal / Lucas)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/abi/guc_actions_abi.h |  8 ----
 drivers/gpu/drm/xe/xe_guc_submit.c       | 54 ------------------------
 2 files changed, 62 deletions(-)

diff --git a/drivers/gpu/drm/xe/abi/guc_actions_abi.h b/drivers/gpu/drm/xe/abi/guc_actions_abi.h
index 31090c69dfbe..47756e4674a1 100644
--- a/drivers/gpu/drm/xe/abi/guc_actions_abi.h
+++ b/drivers/gpu/drm/xe/abi/guc_actions_abi.h
@@ -196,14 +196,6 @@ enum xe_guc_register_context_multi_lrc_param_offsets {
 	XE_GUC_REGISTER_CONTEXT_MULTI_LRC_MSG_MIN_LEN = 11,
 };
 
-enum xe_guc_context_wq_item_offsets {
-	XE_GUC_CONTEXT_WQ_HEADER_DATA_0_TYPE_LEN = 0,
-	XE_GUC_CONTEXT_WQ_EL_INFO_DATA_1_CTX_DESC_LOW,
-	XE_GUC_CONTEXT_WQ_EL_INFO_DATA_2_GUCCTX_RINGTAIL_FREEZEPOCS,
-	XE_GUC_CONTEXT_WQ_EL_INFO_DATA_3_WI_FENCE_ID,
-	XE_GUC_CONTEXT_WQ_EL_CHILD_LIST_DATA_4_RINGTAIL,
-};
-
 enum xe_guc_report_status {
 	XE_GUC_REPORT_STATUS_UNKNOWN = 0x0,
 	XE_GUC_REPORT_STATUS_ACKED = 0x1,
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 53024eb5670b..3ac0950f55be 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -735,18 +735,12 @@ static void wq_item_append(struct xe_exec_queue *q)
 	if (wq_wait_for_space(q, wqi_size))
 		return;
 
-	xe_gt_assert(guc_to_gt(guc), i == XE_GUC_CONTEXT_WQ_HEADER_DATA_0_TYPE_LEN);
 	wqi[i++] = FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_MULTI_LRC) |
 		FIELD_PREP(WQ_LEN_MASK, len_dw);
-	xe_gt_assert(guc_to_gt(guc), i == XE_GUC_CONTEXT_WQ_EL_INFO_DATA_1_CTX_DESC_LOW);
 	wqi[i++] = xe_lrc_descriptor(q->lrc[0]);
-	xe_gt_assert(guc_to_gt(guc), i ==
-		     XE_GUC_CONTEXT_WQ_EL_INFO_DATA_2_GUCCTX_RINGTAIL_FREEZEPOCS);
 	wqi[i++] = FIELD_PREP(WQ_GUC_ID_MASK, q->guc->id) |
 		FIELD_PREP(WQ_RING_TAIL_MASK, q->lrc[0]->ring.tail / sizeof(u64));
-	xe_gt_assert(guc_to_gt(guc), i == XE_GUC_CONTEXT_WQ_EL_INFO_DATA_3_WI_FENCE_ID);
 	wqi[i++] = 0;
-	xe_gt_assert(guc_to_gt(guc), i == XE_GUC_CONTEXT_WQ_EL_CHILD_LIST_DATA_4_RINGTAIL);
 	for (j = 1; j < q->width; ++j) {
 		struct xe_lrc *lrc = q->lrc[j];
 
@@ -767,50 +761,6 @@ static void wq_item_append(struct xe_exec_queue *q)
 	parallel_write(xe, map, wq_desc.tail, q->guc->wqi_tail);
 }
 
-static int wq_items_rebase(struct xe_exec_queue *q)
-{
-	struct xe_guc *guc = exec_queue_to_guc(q);
-	struct xe_device *xe = guc_to_xe(guc);
-	struct iosys_map map = xe_lrc_parallel_map(q->lrc[0]);
-	int i = q->guc->wqi_head;
-
-	/* the ring starts after a header struct */
-	iosys_map_incr(&map, offsetof(struct guc_submit_parallel_scratch, wq[0]));
-
-	while ((i % WQ_SIZE) != (q->guc->wqi_tail % WQ_SIZE)) {
-		u32 len_dw, type, val;
-
-		if (drm_WARN_ON_ONCE(&xe->drm, i < 0 || i > 2 * WQ_SIZE))
-			break;
-
-		val = xe_map_rd_ring_u32(xe, &map, i / sizeof(u32) +
-					 XE_GUC_CONTEXT_WQ_HEADER_DATA_0_TYPE_LEN,
-					 WQ_SIZE / sizeof(u32));
-		len_dw = FIELD_GET(WQ_LEN_MASK, val);
-		type = FIELD_GET(WQ_TYPE_MASK, val);
-
-		if (drm_WARN_ON_ONCE(&xe->drm, len_dw >= WQ_SIZE / sizeof(u32)))
-			break;
-
-		if (type == WQ_TYPE_MULTI_LRC) {
-			val = xe_lrc_descriptor(q->lrc[0]);
-			xe_map_wr_ring_u32(xe, &map, i / sizeof(u32) +
-					   XE_GUC_CONTEXT_WQ_EL_INFO_DATA_1_CTX_DESC_LOW,
-					   WQ_SIZE / sizeof(u32), val);
-		} else if (drm_WARN_ON_ONCE(&xe->drm, type != WQ_TYPE_NOOP)) {
-			break;
-		}
-
-		i += (len_dw + 1) * sizeof(u32);
-	}
-
-	if ((i % WQ_SIZE) != (q->guc->wqi_tail % WQ_SIZE)) {
-		xe_gt_err(q->gt, "Exec queue fixups incomplete - wqi parse failed\n");
-		return -EBADMSG;
-	}
-	return 0;
-}
-
 #define RESUME_PENDING	~0x0ull
 static void submit_exec_queue(struct xe_exec_queue *q)
 {
@@ -2669,10 +2619,6 @@ int xe_guc_contexts_hwsp_rebase(struct xe_guc *guc, void *scratch)
 		err = xe_exec_queue_contexts_hwsp_rebase(q, scratch);
 		if (err)
 			break;
-		if (xe_exec_queue_is_parallel(q))
-			err = wq_items_rebase(q);
-		if (err)
-			break;
 	}
 	mutex_unlock(&guc->submission_state.lock);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 04/36] Revert "drm/xe/vf: Post migration, repopulate ring area for pending request"
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (2 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 03/36] Revert "drm/xe/vf: Rebase exec queue parallel commands during migration recovery" Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-30 15:24   ` Michal Wajdeczko
  2025-09-29  2:55 ` [PATCH v3 05/36] Revert "drm/xe/vf: Fixup CTB send buffer messages after migration" Matthew Brost
                   ` (34 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

This reverts commit a0dda25d24e636df5c30a9370464b7cebc709faf.

Due to change in the VF migration recovery design this code
is not needed any more.

v3:
 - Add commit message (Michal / Lucas)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_exec_queue.c | 24 ------------------------
 drivers/gpu/drm/xe/xe_exec_queue.h |  3 +--
 drivers/gpu/drm/xe/xe_guc_submit.c | 24 ------------------------
 drivers/gpu/drm/xe/xe_guc_submit.h |  2 --
 drivers/gpu/drm/xe/xe_sriov_vf.c   |  1 -
 5 files changed, 1 insertion(+), 53 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
index 37b2b93b73d6..6bfaca424ca3 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -1123,27 +1123,3 @@ int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch)
 
 	return err;
 }
-
-/**
- * xe_exec_queue_jobs_ring_restore - Re-emit ring commands of requests pending on given queue.
- * @q: the &xe_exec_queue struct instance
- */
-void xe_exec_queue_jobs_ring_restore(struct xe_exec_queue *q)
-{
-	struct xe_gpu_scheduler *sched = &q->guc->sched;
-	struct xe_sched_job *job;
-
-	/*
-	 * This routine is used within VF migration recovery. This means
-	 * using the lock here introduces a restriction: we cannot wait
-	 * for any GFX HW response while the lock is taken.
-	 */
-	spin_lock(&sched->base.job_list_lock);
-	list_for_each_entry(job, &sched->base.pending_list, drm.list) {
-		if (xe_sched_job_is_error(job))
-			continue;
-
-		q->ring_ops->emit_job(job);
-	}
-	spin_unlock(&sched->base.job_list_lock);
-}
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h b/drivers/gpu/drm/xe/xe_exec_queue.h
index 15ec852e7f7e..8821ceb838d0 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.h
+++ b/drivers/gpu/drm/xe/xe_exec_queue.h
@@ -92,7 +92,6 @@ void xe_exec_queue_update_run_ticks(struct xe_exec_queue *q);
 
 int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch);
 
-void xe_exec_queue_jobs_ring_restore(struct xe_exec_queue *q);
-
 struct xe_lrc *xe_exec_queue_lrc(struct xe_exec_queue *q);
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 3ac0950f55be..16f78376f196 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -845,30 +845,6 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
 	return fence;
 }
 
-/**
- * xe_guc_jobs_ring_rebase - Re-emit ring commands of requests pending
- * on all queues under a guc.
- * @guc: the &xe_guc struct instance
- */
-void xe_guc_jobs_ring_rebase(struct xe_guc *guc)
-{
-	struct xe_exec_queue *q;
-	unsigned long index;
-
-	/*
-	 * This routine is used within VF migration recovery. This means
-	 * using the lock here introduces a restriction: we cannot wait
-	 * for any GFX HW response while the lock is taken.
-	 */
-	mutex_lock(&guc->submission_state.lock);
-	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q) {
-		if (exec_queue_killed_or_banned_or_wedged(q))
-			continue;
-		xe_exec_queue_jobs_ring_restore(q);
-	}
-	mutex_unlock(&guc->submission_state.lock);
-}
-
 static void guc_exec_queue_free_job(struct drm_sched_job *drm_job)
 {
 	struct xe_sched_job *job = to_xe_sched_job(drm_job);
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
index 78c3f07e31a0..5b4a0a6fd818 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.h
+++ b/drivers/gpu/drm/xe/xe_guc_submit.h
@@ -36,8 +36,6 @@ int xe_guc_exec_queue_memory_cat_error_handler(struct xe_guc *guc, u32 *msg,
 int xe_guc_exec_queue_reset_failure_handler(struct xe_guc *guc, u32 *msg, u32 len);
 int xe_guc_error_capture_handler(struct xe_guc *guc, u32 *msg, u32 len);
 
-void xe_guc_jobs_ring_rebase(struct xe_guc *guc);
-
 struct xe_guc_submit_exec_queue_snapshot *
 xe_guc_exec_queue_snapshot_capture(struct xe_exec_queue *q);
 void
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
index d6e2ed9b9bbc..0581b881b628 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
@@ -340,7 +340,6 @@ static int gt_vf_post_migration_fixups(struct xe_gt *gt)
 		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
 		if (err)
 			goto out;
-		xe_guc_jobs_ring_rebase(&gt->uc.guc);
 		xe_guc_ct_fixup_messages_with_ggtt(&gt->uc.guc.ct, shift);
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 05/36] Revert "drm/xe/vf: Fixup CTB send buffer messages after migration"
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (3 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 04/36] Revert "drm/xe/vf: Post migration, repopulate ring area for pending request" Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-30 15:27   ` Michal Wajdeczko
  2025-09-29  2:55 ` [PATCH v3 06/36] drm/xe: Save off position in ring in which a job was programmed Matthew Brost
                   ` (33 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

This reverts commit cef88d1265cac7d415606af73ba58926fd3cd8b7.

Due to change in the VF migration recovery design this code
is not needed any more.

v3:
 - Add commit message (Michal / Lucas)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_ct.c   | 183 -------------------------------
 drivers/gpu/drm/xe/xe_guc_ct.h   |   2 -
 drivers/gpu/drm/xe/xe_map.h      |  18 ---
 drivers/gpu/drm/xe/xe_sriov_vf.c |   2 -
 4 files changed, 205 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 18f6327bf552..47079ab9922c 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -25,7 +25,6 @@
 #include "xe_gt_printk.h"
 #include "xe_gt_sriov_pf_control.h"
 #include "xe_gt_sriov_pf_monitor.h"
-#include "xe_gt_sriov_printk.h"
 #include "xe_guc.h"
 #include "xe_guc_log.h"
 #include "xe_guc_relay.h"
@@ -93,8 +92,6 @@ struct g2h_fence {
 	bool done;
 };
 
-#define make_u64(hi, lo) ((u64)((u64)(u32)(hi) << 32 | (u32)(lo)))
-
 static void g2h_fence_init(struct g2h_fence *g2h_fence, u32 *response_buffer)
 {
 	memset(g2h_fence, 0, sizeof(*g2h_fence));
@@ -1793,186 +1790,6 @@ static void g2h_worker_func(struct work_struct *w)
 	receive_g2h(ct);
 }
 
-static void xe_fixup_u64_in_cmds(struct xe_device *xe, struct iosys_map *cmds,
-				 u32 size, u32 idx, s64 shift)
-{
-	u32 hi, lo;
-	u64 offset;
-
-	lo = xe_map_rd_ring_u32(xe, cmds, idx, size);
-	hi = xe_map_rd_ring_u32(xe, cmds, idx + 1, size);
-	offset = make_u64(hi, lo);
-	offset += shift;
-	lo = lower_32_bits(offset);
-	hi = upper_32_bits(offset);
-	xe_map_wr_ring_u32(xe, cmds, idx, size, lo);
-	xe_map_wr_ring_u32(xe, cmds, idx + 1, size, hi);
-}
-
-/*
- * Shift any GGTT addresses within a single message left within CTB from
- * before post-migration recovery.
- * @ct: pointer to CT struct of the target GuC
- * @cmds: iomap buffer containing CT messages
- * @head: start of the target message within the buffer
- * @len: length of the target message
- * @size: size of the commands buffer
- * @shift: the address shift to be added to each GGTT reference
- * Return: true if the message was fixed or needed no fixups, false on failure
- */
-static bool ct_fixup_ggtt_in_message(struct xe_guc_ct *ct,
-				     struct iosys_map *cmds, u32 head,
-				     u32 len, u32 size, s64 shift)
-{
-	struct xe_gt *gt = ct_to_gt(ct);
-	struct xe_device *xe = ct_to_xe(ct);
-	u32 msg[GUC_HXG_MSG_MIN_LEN];
-	u32 action, i, n;
-
-	xe_gt_assert(gt, len >= GUC_HXG_MSG_MIN_LEN);
-
-	msg[0] = xe_map_rd_ring_u32(xe, cmds, head, size);
-	action = FIELD_GET(GUC_HXG_REQUEST_MSG_0_ACTION, msg[0]);
-
-	xe_gt_sriov_dbg_verbose(gt, "fixing H2G %#x\n", action);
-
-	switch (action) {
-	case XE_GUC_ACTION_REGISTER_CONTEXT:
-		if (len != XE_GUC_REGISTER_CONTEXT_MSG_LEN)
-			goto err_len;
-		xe_fixup_u64_in_cmds(xe, cmds, size, head +
-				     XE_GUC_REGISTER_CONTEXT_DATA_5_WQ_DESC_ADDR_LOWER,
-				     shift);
-		xe_fixup_u64_in_cmds(xe, cmds, size, head +
-				     XE_GUC_REGISTER_CONTEXT_DATA_7_WQ_BUF_BASE_LOWER,
-				     shift);
-		xe_fixup_u64_in_cmds(xe, cmds, size, head +
-				     XE_GUC_REGISTER_CONTEXT_DATA_10_HW_LRC_ADDR, shift);
-		break;
-	case XE_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC:
-		if (len < XE_GUC_REGISTER_CONTEXT_MULTI_LRC_MSG_MIN_LEN)
-			goto err_len;
-		n = xe_map_rd_ring_u32(xe, cmds, head +
-				       XE_GUC_REGISTER_CONTEXT_MULTI_LRC_DATA_10_NUM_CTXS, size);
-		if (len != XE_GUC_REGISTER_CONTEXT_MULTI_LRC_MSG_MIN_LEN + 2 * n)
-			goto err_len;
-		xe_fixup_u64_in_cmds(xe, cmds, size, head +
-				     XE_GUC_REGISTER_CONTEXT_MULTI_LRC_DATA_5_WQ_DESC_ADDR_LOWER,
-				     shift);
-		xe_fixup_u64_in_cmds(xe, cmds, size, head +
-				     XE_GUC_REGISTER_CONTEXT_MULTI_LRC_DATA_7_WQ_BUF_BASE_LOWER,
-				     shift);
-		for (i = 0; i < n; i++)
-			xe_fixup_u64_in_cmds(xe, cmds, size, head +
-					     XE_GUC_REGISTER_CONTEXT_MULTI_LRC_DATA_11_HW_LRC_ADDR
-					     + 2 * i, shift);
-		break;
-	default:
-		break;
-	}
-	return true;
-
-err_len:
-	xe_gt_err(gt, "Skipped G2G %#x message fixups, unexpected length (%u)\n", action, len);
-	return false;
-}
-
-/*
- * Apply fixups to the next outgoing CT message within given CTB
- * @ct: the &xe_guc_ct struct instance representing the target GuC
- * @h2g: the &guc_ctb struct instance of the target buffer
- * @shift: shift to be added to all GGTT addresses within the CTB
- * @mhead: pointer to an integer storing message start position; the
- *   position is changed to next message before this function return
- * @avail: size of the area available for parsing, that is length
- *   of all remaining messages stored within the CTB
- * Return: size of the area available for parsing after one message
- *   has been parsed, that is length remaining from the updated mhead
- */
-static int ct_fixup_ggtt_in_buffer(struct xe_guc_ct *ct, struct guc_ctb *h2g,
-				   s64 shift, u32 *mhead, s32 avail)
-{
-	struct xe_gt *gt = ct_to_gt(ct);
-	struct xe_device *xe = ct_to_xe(ct);
-	u32 msg[GUC_HXG_MSG_MIN_LEN];
-	u32 size = h2g->info.size;
-	u32 head = *mhead;
-	u32 len;
-
-	xe_gt_assert(gt, avail >= (s32)GUC_CTB_MSG_MIN_LEN);
-
-	/* Read header */
-	msg[0] = xe_map_rd_ring_u32(xe, &h2g->cmds, head, size);
-	len = FIELD_GET(GUC_CTB_MSG_0_NUM_DWORDS, msg[0]) + GUC_CTB_MSG_MIN_LEN;
-
-	if (unlikely(len > (u32)avail)) {
-		xe_gt_err(gt, "H2G channel broken on read, avail=%d, len=%d, fixups skipped\n",
-			  avail, len);
-		return 0;
-	}
-
-	head = (head + GUC_CTB_MSG_MIN_LEN) % size;
-	if (!ct_fixup_ggtt_in_message(ct, &h2g->cmds, head, msg_len_to_hxg_len(len), size, shift))
-		return 0;
-	*mhead = (head + msg_len_to_hxg_len(len)) % size;
-
-	return avail - len;
-}
-
-/**
- * xe_guc_ct_fixup_messages_with_ggtt - Fixup any pending H2G CTB messages
- * @ct: pointer to CT struct of the target GuC
- * @ggtt_shift: shift to be added to all GGTT addresses within the CTB
- *
- * Messages in GuC to Host CTB are owned by GuC and any fixups in them
- * are made by GuC. But content of the Host to GuC CTB is owned by the
- * KMD, so fixups to GGTT references in any pending messages need to be
- * applied here.
- * This function updates GGTT offsets in payloads of pending H2G CTB
- * messages (messages which were not consumed by GuC before the VF got
- * paused).
- */
-void xe_guc_ct_fixup_messages_with_ggtt(struct xe_guc_ct *ct, s64 ggtt_shift)
-{
-	struct guc_ctb *h2g = &ct->ctbs.h2g;
-	struct xe_guc *guc = ct_to_guc(ct);
-	struct xe_gt *gt = guc_to_gt(guc);
-	u32 head, tail, size;
-	s32 avail;
-
-	if (unlikely(h2g->info.broken))
-		return;
-
-	h2g->info.head = desc_read(ct_to_xe(ct), h2g, head);
-	head = h2g->info.head;
-	tail = READ_ONCE(h2g->info.tail);
-	size = h2g->info.size;
-
-	if (unlikely(head > size))
-		goto corrupted;
-
-	if (unlikely(tail >= size))
-		goto corrupted;
-
-	avail = tail - head;
-
-	/* beware of buffer wrap case */
-	if (unlikely(avail < 0))
-		avail += size;
-	xe_gt_dbg(gt, "available %d (%u:%u:%u)\n", avail, head, tail, size);
-	xe_gt_assert(gt, avail >= 0);
-
-	while (avail > 0)
-		avail = ct_fixup_ggtt_in_buffer(ct, h2g, ggtt_shift, &head, avail);
-
-	return;
-
-corrupted:
-	xe_gt_err(gt, "Corrupted H2G descriptor head=%u tail=%u size=%u, fixups not applied\n",
-		  head, tail, size);
-	h2g->info.broken = true;
-}
-
 static struct xe_guc_ct_snapshot *guc_ct_snapshot_alloc(struct xe_guc_ct *ct, bool atomic,
 							bool want_ctb)
 {
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
index cf41210ab30a..d6c81325a76c 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.h
+++ b/drivers/gpu/drm/xe/xe_guc_ct.h
@@ -24,8 +24,6 @@ void xe_guc_ct_snapshot_print(struct xe_guc_ct_snapshot *snapshot, struct drm_pr
 void xe_guc_ct_snapshot_free(struct xe_guc_ct_snapshot *snapshot);
 void xe_guc_ct_print(struct xe_guc_ct *ct, struct drm_printer *p, bool want_ctb);
 
-void xe_guc_ct_fixup_messages_with_ggtt(struct xe_guc_ct *ct, s64 ggtt_shift);
-
 static inline bool xe_guc_ct_initialized(struct xe_guc_ct *ct)
 {
 	return ct->state != XE_GUC_CT_STATE_NOT_INITIALIZED;
diff --git a/drivers/gpu/drm/xe/xe_map.h b/drivers/gpu/drm/xe/xe_map.h
index 8d67f6ba2d95..f62e0c8b67ab 100644
--- a/drivers/gpu/drm/xe/xe_map.h
+++ b/drivers/gpu/drm/xe/xe_map.h
@@ -78,24 +78,6 @@ static inline void xe_map_write32(struct xe_device *xe, struct iosys_map *map,
 	iosys_map_wr(map__, offset__, type__, val__);			\
 })
 
-#define xe_map_rd_array(xe__, map__, index__, type__) \
-	xe_map_rd(xe__, map__, (index__) * sizeof(type__), type__)
-
-#define xe_map_wr_array(xe__, map__, index__, type__, val__) \
-	xe_map_wr(xe__, map__, (index__) * sizeof(type__), type__, val__)
-
-#define xe_map_rd_array_u32(xe__, map__, index__) \
-	xe_map_rd_array(xe__, map__, index__, u32)
-
-#define xe_map_wr_array_u32(xe__, map__, index__, val__) \
-	xe_map_wr_array(xe__, map__, index__, u32, val__)
-
-#define xe_map_rd_ring_u32(xe__, map__, index__, size__) \
-	xe_map_rd_array_u32(xe__, map__, (index__) % (size__))
-
-#define xe_map_wr_ring_u32(xe__, map__, index__, size__, val__) \
-	xe_map_wr_array_u32(xe__, map__, (index__) % (size__), val__)
-
 #define xe_map_rd_field(xe__, map__, struct_offset__, struct_type__, field__) ({	\
 	struct xe_device *__xe = xe__;					\
 	xe_device_assert_mem_access(__xe);				\
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
index 0581b881b628..da064a1e7419 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
@@ -12,7 +12,6 @@
 #include "xe_gt_sriov_printk.h"
 #include "xe_gt_sriov_vf.h"
 #include "xe_guc.h"
-#include "xe_guc_ct.h"
 #include "xe_guc_submit.h"
 #include "xe_irq.h"
 #include "xe_lrc.h"
@@ -340,7 +339,6 @@ static int gt_vf_post_migration_fixups(struct xe_gt *gt)
 		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
 		if (err)
 			goto out;
-		xe_guc_ct_fixup_messages_with_ggtt(&gt->uc.guc.ct, shift);
 	}
 
 out:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 06/36] drm/xe: Save off position in ring in which a job was programmed
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (4 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 05/36] Revert "drm/xe/vf: Fixup CTB send buffer messages after migration" Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29  2:55 ` [PATCH v3 07/36] drm/xe/guc: Track pending-enable source in submission state Matthew Brost
                   ` (32 subsequent siblings)
  38 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

VF post-migration recovery needs to modify the ring with updated GGTT
addresses for pending jobs. Save off position in ring in which a job was
programmed to facilitate.

v4:
 - s/VF resume/VF post-migration recovery (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_ring_ops.c        | 23 +++++++++++++++++++----
 drivers/gpu/drm/xe/xe_sched_job_types.h |  5 +++++
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
index d71837773d6c..ac0c6dcffe15 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops.c
+++ b/drivers/gpu/drm/xe/xe_ring_ops.c
@@ -245,12 +245,14 @@ static int emit_copy_timestamp(struct xe_lrc *lrc, u32 *dw, int i)
 
 /* for engines that don't require any special HW handling (no EUs, no aux inval, etc) */
 static void __emit_job_gen12_simple(struct xe_sched_job *job, struct xe_lrc *lrc,
-				    u64 batch_addr, u32 seqno)
+				    u64 batch_addr, u32 *head, u32 seqno)
 {
 	u32 dw[MAX_JOB_SIZE_DW], i = 0;
 	u32 ppgtt_flag = get_ppgtt_flag(job);
 	struct xe_gt *gt = job->q->gt;
 
+	*head = lrc->ring.tail;
+
 	i = emit_copy_timestamp(lrc, dw, i);
 
 	if (job->ring_ops_flush_tlb) {
@@ -296,7 +298,7 @@ static bool has_aux_ccs(struct xe_device *xe)
 }
 
 static void __emit_job_gen12_video(struct xe_sched_job *job, struct xe_lrc *lrc,
-				   u64 batch_addr, u32 seqno)
+				   u64 batch_addr, u32 *head, u32 seqno)
 {
 	u32 dw[MAX_JOB_SIZE_DW], i = 0;
 	u32 ppgtt_flag = get_ppgtt_flag(job);
@@ -304,6 +306,8 @@ static void __emit_job_gen12_video(struct xe_sched_job *job, struct xe_lrc *lrc,
 	struct xe_device *xe = gt_to_xe(gt);
 	bool decode = job->q->class == XE_ENGINE_CLASS_VIDEO_DECODE;
 
+	*head = lrc->ring.tail;
+
 	i = emit_copy_timestamp(lrc, dw, i);
 
 	dw[i++] = preparser_disable(true);
@@ -346,7 +350,8 @@ static void __emit_job_gen12_video(struct xe_sched_job *job, struct xe_lrc *lrc,
 
 static void __emit_job_gen12_render_compute(struct xe_sched_job *job,
 					    struct xe_lrc *lrc,
-					    u64 batch_addr, u32 seqno)
+					    u64 batch_addr, u32 *head,
+					    u32 seqno)
 {
 	u32 dw[MAX_JOB_SIZE_DW], i = 0;
 	u32 ppgtt_flag = get_ppgtt_flag(job);
@@ -355,6 +360,8 @@ static void __emit_job_gen12_render_compute(struct xe_sched_job *job,
 	bool lacks_render = !(gt->info.engine_mask & XE_HW_ENGINE_RCS_MASK);
 	u32 mask_flags = 0;
 
+	*head = lrc->ring.tail;
+
 	i = emit_copy_timestamp(lrc, dw, i);
 
 	dw[i++] = preparser_disable(true);
@@ -396,11 +403,14 @@ static void __emit_job_gen12_render_compute(struct xe_sched_job *job,
 }
 
 static void emit_migration_job_gen12(struct xe_sched_job *job,
-				     struct xe_lrc *lrc, u32 seqno)
+				     struct xe_lrc *lrc, u32 *head,
+				     u32 seqno)
 {
 	u32 saddr = xe_lrc_start_seqno_ggtt_addr(lrc);
 	u32 dw[MAX_JOB_SIZE_DW], i = 0;
 
+	*head = lrc->ring.tail;
+
 	i = emit_copy_timestamp(lrc, dw, i);
 
 	i = emit_store_imm_ggtt(saddr, seqno, dw, i);
@@ -434,6 +444,7 @@ static void emit_job_gen12_gsc(struct xe_sched_job *job)
 
 	__emit_job_gen12_simple(job, job->q->lrc[0],
 				job->ptrs[0].batch_addr,
+				&job->ptrs[0].head,
 				xe_sched_job_lrc_seqno(job));
 }
 
@@ -443,6 +454,7 @@ static void emit_job_gen12_copy(struct xe_sched_job *job)
 
 	if (xe_sched_job_is_migration(job->q)) {
 		emit_migration_job_gen12(job, job->q->lrc[0],
+					 &job->ptrs[0].head,
 					 xe_sched_job_lrc_seqno(job));
 		return;
 	}
@@ -450,6 +462,7 @@ static void emit_job_gen12_copy(struct xe_sched_job *job)
 	for (i = 0; i < job->q->width; ++i)
 		__emit_job_gen12_simple(job, job->q->lrc[i],
 					job->ptrs[i].batch_addr,
+					&job->ptrs[i].head,
 					xe_sched_job_lrc_seqno(job));
 }
 
@@ -461,6 +474,7 @@ static void emit_job_gen12_video(struct xe_sched_job *job)
 	for (i = 0; i < job->q->width; ++i)
 		__emit_job_gen12_video(job, job->q->lrc[i],
 				       job->ptrs[i].batch_addr,
+				       &job->ptrs[i].head,
 				       xe_sched_job_lrc_seqno(job));
 }
 
@@ -471,6 +485,7 @@ static void emit_job_gen12_render_compute(struct xe_sched_job *job)
 	for (i = 0; i < job->q->width; ++i)
 		__emit_job_gen12_render_compute(job, job->q->lrc[i],
 						job->ptrs[i].batch_addr,
+						&job->ptrs[i].head,
 						xe_sched_job_lrc_seqno(job));
 }
 
diff --git a/drivers/gpu/drm/xe/xe_sched_job_types.h b/drivers/gpu/drm/xe/xe_sched_job_types.h
index dbf260dded8d..7ce58765a34a 100644
--- a/drivers/gpu/drm/xe/xe_sched_job_types.h
+++ b/drivers/gpu/drm/xe/xe_sched_job_types.h
@@ -24,6 +24,11 @@ struct xe_job_ptrs {
 	struct dma_fence_chain *chain_fence;
 	/** @batch_addr: Batch buffer address. */
 	u64 batch_addr;
+	/**
+	 * @head: The tail pointer of the LRC (so head pointer of job) when the
+	 * job was submitted
+	 */
+	u32 head;
 };
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 07/36] drm/xe/guc: Track pending-enable source in submission state
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (5 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 06/36] drm/xe: Save off position in ring in which a job was programmed Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29  2:55 ` [PATCH v3 08/36] drm/xe: Track LR jobs in DRM scheduler pending list Matthew Brost
                   ` (31 subsequent siblings)
  38 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

Add explicit tracking in the GuC submission state to record the source
of a pending enable (TDR vs. queue resume path vs. submission).
Disambiguating the origin lets the GuC submission state machine apply
the correct recovery/replay behavior.

This helps VF restore: when the device comes back, the state machine knows
whether the pending enable stems from timeout recovery, from a queue resume
sequence, or submission and can gate sequencing and fixups accordingly.

v4:
 - Clarify commit message (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 36 ++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 16f78376f196..13746f32b231 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -69,6 +69,8 @@ exec_queue_to_guc(struct xe_exec_queue *q)
 #define EXEC_QUEUE_STATE_BANNED			(1 << 9)
 #define EXEC_QUEUE_STATE_CHECK_TIMEOUT		(1 << 10)
 #define EXEC_QUEUE_STATE_EXTRA_REF		(1 << 11)
+#define EXEC_QUEUE_STATE_PENDING_RESUME		(1 << 12)
+#define EXEC_QUEUE_STATE_PENDING_TDR_EXIT	(1 << 13)
 
 static bool exec_queue_registered(struct xe_exec_queue *q)
 {
@@ -220,6 +222,36 @@ static void set_exec_queue_extra_ref(struct xe_exec_queue *q)
 	atomic_or(EXEC_QUEUE_STATE_EXTRA_REF, &q->guc->state);
 }
 
+static bool __maybe_unused exec_queue_pending_resume(struct xe_exec_queue *q)
+{
+	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_RESUME;
+}
+
+static void set_exec_queue_pending_resume(struct xe_exec_queue *q)
+{
+	atomic_or(EXEC_QUEUE_STATE_PENDING_RESUME, &q->guc->state);
+}
+
+static void clear_exec_queue_pending_resume(struct xe_exec_queue *q)
+{
+	atomic_and(~EXEC_QUEUE_STATE_PENDING_RESUME, &q->guc->state);
+}
+
+static bool __maybe_unused exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
+{
+	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_TDR_EXIT;
+}
+
+static void set_exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
+{
+	atomic_or(EXEC_QUEUE_STATE_PENDING_TDR_EXIT, &q->guc->state);
+}
+
+static void clear_exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
+{
+	atomic_and(~EXEC_QUEUE_STATE_PENDING_TDR_EXIT, &q->guc->state);
+}
+
 static bool exec_queue_killed_or_banned_or_wedged(struct xe_exec_queue *q)
 {
 	return (atomic_read(&q->guc->state) &
@@ -1334,6 +1366,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	return DRM_GPU_SCHED_STAT_RESET;
 
 sched_enable:
+	set_exec_queue_pending_tdr_exit(q);
 	enable_scheduling(q);
 rearm:
 	/*
@@ -1493,6 +1526,7 @@ static void __guc_exec_queue_process_msg_resume(struct xe_sched_msg *msg)
 		clear_exec_queue_suspended(q);
 		if (!exec_queue_enabled(q)) {
 			q->guc->resume_time = RESUME_PENDING;
+			set_exec_queue_pending_resume(q);
 			enable_scheduling(q);
 		}
 	} else {
@@ -2065,6 +2099,8 @@ static void handle_sched_done(struct xe_guc *guc, struct xe_exec_queue *q,
 		xe_gt_assert(guc_to_gt(guc), exec_queue_pending_enable(q));
 
 		q->guc->resume_time = ktime_get();
+		clear_exec_queue_pending_resume(q);
+		clear_exec_queue_pending_tdr_exit(q);
 		clear_exec_queue_pending_enable(q);
 		smp_wmb();
 		wake_up_all(&guc->ct.wq);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 08/36] drm/xe: Track LR jobs in DRM scheduler pending list
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (6 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 07/36] drm/xe/guc: Track pending-enable source in submission state Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29  2:55 ` [PATCH v3 09/36] drm/xe: Don't change LRC ring head on job resubmission Matthew Brost
                   ` (30 subsequent siblings)
  38 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

VF migration requires jobs to remain pending so they can be replayed
after the VF comes back. Previously, LR job fences were intentionally
signaled immediately after submission to avoid the risk of exporting
them, as these fences do not naturally signal in a timely manner and
could break dma-fence contracts. A side effect of this approach was that
LR jobs were never added to the DRM scheduler’s pending list, preventing
them from being tracked for later resubmission.

We now avoid signaling LR job fences and ensure they are never exported;
Xe already guards against exporting these internal fences. With that
guarantee in place, we can safely track LR jobs in the scheduler’s
pending list so they are eligible for resubmission during VF
post-migration recovery (and similar recovery paths).

An added benefit is that LR queues now gain the DRM scheduler’s built-in
flow control over ring usage rather than rejecting new jobs in the exec
IOCTL if the ring is full.

v2:
 - Ensure DRM scheduler TDR doesn't run for LR jobs
 - Stack variable for killed_or_banned_or_wedged
v4:
 - Clarify commit message (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_exec.c       | 12 ++-------
 drivers/gpu/drm/xe/xe_exec_queue.c | 19 -------------
 drivers/gpu/drm/xe/xe_exec_queue.h |  2 --
 drivers/gpu/drm/xe/xe_guc_submit.c | 43 ++++++++++++++++++++----------
 4 files changed, 31 insertions(+), 45 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_exec.c b/drivers/gpu/drm/xe/xe_exec.c
index 83897950f0da..0dc27476832b 100644
--- a/drivers/gpu/drm/xe/xe_exec.c
+++ b/drivers/gpu/drm/xe/xe_exec.c
@@ -124,7 +124,7 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 	struct xe_validation_ctx ctx;
 	struct xe_sched_job *job;
 	struct xe_vm *vm;
-	bool write_locked, skip_retry = false;
+	bool write_locked;
 	int err = 0;
 	struct xe_hw_engine_group *group;
 	enum xe_hw_engine_group_execution_mode mode, previous_mode;
@@ -266,12 +266,6 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		goto err_exec;
 	}
 
-	if (xe_exec_queue_is_lr(q) && xe_exec_queue_ring_full(q)) {
-		err = -EWOULDBLOCK;	/* Aliased to -EAGAIN */
-		skip_retry = true;
-		goto err_exec;
-	}
-
 	if (xe_exec_queue_uses_pxp(q)) {
 		err = xe_vm_validate_protected(q->vm);
 		if (err)
@@ -328,8 +322,6 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		xe_sched_job_init_user_fence(job, &syncs[i]);
 	}
 
-	if (xe_exec_queue_is_lr(q))
-		q->ring_ops->emit_job(job);
 	if (!xe_vm_in_lr_mode(vm))
 		xe_exec_queue_last_fence_set(q, vm, &job->drm.s_fence->finished);
 	xe_sched_job_push(job);
@@ -355,7 +347,7 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		xe_validation_ctx_fini(&ctx);
 err_unlock_list:
 	up_read(&vm->lock);
-	if (err == -EAGAIN && !skip_retry)
+	if (err == -EAGAIN)
 		goto retry;
 err_hw_exec_mode:
 	if (mode == EXEC_MODE_DMA_FENCE)
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
index 6bfaca424ca3..81f707d2c388 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -824,25 +824,6 @@ bool xe_exec_queue_is_lr(struct xe_exec_queue *q)
 		!(q->flags & EXEC_QUEUE_FLAG_VM);
 }
 
-static s32 xe_exec_queue_num_job_inflight(struct xe_exec_queue *q)
-{
-	return q->lrc[0]->fence_ctx.next_seqno - xe_lrc_seqno(q->lrc[0]) - 1;
-}
-
-/**
- * xe_exec_queue_ring_full() - Whether an exec_queue's ring is full
- * @q: The exec_queue
- *
- * Return: True if the exec_queue's ring is full, false otherwise.
- */
-bool xe_exec_queue_ring_full(struct xe_exec_queue *q)
-{
-	struct xe_lrc *lrc = q->lrc[0];
-	s32 max_job = lrc->ring.size / MAX_JOB_SIZE_BYTES;
-
-	return xe_exec_queue_num_job_inflight(q) >= max_job;
-}
-
 /**
  * xe_exec_queue_is_idle() - Whether an exec_queue is idle.
  * @q: The exec_queue
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h b/drivers/gpu/drm/xe/xe_exec_queue.h
index 8821ceb838d0..a4dfbe858bda 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.h
+++ b/drivers/gpu/drm/xe/xe_exec_queue.h
@@ -64,8 +64,6 @@ static inline bool xe_exec_queue_uses_pxp(struct xe_exec_queue *q)
 
 bool xe_exec_queue_is_lr(struct xe_exec_queue *q);
 
-bool xe_exec_queue_ring_full(struct xe_exec_queue *q);
-
 bool xe_exec_queue_is_idle(struct xe_exec_queue *q);
 
 void xe_exec_queue_kill(struct xe_exec_queue *q);
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 13746f32b231..3a534d93505f 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -851,30 +851,31 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
 	struct xe_sched_job *job = to_xe_sched_job(drm_job);
 	struct xe_exec_queue *q = job->q;
 	struct xe_guc *guc = exec_queue_to_guc(q);
-	struct dma_fence *fence = NULL;
-	bool lr = xe_exec_queue_is_lr(q);
+	bool lr = xe_exec_queue_is_lr(q), killed_or_banned_or_wedged =
+		exec_queue_killed_or_banned_or_wedged(q);
 
 	xe_gt_assert(guc_to_gt(guc), !(exec_queue_destroyed(q) || exec_queue_pending_disable(q)) ||
 		     exec_queue_banned(q) || exec_queue_suspended(q));
 
 	trace_xe_sched_job_run(job);
 
-	if (!exec_queue_killed_or_banned_or_wedged(q) && !xe_sched_job_is_error(job)) {
+	if (!killed_or_banned_or_wedged && !xe_sched_job_is_error(job)) {
 		if (!exec_queue_registered(q))
 			register_exec_queue(q, GUC_CONTEXT_NORMAL);
-		if (!lr)	/* LR jobs are emitted in the exec IOCTL */
-			q->ring_ops->emit_job(job);
+		q->ring_ops->emit_job(job);
 		submit_exec_queue(q);
 	}
 
-	if (lr) {
-		xe_sched_job_set_error(job, -EOPNOTSUPP);
-		dma_fence_put(job->fence);	/* Drop ref from xe_sched_job_arm */
-	} else {
-		fence = job->fence;
-	}
+	/*
+	 * We don't care about job-fence ordering in LR VMs because these fences
+	 * are never exported; they are used solely to keep jobs on the pending
+	 * list. Once a queue enters an error state, there's no need to track
+	 * them.
+	 */
+	if (killed_or_banned_or_wedged && lr)
+		xe_sched_job_set_error(job, -ECANCELED);
 
-	return fence;
+	return job->fence;
 }
 
 static void guc_exec_queue_free_job(struct drm_sched_job *drm_job)
@@ -916,7 +917,8 @@ static void disable_scheduling_deregister(struct xe_guc *guc,
 		xe_gt_warn(q->gt, "Pending enable/disable failed to respond\n");
 		xe_sched_submission_start(sched);
 		xe_gt_reset_async(q->gt);
-		xe_sched_tdr_queue_imm(sched);
+		if (!xe_exec_queue_is_lr(q))
+			xe_sched_tdr_queue_imm(sched);
 		return;
 	}
 
@@ -1008,6 +1010,7 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
 	struct xe_exec_queue *q = ge->q;
 	struct xe_guc *guc = exec_queue_to_guc(q);
 	struct xe_gpu_scheduler *sched = &ge->sched;
+	struct xe_sched_job *job;
 	bool wedged = false;
 
 	xe_gt_assert(guc_to_gt(guc), xe_exec_queue_is_lr(q));
@@ -1058,7 +1061,16 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
 	if (!exec_queue_killed(q) && !xe_lrc_ring_is_idle(q->lrc[0]))
 		xe_devcoredump(q, NULL, "LR job cleanup, guc_id=%d", q->guc->id);
 
+	xe_hw_fence_irq_stop(q->fence_irq);
+
 	xe_sched_submission_start(sched);
+
+	spin_lock(&sched->base.job_list_lock);
+	list_for_each_entry(job, &sched->base.pending_list, drm.list)
+		xe_sched_job_set_error(job, -ECANCELED);
+	spin_unlock(&sched->base.job_list_lock);
+
+	xe_hw_fence_irq_start(q->fence_irq);
 }
 
 #define ADJUST_FIVE_PERCENT(__t)	mul_u64_u32_div(__t, 105, 100)
@@ -1129,7 +1141,8 @@ static void enable_scheduling(struct xe_exec_queue *q)
 		xe_gt_warn(guc_to_gt(guc), "Schedule enable failed to respond");
 		set_exec_queue_banned(q);
 		xe_gt_reset_async(q->gt);
-		xe_sched_tdr_queue_imm(&q->guc->sched);
+		if (!xe_exec_queue_is_lr(q))
+			xe_sched_tdr_queue_imm(&q->guc->sched);
 	}
 }
 
@@ -1187,6 +1200,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	int i = 0;
 	bool wedged = false, skip_timeout_check;
 
+	xe_gt_assert(guc_to_gt(guc), !xe_exec_queue_is_lr(q));
+
 	/*
 	 * TDR has fired before free job worker. Common if exec queue
 	 * immediately closed after last fence signaled. Add back to pending
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 09/36] drm/xe: Don't change LRC ring head on job resubmission
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (7 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 08/36] drm/xe: Track LR jobs in DRM scheduler pending list Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-30  2:38   ` Lis, Tomasz
  2025-09-29  2:55 ` [PATCH v3 10/36] drm/xe: Make LRC W/A scratch buffer usage consistent Matthew Brost
                   ` (29 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

Now that we save the job's head during submission, it's no longer
necessary to adjust the LRC ring head during resubmission. Instead, a
software-based adjustment of the tail will overwrite the old jobs in
place. For some odd reason, adjusting the LRC ring head didn't work on
parallel queues, which was causing issues in our CI.

v6:
 - Also set LRC tail to head so queue is idle coming out of reset

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 3a534d93505f..70306f902ba5 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -2008,11 +2008,17 @@ static void guc_exec_queue_start(struct xe_exec_queue *q)
 	struct xe_gpu_scheduler *sched = &q->guc->sched;
 
 	if (!exec_queue_killed_or_banned_or_wedged(q)) {
+		struct xe_sched_job *job = xe_sched_first_pending_job(sched);
 		int i;
 
 		trace_xe_exec_queue_resubmit(q);
-		for (i = 0; i < q->width; ++i)
-			xe_lrc_set_ring_head(q->lrc[i], q->lrc[i]->ring.tail);
+		if (job) {
+			for (i = 0; i < q->width; ++i) {
+				q->lrc[i]->ring.tail = job->ptrs[i].head;
+				xe_lrc_set_ring_tail(q->lrc[i],
+						     xe_lrc_ring_head(q->lrc[i]));
+			}
+		}
 		xe_sched_resubmit_jobs(sched);
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 10/36] drm/xe: Make LRC W/A scratch buffer usage consistent
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (8 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 09/36] drm/xe: Don't change LRC ring head on job resubmission Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29  2:55 ` [PATCH v3 11/36] drm/xe/guc: Document GuC submission backend Matthew Brost
                   ` (28 subsequent siblings)
  38 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

The LRC W/A currently checks for LRC being iomem in some places, while
in others it checks if the scratch buffer is non-NULL. This
inconsistency causes issues with the VF post-migration recovery code,
which blindly passes in a scratch buffer.

This patch standardizes the check by consistently verifying whether the
LRC is iomem to determine if the scratch buffer should be used.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
---
 drivers/gpu/drm/xe/xe_lrc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index e1bc102a6cae..f13737ac707e 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -1248,7 +1248,7 @@ static int setup_bo(struct bo_setup_state *state)
 
 static void finish_bo(struct bo_setup_state *state)
 {
-	if (!state->buffer)
+	if (!state->lrc->bo->vmap.is_iomem)
 		return;
 
 	xe_map_memcpy_to(gt_to_xe(state->lrc->gt), &state->lrc->bo->vmap,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 11/36] drm/xe/guc: Document GuC submission backend
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (9 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 10/36] drm/xe: Make LRC W/A scratch buffer usage consistent Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-30  3:28   ` Lis, Tomasz
  2025-09-29  2:55 ` [PATCH v3 12/36] drm/xe/vf: Add xe_gt_recovery_inprogress helper Matthew Brost
                   ` (27 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

Add kernel-doc to xe_guc_submit.c describing the submission path,
the per-queue single-threaded model with pause/resume, the driver shadow
state machine and lost-H2G replay, job timeout handling, recovery flows
(GT reset, PM resume, VF resume), and reclaim constraints.

v2:
 - Mirror tweaks for clarity
 - Add new doc to Xe rst files
v3:
 - Clarify global vs per-queue stop / start
 - Clarify VF resume flow
 - Add section for 'Waiters during VF resume'
 - Add section for 'Page-faulting queues during VF migration'
 - Add section for 'GuC-ID assignment'
 - Add section for 'Reference counting and final queue destruction'
v4:
 - s/VF resume/VF post migration recovery (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 Documentation/gpu/xe/index.rst     |   1 +
 drivers/gpu/drm/xe/xe_guc_submit.c | 282 +++++++++++++++++++++++++++++
 2 files changed, 283 insertions(+)

diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
index 88b22fad880e..692c544b164c 100644
--- a/Documentation/gpu/xe/index.rst
+++ b/Documentation/gpu/xe/index.rst
@@ -28,3 +28,4 @@ DG2, etc is provided to prototype the driver.
    xe_device
    xe-drm-usage-stats.rst
    xe_configfs
+   xe_guc_submit
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 70306f902ba5..cd5e506527fe 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -46,6 +46,288 @@
 #include "xe_trace.h"
 #include "xe_vm.h"
 
+/*
+ * DOC: Overview
+ *
+ * The GuC submission backend is responsible for submitting GPU jobs to the GuC
+ * firmware, assigning per-queue GuC IDs, tracking submission state via a
+ * driver-side state machine, handling GuC-to-host (G2H) messages, tracking
+ * outstanding jobs, managing job timeouts and queue teardown, and providing
+ * recovery when GuC state is lost. It is built on top of the DRM scheduler
+ * (drm_sched).
+ *
+ * GuC ID assignment:
+ * ------------------
+ * Each queue is assigned a unique GuC ID at queue init. The ID is used in all
+ * H2G/G2H to identify the queue and remains reserved until final destruction,
+ * when the GuC is known to hold no references to it.
+ *
+ * The backend maintains a reverse map GuC-ID -> queue to resolve targets for
+ * G2H handlers and to iterate all queues when required (e.g., recovery). This
+ * map is protected by submission_state.lock, a global (per-GT) lock. Lockless
+ * lookups are acceptable in paths where the queue’s lifetime is otherwise
+ * pinned and it cannot disappear underneath the operation (e.g., G2H handlers).
+ *
+ * Basic submission flow
+ * ---------------------
+ * Submission is driven by the DRM scheduler vfunc ->run_job(). The flow is:
+ *
+ * 1) Emit the job's ring instructions.
+ * 2) Advance the LRC ring tail:
+ *    - width == 1: simple memory write,
+ *    - width  > 1: append a GuC workqueue (WQ) item.
+ * 3) If the queue is unregistered, issue a register H2G for the context.
+ * 4) Trigger execution via a scheduler enable or context submit command.
+ * 5) Return the job's hardware fence to the DRM scheduler.
+ *
+ * Registration, scheduler enable, and submit commands are issued as host-to-GuC
+ * (H2G) messages over the Command Transport (CT) layer, like all GuC
+ * interactions.
+ *
+ * Completion path
+ * ---------------
+ * When the job's hardware fence signals, the DRM scheduler vfunc ->free_job()
+ * is called; it drops the job's reference, typically freeing it.
+ *
+ * Control-plane messages:
+ * -----------------------
+ * GuC submission scheduler messages form the control plane for queue cleanup,
+ * toggling runnability, and modifying queue properties (e.g., scheduler
+ * priority, timeslice, preemption timeout). Messages are initiated via queue
+ * vfuncs that append a control message to the queue. They are processed on the
+ * same single-threaded DRM scheduler workqueue that runs ->run_job() and
+ * ->free_job().
+ *
+ * Lockless model:
+ * ---------------
+ * ->run_job(), ->free_job(), and the message handlers execute as work items on
+ * a single-threaded DRM scheduler workqueue. Per queue, this provides built-in
+ * mutual exclusion: only one of these items can run at a time. As a result,
+ * these paths are lockless with respect to per-queue state tracking. (Global
+ * or cross-queue data structures still use their own synchronization.)
+ *
+ * Stopping / starting:
+ * --------------------
+ * The submission backend supports two scopes of quiesce control:
+ *
+ *  - Per-queue stop/start:
+ *    The single-threaded DRM scheduler workqueue for a specific queue can be
+ *    stopped and started dynamically. Stopping synchronously quiesces that
+ *    queue's worker (lets any in-flight item finish and prevents new items from
+ *    starting), yielding a stable snapshot while an external operation (e.g.,
+ *    job timeout handling) inspects/updates state and performs any required
+ *    fixups. While stopped, no submission, message, or ->free_job() work runs
+ *    for that queue. When the operation completes, the queue is started; any
+ *    pending items are then processed in order on the same worker. Other queues
+ *    continue to run unaffected.
+ *
+ *  - Global (per-GT) stop/start:
+ *    Implemented on top of the per-queue stop/start primitive: the driver
+ *    stops (or starts) each queue on the GT to obtain a device-wide stable
+ *    snapshot. This is used by coordinated recovery flows (GT reset, PM resume,
+ *    VF post migration recovery). Queues created while the global stop is in
+ *    effect (i.e., future queues) initialize in the stopped state and remain
+ *    stopped until the global start. After recovery fixups are complete, a
+ *    global start iterates queues to start all eligible ones and resumes normal
+ *    submission.
+ *
+ * State machine:
+ * --------------
+ * The submission state machine is the driver's shadow of the GuC-visible queue
+ * state (e.g., registered, runnable, scheduler properties). It tracks the
+ * transitions we intend to make (issued as H2G commands), marking them pending
+ * until acknowledged via G2H or otherwise observed as applied. It also records
+ * the origin of each transition (->run_job(), timeout handler, explicit control
+ * message, etc.).
+ *
+ * Because H2G commands and/or GuC submission state can be lost across GT reset,
+ * PM resume, or VF post migration recovery, this bookkeeping lets recovery
+ * decide which operations to replay, which to elide, and which need fixups,
+ * restoring a consistent queue state without additional per-queue locks.
+ *
+ * Job timeouts:
+ * -------------
+ * To prevent jobs from running indefinitely and violating dma-fence signaling
+ * rules, the DRM scheduler tracks how long each job has been running. If a
+ * threshold is exceeded, it calls ->timeout_job().
+ *
+ * ->timeout_job() stops the queue, samples the LRC context timestamps to
+ * confirm the job actually started and has exceeded the allowed runtime, and
+ * then, if confirmed, signals all pending jobs' fences and initiates queue
+ * teardown. Finally, the queue is started.
+ *
+ * Job timeout handling runs on a per-GT, single-threaded recovery workqueue
+ * that is shared with other recovery paths (e.g., GT reset handling, VF
+ * resume). This guarantees only one recovery action executes at a time.
+ *
+ * Queue teardown:
+ * ---------------
+ * Teardown can be triggered by: (1) userspace closing the queue; (2) a G2H
+ * queue-reset notification; (3) a G2H memory_cat_error for the queue; or (4)
+ * in-flight jobs detected on the queue during GT reset.
+ *
+ * In all cases teardown is driven via the timeout path by setting the queue's
+ * DRM scheduler timeout to zero, forcing an immediate ->timeout_job() pass.
+ *
+ * Reference counting and final queue destruction:
+ * -----------------------------------------------
+ * Jobs reference-count the queue; queues hold a reference to the VM. When a
+ * queue's reference count reaches zero (e.g., all jobs are freed and the
+ * userspace handle is closed), the queue is not destroyed immediately because
+ * the GuC may still reference its state.
+ *
+ * Instead, a control-plane cleanup message is appended to remove GuC-side
+ * references (e.g., disable runnability, deregister). Once the final G2H
+ * confirming that GuC no longer references the queue is eligible for
+ * destruction.
+ *
+ * To avoid freeing the queue from within its own DRM scheduler workqueue (which
+ * would risk use-after-free), the actual destruction is deferred to a separate
+ * work item queued on a dedicated destruction workqueue.
+ *
+ * GT resets:
+ * ----------
+ * GT resets are triggered by catastrophic errors (e.g., CT channel failure).
+ * The GuC is reset and all GuC-side submission state is lost. Recovery proceeds
+ * as follows:
+ *
+ * 1) Quiesce:
+ *    - Stop all queues (global submission stop). Per-queue workers finish any
+ *      in-flight item and then stop; newly created queues during the window
+ *      initialize in the stopped state.
+ *    - Abort any waits on CT/G2H to avoid deadlock.
+ *
+ * 2) Sanitize driver shadow state:
+ *    - For each queue, clear GuC-derived bits in the submission state machine
+ *      (e.g., registered/enabled) and mark in-flight H2G transitions as lost.
+ *    - Convert/flush any side effects of lost H2G.
+ *
+ * 3) Decide teardown vs. replay:
+ *    - If a queue's LRC seqno indicates that a job started but did not
+ *      complete, initiate teardown for that queue via the timeout path.
+ *    - If no job started, keep the queue for replay.
+ *
+ * 4) Resume:
+ *    - Start remaining queues; resubmit pending jobs.
+ *    - Queues marked for teardown remain stopped/destroyed.
+ *
+ * The entire sequence runs on the per-GT single-threaded recovery worker,
+ * ensuring only one recovery action executes at a time; a runtime PM reference
+ * is held for the duration.
+ *
+ * PM resume:
+ * ----------
+ * PM resume assumes all GuC state is lost (the device may have been powered
+ * down). It reuses the GT reset recovery path, but executes in the context of
+ * the caller that wakes the device (runtime PM or system resume).
+ *
+ * Suspend entry:
+ *  - Control-plane message work is quiesced; state toggles that require an
+ *    active device are not enqueued while suspended.
+ *  - Per-queue scheduler workers are stopped before the device is allowed to
+ *    suspend.
+ *  - Barring driver bugs, no queues should have in-flight jobs at
+ *    suspend/resume..
+ *
+ * On resume, run the GT reset recovery flow and then start eligible queues.
+ *
+ * Runtime PM and state-change ordering:
+ * -------------------------------------
+ * Runtime/system PM transitions must not race with per-queue submission and
+ * state updates.
+ *
+ * Execution contexts and RPM sources:
+ *  - Scheduler callbacks (->run_job(), ->free_job(), ->timeout_job()):
+ *    executed with an active RPM ref held by the in-flight job.
+ *  - Control-plane message work:
+ *    enqueued from IOCTL paths that already hold an RPM ref; the message path
+ *    itself does not get/put RPM. State toggles are only issued while active.
+ *    During suspend entry, message work is quiesced and no new toggles are
+ *    enqueued until after resume.
+ *  - G2H handlers:
+ *    dispatched with an RPM ref guaranteed by the CT layer.
+ *  - Recovery phases (GT reset/VF post migration recovery):
+ *    explicitly get/put an RPM ref for their duration on the per-GT recovery
+ *    worker.
+ *
+ * Consequence:
+ *  - All submission/state mutations run with an RPM reference. The PM core
+ *    cannot enter suspend while these updates are in progress, and resume is
+ *    complete before updates execute. This prevents PM state changes from
+ *    racing with queue state changes.
+ *
+ * VF post migration recovery:
+ * ---------------------------
+ * VF post migration recovery resembles a GT reset, but GuC submission state is
+ * expected to persist across migration; in-flight H2G commands may be lost, and
+ * GGTT base/offsets may change. Recovery proceeds as follows:
+ *
+ * 1) Quiesce:
+ *    - Stop all queues and abort waits (as with GT reset) to obtain a stable
+ *      snapshot.
+ *    - Queues created while VF post migration recovery is in-flight initialize
+ *      in the stopped state.
+ *
+ * 2) Treat H2G as lost and prepare in-place resubmission (GuC/CT down):
+ *    - Treat in-flight H2G (enable/disable, etc.) as dropped; update shadow
+ *      bits to a safe baseline and tag the ops as "needs replay".
+ *    - Quarantine device-visible submission state: set the GuC-visible LRC ring
+ *      tail equal to the head (and, for WQ-based submission, set the WQ
+ *      descriptor head == tail) so that when the GuC comes up it will not process
+ *      any entries that were built with stale GGTT addresses.
+ *    - Reset the software ring tail to the original value captured at the
+ *      submission of the oldest pending job, so the write pointer sits exactly
+ *      where that job was originally emitted.
+ *
+ * 3) Replay and resubmit once GuC/CT is live:
+ *    - VF post migration recovery invokes ->run_job() for pending jobs;
+ *      ->emit_job() overwrites ring instructions in place, fixes GGTT fields,
+ *      then advances the LRC tail (and WQ descriptor for width > 1). Required
+ *      submission H2G(s) are reissued and fresh WQ entries are written.
+ *    - Queue lost control-plane operations (scheduling-state toggles, cleanup)
+ *      in order via the message path.
+ *    - Start the queues to process the queued control-plane operations and run
+ *      the resubmitted jobs.
+ *
+ * The goal is to preserve both job and queue state; no teardown is performed
+ * in this flow. The sequence runs on the per-GT single-threaded recovery
+ * worker with a held runtime PM reference.
+ *
+ * Waiters during VF post migration recovery
+ * -----------------------------------------
+ * The submission backend frequently uses wait_event_timeout() to wait on
+ * GuC-driven conditions. Across VF migration/recovery two issues arise:
+ * 1) The timeout does not account for migration downtime and may expire
+ *    prematurely, triggering undesired actions (e.g., GT reset, prematurely
+ *    signaling a fence).
+ * 2) Some waits target GuC work that cannot complete until VF recovery
+ *    finishes; these typically sit on the queue-stopping path.
+ *
+ * To handle this, all waiters must atomically test the "GuC down / VF-recovery
+ * in progress" condition (e.g., VF_RESFIX_BLOCKED) both before sleeping and
+ * after wakeup. The flag is coherent with VF migration: vCPUs observe it
+ * immediately on unhalt, and it is cleared only after the GuC/CT is live again.
+ * If set, the waiter must either (a) abort the wait without side effects, or
+ * (b) re-arm the wait with a fresh timeout once the GuC/CT is live. Timeouts
+ * that occur while GuC/CT are down are non-fatal—the VF-recovery path will
+ * rebuild state—and must not trigger recovery or teardown.
+ *
+ * Relation to reclaim:
+ * --------------------
+ * Jobs signal dma-fences, and the MM may wait on those fences during reclaim.
+ * As a consequence, the entire GuC submission backend (DRM scheduler callbacks,
+ * message handling, and all recovery paths) lies on the reclaim path and must
+ * be reclaim-safe.
+ *
+ * Practical implications:
+ * - No memory allocations in these paths (avoid any allocation that could
+ *   recurse into reclaim or sleep).
+ * - The global submission-state lock may be taken from reclaim-tainted contexts
+ *   (timeout/recovery). Any path that acquires it (including queue init/destroy)
+ *   must not allocate or take locks that can recurse into reclaim while holding
+ *   it; keep the critical section to state/xarray updates.
+ */
+
 static struct xe_guc *
 exec_queue_to_guc(struct xe_exec_queue *q)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 12/36] drm/xe/vf: Add xe_gt_recovery_inprogress helper
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (10 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 11/36] drm/xe/guc: Document GuC submission backend Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29  8:04   ` Michal Wajdeczko
  2025-09-29  2:55 ` [PATCH v3 13/36] drm/xe/vf: Make VF recovery run on per-GT worker Matthew Brost
                   ` (26 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

Add xe_gt_recovery_inprogress helper.

This helper serves as the singular point to determine whether a GT
recovery is currently in progress. Expected callers include the GuC CT
layer and the GuC submission layer. Atomically visable as soon as vCPU
are unhalted until VF recovery completes.

v3:
 - Add GT layer xe_gt_recovery_inprogress (Michal)
 - Don't blow up in memirq not enabled (CI)
 - Add __memirq_received with clear argument (Michal)
 - xe_memirq_sw_int_0_irq_pending rename (Michal)
 - Use offset in xe_memirq_sw_int_0_irq_pending (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt.h                | 13 ++++++
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 25 ++++++++++++
 drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  2 +
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h | 10 +++++
 drivers/gpu/drm/xe/xe_memirq.c            | 48 +++++++++++++++++++++--
 drivers/gpu/drm/xe/xe_memirq.h            |  2 +
 6 files changed, 96 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
index 41880979f4de..ee0239b2f48c 100644
--- a/drivers/gpu/drm/xe/xe_gt.h
+++ b/drivers/gpu/drm/xe/xe_gt.h
@@ -12,6 +12,7 @@
 
 #include "xe_device.h"
 #include "xe_device_types.h"
+#include "xe_gt_sriov_vf.h"
 #include "xe_hw_engine.h"
 
 #define for_each_hw_engine(hwe__, gt__, id__) \
@@ -124,4 +125,16 @@ static inline bool xe_gt_is_usm_hwe(struct xe_gt *gt, struct xe_hw_engine *hwe)
 		hwe->instance == gt->usm.reserved_bcs_instance;
 }
 
+/**
+ * xe_gt_recovery_inprogress() - GT recovery in progress
+ * @gt: the &xe_gt
+ *
+ * Return: True if GT recovery in progress, False otherwise
+ */
+static inline bool xe_gt_recovery_inprogress(struct xe_gt *gt)
+{
+	return IS_SRIOV_VF(gt_to_xe(gt)) &&
+		xe_gt_sriov_vf_recovery_inprogress(gt);
+}
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 016c867e5e2b..71309219a4b7 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -26,6 +26,7 @@
 #include "xe_guc_hxg_helpers.h"
 #include "xe_guc_relay.h"
 #include "xe_lrc.h"
+#include "xe_memirq.h"
 #include "xe_mmio.h"
 #include "xe_sriov.h"
 #include "xe_sriov_vf.h"
@@ -828,6 +829,7 @@ void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt)
 	struct xe_device *xe = gt_to_xe(gt);
 
 	xe_gt_assert(gt, IS_SRIOV_VF(xe));
+	xe_gt_assert(gt, xe_gt_sriov_vf_recovery_inprogress(gt));
 
 	set_bit(gt->info.id, &xe->sriov.vf.migration.gt_flags);
 	/*
@@ -1172,3 +1174,26 @@ void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p)
 	drm_printf(p, "\thandshake:\t%u.%u\n",
 		   pf_version->major, pf_version->minor);
 }
+
+/**
+ * xe_gt_sriov_vf_recovery_inprogress() - VF post migration recovery in progress
+ * @gt: the &xe_gt
+ *
+ * Return: True if VF post migration recovery in progress, False otherwise
+ */
+bool xe_gt_sriov_vf_recovery_inprogress(struct xe_gt *gt)
+{
+	struct xe_memirq *memirq = &gt_to_tile(gt)->memirq;
+
+	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
+
+	/*
+	 * In practice, VF migration will never be supported on platforms
+	 * without memirq, avoid CI blowing up on older VF platforms.
+	 */
+	if (!xe_device_uses_memirq(gt_to_xe(gt)))
+	       return false;
+
+	return (xe_memirq_sw_int_0_irq_pending(memirq, &gt->uc.guc) ||
+		READ_ONCE(gt->sriov.vf.migration.recovery_inprogress));
+}
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
index 0af1dc769fe0..bb5f8eace19b 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
@@ -25,6 +25,8 @@ void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt);
 int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt);
 void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
 
+bool xe_gt_sriov_vf_recovery_inprogress(struct xe_gt *gt);
+
 u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
 u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt);
 u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt);
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
index d95857bd789b..7b10b8e1e10e 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
@@ -49,6 +49,14 @@ struct xe_gt_sriov_vf_runtime {
 	} *regs;
 };
 
+/**
+ * xe_gt_sriov_vf_migration - VF migration data.
+ */
+struct xe_gt_sriov_vf_migration {
+	/** @recovery_inprogress: VF post migration recovery in progress */
+	bool recovery_inprogress;
+};
+
 /**
  * struct xe_gt_sriov_vf - GT level VF virtualization data.
  */
@@ -61,6 +69,8 @@ struct xe_gt_sriov_vf {
 	struct xe_gt_sriov_vf_selfconfig self_config;
 	/** @runtime: runtime data retrieved from the PF. */
 	struct xe_gt_sriov_vf_runtime runtime;
+	/** @migration: migration data for the VF. */
+	struct xe_gt_sriov_vf_migration migration;
 };
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_memirq.c b/drivers/gpu/drm/xe/xe_memirq.c
index 49c45ec3e83c..b681c67dcace 100644
--- a/drivers/gpu/drm/xe/xe_memirq.c
+++ b/drivers/gpu/drm/xe/xe_memirq.c
@@ -398,8 +398,9 @@ void xe_memirq_postinstall(struct xe_memirq *memirq)
 		memirq_set_enable(memirq, true);
 }
 
-static bool memirq_received(struct xe_memirq *memirq, struct iosys_map *vector,
-			    u16 offset, const char *name)
+static bool __memirq_received(struct xe_memirq *memirq,
+			      struct iosys_map *vector, u16 offset,
+			      const char *name, bool clear)
 {
 	u8 value;
 
@@ -409,12 +410,26 @@ static bool memirq_received(struct xe_memirq *memirq, struct iosys_map *vector,
 			memirq_err_ratelimited(memirq,
 					       "Unexpected memirq value %#x from %s at %u\n",
 					       value, name, offset);
-		iosys_map_wr(vector, offset, u8, 0x00);
+		if (clear)
+			iosys_map_wr(vector, offset, u8, 0x00);
 	}
 
 	return value;
 }
 
+static bool memirq_received_noclear(struct xe_memirq *memirq,
+				    struct iosys_map *vector,
+				    u16 offset, const char *name)
+{
+	return __memirq_received(memirq, vector, offset, name, false);
+}
+
+static bool memirq_received(struct xe_memirq *memirq, struct iosys_map *vector,
+			    u16 offset, const char *name)
+{
+	return __memirq_received(memirq, vector, offset, name, true);
+}
+
 static void memirq_dispatch_engine(struct xe_memirq *memirq, struct iosys_map *status,
 				   struct xe_hw_engine *hwe)
 {
@@ -434,8 +449,16 @@ static void memirq_dispatch_guc(struct xe_memirq *memirq, struct iosys_map *stat
 	if (memirq_received(memirq, status, ilog2(GUC_INTR_GUC2HOST), name))
 		xe_guc_irq_handler(guc, GUC_INTR_GUC2HOST);
 
-	if (memirq_received(memirq, status, ilog2(GUC_INTR_SW_INT_0), name))
+	/*
+	 * We must wait to perform the clear operation until after
+	 * xe_gt_sriov_vf_start_migration_recovery() runs, to avoid race
+	 * conditions where xe_gt_sriov_vf_recovery_inprogress() returns false.
+	 */
+	if (memirq_received_noclear(memirq, status, ilog2(GUC_INTR_SW_INT_0),
+				    name)) {
 		xe_guc_irq_handler(guc, GUC_INTR_SW_INT_0);
+		iosys_map_wr(status, ilog2(GUC_INTR_SW_INT_0), u8, 0x00);
+	}
 }
 
 /**
@@ -460,6 +483,23 @@ void xe_memirq_hwe_handler(struct xe_memirq *memirq, struct xe_hw_engine *hwe)
 	}
 }
 
+/**
+ * xe_memirq_sw_int_0_irq_pending() - SW_INT_0 IRQ is pending
+ * @memirq: the &xe_memirq
+ * @guc: the &xe_guc to check for IRQ
+ *
+ * Return: True if SW_INT_0 IRQ is pending on @guc, False otherwise
+ */
+bool xe_memirq_sw_int_0_irq_pending(struct xe_memirq *memirq,struct xe_guc *guc)
+{
+	struct xe_gt *gt = guc_to_gt(guc);
+	u32 offset = xe_gt_is_media_type(gt) ? ilog2(INTR_MGUC) : ilog2(INTR_GUC);
+	struct iosys_map map = IOSYS_MAP_INIT_OFFSET(&memirq->status, offset * SZ_16);
+
+	return memirq_received_noclear(memirq, &map, ilog2(GUC_INTR_SW_INT_0),
+				       guc_name(guc));
+}
+
 /**
  * xe_memirq_handler - The `Memory Based Interrupts`_ Handler.
  * @memirq: the &xe_memirq
diff --git a/drivers/gpu/drm/xe/xe_memirq.h b/drivers/gpu/drm/xe/xe_memirq.h
index 06130650e9d6..f87e1274b730 100644
--- a/drivers/gpu/drm/xe/xe_memirq.h
+++ b/drivers/gpu/drm/xe/xe_memirq.h
@@ -25,4 +25,6 @@ void xe_memirq_handler(struct xe_memirq *memirq);
 
 int xe_memirq_init_guc(struct xe_memirq *memirq, struct xe_guc *guc);
 
+bool xe_memirq_sw_int_0_irq_pending(struct xe_memirq *memirq,struct xe_guc *guc);
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 13/36] drm/xe/vf: Make VF recovery run on per-GT worker
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (11 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 12/36] drm/xe/vf: Add xe_gt_recovery_inprogress helper Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-30 14:47   ` Lis, Tomasz
  2025-09-29  2:55 ` [PATCH v3 14/36] drm/xe/vf: Abort H2G sends during VF post-migration recovery Matthew Brost
                   ` (25 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

VF recovery is a per-GT operation, so it makes sense to isolate it to a
per-GT queue. Scheduling this operation on the same worker as the GT
reset and TDR not only aligns with this design but also helps avoid race
conditions, as those operations can also modify the queue state.

v2:
 - Fix lockdep splat (Adam)
 - Use xe_sriov_vf_migration_supported helper
v3:
 - Drop xe_gt_sriov_ prefix for private functions (Michal)
 - Drop message in xe_gt_sriov_vf_migration_init_early (Michal)
 - Logic rework in vf_post_migration_notify_resfix_done (Michal)
 - Rework init sequence layering (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt.c                |   6 +
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 179 +++++++++++++++-
 drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |   3 +-
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |   7 +
 drivers/gpu/drm/xe/xe_sriov_vf.c          | 246 ----------------------
 drivers/gpu/drm/xe/xe_sriov_vf.h          |   1 -
 drivers/gpu/drm/xe/xe_sriov_vf_types.h    |   4 -
 7 files changed, 182 insertions(+), 264 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index 3e0ad7e5b5df..5f9ba4caf837 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -398,6 +398,12 @@ int xe_gt_init_early(struct xe_gt *gt)
 			return err;
 	}
 
+	if (IS_SRIOV_VF(gt_to_xe(gt))) {
+		err = xe_gt_sriov_vf_init_early(gt);
+		if (err)
+			return err;
+	}
+
 	xe_reg_sr_init(&gt->reg_sr, "GT", gt_to_xe(gt));
 
 	err = xe_wa_gt_init(gt);
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 71309219a4b7..ae9df9c0876d 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -25,11 +25,15 @@
 #include "xe_guc.h"
 #include "xe_guc_hxg_helpers.h"
 #include "xe_guc_relay.h"
+#include "xe_guc_submit.h"
+#include "xe_irq.h"
 #include "xe_lrc.h"
 #include "xe_memirq.h"
 #include "xe_mmio.h"
+#include "xe_pm.h"
 #include "xe_sriov.h"
 #include "xe_sriov_vf.h"
+#include "xe_tile_sriov_vf.h"
 #include "xe_uc_fw.h"
 #include "xe_wopcm.h"
 
@@ -308,13 +312,13 @@ static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
 }
 
 /**
- * xe_gt_sriov_vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
+ * vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
  * @gt: the &xe_gt struct instance linked to target GuC
  *
  * Returns: 0 if the operation completed successfully, or a negative error
  * code otherwise.
  */
-int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt)
+static int vf_notify_resfix_done(struct xe_gt *gt)
 {
 	struct xe_guc *guc = &gt->uc.guc;
 	int err;
@@ -808,7 +812,7 @@ int xe_gt_sriov_vf_connect(struct xe_gt *gt)
  * xe_gt_sriov_vf_default_lrcs_hwsp_rebase - Update GGTT references in HWSP of default LRCs.
  * @gt: the &xe_gt struct instance
  */
-void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt)
+static void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt)
 {
 	struct xe_hw_engine *hwe;
 	enum xe_hw_engine_id id;
@@ -817,6 +821,26 @@ void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt)
 		xe_default_lrc_update_memirq_regs_with_address(hwe);
 }
 
+static void vf_start_migration_recovery(struct xe_gt *gt)
+{
+	bool started;
+
+	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
+
+	spin_lock(&gt->sriov.vf.migration.lock);
+
+	if (!gt->sriov.vf.migration.recovery_queued) {
+		gt->sriov.vf.migration.recovery_queued = true;
+		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
+
+		started = queue_work(gt->ordered_wq, &gt->sriov.vf.migration.worker);
+		xe_gt_sriov_info(gt, "VF migration recovery %s\n", started ?
+				 "scheduled" : "already in progress");
+	}
+
+	spin_unlock(&gt->sriov.vf.migration.lock);
+}
+
 /**
  * xe_gt_sriov_vf_migrated_event_handler - Start a VF migration recovery,
  *   or just mark that a GuC is ready for it.
@@ -831,15 +855,8 @@ void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt)
 	xe_gt_assert(gt, IS_SRIOV_VF(xe));
 	xe_gt_assert(gt, xe_gt_sriov_vf_recovery_inprogress(gt));
 
-	set_bit(gt->info.id, &xe->sriov.vf.migration.gt_flags);
-	/*
-	 * We need to be certain that if all flags were set, at least one
-	 * thread will notice that and schedule the recovery.
-	 */
-	smp_mb__after_atomic();
-
 	xe_gt_sriov_info(gt, "ready for recovery after migration\n");
-	xe_sriov_vf_start_migration_recovery(xe);
+	vf_start_migration_recovery(gt);
 }
 
 static bool vf_is_negotiated(struct xe_gt *gt, u16 major, u16 minor)
@@ -1175,6 +1192,146 @@ void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p)
 		   pf_version->major, pf_version->minor);
 }
 
+static void vf_post_migration_shutdown(struct xe_gt *gt)
+{
+	int ret = 0;
+
+	spin_lock_irq(&gt->sriov.vf.migration.lock);
+	gt->sriov.vf.migration.recovery_queued = false;
+	spin_unlock_irq(&gt->sriov.vf.migration.lock);
+
+	xe_guc_submit_pause(&gt->uc.guc);
+	ret |= xe_guc_submit_reset_block(&gt->uc.guc);
+
+	if (ret)
+		xe_gt_sriov_info(gt, "migration recovery encountered ongoing reset\n");
+}
+
+static size_t post_migration_scratch_size(struct xe_device *xe)
+{
+	return max(xe_lrc_reg_size(xe), LRC_WA_BB_SIZE);
+}
+
+static int vf_post_migration_fixups(struct xe_gt *gt)
+{
+	s64 shift;
+	void *buf;
+	int err;
+
+	buf = kmalloc(post_migration_scratch_size(gt_to_xe(gt)), GFP_ATOMIC);
+	if (!buf)
+		return -ENOMEM;
+
+	err = xe_gt_sriov_vf_query_config(gt);
+	if (err)
+		goto out;
+
+	shift = xe_gt_sriov_vf_ggtt_shift(gt);
+	if (shift) {
+		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
+		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
+		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
+		if (err)
+			goto out;
+	}
+
+out:
+	kfree(buf);
+	return err;
+}
+
+static void vf_post_migration_kickstart(struct xe_gt *gt)
+{
+	/*
+	 * Make sure interrupts on the new HW are properly set. The GuC IRQ
+	 * must be working at this point, since the recovery did started,
+	 * but the rest was not enabled using the procedure from spec.
+	 */
+	xe_irq_resume(gt_to_xe(gt));
+
+	xe_guc_submit_reset_unblock(&gt->uc.guc);
+	xe_guc_submit_unpause(&gt->uc.guc);
+}
+
+static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
+{
+	bool skip_resfix = false;
+
+	spin_lock_irq(&gt->sriov.vf.migration.lock);
+	if (gt->sriov.vf.migration.recovery_queued) {
+		skip_resfix = true;
+		xe_gt_sriov_dbg(gt, "another recovery imminent, skipped some notifications\n");
+	} else {
+		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, false);
+	}
+	spin_unlock_irq(&gt->sriov.vf.migration.lock);
+
+	if (skip_resfix)
+		return -EAGAIN;
+
+	return vf_notify_resfix_done(gt);
+}
+
+static void vf_post_migration_recovery(struct xe_gt *gt)
+{
+	struct xe_device *xe = gt_to_xe(gt);
+	int err;
+
+	xe_gt_sriov_dbg(gt, "migration recovery in progress\n");
+
+	xe_pm_runtime_get(xe);
+	vf_post_migration_shutdown(gt);
+
+	if (!xe_sriov_vf_migration_supported(xe)) {
+		xe_gt_sriov_err(gt, "migration is not supported\n");
+		err = -ENOTRECOVERABLE;
+		goto fail;
+	}
+
+	err = vf_post_migration_fixups(gt);
+	if (err)
+		goto fail;
+
+	vf_post_migration_kickstart(gt);
+	err = vf_post_migration_notify_resfix_done(gt);
+	if (err && err != -EAGAIN)
+		goto fail;
+
+	xe_pm_runtime_put(xe);
+	xe_gt_sriov_notice(gt, "migration recovery ended\n");
+	return;
+fail:
+	xe_pm_runtime_put(xe);
+	xe_gt_sriov_err(gt, "migration recovery failed (%pe)\n", ERR_PTR(err));
+	xe_device_declare_wedged(xe);
+}
+
+static void migration_worker_func(struct work_struct *w)
+{
+	struct xe_gt *gt = container_of(w, struct xe_gt,
+					sriov.vf.migration.worker);
+
+	vf_post_migration_recovery(gt);
+}
+
+/**
+ * xe_gt_sriov_vf_init_early() - GT VF init early
+ * @gt: the &xe_gt
+ *
+ * Return 0 on success, errno on failure
+ */
+int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
+{
+	if (!xe_sriov_vf_migration_supported(gt_to_xe(gt)))
+		return 0;
+
+	init_rwsem(&gt->sriov.vf.self_config.lock);
+	spin_lock_init(&gt->sriov.vf.migration.lock);
+	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
+
+	return 0;
+}
+
 /**
  * xe_gt_sriov_vf_recovery_inprogress() - VF post migration recovery in progress
  * @gt: the &xe_gt
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
index bb5f8eace19b..0b0f2a30e67c 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
@@ -21,10 +21,9 @@ void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt,
 int xe_gt_sriov_vf_query_config(struct xe_gt *gt);
 int xe_gt_sriov_vf_connect(struct xe_gt *gt);
 int xe_gt_sriov_vf_query_runtime(struct xe_gt *gt);
-void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt);
-int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt);
 void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
 
+int xe_gt_sriov_vf_init_early(struct xe_gt *gt);
 bool xe_gt_sriov_vf_recovery_inprogress(struct xe_gt *gt);
 
 u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
index 7b10b8e1e10e..53680a2f188a 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
@@ -8,6 +8,7 @@
 
 #include <linux/rwsem.h>
 #include <linux/types.h>
+#include <linux/workqueue.h>
 #include "xe_uc_fw_types.h"
 
 /**
@@ -53,6 +54,12 @@ struct xe_gt_sriov_vf_runtime {
  * xe_gt_sriov_vf_migration - VF migration data.
  */
 struct xe_gt_sriov_vf_migration {
+	/** @migration: VF migration recovery worker */
+	struct work_struct worker;
+	/** @lock: Protects recovery_queued */
+	spinlock_t lock;
+	/** @recovery_queued: VF post migration recovery in queued */
+	bool recovery_queued;
 	/** @recovery_inprogress: VF post migration recovery in progress */
 	bool recovery_inprogress;
 };
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
index da064a1e7419..911d5720917b 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
@@ -6,21 +6,12 @@
 #include <drm/drm_debugfs.h>
 #include <drm/drm_managed.h>
 
-#include "xe_assert.h"
-#include "xe_device.h"
 #include "xe_gt.h"
-#include "xe_gt_sriov_printk.h"
 #include "xe_gt_sriov_vf.h"
 #include "xe_guc.h"
-#include "xe_guc_submit.h"
-#include "xe_irq.h"
-#include "xe_lrc.h"
-#include "xe_pm.h"
-#include "xe_sriov.h"
 #include "xe_sriov_printk.h"
 #include "xe_sriov_vf.h"
 #include "xe_sriov_vf_ccs.h"
-#include "xe_tile_sriov_vf.h"
 
 /**
  * DOC: VF restore procedure in PF KMD and VF KMD
@@ -158,8 +149,6 @@ static void vf_disable_migration(struct xe_device *xe, const char *fmt, ...)
 	xe->sriov.vf.migration.enabled = false;
 }
 
-static void migration_worker_func(struct work_struct *w);
-
 static void vf_migration_init_early(struct xe_device *xe)
 {
 	/*
@@ -184,8 +173,6 @@ static void vf_migration_init_early(struct xe_device *xe)
 						    guc_version.major, guc_version.minor);
 	}
 
-	INIT_WORK(&xe->sriov.vf.migration.worker, migration_worker_func);
-
 	xe->sriov.vf.migration.enabled = true;
 	xe_sriov_dbg(xe, "migration support enabled\n");
 }
@@ -196,242 +183,9 @@ static void vf_migration_init_early(struct xe_device *xe)
  */
 void xe_sriov_vf_init_early(struct xe_device *xe)
 {
-	struct xe_gt *gt;
-	unsigned int id;
-
-	for_each_gt(gt, xe, id)
-		init_rwsem(&gt->sriov.vf.self_config.lock);
-
 	vf_migration_init_early(xe);
 }
 
-/**
- * vf_post_migration_shutdown - Stop the driver activities after VF migration.
- * @xe: the &xe_device struct instance
- *
- * After this VM is migrated and assigned to a new VF, it is running on a new
- * hardware, and therefore many hardware-dependent states and related structures
- * require fixups. Without fixups, the hardware cannot do any work, and therefore
- * all GPU pipelines are stalled.
- * Stop some of kernel activities to make the fixup process faster.
- */
-static void vf_post_migration_shutdown(struct xe_device *xe)
-{
-	struct xe_gt *gt;
-	unsigned int id;
-	int ret = 0;
-
-	for_each_gt(gt, xe, id) {
-		xe_guc_submit_pause(&gt->uc.guc);
-		ret |= xe_guc_submit_reset_block(&gt->uc.guc);
-	}
-
-	if (ret)
-		drm_info(&xe->drm, "migration recovery encountered ongoing reset\n");
-}
-
-/**
- * vf_post_migration_kickstart - Re-start the driver activities under new hardware.
- * @xe: the &xe_device struct instance
- *
- * After we have finished with all post-migration fixups, restart the driver
- * activities to continue feeding the GPU with workloads.
- */
-static void vf_post_migration_kickstart(struct xe_device *xe)
-{
-	struct xe_gt *gt;
-	unsigned int id;
-
-	/*
-	 * Make sure interrupts on the new HW are properly set. The GuC IRQ
-	 * must be working at this point, since the recovery did started,
-	 * but the rest was not enabled using the procedure from spec.
-	 */
-	xe_irq_resume(xe);
-
-	for_each_gt(gt, xe, id) {
-		xe_guc_submit_reset_unblock(&gt->uc.guc);
-		xe_guc_submit_unpause(&gt->uc.guc);
-	}
-}
-
-static bool gt_vf_post_migration_needed(struct xe_gt *gt)
-{
-	return test_bit(gt->info.id, &gt_to_xe(gt)->sriov.vf.migration.gt_flags);
-}
-
-/*
- * Notify GuCs marked in flags about resource fixups apply finished.
- * @xe: the &xe_device struct instance
- * @gt_flags: flags marking to which GTs the notification shall be sent
- */
-static int vf_post_migration_notify_resfix_done(struct xe_device *xe, unsigned long gt_flags)
-{
-	struct xe_gt *gt;
-	unsigned int id;
-	int err = 0;
-
-	for_each_gt(gt, xe, id) {
-		if (!test_bit(id, &gt_flags))
-			continue;
-		/* skip asking GuC for RESFIX exit if new recovery request arrived */
-		if (gt_vf_post_migration_needed(gt))
-			continue;
-		err = xe_gt_sriov_vf_notify_resfix_done(gt);
-		if (err)
-			break;
-		clear_bit(id, &gt_flags);
-	}
-
-	if (gt_flags && !err)
-		drm_dbg(&xe->drm, "another recovery imminent, skipped some notifications\n");
-	return err;
-}
-
-static int vf_get_next_migrated_gt_id(struct xe_device *xe)
-{
-	struct xe_gt *gt;
-	unsigned int id;
-
-	for_each_gt(gt, xe, id) {
-		if (test_and_clear_bit(id, &xe->sriov.vf.migration.gt_flags))
-			return id;
-	}
-	return -1;
-}
-
-static size_t post_migration_scratch_size(struct xe_device *xe)
-{
-	return max(xe_lrc_reg_size(xe), LRC_WA_BB_SIZE);
-}
-
-/**
- * Perform post-migration fixups on a single GT.
- *
- * After migration, GuC needs to be re-queried for VF configuration to check
- * if it matches previous provisioning. Most of VF provisioning shall be the
- * same, except GGTT range, since GGTT is not virtualized per-VF. If GGTT
- * range has changed, we have to perform fixups - shift all GGTT references
- * used anywhere within the driver. After the fixups in this function succeed,
- * it is allowed to ask the GuC bound to this GT to continue normal operation.
- *
- * Returns: 0 if the operation completed successfully, or a negative error
- * code otherwise.
- */
-static int gt_vf_post_migration_fixups(struct xe_gt *gt)
-{
-	s64 shift;
-	void *buf;
-	int err;
-
-	buf = kmalloc(post_migration_scratch_size(gt_to_xe(gt)), GFP_KERNEL);
-	if (!buf)
-		return -ENOMEM;
-
-	err = xe_gt_sriov_vf_query_config(gt);
-	if (err)
-		goto out;
-
-	shift = xe_gt_sriov_vf_ggtt_shift(gt);
-	if (shift) {
-		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
-		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
-		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
-		if (err)
-			goto out;
-	}
-
-out:
-	kfree(buf);
-	return err;
-}
-
-static void vf_post_migration_recovery(struct xe_device *xe)
-{
-	unsigned long fixed_gts = 0;
-	int id, err;
-
-	drm_dbg(&xe->drm, "migration recovery in progress\n");
-	xe_pm_runtime_get(xe);
-	vf_post_migration_shutdown(xe);
-
-	if (!xe_sriov_vf_migration_supported(xe)) {
-		xe_sriov_err(xe, "migration is not supported\n");
-		err = -ENOTRECOVERABLE;
-		goto fail;
-	}
-
-	while (id = vf_get_next_migrated_gt_id(xe), id >= 0) {
-		struct xe_gt *gt = xe_device_get_gt(xe, id);
-
-		err = gt_vf_post_migration_fixups(gt);
-		if (err)
-			goto fail;
-
-		set_bit(id, &fixed_gts);
-	}
-
-	vf_post_migration_kickstart(xe);
-	err = vf_post_migration_notify_resfix_done(xe, fixed_gts);
-	if (err)
-		goto fail;
-
-	xe_pm_runtime_put(xe);
-	drm_notice(&xe->drm, "migration recovery ended\n");
-	return;
-fail:
-	xe_pm_runtime_put(xe);
-	drm_err(&xe->drm, "migration recovery failed (%pe)\n", ERR_PTR(err));
-	xe_device_declare_wedged(xe);
-}
-
-static void migration_worker_func(struct work_struct *w)
-{
-	struct xe_device *xe = container_of(w, struct xe_device,
-					    sriov.vf.migration.worker);
-
-	vf_post_migration_recovery(xe);
-}
-
-/*
- * Check if post-restore recovery is coming on any of GTs.
- * @xe: the &xe_device struct instance
- *
- * Return: True if migration recovery worker will soon be running. Any worker currently
- * executing does not affect the result.
- */
-static bool vf_ready_to_recovery_on_any_gts(struct xe_device *xe)
-{
-	struct xe_gt *gt;
-	unsigned int id;
-
-	for_each_gt(gt, xe, id) {
-		if (test_bit(id, &xe->sriov.vf.migration.gt_flags))
-			return true;
-	}
-	return false;
-}
-
-/**
- * xe_sriov_vf_start_migration_recovery - Start VF migration recovery.
- * @xe: the &xe_device to start recovery on
- *
- * This function shall be called only by VF.
- */
-void xe_sriov_vf_start_migration_recovery(struct xe_device *xe)
-{
-	bool started;
-
-	xe_assert(xe, IS_SRIOV_VF(xe));
-
-	if (!vf_ready_to_recovery_on_any_gts(xe))
-		return;
-
-	started = queue_work(xe->sriov.wq, &xe->sriov.vf.migration.worker);
-	drm_info(&xe->drm, "VF migration recovery %s\n", started ?
-		 "scheduled" : "already in progress");
-}
-
 /**
  * xe_sriov_vf_init_late() - SR-IOV VF late initialization functions.
  * @xe: the &xe_device to initialize
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.h b/drivers/gpu/drm/xe/xe_sriov_vf.h
index 9e752105ec2a..4df95266b261 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_sriov_vf.h
@@ -13,7 +13,6 @@ struct xe_device;
 
 void xe_sriov_vf_init_early(struct xe_device *xe);
 int xe_sriov_vf_init_late(struct xe_device *xe);
-void xe_sriov_vf_start_migration_recovery(struct xe_device *xe);
 bool xe_sriov_vf_migration_supported(struct xe_device *xe);
 void xe_sriov_vf_debugfs_register(struct xe_device *xe, struct dentry *root);
 
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
index 426cc5841958..6a0fd0f5463e 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
@@ -33,10 +33,6 @@ struct xe_device_vf {
 
 	/** @migration: VF Migration state data */
 	struct {
-		/** @migration.worker: VF migration recovery worker */
-		struct work_struct worker;
-		/** @migration.gt_flags: Per-GT request flags for VF migration recovery */
-		unsigned long gt_flags;
 		/**
 		 * @migration.enabled: flag indicating if migration support
 		 * was enabled or not due to missing prerequisites
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 14/36] drm/xe/vf: Abort H2G sends during VF post-migration recovery
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (12 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 13/36] drm/xe/vf: Make VF recovery run on per-GT worker Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29  8:17   ` Michal Wajdeczko
  2025-09-29  2:55 ` [PATCH v3 15/36] drm/xe/vf: Remove memory allocations from VF post migration recovery Matthew Brost
                   ` (24 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

While VF post-migration recovery is in progress, abort H2G sends with
-ECANCEL. These messages are treated as lost, and TLB invalidation
errors are suppressed. During this phase, the H2G channel is down, and
VF recovery requires the CT lock to proceed.

v3:
 - Use xe_gt_recovery_inprogress (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_ct.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 47079ab9922c..d0fde371fae3 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -851,7 +851,7 @@ static int __guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action,
 				u32 len, u32 g2h_len, u32 num_g2h,
 				struct g2h_fence *g2h_fence)
 {
-	struct xe_gt *gt __maybe_unused = ct_to_gt(ct);
+	struct xe_gt *gt = ct_to_gt(ct);
 	u16 seqno;
 	int ret;
 
@@ -872,7 +872,8 @@ static int __guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action,
 		goto out;
 	}
 
-	if (ct->state == XE_GUC_CT_STATE_STOPPED) {
+	if (ct->state == XE_GUC_CT_STATE_STOPPED ||
+	    xe_gt_recovery_inprogress(gt)) {
 		ret = -ECANCELED;
 		goto out;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 15/36] drm/xe/vf: Remove memory allocations from VF post migration recovery
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (13 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 14/36] drm/xe/vf: Abort H2G sends during VF post-migration recovery Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-30 15:00   ` Lis, Tomasz
  2025-09-29  2:55 ` [PATCH v3 16/36] drm/xe/vf: Close multi-GT GGTT shift race Matthew Brost
                   ` (23 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

VF post migration recovery is the path of dma-fence signaling / reclaim,
avoid memory allocations in this path.

v3:
 - s/lrc_wa_bb/scratch (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 23 +++++++++++++----------
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  2 ++
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index ae9df9c0876d..6f15619efe01 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -1214,17 +1214,13 @@ static size_t post_migration_scratch_size(struct xe_device *xe)
 
 static int vf_post_migration_fixups(struct xe_gt *gt)
 {
+	void *buf = gt->sriov.vf.migration.scratch;
 	s64 shift;
-	void *buf;
 	int err;
 
-	buf = kmalloc(post_migration_scratch_size(gt_to_xe(gt)), GFP_ATOMIC);
-	if (!buf)
-		return -ENOMEM;
-
 	err = xe_gt_sriov_vf_query_config(gt);
 	if (err)
-		goto out;
+		return err;
 
 	shift = xe_gt_sriov_vf_ggtt_shift(gt);
 	if (shift) {
@@ -1232,12 +1228,10 @@ static int vf_post_migration_fixups(struct xe_gt *gt)
 		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
 		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
 		if (err)
-			goto out;
+			return err;
 	}
 
-out:
-	kfree(buf);
-	return err;
+	return 0;
 }
 
 static void vf_post_migration_kickstart(struct xe_gt *gt)
@@ -1322,9 +1316,18 @@ static void migration_worker_func(struct work_struct *w)
  */
 int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
 {
+	void *buf;
+
 	if (!xe_sriov_vf_migration_supported(gt_to_xe(gt)))
 		return 0;
 
+	buf = drmm_kmalloc(&gt_to_xe(gt)->drm,
+			   post_migration_scratch_size(gt_to_xe(gt)),
+			   GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	gt->sriov.vf.migration.scratch = buf;
 	init_rwsem(&gt->sriov.vf.self_config.lock);
 	spin_lock_init(&gt->sriov.vf.migration.lock);
 	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
index 53680a2f188a..a63b6004b0b7 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
@@ -58,6 +58,8 @@ struct xe_gt_sriov_vf_migration {
 	struct work_struct worker;
 	/** @lock: Protects recovery_queued */
 	spinlock_t lock;
+	/** @scratch: Scratch memory for VF recovery */
+	void *scratch;
 	/** @recovery_queued: VF post migration recovery in queued */
 	bool recovery_queued;
 	/** @recovery_inprogress: VF post migration recovery in progress */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 16/36] drm/xe/vf: Close multi-GT GGTT shift race
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (14 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 15/36] drm/xe/vf: Remove memory allocations from VF post migration recovery Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29  8:44   ` Michal Wajdeczko
  2025-09-29  2:55 ` [PATCH v3 17/36] drm/xe/vf: Teardown VF post migration worker on driver unload Matthew Brost
                   ` (22 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

As multi-GT VF post-migration recovery can run in parallel on different
workqueues, but both GTs point to the same GGTT, only one GT needs to
shift the GGTT. However, both GTs need to know when this step has
completed. To coordinate this, share the VF config lock among all GTs
that share a GGTT, and perform the GGTT shift under this lock. With
shift being done under the lock, storing the shift value becomes
unnecessary.

v3:
 - Update commmit message (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 95 +++++++++--------------
 drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  3 +-
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h | 11 ++-
 drivers/gpu/drm/xe/xe_guc.c               |  2 +-
 drivers/gpu/drm/xe/xe_tile_sriov_vf.c     |  6 +-
 drivers/gpu/drm/xe/xe_tile_sriov_vf.h     |  1 -
 6 files changed, 51 insertions(+), 67 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 6f15619efe01..ad1d63b5b8d1 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -436,16 +436,19 @@ u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt)
 	return value;
 }
 
-static int vf_get_ggtt_info(struct xe_gt *gt)
+static int vf_get_ggtt_info(struct xe_gt *gt, bool recovery)
 {
 	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
+	struct xe_gt_sriov_vf_selfconfig *primary_config =
+		&gt_to_tile(gt)->primary_gt->sriov.vf.self_config;
 	struct xe_guc *guc = &gt->uc.guc;
 	u64 start, size;
+	s64 shift;
 	int err;
 
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 
-	down_write(&config->lock);
+	down_write(config->lock);
 
 	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_START_KEY, &start);
 	if (unlikely(err))
@@ -465,13 +468,17 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
 	xe_gt_sriov_dbg_verbose(gt, "GGTT %#llx-%#llx = %lluK\n",
 				start, start + size - 1, size / SZ_1K);
 
-	config->ggtt_shift = start - (s64)config->ggtt_base;
+	shift = start - (s64)primary_config->ggtt_base;
 	config->ggtt_base = start;
 	config->ggtt_size = size;
+	if (recovery)
+		primary_config->ggtt_base = start;
 	err = config->ggtt_size ? 0 : -ENODATA;
 
+	if (!err && shift && recovery)
+		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
 out:
-	up_write(&config->lock);
+	up_write(config->lock);
 	return err;
 }
 
@@ -485,7 +492,7 @@ static int vf_get_lmem_info(struct xe_gt *gt)
 
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 
-	down_write(&config->lock);
+	down_write(config->lock);
 
 	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_LMEM_SIZE_KEY, &size);
 	if (unlikely(err))
@@ -505,7 +512,7 @@ static int vf_get_lmem_info(struct xe_gt *gt)
 	err = config->lmem_size ? 0 : -ENODATA;
 
 out:
-	up_write(&config->lock);
+	up_write(config->lock);
 	return err;
 }
 
@@ -518,7 +525,7 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
 
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 
-	down_write(&config->lock);
+	down_write(config->lock);
 
 	err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_CONTEXTS_KEY, &num_ctxs);
 	if (unlikely(err))
@@ -549,7 +556,7 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
 	err = config->num_ctxs ? 0 : -ENODATA;
 
 out:
-	up_write(&config->lock);
+	up_write(config->lock);
 	return err;
 }
 
@@ -564,17 +571,18 @@ static void vf_cache_gmdid(struct xe_gt *gt)
 /**
  * xe_gt_sriov_vf_query_config - Query SR-IOV config data over MMIO.
  * @gt: the &xe_gt
+ * @recovery: VF post migration recovery path
  *
  * This function is for VF use only.
  *
  * Return: 0 on success or a negative error code on failure.
  */
-int xe_gt_sriov_vf_query_config(struct xe_gt *gt)
+int xe_gt_sriov_vf_query_config(struct xe_gt *gt, bool recovery)
 {
 	struct xe_device *xe = gt_to_xe(gt);
 	int err;
 
-	err = vf_get_ggtt_info(gt);
+	err = vf_get_ggtt_info(gt, recovery);
 	if (unlikely(err))
 		return err;
 
@@ -610,10 +618,10 @@ u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
 
-	down_read(&config->lock);
+	down_read(config->lock);
 	xe_gt_assert(gt, config->num_ctxs);
 	val = config->num_ctxs;
-	up_read(&config->lock);
+	up_read(config->lock);
 
 	return val;
 }
@@ -634,10 +642,10 @@ u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
 
-	down_read(&config->lock);
+	down_read(config->lock);
 	xe_gt_assert(gt, config->lmem_size);
 	val = config->lmem_size;
-	up_read(&config->lock);
+	up_read(config->lock);
 
 	return val;
 }
@@ -656,11 +664,9 @@ u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt)
 	u64 val;
 
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
-	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
+	lockdep_assert_held(config->lock);
 
-	down_read(&config->lock);
 	val = config->ggtt_size;
-	up_read(&config->lock);
 
 	return val;
 }
@@ -680,34 +686,10 @@ u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt)
 
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
-
-	down_read(&config->lock);
 	xe_gt_assert(gt, config->ggtt_size);
-	val = config->ggtt_base;
-	up_read(&config->lock);
-
-	return val;
-}
+	lockdep_assert_held(config->lock);
 
-/**
- * xe_gt_sriov_vf_ggtt_shift - Return shift in GGTT range due to VF migration
- * @gt: the &xe_gt struct instance
- *
- * This function is for VF use only.
- *
- * Return: The shift value; could be negative
- */
-s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt)
-{
-	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
-	s64 val;
-
-	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
-	xe_gt_assert(gt, xe_gt_is_main_type(gt));
-
-	down_read(&config->lock);
-	val = config->ggtt_shift;
-	up_read(&config->lock);
+	val = config->ggtt_base;
 
 	return val;
 }
@@ -1115,7 +1097,7 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
 
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 
-	down_read(&config->lock);
+	down_read(config->lock);
 	drm_printf(p, "GGTT range:\t%#llx-%#llx\n",
 		   config->ggtt_base,
 		   config->ggtt_base + config->ggtt_size - 1);
@@ -1123,8 +1105,6 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
 	string_get_size(config->ggtt_size, 1, STRING_UNITS_2, buf, sizeof(buf));
 	drm_printf(p, "GGTT size:\t%llu (%s)\n", config->ggtt_size, buf);
 
-	drm_printf(p, "GGTT shift on last restore:\t%lld\n", config->ggtt_shift);
-
 	if (IS_DGFX(xe) && xe_gt_is_main_type(gt)) {
 		string_get_size(config->lmem_size, 1, STRING_UNITS_2, buf, sizeof(buf));
 		drm_printf(p, "LMEM size:\t%llu (%s)\n", config->lmem_size, buf);
@@ -1132,7 +1112,7 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
 
 	drm_printf(p, "GuC contexts:\t%u\n", config->num_ctxs);
 	drm_printf(p, "GuC doorbells:\t%u\n", config->num_dbs);
-	up_read(&config->lock);
+	up_read(config->lock);
 }
 
 /**
@@ -1215,21 +1195,16 @@ static size_t post_migration_scratch_size(struct xe_device *xe)
 static int vf_post_migration_fixups(struct xe_gt *gt)
 {
 	void *buf = gt->sriov.vf.migration.scratch;
-	s64 shift;
 	int err;
 
-	err = xe_gt_sriov_vf_query_config(gt);
+	err = xe_gt_sriov_vf_query_config(gt, true);
 	if (err)
 		return err;
 
-	shift = xe_gt_sriov_vf_ggtt_shift(gt);
-	if (shift) {
-		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
-		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
-		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
-		if (err)
-			return err;
-	}
+	xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
+	err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
+	if (err)
+		return err;
 
 	return 0;
 }
@@ -1316,6 +1291,7 @@ static void migration_worker_func(struct work_struct *w)
  */
 int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
 {
+	struct xe_tile *tile = gt_to_tile(gt);
 	void *buf;
 
 	if (!xe_sriov_vf_migration_supported(gt_to_xe(gt)))
@@ -1328,7 +1304,10 @@ int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
 		return -ENOMEM;
 
 	gt->sriov.vf.migration.scratch = buf;
-	init_rwsem(&gt->sriov.vf.self_config.lock);
+	if (xe_gt_is_main_type(gt))
+		init_rwsem(&gt->sriov.vf.self_config.__lock);
+	gt->sriov.vf.self_config.lock =
+		&tile->primary_gt->sriov.vf.self_config.__lock;
 	spin_lock_init(&gt->sriov.vf.migration.lock);
 	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
 
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
index 0b0f2a30e67c..ff3a0ce608cd 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
@@ -18,7 +18,7 @@ int xe_gt_sriov_vf_bootstrap(struct xe_gt *gt);
 void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt,
 				 struct xe_uc_fw_version *wanted,
 				 struct xe_uc_fw_version *found);
-int xe_gt_sriov_vf_query_config(struct xe_gt *gt);
+int xe_gt_sriov_vf_query_config(struct xe_gt *gt, bool recovery);
 int xe_gt_sriov_vf_connect(struct xe_gt *gt);
 int xe_gt_sriov_vf_query_runtime(struct xe_gt *gt);
 void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
@@ -31,7 +31,6 @@ u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt);
 u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt);
 u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt);
 u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt);
-s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt);
 
 u32 xe_gt_sriov_vf_read32(struct xe_gt *gt, struct xe_reg reg);
 void xe_gt_sriov_vf_write32(struct xe_gt *gt, struct xe_reg reg, u32 val);
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
index a63b6004b0b7..6cbf8291a5ab 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
@@ -19,16 +19,19 @@ struct xe_gt_sriov_vf_selfconfig {
 	u64 ggtt_base;
 	/** @ggtt_size: assigned size of the GGTT region. */
 	u64 ggtt_size;
-	/** @ggtt_shift: difference in ggtt_base on last migration */
-	s64 ggtt_shift;
 	/** @lmem_size: assigned size of the LMEM. */
 	u64 lmem_size;
 	/** @num_ctxs: assigned number of GuC submission context IDs. */
 	u16 num_ctxs;
 	/** @num_dbs: assigned number of GuC doorbells IDs. */
 	u16 num_dbs;
-	/** @lock: lock for protecting access to all selfconfig fields. */
-	struct rw_semaphore lock;
+	/** @__lock: lock for protecting access to all selfconfig fields. */
+	struct rw_semaphore __lock;
+	/**
+	 * @lock: pointer to lock for protecting access to all selfconfig
+	 * fields, all GTs point to primary GT.
+	 */
+	struct rw_semaphore *lock;
 };
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
index d5adbbb013ec..c016a11b6ab1 100644
--- a/drivers/gpu/drm/xe/xe_guc.c
+++ b/drivers/gpu/drm/xe/xe_guc.c
@@ -713,7 +713,7 @@ static int vf_guc_init_noalloc(struct xe_guc *guc)
 	if (err)
 		return err;
 
-	err = xe_gt_sriov_vf_query_config(gt);
+	err = xe_gt_sriov_vf_query_config(gt, false);
 	if (err)
 		return err;
 
diff --git a/drivers/gpu/drm/xe/xe_tile_sriov_vf.c b/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
index f221dbed16f0..dc6221fc0520 100644
--- a/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
@@ -40,7 +40,7 @@ static int vf_init_ggtt_balloons(struct xe_tile *tile)
  *
  * Return: 0 on success or a negative error code on failure.
  */
-int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile)
+static int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile)
 {
 	u64 ggtt_base = xe_gt_sriov_vf_ggtt_base(tile->primary_gt);
 	u64 ggtt_size = xe_gt_sriov_vf_ggtt(tile->primary_gt);
@@ -100,12 +100,16 @@ int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile)
 
 static int vf_balloon_ggtt(struct xe_tile *tile)
 {
+	struct xe_gt_sriov_vf_selfconfig *config =
+		&tile->primary_gt->sriov.vf.self_config;
 	struct xe_ggtt *ggtt = tile->mem.ggtt;
 	int err;
 
+	down_read(config->lock);
 	mutex_lock(&ggtt->lock);
 	err = xe_tile_sriov_vf_balloon_ggtt_locked(tile);
 	mutex_unlock(&ggtt->lock);
+	up_read(config->lock);
 
 	return err;
 }
diff --git a/drivers/gpu/drm/xe/xe_tile_sriov_vf.h b/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
index 93eb043171e8..4ee68d1fb28e 100644
--- a/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
@@ -11,7 +11,6 @@
 struct xe_tile;
 
 int xe_tile_sriov_vf_prepare_ggtt(struct xe_tile *tile);
-int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile);
 void xe_tile_sriov_vf_deballoon_ggtt_locked(struct xe_tile *tile);
 void xe_tile_sriov_vf_fixup_ggtt_nodes(struct xe_tile *tile, s64 shift);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 17/36] drm/xe/vf: Teardown VF post migration worker on driver unload
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (15 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 16/36] drm/xe/vf: Close multi-GT GGTT shift race Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-30 16:24   ` Lis, Tomasz
  2025-09-29  2:55 ` [PATCH v3 18/36] drm/xe/vf: Don't allow GT reset to be queued during VF post migration recovery Matthew Brost
                   ` (21 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

Be cautious and ensure the VF post-migration worker is not running
during driver unload.

v3:
 - More teardown later in driver init, use devm (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt.c                |  6 +++++
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 31 ++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  1 +
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  4 ++-
 4 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index 5f9ba4caf837..82be38c99205 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -663,6 +663,12 @@ int xe_gt_init(struct xe_gt *gt)
 	if (err)
 		return err;
 
+	if (IS_SRIOV_VF(gt_to_xe(gt))) {
+		err = xe_gt_sriov_vf_init(gt);
+		if (err)
+			return err;
+	}
+
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index ad1d63b5b8d1..cc5af19c1911 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -811,7 +811,8 @@ static void vf_start_migration_recovery(struct xe_gt *gt)
 
 	spin_lock(&gt->sriov.vf.migration.lock);
 
-	if (!gt->sriov.vf.migration.recovery_queued) {
+	if (!gt->sriov.vf.migration.recovery_queued ||
+	    !gt->sriov.vf.migration.recovery_teardown) {
 		gt->sriov.vf.migration.recovery_queued = true;
 		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
 
@@ -1283,6 +1284,17 @@ static void migration_worker_func(struct work_struct *w)
 	vf_post_migration_recovery(gt);
 }
 
+static void vf_migration_fini(void *arg)
+{
+	struct xe_gt *gt = arg;
+
+	spin_lock_irq(&gt->sriov.vf.migration.lock);
+	gt->sriov.vf.migration.recovery_teardown = true;
+	spin_unlock_irq(&gt->sriov.vf.migration.lock);
+
+	cancel_work_sync(&gt->sriov.vf.migration.worker);
+}
+
 /**
  * xe_gt_sriov_vf_init_early() - GT VF init early
  * @gt: the &xe_gt
@@ -1314,6 +1326,23 @@ int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
 	return 0;
 }
 
+/**
+ * xe_gt_sriov_vf_init() - GT VF init
+ * @gt: the &xe_gt
+ *
+ * Return 0 on success, errno on failure
+ */
+int xe_gt_sriov_vf_init(struct xe_gt *gt)
+{
+	/*
+	 * We want to tear down the VF post-migration early during driver
+	 * unload; therefore, we add this finalization action later during
+	 * driver load.
+	 */
+	return devm_add_action_or_reset(gt_to_xe(gt)->drm.dev,
+					vf_migration_fini, gt);
+}
+
 /**
  * xe_gt_sriov_vf_recovery_inprogress() - VF post migration recovery in progress
  * @gt: the &xe_gt
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
index ff3a0ce608cd..71e1d566da81 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
@@ -24,6 +24,7 @@ int xe_gt_sriov_vf_query_runtime(struct xe_gt *gt);
 void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
 
 int xe_gt_sriov_vf_init_early(struct xe_gt *gt);
+int xe_gt_sriov_vf_init(struct xe_gt *gt);
 bool xe_gt_sriov_vf_recovery_inprogress(struct xe_gt *gt);
 
 u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
index 6cbf8291a5ab..e135018cba1e 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
@@ -59,10 +59,12 @@ struct xe_gt_sriov_vf_runtime {
 struct xe_gt_sriov_vf_migration {
 	/** @migration: VF migration recovery worker */
 	struct work_struct worker;
-	/** @lock: Protects recovery_queued */
+	/** @lock: Protects recovery_queued, teardown */
 	spinlock_t lock;
 	/** @scratch: Scratch memory for VF recovery */
 	void *scratch;
+	/** @recovery_teardown: VF post migration recovery is being torn down */
+	bool recovery_teardown;
 	/** @recovery_queued: VF post migration recovery in queued */
 	bool recovery_queued;
 	/** @recovery_inprogress: VF post migration recovery in progress */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 18/36] drm/xe/vf: Don't allow GT reset to be queued during VF post migration recovery
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (16 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 17/36] drm/xe/vf: Teardown VF post migration worker on driver unload Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29  9:17   ` Michal Wajdeczko
  2025-09-29  2:55 ` [PATCH v3 19/36] drm/xe/vf: Wakeup in GuC backend on " Matthew Brost
                   ` (20 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

With well-behaved software, a GT reset should never occur, nor should it
happen during VF post-migration recovery. If it does, trigger a warning
but suppress the GT reset, as VF post-migration recovery is expected to
bring the VF back to a working state.

v3:
 - Better commit message (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt.c          |  9 -------
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c |  7 -----
 drivers/gpu/drm/xe/xe_guc_submit.c  | 41 +++--------------------------
 drivers/gpu/drm/xe/xe_guc_submit.h  |  3 ---
 4 files changed, 4 insertions(+), 56 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index 82be38c99205..5f04d562604b 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -815,11 +815,6 @@ static int do_gt_restart(struct xe_gt *gt)
 	return 0;
 }
 
-static int gt_wait_reset_unblock(struct xe_gt *gt)
-{
-	return xe_guc_wait_reset_unblock(&gt->uc.guc);
-}
-
 static int gt_reset(struct xe_gt *gt)
 {
 	unsigned int fw_ref;
@@ -834,10 +829,6 @@ static int gt_reset(struct xe_gt *gt)
 
 	xe_gt_info(gt, "reset started\n");
 
-	err = gt_wait_reset_unblock(gt);
-	if (!err)
-		xe_gt_warn(gt, "reset block failed to get lifted");
-
 	xe_pm_runtime_get(gt_to_xe(gt));
 
 	if (xe_fault_inject_gt_reset()) {
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index cc5af19c1911..b16e8fd271f8 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -1175,17 +1175,11 @@ void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p)
 
 static void vf_post_migration_shutdown(struct xe_gt *gt)
 {
-	int ret = 0;
-
 	spin_lock_irq(&gt->sriov.vf.migration.lock);
 	gt->sriov.vf.migration.recovery_queued = false;
 	spin_unlock_irq(&gt->sriov.vf.migration.lock);
 
 	xe_guc_submit_pause(&gt->uc.guc);
-	ret |= xe_guc_submit_reset_block(&gt->uc.guc);
-
-	if (ret)
-		xe_gt_sriov_info(gt, "migration recovery encountered ongoing reset\n");
 }
 
 static size_t post_migration_scratch_size(struct xe_device *xe)
@@ -1219,7 +1213,6 @@ static void vf_post_migration_kickstart(struct xe_gt *gt)
 	 */
 	xe_irq_resume(gt_to_xe(gt));
 
-	xe_guc_submit_reset_unblock(&gt->uc.guc);
 	xe_guc_submit_unpause(&gt->uc.guc);
 }
 
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index cd5e506527fe..b82976f031e5 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -27,6 +27,7 @@
 #include "xe_gt.h"
 #include "xe_gt_clock.h"
 #include "xe_gt_printk.h"
+#include "xe_gt_sriov_vf.h"
 #include "xe_guc.h"
 #include "xe_guc_capture.h"
 #include "xe_guc_ct.h"
@@ -2182,47 +2183,13 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
 	}
 }
 
-/**
- * xe_guc_submit_reset_block - Disallow reset calls on given GuC.
- * @guc: the &xe_guc struct instance
- */
-int xe_guc_submit_reset_block(struct xe_guc *guc)
-{
-	return atomic_fetch_or(1, &guc->submission_state.reset_blocked);
-}
-
-/**
- * xe_guc_submit_reset_unblock - Allow back reset calls on given GuC.
- * @guc: the &xe_guc struct instance
- */
-void xe_guc_submit_reset_unblock(struct xe_guc *guc)
-{
-	atomic_set_release(&guc->submission_state.reset_blocked, 0);
-	wake_up_all(&guc->ct.wq);
-}
-
-static int guc_submit_reset_is_blocked(struct xe_guc *guc)
-{
-	return atomic_read_acquire(&guc->submission_state.reset_blocked);
-}
-
-/* Maximum time of blocking reset */
-#define RESET_BLOCK_PERIOD_MAX (HZ * 5)
-
-/**
- * xe_guc_wait_reset_unblock - Wait until reset blocking flag is lifted, or timeout.
- * @guc: the &xe_guc struct instance
- */
-int xe_guc_wait_reset_unblock(struct xe_guc *guc)
-{
-	return wait_event_timeout(guc->ct.wq,
-				  !guc_submit_reset_is_blocked(guc), RESET_BLOCK_PERIOD_MAX);
-}
-
 int xe_guc_submit_reset_prepare(struct xe_guc *guc)
 {
 	int ret;
 
+	if (WARN_ON_ONCE(xe_gt_sriov_vf_recovery_inprogress(guc_to_gt(guc))))
+		return 0;
+
 	if (!guc->submission_state.initialized)
 		return 0;
 
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
index 5b4a0a6fd818..f535fe3895e5 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.h
+++ b/drivers/gpu/drm/xe/xe_guc_submit.h
@@ -22,9 +22,6 @@ void xe_guc_submit_stop(struct xe_guc *guc);
 int xe_guc_submit_start(struct xe_guc *guc);
 void xe_guc_submit_pause(struct xe_guc *guc);
 void xe_guc_submit_unpause(struct xe_guc *guc);
-int xe_guc_submit_reset_block(struct xe_guc *guc);
-void xe_guc_submit_reset_unblock(struct xe_guc *guc);
-int xe_guc_wait_reset_unblock(struct xe_guc *guc);
 void xe_guc_submit_wedge(struct xe_guc *guc);
 
 int xe_guc_read_stopped(struct xe_guc *guc);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 19/36] drm/xe/vf: Wakeup in GuC backend on VF post migration recovery
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (17 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 18/36] drm/xe/vf: Don't allow GT reset to be queued during VF post migration recovery Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29  2:55 ` [PATCH v3 20/36] drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs supporting migration Matthew Brost
                   ` (19 subsequent siblings)
  38 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

If VF post-migration recovery is in progress, the recovery flow will
rebuild all GuC submission state.  In this case, exit all waiters to
ensure that submission queue scheduling can also be paused. Avoid taking
any adverse actions after aborting the wait.

v3:
 - Don't block in preempt fence work queue as this can interfere with VF
   post-migration work queue scheduling leading to deadlock (Testing)
 - Use xe_gt_recovery_inprogress (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c   |  3 +
 drivers/gpu/drm/xe/xe_guc_submit.c    | 79 ++++++++++++++++++++-------
 drivers/gpu/drm/xe/xe_preempt_fence.c | 11 ++++
 3 files changed, 73 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index b16e8fd271f8..46bba0feadd0 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -815,6 +815,9 @@ static void vf_start_migration_recovery(struct xe_gt *gt)
 	    !gt->sriov.vf.migration.recovery_teardown) {
 		gt->sriov.vf.migration.recovery_queued = true;
 		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
+		smp_wmb();	/* Ensure above write visable before wake */
+
+		wake_up_all(&gt->uc.guc.ct.wq);
 
 		started = queue_work(gt->ordered_wq, &gt->sriov.vf.migration.worker);
 		xe_gt_sriov_info(gt, "VF migration recovery %s\n", started ?
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index b82976f031e5..9320fe9fbb29 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -27,7 +27,6 @@
 #include "xe_gt.h"
 #include "xe_gt_clock.h"
 #include "xe_gt_printk.h"
-#include "xe_gt_sriov_vf.h"
 #include "xe_guc.h"
 #include "xe_guc_capture.h"
 #include "xe_guc_ct.h"
@@ -984,6 +983,9 @@ static u32 wq_space_until_wrap(struct xe_exec_queue *q)
 	return (WQ_SIZE - q->guc->wqi_tail);
 }
 
+#define vf_recovery(guc)	\
+	xe_gt_recovery_inprogress(guc_to_gt(guc))
+
 static int wq_wait_for_space(struct xe_exec_queue *q, u32 wqi_size)
 {
 	struct xe_guc *guc = exec_queue_to_guc(q);
@@ -993,7 +995,7 @@ static int wq_wait_for_space(struct xe_exec_queue *q, u32 wqi_size)
 
 #define AVAILABLE_SPACE \
 	CIRC_SPACE(q->guc->wqi_tail, q->guc->wqi_head, WQ_SIZE)
-	if (wqi_size > AVAILABLE_SPACE) {
+	if (wqi_size > AVAILABLE_SPACE && !vf_recovery(guc)) {
 try_again:
 		q->guc->wqi_head = parallel_read(xe, map, wq_desc.head);
 		if (wqi_size > AVAILABLE_SPACE) {
@@ -1192,9 +1194,10 @@ static void disable_scheduling_deregister(struct xe_guc *guc,
 	ret = wait_event_timeout(guc->ct.wq,
 				 (!exec_queue_pending_enable(q) &&
 				  !exec_queue_pending_disable(q)) ||
-					 xe_guc_read_stopped(guc),
+					 xe_guc_read_stopped(guc) ||
+					 vf_recovery(guc),
 				 HZ * 5);
-	if (!ret) {
+	if (!ret && !vf_recovery(guc)) {
 		struct xe_gpu_scheduler *sched = &q->guc->sched;
 
 		xe_gt_warn(q->gt, "Pending enable/disable failed to respond\n");
@@ -1297,6 +1300,10 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
 	bool wedged = false;
 
 	xe_gt_assert(guc_to_gt(guc), xe_exec_queue_is_lr(q));
+
+	if (vf_recovery(guc))
+		return;
+
 	trace_xe_exec_queue_lr_cleanup(q);
 
 	if (!exec_queue_killed(q))
@@ -1329,7 +1336,11 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
 		 */
 		ret = wait_event_timeout(guc->ct.wq,
 					 !exec_queue_pending_disable(q) ||
-					 xe_guc_read_stopped(guc), HZ * 5);
+					 xe_guc_read_stopped(guc) ||
+					 vf_recovery(guc), HZ * 5);
+		if (vf_recovery(guc))
+			return;
+
 		if (!ret) {
 			xe_gt_warn(q->gt, "Schedule disable failed to respond, guc_id=%d\n",
 				   q->guc->id);
@@ -1419,8 +1430,9 @@ static void enable_scheduling(struct xe_exec_queue *q)
 
 	ret = wait_event_timeout(guc->ct.wq,
 				 !exec_queue_pending_enable(q) ||
-				 xe_guc_read_stopped(guc), HZ * 5);
-	if (!ret || xe_guc_read_stopped(guc)) {
+				 xe_guc_read_stopped(guc) ||
+				 vf_recovery(guc), HZ * 5);
+	if ((!ret && !vf_recovery(guc)) || xe_guc_read_stopped(guc)) {
 		xe_gt_warn(guc_to_gt(guc), "Schedule enable failed to respond");
 		set_exec_queue_banned(q);
 		xe_gt_reset_async(q->gt);
@@ -1491,7 +1503,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	 * list so job can be freed and kick scheduler ensuring free job is not
 	 * lost.
 	 */
-	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags))
+	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags) ||
+	    vf_recovery(guc))
 		return DRM_GPU_SCHED_STAT_NO_HANG;
 
 	/* Kill the run_job entry point */
@@ -1543,7 +1556,10 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 			ret = wait_event_timeout(guc->ct.wq,
 						 (!exec_queue_pending_enable(q) &&
 						  !exec_queue_pending_disable(q)) ||
-						 xe_guc_read_stopped(guc), HZ * 5);
+						 xe_guc_read_stopped(guc) ||
+						 vf_recovery(guc), HZ * 5);
+			if (vf_recovery(guc))
+				goto handle_vf_resume;
 			if (!ret || xe_guc_read_stopped(guc))
 				goto trigger_reset;
 
@@ -1568,7 +1584,10 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 		smp_rmb();
 		ret = wait_event_timeout(guc->ct.wq,
 					 !exec_queue_pending_disable(q) ||
-					 xe_guc_read_stopped(guc), HZ * 5);
+					 xe_guc_read_stopped(guc) ||
+					 vf_recovery(guc), HZ * 5);
+		if (vf_recovery(guc))
+			goto handle_vf_resume;
 		if (!ret || xe_guc_read_stopped(guc)) {
 trigger_reset:
 			if (!ret)
@@ -1673,6 +1692,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	 * some thought, do this in a follow up.
 	 */
 	xe_sched_submission_start(sched);
+handle_vf_resume:
 	return DRM_GPU_SCHED_STAT_NO_HANG;
 }
 
@@ -1769,11 +1789,17 @@ static void __guc_exec_queue_process_msg_set_sched_props(struct xe_sched_msg *ms
 
 static void __suspend_fence_signal(struct xe_exec_queue *q)
 {
+	struct xe_guc *guc = exec_queue_to_guc(q);
+	struct xe_device *xe = guc_to_xe(guc);
+
 	if (!q->guc->suspend_pending)
 		return;
 
 	WRITE_ONCE(q->guc->suspend_pending, false);
-	wake_up(&q->guc->suspend_wait);
+	if (IS_SRIOV_VF(xe))
+		wake_up_all(&guc->ct.wq);
+	else
+		wake_up(&q->guc->suspend_wait);
 }
 
 static void suspend_fence_signal(struct xe_exec_queue *q)
@@ -1794,8 +1820,9 @@ static void __guc_exec_queue_process_msg_suspend(struct xe_sched_msg *msg)
 
 	if (guc_exec_queue_allowed_to_change_state(q) && !exec_queue_suspended(q) &&
 	    exec_queue_enabled(q)) {
-		wait_event(guc->ct.wq, (q->guc->resume_time != RESUME_PENDING ||
-			   xe_guc_read_stopped(guc)) && !exec_queue_pending_disable(q));
+		wait_event(guc->ct.wq, vf_recovery(guc) ||
+			   ((q->guc->resume_time != RESUME_PENDING ||
+			   xe_guc_read_stopped(guc)) && !exec_queue_pending_disable(q)));
 
 		if (!xe_guc_read_stopped(guc)) {
 			s64 since_resume_ms =
@@ -1922,7 +1949,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
 
 	q->entity = &ge->entity;
 
-	if (xe_guc_read_stopped(guc))
+	if (xe_guc_read_stopped(guc) || vf_recovery(guc))
 		xe_sched_stop(sched);
 
 	mutex_unlock(&guc->submission_state.lock);
@@ -2068,6 +2095,7 @@ static int guc_exec_queue_suspend(struct xe_exec_queue *q)
 static int guc_exec_queue_suspend_wait(struct xe_exec_queue *q)
 {
 	struct xe_guc *guc = exec_queue_to_guc(q);
+	struct xe_device *xe = guc_to_xe(guc);
 	int ret;
 
 	/*
@@ -2075,11 +2103,22 @@ static int guc_exec_queue_suspend_wait(struct xe_exec_queue *q)
 	 * suspend_pending upon kill but to be paranoid but races in which
 	 * suspend_pending is set after kill also check kill here.
 	 */
-	ret = wait_event_interruptible_timeout(q->guc->suspend_wait,
-					       !READ_ONCE(q->guc->suspend_pending) ||
-					       exec_queue_killed(q) ||
-					       xe_guc_read_stopped(guc),
-					       HZ * 5);
+	if (IS_SRIOV_VF(xe))
+		ret = wait_event_interruptible_timeout(guc->ct.wq,
+						       !READ_ONCE(q->guc->suspend_pending) ||
+						       exec_queue_killed(q) ||
+						       xe_guc_read_stopped(guc) ||
+						       vf_recovery(guc),
+						       HZ * 5);
+	else
+		ret = wait_event_interruptible_timeout(q->guc->suspend_wait,
+						       !READ_ONCE(q->guc->suspend_pending) ||
+						       exec_queue_killed(q) ||
+						       xe_guc_read_stopped(guc),
+						       HZ * 5);
+
+	if (vf_recovery(guc) && !xe_device_wedged((guc_to_xe(guc))))
+		return -EAGAIN;
 
 	if (!ret) {
 		xe_gt_warn(guc_to_gt(guc),
@@ -2187,7 +2226,7 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
 {
 	int ret;
 
-	if (WARN_ON_ONCE(xe_gt_sriov_vf_recovery_inprogress(guc_to_gt(guc))))
+	if (WARN_ON_ONCE(vf_recovery(guc)))
 		return 0;
 
 	if (!guc->submission_state.initialized)
diff --git a/drivers/gpu/drm/xe/xe_preempt_fence.c b/drivers/gpu/drm/xe/xe_preempt_fence.c
index 83fbeea5aa20..7f587ca3947d 100644
--- a/drivers/gpu/drm/xe/xe_preempt_fence.c
+++ b/drivers/gpu/drm/xe/xe_preempt_fence.c
@@ -8,6 +8,8 @@
 #include <linux/slab.h>
 
 #include "xe_exec_queue.h"
+#include "xe_gt_printk.h"
+#include "xe_guc_exec_queue_types.h"
 #include "xe_vm.h"
 
 static void preempt_fence_work_func(struct work_struct *w)
@@ -22,6 +24,15 @@ static void preempt_fence_work_func(struct work_struct *w)
 	} else if (!q->ops->reset_status(q)) {
 		int err = q->ops->suspend_wait(q);
 
+		if (err == -EAGAIN) {
+			xe_gt_dbg(q->gt, "PREEMPT FENCE RETRY guc_id=%d",
+				  q->guc->id);
+			queue_work(q->vm->xe->preempt_fence_wq,
+				   &pfence->preempt_work);
+			dma_fence_end_signalling(cookie);
+			return;
+		}
+
 		if (err)
 			dma_fence_set_error(&pfence->base, err);
 	} else {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 20/36] drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs supporting migration
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (18 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 19/36] drm/xe/vf: Wakeup in GuC backend on " Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-10-01 13:45   ` Lis, Tomasz
  2025-09-29  2:55 ` [PATCH v3 21/36] drm/xe/vf: Extra debug on GGTT shift Matthew Brost
                   ` (18 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

Blocking in work queues on a hardware action that may never occur —
especially when it depends on a software fixup also scheduled on the
awork queue — is a recipe for deadlock. This situation arises with
the preempt rebind worker and VF post-migration recovery. To prevent
potential deadlocks, avoid indefinite blocking in the preempt rebind
worker for VFs that support migration.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 80b7f13ecd80..b527ee2a5da5 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -35,6 +35,7 @@
 #include "xe_pt.h"
 #include "xe_pxp.h"
 #include "xe_res_cursor.h"
+#include "xe_sriov_vf.h"
 #include "xe_svm.h"
 #include "xe_sync.h"
 #include "xe_tile.h"
@@ -111,12 +112,25 @@ static int alloc_preempt_fences(struct xe_vm *vm, struct list_head *list,
 static int wait_for_existing_preempt_fences(struct xe_vm *vm)
 {
 	struct xe_exec_queue *q;
+	bool vf_migration = IS_SRIOV_VF(vm->xe) &&
+		xe_sriov_vf_migration_supported(vm->xe);
 
 	xe_vm_assert_held(vm);
 
 	list_for_each_entry(q, &vm->preempt.exec_queues, lr.link) {
 		if (q->lr.pfence) {
-			long timeout = dma_fence_wait(q->lr.pfence, false);
+			long timeout;
+
+			if (vf_migration)
+				timeout = dma_fence_wait_timeout(q->lr.pfence,
+								 false, HZ / 5);
+			else
+				timeout = dma_fence_wait(q->lr.pfence, false);
+
+			if (!timeout) {
+				xe_assert(vm->xe, vf_migration);
+				return -EAGAIN;
+			}
 
 			/* Only -ETIME on fence indicates VM needs to be killed */
 			if (timeout < 0 || q->lr.pfence->error == -ETIME)
@@ -541,6 +555,19 @@ static void preempt_rebind_work_func(struct work_struct *w)
 out_unlock_outer:
 	if (err == -EAGAIN) {
 		trace_xe_vm_rebind_worker_retry(vm);
+
+		/*
+		 * We can't block in workers on a VF which supports migration
+		 * given this can block the VF post-migration workers from
+		 * getting scheduled.
+		 */
+		if (IS_SRIOV_VF(vm->xe) &&
+		    xe_sriov_vf_migration_supported(vm->xe)) {
+			up_write(&vm->lock);
+			xe_vm_queue_rebind_worker(vm);
+			return;
+		}
+
 		goto retry;
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 21/36] drm/xe/vf: Extra debug on GGTT shift
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (19 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 20/36] drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs supporting migration Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29  2:55 ` [PATCH v3 22/36] drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register Matthew Brost
                   ` (17 subsequent siblings)
  38 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

Adding a bit of extra debugging to GGTT shift printing—specifically
printing the GGTT shift value—is helpful for VF post-migration recovery.

v3:
 - Reword commit message (Michal)
 - Adjust debug message (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 46bba0feadd0..a564f296e4b9 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -475,8 +475,11 @@ static int vf_get_ggtt_info(struct xe_gt *gt, bool recovery)
 		primary_config->ggtt_base = start;
 	err = config->ggtt_size ? 0 : -ENODATA;
 
-	if (!err && shift && recovery)
+	if (!err && shift && recovery) {
+		xe_gt_sriov_info(gt, "Shifting GGTT base by %lld to 0x%016llx\\n",
+				 shift, config->ggtt_base);
 		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
+	}
 out:
 	up_write(config->lock);
 	return err;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 22/36] drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (20 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 21/36] drm/xe/vf: Extra debug on GGTT shift Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29  2:55 ` [PATCH v3 23/36] drm/xe/vf: Flush and stop CTs in VF post migration recovery Matthew Brost
                   ` (16 subsequent siblings)
  38 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

The only case where the GuC submission backend cannot reason 100%
correctly is when a GuC context is registered during VF post-migration
recovery. In this scenario, it's possible that the GuC context register
H2G is processed, but the immediately following schedule-enable H2G gets
lost.

A double register is harmless when using `GUC_HXG_TYPE_EVENT`, as GuC
simply drops the duplicate H2G. To keep things simple, use
`GUC_HXG_TYPE_EVENT` for all context registrations on VFs.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_ct.c | 32 ++++++++++++++++++++++++--------
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index d0fde371fae3..d84de8544532 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -736,6 +736,26 @@ static u16 next_ct_seqno(struct xe_guc_ct *ct, bool is_g2h_fence)
 	return seqno;
 }
 
+#define MAKE_ACTION(type, __action)				\
+({								\
+	FIELD_PREP(GUC_HXG_MSG_0_TYPE, type) |			\
+	FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |			\
+		   GUC_HXG_EVENT_MSG_0_DATA0, __action);	\
+})
+
+static bool vf_action_can_safely_fail(struct xe_device *xe, u32 action)
+{
+	/*
+	 * If we are VF resuming, we can't exactly track if a context
+	 * registration has been completed in the GuC state machine, it is
+	 * harmless to resend as it will just fail silently if
+	 * GUC_HXG_TYPE_EVENT is used.
+	 */
+	return IS_SRIOV_VF(xe) &&
+		(action == XE_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC ||
+		 action == XE_GUC_ACTION_REGISTER_CONTEXT);
+}
+
 #define H2G_CT_HEADERS (GUC_CTB_HDR_LEN + 1) /* one DW CTB header and one DW HxG header */
 
 static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
@@ -807,18 +827,14 @@ static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
 		FIELD_PREP(GUC_CTB_MSG_0_NUM_DWORDS, len) |
 		FIELD_PREP(GUC_CTB_MSG_0_FENCE, ct_fence_value);
 	if (want_response) {
-		cmd[1] =
-			FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
-			FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |
-				   GUC_HXG_EVENT_MSG_0_DATA0, action[0]);
+		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_REQUEST, action[0]);
+	} else if (vf_action_can_safely_fail(xe, action[0])) {
+		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_EVENT, action[0]);
 	} else {
 		fast_req_track(ct, ct_fence_value,
 			       FIELD_GET(GUC_HXG_EVENT_MSG_0_ACTION, action[0]));
 
-		cmd[1] =
-			FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_FAST_REQUEST) |
-			FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |
-				   GUC_HXG_EVENT_MSG_0_DATA0, action[0]);
+		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_FAST_REQUEST, action[0]);
 	}
 
 	/* H2G header in cmd[1] replaces action[0] so: */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 23/36] drm/xe/vf: Flush and stop CTs in VF post migration recovery
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (21 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 22/36] drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29 21:31   ` Michal Wajdeczko
  2025-09-29  2:55 ` [PATCH v3 24/36] drm/xe/vf: Reset TLB invalidations during " Matthew Brost
                   ` (15 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

Flushing CTs (i.e., progressing all pending G2H messages) gives VF
post-migration recovery an accurate view of which H2G messages the GuC
has processed, enabling the GuC submission state machine to correctly
rebuild all state.

Also, stop all CT traffic, as the CT is not live during VF
post-migration recovery.

v3:
 - xe_guc_ct_flush_and_stop rename (Michal)
 - Drop extra GuC CT WQ wake up (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c |  2 ++
 drivers/gpu/drm/xe/xe_guc_ct.c      | 10 ++++++++++
 drivers/gpu/drm/xe/xe_guc_ct.h      |  1 +
 3 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index a564f296e4b9..37ef1c42bacb 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -23,6 +23,7 @@
 #include "xe_gt_sriov_vf.h"
 #include "xe_gt_sriov_vf_types.h"
 #include "xe_guc.h"
+#include "xe_guc_ct.h"
 #include "xe_guc_hxg_helpers.h"
 #include "xe_guc_relay.h"
 #include "xe_guc_submit.h"
@@ -1185,6 +1186,7 @@ static void vf_post_migration_shutdown(struct xe_gt *gt)
 	gt->sriov.vf.migration.recovery_queued = false;
 	spin_unlock_irq(&gt->sriov.vf.migration.lock);
 
+	xe_guc_ct_flush_and_stop(&gt->uc.guc.ct);
 	xe_guc_submit_pause(&gt->uc.guc);
 }
 
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index d84de8544532..fd6e731c0395 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -573,6 +573,16 @@ void xe_guc_ct_disable(struct xe_guc_ct *ct)
 	stop_g2h_handler(ct);
 }
 
+/**
+ * xe_guc_ct_flush_and_stop - Flush and stop all processing of G2H / H2G
+ * @ct: the &xe_guc_ct
+ */
+void xe_guc_ct_flush_and_stop(struct xe_guc_ct *ct)
+{
+	receive_g2h(ct);
+	xe_guc_ct_stop(ct);
+}
+
 /**
  * xe_guc_ct_stop - Set GuC to stopped state
  * @ct: the &xe_guc_ct
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
index d6c81325a76c..0a88f4e447fa 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.h
+++ b/drivers/gpu/drm/xe/xe_guc_ct.h
@@ -17,6 +17,7 @@ int xe_guc_ct_init_post_hwconfig(struct xe_guc_ct *ct);
 int xe_guc_ct_enable(struct xe_guc_ct *ct);
 void xe_guc_ct_disable(struct xe_guc_ct *ct);
 void xe_guc_ct_stop(struct xe_guc_ct *ct);
+void xe_guc_ct_flush_and_stop(struct xe_guc_ct *ct);
 void xe_guc_ct_fast_path(struct xe_guc_ct *ct);
 
 struct xe_guc_ct_snapshot *xe_guc_ct_snapshot_capture(struct xe_guc_ct *ct);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 24/36] drm/xe/vf: Reset TLB invalidations during VF post migration recovery
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (22 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 23/36] drm/xe/vf: Flush and stop CTs in VF post migration recovery Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-10-01 13:53   ` Lis, Tomasz
  2025-09-29  2:55 ` [PATCH v3 25/36] drm/xe/vf: Kickstart after resfix in " Matthew Brost
                   ` (14 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

TLB invalidations requests can be lost during VF post-migration
recovery. Since the VF has migrated, these invalidations are no longer
needed.

Reset the TLB invalidation frontend, which will signal all pending
fences.

v3:
 - Move TLB invalidation reset after pausing submission (Tomasz)
 - Adjust commit message (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 37ef1c42bacb..c9d94620d197 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -35,6 +35,7 @@
 #include "xe_sriov.h"
 #include "xe_sriov_vf.h"
 #include "xe_tile_sriov_vf.h"
+#include "xe_tlb_inval.h"
 #include "xe_uc_fw.h"
 #include "xe_wopcm.h"
 
@@ -1188,6 +1189,7 @@ static void vf_post_migration_shutdown(struct xe_gt *gt)
 
 	xe_guc_ct_flush_and_stop(&gt->uc.guc.ct);
 	xe_guc_submit_pause(&gt->uc.guc);
+	xe_tlb_inval_reset(&gt->tlb_inval);
 }
 
 static size_t post_migration_scratch_size(struct xe_device *xe)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 25/36] drm/xe/vf: Kickstart after resfix in VF post migration recovery
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (23 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 24/36] drm/xe/vf: Reset TLB invalidations during " Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29  2:55 ` [PATCH v3 26/36] drm/xe/vf: Start CTs before resfix " Matthew Brost
                   ` (13 subsequent siblings)
  38 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

GuC needs to be live for the GuC submission state machine to resubmit
anything lost during VF post-migration recovery.  Therefore, move the
kickstart step after `resfix` to ensure proper resubmission.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index c9d94620d197..35de8977c6d0 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -1216,13 +1216,6 @@ static int vf_post_migration_fixups(struct xe_gt *gt)
 
 static void vf_post_migration_kickstart(struct xe_gt *gt)
 {
-	/*
-	 * Make sure interrupts on the new HW are properly set. The GuC IRQ
-	 * must be working at this point, since the recovery did started,
-	 * but the rest was not enabled using the procedure from spec.
-	 */
-	xe_irq_resume(gt_to_xe(gt));
-
 	xe_guc_submit_unpause(&gt->uc.guc);
 }
 
@@ -1242,6 +1235,13 @@ static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
 	if (skip_resfix)
 		return -EAGAIN;
 
+	/*
+	 * Make sure interrupts on the new HW are properly set. The GuC IRQ
+	 * must be working at this point, since the recovery did started,
+	 * but the rest was not enabled using the procedure from spec.
+	 */
+	xe_irq_resume(gt_to_xe(gt));
+
 	return vf_notify_resfix_done(gt);
 }
 
@@ -1265,11 +1265,12 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
 	if (err)
 		goto fail;
 
-	vf_post_migration_kickstart(gt);
 	err = vf_post_migration_notify_resfix_done(gt);
 	if (err && err != -EAGAIN)
 		goto fail;
 
+	vf_post_migration_kickstart(gt);
+
 	xe_pm_runtime_put(xe);
 	xe_gt_sriov_notice(gt, "migration recovery ended\n");
 	return;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 26/36] drm/xe/vf: Start CTs before resfix VF post migration recovery
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (24 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 25/36] drm/xe/vf: Kickstart after resfix in " Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29 21:49   ` Michal Wajdeczko
  2025-09-29  2:55 ` [PATCH v3 27/36] drm/xe/vf: Abort VF post migration recovery on failure Matthew Brost
                   ` (12 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

Before `resfix`, all CTs stuck in the H2G queue need to be squashed, as
they may contain stale or invalid data.

Starting the CTs clears all H2Gs in the queue. Any lost H2Gs are
resubmitted by the GuC submission state machine.

v3:
 - Don't mess with head / tail values (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c |  7 ++++
 drivers/gpu/drm/xe/xe_guc_ct.c      | 59 ++++++++++++++++++++++-------
 drivers/gpu/drm/xe/xe_guc_ct.h      |  1 +
 3 files changed, 54 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 35de8977c6d0..cb3e9f6e83fa 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -1214,6 +1214,11 @@ static int vf_post_migration_fixups(struct xe_gt *gt)
 	return 0;
 }
 
+static void vf_post_migration_rearm(struct xe_gt *gt)
+{
+	xe_guc_ct_restart(&gt->uc.guc.ct);
+}
+
 static void vf_post_migration_kickstart(struct xe_gt *gt)
 {
 	xe_guc_submit_unpause(&gt->uc.guc);
@@ -1265,6 +1270,8 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
 	if (err)
 		goto fail;
 
+	vf_post_migration_rearm(gt);
+
 	err = vf_post_migration_notify_resfix_done(gt);
 	if (err && err != -EAGAIN)
 		goto fail;
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index fd6e731c0395..25efc1f813ce 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -500,7 +500,7 @@ static void ct_exit_safe_mode(struct xe_guc_ct *ct)
 		xe_gt_dbg(ct_to_gt(ct), "GuC CT safe-mode disabled\n");
 }
 
-int xe_guc_ct_enable(struct xe_guc_ct *ct)
+static int __xe_guc_ct_start(struct xe_guc_ct *ct, bool needs_register)
 {
 	struct xe_device *xe = ct_to_xe(ct);
 	struct xe_gt *gt = ct_to_gt(ct);
@@ -508,21 +508,28 @@ int xe_guc_ct_enable(struct xe_guc_ct *ct)
 
 	xe_gt_assert(gt, !xe_guc_ct_enabled(ct));
 
-	xe_map_memset(xe, &ct->bo->vmap, 0, 0, xe_bo_size(ct->bo));
-	guc_ct_ctb_h2g_init(xe, &ct->ctbs.h2g, &ct->bo->vmap);
-	guc_ct_ctb_g2h_init(xe, &ct->ctbs.g2h, &ct->bo->vmap);
+	if (needs_register) {
+		xe_map_memset(xe, &ct->bo->vmap, 0, 0, xe_bo_size(ct->bo));
+		guc_ct_ctb_h2g_init(xe, &ct->ctbs.h2g, &ct->bo->vmap);
+		guc_ct_ctb_g2h_init(xe, &ct->ctbs.g2h, &ct->bo->vmap);
 
-	err = guc_ct_ctb_h2g_register(ct);
-	if (err)
-		goto err_out;
+		err = guc_ct_ctb_h2g_register(ct);
+		if (err)
+			goto err_out;
 
-	err = guc_ct_ctb_g2h_register(ct);
-	if (err)
-		goto err_out;
+		err = guc_ct_ctb_g2h_register(ct);
+		if (err)
+			goto err_out;
 
-	err = guc_ct_control_toggle(ct, true);
-	if (err)
-		goto err_out;
+		err = guc_ct_control_toggle(ct, true);
+		if (err)
+			goto err_out;
+	} else {
+		ct->ctbs.h2g.info.broken = false;
+		ct->ctbs.g2h.info.broken = false;
+		xe_map_memset(xe, &ct->bo->vmap, CTB_DESC_SIZE * 2, 0,
+			      CTB_H2G_BUFFER_SIZE);
+	}
 
 	guc_ct_change_state(ct, XE_GUC_CT_STATE_ENABLED);
 
@@ -554,6 +561,32 @@ int xe_guc_ct_enable(struct xe_guc_ct *ct)
 	return err;
 }
 
+/**
+ * xe_guc_ct_restart() - Restart GuC CT
+ * @ct: the &xe_guc_ct
+ *
+ * Restart GuC CT to an empty state without issuing a CT register MMIO command.
+ *
+ * Return: 0 on success, or a negative errno on failure.
+ */
+int xe_guc_ct_restart(struct xe_guc_ct *ct)
+{
+	return __xe_guc_ct_start(ct, false);
+}
+
+/**
+ * xe_guc_ct_enable() - Enable GuC CT
+ * @ct: the &xe_guc_ct
+ *
+ * Enable GuC CT to an empty state and issue a CT register MMIO command.
+ *
+ * Return: 0 on success, or a negative errno on failure.
+ */
+int xe_guc_ct_enable(struct xe_guc_ct *ct)
+{
+	return __xe_guc_ct_start(ct, true);
+}
+
 static void stop_g2h_handler(struct xe_guc_ct *ct)
 {
 	cancel_work_sync(&ct->g2h_worker);
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
index 0a88f4e447fa..b1cba250c51c 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.h
+++ b/drivers/gpu/drm/xe/xe_guc_ct.h
@@ -15,6 +15,7 @@ int xe_guc_ct_init_noalloc(struct xe_guc_ct *ct);
 int xe_guc_ct_init(struct xe_guc_ct *ct);
 int xe_guc_ct_init_post_hwconfig(struct xe_guc_ct *ct);
 int xe_guc_ct_enable(struct xe_guc_ct *ct);
+int xe_guc_ct_restart(struct xe_guc_ct *ct);
 void xe_guc_ct_disable(struct xe_guc_ct *ct);
 void xe_guc_ct_stop(struct xe_guc_ct *ct);
 void xe_guc_ct_flush_and_stop(struct xe_guc_ct *ct);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 27/36] drm/xe/vf: Abort VF post migration recovery on failure
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (25 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 26/36] drm/xe/vf: Start CTs before resfix " Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-10-01 14:06   ` Lis, Tomasz
  2025-09-29  2:55 ` [PATCH v3 28/36] drm/xe/vf: Replay GuC submission state on pause / unpause Matthew Brost
                   ` (11 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

If VF post-migration recovery fails, the device is wedged. However,
submission queues still need to be enabled for proper cleanup. In such
cases, call into the GuC submission backend to restart all queues that
were previously paused.

v3:
 - s/Avort/Abort (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 10 ++++++++++
 drivers/gpu/drm/xe/xe_guc_submit.c  | 20 ++++++++++++++++++++
 drivers/gpu/drm/xe/xe_guc_submit.h  |  1 +
 3 files changed, 31 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index cb3e9f6e83fa..9f33561b91c6 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -1224,6 +1224,15 @@ static void vf_post_migration_kickstart(struct xe_gt *gt)
 	xe_guc_submit_unpause(&gt->uc.guc);
 }
 
+static void vf_post_migration_abort(struct xe_gt *gt)
+{
+	spin_lock_irq(&gt->sriov.vf.migration.lock);
+	WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, false);
+	spin_unlock_irq(&gt->sriov.vf.migration.lock);
+
+	xe_guc_submit_pause_abort(&gt->uc.guc);
+}
+
 static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
 {
 	bool skip_resfix = false;
@@ -1282,6 +1291,7 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
 	xe_gt_sriov_notice(gt, "migration recovery ended\n");
 	return;
 fail:
+	vf_post_migration_abort(gt);
 	xe_pm_runtime_put(xe);
 	xe_gt_sriov_err(gt, "migration recovery failed (%pe)\n", ERR_PTR(err));
 	xe_device_declare_wedged(xe);
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 9320fe9fbb29..99ea9b3507cd 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -2359,6 +2359,26 @@ void xe_guc_submit_unpause(struct xe_guc *guc)
 	wake_up_all(&guc->ct.wq);
 }
 
+/**
+ * xe_guc_submit_abort - Abort all paused submission task on given GuC.
+ * @guc: the &xe_guc struct instance whose scheduler is to be aborted
+ */
+void xe_guc_submit_pause_abort(struct xe_guc *guc)
+{
+	struct xe_exec_queue *q;
+	unsigned long index;
+
+	mutex_lock(&guc->submission_state.lock);
+	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q) {
+		struct xe_gpu_scheduler *sched = &q->guc->sched;
+
+		xe_sched_submission_start(sched);
+		if (exec_queue_killed_or_banned_or_wedged(q))
+			xe_guc_exec_queue_trigger_cleanup(q);
+	}
+	mutex_unlock(&guc->submission_state.lock);
+}
+
 static struct xe_exec_queue *
 g2h_exec_queue_lookup(struct xe_guc *guc, u32 guc_id)
 {
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
index f535fe3895e5..fe82c317048e 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.h
+++ b/drivers/gpu/drm/xe/xe_guc_submit.h
@@ -22,6 +22,7 @@ void xe_guc_submit_stop(struct xe_guc *guc);
 int xe_guc_submit_start(struct xe_guc *guc);
 void xe_guc_submit_pause(struct xe_guc *guc);
 void xe_guc_submit_unpause(struct xe_guc *guc);
+void xe_guc_submit_pause_abort(struct xe_guc *guc);
 void xe_guc_submit_wedge(struct xe_guc *guc);
 
 int xe_guc_read_stopped(struct xe_guc *guc);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 28/36] drm/xe/vf: Replay GuC submission state on pause / unpause
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (26 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 27/36] drm/xe/vf: Abort VF post migration recovery on failure Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-10-01 14:37   ` Lis, Tomasz
  2025-09-29  2:55 ` [PATCH v3 29/36] drm/xe: Move queue init before LRC creation Matthew Brost
                   ` (10 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

Fixup GuC submission pause / unpause functions to properly replay any
possible state lost during VF post migration recovery.

v3:
 - Add helpers for revert / replay (Tomasz)
 - Add comment around WQ NOPs (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gpu_scheduler.c        |  14 ++
 drivers/gpu/drm/xe/xe_gpu_scheduler.h        |   2 +
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c          |   1 +
 drivers/gpu/drm/xe/xe_guc_exec_queue_types.h |  15 ++
 drivers/gpu/drm/xe/xe_guc_submit.c           | 242 +++++++++++++++++--
 drivers/gpu/drm/xe/xe_guc_submit.h           |   1 +
 drivers/gpu/drm/xe/xe_sched_job_types.h      |   4 +
 7 files changed, 264 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.c b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
index 455ccaf17314..af300adc7e1a 100644
--- a/drivers/gpu/drm/xe/xe_gpu_scheduler.c
+++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
@@ -135,3 +135,17 @@ void xe_sched_add_msg_locked(struct xe_gpu_scheduler *sched,
 	list_add_tail(&msg->link, &sched->msgs);
 	xe_sched_process_msg_queue(sched);
 }
+
+/**
+ * xe_sched_add_msg_head() - Xe GPU scheduler add message to head of list
+ * @sched: Xe GPU scheduler
+ * @msg: Message to add
+ */
+void xe_sched_add_msg_head(struct xe_gpu_scheduler *sched,
+			   struct xe_sched_msg *msg)
+{
+	lockdep_assert_held(&sched->base.job_list_lock);
+
+	list_add(&msg->link, &sched->msgs);
+	xe_sched_process_msg_queue(sched);
+}
diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.h b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
index e548b2aed95a..010003a6103a 100644
--- a/drivers/gpu/drm/xe/xe_gpu_scheduler.h
+++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
@@ -29,6 +29,8 @@ void xe_sched_add_msg(struct xe_gpu_scheduler *sched,
 		      struct xe_sched_msg *msg);
 void xe_sched_add_msg_locked(struct xe_gpu_scheduler *sched,
 			     struct xe_sched_msg *msg);
+void xe_sched_add_msg_head(struct xe_gpu_scheduler *sched,
+			   struct xe_sched_msg *msg);
 
 static inline void xe_sched_msg_lock(struct xe_gpu_scheduler *sched)
 {
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 9f33561b91c6..0d94867dce8e 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -1217,6 +1217,7 @@ static int vf_post_migration_fixups(struct xe_gt *gt)
 static void vf_post_migration_rearm(struct xe_gt *gt)
 {
 	xe_guc_ct_restart(&gt->uc.guc.ct);
+	xe_guc_submit_unpause_prepare(&gt->uc.guc);
 }
 
 static void vf_post_migration_kickstart(struct xe_gt *gt)
diff --git a/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h b/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h
index c30c0e3ccbbb..a3b034e4b205 100644
--- a/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h
+++ b/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h
@@ -51,6 +51,21 @@ struct xe_guc_exec_queue {
 	wait_queue_head_t suspend_wait;
 	/** @suspend_pending: a suspend of the exec_queue is pending */
 	bool suspend_pending;
+	/**
+	 * @needs_cleanup: Needs a cleanup message during VF post migration
+	 * recovery.
+	 */
+	bool needs_cleanup;
+	/**
+	 * @needs_suspend: Needs a suspend message during VF post migration
+	 * recovery.
+	 */
+	bool needs_suspend;
+	/**
+	 * @needs_resume: Needs a resume message during VF post migration
+	 * recovery.
+	 */
+	bool needs_resume;
 };
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 99ea9b3507cd..497a736c23c3 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -424,6 +424,11 @@ static void set_exec_queue_destroyed(struct xe_exec_queue *q)
 	atomic_or(EXEC_QUEUE_STATE_DESTROYED, &q->guc->state);
 }
 
+static void clear_exec_queue_destroyed(struct xe_exec_queue *q)
+{
+	atomic_and(~EXEC_QUEUE_STATE_DESTROYED, &q->guc->state);
+}
+
 static bool exec_queue_banned(struct xe_exec_queue *q)
 {
 	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_BANNED;
@@ -504,7 +509,12 @@ static void set_exec_queue_extra_ref(struct xe_exec_queue *q)
 	atomic_or(EXEC_QUEUE_STATE_EXTRA_REF, &q->guc->state);
 }
 
-static bool __maybe_unused exec_queue_pending_resume(struct xe_exec_queue *q)
+static void clear_exec_queue_extra_ref(struct xe_exec_queue *q)
+{
+	atomic_and(~EXEC_QUEUE_STATE_EXTRA_REF, &q->guc->state);
+}
+
+static bool exec_queue_pending_resume(struct xe_exec_queue *q)
 {
 	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_RESUME;
 }
@@ -519,7 +529,7 @@ static void clear_exec_queue_pending_resume(struct xe_exec_queue *q)
 	atomic_and(~EXEC_QUEUE_STATE_PENDING_RESUME, &q->guc->state);
 }
 
-static bool __maybe_unused exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
+static bool exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
 {
 	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_TDR_EXIT;
 }
@@ -1079,7 +1089,7 @@ static void wq_item_append(struct xe_exec_queue *q)
 }
 
 #define RESUME_PENDING	~0x0ull
-static void submit_exec_queue(struct xe_exec_queue *q)
+static void submit_exec_queue(struct xe_exec_queue *q, struct xe_sched_job *job)
 {
 	struct xe_guc *guc = exec_queue_to_guc(q);
 	struct xe_lrc *lrc = q->lrc[0];
@@ -1091,10 +1101,13 @@ static void submit_exec_queue(struct xe_exec_queue *q)
 
 	xe_gt_assert(guc_to_gt(guc), exec_queue_registered(q));
 
-	if (xe_exec_queue_is_parallel(q))
-		wq_item_append(q);
-	else
-		xe_lrc_set_ring_tail(lrc, lrc->ring.tail);
+	if (!job->skip_emit || job->last_replay) {
+		if (xe_exec_queue_is_parallel(q))
+			wq_item_append(q);
+		else
+			xe_lrc_set_ring_tail(lrc, lrc->ring.tail);
+		job->last_replay = false;
+	}
 
 	if (exec_queue_suspended(q) && !xe_exec_queue_is_parallel(q))
 		return;
@@ -1147,8 +1160,10 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
 	if (!killed_or_banned_or_wedged && !xe_sched_job_is_error(job)) {
 		if (!exec_queue_registered(q))
 			register_exec_queue(q, GUC_CONTEXT_NORMAL);
-		q->ring_ops->emit_job(job);
-		submit_exec_queue(q);
+		if (!job->skip_emit)
+			q->ring_ops->emit_job(job);
+		submit_exec_queue(q, job);
+		job->skip_emit = false;
 	}
 
 	/*
@@ -1865,6 +1880,7 @@ static void __guc_exec_queue_process_msg_resume(struct xe_sched_msg *msg)
 #define RESUME		4
 #define OPCODE_MASK	0xf
 #define MSG_LOCKED	BIT(8)
+#define MSG_HEAD	BIT(9)
 
 static void guc_exec_queue_process_msg(struct xe_sched_msg *msg)
 {
@@ -1989,12 +2005,24 @@ static void guc_exec_queue_add_msg(struct xe_exec_queue *q, struct xe_sched_msg
 	msg->private_data = q;
 
 	trace_xe_sched_msg_add(msg);
-	if (opcode & MSG_LOCKED)
+	if (opcode & MSG_HEAD)
+		xe_sched_add_msg_head(&q->guc->sched, msg);
+	else if (opcode & MSG_LOCKED)
 		xe_sched_add_msg_locked(&q->guc->sched, msg);
 	else
 		xe_sched_add_msg(&q->guc->sched, msg);
 }
 
+static void guc_exec_queue_try_add_msg_head(struct xe_exec_queue *q,
+					    struct xe_sched_msg *msg,
+					    u32 opcode)
+{
+	if (!list_empty(&msg->link))
+		return;
+
+	guc_exec_queue_add_msg(q, msg, opcode | MSG_LOCKED | MSG_HEAD);
+}
+
 static bool guc_exec_queue_try_add_msg(struct xe_exec_queue *q,
 				       struct xe_sched_msg *msg,
 				       u32 opcode)
@@ -2278,6 +2306,105 @@ void xe_guc_submit_stop(struct xe_guc *guc)
 
 }
 
+static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
+{
+	bool pending_enable, pending_disable, pending_resume;
+
+	pending_enable = exec_queue_pending_enable(q);
+	pending_resume = exec_queue_pending_resume(q);
+
+	if (pending_enable && pending_resume)
+		q->guc->needs_resume = true;
+
+	if (pending_enable && !pending_resume &&
+	    !exec_queue_pending_tdr_exit(q)) {
+		clear_exec_queue_registered(q);
+		if (xe_exec_queue_is_lr(q))
+			xe_exec_queue_put(q);
+	}
+
+	if (pending_enable) {
+		clear_exec_queue_enabled(q);
+		clear_exec_queue_pending_resume(q);
+		clear_exec_queue_pending_tdr_exit(q);
+		clear_exec_queue_pending_enable(q);
+	}
+
+	if (exec_queue_destroyed(q) && exec_queue_registered(q)) {
+		clear_exec_queue_destroyed(q);
+		if (exec_queue_extra_ref(q))
+			xe_exec_queue_put(q);
+		else
+			q->guc->needs_cleanup = true;
+		clear_exec_queue_extra_ref(q);
+	}
+
+	pending_disable = exec_queue_pending_disable(q);
+
+	if (pending_disable && exec_queue_suspended(q)) {
+		clear_exec_queue_suspended(q);
+		q->guc->needs_suspend = true;
+	}
+
+	if (pending_disable) {
+		if (!pending_enable)
+			set_exec_queue_enabled(q);
+		clear_exec_queue_pending_disable(q);
+		clear_exec_queue_check_timeout(q);
+	}
+
+	q->guc->resume_time = 0;
+}
+
+/*
+ * This function is quite complex but only real way to ensure no state is lost
+ * during VF resume flows. The function scans the queue state, make adjustments
+ * as needed, and queues jobs / messages which replayed upon unpause.
+ */
+static void guc_exec_queue_pause(struct xe_guc *guc, struct xe_exec_queue *q)
+{
+	struct xe_gpu_scheduler *sched = &q->guc->sched;
+	struct xe_sched_job *job;
+	int i;
+
+	lockdep_assert_held(&guc->submission_state.lock);
+
+	/* Stop scheduling + flush any DRM scheduler operations */
+	xe_sched_submission_stop(sched);
+	if (xe_exec_queue_is_lr(q))
+		cancel_work_sync(&q->guc->lr_tdr);
+	else
+		cancel_delayed_work_sync(&sched->base.work_tdr);
+
+	guc_exec_queue_revert_pending_state_change(q);
+
+	if (xe_exec_queue_is_parallel(q)) {
+		struct xe_device *xe = guc_to_xe(guc);
+		struct iosys_map map = xe_lrc_parallel_map(q->lrc[0]);
+
+		/*
+		 * NOP existing WQ commands that may contain stale GGTT
+		 * addresses. These will be replayed upon unpause. The hardware
+		 * seems to get confused if the WQ head/tail pointers are
+		 * adjusted.
+		 */
+		for (i = 0; i < WQ_SIZE / sizeof(u32); ++i)
+			parallel_write(xe, map, wq[i],
+				       FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_NOOP) |
+				       FIELD_PREP(WQ_LEN_MASK, 0));
+	}
+
+	job = xe_sched_first_pending_job(sched);
+	if (job) {
+		/*
+		 * Adjust software tail so jobs submitted overwrite previous
+		 * position in ring buffer with new GGTT addresses.
+		 */
+		for (i = 0; i < q->width; ++i)
+			q->lrc[i]->ring.tail = job->ptrs[i].head;
+	}
+}
+
 /**
  * xe_guc_submit_pause - Stop further runs of submission tasks on given GuC.
  * @guc: the &xe_guc struct instance whose scheduler is to be disabled
@@ -2287,8 +2414,12 @@ void xe_guc_submit_pause(struct xe_guc *guc)
 	struct xe_exec_queue *q;
 	unsigned long index;
 
+	xe_gt_assert(guc_to_gt(guc), vf_recovery(guc));
+
+	mutex_lock(&guc->submission_state.lock);
 	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
-		xe_sched_submission_stop_async(&q->guc->sched);
+		guc_exec_queue_pause(guc, q);
+	mutex_unlock(&guc->submission_state.lock);
 }
 
 static void guc_exec_queue_start(struct xe_exec_queue *q)
@@ -2337,11 +2468,92 @@ int xe_guc_submit_start(struct xe_guc *guc)
 	return 0;
 }
 
-static void guc_exec_queue_unpause(struct xe_exec_queue *q)
+static void guc_exec_queue_unpause_prepare(struct xe_guc *guc,
+					   struct xe_exec_queue *q)
 {
 	struct xe_gpu_scheduler *sched = &q->guc->sched;
+	struct drm_sched_job *s_job;
+	struct xe_sched_job *job = NULL;
+
+	list_for_each_entry(s_job, &sched->base.pending_list, list) {
+		job = to_xe_sched_job(s_job);
+
+		q->ring_ops->emit_job(job);
+		job->skip_emit = true;
+	}
 
+	if (job)
+		job->last_replay = true;
+}
+
+/**
+ * xe_guc_submit_unpause_prepare - Prepare unpause submission tasks on given GuC.
+ * @guc: the &xe_guc struct instance whose scheduler is to be prepared for unpause
+ */
+void xe_guc_submit_unpause_prepare(struct xe_guc *guc)
+{
+	struct xe_exec_queue *q;
+	unsigned long index;
+
+	xe_gt_assert(guc_to_gt(guc), vf_recovery(guc));
+
+	mutex_lock(&guc->submission_state.lock);
+	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
+		guc_exec_queue_unpause_prepare(guc, q);
+	mutex_unlock(&guc->submission_state.lock);
+}
+
+static void guc_exec_queue_replay_pending_state_change(struct xe_exec_queue *q)
+{
+	struct xe_gpu_scheduler *sched = &q->guc->sched;
+	struct xe_sched_msg *msg;
+
+	if (q->guc->needs_cleanup) {
+		msg = q->guc->static_msgs + STATIC_MSG_CLEANUP;
+
+		guc_exec_queue_add_msg(q, msg, CLEANUP);
+		q->guc->needs_cleanup = false;
+	}
+
+	if (q->guc->needs_suspend) {
+		msg = q->guc->static_msgs + STATIC_MSG_SUSPEND;
+
+		xe_sched_msg_lock(sched);
+		guc_exec_queue_try_add_msg_head(q, msg, SUSPEND);
+		xe_sched_msg_unlock(sched);
+
+		q->guc->needs_suspend = false;
+	}
+
+	/*
+	 * The resume must be in the message queue before the suspend as it is
+	 * not possible for a resume to be issued if a suspend pending is, but
+	 * the inverse is possible.
+	 */
+	if (q->guc->needs_resume) {
+		msg = q->guc->static_msgs + STATIC_MSG_RESUME;
+
+		xe_sched_msg_lock(sched);
+		guc_exec_queue_try_add_msg_head(q, msg, RESUME);
+		xe_sched_msg_unlock(sched);
+
+		q->guc->needs_resume = false;
+	}
+}
+
+static void guc_exec_queue_unpause(struct xe_guc *guc, struct xe_exec_queue *q)
+{
+	struct xe_gpu_scheduler *sched = &q->guc->sched;
+	bool needs_tdr = exec_queue_killed_or_banned_or_wedged(q);
+
+	lockdep_assert_held(&guc->submission_state.lock);
+
+	xe_sched_resubmit_jobs(sched);
+	guc_exec_queue_replay_pending_state_change(q);
 	xe_sched_submission_start(sched);
+	if (needs_tdr)
+		xe_guc_exec_queue_trigger_cleanup(q);
+	xe_sched_submission_resume_tdr(sched);
 }
 
 /**
@@ -2353,10 +2565,10 @@ void xe_guc_submit_unpause(struct xe_guc *guc)
 	struct xe_exec_queue *q;
 	unsigned long index;
 
+	mutex_lock(&guc->submission_state.lock);
 	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
-		guc_exec_queue_unpause(q);
-
-	wake_up_all(&guc->ct.wq);
+		guc_exec_queue_unpause(guc, q);
+	mutex_unlock(&guc->submission_state.lock);
 }
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
index fe82c317048e..b49a2748ec46 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.h
+++ b/drivers/gpu/drm/xe/xe_guc_submit.h
@@ -22,6 +22,7 @@ void xe_guc_submit_stop(struct xe_guc *guc);
 int xe_guc_submit_start(struct xe_guc *guc);
 void xe_guc_submit_pause(struct xe_guc *guc);
 void xe_guc_submit_unpause(struct xe_guc *guc);
+void xe_guc_submit_unpause_prepare(struct xe_guc *guc);
 void xe_guc_submit_pause_abort(struct xe_guc *guc);
 void xe_guc_submit_wedge(struct xe_guc *guc);
 
diff --git a/drivers/gpu/drm/xe/xe_sched_job_types.h b/drivers/gpu/drm/xe/xe_sched_job_types.h
index 7ce58765a34a..13e7a12b03ad 100644
--- a/drivers/gpu/drm/xe/xe_sched_job_types.h
+++ b/drivers/gpu/drm/xe/xe_sched_job_types.h
@@ -63,6 +63,10 @@ struct xe_sched_job {
 	bool ring_ops_flush_tlb;
 	/** @ggtt: mapped in ggtt. */
 	bool ggtt;
+	/** @skip_emit: skip emitting the job */
+	bool skip_emit;
+	/** @last_replay: last job being replayed */
+	bool last_replay;
 	/** @ptrs: per instance pointers. */
 	struct xe_job_ptrs ptrs[];
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 29/36] drm/xe: Move queue init before LRC creation
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (27 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 28/36] drm/xe/vf: Replay GuC submission state on pause / unpause Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-10-02  0:44   ` Lis, Tomasz
  2025-09-29  2:55 ` [PATCH v3 30/36] drm/xe/vf: Add debug prints for GuC replaying state during VF recovery Matthew Brost
                   ` (9 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

A queue must be in the submission backend's tracking state before the
LRC is created to avoid a race condition where the LRC's GGTT addresses
are not properly fixed up during VF post-migration recovery.

Move the queue initialization—which adds the queue to the submission
backend's tracking state—before LRC creation.

v2:
 - Wait on VF GGTT fixes before creating LRC (testing)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_exec_queue.c        | 43 +++++++++++++++++------
 drivers/gpu/drm/xe/xe_execlist.c          |  2 +-
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 39 +++++++++++++++++++-
 drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  2 ++
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  5 +++
 drivers/gpu/drm/xe/xe_guc_submit.c        |  2 +-
 drivers/gpu/drm/xe/xe_lrc.h               | 10 ++++++
 7 files changed, 90 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
index 81f707d2c388..3db8e64d9d13 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -15,6 +15,7 @@
 #include "xe_dep_scheduler.h"
 #include "xe_device.h"
 #include "xe_gt.h"
+#include "xe_gt_sriov_vf.h"
 #include "xe_hw_engine_class_sysfs.h"
 #include "xe_hw_engine_group.h"
 #include "xe_hw_fence.h"
@@ -179,17 +180,32 @@ static int __xe_exec_queue_init(struct xe_exec_queue *q)
 			flags |= XE_LRC_CREATE_RUNALONE;
 	}
 
+	err = q->ops->init(q);
+	if (err)
+		return err;
+
+	/*
+	 * This must occur after q->ops->init to avoid race conditions during VF
+	 * post-migration recovery, as the fixups for the LRC GGTT addresses
+	 * depend on the queue being present in the backend tracking structure.
+	 *
+	 * In addition to above, we must wait on inflight GGTT changes to
+	 * avoid writing out stale values here.
+	 */
+	xe_gt_sriov_vf_wait_valid_ggtt(q->gt);
 	for (i = 0; i < q->width; ++i) {
-		q->lrc[i] = xe_lrc_create(q->hwe, q->vm, SZ_16K, q->msix_vec, flags);
-		if (IS_ERR(q->lrc[i])) {
-			err = PTR_ERR(q->lrc[i]);
+		struct xe_lrc *lrc;
+
+		lrc = xe_lrc_create(q->hwe, q->vm, xe_lrc_ring_size(),
+				    q->msix_vec, flags);
+		if (IS_ERR(lrc)) {
+			err = PTR_ERR(lrc);
 			goto err_lrc;
 		}
-	}
 
-	err = q->ops->init(q);
-	if (err)
-		goto err_lrc;
+		/* Pairs with READ_ONCE to xe_exec_queue_contexts_hwsp_rebase */
+		WRITE_ONCE(q->lrc[i], lrc);
+	}
 
 	return 0;
 
@@ -1095,9 +1111,16 @@ int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch)
 	int err = 0;
 
 	for (i = 0; i < q->width; ++i) {
-		xe_lrc_update_memirq_regs_with_address(q->lrc[i], q->hwe, scratch);
-		xe_lrc_update_hwctx_regs_with_address(q->lrc[i]);
-		err = xe_lrc_setup_wa_bb_with_scratch(q->lrc[i], q->hwe, scratch);
+		struct xe_lrc *lrc;
+
+		/* Pairs with WRITE_ONCE in __xe_exec_queue_init  */
+		lrc = READ_ONCE(q->lrc[i]);
+		if (!lrc)
+			continue;
+
+		xe_lrc_update_memirq_regs_with_address(lrc, q->hwe, scratch);
+		xe_lrc_update_hwctx_regs_with_address(lrc);
+		err = xe_lrc_setup_wa_bb_with_scratch(lrc, q->hwe, scratch);
 		if (err)
 			break;
 	}
diff --git a/drivers/gpu/drm/xe/xe_execlist.c b/drivers/gpu/drm/xe/xe_execlist.c
index f83d421ac9d3..769d05517f93 100644
--- a/drivers/gpu/drm/xe/xe_execlist.c
+++ b/drivers/gpu/drm/xe/xe_execlist.c
@@ -339,7 +339,7 @@ static int execlist_exec_queue_init(struct xe_exec_queue *q)
 	const struct drm_sched_init_args args = {
 		.ops = &drm_sched_ops,
 		.num_rqs = 1,
-		.credit_limit = q->lrc[0]->ring.size / MAX_JOB_SIZE_BYTES,
+		.credit_limit = xe_lrc_ring_size() / MAX_JOB_SIZE_BYTES,
 		.hang_limit = XE_SCHED_HANG_LIMIT,
 		.timeout = XE_SCHED_JOB_TIMEOUT,
 		.name = q->hwe->name,
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 0d94867dce8e..42f9fd43b436 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -482,6 +482,11 @@ static int vf_get_ggtt_info(struct xe_gt *gt, bool recovery)
 				 shift, config->ggtt_base);
 		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
 	}
+
+	WRITE_ONCE(gt->sriov.vf.migration.ggtt_need_fixes, false);
+	smp_wmb();	/* Ensure above write visible before wake */
+	wake_up_all(&gt->sriov.vf.migration.wq);
+
 out:
 	up_write(config->lock);
 	return err;
@@ -820,7 +825,8 @@ static void vf_start_migration_recovery(struct xe_gt *gt)
 	    !gt->sriov.vf.migration.recovery_teardown) {
 		gt->sriov.vf.migration.recovery_queued = true;
 		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
-		smp_wmb();	/* Ensure above write visable before wake */
+		WRITE_ONCE(gt->sriov.vf.migration.ggtt_need_fixes, true);
+		smp_wmb();	/* Ensure above writes visable before wake */
 
 		wake_up_all(&gt->uc.guc.ct.wq);
 
@@ -1344,6 +1350,7 @@ int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
 		&tile->primary_gt->sriov.vf.self_config.__lock;
 	spin_lock_init(&gt->sriov.vf.migration.lock);
 	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
+	init_waitqueue_head(&gt->sriov.vf.migration.wq);
 
 	return 0;
 }
@@ -1387,3 +1394,33 @@ bool xe_gt_sriov_vf_recovery_inprogress(struct xe_gt *gt)
 	return (xe_memirq_sw_int_0_irq_pending(memirq, &gt->uc.guc) ||
 		READ_ONCE(gt->sriov.vf.migration.recovery_inprogress));
 }
+
+static bool vf_valid_ggtt(struct xe_gt *gt)
+{
+	struct xe_memirq *memirq = &gt_to_tile(gt)->memirq;
+
+	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
+
+	if (xe_memirq_sw_int_0_irq_pending(memirq, &gt->uc.guc) ||
+	    READ_ONCE(gt->sriov.vf.migration.ggtt_need_fixes))
+		return false;
+
+	return true;
+}
+
+/**
+ * xe_gt_sriov_vf_wait_valid_ggtt() - VF wait for valid GGTT addresses
+ * @gt: the &xe_gt
+ */
+void xe_gt_sriov_vf_wait_valid_ggtt(struct xe_gt *gt)
+{
+	int ret;
+
+	if (!IS_SRIOV_VF(gt_to_xe(gt)))
+		return;
+
+	ret = wait_event_interruptible_timeout(gt->sriov.vf.migration.wq,
+					       vf_valid_ggtt(gt),
+					       HZ * 5);
+	XE_WARN_ON(!ret);
+}
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
index 71e1d566da81..20cc0c4c32e3 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
@@ -40,4 +40,6 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p);
 void xe_gt_sriov_vf_print_runtime(struct xe_gt *gt, struct drm_printer *p);
 void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p);
 
+void xe_gt_sriov_vf_wait_valid_ggtt(struct xe_gt *gt);
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
index e135018cba1e..3c3e415199d1 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
@@ -8,6 +8,7 @@
 
 #include <linux/rwsem.h>
 #include <linux/types.h>
+#include <linux/wait.h>
 #include <linux/workqueue.h>
 #include "xe_uc_fw_types.h"
 
@@ -61,6 +62,8 @@ struct xe_gt_sriov_vf_migration {
 	struct work_struct worker;
 	/** @lock: Protects recovery_queued, teardown */
 	spinlock_t lock;
+	/** @wq: wait queue for migration fixes */
+	wait_queue_head_t wq;
 	/** @scratch: Scratch memory for VF recovery */
 	void *scratch;
 	/** @recovery_teardown: VF post migration recovery is being torn down */
@@ -69,6 +72,8 @@ struct xe_gt_sriov_vf_migration {
 	bool recovery_queued;
 	/** @recovery_inprogress: VF post migration recovery in progress */
 	bool recovery_inprogress;
+	/** @ggtt_need_fixes: VF GGTT needs fixes */
+	bool ggtt_need_fixes;
 };
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 497a736c23c3..7fe3fb07e35e 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -1943,7 +1943,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
 	timeout = (q->vm && xe_vm_in_lr_mode(q->vm)) ? MAX_SCHEDULE_TIMEOUT :
 		  msecs_to_jiffies(q->sched_props.job_timeout_ms);
 	err = xe_sched_init(&ge->sched, &drm_sched_ops, &xe_sched_ops,
-			    NULL, q->lrc[0]->ring.size / MAX_JOB_SIZE_BYTES, 64,
+			    NULL, xe_lrc_ring_size() / MAX_JOB_SIZE_BYTES, 64,
 			    timeout, guc_to_gt(guc)->ordered_wq, NULL,
 			    q->name, gt_to_xe(q->gt)->drm.dev);
 	if (err)
diff --git a/drivers/gpu/drm/xe/xe_lrc.h b/drivers/gpu/drm/xe/xe_lrc.h
index 188565465779..5fb6c74bdab5 100644
--- a/drivers/gpu/drm/xe/xe_lrc.h
+++ b/drivers/gpu/drm/xe/xe_lrc.h
@@ -74,6 +74,16 @@ static inline void xe_lrc_put(struct xe_lrc *lrc)
 	kref_put(&lrc->refcount, xe_lrc_destroy);
 }
 
+/**
+ * xe_lrc_ring_size() - Xe LRC ring size
+ *
+ * Return: Size of LRC size
+ */
+static inline size_t xe_lrc_ring_size(void)
+{
+	return SZ_16K;
+}
+
 size_t xe_gt_lrc_size(struct xe_gt *gt, enum xe_engine_class class);
 u32 xe_lrc_pphwsp_offset(struct xe_lrc *lrc);
 u32 xe_lrc_regs_offset(struct xe_lrc *lrc);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 30/36] drm/xe/vf: Add debug prints for GuC replaying state during VF recovery
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (28 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 29/36] drm/xe: Move queue init before LRC creation Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-10-02  1:02   ` Lis, Tomasz
  2025-09-29  2:55 ` [PATCH v3 31/36] drm/xe/vf: Workaround for race condition in GuC firmware during VF pause Matthew Brost
                   ` (8 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

Helpful to manually verify the GuC state machine can correctly replay
the state during a VF post-migration recovery. All replay paths have
been manually verified as triggered and working during testing.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 7fe3fb07e35e..bc717403740c 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -2306,21 +2306,27 @@ void xe_guc_submit_stop(struct xe_guc *guc)
 
 }
 
-static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
+static void guc_exec_queue_revert_pending_state_change(struct xe_guc *guc,
+						       struct xe_exec_queue *q)
 {
 	bool pending_enable, pending_disable, pending_resume;
 
 	pending_enable = exec_queue_pending_enable(q);
 	pending_resume = exec_queue_pending_resume(q);
 
-	if (pending_enable && pending_resume)
+	if (pending_enable && pending_resume) {
 		q->guc->needs_resume = true;
+		xe_gt_dbg(guc_to_gt(guc), "Replay RESUME - guc_id=%d",
+			  q->guc->id);
+	}
 
 	if (pending_enable && !pending_resume &&
 	    !exec_queue_pending_tdr_exit(q)) {
 		clear_exec_queue_registered(q);
 		if (xe_exec_queue_is_lr(q))
 			xe_exec_queue_put(q);
+		xe_gt_dbg(guc_to_gt(guc), "Replay REGISTER - guc_id=%d",
+			  q->guc->id);
 	}
 
 	if (pending_enable) {
@@ -2328,6 +2334,8 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
 		clear_exec_queue_pending_resume(q);
 		clear_exec_queue_pending_tdr_exit(q);
 		clear_exec_queue_pending_enable(q);
+		xe_gt_dbg(guc_to_gt(guc), "Replay ENABLE - guc_id=%d",
+			  q->guc->id);
 	}
 
 	if (exec_queue_destroyed(q) && exec_queue_registered(q)) {
@@ -2337,6 +2345,8 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
 		else
 			q->guc->needs_cleanup = true;
 		clear_exec_queue_extra_ref(q);
+		xe_gt_dbg(guc_to_gt(guc), "Replay CLEANUP - guc_id=%d",
+			  q->guc->id);
 	}
 
 	pending_disable = exec_queue_pending_disable(q);
@@ -2344,6 +2354,8 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
 	if (pending_disable && exec_queue_suspended(q)) {
 		clear_exec_queue_suspended(q);
 		q->guc->needs_suspend = true;
+		xe_gt_dbg(guc_to_gt(guc), "Replay SUSPEND - guc_id=%d",
+			  q->guc->id);
 	}
 
 	if (pending_disable) {
@@ -2351,6 +2363,8 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
 			set_exec_queue_enabled(q);
 		clear_exec_queue_pending_disable(q);
 		clear_exec_queue_check_timeout(q);
+		xe_gt_dbg(guc_to_gt(guc), "Replay DISABLE - guc_id=%d",
+			  q->guc->id);
 	}
 
 	q->guc->resume_time = 0;
@@ -2376,7 +2390,7 @@ static void guc_exec_queue_pause(struct xe_guc *guc, struct xe_exec_queue *q)
 	else
 		cancel_delayed_work_sync(&sched->base.work_tdr);
 
-	guc_exec_queue_revert_pending_state_change(q);
+	guc_exec_queue_revert_pending_state_change(guc, q);
 
 	if (xe_exec_queue_is_parallel(q)) {
 		struct xe_device *xe = guc_to_xe(guc);
@@ -2478,6 +2492,9 @@ static void guc_exec_queue_unpause_prepare(struct xe_guc *guc,
 	list_for_each_entry(s_job, &sched->base.pending_list, list) {
 		job = to_xe_sched_job(s_job);
 
+		xe_gt_dbg(guc_to_gt(guc), "Replay JOB - guc_id=%d, seqno=%d",
+			  q->guc->id, xe_sched_job_seqno(job));
+
 		q->ring_ops->emit_job(job);
 		job->skip_emit = true;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 31/36] drm/xe/vf: Workaround for race condition in GuC firmware during VF pause
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (29 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 30/36] drm/xe/vf: Add debug prints for GuC replaying state during VF recovery Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-10-02  1:09   ` Lis, Tomasz
  2025-09-29  2:55 ` [PATCH v3 32/36] drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups Matthew Brost
                   ` (7 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

A race condition exists where a paused VF's H2G request can be processed
and subsequently rejected. This rejection results in a FAST_REQ failure
being delivered to the KMD, which then terminates the CT via a dead
worker and triggers a GT reset—an undesirable outcome.

This workaround mitigates the issue by checking if a VF post-migration
recovery is in progress and aborting these adverse actions accordingly.
The GuC firmware will address this bug in an upcoming release. Once that
version is available and VF migration depends on it, this workaround can
be safely removed.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_ct.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 25efc1f813ce..89ee68828f07 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -1394,6 +1394,10 @@ static int parse_g2h_response(struct xe_guc_ct *ct, u32 *msg, u32 len)
 
 		fast_req_report(ct, fence);
 
+		/* FIXME: W/A race in the GuC, will get in firmware soon */
+		if (xe_gt_recovery_inprogress(gt))
+			return 0;
+
 		CT_DEAD(ct, NULL, PARSE_G2H_RESPONSE);
 
 		return -EPROTO;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 32/36] drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (30 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 31/36] drm/xe/vf: Workaround for race condition in GuC firmware during VF pause Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-10-02  1:25   ` Lis, Tomasz
  2025-09-29  2:55 ` [PATCH v3 33/36] drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF Matthew Brost
                   ` (6 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

From: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>

The migrate VM builds the CCS metadata save/restore batch buffer (BB) in
advance and retains it so the GuC can submit it directly when saving a
VM’s state.

When a VM migrates between VFs, the GGTT base can change. Any GGTT-based
addresses embedded in the BB would then have to be parsed and patched.

Use PPGTT addresses in the BB (including for TLB invalidation) so the BB
remains GGTT-agnostic and requires no address fixups during migration.

Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_migrate.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 1d667fa36cf3..ad03afb5145f 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -980,15 +980,27 @@ struct xe_lrc *xe_migrate_lrc(struct xe_migrate *migrate)
 	return migrate->q->lrc[0];
 }
 
-static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *dw, int i,
-				 u32 flags)
+static u64 migrate_vm_ppgtt_addr_tlb_inval(void)
 {
-	struct xe_lrc *lrc = xe_exec_queue_lrc(q);
+	/*
+	 * The migrate VM is self-referential so it can modify its own PTEs (see
+	 * pte_update_size() or emit_pte() functions). We reserve NUM_KERNEL_PDE
+	 * entries for kernel operations (copies, clears, CCS migrate), and
+	 * suballocate the rest to user operations (binds/unbinds). With
+	 * NUM_KERNEL_PDE = 15, NUM_KERNEL_PDE - 1 is already used for PTE updates,
+	 * so assign NUM_KERNEL_PDE - 2 for TLB invalidation.
+	 */
+	return (NUM_KERNEL_PDE - 2) * XE_PAGE_SIZE;
+}
+
+static int emit_flush_invalidate(u32 *dw, int i, u32 flags)
+{
+	u64 addr = migrate_vm_ppgtt_addr_tlb_inval();
+
 	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
 		  MI_FLUSH_IMM_DW | flags;
-	dw[i++] = lower_32_bits(xe_lrc_start_seqno_ggtt_addr(lrc)) |
-		  MI_FLUSH_DW_USE_GTT;
-	dw[i++] = upper_32_bits(xe_lrc_start_seqno_ggtt_addr(lrc));
+	dw[i++] = lower_32_bits(addr);
+	dw[i++] = upper_32_bits(addr);
 	dw[i++] = MI_NOOP;
 	dw[i++] = MI_NOOP;
 
@@ -1101,11 +1113,11 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
 
 		emit_pte(m, bb, ccs_pt, false, false, &ccs_it, ccs_size, src);
 
-		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
+		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
 		flush_flags = xe_migrate_ccs_copy(m, bb, src_L0_ofs, src_is_pltt,
 						  src_L0_ofs, dst_is_pltt,
 						  src_L0, ccs_ofs, true);
-		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
+		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
 
 		size -= src_L0;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 33/36] drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (31 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 32/36] drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29  2:55 ` [PATCH v3 34/36] drm/xe/vf: Ensure media GT VF recovery runs after primary GT on PTL Matthew Brost
                   ` (5 subsequent siblings)
  38 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

VF CCS restore is a primary GT operation on which the media GT depends.
Therefore, it doesn't make much sense to run these operations in
parallel. To address this, point the media GT's ordered work queue to
the primary GT's ordered work queue on platforms that require (PTL VFs)
CCS restore as part of VF post-migration recovery.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_device_types.h |  2 ++
 drivers/gpu/drm/xe/xe_gt.c           | 16 ++++++++++------
 drivers/gpu/drm/xe/xe_gt.h           |  2 +-
 drivers/gpu/drm/xe/xe_pci.c          |  6 +++++-
 drivers/gpu/drm/xe/xe_pci_types.h    |  1 +
 drivers/gpu/drm/xe/xe_tile.c         |  2 +-
 6 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index a6c361db11d9..af92c8cd43b6 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -318,6 +318,8 @@ struct xe_device {
 		u8 skip_mtcfg:1;
 		/** @info.skip_pcode: skip access to PCODE uC */
 		u8 skip_pcode:1;
+		/** @info.needs_shared_vf_gt_wq: needs shared GT WQ on VF */
+		u8 needs_shared_vf_gt_wq:1;
 	} info;
 
 	/** @wa_active: keep track of active workarounds */
diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index 5f04d562604b..0c38cd30143c 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -72,7 +72,7 @@ static void gt_fini(struct drm_device *drm, void *arg)
 	destroy_workqueue(gt->ordered_wq);
 }
 
-struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
+struct xe_gt *xe_gt_alloc(struct xe_tile *tile, bool use_primary_wq)
 {
 	struct xe_gt *gt;
 	int err;
@@ -82,12 +82,16 @@ struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
 		return ERR_PTR(-ENOMEM);
 
 	gt->tile = tile;
-	gt->ordered_wq = alloc_ordered_workqueue("gt-ordered-wq",
-						 WQ_MEM_RECLAIM);
+	if (use_primary_wq) {
+		gt->ordered_wq = tile->primary_gt->ordered_wq;
+	} else {
+		gt->ordered_wq = alloc_ordered_workqueue("gt-ordered-wq",
+							 WQ_MEM_RECLAIM);
 
-	err = drmm_add_action_or_reset(&gt_to_xe(gt)->drm, gt_fini, gt);
-	if (err)
-		return ERR_PTR(err);
+		err = drmm_add_action_or_reset(&gt_to_xe(gt)->drm, gt_fini, gt);
+		if (err)
+			return ERR_PTR(err);
+	}
 
 	return gt;
 }
diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
index ee0239b2f48c..2e3898c18746 100644
--- a/drivers/gpu/drm/xe/xe_gt.h
+++ b/drivers/gpu/drm/xe/xe_gt.h
@@ -28,7 +28,7 @@ static inline bool xe_fault_inject_gt_reset(void)
 	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&gt_reset_failure, 1);
 }
 
-struct xe_gt *xe_gt_alloc(struct xe_tile *tile);
+struct xe_gt *xe_gt_alloc(struct xe_tile *tile, bool use_primary_wq);
 int xe_gt_init_early(struct xe_gt *gt);
 int xe_gt_init(struct xe_gt *gt);
 void xe_gt_mmio_init(struct xe_gt *gt);
diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
index 3f42b91efa28..25a1d96a68e7 100644
--- a/drivers/gpu/drm/xe/xe_pci.c
+++ b/drivers/gpu/drm/xe/xe_pci.c
@@ -347,6 +347,7 @@ static const struct xe_device_desc ptl_desc = {
 	.has_sriov = true,
 	.max_gt_per_tile = 2,
 	.needs_scratch = true,
+	.needs_shared_vf_gt_wq = true,
 };
 
 #undef PLATFORM
@@ -598,6 +599,7 @@ static int xe_info_init_early(struct xe_device *xe,
 	xe->info.skip_mtcfg = desc->skip_mtcfg;
 	xe->info.skip_pcode = desc->skip_pcode;
 	xe->info.needs_scratch = desc->needs_scratch;
+	xe->info.needs_shared_vf_gt_wq = desc->needs_shared_vf_gt_wq;
 
 	xe->info.probe_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY) &&
 				 xe_modparam.probe_display &&
@@ -766,7 +768,9 @@ static int xe_info_init(struct xe_device *xe,
 		 * Allocate and setup media GT for platforms with standalone
 		 * media.
 		 */
-		tile->media_gt = xe_gt_alloc(tile);
+		tile->media_gt = xe_gt_alloc(tile,
+					     xe->info.needs_shared_vf_gt_wq &&
+					     IS_SRIOV_VF(xe));
 		if (IS_ERR(tile->media_gt))
 			return PTR_ERR(tile->media_gt);
 
diff --git a/drivers/gpu/drm/xe/xe_pci_types.h b/drivers/gpu/drm/xe/xe_pci_types.h
index 9b9766a3baa3..b11bf6abda5b 100644
--- a/drivers/gpu/drm/xe/xe_pci_types.h
+++ b/drivers/gpu/drm/xe/xe_pci_types.h
@@ -48,6 +48,7 @@ struct xe_device_desc {
 	u8 skip_guc_pc:1;
 	u8 skip_mtcfg:1;
 	u8 skip_pcode:1;
+	u8 needs_shared_vf_gt_wq:1;
 };
 
 struct xe_graphics_desc {
diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
index d49ba3401963..a982732a8056 100644
--- a/drivers/gpu/drm/xe/xe_tile.c
+++ b/drivers/gpu/drm/xe/xe_tile.c
@@ -149,7 +149,7 @@ int xe_tile_init_early(struct xe_tile *tile, struct xe_device *xe, u8 id)
 	if (err)
 		return err;
 
-	tile->primary_gt = xe_gt_alloc(tile);
+	tile->primary_gt = xe_gt_alloc(tile, false);
 	if (IS_ERR(tile->primary_gt))
 		return PTR_ERR(tile->primary_gt);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 34/36] drm/xe/vf: Ensure media GT VF recovery runs after primary GT on PTL
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (32 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 33/36] drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29  2:55 ` [PATCH v3 35/36] drm/xe/vf: Rebase CCS save/restore BB GGTT addresses Matthew Brost
                   ` (4 subsequent siblings)
  38 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

It is possible that the media GT's VF post-migration recovery work item
gets scheduled before the primary GT's work item. Since the media GT
depends on the primary GT's work item to complete CCS restore, if the
media GT's work item is scheduled first, detect this condition and
re-queue the media GT's work item for a later time.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 42f9fd43b436..9c1fea9f65d2 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -1187,8 +1187,22 @@ void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p)
 		   pf_version->major, pf_version->minor);
 }
 
-static void vf_post_migration_shutdown(struct xe_gt *gt)
+static bool vf_post_migration_shutdown(struct xe_gt *gt)
 {
+	struct xe_device *xe = gt_to_xe(gt);
+
+	/*
+	 * On platforms where CCS must be restored by the primary GT, the media
+	 * GT's VF post-migration recovery must run afterward. Detect this case
+	 * and re-queue the media GT's restore work item if necessary.
+	 */
+	if (xe->info.needs_shared_vf_gt_wq && xe_gt_is_media_type(gt)) {
+		struct xe_gt *primary_gt = gt_to_tile(gt)->primary_gt;
+
+		if (xe_gt_sriov_vf_recovery_inprogress(primary_gt))
+			return true;
+	}
+
 	spin_lock_irq(&gt->sriov.vf.migration.lock);
 	gt->sriov.vf.migration.recovery_queued = false;
 	spin_unlock_irq(&gt->sriov.vf.migration.lock);
@@ -1196,6 +1210,8 @@ static void vf_post_migration_shutdown(struct xe_gt *gt)
 	xe_guc_ct_flush_and_stop(&gt->uc.guc.ct);
 	xe_guc_submit_pause(&gt->uc.guc);
 	xe_tlb_inval_reset(&gt->tlb_inval);
+
+	return false;
 }
 
 static size_t post_migration_scratch_size(struct xe_device *xe)
@@ -1270,11 +1286,14 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
 {
 	struct xe_device *xe = gt_to_xe(gt);
 	int err;
+	bool retry;
 
 	xe_gt_sriov_dbg(gt, "migration recovery in progress\n");
 
 	xe_pm_runtime_get(xe);
-	vf_post_migration_shutdown(gt);
+	retry = vf_post_migration_shutdown(gt);
+	if (retry)
+		goto queue;
 
 	if (!xe_sriov_vf_migration_supported(xe)) {
 		xe_gt_sriov_err(gt, "migration is not supported\n");
@@ -1302,6 +1321,12 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
 	xe_pm_runtime_put(xe);
 	xe_gt_sriov_err(gt, "migration recovery failed (%pe)\n", ERR_PTR(err));
 	xe_device_declare_wedged(xe);
+	return;
+
+queue:
+	xe_gt_sriov_info(gt, "Re-queuing GT recovery\n");
+	queue_work(gt->ordered_wq, &gt->sriov.vf.migration.worker);
+	xe_pm_runtime_put(xe);
 }
 
 static void migration_worker_func(struct work_struct *w)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 35/36] drm/xe/vf: Rebase CCS save/restore BB GGTT addresses
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (33 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 34/36] drm/xe/vf: Ensure media GT VF recovery runs after primary GT on PTL Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29  2:55 ` [PATCH v3 36/36] drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC Matthew Brost
                   ` (3 subsequent siblings)
  38 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

Rebase the CCS save/restore BB's GGTT addresses during VF post-migration
recovery by setting the software ring tail to zero, the LRC ring head to
zero, and rewriting the jump-to-BB instructions.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c  |  4 ++++
 drivers/gpu/drm/xe/xe_sriov_vf_ccs.c | 28 ++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_sriov_vf_ccs.h |  1 +
 3 files changed, 33 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 9c1fea9f65d2..d711301936b9 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -34,6 +34,7 @@
 #include "xe_pm.h"
 #include "xe_sriov.h"
 #include "xe_sriov_vf.h"
+#include "xe_sriov_vf_ccs.h"
 #include "xe_tile_sriov_vf.h"
 #include "xe_tlb_inval.h"
 #include "xe_uc_fw.h"
@@ -1228,6 +1229,9 @@ static int vf_post_migration_fixups(struct xe_gt *gt)
 	if (err)
 		return err;
 
+	if (xe_gt_is_main_type(gt))
+		xe_sriov_vf_ccs_rebase(gt_to_xe(gt));
+
 	xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
 	err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
 	if (err)
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf_ccs.c b/drivers/gpu/drm/xe/xe_sriov_vf_ccs.c
index 8dec616c37c9..790249801364 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf_ccs.c
+++ b/drivers/gpu/drm/xe/xe_sriov_vf_ccs.c
@@ -175,6 +175,15 @@ static void ccs_rw_update_ring(struct xe_sriov_vf_ccs_ctx *ctx)
 	struct xe_lrc *lrc = xe_exec_queue_lrc(ctx->mig_q);
 	u32 dw[10], i = 0;
 
+	/*
+	 * XXX: Save/restore fixes — for some reason, the GuC only accepts the
+	 * save/restore context if the LRC head pointer is zero. This is evident
+	 * from repeated VF migrations failing when the LRC head pointer is
+	 * non-zero.
+	 */
+	lrc->ring.tail = 0;
+	xe_lrc_set_ring_head(lrc, 0);
+
 	dw[i++] = MI_ARB_ON_OFF | MI_ARB_ENABLE;
 	dw[i++] = MI_BATCH_BUFFER_START | XE_INSTR_NUM_DW(3);
 	dw[i++] = lower_32_bits(addr);
@@ -186,6 +195,25 @@ static void ccs_rw_update_ring(struct xe_sriov_vf_ccs_ctx *ctx)
 	xe_lrc_set_ring_tail(lrc, lrc->ring.tail);
 }
 
+/**
+ * xe_sriov_vf_ccs_rebase - Rebase GGTT addresses for CCS save / restore
+ * @xe: the &xe_device.
+ */
+void xe_sriov_vf_ccs_rebase(struct xe_device *xe)
+{
+	enum xe_sriov_vf_ccs_rw_ctxs ctx_id;
+
+	if (!IS_VF_CCS_READY(xe))
+		return;
+
+	for_each_ccs_rw_ctx(ctx_id) {
+		struct xe_sriov_vf_ccs_ctx *ctx =
+			&xe->sriov.vf.ccs.contexts[ctx_id];
+
+		ccs_rw_update_ring(ctx);
+	}
+}
+
 static int register_save_restore_context(struct xe_sriov_vf_ccs_ctx *ctx)
 {
 	int ctx_type;
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf_ccs.h b/drivers/gpu/drm/xe/xe_sriov_vf_ccs.h
index 0745c0ff0228..f8ca6efce9ee 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf_ccs.h
+++ b/drivers/gpu/drm/xe/xe_sriov_vf_ccs.h
@@ -18,6 +18,7 @@ int xe_sriov_vf_ccs_init(struct xe_device *xe);
 int xe_sriov_vf_ccs_attach_bo(struct xe_bo *bo);
 int xe_sriov_vf_ccs_detach_bo(struct xe_bo *bo);
 int xe_sriov_vf_ccs_register_context(struct xe_device *xe);
+void xe_sriov_vf_ccs_rebase(struct xe_device *xe);
 void xe_sriov_vf_ccs_print(struct xe_device *xe, struct drm_printer *p);
 
 static inline bool xe_sriov_vf_ccs_ready(struct xe_device *xe)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v3 36/36] drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (34 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 35/36] drm/xe/vf: Rebase CCS save/restore BB GGTT addresses Matthew Brost
@ 2025-09-29  2:55 ` Matthew Brost
  2025-09-29 15:17   ` K V P, Satyanarayana
  2025-09-29  3:06 ` ✗ CI.checkpatch: warning for VF migration redesign (rev3) Patchwork
                   ` (2 subsequent siblings)
  38 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  2:55 UTC (permalink / raw)
  To: intel-xe

From: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>

Some VF2GUC actions may take longer to process. Increase default timeout
after received BUSY indication to 2sec to cover all worst case scenarios.

Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
---
 drivers/gpu/drm/xe/xe_guc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
index c016a11b6ab1..f0de1fa61898 100644
--- a/drivers/gpu/drm/xe/xe_guc.c
+++ b/drivers/gpu/drm/xe/xe_guc.c
@@ -1439,7 +1439,7 @@ int xe_guc_mmio_send_recv(struct xe_guc *guc, const u32 *request,
 		BUILD_BUG_ON((GUC_HXG_TYPE_RESPONSE_SUCCESS ^ GUC_HXG_TYPE_RESPONSE_FAILURE) != 1);
 
 		ret = xe_mmio_wait32(mmio, reply_reg, resp_mask, resp_mask,
-				     1000000, &header, false);
+				     2000000, &header, false);
 
 		if (unlikely(FIELD_GET(GUC_HXG_MSG_0_ORIGIN, header) !=
 			     GUC_HXG_ORIGIN_GUC))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* ✗ CI.checkpatch: warning for VF migration redesign (rev3)
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (35 preceding siblings ...)
  2025-09-29  2:55 ` [PATCH v3 36/36] drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC Matthew Brost
@ 2025-09-29  3:06 ` Patchwork
  2025-09-29  3:08 ` ✓ CI.KUnit: success " Patchwork
  2025-09-29  6:28 ` ✗ Xe.CI.Full: failure " Patchwork
  38 siblings, 0 replies; 83+ messages in thread
From: Patchwork @ 2025-09-29  3:06 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

== Series Details ==

Series: VF migration redesign (rev3)
URL   : https://patchwork.freedesktop.org/series/154627/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
fbd08a78c3a3bb17964db2a326514c69c1dca660
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit 6b1ab8ae0216a43239004cea48375d7e8a922d7c
Author: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Date:   Sun Sep 28 19:55:42 2025 -0700

    drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC
    
    Some VF2GUC actions may take longer to process. Increase default timeout
    after received BUSY indication to 2sec to cover all worst case scenarios.
    
    Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
    Signed-off-by: Matthew Brost <matthew.brost@intel.com>
    Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
+ /mt/dim checkpatch e2a896e95ea5f65aa137dcf117bfd0d61176c8ce drm-intel
d8c50815e346 drm/xe: Add NULL checks to scratch LRC allocation
1ce14b575b00 drm/xe/vf: Lock querying GGTT config during driver init
0d1ce9c6e840 Revert "drm/xe/vf: Rebase exec queue parallel commands during migration recovery"
1a9c9b825f32 Revert "drm/xe/vf: Post migration, repopulate ring area for pending request"
f7e68c9bef59 Revert "drm/xe/vf: Fixup CTB send buffer messages after migration"
a93a79b3496a drm/xe: Save off position in ring in which a job was programmed
4a81e99e537f drm/xe/guc: Track pending-enable source in submission state
d98ee26c8248 drm/xe: Track LR jobs in DRM scheduler pending list
d596834a320e drm/xe: Don't change LRC ring head on job resubmission
5e635a71b24c drm/xe: Make LRC W/A scratch buffer usage consistent
05b22c7143ef drm/xe/guc: Document GuC submission backend
ce7a7f8ac168 drm/xe/vf: Add xe_gt_recovery_inprogress helper
-:92: WARNING:SUSPECT_CODE_INDENT: suspect code indent for conditional statements (8, 15)
#92: FILE: drivers/gpu/drm/xe/xe_gt_sriov_vf.c:1194:
+	if (!xe_device_uses_memirq(gt_to_xe(gt)))
+	       return false;

-:93: WARNING:TABSTOP: Statements should start on a tabstop
#93: FILE: drivers/gpu/drm/xe/xe_gt_sriov_vf.c:1195:
+	       return false;

-:212: ERROR:SPACING: space required after that ',' (ctx:VxV)
#212: FILE: drivers/gpu/drm/xe/xe_memirq.c:493:
+bool xe_memirq_sw_int_0_irq_pending(struct xe_memirq *memirq,struct xe_guc *guc)
                                                             ^

-:233: ERROR:SPACING: space required after that ',' (ctx:VxV)
#233: FILE: drivers/gpu/drm/xe/xe_memirq.h:28:
+bool xe_memirq_sw_int_0_irq_pending(struct xe_memirq *memirq,struct xe_guc *guc);
                                                             ^

total: 2 errors, 2 warnings, 0 checks, 177 lines checked
f8fa62ab6977 drm/xe/vf: Make VF recovery run on per-GT worker
31240bc5e7d0 drm/xe/vf: Abort H2G sends during VF post-migration recovery
f5af6ebcc032 drm/xe/vf: Remove memory allocations from VF post migration recovery
e3109acd36e4 drm/xe/vf: Close multi-GT GGTT shift race
959deb2cc97c drm/xe/vf: Teardown VF post migration worker on driver unload
1a83fbbc58b1 drm/xe/vf: Don't allow GT reset to be queued during VF post migration recovery
0cb2953404dc drm/xe/vf: Wakeup in GuC backend on VF post migration recovery
ae59ef915b18 drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs supporting migration
08245628e1f1 drm/xe/vf: Extra debug on GGTT shift
d28467c0bdc3 drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register
599ba8a32549 drm/xe/vf: Flush and stop CTs in VF post migration recovery
4b0790dfe8ab drm/xe/vf: Reset TLB invalidations during VF post migration recovery
ba34c61d3cd9 drm/xe/vf: Kickstart after resfix in VF post migration recovery
252e2cbc6cac drm/xe/vf: Start CTs before resfix VF post migration recovery
bfce4740236c drm/xe/vf: Abort VF post migration recovery on failure
80bfb5d66feb drm/xe/vf: Replay GuC submission state on pause / unpause
897771368173 drm/xe: Move queue init before LRC creation
ea1294fd8f11 drm/xe/vf: Add debug prints for GuC replaying state during VF recovery
2485424c37eb drm/xe/vf: Workaround for race condition in GuC firmware during VF pause
8c73148241b3 drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups
27d17569dc05 drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF
85dafe50dc85 drm/xe/vf: Ensure media GT VF recovery runs after primary GT on PTL
54be57cf3366 drm/xe/vf: Rebase CCS save/restore BB GGTT addresses
6b1ab8ae0216 drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC



^ permalink raw reply	[flat|nested] 83+ messages in thread

* ✓ CI.KUnit: success for VF migration redesign (rev3)
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (36 preceding siblings ...)
  2025-09-29  3:06 ` ✗ CI.checkpatch: warning for VF migration redesign (rev3) Patchwork
@ 2025-09-29  3:08 ` Patchwork
  2025-09-29  6:28 ` ✗ Xe.CI.Full: failure " Patchwork
  38 siblings, 0 replies; 83+ messages in thread
From: Patchwork @ 2025-09-29  3:08 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

== Series Details ==

Series: VF migration redesign (rev3)
URL   : https://patchwork.freedesktop.org/series/154627/
State : success

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
[03:06:49] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[03:06:54] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[03:07:22] Starting KUnit Kernel (1/1)...
[03:07:22] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[03:07:23] ================== guc_buf (11 subtests) ===================
[03:07:23] [PASSED] test_smallest
[03:07:23] [PASSED] test_largest
[03:07:23] [PASSED] test_granular
[03:07:23] [PASSED] test_unique
[03:07:23] [PASSED] test_overlap
[03:07:23] [PASSED] test_reusable
[03:07:23] [PASSED] test_too_big
[03:07:23] [PASSED] test_flush
[03:07:23] [PASSED] test_lookup
[03:07:23] [PASSED] test_data
[03:07:23] [PASSED] test_class
[03:07:23] ===================== [PASSED] guc_buf =====================
[03:07:23] =================== guc_dbm (7 subtests) ===================
[03:07:23] [PASSED] test_empty
[03:07:23] [PASSED] test_default
[03:07:23] ======================== test_size  ========================
[03:07:23] [PASSED] 4
[03:07:23] [PASSED] 8
[03:07:23] [PASSED] 32
[03:07:23] [PASSED] 256
[03:07:23] ==================== [PASSED] test_size ====================
[03:07:23] ======================= test_reuse  ========================
[03:07:23] [PASSED] 4
[03:07:23] [PASSED] 8
[03:07:23] [PASSED] 32
[03:07:23] [PASSED] 256
[03:07:23] =================== [PASSED] test_reuse ====================
[03:07:23] =================== test_range_overlap  ====================
[03:07:23] [PASSED] 4
[03:07:23] [PASSED] 8
[03:07:23] [PASSED] 32
[03:07:23] [PASSED] 256
[03:07:23] =============== [PASSED] test_range_overlap ================
[03:07:23] =================== test_range_compact  ====================
[03:07:23] [PASSED] 4
[03:07:23] [PASSED] 8
[03:07:23] [PASSED] 32
[03:07:23] [PASSED] 256
[03:07:23] =============== [PASSED] test_range_compact ================
[03:07:23] ==================== test_range_spare  =====================
[03:07:23] [PASSED] 4
[03:07:23] [PASSED] 8
[03:07:23] [PASSED] 32
[03:07:23] [PASSED] 256
[03:07:23] ================ [PASSED] test_range_spare =================
[03:07:23] ===================== [PASSED] guc_dbm =====================
[03:07:23] =================== guc_idm (6 subtests) ===================
[03:07:23] [PASSED] bad_init
[03:07:23] [PASSED] no_init
[03:07:23] [PASSED] init_fini
[03:07:23] [PASSED] check_used
[03:07:23] [PASSED] check_quota
[03:07:23] [PASSED] check_all
[03:07:23] ===================== [PASSED] guc_idm =====================
[03:07:23] ================== no_relay (3 subtests) ===================
[03:07:23] [PASSED] xe_drops_guc2pf_if_not_ready
[03:07:23] [PASSED] xe_drops_guc2vf_if_not_ready
[03:07:23] [PASSED] xe_rejects_send_if_not_ready
[03:07:23] ==================== [PASSED] no_relay =====================
[03:07:23] ================== pf_relay (14 subtests) ==================
[03:07:23] [PASSED] pf_rejects_guc2pf_too_short
[03:07:23] [PASSED] pf_rejects_guc2pf_too_long
[03:07:23] [PASSED] pf_rejects_guc2pf_no_payload
[03:07:23] [PASSED] pf_fails_no_payload
[03:07:23] [PASSED] pf_fails_bad_origin
[03:07:23] [PASSED] pf_fails_bad_type
[03:07:23] [PASSED] pf_txn_reports_error
[03:07:23] [PASSED] pf_txn_sends_pf2guc
[03:07:23] [PASSED] pf_sends_pf2guc
[03:07:23] [SKIPPED] pf_loopback_nop
[03:07:23] [SKIPPED] pf_loopback_echo
[03:07:23] [SKIPPED] pf_loopback_fail
[03:07:23] [SKIPPED] pf_loopback_busy
[03:07:23] [SKIPPED] pf_loopback_retry
[03:07:23] ==================== [PASSED] pf_relay =====================
[03:07:23] ================== vf_relay (3 subtests) ===================
[03:07:23] [PASSED] vf_rejects_guc2vf_too_short
[03:07:23] [PASSED] vf_rejects_guc2vf_too_long
[03:07:23] [PASSED] vf_rejects_guc2vf_no_payload
[03:07:23] ==================== [PASSED] vf_relay =====================
[03:07:23] ===================== lmtt (1 subtest) =====================
[03:07:23] ======================== test_ops  =========================
[03:07:23] [PASSED] 2-level
[03:07:23] [PASSED] multi-level
[03:07:23] ==================== [PASSED] test_ops =====================
[03:07:23] ====================== [PASSED] lmtt =======================
[03:07:23] ================= pf_service (11 subtests) =================
[03:07:23] [PASSED] pf_negotiate_any
[03:07:23] [PASSED] pf_negotiate_base_match
[03:07:23] [PASSED] pf_negotiate_base_newer
[03:07:23] [PASSED] pf_negotiate_base_next
[03:07:23] [SKIPPED] pf_negotiate_base_older
[03:07:23] [PASSED] pf_negotiate_base_prev
[03:07:23] [PASSED] pf_negotiate_latest_match
[03:07:23] [PASSED] pf_negotiate_latest_newer
[03:07:23] [PASSED] pf_negotiate_latest_next
[03:07:23] [SKIPPED] pf_negotiate_latest_older
[03:07:23] [SKIPPED] pf_negotiate_latest_prev
[03:07:23] =================== [PASSED] pf_service ====================
[03:07:23] ================= xe_guc_g2g (2 subtests) ==================
[03:07:23] ============== xe_live_guc_g2g_kunit_default  ==============
[03:07:23] ========= [SKIPPED] xe_live_guc_g2g_kunit_default ==========
[03:07:23] ============== xe_live_guc_g2g_kunit_allmem  ===============
[03:07:23] ========== [SKIPPED] xe_live_guc_g2g_kunit_allmem ==========
[03:07:23] =================== [SKIPPED] xe_guc_g2g ===================
[03:07:23] =================== xe_mocs (2 subtests) ===================
[03:07:23] ================ xe_live_mocs_kernel_kunit  ================
[03:07:23] =========== [SKIPPED] xe_live_mocs_kernel_kunit ============
[03:07:23] ================ xe_live_mocs_reset_kunit  =================
[03:07:23] ============ [SKIPPED] xe_live_mocs_reset_kunit ============
[03:07:23] ==================== [SKIPPED] xe_mocs =====================
[03:07:23] ================= xe_migrate (2 subtests) ==================
[03:07:23] ================= xe_migrate_sanity_kunit  =================
[03:07:23] ============ [SKIPPED] xe_migrate_sanity_kunit =============
[03:07:23] ================== xe_validate_ccs_kunit  ==================
[03:07:23] ============= [SKIPPED] xe_validate_ccs_kunit ==============
[03:07:23] =================== [SKIPPED] xe_migrate ===================
[03:07:23] ================== xe_dma_buf (1 subtest) ==================
[03:07:23] ==================== xe_dma_buf_kunit  =====================
[03:07:23] ================ [SKIPPED] xe_dma_buf_kunit ================
[03:07:23] =================== [SKIPPED] xe_dma_buf ===================
[03:07:23] ================= xe_bo_shrink (1 subtest) =================
[03:07:23] =================== xe_bo_shrink_kunit  ====================
[03:07:23] =============== [SKIPPED] xe_bo_shrink_kunit ===============
[03:07:23] ================== [SKIPPED] xe_bo_shrink ==================
[03:07:23] ==================== xe_bo (2 subtests) ====================
[03:07:23] ================== xe_ccs_migrate_kunit  ===================
[03:07:23] ============== [SKIPPED] xe_ccs_migrate_kunit ==============
[03:07:23] ==================== xe_bo_evict_kunit  ====================
[03:07:23] =============== [SKIPPED] xe_bo_evict_kunit ================
[03:07:23] ===================== [SKIPPED] xe_bo ======================
[03:07:23] ==================== args (11 subtests) ====================
[03:07:23] [PASSED] count_args_test
[03:07:23] [PASSED] call_args_example
[03:07:23] [PASSED] call_args_test
[03:07:23] [PASSED] drop_first_arg_example
[03:07:23] [PASSED] drop_first_arg_test
[03:07:23] [PASSED] first_arg_example
[03:07:23] [PASSED] first_arg_test
[03:07:23] [PASSED] last_arg_example
[03:07:23] [PASSED] last_arg_test
[03:07:23] [PASSED] pick_arg_example
[03:07:23] [PASSED] sep_comma_example
[03:07:23] ====================== [PASSED] args =======================
[03:07:23] =================== xe_pci (3 subtests) ====================
[03:07:23] ==================== check_graphics_ip  ====================
[03:07:23] [PASSED] 12.00 Xe_LP
[03:07:23] [PASSED] 12.10 Xe_LP+
[03:07:23] [PASSED] 12.55 Xe_HPG
[03:07:23] [PASSED] 12.60 Xe_HPC
[03:07:23] [PASSED] 12.70 Xe_LPG
[03:07:23] [PASSED] 12.71 Xe_LPG
[03:07:23] [PASSED] 12.74 Xe_LPG+
[03:07:23] [PASSED] 20.01 Xe2_HPG
[03:07:23] [PASSED] 20.02 Xe2_HPG
[03:07:23] [PASSED] 20.04 Xe2_LPG
[03:07:23] [PASSED] 30.00 Xe3_LPG
[03:07:23] [PASSED] 30.01 Xe3_LPG
[03:07:23] [PASSED] 30.03 Xe3_LPG
[03:07:23] ================ [PASSED] check_graphics_ip ================
[03:07:23] ===================== check_media_ip  ======================
[03:07:23] [PASSED] 12.00 Xe_M
[03:07:23] [PASSED] 12.55 Xe_HPM
[03:07:23] [PASSED] 13.00 Xe_LPM+
[03:07:23] [PASSED] 13.01 Xe2_HPM
[03:07:23] [PASSED] 20.00 Xe2_LPM
[03:07:23] [PASSED] 30.00 Xe3_LPM
[03:07:23] [PASSED] 30.02 Xe3_LPM
[03:07:23] ================= [PASSED] check_media_ip ==================
[03:07:23] ================= check_platform_gt_count  =================
[03:07:23] [PASSED] 0x9A60 (TIGERLAKE)
[03:07:23] [PASSED] 0x9A68 (TIGERLAKE)
[03:07:23] [PASSED] 0x9A70 (TIGERLAKE)
[03:07:23] [PASSED] 0x9A40 (TIGERLAKE)
[03:07:23] [PASSED] 0x9A49 (TIGERLAKE)
[03:07:23] [PASSED] 0x9A59 (TIGERLAKE)
[03:07:23] [PASSED] 0x9A78 (TIGERLAKE)
[03:07:23] [PASSED] 0x9AC0 (TIGERLAKE)
[03:07:23] [PASSED] 0x9AC9 (TIGERLAKE)
[03:07:23] [PASSED] 0x9AD9 (TIGERLAKE)
[03:07:23] [PASSED] 0x9AF8 (TIGERLAKE)
[03:07:23] [PASSED] 0x4C80 (ROCKETLAKE)
[03:07:23] [PASSED] 0x4C8A (ROCKETLAKE)
[03:07:23] [PASSED] 0x4C8B (ROCKETLAKE)
[03:07:23] [PASSED] 0x4C8C (ROCKETLAKE)
[03:07:23] [PASSED] 0x4C90 (ROCKETLAKE)
[03:07:23] [PASSED] 0x4C9A (ROCKETLAKE)
[03:07:23] [PASSED] 0x4680 (ALDERLAKE_S)
[03:07:23] [PASSED] 0x4682 (ALDERLAKE_S)
[03:07:23] [PASSED] 0x4688 (ALDERLAKE_S)
[03:07:23] [PASSED] 0x468A (ALDERLAKE_S)
[03:07:23] [PASSED] 0x468B (ALDERLAKE_S)
[03:07:23] [PASSED] 0x4690 (ALDERLAKE_S)
[03:07:23] [PASSED] 0x4692 (ALDERLAKE_S)
[03:07:23] [PASSED] 0x4693 (ALDERLAKE_S)
[03:07:23] [PASSED] 0x46A0 (ALDERLAKE_P)
[03:07:23] [PASSED] 0x46A1 (ALDERLAKE_P)
[03:07:23] [PASSED] 0x46A2 (ALDERLAKE_P)
[03:07:23] [PASSED] 0x46A3 (ALDERLAKE_P)
[03:07:23] [PASSED] 0x46A6 (ALDERLAKE_P)
[03:07:23] [PASSED] 0x46A8 (ALDERLAKE_P)
[03:07:23] [PASSED] 0x46AA (ALDERLAKE_P)
[03:07:23] [PASSED] 0x462A (ALDERLAKE_P)
[03:07:23] [PASSED] 0x4626 (ALDERLAKE_P)
[03:07:23] [PASSED] 0x4628 (ALDERLAKE_P)
[03:07:23] [PASSED] 0x46B0 (ALDERLAKE_P)
[03:07:23] [PASSED] 0x46B1 (ALDERLAKE_P)
[03:07:23] [PASSED] 0x46B2 (ALDERLAKE_P)
[03:07:23] [PASSED] 0x46B3 (ALDERLAKE_P)
[03:07:23] [PASSED] 0x46C0 (ALDERLAKE_P)
[03:07:23] [PASSED] 0x46C1 (ALDERLAKE_P)
[03:07:23] [PASSED] 0x46C2 (ALDERLAKE_P)
[03:07:23] [PASSED] 0x46C3 (ALDERLAKE_P)
[03:07:23] [PASSED] 0x46D0 (ALDERLAKE_N)
[03:07:23] [PASSED] 0x46D1 (ALDERLAKE_N)
[03:07:23] [PASSED] 0x46D2 (ALDERLAKE_N)
[03:07:23] [PASSED] 0x46D3 (ALDERLAKE_N)
[03:07:23] [PASSED] 0x46D4 (ALDERLAKE_N)
[03:07:23] [PASSED] 0xA721 (ALDERLAKE_P)
[03:07:23] [PASSED] 0xA7A1 (ALDERLAKE_P)
[03:07:23] [PASSED] 0xA7A9 (ALDERLAKE_P)
[03:07:23] [PASSED] 0xA7AC (ALDERLAKE_P)
[03:07:23] [PASSED] 0xA7AD (ALDERLAKE_P)
[03:07:23] [PASSED] 0xA720 (ALDERLAKE_P)
[03:07:23] [PASSED] 0xA7A0 (ALDERLAKE_P)
[03:07:23] [PASSED] 0xA7A8 (ALDERLAKE_P)
[03:07:23] [PASSED] 0xA7AA (ALDERLAKE_P)
[03:07:23] [PASSED] 0xA7AB (ALDERLAKE_P)
[03:07:23] [PASSED] 0xA780 (ALDERLAKE_S)
[03:07:23] [PASSED] 0xA781 (ALDERLAKE_S)
[03:07:23] [PASSED] 0xA782 (ALDERLAKE_S)
[03:07:23] [PASSED] 0xA783 (ALDERLAKE_S)
[03:07:23] [PASSED] 0xA788 (ALDERLAKE_S)
[03:07:23] [PASSED] 0xA789 (ALDERLAKE_S)
[03:07:23] [PASSED] 0xA78A (ALDERLAKE_S)
[03:07:23] [PASSED] 0xA78B (ALDERLAKE_S)
[03:07:23] [PASSED] 0x4905 (DG1)
[03:07:23] [PASSED] 0x4906 (DG1)
[03:07:23] [PASSED] 0x4907 (DG1)
[03:07:23] [PASSED] 0x4908 (DG1)
[03:07:23] [PASSED] 0x4909 (DG1)
[03:07:23] [PASSED] 0x56C0 (DG2)
[03:07:23] [PASSED] 0x56C2 (DG2)
[03:07:23] [PASSED] 0x56C1 (DG2)
[03:07:23] [PASSED] 0x7D51 (METEORLAKE)
[03:07:23] [PASSED] 0x7DD1 (METEORLAKE)
[03:07:23] [PASSED] 0x7D41 (METEORLAKE)
[03:07:23] [PASSED] 0x7D67 (METEORLAKE)
[03:07:23] [PASSED] 0xB640 (METEORLAKE)
[03:07:23] [PASSED] 0x56A0 (DG2)
[03:07:23] [PASSED] 0x56A1 (DG2)
[03:07:23] [PASSED] 0x56A2 (DG2)
[03:07:23] [PASSED] 0x56BE (DG2)
[03:07:23] [PASSED] 0x56BF (DG2)
[03:07:23] [PASSED] 0x5690 (DG2)
[03:07:23] [PASSED] 0x5691 (DG2)
[03:07:23] [PASSED] 0x5692 (DG2)
[03:07:23] [PASSED] 0x56A5 (DG2)
[03:07:23] [PASSED] 0x56A6 (DG2)
[03:07:23] [PASSED] 0x56B0 (DG2)
[03:07:23] [PASSED] 0x56B1 (DG2)
[03:07:23] [PASSED] 0x56BA (DG2)
[03:07:23] [PASSED] 0x56BB (DG2)
[03:07:23] [PASSED] 0x56BC (DG2)
[03:07:23] [PASSED] 0x56BD (DG2)
[03:07:23] [PASSED] 0x5693 (DG2)
[03:07:23] [PASSED] 0x5694 (DG2)
[03:07:23] [PASSED] 0x5695 (DG2)
[03:07:23] [PASSED] 0x56A3 (DG2)
[03:07:23] [PASSED] 0x56A4 (DG2)
[03:07:23] [PASSED] 0x56B2 (DG2)
[03:07:23] [PASSED] 0x56B3 (DG2)
[03:07:23] [PASSED] 0x5696 (DG2)
[03:07:23] [PASSED] 0x5697 (DG2)
[03:07:23] [PASSED] 0xB69 (PVC)
[03:07:23] [PASSED] 0xB6E (PVC)
[03:07:23] [PASSED] 0xBD4 (PVC)
[03:07:23] [PASSED] 0xBD5 (PVC)
[03:07:23] [PASSED] 0xBD6 (PVC)
[03:07:23] [PASSED] 0xBD7 (PVC)
[03:07:23] [PASSED] 0xBD8 (PVC)
[03:07:23] [PASSED] 0xBD9 (PVC)
[03:07:23] [PASSED] 0xBDA (PVC)
[03:07:23] [PASSED] 0xBDB (PVC)
[03:07:23] [PASSED] 0xBE0 (PVC)
[03:07:23] [PASSED] 0xBE1 (PVC)
[03:07:23] [PASSED] 0xBE5 (PVC)
[03:07:23] [PASSED] 0x7D40 (METEORLAKE)
[03:07:23] [PASSED] 0x7D45 (METEORLAKE)
[03:07:23] [PASSED] 0x7D55 (METEORLAKE)
[03:07:23] [PASSED] 0x7D60 (METEORLAKE)
[03:07:23] [PASSED] 0x7DD5 (METEORLAKE)
[03:07:23] [PASSED] 0x6420 (LUNARLAKE)
[03:07:23] [PASSED] 0x64A0 (LUNARLAKE)
[03:07:23] [PASSED] 0x64B0 (LUNARLAKE)
[03:07:23] [PASSED] 0xE202 (BATTLEMAGE)
[03:07:23] [PASSED] 0xE209 (BATTLEMAGE)
[03:07:23] [PASSED] 0xE20B (BATTLEMAGE)
[03:07:23] [PASSED] 0xE20C (BATTLEMAGE)
[03:07:23] [PASSED] 0xE20D (BATTLEMAGE)
[03:07:23] [PASSED] 0xE210 (BATTLEMAGE)
[03:07:23] [PASSED] 0xE211 (BATTLEMAGE)
[03:07:23] [PASSED] 0xE212 (BATTLEMAGE)
[03:07:23] [PASSED] 0xE216 (BATTLEMAGE)
[03:07:23] [PASSED] 0xE220 (BATTLEMAGE)
[03:07:23] [PASSED] 0xE221 (BATTLEMAGE)
[03:07:23] [PASSED] 0xE222 (BATTLEMAGE)
[03:07:23] [PASSED] 0xE223 (BATTLEMAGE)
[03:07:23] [PASSED] 0xB080 (PANTHERLAKE)
[03:07:23] [PASSED] 0xB081 (PANTHERLAKE)
[03:07:23] [PASSED] 0xB082 (PANTHERLAKE)
[03:07:23] [PASSED] 0xB083 (PANTHERLAKE)
[03:07:23] [PASSED] 0xB084 (PANTHERLAKE)
[03:07:23] [PASSED] 0xB085 (PANTHERLAKE)
[03:07:23] [PASSED] 0xB086 (PANTHERLAKE)
[03:07:23] [PASSED] 0xB087 (PANTHERLAKE)
[03:07:23] [PASSED] 0xB08F (PANTHERLAKE)
[03:07:23] [PASSED] 0xB090 (PANTHERLAKE)
[03:07:23] [PASSED] 0xB0A0 (PANTHERLAKE)
[03:07:23] [PASSED] 0xB0B0 (PANTHERLAKE)
[03:07:23] [PASSED] 0xFD80 (PANTHERLAKE)
[03:07:23] [PASSED] 0xFD81 (PANTHERLAKE)
[03:07:23] ============= [PASSED] check_platform_gt_count =============
[03:07:23] ===================== [PASSED] xe_pci ======================
[03:07:23] =================== xe_rtp (2 subtests) ====================
[03:07:23] =============== xe_rtp_process_to_sr_tests  ================
[03:07:23] [PASSED] coalesce-same-reg
[03:07:23] [PASSED] no-match-no-add
[03:07:23] [PASSED] match-or
[03:07:23] [PASSED] match-or-xfail
[03:07:23] [PASSED] no-match-no-add-multiple-rules
[03:07:23] [PASSED] two-regs-two-entries
[03:07:23] [PASSED] clr-one-set-other
[03:07:23] [PASSED] set-field
[03:07:23] [PASSED] conflict-duplicate
[03:07:23] [PASSED] conflict-not-disjoint
[03:07:23] [PASSED] conflict-reg-type
[03:07:23] =========== [PASSED] xe_rtp_process_to_sr_tests ============
[03:07:23] ================== xe_rtp_process_tests  ===================
[03:07:23] [PASSED] active1
[03:07:23] [PASSED] active2
[03:07:23] [PASSED] active-inactive
[03:07:23] [PASSED] inactive-active
[03:07:23] [PASSED] inactive-1st_or_active-inactive
[03:07:23] [PASSED] inactive-2nd_or_active-inactive
[03:07:23] [PASSED] inactive-last_or_active-inactive
[03:07:23] [PASSED] inactive-no_or_active-inactive
[03:07:23] ============== [PASSED] xe_rtp_process_tests ===============
[03:07:23] ===================== [PASSED] xe_rtp ======================
[03:07:23] ==================== xe_wa (1 subtest) =====================
[03:07:23] ======================== xe_wa_gt  =========================
[03:07:23] [PASSED] TIGERLAKE B0
[03:07:23] [PASSED] DG1 A0
[03:07:23] [PASSED] DG1 B0
[03:07:23] [PASSED] ALDERLAKE_S A0
[03:07:23] [PASSED] ALDERLAKE_S B0
stty: 'standard input': Inappropriate ioctl for device
[03:07:23] [PASSED] ALDERLAKE_S C0
[03:07:23] [PASSED] ALDERLAKE_S D0
[03:07:23] [PASSED] ALDERLAKE_P A0
[03:07:23] [PASSED] ALDERLAKE_P B0
[03:07:23] [PASSED] ALDERLAKE_P C0
[03:07:23] [PASSED] ALDERLAKE_S RPLS D0
[03:07:23] [PASSED] ALDERLAKE_P RPLU E0
[03:07:23] [PASSED] DG2 G10 C0
[03:07:23] [PASSED] DG2 G11 B1
[03:07:23] [PASSED] DG2 G12 A1
[03:07:23] [PASSED] METEORLAKE 12.70(Xe_LPG) A0 13.00(Xe_LPM+) A0
[03:07:23] [PASSED] METEORLAKE 12.71(Xe_LPG) A0 13.00(Xe_LPM+) A0
[03:07:23] [PASSED] METEORLAKE 12.74(Xe_LPG+) A0 13.00(Xe_LPM+) A0
[03:07:23] [PASSED] LUNARLAKE 20.04(Xe2_LPG) A0 20.00(Xe2_LPM) A0
[03:07:23] [PASSED] LUNARLAKE 20.04(Xe2_LPG) B0 20.00(Xe2_LPM) A0
[03:07:23] [PASSED] BATTLEMAGE 20.01(Xe2_HPG) A0 13.01(Xe2_HPM) A1
[03:07:23] [PASSED] PANTHERLAKE 30.00(Xe3_LPG) A0 30.00(Xe3_LPM) A0
[03:07:23] ==================== [PASSED] xe_wa_gt =====================
[03:07:23] ====================== [PASSED] xe_wa ======================
[03:07:23] ============================================================
[03:07:23] Testing complete. Ran 306 tests: passed: 288, skipped: 18
[03:07:23] Elapsed time: 33.543s total, 4.249s configuring, 28.928s building, 0.319s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/tests/.kunitconfig
[03:07:23] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[03:07:25] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[03:07:48] Starting KUnit Kernel (1/1)...
[03:07:48] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[03:07:48] ============ drm_test_pick_cmdline (2 subtests) ============
[03:07:48] [PASSED] drm_test_pick_cmdline_res_1920_1080_60
[03:07:48] =============== drm_test_pick_cmdline_named  ===============
[03:07:48] [PASSED] NTSC
[03:07:48] [PASSED] NTSC-J
[03:07:48] [PASSED] PAL
[03:07:48] [PASSED] PAL-M
[03:07:48] =========== [PASSED] drm_test_pick_cmdline_named ===========
[03:07:48] ============== [PASSED] drm_test_pick_cmdline ==============
[03:07:48] == drm_test_atomic_get_connector_for_encoder (1 subtest) ===
[03:07:48] [PASSED] drm_test_drm_atomic_get_connector_for_encoder
[03:07:48] ==== [PASSED] drm_test_atomic_get_connector_for_encoder ====
[03:07:48] =========== drm_validate_clone_mode (2 subtests) ===========
[03:07:48] ============== drm_test_check_in_clone_mode  ===============
[03:07:48] [PASSED] in_clone_mode
[03:07:48] [PASSED] not_in_clone_mode
[03:07:48] ========== [PASSED] drm_test_check_in_clone_mode ===========
[03:07:48] =============== drm_test_check_valid_clones  ===============
[03:07:48] [PASSED] not_in_clone_mode
[03:07:48] [PASSED] valid_clone
[03:07:48] [PASSED] invalid_clone
[03:07:48] =========== [PASSED] drm_test_check_valid_clones ===========
[03:07:48] ============= [PASSED] drm_validate_clone_mode =============
[03:07:48] ============= drm_validate_modeset (1 subtest) =============
[03:07:48] [PASSED] drm_test_check_connector_changed_modeset
[03:07:48] ============== [PASSED] drm_validate_modeset ===============
[03:07:48] ====== drm_test_bridge_get_current_state (2 subtests) ======
[03:07:48] [PASSED] drm_test_drm_bridge_get_current_state_atomic
[03:07:48] [PASSED] drm_test_drm_bridge_get_current_state_legacy
[03:07:48] ======== [PASSED] drm_test_bridge_get_current_state ========
[03:07:48] ====== drm_test_bridge_helper_reset_crtc (3 subtests) ======
[03:07:48] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic
[03:07:48] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic_disabled
[03:07:48] [PASSED] drm_test_drm_bridge_helper_reset_crtc_legacy
[03:07:48] ======== [PASSED] drm_test_bridge_helper_reset_crtc ========
[03:07:48] ============== drm_bridge_alloc (2 subtests) ===============
[03:07:48] [PASSED] drm_test_drm_bridge_alloc_basic
[03:07:48] [PASSED] drm_test_drm_bridge_alloc_get_put
[03:07:48] ================ [PASSED] drm_bridge_alloc =================
[03:07:48] ================== drm_buddy (7 subtests) ==================
[03:07:48] [PASSED] drm_test_buddy_alloc_limit
[03:07:48] [PASSED] drm_test_buddy_alloc_optimistic
[03:07:48] [PASSED] drm_test_buddy_alloc_pessimistic
[03:07:48] [PASSED] drm_test_buddy_alloc_pathological
[03:07:48] [PASSED] drm_test_buddy_alloc_contiguous
[03:07:48] [PASSED] drm_test_buddy_alloc_clear
[03:07:48] [PASSED] drm_test_buddy_alloc_range_bias
[03:07:48] ==================== [PASSED] drm_buddy ====================
[03:07:48] ============= drm_cmdline_parser (40 subtests) =============
[03:07:48] [PASSED] drm_test_cmdline_force_d_only
[03:07:48] [PASSED] drm_test_cmdline_force_D_only_dvi
[03:07:48] [PASSED] drm_test_cmdline_force_D_only_hdmi
[03:07:48] [PASSED] drm_test_cmdline_force_D_only_not_digital
[03:07:48] [PASSED] drm_test_cmdline_force_e_only
[03:07:48] [PASSED] drm_test_cmdline_res
[03:07:48] [PASSED] drm_test_cmdline_res_vesa
[03:07:48] [PASSED] drm_test_cmdline_res_vesa_rblank
[03:07:48] [PASSED] drm_test_cmdline_res_rblank
[03:07:48] [PASSED] drm_test_cmdline_res_bpp
[03:07:48] [PASSED] drm_test_cmdline_res_refresh
[03:07:48] [PASSED] drm_test_cmdline_res_bpp_refresh
[03:07:48] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced
[03:07:48] [PASSED] drm_test_cmdline_res_bpp_refresh_margins
[03:07:48] [PASSED] drm_test_cmdline_res_bpp_refresh_force_off
[03:07:48] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on
[03:07:48] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_analog
[03:07:48] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_digital
[03:07:48] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced_margins_force_on
[03:07:48] [PASSED] drm_test_cmdline_res_margins_force_on
[03:07:48] [PASSED] drm_test_cmdline_res_vesa_margins
[03:07:48] [PASSED] drm_test_cmdline_name
[03:07:48] [PASSED] drm_test_cmdline_name_bpp
[03:07:48] [PASSED] drm_test_cmdline_name_option
[03:07:48] [PASSED] drm_test_cmdline_name_bpp_option
[03:07:48] [PASSED] drm_test_cmdline_rotate_0
[03:07:48] [PASSED] drm_test_cmdline_rotate_90
[03:07:48] [PASSED] drm_test_cmdline_rotate_180
[03:07:48] [PASSED] drm_test_cmdline_rotate_270
[03:07:48] [PASSED] drm_test_cmdline_hmirror
[03:07:48] [PASSED] drm_test_cmdline_vmirror
[03:07:48] [PASSED] drm_test_cmdline_margin_options
[03:07:48] [PASSED] drm_test_cmdline_multiple_options
[03:07:48] [PASSED] drm_test_cmdline_bpp_extra_and_option
[03:07:48] [PASSED] drm_test_cmdline_extra_and_option
[03:07:48] [PASSED] drm_test_cmdline_freestanding_options
[03:07:48] [PASSED] drm_test_cmdline_freestanding_force_e_and_options
[03:07:48] [PASSED] drm_test_cmdline_panel_orientation
[03:07:48] ================ drm_test_cmdline_invalid  =================
[03:07:48] [PASSED] margin_only
[03:07:48] [PASSED] interlace_only
[03:07:48] [PASSED] res_missing_x
[03:07:48] [PASSED] res_missing_y
[03:07:48] [PASSED] res_bad_y
[03:07:48] [PASSED] res_missing_y_bpp
[03:07:48] [PASSED] res_bad_bpp
[03:07:48] [PASSED] res_bad_refresh
[03:07:48] [PASSED] res_bpp_refresh_force_on_off
[03:07:48] [PASSED] res_invalid_mode
[03:07:48] [PASSED] res_bpp_wrong_place_mode
[03:07:48] [PASSED] name_bpp_refresh
[03:07:48] [PASSED] name_refresh
[03:07:48] [PASSED] name_refresh_wrong_mode
[03:07:48] [PASSED] name_refresh_invalid_mode
[03:07:48] [PASSED] rotate_multiple
[03:07:48] [PASSED] rotate_invalid_val
[03:07:48] [PASSED] rotate_truncated
[03:07:48] [PASSED] invalid_option
[03:07:48] [PASSED] invalid_tv_option
[03:07:48] [PASSED] truncated_tv_option
[03:07:48] ============ [PASSED] drm_test_cmdline_invalid =============
[03:07:48] =============== drm_test_cmdline_tv_options  ===============
[03:07:48] [PASSED] NTSC
[03:07:48] [PASSED] NTSC_443
[03:07:48] [PASSED] NTSC_J
[03:07:48] [PASSED] PAL
[03:07:48] [PASSED] PAL_M
[03:07:48] [PASSED] PAL_N
[03:07:48] [PASSED] SECAM
[03:07:48] [PASSED] MONO_525
[03:07:48] [PASSED] MONO_625
[03:07:48] =========== [PASSED] drm_test_cmdline_tv_options ===========
[03:07:48] =============== [PASSED] drm_cmdline_parser ================
[03:07:48] ========== drmm_connector_hdmi_init (20 subtests) ==========
[03:07:48] [PASSED] drm_test_connector_hdmi_init_valid
[03:07:48] [PASSED] drm_test_connector_hdmi_init_bpc_8
[03:07:48] [PASSED] drm_test_connector_hdmi_init_bpc_10
[03:07:48] [PASSED] drm_test_connector_hdmi_init_bpc_12
[03:07:48] [PASSED] drm_test_connector_hdmi_init_bpc_invalid
[03:07:48] [PASSED] drm_test_connector_hdmi_init_bpc_null
[03:07:48] [PASSED] drm_test_connector_hdmi_init_formats_empty
[03:07:48] [PASSED] drm_test_connector_hdmi_init_formats_no_rgb
[03:07:48] === drm_test_connector_hdmi_init_formats_yuv420_allowed  ===
[03:07:48] [PASSED] supported_formats=0x9 yuv420_allowed=1
[03:07:48] [PASSED] supported_formats=0x9 yuv420_allowed=0
[03:07:48] [PASSED] supported_formats=0x3 yuv420_allowed=1
[03:07:48] [PASSED] supported_formats=0x3 yuv420_allowed=0
[03:07:48] === [PASSED] drm_test_connector_hdmi_init_formats_yuv420_allowed ===
[03:07:48] [PASSED] drm_test_connector_hdmi_init_null_ddc
[03:07:48] [PASSED] drm_test_connector_hdmi_init_null_product
[03:07:48] [PASSED] drm_test_connector_hdmi_init_null_vendor
[03:07:48] [PASSED] drm_test_connector_hdmi_init_product_length_exact
[03:07:48] [PASSED] drm_test_connector_hdmi_init_product_length_too_long
[03:07:48] [PASSED] drm_test_connector_hdmi_init_product_valid
[03:07:48] [PASSED] drm_test_connector_hdmi_init_vendor_length_exact
[03:07:48] [PASSED] drm_test_connector_hdmi_init_vendor_length_too_long
[03:07:48] [PASSED] drm_test_connector_hdmi_init_vendor_valid
[03:07:48] ========= drm_test_connector_hdmi_init_type_valid  =========
[03:07:48] [PASSED] HDMI-A
[03:07:48] [PASSED] HDMI-B
[03:07:48] ===== [PASSED] drm_test_connector_hdmi_init_type_valid =====
[03:07:48] ======== drm_test_connector_hdmi_init_type_invalid  ========
[03:07:48] [PASSED] Unknown
[03:07:48] [PASSED] VGA
[03:07:48] [PASSED] DVI-I
[03:07:48] [PASSED] DVI-D
[03:07:48] [PASSED] DVI-A
[03:07:48] [PASSED] Composite
[03:07:48] [PASSED] SVIDEO
[03:07:48] [PASSED] LVDS
[03:07:48] [PASSED] Component
[03:07:48] [PASSED] DIN
[03:07:48] [PASSED] DP
[03:07:48] [PASSED] TV
[03:07:48] [PASSED] eDP
[03:07:48] [PASSED] Virtual
[03:07:48] [PASSED] DSI
[03:07:48] [PASSED] DPI
[03:07:48] [PASSED] Writeback
[03:07:48] [PASSED] SPI
[03:07:48] [PASSED] USB
[03:07:48] ==== [PASSED] drm_test_connector_hdmi_init_type_invalid ====
[03:07:48] ============ [PASSED] drmm_connector_hdmi_init =============
[03:07:48] ============= drmm_connector_init (3 subtests) =============
[03:07:48] [PASSED] drm_test_drmm_connector_init
[03:07:48] [PASSED] drm_test_drmm_connector_init_null_ddc
[03:07:48] ========= drm_test_drmm_connector_init_type_valid  =========
[03:07:48] [PASSED] Unknown
[03:07:48] [PASSED] VGA
[03:07:48] [PASSED] DVI-I
[03:07:48] [PASSED] DVI-D
[03:07:48] [PASSED] DVI-A
[03:07:48] [PASSED] Composite
[03:07:48] [PASSED] SVIDEO
[03:07:48] [PASSED] LVDS
[03:07:48] [PASSED] Component
[03:07:48] [PASSED] DIN
[03:07:48] [PASSED] DP
[03:07:48] [PASSED] HDMI-A
[03:07:48] [PASSED] HDMI-B
[03:07:48] [PASSED] TV
[03:07:48] [PASSED] eDP
[03:07:48] [PASSED] Virtual
[03:07:48] [PASSED] DSI
[03:07:48] [PASSED] DPI
[03:07:48] [PASSED] Writeback
[03:07:48] [PASSED] SPI
[03:07:48] [PASSED] USB
[03:07:48] ===== [PASSED] drm_test_drmm_connector_init_type_valid =====
[03:07:48] =============== [PASSED] drmm_connector_init ===============
[03:07:48] ========= drm_connector_dynamic_init (6 subtests) ==========
[03:07:48] [PASSED] drm_test_drm_connector_dynamic_init
[03:07:48] [PASSED] drm_test_drm_connector_dynamic_init_null_ddc
[03:07:48] [PASSED] drm_test_drm_connector_dynamic_init_not_added
[03:07:48] [PASSED] drm_test_drm_connector_dynamic_init_properties
[03:07:48] ===== drm_test_drm_connector_dynamic_init_type_valid  ======
[03:07:48] [PASSED] Unknown
[03:07:48] [PASSED] VGA
[03:07:48] [PASSED] DVI-I
[03:07:48] [PASSED] DVI-D
[03:07:48] [PASSED] DVI-A
[03:07:48] [PASSED] Composite
[03:07:48] [PASSED] SVIDEO
[03:07:48] [PASSED] LVDS
[03:07:48] [PASSED] Component
[03:07:48] [PASSED] DIN
[03:07:48] [PASSED] DP
[03:07:48] [PASSED] HDMI-A
[03:07:48] [PASSED] HDMI-B
[03:07:48] [PASSED] TV
[03:07:48] [PASSED] eDP
[03:07:48] [PASSED] Virtual
[03:07:48] [PASSED] DSI
[03:07:48] [PASSED] DPI
[03:07:48] [PASSED] Writeback
[03:07:48] [PASSED] SPI
[03:07:48] [PASSED] USB
[03:07:48] = [PASSED] drm_test_drm_connector_dynamic_init_type_valid ==
[03:07:48] ======== drm_test_drm_connector_dynamic_init_name  =========
[03:07:48] [PASSED] Unknown
[03:07:48] [PASSED] VGA
[03:07:48] [PASSED] DVI-I
[03:07:48] [PASSED] DVI-D
[03:07:48] [PASSED] DVI-A
[03:07:48] [PASSED] Composite
[03:07:48] [PASSED] SVIDEO
[03:07:48] [PASSED] LVDS
[03:07:48] [PASSED] Component
[03:07:48] [PASSED] DIN
[03:07:48] [PASSED] DP
[03:07:48] [PASSED] HDMI-A
[03:07:48] [PASSED] HDMI-B
[03:07:48] [PASSED] TV
[03:07:48] [PASSED] eDP
[03:07:48] [PASSED] Virtual
[03:07:48] [PASSED] DSI
[03:07:48] [PASSED] DPI
[03:07:48] [PASSED] Writeback
[03:07:48] [PASSED] SPI
[03:07:48] [PASSED] USB
[03:07:48] ==== [PASSED] drm_test_drm_connector_dynamic_init_name =====
[03:07:48] =========== [PASSED] drm_connector_dynamic_init ============
[03:07:48] ==== drm_connector_dynamic_register_early (4 subtests) =====
[03:07:48] [PASSED] drm_test_drm_connector_dynamic_register_early_on_list
[03:07:48] [PASSED] drm_test_drm_connector_dynamic_register_early_defer
[03:07:48] [PASSED] drm_test_drm_connector_dynamic_register_early_no_init
[03:07:48] [PASSED] drm_test_drm_connector_dynamic_register_early_no_mode_object
[03:07:48] ====== [PASSED] drm_connector_dynamic_register_early =======
[03:07:48] ======= drm_connector_dynamic_register (7 subtests) ========
[03:07:48] [PASSED] drm_test_drm_connector_dynamic_register_on_list
[03:07:48] [PASSED] drm_test_drm_connector_dynamic_register_no_defer
[03:07:48] [PASSED] drm_test_drm_connector_dynamic_register_no_init
[03:07:48] [PASSED] drm_test_drm_connector_dynamic_register_mode_object
[03:07:48] [PASSED] drm_test_drm_connector_dynamic_register_sysfs
[03:07:48] [PASSED] drm_test_drm_connector_dynamic_register_sysfs_name
[03:07:48] [PASSED] drm_test_drm_connector_dynamic_register_debugfs
[03:07:48] ========= [PASSED] drm_connector_dynamic_register ==========
[03:07:48] = drm_connector_attach_broadcast_rgb_property (2 subtests) =
[03:07:48] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property
[03:07:48] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property_hdmi_connector
[03:07:48] === [PASSED] drm_connector_attach_broadcast_rgb_property ===
[03:07:48] ========== drm_get_tv_mode_from_name (2 subtests) ==========
[03:07:48] ========== drm_test_get_tv_mode_from_name_valid  ===========
[03:07:48] [PASSED] NTSC
[03:07:48] [PASSED] NTSC-443
[03:07:48] [PASSED] NTSC-J
[03:07:48] [PASSED] PAL
[03:07:48] [PASSED] PAL-M
[03:07:48] [PASSED] PAL-N
[03:07:48] [PASSED] SECAM
[03:07:48] [PASSED] Mono
[03:07:48] ====== [PASSED] drm_test_get_tv_mode_from_name_valid =======
[03:07:48] [PASSED] drm_test_get_tv_mode_from_name_truncated
[03:07:48] ============ [PASSED] drm_get_tv_mode_from_name ============
[03:07:48] = drm_test_connector_hdmi_compute_mode_clock (12 subtests) =
[03:07:48] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb
[03:07:48] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc
[03:07:48] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc_vic_1
[03:07:48] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc
[03:07:48] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc_vic_1
[03:07:48] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_double
[03:07:48] = drm_test_connector_hdmi_compute_mode_clock_yuv420_valid  =
[03:07:48] [PASSED] VIC 96
[03:07:48] [PASSED] VIC 97
[03:07:48] [PASSED] VIC 101
[03:07:48] [PASSED] VIC 102
[03:07:48] [PASSED] VIC 106
[03:07:48] [PASSED] VIC 107
[03:07:48] === [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_valid ===
[03:07:48] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_10_bpc
[03:07:48] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_12_bpc
[03:07:48] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_8_bpc
[03:07:48] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_10_bpc
[03:07:48] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_12_bpc
[03:07:48] === [PASSED] drm_test_connector_hdmi_compute_mode_clock ====
[03:07:48] == drm_hdmi_connector_get_broadcast_rgb_name (2 subtests) ==
[03:07:48] === drm_test_drm_hdmi_connector_get_broadcast_rgb_name  ====
[03:07:48] [PASSED] Automatic
[03:07:48] [PASSED] Full
[03:07:48] [PASSED] Limited 16:235
[03:07:48] === [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name ===
[03:07:48] [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name_invalid
[03:07:48] ==== [PASSED] drm_hdmi_connector_get_broadcast_rgb_name ====
[03:07:48] == drm_hdmi_connector_get_output_format_name (2 subtests) ==
[03:07:48] === drm_test_drm_hdmi_connector_get_output_format_name  ====
[03:07:48] [PASSED] RGB
[03:07:48] [PASSED] YUV 4:2:0
[03:07:48] [PASSED] YUV 4:2:2
[03:07:48] [PASSED] YUV 4:4:4
[03:07:48] === [PASSED] drm_test_drm_hdmi_connector_get_output_format_name ===
[03:07:48] [PASSED] drm_test_drm_hdmi_connector_get_output_format_name_invalid
[03:07:48] ==== [PASSED] drm_hdmi_connector_get_output_format_name ====
[03:07:48] ============= drm_damage_helper (21 subtests) ==============
[03:07:48] [PASSED] drm_test_damage_iter_no_damage
[03:07:48] [PASSED] drm_test_damage_iter_no_damage_fractional_src
[03:07:48] [PASSED] drm_test_damage_iter_no_damage_src_moved
[03:07:48] [PASSED] drm_test_damage_iter_no_damage_fractional_src_moved
[03:07:48] [PASSED] drm_test_damage_iter_no_damage_not_visible
[03:07:48] [PASSED] drm_test_damage_iter_no_damage_no_crtc
[03:07:48] [PASSED] drm_test_damage_iter_no_damage_no_fb
[03:07:48] [PASSED] drm_test_damage_iter_simple_damage
[03:07:48] [PASSED] drm_test_damage_iter_single_damage
[03:07:48] [PASSED] drm_test_damage_iter_single_damage_intersect_src
[03:07:48] [PASSED] drm_test_damage_iter_single_damage_outside_src
[03:07:48] [PASSED] drm_test_damage_iter_single_damage_fractional_src
[03:07:48] [PASSED] drm_test_damage_iter_single_damage_intersect_fractional_src
[03:07:48] [PASSED] drm_test_damage_iter_single_damage_outside_fractional_src
[03:07:48] [PASSED] drm_test_damage_iter_single_damage_src_moved
[03:07:48] [PASSED] drm_test_damage_iter_single_damage_fractional_src_moved
[03:07:48] [PASSED] drm_test_damage_iter_damage
[03:07:48] [PASSED] drm_test_damage_iter_damage_one_intersect
[03:07:48] [PASSED] drm_test_damage_iter_damage_one_outside
[03:07:48] [PASSED] drm_test_damage_iter_damage_src_moved
[03:07:48] [PASSED] drm_test_damage_iter_damage_not_visible
[03:07:48] ================ [PASSED] drm_damage_helper ================
[03:07:48] ============== drm_dp_mst_helper (3 subtests) ==============
[03:07:48] ============== drm_test_dp_mst_calc_pbn_mode  ==============
[03:07:48] [PASSED] Clock 154000 BPP 30 DSC disabled
[03:07:48] [PASSED] Clock 234000 BPP 30 DSC disabled
[03:07:48] [PASSED] Clock 297000 BPP 24 DSC disabled
[03:07:48] [PASSED] Clock 332880 BPP 24 DSC enabled
[03:07:48] [PASSED] Clock 324540 BPP 24 DSC enabled
[03:07:48] ========== [PASSED] drm_test_dp_mst_calc_pbn_mode ==========
[03:07:48] ============== drm_test_dp_mst_calc_pbn_div  ===============
[03:07:48] [PASSED] Link rate 2000000 lane count 4
[03:07:48] [PASSED] Link rate 2000000 lane count 2
[03:07:48] [PASSED] Link rate 2000000 lane count 1
[03:07:48] [PASSED] Link rate 1350000 lane count 4
[03:07:48] [PASSED] Link rate 1350000 lane count 2
[03:07:48] [PASSED] Link rate 1350000 lane count 1
[03:07:48] [PASSED] Link rate 1000000 lane count 4
[03:07:48] [PASSED] Link rate 1000000 lane count 2
[03:07:48] [PASSED] Link rate 1000000 lane count 1
[03:07:48] [PASSED] Link rate 810000 lane count 4
[03:07:48] [PASSED] Link rate 810000 lane count 2
[03:07:48] [PASSED] Link rate 810000 lane count 1
[03:07:48] [PASSED] Link rate 540000 lane count 4
[03:07:48] [PASSED] Link rate 540000 lane count 2
[03:07:48] [PASSED] Link rate 540000 lane count 1
[03:07:48] [PASSED] Link rate 270000 lane count 4
[03:07:48] [PASSED] Link rate 270000 lane count 2
[03:07:48] [PASSED] Link rate 270000 lane count 1
[03:07:48] [PASSED] Link rate 162000 lane count 4
[03:07:48] [PASSED] Link rate 162000 lane count 2
[03:07:48] [PASSED] Link rate 162000 lane count 1
[03:07:48] ========== [PASSED] drm_test_dp_mst_calc_pbn_div ===========
[03:07:48] ========= drm_test_dp_mst_sideband_msg_req_decode  =========
[03:07:48] [PASSED] DP_ENUM_PATH_RESOURCES with port number
[03:07:48] [PASSED] DP_POWER_UP_PHY with port number
[03:07:48] [PASSED] DP_POWER_DOWN_PHY with port number
[03:07:48] [PASSED] DP_ALLOCATE_PAYLOAD with SDP stream sinks
[03:07:48] [PASSED] DP_ALLOCATE_PAYLOAD with port number
[03:07:48] [PASSED] DP_ALLOCATE_PAYLOAD with VCPI
[03:07:48] [PASSED] DP_ALLOCATE_PAYLOAD with PBN
[03:07:48] [PASSED] DP_QUERY_PAYLOAD with port number
[03:07:48] [PASSED] DP_QUERY_PAYLOAD with VCPI
[03:07:48] [PASSED] DP_REMOTE_DPCD_READ with port number
[03:07:48] [PASSED] DP_REMOTE_DPCD_READ with DPCD address
[03:07:48] [PASSED] DP_REMOTE_DPCD_READ with max number of bytes
[03:07:48] [PASSED] DP_REMOTE_DPCD_WRITE with port number
[03:07:48] [PASSED] DP_REMOTE_DPCD_WRITE with DPCD address
[03:07:48] [PASSED] DP_REMOTE_DPCD_WRITE with data array
[03:07:48] [PASSED] DP_REMOTE_I2C_READ with port number
[03:07:48] [PASSED] DP_REMOTE_I2C_READ with I2C device ID
[03:07:48] [PASSED] DP_REMOTE_I2C_READ with transactions array
[03:07:48] [PASSED] DP_REMOTE_I2C_WRITE with port number
[03:07:48] [PASSED] DP_REMOTE_I2C_WRITE with I2C device ID
[03:07:48] [PASSED] DP_REMOTE_I2C_WRITE with data array
[03:07:48] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream ID
[03:07:48] [PASSED] DP_QUERY_STREAM_ENC_STATUS with client ID
[03:07:48] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream event
[03:07:48] [PASSED] DP_QUERY_STREAM_ENC_STATUS with valid stream event
[03:07:48] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream behavior
[03:07:48] [PASSED] DP_QUERY_STREAM_ENC_STATUS with a valid stream behavior
[03:07:48] ===== [PASSED] drm_test_dp_mst_sideband_msg_req_decode =====
[03:07:48] ================ [PASSED] drm_dp_mst_helper ================
[03:07:48] ================== drm_exec (7 subtests) ===================
[03:07:48] [PASSED] sanitycheck
[03:07:48] [PASSED] test_lock
[03:07:48] [PASSED] test_lock_unlock
[03:07:48] [PASSED] test_duplicates
[03:07:48] [PASSED] test_prepare
[03:07:48] [PASSED] test_prepare_array
[03:07:48] [PASSED] test_multiple_loops
[03:07:48] ==================== [PASSED] drm_exec =====================
[03:07:48] =========== drm_format_helper_test (17 subtests) ===========
[03:07:48] ============== drm_test_fb_xrgb8888_to_gray8  ==============
[03:07:48] [PASSED] single_pixel_source_buffer
[03:07:48] [PASSED] single_pixel_clip_rectangle
[03:07:48] [PASSED] well_known_colors
[03:07:48] [PASSED] destination_pitch
[03:07:48] ========== [PASSED] drm_test_fb_xrgb8888_to_gray8 ==========
[03:07:48] ============= drm_test_fb_xrgb8888_to_rgb332  ==============
[03:07:48] [PASSED] single_pixel_source_buffer
[03:07:48] [PASSED] single_pixel_clip_rectangle
[03:07:48] [PASSED] well_known_colors
[03:07:48] [PASSED] destination_pitch
[03:07:48] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb332 ==========
[03:07:48] ============= drm_test_fb_xrgb8888_to_rgb565  ==============
[03:07:48] [PASSED] single_pixel_source_buffer
[03:07:48] [PASSED] single_pixel_clip_rectangle
[03:07:48] [PASSED] well_known_colors
[03:07:48] [PASSED] destination_pitch
[03:07:48] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb565 ==========
[03:07:48] ============ drm_test_fb_xrgb8888_to_xrgb1555  =============
[03:07:48] [PASSED] single_pixel_source_buffer
[03:07:48] [PASSED] single_pixel_clip_rectangle
[03:07:48] [PASSED] well_known_colors
[03:07:48] [PASSED] destination_pitch
[03:07:48] ======== [PASSED] drm_test_fb_xrgb8888_to_xrgb1555 =========
[03:07:48] ============ drm_test_fb_xrgb8888_to_argb1555  =============
[03:07:48] [PASSED] single_pixel_source_buffer
[03:07:48] [PASSED] single_pixel_clip_rectangle
[03:07:48] [PASSED] well_known_colors
[03:07:48] [PASSED] destination_pitch
[03:07:48] ======== [PASSED] drm_test_fb_xrgb8888_to_argb1555 =========
[03:07:48] ============ drm_test_fb_xrgb8888_to_rgba5551  =============
[03:07:48] [PASSED] single_pixel_source_buffer
[03:07:48] [PASSED] single_pixel_clip_rectangle
[03:07:48] [PASSED] well_known_colors
[03:07:48] [PASSED] destination_pitch
[03:07:48] ======== [PASSED] drm_test_fb_xrgb8888_to_rgba5551 =========
[03:07:48] ============= drm_test_fb_xrgb8888_to_rgb888  ==============
[03:07:48] [PASSED] single_pixel_source_buffer
[03:07:48] [PASSED] single_pixel_clip_rectangle
[03:07:48] [PASSED] well_known_colors
[03:07:48] [PASSED] destination_pitch
[03:07:48] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb888 ==========
[03:07:48] ============= drm_test_fb_xrgb8888_to_bgr888  ==============
[03:07:48] [PASSED] single_pixel_source_buffer
[03:07:48] [PASSED] single_pixel_clip_rectangle
[03:07:48] [PASSED] well_known_colors
[03:07:48] [PASSED] destination_pitch
[03:07:48] ========= [PASSED] drm_test_fb_xrgb8888_to_bgr888 ==========
[03:07:48] ============ drm_test_fb_xrgb8888_to_argb8888  =============
[03:07:48] [PASSED] single_pixel_source_buffer
[03:07:48] [PASSED] single_pixel_clip_rectangle
[03:07:48] [PASSED] well_known_colors
[03:07:48] [PASSED] destination_pitch
[03:07:48] ======== [PASSED] drm_test_fb_xrgb8888_to_argb8888 =========
[03:07:48] =========== drm_test_fb_xrgb8888_to_xrgb2101010  ===========
[03:07:48] [PASSED] single_pixel_source_buffer
[03:07:48] [PASSED] single_pixel_clip_rectangle
[03:07:48] [PASSED] well_known_colors
[03:07:48] [PASSED] destination_pitch
[03:07:48] ======= [PASSED] drm_test_fb_xrgb8888_to_xrgb2101010 =======
[03:07:48] =========== drm_test_fb_xrgb8888_to_argb2101010  ===========
[03:07:48] [PASSED] single_pixel_source_buffer
[03:07:48] [PASSED] single_pixel_clip_rectangle
[03:07:48] [PASSED] well_known_colors
[03:07:48] [PASSED] destination_pitch
[03:07:48] ======= [PASSED] drm_test_fb_xrgb8888_to_argb2101010 =======
[03:07:48] ============== drm_test_fb_xrgb8888_to_mono  ===============
[03:07:48] [PASSED] single_pixel_source_buffer
[03:07:48] [PASSED] single_pixel_clip_rectangle
[03:07:48] [PASSED] well_known_colors
[03:07:48] [PASSED] destination_pitch
[03:07:48] ========== [PASSED] drm_test_fb_xrgb8888_to_mono ===========
[03:07:48] ==================== drm_test_fb_swab  =====================
[03:07:48] [PASSED] single_pixel_source_buffer
[03:07:48] [PASSED] single_pixel_clip_rectangle
[03:07:48] [PASSED] well_known_colors
[03:07:48] [PASSED] destination_pitch
[03:07:48] ================ [PASSED] drm_test_fb_swab =================
[03:07:48] ============ drm_test_fb_xrgb8888_to_xbgr8888  =============
[03:07:48] [PASSED] single_pixel_source_buffer
[03:07:48] [PASSED] single_pixel_clip_rectangle
[03:07:48] [PASSED] well_known_colors
[03:07:48] [PASSED] destination_pitch
[03:07:48] ======== [PASSED] drm_test_fb_xrgb8888_to_xbgr8888 =========
[03:07:48] ============ drm_test_fb_xrgb8888_to_abgr8888  =============
[03:07:48] [PASSED] single_pixel_source_buffer
[03:07:48] [PASSED] single_pixel_clip_rectangle
[03:07:48] [PASSED] well_known_colors
[03:07:48] [PASSED] destination_pitch
[03:07:48] ======== [PASSED] drm_test_fb_xrgb8888_to_abgr8888 =========
[03:07:48] ================= drm_test_fb_clip_offset  =================
[03:07:48] [PASSED] pass through
[03:07:48] [PASSED] horizontal offset
[03:07:48] [PASSED] vertical offset
[03:07:48] [PASSED] horizontal and vertical offset
[03:07:48] [PASSED] horizontal offset (custom pitch)
[03:07:48] [PASSED] vertical offset (custom pitch)
[03:07:48] [PASSED] horizontal and vertical offset (custom pitch)
[03:07:48] ============= [PASSED] drm_test_fb_clip_offset =============
[03:07:48] =================== drm_test_fb_memcpy  ====================
[03:07:48] [PASSED] single_pixel_source_buffer: XR24 little-endian (0x34325258)
[03:07:48] [PASSED] single_pixel_source_buffer: XRA8 little-endian (0x38415258)
[03:07:48] [PASSED] single_pixel_source_buffer: YU24 little-endian (0x34325559)
[03:07:48] [PASSED] single_pixel_clip_rectangle: XB24 little-endian (0x34324258)
[03:07:48] [PASSED] single_pixel_clip_rectangle: XRA8 little-endian (0x38415258)
[03:07:48] [PASSED] single_pixel_clip_rectangle: YU24 little-endian (0x34325559)
[03:07:48] [PASSED] well_known_colors: XB24 little-endian (0x34324258)
[03:07:48] [PASSED] well_known_colors: XRA8 little-endian (0x38415258)
[03:07:48] [PASSED] well_known_colors: YU24 little-endian (0x34325559)
[03:07:48] [PASSED] destination_pitch: XB24 little-endian (0x34324258)
[03:07:48] [PASSED] destination_pitch: XRA8 little-endian (0x38415258)
[03:07:48] [PASSED] destination_pitch: YU24 little-endian (0x34325559)
[03:07:48] =============== [PASSED] drm_test_fb_memcpy ================
[03:07:48] ============= [PASSED] drm_format_helper_test ==============
[03:07:48] ================= drm_format (18 subtests) =================
[03:07:48] [PASSED] drm_test_format_block_width_invalid
[03:07:48] [PASSED] drm_test_format_block_width_one_plane
[03:07:48] [PASSED] drm_test_format_block_width_two_plane
[03:07:48] [PASSED] drm_test_format_block_width_three_plane
[03:07:48] [PASSED] drm_test_format_block_width_tiled
[03:07:48] [PASSED] drm_test_format_block_height_invalid
[03:07:48] [PASSED] drm_test_format_block_height_one_plane
[03:07:48] [PASSED] drm_test_format_block_height_two_plane
[03:07:48] [PASSED] drm_test_format_block_height_three_plane
[03:07:48] [PASSED] drm_test_format_block_height_tiled
[03:07:48] [PASSED] drm_test_format_min_pitch_invalid
[03:07:48] [PASSED] drm_test_format_min_pitch_one_plane_8bpp
[03:07:48] [PASSED] drm_test_format_min_pitch_one_plane_16bpp
[03:07:48] [PASSED] drm_test_format_min_pitch_one_plane_24bpp
[03:07:48] [PASSED] drm_test_format_min_pitch_one_plane_32bpp
[03:07:48] [PASSED] drm_test_format_min_pitch_two_plane
[03:07:48] [PASSED] drm_test_format_min_pitch_three_plane_8bpp
[03:07:48] [PASSED] drm_test_format_min_pitch_tiled
[03:07:48] =================== [PASSED] drm_format ====================
[03:07:48] ============== drm_framebuffer (10 subtests) ===============
[03:07:48] ========== drm_test_framebuffer_check_src_coords  ==========
[03:07:48] [PASSED] Success: source fits into fb
[03:07:48] [PASSED] Fail: overflowing fb with x-axis coordinate
[03:07:48] [PASSED] Fail: overflowing fb with y-axis coordinate
[03:07:48] [PASSED] Fail: overflowing fb with source width
[03:07:48] [PASSED] Fail: overflowing fb with source height
[03:07:48] ====== [PASSED] drm_test_framebuffer_check_src_coords ======
[03:07:48] [PASSED] drm_test_framebuffer_cleanup
[03:07:48] =============== drm_test_framebuffer_create  ===============
[03:07:48] [PASSED] ABGR8888 normal sizes
[03:07:48] [PASSED] ABGR8888 max sizes
[03:07:48] [PASSED] ABGR8888 pitch greater than min required
[03:07:48] [PASSED] ABGR8888 pitch less than min required
[03:07:48] [PASSED] ABGR8888 Invalid width
[03:07:48] [PASSED] ABGR8888 Invalid buffer handle
[03:07:48] [PASSED] No pixel format
[03:07:48] [PASSED] ABGR8888 Width 0
[03:07:48] [PASSED] ABGR8888 Height 0
[03:07:48] [PASSED] ABGR8888 Out of bound height * pitch combination
[03:07:48] [PASSED] ABGR8888 Large buffer offset
[03:07:48] [PASSED] ABGR8888 Buffer offset for inexistent plane
[03:07:48] [PASSED] ABGR8888 Invalid flag
[03:07:48] [PASSED] ABGR8888 Set DRM_MODE_FB_MODIFIERS without modifiers
[03:07:48] [PASSED] ABGR8888 Valid buffer modifier
[03:07:48] [PASSED] ABGR8888 Invalid buffer modifier(DRM_FORMAT_MOD_SAMSUNG_64_32_TILE)
[03:07:48] [PASSED] ABGR8888 Extra pitches without DRM_MODE_FB_MODIFIERS
[03:07:48] [PASSED] ABGR8888 Extra pitches with DRM_MODE_FB_MODIFIERS
[03:07:48] [PASSED] NV12 Normal sizes
[03:07:48] [PASSED] NV12 Max sizes
[03:07:48] [PASSED] NV12 Invalid pitch
[03:07:48] [PASSED] NV12 Invalid modifier/missing DRM_MODE_FB_MODIFIERS flag
[03:07:48] [PASSED] NV12 different  modifier per-plane
[03:07:48] [PASSED] NV12 with DRM_FORMAT_MOD_SAMSUNG_64_32_TILE
[03:07:48] [PASSED] NV12 Valid modifiers without DRM_MODE_FB_MODIFIERS
[03:07:48] [PASSED] NV12 Modifier for inexistent plane
[03:07:48] [PASSED] NV12 Handle for inexistent plane
[03:07:48] [PASSED] NV12 Handle for inexistent plane without DRM_MODE_FB_MODIFIERS
[03:07:48] [PASSED] YVU420 DRM_MODE_FB_MODIFIERS set without modifier
[03:07:48] [PASSED] YVU420 Normal sizes
[03:07:48] [PASSED] YVU420 Max sizes
[03:07:48] [PASSED] YVU420 Invalid pitch
[03:07:48] [PASSED] YVU420 Different pitches
[03:07:48] [PASSED] YVU420 Different buffer offsets/pitches
[03:07:48] [PASSED] YVU420 Modifier set just for plane 0, without DRM_MODE_FB_MODIFIERS
[03:07:48] [PASSED] YVU420 Modifier set just for planes 0, 1, without DRM_MODE_FB_MODIFIERS
[03:07:48] [PASSED] YVU420 Modifier set just for plane 0, 1, with DRM_MODE_FB_MODIFIERS
[03:07:48] [PASSED] YVU420 Valid modifier
[03:07:48] [PASSED] YVU420 Different modifiers per plane
[03:07:48] [PASSED] YVU420 Modifier for inexistent plane
[03:07:48] [PASSED] YUV420_10BIT Invalid modifier(DRM_FORMAT_MOD_LINEAR)
[03:07:48] [PASSED] X0L2 Normal sizes
[03:07:48] [PASSED] X0L2 Max sizes
[03:07:48] [PASSED] X0L2 Invalid pitch
[03:07:48] [PASSED] X0L2 Pitch greater than minimum required
[03:07:48] [PASSED] X0L2 Handle for inexistent plane
[03:07:48] [PASSED] X0L2 Offset for inexistent plane, without DRM_MODE_FB_MODIFIERS set
[03:07:48] [PASSED] X0L2 Modifier without DRM_MODE_FB_MODIFIERS set
[03:07:48] [PASSED] X0L2 Valid modifier
[03:07:48] [PASSED] X0L2 Modifier for inexistent plane
[03:07:48] =========== [PASSED] drm_test_framebuffer_create ===========
[03:07:48] [PASSED] drm_test_framebuffer_free
[03:07:48] [PASSED] drm_test_framebuffer_init
[03:07:48] [PASSED] drm_test_framebuffer_init_bad_format
[03:07:48] [PASSED] drm_test_framebuffer_init_dev_mismatch
[03:07:48] [PASSED] drm_test_framebuffer_lookup
[03:07:48] [PASSED] drm_test_framebuffer_lookup_inexistent
[03:07:48] [PASSED] drm_test_framebuffer_modifiers_not_supported
[03:07:48] ================= [PASSED] drm_framebuffer =================
[03:07:48] ================ drm_gem_shmem (8 subtests) ================
[03:07:48] [PASSED] drm_gem_shmem_test_obj_create
[03:07:48] [PASSED] drm_gem_shmem_test_obj_create_private
[03:07:48] [PASSED] drm_gem_shmem_test_pin_pages
[03:07:48] [PASSED] drm_gem_shmem_test_vmap
[03:07:48] [PASSED] drm_gem_shmem_test_get_pages_sgt
[03:07:48] [PASSED] drm_gem_shmem_test_get_sg_table
[03:07:48] [PASSED] drm_gem_shmem_test_madvise
[03:07:48] [PASSED] drm_gem_shmem_test_purge
[03:07:48] ================== [PASSED] drm_gem_shmem ==================
[03:07:48] === drm_atomic_helper_connector_hdmi_check (27 subtests) ===
[03:07:48] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode
[03:07:48] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode_vic_1
[03:07:48] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode
[03:07:48] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode_vic_1
[03:07:48] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode
[03:07:48] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode_vic_1
[03:07:48] ====== drm_test_check_broadcast_rgb_cea_mode_yuv420  =======
[03:07:48] [PASSED] Automatic
[03:07:48] [PASSED] Full
[03:07:48] [PASSED] Limited 16:235
[03:07:48] == [PASSED] drm_test_check_broadcast_rgb_cea_mode_yuv420 ===
[03:07:48] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_changed
[03:07:48] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_not_changed
[03:07:48] [PASSED] drm_test_check_disable_connector
[03:07:48] [PASSED] drm_test_check_hdmi_funcs_reject_rate
[03:07:48] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_rgb
[03:07:48] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_yuv420
[03:07:48] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv422
[03:07:48] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv420
[03:07:48] [PASSED] drm_test_check_driver_unsupported_fallback_yuv420
[03:07:48] [PASSED] drm_test_check_output_bpc_crtc_mode_changed
[03:07:48] [PASSED] drm_test_check_output_bpc_crtc_mode_not_changed
[03:07:48] [PASSED] drm_test_check_output_bpc_dvi
[03:07:48] [PASSED] drm_test_check_output_bpc_format_vic_1
[03:07:48] [PASSED] drm_test_check_output_bpc_format_display_8bpc_only
[03:07:48] [PASSED] drm_test_check_output_bpc_format_display_rgb_only
[03:07:48] [PASSED] drm_test_check_output_bpc_format_driver_8bpc_only
[03:07:48] [PASSED] drm_test_check_output_bpc_format_driver_rgb_only
[03:07:48] [PASSED] drm_test_check_tmds_char_rate_rgb_8bpc
[03:07:48] [PASSED] drm_test_check_tmds_char_rate_rgb_10bpc
[03:07:48] [PASSED] drm_test_check_tmds_char_rate_rgb_12bpc
[03:07:48] ===== [PASSED] drm_atomic_helper_connector_hdmi_check ======
[03:07:48] === drm_atomic_helper_connector_hdmi_reset (6 subtests) ====
[03:07:48] [PASSED] drm_test_check_broadcast_rgb_value
[03:07:48] [PASSED] drm_test_check_bpc_8_value
[03:07:48] [PASSED] drm_test_check_bpc_10_value
[03:07:48] [PASSED] drm_test_check_bpc_12_value
[03:07:48] [PASSED] drm_test_check_format_value
[03:07:48] [PASSED] drm_test_check_tmds_char_value
[03:07:48] ===== [PASSED] drm_atomic_helper_connector_hdmi_reset ======
[03:07:48] = drm_atomic_helper_connector_hdmi_mode_valid (4 subtests) =
[03:07:48] [PASSED] drm_test_check_mode_valid
[03:07:48] [PASSED] drm_test_check_mode_valid_reject
[03:07:48] [PASSED] drm_test_check_mode_valid_reject_rate
[03:07:48] [PASSED] drm_test_check_mode_valid_reject_max_clock
[03:07:48] === [PASSED] drm_atomic_helper_connector_hdmi_mode_valid ===
[03:07:48] ================= drm_managed (2 subtests) =================
[03:07:48] [PASSED] drm_test_managed_release_action
[03:07:48] [PASSED] drm_test_managed_run_action
[03:07:48] =================== [PASSED] drm_managed ===================
[03:07:48] =================== drm_mm (6 subtests) ====================
[03:07:48] [PASSED] drm_test_mm_init
[03:07:48] [PASSED] drm_test_mm_debug
[03:07:48] [PASSED] drm_test_mm_align32
[03:07:48] [PASSED] drm_test_mm_align64
[03:07:48] [PASSED] drm_test_mm_lowest
[03:07:48] [PASSED] drm_test_mm_highest
[03:07:48] ===================== [PASSED] drm_mm ======================
[03:07:48] ============= drm_modes_analog_tv (5 subtests) =============
[03:07:48] [PASSED] drm_test_modes_analog_tv_mono_576i
[03:07:48] [PASSED] drm_test_modes_analog_tv_ntsc_480i
[03:07:48] [PASSED] drm_test_modes_analog_tv_ntsc_480i_inlined
[03:07:48] [PASSED] drm_test_modes_analog_tv_pal_576i
[03:07:48] [PASSED] drm_test_modes_analog_tv_pal_576i_inlined
[03:07:48] =============== [PASSED] drm_modes_analog_tv ===============
[03:07:48] ============== drm_plane_helper (2 subtests) ===============
[03:07:48] =============== drm_test_check_plane_state  ================
[03:07:48] [PASSED] clipping_simple
[03:07:48] [PASSED] clipping_rotate_reflect
[03:07:48] [PASSED] positioning_simple
[03:07:48] [PASSED] upscaling
[03:07:48] [PASSED] downscaling
[03:07:48] [PASSED] rounding1
[03:07:48] [PASSED] rounding2
[03:07:48] [PASSED] rounding3
[03:07:48] [PASSED] rounding4
[03:07:48] =========== [PASSED] drm_test_check_plane_state ============
[03:07:48] =========== drm_test_check_invalid_plane_state  ============
[03:07:48] [PASSED] positioning_invalid
[03:07:48] [PASSED] upscaling_invalid
[03:07:48] [PASSED] downscaling_invalid
[03:07:48] ======= [PASSED] drm_test_check_invalid_plane_state ========
[03:07:48] ================ [PASSED] drm_plane_helper =================
[03:07:48] ====== drm_connector_helper_tv_get_modes (1 subtest) =======
[03:07:48] ====== drm_test_connector_helper_tv_get_modes_check  =======
[03:07:48] [PASSED] None
[03:07:48] [PASSED] PAL
[03:07:48] [PASSED] NTSC
[03:07:48] [PASSED] Both, NTSC Default
[03:07:48] [PASSED] Both, PAL Default
[03:07:48] [PASSED] Both, NTSC Default, with PAL on command-line
[03:07:48] [PASSED] Both, PAL Default, with NTSC on command-line
[03:07:48] == [PASSED] drm_test_connector_helper_tv_get_modes_check ===
[03:07:48] ======== [PASSED] drm_connector_helper_tv_get_modes ========
[03:07:48] ================== drm_rect (9 subtests) ===================
[03:07:48] [PASSED] drm_test_rect_clip_scaled_div_by_zero
[03:07:48] [PASSED] drm_test_rect_clip_scaled_not_clipped
[03:07:48] [PASSED] drm_test_rect_clip_scaled_clipped
[03:07:48] [PASSED] drm_test_rect_clip_scaled_signed_vs_unsigned
[03:07:48] ================= drm_test_rect_intersect  =================
[03:07:48] [PASSED] top-left x bottom-right: 2x2+1+1 x 2x2+0+0
[03:07:48] [PASSED] top-right x bottom-left: 2x2+0+0 x 2x2+1-1
[03:07:48] [PASSED] bottom-left x top-right: 2x2+1-1 x 2x2+0+0
[03:07:48] [PASSED] bottom-right x top-left: 2x2+0+0 x 2x2+1+1
[03:07:48] [PASSED] right x left: 2x1+0+0 x 3x1+1+0
[03:07:48] [PASSED] left x right: 3x1+1+0 x 2x1+0+0
[03:07:48] [PASSED] up x bottom: 1x2+0+0 x 1x3+0-1
[03:07:48] [PASSED] bottom x up: 1x3+0-1 x 1x2+0+0
[03:07:48] [PASSED] touching corner: 1x1+0+0 x 2x2+1+1
[03:07:48] [PASSED] touching side: 1x1+0+0 x 1x1+1+0
[03:07:48] [PASSED] equal rects: 2x2+0+0 x 2x2+0+0
[03:07:48] [PASSED] inside another: 2x2+0+0 x 1x1+1+1
[03:07:48] [PASSED] far away: 1x1+0+0 x 1x1+3+6
[03:07:48] [PASSED] points intersecting: 0x0+5+10 x 0x0+5+10
[03:07:48] [PASSED] points not intersecting: 0x0+0+0 x 0x0+5+10
[03:07:48] ============= [PASSED] drm_test_rect_intersect =============
[03:07:48] ================ drm_test_rect_calc_hscale  ================
[03:07:48] [PASSED] normal use
[03:07:48] [PASSED] out of max range
[03:07:48] [PASSED] out of min range
[03:07:48] [PASSED] zero dst
[03:07:48] [PASSED] negative src
[03:07:48] [PASSED] negative dst
[03:07:48] ============ [PASSED] drm_test_rect_calc_hscale ============
[03:07:48] ================ drm_test_rect_calc_vscale  ================
[03:07:48] [PASSED] normal use
[03:07:48] [PASSED] out of max range
[03:07:48] [PASSED] out of min range
[03:07:48] [PASSED] zero dst
[03:07:48] [PASSED] negative src
stty: 'standard input': Inappropriate ioctl for device
[03:07:48] [PASSED] negative dst
[03:07:48] ============ [PASSED] drm_test_rect_calc_vscale ============
[03:07:48] ================== drm_test_rect_rotate  ===================
[03:07:48] [PASSED] reflect-x
[03:07:48] [PASSED] reflect-y
[03:07:48] [PASSED] rotate-0
[03:07:48] [PASSED] rotate-90
[03:07:48] [PASSED] rotate-180
[03:07:48] [PASSED] rotate-270
[03:07:48] ============== [PASSED] drm_test_rect_rotate ===============
[03:07:48] ================ drm_test_rect_rotate_inv  =================
[03:07:48] [PASSED] reflect-x
[03:07:48] [PASSED] reflect-y
[03:07:48] [PASSED] rotate-0
[03:07:48] [PASSED] rotate-90
[03:07:49] [PASSED] rotate-180
[03:07:49] [PASSED] rotate-270
[03:07:49] ============ [PASSED] drm_test_rect_rotate_inv =============
[03:07:49] ==================== [PASSED] drm_rect =====================
[03:07:49] ============ drm_sysfb_modeset_test (1 subtest) ============
[03:07:49] ============ drm_test_sysfb_build_fourcc_list  =============
[03:07:49] [PASSED] no native formats
[03:07:49] [PASSED] XRGB8888 as native format
[03:07:49] [PASSED] remove duplicates
[03:07:49] [PASSED] convert alpha formats
[03:07:49] [PASSED] random formats
[03:07:49] ======== [PASSED] drm_test_sysfb_build_fourcc_list =========
[03:07:49] ============= [PASSED] drm_sysfb_modeset_test ==============
[03:07:49] ============================================================
[03:07:49] Testing complete. Ran 621 tests: passed: 621
[03:07:49] Elapsed time: 25.593s total, 1.721s configuring, 23.705s building, 0.149s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/ttm/tests/.kunitconfig
[03:07:49] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[03:07:50] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[03:08:00] Starting KUnit Kernel (1/1)...
[03:08:00] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[03:08:00] ================= ttm_device (5 subtests) ==================
[03:08:00] [PASSED] ttm_device_init_basic
[03:08:00] [PASSED] ttm_device_init_multiple
[03:08:00] [PASSED] ttm_device_fini_basic
[03:08:00] [PASSED] ttm_device_init_no_vma_man
[03:08:00] ================== ttm_device_init_pools  ==================
[03:08:00] [PASSED] No DMA allocations, no DMA32 required
[03:08:00] [PASSED] DMA allocations, DMA32 required
[03:08:00] [PASSED] No DMA allocations, DMA32 required
[03:08:00] [PASSED] DMA allocations, no DMA32 required
[03:08:00] ============== [PASSED] ttm_device_init_pools ==============
[03:08:00] =================== [PASSED] ttm_device ====================
[03:08:00] ================== ttm_pool (8 subtests) ===================
[03:08:00] ================== ttm_pool_alloc_basic  ===================
[03:08:00] [PASSED] One page
[03:08:00] [PASSED] More than one page
[03:08:00] [PASSED] Above the allocation limit
[03:08:00] [PASSED] One page, with coherent DMA mappings enabled
[03:08:00] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[03:08:00] ============== [PASSED] ttm_pool_alloc_basic ===============
[03:08:00] ============== ttm_pool_alloc_basic_dma_addr  ==============
[03:08:00] [PASSED] One page
[03:08:00] [PASSED] More than one page
[03:08:00] [PASSED] Above the allocation limit
[03:08:00] [PASSED] One page, with coherent DMA mappings enabled
[03:08:00] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[03:08:00] ========== [PASSED] ttm_pool_alloc_basic_dma_addr ==========
[03:08:00] [PASSED] ttm_pool_alloc_order_caching_match
[03:08:00] [PASSED] ttm_pool_alloc_caching_mismatch
[03:08:00] [PASSED] ttm_pool_alloc_order_mismatch
[03:08:00] [PASSED] ttm_pool_free_dma_alloc
[03:08:00] [PASSED] ttm_pool_free_no_dma_alloc
[03:08:00] [PASSED] ttm_pool_fini_basic
[03:08:00] ==================== [PASSED] ttm_pool =====================
[03:08:00] ================ ttm_resource (8 subtests) =================
[03:08:00] ================= ttm_resource_init_basic  =================
[03:08:00] [PASSED] Init resource in TTM_PL_SYSTEM
[03:08:00] [PASSED] Init resource in TTM_PL_VRAM
[03:08:00] [PASSED] Init resource in a private placement
[03:08:00] [PASSED] Init resource in TTM_PL_SYSTEM, set placement flags
[03:08:00] ============= [PASSED] ttm_resource_init_basic =============
[03:08:00] [PASSED] ttm_resource_init_pinned
[03:08:00] [PASSED] ttm_resource_fini_basic
[03:08:00] [PASSED] ttm_resource_manager_init_basic
[03:08:00] [PASSED] ttm_resource_manager_usage_basic
[03:08:00] [PASSED] ttm_resource_manager_set_used_basic
[03:08:00] [PASSED] ttm_sys_man_alloc_basic
[03:08:00] [PASSED] ttm_sys_man_free_basic
[03:08:00] ================== [PASSED] ttm_resource ===================
[03:08:00] =================== ttm_tt (15 subtests) ===================
[03:08:00] ==================== ttm_tt_init_basic  ====================
[03:08:00] [PASSED] Page-aligned size
[03:08:00] [PASSED] Extra pages requested
[03:08:00] ================ [PASSED] ttm_tt_init_basic ================
[03:08:00] [PASSED] ttm_tt_init_misaligned
[03:08:00] [PASSED] ttm_tt_fini_basic
[03:08:00] [PASSED] ttm_tt_fini_sg
[03:08:00] [PASSED] ttm_tt_fini_shmem
[03:08:00] [PASSED] ttm_tt_create_basic
[03:08:00] [PASSED] ttm_tt_create_invalid_bo_type
[03:08:00] [PASSED] ttm_tt_create_ttm_exists
[03:08:00] [PASSED] ttm_tt_create_failed
[03:08:00] [PASSED] ttm_tt_destroy_basic
[03:08:00] [PASSED] ttm_tt_populate_null_ttm
[03:08:00] [PASSED] ttm_tt_populate_populated_ttm
[03:08:00] [PASSED] ttm_tt_unpopulate_basic
[03:08:00] [PASSED] ttm_tt_unpopulate_empty_ttm
[03:08:00] [PASSED] ttm_tt_swapin_basic
[03:08:00] ===================== [PASSED] ttm_tt ======================
[03:08:00] =================== ttm_bo (14 subtests) ===================
[03:08:00] =========== ttm_bo_reserve_optimistic_no_ticket  ===========
[03:08:00] [PASSED] Cannot be interrupted and sleeps
[03:08:00] [PASSED] Cannot be interrupted, locks straight away
[03:08:00] [PASSED] Can be interrupted, sleeps
[03:08:00] ======= [PASSED] ttm_bo_reserve_optimistic_no_ticket =======
[03:08:00] [PASSED] ttm_bo_reserve_locked_no_sleep
[03:08:00] [PASSED] ttm_bo_reserve_no_wait_ticket
[03:08:00] [PASSED] ttm_bo_reserve_double_resv
[03:08:00] [PASSED] ttm_bo_reserve_interrupted
[03:08:00] [PASSED] ttm_bo_reserve_deadlock
[03:08:00] [PASSED] ttm_bo_unreserve_basic
[03:08:00] [PASSED] ttm_bo_unreserve_pinned
[03:08:00] [PASSED] ttm_bo_unreserve_bulk
[03:08:00] [PASSED] ttm_bo_fini_basic
[03:08:00] [PASSED] ttm_bo_fini_shared_resv
[03:08:00] [PASSED] ttm_bo_pin_basic
[03:08:00] [PASSED] ttm_bo_pin_unpin_resource
[03:08:00] [PASSED] ttm_bo_multiple_pin_one_unpin
[03:08:00] ===================== [PASSED] ttm_bo ======================
[03:08:00] ============== ttm_bo_validate (21 subtests) ===============
[03:08:00] ============== ttm_bo_init_reserved_sys_man  ===============
[03:08:00] [PASSED] Buffer object for userspace
[03:08:00] [PASSED] Kernel buffer object
[03:08:00] [PASSED] Shared buffer object
[03:08:00] ========== [PASSED] ttm_bo_init_reserved_sys_man ===========
[03:08:00] ============== ttm_bo_init_reserved_mock_man  ==============
[03:08:00] [PASSED] Buffer object for userspace
[03:08:00] [PASSED] Kernel buffer object
[03:08:00] [PASSED] Shared buffer object
[03:08:00] ========== [PASSED] ttm_bo_init_reserved_mock_man ==========
[03:08:00] [PASSED] ttm_bo_init_reserved_resv
[03:08:00] ================== ttm_bo_validate_basic  ==================
[03:08:00] [PASSED] Buffer object for userspace
[03:08:00] [PASSED] Kernel buffer object
[03:08:00] [PASSED] Shared buffer object
[03:08:00] ============== [PASSED] ttm_bo_validate_basic ==============
[03:08:00] [PASSED] ttm_bo_validate_invalid_placement
[03:08:00] ============= ttm_bo_validate_same_placement  ==============
[03:08:00] [PASSED] System manager
[03:08:00] [PASSED] VRAM manager
[03:08:00] ========= [PASSED] ttm_bo_validate_same_placement ==========
[03:08:00] [PASSED] ttm_bo_validate_failed_alloc
[03:08:00] [PASSED] ttm_bo_validate_pinned
[03:08:00] [PASSED] ttm_bo_validate_busy_placement
[03:08:00] ================ ttm_bo_validate_multihop  =================
[03:08:00] [PASSED] Buffer object for userspace
[03:08:00] [PASSED] Kernel buffer object
[03:08:00] [PASSED] Shared buffer object
[03:08:00] ============ [PASSED] ttm_bo_validate_multihop =============
[03:08:00] ========== ttm_bo_validate_no_placement_signaled  ==========
[03:08:00] [PASSED] Buffer object in system domain, no page vector
[03:08:00] [PASSED] Buffer object in system domain with an existing page vector
[03:08:00] ====== [PASSED] ttm_bo_validate_no_placement_signaled ======
[03:08:00] ======== ttm_bo_validate_no_placement_not_signaled  ========
[03:08:00] [PASSED] Buffer object for userspace
[03:08:00] [PASSED] Kernel buffer object
[03:08:00] [PASSED] Shared buffer object
[03:08:00] ==== [PASSED] ttm_bo_validate_no_placement_not_signaled ====
[03:08:00] [PASSED] ttm_bo_validate_move_fence_signaled
[03:08:00] ========= ttm_bo_validate_move_fence_not_signaled  =========
[03:08:00] [PASSED] Waits for GPU
[03:08:00] [PASSED] Tries to lock straight away
[03:08:00] ===== [PASSED] ttm_bo_validate_move_fence_not_signaled =====
[03:08:00] [PASSED] ttm_bo_validate_happy_evict
[03:08:00] [PASSED] ttm_bo_validate_all_pinned_evict
[03:08:00] [PASSED] ttm_bo_validate_allowed_only_evict
[03:08:00] [PASSED] ttm_bo_validate_deleted_evict
[03:08:00] [PASSED] ttm_bo_validate_busy_domain_evict
[03:08:00] [PASSED] ttm_bo_validate_evict_gutting
[03:08:00] [PASSED] ttm_bo_validate_recrusive_evict
stty: 'standard input': Inappropriate ioctl for device
[03:08:00] ================= [PASSED] ttm_bo_validate =================
[03:08:00] ============================================================
[03:08:00] Testing complete. Ran 101 tests: passed: 101
[03:08:00] Elapsed time: 11.152s total, 1.737s configuring, 9.148s building, 0.228s running

+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 83+ messages in thread

* ✗ Xe.CI.Full: failure for VF migration redesign (rev3)
  2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
                   ` (37 preceding siblings ...)
  2025-09-29  3:08 ` ✓ CI.KUnit: success " Patchwork
@ 2025-09-29  6:28 ` Patchwork
  38 siblings, 0 replies; 83+ messages in thread
From: Patchwork @ 2025-09-29  6:28 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

[-- Attachment #1: Type: text/plain, Size: 64554 bytes --]

== Series Details ==

Series: VF migration redesign (rev3)
URL   : https://patchwork.freedesktop.org/series/154627/
State : failure

== Summary ==

CI Bug Log - changes from xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce_FULL -> xe-pw-154627v3_FULL
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with xe-pw-154627v3_FULL absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in xe-pw-154627v3_FULL, please notify your bug team (I915-ci-infra@lists.freedesktop.org) to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Participating hosts (4 -> 4)
------------------------------

  No changes in participating hosts

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in xe-pw-154627v3_FULL:

### IGT changes ###

#### Possible regressions ####

  * igt@xe_evict@evict-large-external-cm:
    - shard-bmg:          [PASS][1] -> [ABORT][2] +3 other tests abort
   [1]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-7/igt@xe_evict@evict-large-external-cm.html
   [2]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-5/igt@xe_evict@evict-large-external-cm.html

  * igt@xe_exec_balancer@many-cm-parallel-userptr-invalidate:
    - shard-bmg:          [PASS][3] -> [INCOMPLETE][4] +15 other tests incomplete
   [3]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-7/igt@xe_exec_balancer@many-cm-parallel-userptr-invalidate.html
   [4]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-3/igt@xe_exec_balancer@many-cm-parallel-userptr-invalidate.html

  * igt@xe_exec_balancer@many-execqueues-cm-parallel-userptr-invalidate:
    - shard-dg2-set2:     [PASS][5] -> [ABORT][6] +4 other tests abort
   [5]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-463/igt@xe_exec_balancer@many-execqueues-cm-parallel-userptr-invalidate.html
   [6]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-435/igt@xe_exec_balancer@many-execqueues-cm-parallel-userptr-invalidate.html

  * igt@xe_exec_compute_mode@twice-bindexecqueue-userptr-invalidate:
    - shard-dg2-set2:     [PASS][7] -> [INCOMPLETE][8] +8 other tests incomplete
   [7]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-432/igt@xe_exec_compute_mode@twice-bindexecqueue-userptr-invalidate.html
   [8]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-466/igt@xe_exec_compute_mode@twice-bindexecqueue-userptr-invalidate.html

  * igt@xe_exec_threads@threads-cm-fd-userptr-invalidate:
    - shard-adlp:         [PASS][9] -> [INCOMPLETE][10] +14 other tests incomplete
   [9]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-9/igt@xe_exec_threads@threads-cm-fd-userptr-invalidate.html
   [10]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-9/igt@xe_exec_threads@threads-cm-fd-userptr-invalidate.html

  * igt@xe_exec_threads@threads-cm-shared-vm-userptr-invalidate-race:
    - shard-lnl:          [PASS][11] -> [INCOMPLETE][12] +14 other tests incomplete
   [11]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-2/igt@xe_exec_threads@threads-cm-shared-vm-userptr-invalidate-race.html
   [12]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-7/igt@xe_exec_threads@threads-cm-shared-vm-userptr-invalidate-race.html

  * igt@xe_sriov_auto_provisioning@selfconfig-reprovision-increase-numvfs@vf-random:
    - shard-adlp:         [PASS][13] -> [ABORT][14] +8 other tests abort
   [13]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-3/igt@xe_sriov_auto_provisioning@selfconfig-reprovision-increase-numvfs@vf-random.html
   [14]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-4/igt@xe_sriov_auto_provisioning@selfconfig-reprovision-increase-numvfs@vf-random.html

  
#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * {igt@xe_compute_preempt@compute-preempt-many-vram-evict@engine-drm_xe_engine_class_compute}:
    - shard-bmg:          [PASS][15] -> [INCOMPLETE][16] +1 other test incomplete
   [15]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-6/igt@xe_compute_preempt@compute-preempt-many-vram-evict@engine-drm_xe_engine_class_compute.html
   [16]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-6/igt@xe_compute_preempt@compute-preempt-many-vram-evict@engine-drm_xe_engine_class_compute.html

  * {igt@xe_pmu@engine-activity-suspend}:
    - shard-adlp:         [PASS][17] -> [ABORT][18] +5 other tests abort
   [17]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-4/igt@xe_pmu@engine-activity-suspend.html
   [18]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-1/igt@xe_pmu@engine-activity-suspend.html

  
New tests
---------

  New tests have been introduced between xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce_FULL and xe-pw-154627v3_FULL:

### New IGT tests (8) ###

  * igt@kms_lease@cursor-implicit-plane@pipe-a-dp-4:
    - Statuses : 1 pass(s)
    - Exec time: [0.26] s

  * igt@kms_lease@cursor-implicit-plane@pipe-b-dp-4:
    - Statuses : 1 pass(s)
    - Exec time: [0.25] s

  * igt@kms_lease@cursor-implicit-plane@pipe-c-dp-4:
    - Statuses : 1 pass(s)
    - Exec time: [0.26] s

  * igt@kms_lease@cursor-implicit-plane@pipe-d-dp-4:
    - Statuses : 1 pass(s)
    - Exec time: [0.24] s

  * igt@kms_lease@lease-revoke@pipe-a-dp-4:
    - Statuses : 1 pass(s)
    - Exec time: [0.06] s

  * igt@kms_lease@lease-revoke@pipe-b-dp-4:
    - Statuses : 1 pass(s)
    - Exec time: [0.06] s

  * igt@kms_lease@lease-revoke@pipe-c-dp-4:
    - Statuses : 1 pass(s)
    - Exec time: [0.06] s

  * igt@kms_lease@lease-revoke@pipe-d-dp-4:
    - Statuses : 1 pass(s)
    - Exec time: [0.06] s

  

Known issues
------------

  Here are the changes found in xe-pw-154627v3_FULL that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@kms_big_fb@linear-64bpp-rotate-90:
    - shard-bmg:          NOTRUN -> [SKIP][19] ([Intel XE#2327])
   [19]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@kms_big_fb@linear-64bpp-rotate-90.html

  * igt@kms_big_fb@linear-8bpp-rotate-270:
    - shard-dg2-set2:     NOTRUN -> [SKIP][20] ([Intel XE#316])
   [20]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@kms_big_fb@linear-8bpp-rotate-270.html

  * igt@kms_big_fb@y-tiled-16bpp-rotate-0:
    - shard-dg2-set2:     NOTRUN -> [SKIP][21] ([Intel XE#1124])
   [21]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@kms_big_fb@y-tiled-16bpp-rotate-0.html

  * igt@kms_big_fb@yf-tiled-8bpp-rotate-0:
    - shard-bmg:          NOTRUN -> [SKIP][22] ([Intel XE#1124])
   [22]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@kms_big_fb@yf-tiled-8bpp-rotate-0.html

  * igt@kms_bw@connected-linear-tiling-2-displays-3840x2160p:
    - shard-bmg:          [PASS][23] -> [SKIP][24] ([Intel XE#2314] / [Intel XE#2894])
   [23]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-4/igt@kms_bw@connected-linear-tiling-2-displays-3840x2160p.html
   [24]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-6/igt@kms_bw@connected-linear-tiling-2-displays-3840x2160p.html

  * igt@kms_bw@connected-linear-tiling-3-displays-2560x1440p:
    - shard-bmg:          NOTRUN -> [SKIP][25] ([Intel XE#2314] / [Intel XE#2894])
   [25]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@kms_bw@connected-linear-tiling-3-displays-2560x1440p.html

  * igt@kms_bw@linear-tiling-1-displays-2560x1440p:
    - shard-dg2-set2:     NOTRUN -> [SKIP][26] ([Intel XE#367])
   [26]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@kms_bw@linear-tiling-1-displays-2560x1440p.html

  * igt@kms_bw@linear-tiling-4-displays-3840x2160p:
    - shard-bmg:          NOTRUN -> [SKIP][27] ([Intel XE#367])
   [27]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@kms_bw@linear-tiling-4-displays-3840x2160p.html

  * igt@kms_ccs@bad-rotation-90-4-tiled-lnl-ccs@pipe-c-dp-2:
    - shard-bmg:          NOTRUN -> [SKIP][28] ([Intel XE#2652] / [Intel XE#787]) +3 other tests skip
   [28]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-3/igt@kms_ccs@bad-rotation-90-4-tiled-lnl-ccs@pipe-c-dp-2.html

  * igt@kms_ccs@crc-primary-basic-y-tiled-gen12-mc-ccs@pipe-d-dp-2:
    - shard-dg2-set2:     NOTRUN -> [SKIP][29] ([Intel XE#455] / [Intel XE#787]) +7 other tests skip
   [29]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@kms_ccs@crc-primary-basic-y-tiled-gen12-mc-ccs@pipe-d-dp-2.html

  * igt@kms_ccs@crc-primary-basic-y-tiled-gen12-rc-ccs:
    - shard-lnl:          NOTRUN -> [SKIP][30] ([Intel XE#2887]) +1 other test skip
   [30]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@kms_ccs@crc-primary-basic-y-tiled-gen12-rc-ccs.html

  * igt@kms_ccs@crc-sprite-planes-basic-4-tiled-mtl-rc-ccs-cc@pipe-c-hdmi-a-6:
    - shard-dg2-set2:     NOTRUN -> [SKIP][31] ([Intel XE#787]) +48 other tests skip
   [31]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-436/igt@kms_ccs@crc-sprite-planes-basic-4-tiled-mtl-rc-ccs-cc@pipe-c-hdmi-a-6.html

  * igt@kms_ccs@crc-sprite-planes-basic-y-tiled-gen12-rc-ccs:
    - shard-bmg:          NOTRUN -> [SKIP][32] ([Intel XE#2887]) +4 other tests skip
   [32]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@kms_ccs@crc-sprite-planes-basic-y-tiled-gen12-rc-ccs.html

  * igt@kms_chamelium_frames@vga-frame-dump:
    - shard-dg2-set2:     NOTRUN -> [SKIP][33] ([Intel XE#373]) +1 other test skip
   [33]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@kms_chamelium_frames@vga-frame-dump.html

  * igt@kms_chamelium_hpd@hdmi-hpd-fast:
    - shard-lnl:          NOTRUN -> [SKIP][34] ([Intel XE#373])
   [34]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@kms_chamelium_hpd@hdmi-hpd-fast.html

  * igt@kms_content_protection@atomic@pipe-a-dp-2:
    - shard-dg2-set2:     NOTRUN -> [FAIL][35] ([Intel XE#1178])
   [35]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@kms_content_protection@atomic@pipe-a-dp-2.html

  * igt@kms_content_protection@dp-mst-lic-type-0:
    - shard-dg2-set2:     NOTRUN -> [SKIP][36] ([Intel XE#307])
   [36]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@kms_content_protection@dp-mst-lic-type-0.html

  * igt@kms_content_protection@type1:
    - shard-lnl:          NOTRUN -> [SKIP][37] ([Intel XE#3278])
   [37]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@kms_content_protection@type1.html

  * igt@kms_cursor_crc@cursor-random-256x256:
    - shard-dg2-set2:     [PASS][38] -> [INCOMPLETE][39] ([Intel XE#4842]) +1 other test incomplete
   [38]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-434/igt@kms_cursor_crc@cursor-random-256x256.html
   [39]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-463/igt@kms_cursor_crc@cursor-random-256x256.html

  * igt@kms_cursor_legacy@cursora-vs-flipb-varying-size:
    - shard-bmg:          [PASS][40] -> [SKIP][41] ([Intel XE#2291])
   [40]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-4/igt@kms_cursor_legacy@cursora-vs-flipb-varying-size.html
   [41]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-6/igt@kms_cursor_legacy@cursora-vs-flipb-varying-size.html

  * igt@kms_flip@2x-flip-vs-modeset-vs-hang:
    - shard-lnl:          NOTRUN -> [SKIP][42] ([Intel XE#1421])
   [42]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@kms_flip@2x-flip-vs-modeset-vs-hang.html

  * igt@kms_flip@2x-flip-vs-rmfb:
    - shard-bmg:          [PASS][43] -> [SKIP][44] ([Intel XE#2316])
   [43]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-4/igt@kms_flip@2x-flip-vs-rmfb.html
   [44]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-6/igt@kms_flip@2x-flip-vs-rmfb.html

  * igt@kms_flip@flip-vs-absolute-wf_vblank-interruptible:
    - shard-adlp:         [PASS][45] -> [DMESG-WARN][46] ([Intel XE#4543]) +3 other tests dmesg-warn
   [45]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-6/igt@kms_flip@flip-vs-absolute-wf_vblank-interruptible.html
   [46]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-4/igt@kms_flip@flip-vs-absolute-wf_vblank-interruptible.html

  * igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling:
    - shard-lnl:          NOTRUN -> [SKIP][47] ([Intel XE#1401] / [Intel XE#1745])
   [47]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling.html

  * igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling@pipe-a-default-mode:
    - shard-lnl:          NOTRUN -> [SKIP][48] ([Intel XE#1401])
   [48]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling@pipe-a-default-mode.html

  * igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-32bpp-ytilegen12rcccs-upscaling:
    - shard-bmg:          NOTRUN -> [SKIP][49] ([Intel XE#2293] / [Intel XE#2380])
   [49]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-32bpp-ytilegen12rcccs-upscaling.html

  * igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-32bpp-ytilegen12rcccs-upscaling@pipe-a-valid-mode:
    - shard-bmg:          NOTRUN -> [SKIP][50] ([Intel XE#2293])
   [50]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-32bpp-ytilegen12rcccs-upscaling@pipe-a-valid-mode.html

  * igt@kms_frontbuffer_tracking@drrs-2p-primscrn-indfb-pgflip-blt:
    - shard-bmg:          NOTRUN -> [SKIP][51] ([Intel XE#2311]) +3 other tests skip
   [51]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@kms_frontbuffer_tracking@drrs-2p-primscrn-indfb-pgflip-blt.html

  * igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-indfb-pgflip-blt:
    - shard-lnl:          NOTRUN -> [SKIP][52] ([Intel XE#656]) +2 other tests skip
   [52]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-indfb-pgflip-blt.html

  * igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-shrfb-pgflip-blt:
    - shard-bmg:          NOTRUN -> [SKIP][53] ([Intel XE#5390]) +1 other test skip
   [53]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-shrfb-pgflip-blt.html

  * igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-shrfb-plflip-blt:
    - shard-dg2-set2:     NOTRUN -> [SKIP][54] ([Intel XE#651]) +5 other tests skip
   [54]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-shrfb-plflip-blt.html

  * igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-spr-indfb-draw-mmap-wc:
    - shard-dg2-set2:     NOTRUN -> [SKIP][55] ([Intel XE#653]) +5 other tests skip
   [55]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-spr-indfb-draw-mmap-wc.html

  * igt@kms_frontbuffer_tracking@psr-rgb101010-draw-mmap-wc:
    - shard-bmg:          NOTRUN -> [SKIP][56] ([Intel XE#2313]) +3 other tests skip
   [56]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@kms_frontbuffer_tracking@psr-rgb101010-draw-mmap-wc.html

  * igt@kms_hdr@brightness-with-hdr:
    - shard-dg2-set2:     NOTRUN -> [SKIP][57] ([Intel XE#455]) +3 other tests skip
   [57]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@kms_hdr@brightness-with-hdr.html

  * igt@kms_pm_dc@dc6-psr:
    - shard-dg2-set2:     NOTRUN -> [SKIP][58] ([Intel XE#1129])
   [58]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@kms_pm_dc@dc6-psr.html

  * igt@kms_psr2_sf@fbc-pr-overlay-plane-move-continuous-exceed-fully-sf:
    - shard-dg2-set2:     NOTRUN -> [SKIP][59] ([Intel XE#1406] / [Intel XE#1489]) +2 other tests skip
   [59]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@kms_psr2_sf@fbc-pr-overlay-plane-move-continuous-exceed-fully-sf.html

  * igt@kms_psr2_sf@fbc-psr2-cursor-plane-move-continuous-exceed-sf:
    - shard-lnl:          NOTRUN -> [SKIP][60] ([Intel XE#1406] / [Intel XE#2893] / [Intel XE#4608])
   [60]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@kms_psr2_sf@fbc-psr2-cursor-plane-move-continuous-exceed-sf.html

  * igt@kms_psr2_sf@fbc-psr2-cursor-plane-move-continuous-exceed-sf@pipe-b-edp-1:
    - shard-lnl:          NOTRUN -> [SKIP][61] ([Intel XE#1406] / [Intel XE#4608]) +1 other test skip
   [61]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@kms_psr2_sf@fbc-psr2-cursor-plane-move-continuous-exceed-sf@pipe-b-edp-1.html

  * igt@kms_psr2_sf@fbc-psr2-cursor-plane-update-sf:
    - shard-bmg:          NOTRUN -> [SKIP][62] ([Intel XE#1406] / [Intel XE#1489])
   [62]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@kms_psr2_sf@fbc-psr2-cursor-plane-update-sf.html

  * igt@kms_psr@fbc-psr-sprite-plane-move:
    - shard-dg2-set2:     NOTRUN -> [SKIP][63] ([Intel XE#1406] / [Intel XE#2850] / [Intel XE#929]) +1 other test skip
   [63]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@kms_psr@fbc-psr-sprite-plane-move.html

  * igt@kms_psr@pr-suspend:
    - shard-lnl:          NOTRUN -> [SKIP][64] ([Intel XE#1406])
   [64]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@kms_psr@pr-suspend.html

  * igt@kms_psr@psr2-cursor-plane-move:
    - shard-bmg:          NOTRUN -> [SKIP][65] ([Intel XE#1406] / [Intel XE#2234] / [Intel XE#2850])
   [65]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@kms_psr@psr2-cursor-plane-move.html

  * igt@kms_vrr@max-min:
    - shard-lnl:          [PASS][66] -> [FAIL][67] ([Intel XE#4227]) +1 other test fail
   [66]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-4/igt@kms_vrr@max-min.html
   [67]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-4/igt@kms_vrr@max-min.html

  * igt@xe_eudebug@basic-vm-access-parameters-userptr-faultable:
    - shard-dg2-set2:     NOTRUN -> [SKIP][68] ([Intel XE#4837])
   [68]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@xe_eudebug@basic-vm-access-parameters-userptr-faultable.html

  * igt@xe_eudebug_online@preempt-breakpoint:
    - shard-lnl:          NOTRUN -> [SKIP][69] ([Intel XE#4837])
   [69]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@xe_eudebug_online@preempt-breakpoint.html

  * igt@xe_eudebug_online@writes-caching-vram-bb-vram-target-vram:
    - shard-bmg:          NOTRUN -> [SKIP][70] ([Intel XE#4837]) +1 other test skip
   [70]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@xe_eudebug_online@writes-caching-vram-bb-vram-target-vram.html

  * igt@xe_eudebug_sriov@deny-sriov:
    - shard-lnl:          NOTRUN -> [SKIP][71] ([Intel XE#4518])
   [71]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@xe_eudebug_sriov@deny-sriov.html

  * igt@xe_exec_basic@multigpu-no-exec-basic-defer-mmap:
    - shard-dg2-set2:     NOTRUN -> [SKIP][72] ([Intel XE#1392]) +1 other test skip
   [72]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@xe_exec_basic@multigpu-no-exec-basic-defer-mmap.html

  * igt@xe_exec_basic@multigpu-once-basic-defer-bind:
    - shard-bmg:          NOTRUN -> [SKIP][73] ([Intel XE#2322])
   [73]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@xe_exec_basic@multigpu-once-basic-defer-bind.html

  * igt@xe_exec_fault_mode@once-rebind-imm:
    - shard-dg2-set2:     NOTRUN -> [SKIP][74] ([Intel XE#288]) +5 other tests skip
   [74]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@xe_exec_fault_mode@once-rebind-imm.html

  * igt@xe_exec_system_allocator@fault-threads-same-page-benchmark:
    - shard-dg2-set2:     NOTRUN -> [SKIP][75] ([Intel XE#4915]) +55 other tests skip
   [75]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@xe_exec_system_allocator@fault-threads-same-page-benchmark.html

  * igt@xe_exec_system_allocator@process-many-stride-mmap-huge:
    - shard-bmg:          NOTRUN -> [SKIP][76] ([Intel XE#4943]) +2 other tests skip
   [76]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@xe_exec_system_allocator@process-many-stride-mmap-huge.html

  * igt@xe_exec_threads@threads-bal-mixed-fd-userptr-invalidate:
    - shard-dg2-set2:     [PASS][77] -> [INCOMPLETE][78] ([Intel XE#6134]) +6 other tests incomplete
   [77]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-432/igt@xe_exec_threads@threads-bal-mixed-fd-userptr-invalidate.html
   [78]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-436/igt@xe_exec_threads@threads-bal-mixed-fd-userptr-invalidate.html

  * igt@xe_fault_injection@probe-fail-guc-xe_guc_mmio_send_recv:
    - shard-dg2-set2:     [PASS][79] -> [DMESG-WARN][80] ([Intel XE#5893])
   [79]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-434/igt@xe_fault_injection@probe-fail-guc-xe_guc_mmio_send_recv.html
   [80]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-463/igt@xe_fault_injection@probe-fail-guc-xe_guc_mmio_send_recv.html
    - shard-lnl:          NOTRUN -> [ABORT][81] ([Intel XE#4757])
   [81]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@xe_fault_injection@probe-fail-guc-xe_guc_mmio_send_recv.html

  * igt@xe_mmap@small-bar:
    - shard-dg2-set2:     NOTRUN -> [SKIP][82] ([Intel XE#512])
   [82]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@xe_mmap@small-bar.html

  * igt@xe_oa@polling-small-buf:
    - shard-dg2-set2:     NOTRUN -> [SKIP][83] ([Intel XE#3573]) +1 other test skip
   [83]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@xe_oa@polling-small-buf.html

  * igt@xe_pmu@engine-activity-all-load-idle:
    - shard-bmg:          NOTRUN -> [DMESG-WARN][84] ([Intel XE#6190])
   [84]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@xe_pmu@engine-activity-all-load-idle.html

  * igt@xe_pxp@pxp-stale-bo-exec-post-rpm:
    - shard-dg2-set2:     NOTRUN -> [SKIP][85] ([Intel XE#4733])
   [85]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@xe_pxp@pxp-stale-bo-exec-post-rpm.html

  * igt@xe_query@multigpu-query-uc-fw-version-huc:
    - shard-dg2-set2:     NOTRUN -> [SKIP][86] ([Intel XE#944])
   [86]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@xe_query@multigpu-query-uc-fw-version-huc.html

  * igt@xe_sriov_scheduling@equal-throughput:
    - shard-dg2-set2:     NOTRUN -> [SKIP][87] ([Intel XE#4351])
   [87]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@xe_sriov_scheduling@equal-throughput.html

  
#### Possible fixes ####

  * igt@kms_ccs@random-ccs-data-4-tiled-dg2-mc-ccs:
    - shard-dg2-set2:     [INCOMPLETE][88] ([Intel XE#2705] / [Intel XE#4212] / [Intel XE#4345]) -> [PASS][89]
   [88]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-466/igt@kms_ccs@random-ccs-data-4-tiled-dg2-mc-ccs.html
   [89]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@kms_ccs@random-ccs-data-4-tiled-dg2-mc-ccs.html

  * igt@kms_flip@flip-vs-rmfb-interruptible@d-hdmi-a1:
    - shard-adlp:         [DMESG-WARN][90] ([Intel XE#4543]) -> [PASS][91] +2 other tests pass
   [90]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-1/igt@kms_flip@flip-vs-rmfb-interruptible@d-hdmi-a1.html
   [91]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-3/igt@kms_flip@flip-vs-rmfb-interruptible@d-hdmi-a1.html

  * igt@kms_pm_rpm@basic-pci-d3-state:
    - shard-dg2-set2:     [FAIL][92] ([Intel XE#4741]) -> [PASS][93]
   [92]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-436/igt@kms_pm_rpm@basic-pci-d3-state.html
   [93]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-466/igt@kms_pm_rpm@basic-pci-d3-state.html

  * igt@kms_setmode@basic@pipe-b-edp-1:
    - shard-lnl:          [FAIL][94] ([Intel XE#2883]) -> [PASS][95] +2 other tests pass
   [94]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-4/igt@kms_setmode@basic@pipe-b-edp-1.html
   [95]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-5/igt@kms_setmode@basic@pipe-b-edp-1.html

  * igt@xe_exec_basic@multigpu-no-exec-basic:
    - shard-dg2-set2:     [SKIP][96] ([Intel XE#1392]) -> [PASS][97] +3 other tests pass
   [96]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-432/igt@xe_exec_basic@multigpu-no-exec-basic.html
   [97]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-436/igt@xe_exec_basic@multigpu-no-exec-basic.html

  * {igt@xe_exec_system_allocator@many-64k-malloc-prefetch}:
    - shard-lnl:          [CRASH][98] ([Intel XE#6192]) -> [PASS][99] +1 other test pass
   [98]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-5/igt@xe_exec_system_allocator@many-64k-malloc-prefetch.html
   [99]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-1/igt@xe_exec_system_allocator@many-64k-malloc-prefetch.html

  * igt@xe_module_load@load:
    - shard-lnl:          ([PASS][100], [PASS][101], [PASS][102], [PASS][103], [PASS][104], [PASS][105], [PASS][106], [PASS][107], [PASS][108], [PASS][109], [PASS][110], [PASS][111], [PASS][112], [PASS][113], [PASS][114], [PASS][115], [PASS][116], [PASS][117], [PASS][118], [PASS][119], [PASS][120], [PASS][121], [SKIP][122], [PASS][123], [PASS][124]) ([Intel XE#378]) -> ([PASS][125], [PASS][126], [PASS][127], [PASS][128], [PASS][129], [PASS][130], [PASS][131], [PASS][132], [PASS][133], [PASS][134], [PASS][135], [PASS][136], [PASS][137], [PASS][138], [PASS][139], [PASS][140], [PASS][141], [PASS][142], [PASS][143], [PASS][144], [PASS][145], [PASS][146], [PASS][147], [PASS][148], [PASS][149])
   [100]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-4/igt@xe_module_load@load.html
   [101]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-2/igt@xe_module_load@load.html
   [102]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-7/igt@xe_module_load@load.html
   [103]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-1/igt@xe_module_load@load.html
   [104]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-5/igt@xe_module_load@load.html
   [105]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-5/igt@xe_module_load@load.html
   [106]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-5/igt@xe_module_load@load.html
   [107]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-5/igt@xe_module_load@load.html
   [108]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-1/igt@xe_module_load@load.html
   [109]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-8/igt@xe_module_load@load.html
   [110]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-4/igt@xe_module_load@load.html
   [111]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-1/igt@xe_module_load@load.html
   [112]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-1/igt@xe_module_load@load.html
   [113]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-8/igt@xe_module_load@load.html
   [114]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-8/igt@xe_module_load@load.html
   [115]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-3/igt@xe_module_load@load.html
   [116]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-3/igt@xe_module_load@load.html
   [117]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-3/igt@xe_module_load@load.html
   [118]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-4/igt@xe_module_load@load.html
   [119]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-3/igt@xe_module_load@load.html
   [120]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-7/igt@xe_module_load@load.html
   [121]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-7/igt@xe_module_load@load.html
   [122]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-8/igt@xe_module_load@load.html
   [123]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-2/igt@xe_module_load@load.html
   [124]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-2/igt@xe_module_load@load.html
   [125]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@xe_module_load@load.html
   [126]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@xe_module_load@load.html
   [127]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-8/igt@xe_module_load@load.html
   [128]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-3/igt@xe_module_load@load.html
   [129]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-1/igt@xe_module_load@load.html
   [130]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-3/igt@xe_module_load@load.html
   [131]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-5/igt@xe_module_load@load.html
   [132]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-5/igt@xe_module_load@load.html
   [133]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@xe_module_load@load.html
   [134]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-2/igt@xe_module_load@load.html
   [135]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-7/igt@xe_module_load@load.html
   [136]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-7/igt@xe_module_load@load.html
   [137]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-7/igt@xe_module_load@load.html
   [138]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-7/igt@xe_module_load@load.html
   [139]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-1/igt@xe_module_load@load.html
   [140]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-4/igt@xe_module_load@load.html
   [141]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-8/igt@xe_module_load@load.html
   [142]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-5/igt@xe_module_load@load.html
   [143]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-8/igt@xe_module_load@load.html
   [144]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-4/igt@xe_module_load@load.html
   [145]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-4/igt@xe_module_load@load.html
   [146]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-3/igt@xe_module_load@load.html
   [147]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-3/igt@xe_module_load@load.html
   [148]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-1/igt@xe_module_load@load.html
   [149]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-1/igt@xe_module_load@load.html
    - shard-bmg:          ([PASS][150], [PASS][151], [PASS][152], [PASS][153], [PASS][154], [PASS][155], [PASS][156], [PASS][157], [PASS][158], [PASS][159], [PASS][160], [PASS][161], [PASS][162], [PASS][163], [PASS][164], [SKIP][165], [PASS][166], [PASS][167], [PASS][168], [PASS][169], [PASS][170], [PASS][171], [PASS][172], [PASS][173], [PASS][174]) ([Intel XE#2457]) -> ([PASS][175], [PASS][176], [PASS][177], [PASS][178], [PASS][179], [PASS][180], [PASS][181], [PASS][182], [PASS][183], [PASS][184], [PASS][185], [PASS][186], [PASS][187], [PASS][188], [PASS][189], [PASS][190], [PASS][191], [PASS][192], [PASS][193], [PASS][194], [PASS][195], [PASS][196], [PASS][197], [PASS][198], [PASS][199])
   [150]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-3/igt@xe_module_load@load.html
   [151]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-3/igt@xe_module_load@load.html
   [152]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-8/igt@xe_module_load@load.html
   [153]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-4/igt@xe_module_load@load.html
   [154]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-8/igt@xe_module_load@load.html
   [155]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-8/igt@xe_module_load@load.html
   [156]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-4/igt@xe_module_load@load.html
   [157]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-1/igt@xe_module_load@load.html
   [158]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-1/igt@xe_module_load@load.html
   [159]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-1/igt@xe_module_load@load.html
   [160]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-8/igt@xe_module_load@load.html
   [161]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-6/igt@xe_module_load@load.html
   [162]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-6/igt@xe_module_load@load.html
   [163]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-7/igt@xe_module_load@load.html
   [164]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-3/igt@xe_module_load@load.html
   [165]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-6/igt@xe_module_load@load.html
   [166]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-4/igt@xe_module_load@load.html
   [167]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-3/igt@xe_module_load@load.html
   [168]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-6/igt@xe_module_load@load.html
   [169]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-5/igt@xe_module_load@load.html
   [170]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-7/igt@xe_module_load@load.html
   [171]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-7/igt@xe_module_load@load.html
   [172]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-4/igt@xe_module_load@load.html
   [173]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-5/igt@xe_module_load@load.html
   [174]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-5/igt@xe_module_load@load.html
   [175]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-4/igt@xe_module_load@load.html
   [176]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-4/igt@xe_module_load@load.html
   [177]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@xe_module_load@load.html
   [178]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@xe_module_load@load.html
   [179]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-8/igt@xe_module_load@load.html
   [180]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-8/igt@xe_module_load@load.html
   [181]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-3/igt@xe_module_load@load.html
   [182]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-4/igt@xe_module_load@load.html
   [183]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-4/igt@xe_module_load@load.html
   [184]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-6/igt@xe_module_load@load.html
   [185]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-6/igt@xe_module_load@load.html
   [186]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-6/igt@xe_module_load@load.html
   [187]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-6/igt@xe_module_load@load.html
   [188]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-8/igt@xe_module_load@load.html
   [189]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-1/igt@xe_module_load@load.html
   [190]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-1/igt@xe_module_load@load.html
   [191]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-5/igt@xe_module_load@load.html
   [192]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-5/igt@xe_module_load@load.html
   [193]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@xe_module_load@load.html
   [194]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-3/igt@xe_module_load@load.html
   [195]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-3/igt@xe_module_load@load.html
   [196]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-5/igt@xe_module_load@load.html
   [197]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-5/igt@xe_module_load@load.html
   [198]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-5/igt@xe_module_load@load.html
   [199]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-8/igt@xe_module_load@load.html
    - shard-adlp:         ([PASS][200], [PASS][201], [PASS][202], [PASS][203], [PASS][204], [PASS][205], [PASS][206], [PASS][207], [PASS][208], [SKIP][209], [PASS][210], [PASS][211], [PASS][212], [PASS][213], [PASS][214], [PASS][215], [PASS][216], [PASS][217], [PASS][218], [PASS][219], [PASS][220], [PASS][221], [PASS][222], [PASS][223], [PASS][224], [PASS][225]) ([Intel XE#378] / [Intel XE#5612]) -> ([PASS][226], [PASS][227], [PASS][228], [PASS][229], [PASS][230], [PASS][231], [PASS][232], [PASS][233], [PASS][234], [PASS][235], [PASS][236], [PASS][237], [PASS][238], [PASS][239], [PASS][240], [PASS][241], [PASS][242], [PASS][243], [PASS][244], [PASS][245], [PASS][246], [PASS][247], [PASS][248], [PASS][249], [PASS][250])
   [200]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-8/igt@xe_module_load@load.html
   [201]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-3/igt@xe_module_load@load.html
   [202]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-8/igt@xe_module_load@load.html
   [203]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-3/igt@xe_module_load@load.html
   [204]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-9/igt@xe_module_load@load.html
   [205]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-9/igt@xe_module_load@load.html
   [206]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-2/igt@xe_module_load@load.html
   [207]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-4/igt@xe_module_load@load.html
   [208]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-1/igt@xe_module_load@load.html
   [209]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-8/igt@xe_module_load@load.html
   [210]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-4/igt@xe_module_load@load.html
   [211]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-9/igt@xe_module_load@load.html
   [212]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-4/igt@xe_module_load@load.html
   [213]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-6/igt@xe_module_load@load.html
   [214]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-8/igt@xe_module_load@load.html
   [215]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-6/igt@xe_module_load@load.html
   [216]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-6/igt@xe_module_load@load.html
   [217]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-6/igt@xe_module_load@load.html
   [218]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-8/igt@xe_module_load@load.html
   [219]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-1/igt@xe_module_load@load.html
   [220]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-1/igt@xe_module_load@load.html
   [221]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-2/igt@xe_module_load@load.html
   [222]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-2/igt@xe_module_load@load.html
   [223]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-2/igt@xe_module_load@load.html
   [224]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-3/igt@xe_module_load@load.html
   [225]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-3/igt@xe_module_load@load.html
   [226]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-8/igt@xe_module_load@load.html
   [227]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-9/igt@xe_module_load@load.html
   [228]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-9/igt@xe_module_load@load.html
   [229]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-6/igt@xe_module_load@load.html
   [230]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-8/igt@xe_module_load@load.html
   [231]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-8/igt@xe_module_load@load.html
   [232]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-3/igt@xe_module_load@load.html
   [233]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-6/igt@xe_module_load@load.html
   [234]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-3/igt@xe_module_load@load.html
   [235]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-3/igt@xe_module_load@load.html
   [236]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-1/igt@xe_module_load@load.html
   [237]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-6/igt@xe_module_load@load.html
   [238]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-1/igt@xe_module_load@load.html
   [239]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-1/igt@xe_module_load@load.html
   [240]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-2/igt@xe_module_load@load.html
   [241]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-6/igt@xe_module_load@load.html
   [242]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-4/igt@xe_module_load@load.html
   [243]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-2/igt@xe_module_load@load.html
   [244]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-2/igt@xe_module_load@load.html
   [245]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-2/igt@xe_module_load@load.html
   [246]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-4/igt@xe_module_load@load.html
   [247]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-4/igt@xe_module_load@load.html
   [248]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-9/igt@xe_module_load@load.html
   [249]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-9/igt@xe_module_load@load.html
   [250]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-8/igt@xe_module_load@load.html
    - shard-dg2-set2:     ([PASS][251], [PASS][252], [PASS][253], [PASS][254], [PASS][255], [PASS][256], [SKIP][257], [PASS][258], [PASS][259], [PASS][260], [PASS][261], [PASS][262], [PASS][263], [PASS][264], [PASS][265], [PASS][266], [PASS][267], [PASS][268], [PASS][269], [PASS][270], [PASS][271], [PASS][272], [PASS][273], [PASS][274], [PASS][275], [PASS][276]) ([Intel XE#378]) -> ([PASS][277], [PASS][278], [PASS][279], [PASS][280], [PASS][281], [PASS][282], [PASS][283], [PASS][284], [PASS][285], [PASS][286], [PASS][287], [PASS][288], [PASS][289], [PASS][290], [PASS][291], [PASS][292], [PASS][293], [PASS][294], [PASS][295], [PASS][296], [PASS][297], [PASS][298], [PASS][299], [PASS][300], [PASS][301])
   [251]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-434/igt@xe_module_load@load.html
   [252]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-436/igt@xe_module_load@load.html
   [253]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-434/igt@xe_module_load@load.html
   [254]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-436/igt@xe_module_load@load.html
   [255]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-432/igt@xe_module_load@load.html
   [256]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-432/igt@xe_module_load@load.html
   [257]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-463/igt@xe_module_load@load.html
   [258]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-466/igt@xe_module_load@load.html
   [259]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-466/igt@xe_module_load@load.html
   [260]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-435/igt@xe_module_load@load.html
   [261]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-463/igt@xe_module_load@load.html
   [262]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-463/igt@xe_module_load@load.html
   [263]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-433/igt@xe_module_load@load.html
   [264]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-463/igt@xe_module_load@load.html
   [265]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-434/igt@xe_module_load@load.html
   [266]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-433/igt@xe_module_load@load.html
   [267]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-435/igt@xe_module_load@load.html
   [268]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-463/igt@xe_module_load@load.html
   [269]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-464/igt@xe_module_load@load.html
   [270]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-464/igt@xe_module_load@load.html
   [271]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-464/igt@xe_module_load@load.html
   [272]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-433/igt@xe_module_load@load.html
   [273]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-436/igt@xe_module_load@load.html
   [274]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-432/igt@xe_module_load@load.html
   [275]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-432/igt@xe_module_load@load.html
   [276]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-dg2-433/igt@xe_module_load@load.html
   [277]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-436/igt@xe_module_load@load.html
   [278]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-436/igt@xe_module_load@load.html
   [279]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-464/igt@xe_module_load@load.html
   [280]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-434/igt@xe_module_load@load.html
   [281]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-435/igt@xe_module_load@load.html
   [282]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-435/igt@xe_module_load@load.html
   [283]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-435/igt@xe_module_load@load.html
   [284]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@xe_module_load@load.html
   [285]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@xe_module_load@load.html
   [286]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-464/igt@xe_module_load@load.html
   [287]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-464/igt@xe_module_load@load.html
   [288]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-432/igt@xe_module_load@load.html
   [289]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-436/igt@xe_module_load@load.html
   [290]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-434/igt@xe_module_load@load.html
   [291]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-463/igt@xe_module_load@load.html
   [292]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-463/igt@xe_module_load@load.html
   [293]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-433/igt@xe_module_load@load.html
   [294]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-466/igt@xe_module_load@load.html
   [295]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-466/igt@xe_module_load@load.html
   [296]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-434/igt@xe_module_load@load.html
   [297]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-433/igt@xe_module_load@load.html
   [298]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-466/igt@xe_module_load@load.html
   [299]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-466/igt@xe_module_load@load.html
   [300]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-463/igt@xe_module_load@load.html
   [301]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-dg2-433/igt@xe_module_load@load.html

  * igt@xe_pmu@gt-frequency:
    - shard-lnl:          [FAIL][302] ([Intel XE#5166]) -> [PASS][303] +1 other test pass
   [302]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-lnl-2/igt@xe_pmu@gt-frequency.html
   [303]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-lnl-4/igt@xe_pmu@gt-frequency.html

  
#### Warnings ####

  * igt@kms_frontbuffer_tracking@drrs-2p-primscrn-spr-indfb-draw-render:
    - shard-bmg:          [SKIP][304] ([Intel XE#2312]) -> [SKIP][305] ([Intel XE#2311]) +1 other test skip
   [304]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-6/igt@kms_frontbuffer_tracking@drrs-2p-primscrn-spr-indfb-draw-render.html
   [305]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-5/igt@kms_frontbuffer_tracking@drrs-2p-primscrn-spr-indfb-draw-render.html

  * igt@kms_frontbuffer_tracking@drrs-2p-scndscrn-pri-shrfb-draw-mmap-wc:
    - shard-bmg:          [SKIP][306] ([Intel XE#2311]) -> [SKIP][307] ([Intel XE#2312]) +3 other tests skip
   [306]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-4/igt@kms_frontbuffer_tracking@drrs-2p-scndscrn-pri-shrfb-draw-mmap-wc.html
   [307]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-6/igt@kms_frontbuffer_tracking@drrs-2p-scndscrn-pri-shrfb-draw-mmap-wc.html

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-cur-indfb-onoff:
    - shard-bmg:          [SKIP][308] ([Intel XE#5390]) -> [SKIP][309] ([Intel XE#2312]) +1 other test skip
   [308]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-4/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-cur-indfb-onoff.html
   [309]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-6/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-cur-indfb-onoff.html

  * igt@kms_frontbuffer_tracking@fbcpsr-2p-pri-indfb-multidraw:
    - shard-bmg:          [SKIP][310] ([Intel XE#2313]) -> [SKIP][311] ([Intel XE#2312]) +4 other tests skip
   [310]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-4/igt@kms_frontbuffer_tracking@fbcpsr-2p-pri-indfb-multidraw.html
   [311]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-6/igt@kms_frontbuffer_tracking@fbcpsr-2p-pri-indfb-multidraw.html

  * igt@kms_frontbuffer_tracking@psr-2p-scndscrn-pri-indfb-draw-mmap-wc:
    - shard-bmg:          [SKIP][312] ([Intel XE#2312]) -> [SKIP][313] ([Intel XE#2313]) +1 other test skip
   [312]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-6/igt@kms_frontbuffer_tracking@psr-2p-scndscrn-pri-indfb-draw-mmap-wc.html
   [313]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-5/igt@kms_frontbuffer_tracking@psr-2p-scndscrn-pri-indfb-draw-mmap-wc.html

  * igt@kms_hdr@brightness-with-hdr:
    - shard-bmg:          [SKIP][314] ([Intel XE#3544]) -> [SKIP][315] ([Intel XE#3374] / [Intel XE#3544])
   [314]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-bmg-5/igt@kms_hdr@brightness-with-hdr.html
   [315]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-bmg-7/igt@kms_hdr@brightness-with-hdr.html

  * igt@xe_exec_reset@cm-cat-error:
    - shard-adlp:         [DMESG-FAIL][316] ([Intel XE#3868]) -> [DMESG-WARN][317] ([Intel XE#3868])
   [316]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce/shard-adlp-3/igt@xe_exec_reset@cm-cat-error.html
   [317]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/shard-adlp-4/igt@xe_exec_reset@cm-cat-error.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [Intel XE#1124]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1124
  [Intel XE#1129]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1129
  [Intel XE#1178]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1178
  [Intel XE#1392]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1392
  [Intel XE#1401]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1401
  [Intel XE#1406]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1406
  [Intel XE#1421]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1421
  [Intel XE#1489]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1489
  [Intel XE#1745]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1745
  [Intel XE#2234]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2234
  [Intel XE#2291]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2291
  [Intel XE#2293]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2293
  [Intel XE#2311]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2311
  [Intel XE#2312]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2312
  [Intel XE#2313]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2313
  [Intel XE#2314]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2314
  [Intel XE#2316]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2316
  [Intel XE#2322]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2322
  [Intel XE#2327]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2327
  [Intel XE#2380]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2380
  [Intel XE#2457]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2457
  [Intel XE#2652]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2652
  [Intel XE#2705]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2705
  [Intel XE#2850]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2850
  [Intel XE#288]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/288
  [Intel XE#2883]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2883
  [Intel XE#2887]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2887
  [Intel XE#2893]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2893
  [Intel XE#2894]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2894
  [Intel XE#307]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/307
  [Intel XE#316]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/316
  [Intel XE#3278]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3278
  [Intel XE#3374]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3374
  [Intel XE#3544]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3544
  [Intel XE#3573]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3573
  [Intel XE#367]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/367
  [Intel XE#373]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/373
  [Intel XE#378]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/378
  [Intel XE#3868]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3868
  [Intel XE#4212]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4212
  [Intel XE#4227]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4227
  [Intel XE#4345]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4345
  [Intel XE#4351]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4351
  [Intel XE#4518]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4518
  [Intel XE#4543]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4543
  [Intel XE#455]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/455
  [Intel XE#4608]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4608
  [Intel XE#4733]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4733
  [Intel XE#4741]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4741
  [Intel XE#4757]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4757
  [Intel XE#4837]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4837
  [Intel XE#4842]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4842
  [Intel XE#4915]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4915
  [Intel XE#4943]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4943
  [Intel XE#512]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/512
  [Intel XE#5166]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5166
  [Intel XE#5390]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5390
  [Intel XE#5612]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5612
  [Intel XE#5786]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5786
  [Intel XE#5893]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5893
  [Intel XE#6134]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6134
  [Intel XE#6190]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6190
  [Intel XE#6192]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6192
  [Intel XE#651]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/651
  [Intel XE#653]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/653
  [Intel XE#656]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/656
  [Intel XE#787]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/787
  [Intel XE#929]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/929
  [Intel XE#944]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/944


Build changes
-------------

  * Linux: xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce -> xe-pw-154627v3

  IGT_8555: 8555
  xe-3836-e2a896e95ea5f65aa137dcf117bfd0d61176c8ce: e2a896e95ea5f65aa137dcf117bfd0d61176c8ce
  xe-pw-154627v3: 154627v3

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v3/index.html

[-- Attachment #2: Type: text/html, Size: 69863 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 02/36] drm/xe/vf: Lock querying GGTT config during driver init
  2025-09-29  2:55 ` [PATCH v3 02/36] drm/xe/vf: Lock querying GGTT config during driver init Matthew Brost
@ 2025-09-29  7:42   ` Michal Wajdeczko
  2025-09-29 12:15     ` Matthew Brost
  2025-09-29  8:13   ` Ville Syrjälä
  1 sibling, 1 reply; 83+ messages in thread
From: Michal Wajdeczko @ 2025-09-29  7:42 UTC (permalink / raw)
  To: Matthew Brost, intel-xe



On 9/29/2025 4:55 AM, Matthew Brost wrote:
> From: Tomasz Lis <tomasz.lis@intel.com>
> 
> Protect access to GGTT config as this is non-static information.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 96 ++++++++++++++++++-----
>  drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  3 +
>  drivers/gpu/drm/xe/xe_sriov_vf.c          |  6 ++
>  3 files changed, 84 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 0461d5513487..016c867e5e2b 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -440,18 +440,21 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  
> +	down_write(&config->lock);
> +

still didn't get answer to my earlier question [1]

[1] https://patchwork.freedesktop.org/patch/676375/?series=154627&rev=2#comment_1240924

>  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_START_KEY, &start);
>  	if (unlikely(err))
> -		return err;
> +		goto out;
>  
>  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_SIZE_KEY, &size);
>  	if (unlikely(err))
> -		return err;
> +		goto out;
>  
>  	if (config->ggtt_size && config->ggtt_size != size) {
>  		xe_gt_sriov_err(gt, "Unexpected GGTT reassignment: %lluK != %lluK\n",
>  				size / SZ_1K, config->ggtt_size / SZ_1K);
> -		return -EREMCHG;
> +		err = -EREMCHG;
> +		goto out;
>  	}
>  
>  	xe_gt_sriov_dbg_verbose(gt, "GGTT %#llx-%#llx = %lluK\n",
> @@ -460,8 +463,11 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
>  	config->ggtt_shift = start - (s64)config->ggtt_base;
>  	config->ggtt_base = start;
>  	config->ggtt_size = size;
> +	err = config->ggtt_size ? 0 : -ENODATA;
>  
> -	return config->ggtt_size ? 0 : -ENODATA;
> +out:
> +	up_write(&config->lock);
> +	return err;
>  }
>  
>  static int vf_get_lmem_info(struct xe_gt *gt)
> @@ -474,22 +480,28 @@ static int vf_get_lmem_info(struct xe_gt *gt)
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  
> +	down_write(&config->lock);
> +

also, commit message says "Protect access to GGTT config "
while the patch seems to apply locking to the whole config ...

what's the rationale to extend this protection?
just unification?

>  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_LMEM_SIZE_KEY, &size);
>  	if (unlikely(err))
> -		return err;
> +		goto out;
>  
>  	if (config->lmem_size && config->lmem_size != size) {
>  		xe_gt_sriov_err(gt, "Unexpected LMEM reassignment: %lluM != %lluM\n",
>  				size / SZ_1M, config->lmem_size / SZ_1M);
> -		return -EREMCHG;
> +		err = -EREMCHG;
> +		goto out;
>  	}
>  
>  	string_get_size(size, 1, STRING_UNITS_2, size_str, sizeof(size_str));
>  	xe_gt_sriov_dbg_verbose(gt, "LMEM %lluM %s\n", size / SZ_1M, size_str);
>  
>  	config->lmem_size = size;
> +	err = config->lmem_size ? 0 : -ENODATA;
>  
> -	return config->lmem_size ? 0 : -ENODATA;
> +out:
> +	up_write(&config->lock);
> +	return err;
>  }
>  
>  static int vf_get_submission_cfg(struct xe_gt *gt)
> @@ -501,23 +513,27 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  
> +	down_write(&config->lock);
> +
>  	err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_CONTEXTS_KEY, &num_ctxs);
>  	if (unlikely(err))
> -		return err;
> +		goto out;
>  
>  	err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_DOORBELLS_KEY, &num_dbs);
>  	if (unlikely(err))
> -		return err;
> +		goto out;
>  
>  	if (config->num_ctxs && config->num_ctxs != num_ctxs) {
>  		xe_gt_sriov_err(gt, "Unexpected CTXs reassignment: %u != %u\n",
>  				num_ctxs, config->num_ctxs);
> -		return -EREMCHG;
> +		err = -EREMCHG;
> +		goto out;
>  	}
>  	if (config->num_dbs && config->num_dbs != num_dbs) {
>  		xe_gt_sriov_err(gt, "Unexpected DBs reassignment: %u != %u\n",
>  				num_dbs, config->num_dbs);
> -		return -EREMCHG;
> +		err = -EREMCHG;
> +		goto out;
>  	}
>  
>  	xe_gt_sriov_dbg_verbose(gt, "CTXs %u DBs %u\n", num_ctxs, num_dbs);
> @@ -525,7 +541,11 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
>  	config->num_ctxs = num_ctxs;
>  	config->num_dbs = num_dbs;
>  
> -	return config->num_ctxs ? 0 : -ENODATA;
> +	err = config->num_ctxs ? 0 : -ENODATA;
> +
> +out:
> +	up_write(&config->lock);
> +	return err;
>  }
>  
>  static void vf_cache_gmdid(struct xe_gt *gt)
> @@ -579,11 +599,18 @@ int xe_gt_sriov_vf_query_config(struct xe_gt *gt)
>   */
>  u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
>  {
> +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> +	u16 val;
> +
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> -	xe_gt_assert(gt, gt->sriov.vf.self_config.num_ctxs);
>  
> -	return gt->sriov.vf.self_config.num_ctxs;
> +	down_read(&config->lock);
> +	xe_gt_assert(gt, config->num_ctxs);
> +	val = config->num_ctxs;
> +	up_read(&config->lock);
> +
> +	return val;
>  }
>  
>  /**
> @@ -596,11 +623,18 @@ u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
>   */
>  u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
>  {
> +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> +	u64 val;
> +
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> -	xe_gt_assert(gt, gt->sriov.vf.self_config.lmem_size);
>  
> -	return gt->sriov.vf.self_config.lmem_size;
> +	down_read(&config->lock);
> +	xe_gt_assert(gt, config->lmem_size);
> +	val = config->lmem_size;
> +	up_read(&config->lock);
> +
> +	return val;
>  }
>  
>  /**
> @@ -613,11 +647,17 @@ u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
>   */
>  u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt)
>  {
> +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> +	u64 val;
> +
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> -	xe_gt_assert(gt, gt->sriov.vf.self_config.ggtt_size);
>  
> -	return gt->sriov.vf.self_config.ggtt_size;
> +	down_read(&config->lock);
> +	val = config->ggtt_size;
> +	up_read(&config->lock);
> +
> +	return val;
>  }
>  
>  /**
> @@ -630,11 +670,18 @@ u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt)
>   */
>  u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt)
>  {
> +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> +	u64 val;
> +
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> -	xe_gt_assert(gt, gt->sriov.vf.self_config.ggtt_size);
>  
> -	return gt->sriov.vf.self_config.ggtt_base;
> +	down_read(&config->lock);
> +	xe_gt_assert(gt, config->ggtt_size);
> +	val = config->ggtt_base;
> +	up_read(&config->lock);
> +
> +	return val;
>  }
>  
>  /**
> @@ -648,11 +695,16 @@ u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt)
>  s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt)
>  {
>  	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> +	s64 val;
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  	xe_gt_assert(gt, xe_gt_is_main_type(gt));
>  
> -	return config->ggtt_shift;
> +	down_read(&config->lock);
> +	val = config->ggtt_shift;
> +	up_read(&config->lock);
> +
> +	return val;
>  }
>  
>  static int relay_action_handshake(struct xe_gt *gt, u32 *major, u32 *minor)
> @@ -1044,6 +1096,7 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  
> +	down_read(&config->lock);
>  	drm_printf(p, "GGTT range:\t%#llx-%#llx\n",
>  		   config->ggtt_base,
>  		   config->ggtt_base + config->ggtt_size - 1);
> @@ -1060,6 +1113,7 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
>  
>  	drm_printf(p, "GuC contexts:\t%u\n", config->num_ctxs);
>  	drm_printf(p, "GuC doorbells:\t%u\n", config->num_dbs);
> +	up_read(&config->lock);
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> index 298dedf4b009..d95857bd789b 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> @@ -6,6 +6,7 @@
>  #ifndef _XE_GT_SRIOV_VF_TYPES_H_
>  #define _XE_GT_SRIOV_VF_TYPES_H_
>  
> +#include <linux/rwsem.h>
>  #include <linux/types.h>
>  #include "xe_uc_fw_types.h"
>  
> @@ -25,6 +26,8 @@ struct xe_gt_sriov_vf_selfconfig {
>  	u16 num_ctxs;
>  	/** @num_dbs: assigned number of GuC doorbells IDs. */
>  	u16 num_dbs;
> +	/** @lock: lock for protecting access to all selfconfig fields. */
> +	struct rw_semaphore lock;
>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> index cdd9f8e78b2a..d6e2ed9b9bbc 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> @@ -197,6 +197,12 @@ static void vf_migration_init_early(struct xe_device *xe)
>   */
>  void xe_sriov_vf_init_early(struct xe_device *xe)
>  {
> +	struct xe_gt *gt;
> +	unsigned int id;
> +
> +	for_each_gt(gt, xe, id)
> +		init_rwsem(&gt->sriov.vf.self_config.lock);

as before, this should be done in

	xe_gt_sriov_vf_init_early

> +
>  	vf_migration_init_early(xe);
>  }
>  


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 12/36] drm/xe/vf: Add xe_gt_recovery_inprogress helper
  2025-09-29  2:55 ` [PATCH v3 12/36] drm/xe/vf: Add xe_gt_recovery_inprogress helper Matthew Brost
@ 2025-09-29  8:04   ` Michal Wajdeczko
  2025-09-29  8:52     ` Matthew Brost
  0 siblings, 1 reply; 83+ messages in thread
From: Michal Wajdeczko @ 2025-09-29  8:04 UTC (permalink / raw)
  To: Matthew Brost, intel-xe



On 9/29/2025 4:55 AM, Matthew Brost wrote:
> Add xe_gt_recovery_inprogress helper.
> 
> This helper serves as the singular point to determine whether a GT
> recovery is currently in progress. Expected callers include the GuC CT
> layer and the GuC submission layer. Atomically visable as soon as vCPU
> are unhalted until VF recovery completes.
> 
> v3:
>  - Add GT layer xe_gt_recovery_inprogress (Michal)
>  - Don't blow up in memirq not enabled (CI)
>  - Add __memirq_received with clear argument (Michal)
>  - xe_memirq_sw_int_0_irq_pending rename (Michal)
>  - Use offset in xe_memirq_sw_int_0_irq_pending (Michal)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_gt.h                | 13 ++++++
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 25 ++++++++++++
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  2 +
>  drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h | 10 +++++
>  drivers/gpu/drm/xe/xe_memirq.c            | 48 +++++++++++++++++++++--
>  drivers/gpu/drm/xe/xe_memirq.h            |  2 +
>  6 files changed, 96 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
> index 41880979f4de..ee0239b2f48c 100644
> --- a/drivers/gpu/drm/xe/xe_gt.h
> +++ b/drivers/gpu/drm/xe/xe_gt.h
> @@ -12,6 +12,7 @@
>  
>  #include "xe_device.h"
>  #include "xe_device_types.h"
> +#include "xe_gt_sriov_vf.h"
>  #include "xe_hw_engine.h"
>  
>  #define for_each_hw_engine(hwe__, gt__, id__) \
> @@ -124,4 +125,16 @@ static inline bool xe_gt_is_usm_hwe(struct xe_gt *gt, struct xe_hw_engine *hwe)
>  		hwe->instance == gt->usm.reserved_bcs_instance;
>  }
>  
> +/**
> + * xe_gt_recovery_inprogress() - GT recovery in progress
> + * @gt: the &xe_gt
> + *
> + * Return: True if GT recovery in progress, False otherwise
> + */
> +static inline bool xe_gt_recovery_inprogress(struct xe_gt *gt)
> +{
> +	return IS_SRIOV_VF(gt_to_xe(gt)) &&
> +		xe_gt_sriov_vf_recovery_inprogress(gt);
> +}
> +
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 016c867e5e2b..71309219a4b7 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -26,6 +26,7 @@
>  #include "xe_guc_hxg_helpers.h"
>  #include "xe_guc_relay.h"
>  #include "xe_lrc.h"
> +#include "xe_memirq.h"
>  #include "xe_mmio.h"
>  #include "xe_sriov.h"
>  #include "xe_sriov_vf.h"
> @@ -828,6 +829,7 @@ void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt)
>  	struct xe_device *xe = gt_to_xe(gt);
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(xe));
> +	xe_gt_assert(gt, xe_gt_sriov_vf_recovery_inprogress(gt));

do we really need this?
with current code this function will be limited to memirq platforms only

>  
>  	set_bit(gt->info.id, &xe->sriov.vf.migration.gt_flags);
>  	/*
> @@ -1172,3 +1174,26 @@ void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p)
>  	drm_printf(p, "\thandshake:\t%u.%u\n",
>  		   pf_version->major, pf_version->minor);
>  }
> +
> +/**
> + * xe_gt_sriov_vf_recovery_inprogress() - VF post migration recovery in progress
> + * @gt: the &xe_gt
> + *
> + * Return: True if VF post migration recovery in progress, False otherwise
> + */
> +bool xe_gt_sriov_vf_recovery_inprogress(struct xe_gt *gt)
> +{
> +	struct xe_memirq *memirq = &gt_to_tile(gt)->memirq;
> +
> +	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> +
> +	/*
> +	 * In practice, VF migration will never be supported on platforms
> +	 * without memirq, avoid CI blowing up on older VF platforms.
> +	 */

maybe instead of closing that door simply code this as:

	/* early detection until recovery starts */
	if (xe_device_uses_memirq(gt_to_xe(gt)) &&
	    xe_memirq_sw_int_0_irq_pending(memirq, &gt->uc.guc))
		return true;

	return READ_ONCE(gt->sriov.vf.migration.recovery_inprogress);

> +	if (!xe_device_uses_memirq(gt_to_xe(gt)))
> +	       return false;
> +
> +	return (xe_memirq_sw_int_0_irq_pending(memirq, &gt->uc.guc) ||
> +		READ_ONCE(gt->sriov.vf.migration.recovery_inprogress));
> +}
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> index 0af1dc769fe0..bb5f8eace19b 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> @@ -25,6 +25,8 @@ void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt);
>  int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt);
>  void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
>  
> +bool xe_gt_sriov_vf_recovery_inprogress(struct xe_gt *gt);
> +
>  u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
>  u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt);
>  u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt);
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> index d95857bd789b..7b10b8e1e10e 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> @@ -49,6 +49,14 @@ struct xe_gt_sriov_vf_runtime {
>  	} *regs;
>  };
>  
> +/**
> + * xe_gt_sriov_vf_migration - VF migration data.
> + */
> +struct xe_gt_sriov_vf_migration {
> +	/** @recovery_inprogress: VF post migration recovery in progress */
> +	bool recovery_inprogress;
> +};
> +
>  /**
>   * struct xe_gt_sriov_vf - GT level VF virtualization data.
>   */
> @@ -61,6 +69,8 @@ struct xe_gt_sriov_vf {
>  	struct xe_gt_sriov_vf_selfconfig self_config;
>  	/** @runtime: runtime data retrieved from the PF. */
>  	struct xe_gt_sriov_vf_runtime runtime;
> +	/** @migration: migration data for the VF. */
> +	struct xe_gt_sriov_vf_migration migration;
>  };
>  
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_memirq.c b/drivers/gpu/drm/xe/xe_memirq.c
> index 49c45ec3e83c..b681c67dcace 100644
> --- a/drivers/gpu/drm/xe/xe_memirq.c
> +++ b/drivers/gpu/drm/xe/xe_memirq.c
> @@ -398,8 +398,9 @@ void xe_memirq_postinstall(struct xe_memirq *memirq)
>  		memirq_set_enable(memirq, true);
>  }
>  
> -static bool memirq_received(struct xe_memirq *memirq, struct iosys_map *vector,
> -			    u16 offset, const char *name)
> +static bool __memirq_received(struct xe_memirq *memirq,
> +			      struct iosys_map *vector, u16 offset,
> +			      const char *name, bool clear)
>  {
>  	u8 value;
>  
> @@ -409,12 +410,26 @@ static bool memirq_received(struct xe_memirq *memirq, struct iosys_map *vector,
>  			memirq_err_ratelimited(memirq,
>  					       "Unexpected memirq value %#x from %s at %u\n",
>  					       value, name, offset);
> -		iosys_map_wr(vector, offset, u8, 0x00);
> +		if (clear)
> +			iosys_map_wr(vector, offset, u8, 0x00);
>  	}
>  
>  	return value;
>  }
>  
> +static bool memirq_received_noclear(struct xe_memirq *memirq,
> +				    struct iosys_map *vector,
> +				    u16 offset, const char *name)
> +{
> +	return __memirq_received(memirq, vector, offset, name, false);
> +}
> +
> +static bool memirq_received(struct xe_memirq *memirq, struct iosys_map *vector,
> +			    u16 offset, const char *name)
> +{
> +	return __memirq_received(memirq, vector, offset, name, true);
> +}
> +
>  static void memirq_dispatch_engine(struct xe_memirq *memirq, struct iosys_map *status,
>  				   struct xe_hw_engine *hwe)
>  {
> @@ -434,8 +449,16 @@ static void memirq_dispatch_guc(struct xe_memirq *memirq, struct iosys_map *stat
>  	if (memirq_received(memirq, status, ilog2(GUC_INTR_GUC2HOST), name))
>  		xe_guc_irq_handler(guc, GUC_INTR_GUC2HOST);
>  
> -	if (memirq_received(memirq, status, ilog2(GUC_INTR_SW_INT_0), name))
> +	/*
> +	 * We must wait to perform the clear operation until after
> +	 * xe_gt_sriov_vf_start_migration_recovery() runs, to avoid race
> +	 * conditions where xe_gt_sriov_vf_recovery_inprogress() returns false.
> +	 */
> +	if (memirq_received_noclear(memirq, status, ilog2(GUC_INTR_SW_INT_0),
> +				    name)) {
>  		xe_guc_irq_handler(guc, GUC_INTR_SW_INT_0);
> +		iosys_map_wr(status, ilog2(GUC_INTR_SW_INT_0), u8, 0x00);
> +	}
>  }
>  
>  /**
> @@ -460,6 +483,23 @@ void xe_memirq_hwe_handler(struct xe_memirq *memirq, struct xe_hw_engine *hwe)
>  	}
>  }
>  
> +/**
> + * xe_memirq_sw_int_0_irq_pending() - SW_INT_0 IRQ is pending
> + * @memirq: the &xe_memirq
> + * @guc: the &xe_guc to check for IRQ
> + *
> + * Return: True if SW_INT_0 IRQ is pending on @guc, False otherwise
> + */
> +bool xe_memirq_sw_int_0_irq_pending(struct xe_memirq *memirq,struct xe_guc *guc)
> +{
> +	struct xe_gt *gt = guc_to_gt(guc);
> +	u32 offset = xe_gt_is_media_type(gt) ? ilog2(INTR_MGUC) : ilog2(INTR_GUC);
> +	struct iosys_map map = IOSYS_MAP_INIT_OFFSET(&memirq->status, offset * SZ_16);
> +
> +	return memirq_received_noclear(memirq, &map, ilog2(GUC_INTR_SW_INT_0),
> +				       guc_name(guc));
> +}
> +
>  /**
>   * xe_memirq_handler - The `Memory Based Interrupts`_ Handler.
>   * @memirq: the &xe_memirq
> diff --git a/drivers/gpu/drm/xe/xe_memirq.h b/drivers/gpu/drm/xe/xe_memirq.h
> index 06130650e9d6..f87e1274b730 100644
> --- a/drivers/gpu/drm/xe/xe_memirq.h
> +++ b/drivers/gpu/drm/xe/xe_memirq.h
> @@ -25,4 +25,6 @@ void xe_memirq_handler(struct xe_memirq *memirq);
>  
>  int xe_memirq_init_guc(struct xe_memirq *memirq, struct xe_guc *guc);
>  
> +bool xe_memirq_sw_int_0_irq_pending(struct xe_memirq *memirq,struct xe_guc *guc);
> +
>  #endif


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 02/36] drm/xe/vf: Lock querying GGTT config during driver init
  2025-09-29  2:55 ` [PATCH v3 02/36] drm/xe/vf: Lock querying GGTT config during driver init Matthew Brost
  2025-09-29  7:42   ` Michal Wajdeczko
@ 2025-09-29  8:13   ` Ville Syrjälä
  2025-09-30 13:22     ` Lis, Tomasz
  1 sibling, 1 reply; 83+ messages in thread
From: Ville Syrjälä @ 2025-09-29  8:13 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

On Sun, Sep 28, 2025 at 07:55:08PM -0700, Matthew Brost wrote:
> From: Tomasz Lis <tomasz.lis@intel.com>
> 
> Protect access to GGTT config as this is non-static information.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 96 ++++++++++++++++++-----
>  drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  3 +
>  drivers/gpu/drm/xe/xe_sriov_vf.c          |  6 ++
>  3 files changed, 84 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 0461d5513487..016c867e5e2b 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -440,18 +440,21 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  
> +	down_write(&config->lock);
> +
>  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_START_KEY, &start);
>  	if (unlikely(err))
> -		return err;
> +		goto out;
>  
>  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_SIZE_KEY, &size);
>  	if (unlikely(err))
> -		return err;
> +		goto out;
>  
>  	if (config->ggtt_size && config->ggtt_size != size) {
>  		xe_gt_sriov_err(gt, "Unexpected GGTT reassignment: %lluK != %lluK\n",
>  				size / SZ_1K, config->ggtt_size / SZ_1K);
> -		return -EREMCHG;
> +		err = -EREMCHG;
> +		goto out;
>  	}
>  
>  	xe_gt_sriov_dbg_verbose(gt, "GGTT %#llx-%#llx = %lluK\n",
> @@ -460,8 +463,11 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
>  	config->ggtt_shift = start - (s64)config->ggtt_base;
>  	config->ggtt_base = start;
>  	config->ggtt_size = size;
> +	err = config->ggtt_size ? 0 : -ENODATA;
>  
> -	return config->ggtt_size ? 0 : -ENODATA;
> +out:
> +	up_write(&config->lock);
> +	return err;
>  }
>  
>  static int vf_get_lmem_info(struct xe_gt *gt)
> @@ -474,22 +480,28 @@ static int vf_get_lmem_info(struct xe_gt *gt)
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  
> +	down_write(&config->lock);
> +
>  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_LMEM_SIZE_KEY, &size);
>  	if (unlikely(err))
> -		return err;
> +		goto out;
>  
>  	if (config->lmem_size && config->lmem_size != size) {
>  		xe_gt_sriov_err(gt, "Unexpected LMEM reassignment: %lluM != %lluM\n",
>  				size / SZ_1M, config->lmem_size / SZ_1M);
> -		return -EREMCHG;
> +		err = -EREMCHG;
> +		goto out;
>  	}
>  
>  	string_get_size(size, 1, STRING_UNITS_2, size_str, sizeof(size_str));
>  	xe_gt_sriov_dbg_verbose(gt, "LMEM %lluM %s\n", size / SZ_1M, size_str);
>  
>  	config->lmem_size = size;
> +	err = config->lmem_size ? 0 : -ENODATA;
>  
> -	return config->lmem_size ? 0 : -ENODATA;
> +out:
> +	up_write(&config->lock);
> +	return err;
>  }
>  
>  static int vf_get_submission_cfg(struct xe_gt *gt)
> @@ -501,23 +513,27 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  
> +	down_write(&config->lock);
> +
>  	err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_CONTEXTS_KEY, &num_ctxs);
>  	if (unlikely(err))
> -		return err;
> +		goto out;
>  
>  	err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_DOORBELLS_KEY, &num_dbs);
>  	if (unlikely(err))
> -		return err;
> +		goto out;
>  
>  	if (config->num_ctxs && config->num_ctxs != num_ctxs) {
>  		xe_gt_sriov_err(gt, "Unexpected CTXs reassignment: %u != %u\n",
>  				num_ctxs, config->num_ctxs);
> -		return -EREMCHG;
> +		err = -EREMCHG;
> +		goto out;
>  	}
>  	if (config->num_dbs && config->num_dbs != num_dbs) {
>  		xe_gt_sriov_err(gt, "Unexpected DBs reassignment: %u != %u\n",
>  				num_dbs, config->num_dbs);
> -		return -EREMCHG;
> +		err = -EREMCHG;
> +		goto out;
>  	}
>  
>  	xe_gt_sriov_dbg_verbose(gt, "CTXs %u DBs %u\n", num_ctxs, num_dbs);
> @@ -525,7 +541,11 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
>  	config->num_ctxs = num_ctxs;
>  	config->num_dbs = num_dbs;
>  
> -	return config->num_ctxs ? 0 : -ENODATA;
> +	err = config->num_ctxs ? 0 : -ENODATA;
> +
> +out:
> +	up_write(&config->lock);
> +	return err;
>  }
>  
>  static void vf_cache_gmdid(struct xe_gt *gt)
> @@ -579,11 +599,18 @@ int xe_gt_sriov_vf_query_config(struct xe_gt *gt)
>   */
>  u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
>  {
> +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> +	u16 val;
> +
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> -	xe_gt_assert(gt, gt->sriov.vf.self_config.num_ctxs);
>  
> -	return gt->sriov.vf.self_config.num_ctxs;
> +	down_read(&config->lock);
> +	xe_gt_assert(gt, config->num_ctxs);
> +	val = config->num_ctxs;
> +	up_read(&config->lock);
> +
> +	return val;
>  }
>  
>  /**
> @@ -596,11 +623,18 @@ u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
>   */
>  u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
>  {
> +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> +	u64 val;
> +
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> -	xe_gt_assert(gt, gt->sriov.vf.self_config.lmem_size);
>  
> -	return gt->sriov.vf.self_config.lmem_size;
> +	down_read(&config->lock);
> +	xe_gt_assert(gt, config->lmem_size);
> +	val = config->lmem_size;
> +	up_read(&config->lock);

Why is someone mutating this sort of information at runtime?
Sounds pretty crazy to me.

-- 
Ville Syrjälä
Intel

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 14/36] drm/xe/vf: Abort H2G sends during VF post-migration recovery
  2025-09-29  2:55 ` [PATCH v3 14/36] drm/xe/vf: Abort H2G sends during VF post-migration recovery Matthew Brost
@ 2025-09-29  8:17   ` Michal Wajdeczko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Wajdeczko @ 2025-09-29  8:17 UTC (permalink / raw)
  To: Matthew Brost, intel-xe



On 9/29/2025 4:55 AM, Matthew Brost wrote:
> While VF post-migration recovery is in progress, abort H2G sends with
> -ECANCEL. 

-ECANCELED

I'm still not 100% convinced that we should reuse the same error code that
we were using to indicate some more or less final CT state (STOPPED, DISABLED)
as while it might be no difference for submission/TLB invalidation code, it
might not be same to other code that uses CTB (relays)
but I don't have a prove now, so

Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>


> These messages are treated as lost, and TLB invalidation
> errors are suppressed. During this phase, the H2G channel is down, and
> VF recovery requires the CT lock to proceed.
> 
> v3:
>  - Use xe_gt_recovery_inprogress (Michal)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_guc_ct.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index 47079ab9922c..d0fde371fae3 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -851,7 +851,7 @@ static int __guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action,
>  				u32 len, u32 g2h_len, u32 num_g2h,
>  				struct g2h_fence *g2h_fence)
>  {
> -	struct xe_gt *gt __maybe_unused = ct_to_gt(ct);
> +	struct xe_gt *gt = ct_to_gt(ct);
>  	u16 seqno;
>  	int ret;
>  
> @@ -872,7 +872,8 @@ static int __guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action,
>  		goto out;
>  	}
>  
> -	if (ct->state == XE_GUC_CT_STATE_STOPPED) {
> +	if (ct->state == XE_GUC_CT_STATE_STOPPED ||
> +	    xe_gt_recovery_inprogress(gt)) {
>  		ret = -ECANCELED;
>  		goto out;
>  	}


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 16/36] drm/xe/vf: Close multi-GT GGTT shift race
  2025-09-29  2:55 ` [PATCH v3 16/36] drm/xe/vf: Close multi-GT GGTT shift race Matthew Brost
@ 2025-09-29  8:44   ` Michal Wajdeczko
  2025-09-29 12:31     ` Matthew Brost
  0 siblings, 1 reply; 83+ messages in thread
From: Michal Wajdeczko @ 2025-09-29  8:44 UTC (permalink / raw)
  To: Matthew Brost, intel-xe



On 9/29/2025 4:55 AM, Matthew Brost wrote:
> As multi-GT VF post-migration recovery can run in parallel on different
> workqueues, but both GTs point to the same GGTT, only one GT needs to
> shift the GGTT. However, both GTs need to know when this step has
> completed. To coordinate this, share the VF config lock among all GTs
> that share a GGTT, and perform the GGTT shift under this lock. With
> shift being done under the lock, storing the shift value becomes
> unnecessary.

maybe better (and more natural) option would be to move VF GGTT config
from GT (xe_gt_sriov_vf_config) to Tile (xe_tile_sriov_vf_config) ? 

and protect it there with single lock also defined there ?

I'm doing similar changes on the PF provisioning side...

> 
> v3:
>  - Update commmit message (Tomasz)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 95 +++++++++--------------
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  3 +-
>  drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h | 11 ++-
>  drivers/gpu/drm/xe/xe_guc.c               |  2 +-
>  drivers/gpu/drm/xe/xe_tile_sriov_vf.c     |  6 +-
>  drivers/gpu/drm/xe/xe_tile_sriov_vf.h     |  1 -
>  6 files changed, 51 insertions(+), 67 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 6f15619efe01..ad1d63b5b8d1 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -436,16 +436,19 @@ u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt)
>  	return value;
>  }
>  
> -static int vf_get_ggtt_info(struct xe_gt *gt)
> +static int vf_get_ggtt_info(struct xe_gt *gt, bool recovery)
>  {
>  	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> +	struct xe_gt_sriov_vf_selfconfig *primary_config =
> +		&gt_to_tile(gt)->primary_gt->sriov.vf.self_config;
>  	struct xe_guc *guc = &gt->uc.guc;
>  	u64 start, size;
> +	s64 shift;
>  	int err;
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  
> -	down_write(&config->lock);
> +	down_write(config->lock);
>  
>  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_START_KEY, &start);
>  	if (unlikely(err))
> @@ -465,13 +468,17 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
>  	xe_gt_sriov_dbg_verbose(gt, "GGTT %#llx-%#llx = %lluK\n",
>  				start, start + size - 1, size / SZ_1K);
>  
> -	config->ggtt_shift = start - (s64)config->ggtt_base;
> +	shift = start - (s64)primary_config->ggtt_base;
>  	config->ggtt_base = start;
>  	config->ggtt_size = size;
> +	if (recovery)
> +		primary_config->ggtt_base = start;
>  	err = config->ggtt_size ? 0 : -ENODATA;
>  
> +	if (!err && shift && recovery)
> +		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
>  out:
> -	up_write(&config->lock);
> +	up_write(config->lock);
>  	return err;
>  }
>  
> @@ -485,7 +492,7 @@ static int vf_get_lmem_info(struct xe_gt *gt)
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  
> -	down_write(&config->lock);
> +	down_write(config->lock);
>  
>  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_LMEM_SIZE_KEY, &size);
>  	if (unlikely(err))
> @@ -505,7 +512,7 @@ static int vf_get_lmem_info(struct xe_gt *gt)
>  	err = config->lmem_size ? 0 : -ENODATA;
>  
>  out:
> -	up_write(&config->lock);
> +	up_write(config->lock);
>  	return err;
>  }
>  
> @@ -518,7 +525,7 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  
> -	down_write(&config->lock);
> +	down_write(config->lock);
>  
>  	err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_CONTEXTS_KEY, &num_ctxs);
>  	if (unlikely(err))
> @@ -549,7 +556,7 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
>  	err = config->num_ctxs ? 0 : -ENODATA;
>  
>  out:
> -	up_write(&config->lock);
> +	up_write(config->lock);
>  	return err;
>  }
>  
> @@ -564,17 +571,18 @@ static void vf_cache_gmdid(struct xe_gt *gt)
>  /**
>   * xe_gt_sriov_vf_query_config - Query SR-IOV config data over MMIO.
>   * @gt: the &xe_gt
> + * @recovery: VF post migration recovery path
>   *
>   * This function is for VF use only.
>   *
>   * Return: 0 on success or a negative error code on failure.
>   */
> -int xe_gt_sriov_vf_query_config(struct xe_gt *gt)
> +int xe_gt_sriov_vf_query_config(struct xe_gt *gt, bool recovery)
>  {
>  	struct xe_device *xe = gt_to_xe(gt);
>  	int err;
>  
> -	err = vf_get_ggtt_info(gt);
> +	err = vf_get_ggtt_info(gt, recovery);
>  	if (unlikely(err))
>  		return err;
>  
> @@ -610,10 +618,10 @@ u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
>  
> -	down_read(&config->lock);
> +	down_read(config->lock);
>  	xe_gt_assert(gt, config->num_ctxs);
>  	val = config->num_ctxs;
> -	up_read(&config->lock);
> +	up_read(config->lock);
>  
>  	return val;
>  }
> @@ -634,10 +642,10 @@ u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
>  
> -	down_read(&config->lock);
> +	down_read(config->lock);
>  	xe_gt_assert(gt, config->lmem_size);
>  	val = config->lmem_size;
> -	up_read(&config->lock);
> +	up_read(config->lock);
>  
>  	return val;
>  }
> @@ -656,11 +664,9 @@ u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt)
>  	u64 val;
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> -	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> +	lockdep_assert_held(config->lock);
>  
> -	down_read(&config->lock);
>  	val = config->ggtt_size;
> -	up_read(&config->lock);
>  
>  	return val;
>  }
> @@ -680,34 +686,10 @@ u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt)
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> -
> -	down_read(&config->lock);
>  	xe_gt_assert(gt, config->ggtt_size);
> -	val = config->ggtt_base;
> -	up_read(&config->lock);
> -
> -	return val;
> -}
> +	lockdep_assert_held(config->lock);
>  
> -/**
> - * xe_gt_sriov_vf_ggtt_shift - Return shift in GGTT range due to VF migration
> - * @gt: the &xe_gt struct instance
> - *
> - * This function is for VF use only.
> - *
> - * Return: The shift value; could be negative
> - */
> -s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt)
> -{
> -	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> -	s64 val;
> -
> -	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> -	xe_gt_assert(gt, xe_gt_is_main_type(gt));
> -
> -	down_read(&config->lock);
> -	val = config->ggtt_shift;
> -	up_read(&config->lock);
> +	val = config->ggtt_base;
>  
>  	return val;
>  }
> @@ -1115,7 +1097,7 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  
> -	down_read(&config->lock);
> +	down_read(config->lock);
>  	drm_printf(p, "GGTT range:\t%#llx-%#llx\n",
>  		   config->ggtt_base,
>  		   config->ggtt_base + config->ggtt_size - 1);
> @@ -1123,8 +1105,6 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
>  	string_get_size(config->ggtt_size, 1, STRING_UNITS_2, buf, sizeof(buf));
>  	drm_printf(p, "GGTT size:\t%llu (%s)\n", config->ggtt_size, buf);
>  
> -	drm_printf(p, "GGTT shift on last restore:\t%lld\n", config->ggtt_shift);
> -
>  	if (IS_DGFX(xe) && xe_gt_is_main_type(gt)) {
>  		string_get_size(config->lmem_size, 1, STRING_UNITS_2, buf, sizeof(buf));
>  		drm_printf(p, "LMEM size:\t%llu (%s)\n", config->lmem_size, buf);
> @@ -1132,7 +1112,7 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
>  
>  	drm_printf(p, "GuC contexts:\t%u\n", config->num_ctxs);
>  	drm_printf(p, "GuC doorbells:\t%u\n", config->num_dbs);
> -	up_read(&config->lock);
> +	up_read(config->lock);
>  }
>  
>  /**
> @@ -1215,21 +1195,16 @@ static size_t post_migration_scratch_size(struct xe_device *xe)
>  static int vf_post_migration_fixups(struct xe_gt *gt)
>  {
>  	void *buf = gt->sriov.vf.migration.scratch;
> -	s64 shift;
>  	int err;
>  
> -	err = xe_gt_sriov_vf_query_config(gt);
> +	err = xe_gt_sriov_vf_query_config(gt, true);
>  	if (err)
>  		return err;
>  
> -	shift = xe_gt_sriov_vf_ggtt_shift(gt);
> -	if (shift) {
> -		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
> -		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
> -		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
> -		if (err)
> -			return err;
> -	}
> +	xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
> +	err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
> +	if (err)
> +		return err;
>  
>  	return 0;
>  }
> @@ -1316,6 +1291,7 @@ static void migration_worker_func(struct work_struct *w)
>   */
>  int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
>  {
> +	struct xe_tile *tile = gt_to_tile(gt);
>  	void *buf;
>  
>  	if (!xe_sriov_vf_migration_supported(gt_to_xe(gt)))
> @@ -1328,7 +1304,10 @@ int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
>  		return -ENOMEM;
>  
>  	gt->sriov.vf.migration.scratch = buf;
> -	init_rwsem(&gt->sriov.vf.self_config.lock);
> +	if (xe_gt_is_main_type(gt))
> +		init_rwsem(&gt->sriov.vf.self_config.__lock);
> +	gt->sriov.vf.self_config.lock =
> +		&tile->primary_gt->sriov.vf.self_config.__lock;
>  	spin_lock_init(&gt->sriov.vf.migration.lock);
>  	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
>  
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> index 0b0f2a30e67c..ff3a0ce608cd 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> @@ -18,7 +18,7 @@ int xe_gt_sriov_vf_bootstrap(struct xe_gt *gt);
>  void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt,
>  				 struct xe_uc_fw_version *wanted,
>  				 struct xe_uc_fw_version *found);
> -int xe_gt_sriov_vf_query_config(struct xe_gt *gt);
> +int xe_gt_sriov_vf_query_config(struct xe_gt *gt, bool recovery);
>  int xe_gt_sriov_vf_connect(struct xe_gt *gt);
>  int xe_gt_sriov_vf_query_runtime(struct xe_gt *gt);
>  void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
> @@ -31,7 +31,6 @@ u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt);
>  u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt);
>  u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt);
>  u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt);
> -s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt);
>  
>  u32 xe_gt_sriov_vf_read32(struct xe_gt *gt, struct xe_reg reg);
>  void xe_gt_sriov_vf_write32(struct xe_gt *gt, struct xe_reg reg, u32 val);
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> index a63b6004b0b7..6cbf8291a5ab 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> @@ -19,16 +19,19 @@ struct xe_gt_sriov_vf_selfconfig {
>  	u64 ggtt_base;
>  	/** @ggtt_size: assigned size of the GGTT region. */
>  	u64 ggtt_size;
> -	/** @ggtt_shift: difference in ggtt_base on last migration */
> -	s64 ggtt_shift;
>  	/** @lmem_size: assigned size of the LMEM. */
>  	u64 lmem_size;
>  	/** @num_ctxs: assigned number of GuC submission context IDs. */
>  	u16 num_ctxs;
>  	/** @num_dbs: assigned number of GuC doorbells IDs. */
>  	u16 num_dbs;
> -	/** @lock: lock for protecting access to all selfconfig fields. */
> -	struct rw_semaphore lock;
> +	/** @__lock: lock for protecting access to all selfconfig fields. */
> +	struct rw_semaphore __lock;
> +	/**
> +	 * @lock: pointer to lock for protecting access to all selfconfig
> +	 * fields, all GTs point to primary GT.
> +	 */
> +	struct rw_semaphore *lock;

this could be placed in tile.sriov.vf

>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> index d5adbbb013ec..c016a11b6ab1 100644
> --- a/drivers/gpu/drm/xe/xe_guc.c
> +++ b/drivers/gpu/drm/xe/xe_guc.c
> @@ -713,7 +713,7 @@ static int vf_guc_init_noalloc(struct xe_guc *guc)
>  	if (err)
>  		return err;
>  
> -	err = xe_gt_sriov_vf_query_config(gt);
> +	err = xe_gt_sriov_vf_query_config(gt, false);
>  	if (err)
>  		return err;
>  
> diff --git a/drivers/gpu/drm/xe/xe_tile_sriov_vf.c b/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
> index f221dbed16f0..dc6221fc0520 100644
> --- a/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
> @@ -40,7 +40,7 @@ static int vf_init_ggtt_balloons(struct xe_tile *tile)
>   *
>   * Return: 0 on success or a negative error code on failure.
>   */
> -int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile)
> +static int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile)
>  {
>  	u64 ggtt_base = xe_gt_sriov_vf_ggtt_base(tile->primary_gt);
>  	u64 ggtt_size = xe_gt_sriov_vf_ggtt(tile->primary_gt);
> @@ -100,12 +100,16 @@ int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile)
>  
>  static int vf_balloon_ggtt(struct xe_tile *tile)
>  {
> +	struct xe_gt_sriov_vf_selfconfig *config =
> +		&tile->primary_gt->sriov.vf.self_config;

with GGTT (and its lock) stored at tile level we will not be forced
to look at the primary-gt any more


>  	struct xe_ggtt *ggtt = tile->mem.ggtt;
>  	int err;
>  
> +	down_read(config->lock);
>  	mutex_lock(&ggtt->lock);
>  	err = xe_tile_sriov_vf_balloon_ggtt_locked(tile);
>  	mutex_unlock(&ggtt->lock);
> +	up_read(config->lock);
>  
>  	return err;
>  }
> diff --git a/drivers/gpu/drm/xe/xe_tile_sriov_vf.h b/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
> index 93eb043171e8..4ee68d1fb28e 100644
> --- a/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
> +++ b/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
> @@ -11,7 +11,6 @@
>  struct xe_tile;
>  
>  int xe_tile_sriov_vf_prepare_ggtt(struct xe_tile *tile);
> -int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile);
>  void xe_tile_sriov_vf_deballoon_ggtt_locked(struct xe_tile *tile);
>  void xe_tile_sriov_vf_fixup_ggtt_nodes(struct xe_tile *tile, s64 shift);
>  


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 12/36] drm/xe/vf: Add xe_gt_recovery_inprogress helper
  2025-09-29  8:04   ` Michal Wajdeczko
@ 2025-09-29  8:52     ` Matthew Brost
  0 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-29  8:52 UTC (permalink / raw)
  To: Michal Wajdeczko; +Cc: intel-xe

On Mon, Sep 29, 2025 at 10:04:51AM +0200, Michal Wajdeczko wrote:
> 
> 
> On 9/29/2025 4:55 AM, Matthew Brost wrote:
> > Add xe_gt_recovery_inprogress helper.
> > 
> > This helper serves as the singular point to determine whether a GT
> > recovery is currently in progress. Expected callers include the GuC CT
> > layer and the GuC submission layer. Atomically visable as soon as vCPU
> > are unhalted until VF recovery completes.
> > 
> > v3:
> >  - Add GT layer xe_gt_recovery_inprogress (Michal)
> >  - Don't blow up in memirq not enabled (CI)
> >  - Add __memirq_received with clear argument (Michal)
> >  - xe_memirq_sw_int_0_irq_pending rename (Michal)
> >  - Use offset in xe_memirq_sw_int_0_irq_pending (Michal)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_gt.h                | 13 ++++++
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 25 ++++++++++++
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  2 +
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h | 10 +++++
> >  drivers/gpu/drm/xe/xe_memirq.c            | 48 +++++++++++++++++++++--
> >  drivers/gpu/drm/xe/xe_memirq.h            |  2 +
> >  6 files changed, 96 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
> > index 41880979f4de..ee0239b2f48c 100644
> > --- a/drivers/gpu/drm/xe/xe_gt.h
> > +++ b/drivers/gpu/drm/xe/xe_gt.h
> > @@ -12,6 +12,7 @@
> >  
> >  #include "xe_device.h"
> >  #include "xe_device_types.h"
> > +#include "xe_gt_sriov_vf.h"
> >  #include "xe_hw_engine.h"
> >  
> >  #define for_each_hw_engine(hwe__, gt__, id__) \
> > @@ -124,4 +125,16 @@ static inline bool xe_gt_is_usm_hwe(struct xe_gt *gt, struct xe_hw_engine *hwe)
> >  		hwe->instance == gt->usm.reserved_bcs_instance;
> >  }
> >  
> > +/**
> > + * xe_gt_recovery_inprogress() - GT recovery in progress
> > + * @gt: the &xe_gt
> > + *
> > + * Return: True if GT recovery in progress, False otherwise
> > + */
> > +static inline bool xe_gt_recovery_inprogress(struct xe_gt *gt)
> > +{
> > +	return IS_SRIOV_VF(gt_to_xe(gt)) &&
> > +		xe_gt_sriov_vf_recovery_inprogress(gt);
> > +}
> > +
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > index 016c867e5e2b..71309219a4b7 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > @@ -26,6 +26,7 @@
> >  #include "xe_guc_hxg_helpers.h"
> >  #include "xe_guc_relay.h"
> >  #include "xe_lrc.h"
> > +#include "xe_memirq.h"
> >  #include "xe_mmio.h"
> >  #include "xe_sriov.h"
> >  #include "xe_sriov_vf.h"
> > @@ -828,6 +829,7 @@ void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt)
> >  	struct xe_device *xe = gt_to_xe(gt);
> >  
> >  	xe_gt_assert(gt, IS_SRIOV_VF(xe));
> > +	xe_gt_assert(gt, xe_gt_sriov_vf_recovery_inprogress(gt));
> 
> do we really need this?
> with current code this function will be limited to memirq platforms only
> 

Yes, this helps proves what I document in [1] in section 'Waiters during
VF post migration recovery' in true - that is it immediately visable
after vCPU unhalt.

[1] https://patchwork.freedesktop.org/patch/677309/?series=154627&rev=3

> >  
> >  	set_bit(gt->info.id, &xe->sriov.vf.migration.gt_flags);
> >  	/*
> > @@ -1172,3 +1174,26 @@ void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p)
> >  	drm_printf(p, "\thandshake:\t%u.%u\n",
> >  		   pf_version->major, pf_version->minor);
> >  }
> > +
> > +/**
> > + * xe_gt_sriov_vf_recovery_inprogress() - VF post migration recovery in progress
> > + * @gt: the &xe_gt
> > + *
> > + * Return: True if VF post migration recovery in progress, False otherwise
> > + */
> > +bool xe_gt_sriov_vf_recovery_inprogress(struct xe_gt *gt)
> > +{
> > +	struct xe_memirq *memirq = &gt_to_tile(gt)->memirq;
> > +
> > +	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> > +
> > +	/*
> > +	 * In practice, VF migration will never be supported on platforms
> > +	 * without memirq, avoid CI blowing up on older VF platforms.
> > +	 */
> 
> maybe instead of closing that door simply code this as:
> 
> 	/* early detection until recovery starts */
> 	if (xe_device_uses_memirq(gt_to_xe(gt)) &&
> 	    xe_memirq_sw_int_0_irq_pending(memirq, &gt->uc.guc))
> 		return true;
> 
> 	return READ_ONCE(gt->sriov.vf.migration.recovery_inprogress);
> 

Sure. But...

This likely will the break the above assert, but VF migration will not
work 100% reliabliy if that assert fails.

Matt

> > +	if (!xe_device_uses_memirq(gt_to_xe(gt)))
> > +	       return false;
> > +
> > +	return (xe_memirq_sw_int_0_irq_pending(memirq, &gt->uc.guc) ||
> > +		READ_ONCE(gt->sriov.vf.migration.recovery_inprogress));
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> > index 0af1dc769fe0..bb5f8eace19b 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> > @@ -25,6 +25,8 @@ void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt);
> >  int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt);
> >  void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
> >  
> > +bool xe_gt_sriov_vf_recovery_inprogress(struct xe_gt *gt);
> > +
> >  u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
> >  u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt);
> >  u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt);
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > index d95857bd789b..7b10b8e1e10e 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > @@ -49,6 +49,14 @@ struct xe_gt_sriov_vf_runtime {
> >  	} *regs;
> >  };
> >  
> > +/**
> > + * xe_gt_sriov_vf_migration - VF migration data.
> > + */
> > +struct xe_gt_sriov_vf_migration {
> > +	/** @recovery_inprogress: VF post migration recovery in progress */
> > +	bool recovery_inprogress;
> > +};
> > +
> >  /**
> >   * struct xe_gt_sriov_vf - GT level VF virtualization data.
> >   */
> > @@ -61,6 +69,8 @@ struct xe_gt_sriov_vf {
> >  	struct xe_gt_sriov_vf_selfconfig self_config;
> >  	/** @runtime: runtime data retrieved from the PF. */
> >  	struct xe_gt_sriov_vf_runtime runtime;
> > +	/** @migration: migration data for the VF. */
> > +	struct xe_gt_sriov_vf_migration migration;
> >  };
> >  
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_memirq.c b/drivers/gpu/drm/xe/xe_memirq.c
> > index 49c45ec3e83c..b681c67dcace 100644
> > --- a/drivers/gpu/drm/xe/xe_memirq.c
> > +++ b/drivers/gpu/drm/xe/xe_memirq.c
> > @@ -398,8 +398,9 @@ void xe_memirq_postinstall(struct xe_memirq *memirq)
> >  		memirq_set_enable(memirq, true);
> >  }
> >  
> > -static bool memirq_received(struct xe_memirq *memirq, struct iosys_map *vector,
> > -			    u16 offset, const char *name)
> > +static bool __memirq_received(struct xe_memirq *memirq,
> > +			      struct iosys_map *vector, u16 offset,
> > +			      const char *name, bool clear)
> >  {
> >  	u8 value;
> >  
> > @@ -409,12 +410,26 @@ static bool memirq_received(struct xe_memirq *memirq, struct iosys_map *vector,
> >  			memirq_err_ratelimited(memirq,
> >  					       "Unexpected memirq value %#x from %s at %u\n",
> >  					       value, name, offset);
> > -		iosys_map_wr(vector, offset, u8, 0x00);
> > +		if (clear)
> > +			iosys_map_wr(vector, offset, u8, 0x00);
> >  	}
> >  
> >  	return value;
> >  }
> >  
> > +static bool memirq_received_noclear(struct xe_memirq *memirq,
> > +				    struct iosys_map *vector,
> > +				    u16 offset, const char *name)
> > +{
> > +	return __memirq_received(memirq, vector, offset, name, false);
> > +}
> > +
> > +static bool memirq_received(struct xe_memirq *memirq, struct iosys_map *vector,
> > +			    u16 offset, const char *name)
> > +{
> > +	return __memirq_received(memirq, vector, offset, name, true);
> > +}
> > +
> >  static void memirq_dispatch_engine(struct xe_memirq *memirq, struct iosys_map *status,
> >  				   struct xe_hw_engine *hwe)
> >  {
> > @@ -434,8 +449,16 @@ static void memirq_dispatch_guc(struct xe_memirq *memirq, struct iosys_map *stat
> >  	if (memirq_received(memirq, status, ilog2(GUC_INTR_GUC2HOST), name))
> >  		xe_guc_irq_handler(guc, GUC_INTR_GUC2HOST);
> >  
> > -	if (memirq_received(memirq, status, ilog2(GUC_INTR_SW_INT_0), name))
> > +	/*
> > +	 * We must wait to perform the clear operation until after
> > +	 * xe_gt_sriov_vf_start_migration_recovery() runs, to avoid race
> > +	 * conditions where xe_gt_sriov_vf_recovery_inprogress() returns false.
> > +	 */
> > +	if (memirq_received_noclear(memirq, status, ilog2(GUC_INTR_SW_INT_0),
> > +				    name)) {
> >  		xe_guc_irq_handler(guc, GUC_INTR_SW_INT_0);
> > +		iosys_map_wr(status, ilog2(GUC_INTR_SW_INT_0), u8, 0x00);
> > +	}
> >  }
> >  
> >  /**
> > @@ -460,6 +483,23 @@ void xe_memirq_hwe_handler(struct xe_memirq *memirq, struct xe_hw_engine *hwe)
> >  	}
> >  }
> >  
> > +/**
> > + * xe_memirq_sw_int_0_irq_pending() - SW_INT_0 IRQ is pending
> > + * @memirq: the &xe_memirq
> > + * @guc: the &xe_guc to check for IRQ
> > + *
> > + * Return: True if SW_INT_0 IRQ is pending on @guc, False otherwise
> > + */
> > +bool xe_memirq_sw_int_0_irq_pending(struct xe_memirq *memirq,struct xe_guc *guc)
> > +{
> > +	struct xe_gt *gt = guc_to_gt(guc);
> > +	u32 offset = xe_gt_is_media_type(gt) ? ilog2(INTR_MGUC) : ilog2(INTR_GUC);
> > +	struct iosys_map map = IOSYS_MAP_INIT_OFFSET(&memirq->status, offset * SZ_16);
> > +
> > +	return memirq_received_noclear(memirq, &map, ilog2(GUC_INTR_SW_INT_0),
> > +				       guc_name(guc));
> > +}
> > +
> >  /**
> >   * xe_memirq_handler - The `Memory Based Interrupts`_ Handler.
> >   * @memirq: the &xe_memirq
> > diff --git a/drivers/gpu/drm/xe/xe_memirq.h b/drivers/gpu/drm/xe/xe_memirq.h
> > index 06130650e9d6..f87e1274b730 100644
> > --- a/drivers/gpu/drm/xe/xe_memirq.h
> > +++ b/drivers/gpu/drm/xe/xe_memirq.h
> > @@ -25,4 +25,6 @@ void xe_memirq_handler(struct xe_memirq *memirq);
> >  
> >  int xe_memirq_init_guc(struct xe_memirq *memirq, struct xe_guc *guc);
> >  
> > +bool xe_memirq_sw_int_0_irq_pending(struct xe_memirq *memirq,struct xe_guc *guc);
> > +
> >  #endif
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 18/36] drm/xe/vf: Don't allow GT reset to be queued during VF post migration recovery
  2025-09-29  2:55 ` [PATCH v3 18/36] drm/xe/vf: Don't allow GT reset to be queued during VF post migration recovery Matthew Brost
@ 2025-09-29  9:17   ` Michal Wajdeczko
  2025-09-29 12:50     ` Matthew Brost
  0 siblings, 1 reply; 83+ messages in thread
From: Michal Wajdeczko @ 2025-09-29  9:17 UTC (permalink / raw)
  To: Matthew Brost, intel-xe



On 9/29/2025 4:55 AM, Matthew Brost wrote:
> With well-behaved software, a GT reset should never occur, nor should it
> happen during VF post-migration recovery. If it does, trigger a warning

hmm, I'm not sure that GT-resets depends only on the SW side, nor that
reasons for it couldn't happen just before VF was migrated

> but suppress the GT reset, as VF post-migration recovery is expected to
> bring the VF back to a working state.

can't we just say this last sentence, that "there is no need to run
explicit VF-reset sequence during recovery as VF-recovery is equivalent 
and it is also expected to bring VF back to working state?

also, since the patch is a refactor, it should mention that "instead of
blocking resets, just rely on the recovery" 
> 
> v3:
>  - Better commit message (Tomasz)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_gt.c          |  9 -------
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.c |  7 -----
>  drivers/gpu/drm/xe/xe_guc_submit.c  | 41 +++--------------------------
>  drivers/gpu/drm/xe/xe_guc_submit.h  |  3 ---
>  4 files changed, 4 insertions(+), 56 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> index 82be38c99205..5f04d562604b 100644
> --- a/drivers/gpu/drm/xe/xe_gt.c
> +++ b/drivers/gpu/drm/xe/xe_gt.c
> @@ -815,11 +815,6 @@ static int do_gt_restart(struct xe_gt *gt)
>  	return 0;
>  }
>  
> -static int gt_wait_reset_unblock(struct xe_gt *gt)
> -{
> -	return xe_guc_wait_reset_unblock(&gt->uc.guc);
> -}
> -
>  static int gt_reset(struct xe_gt *gt)
>  {
>  	unsigned int fw_ref;
> @@ -834,10 +829,6 @@ static int gt_reset(struct xe_gt *gt)
>  
>  	xe_gt_info(gt, "reset started\n");
>  
> -	err = gt_wait_reset_unblock(gt);
> -	if (!err)
> -		xe_gt_warn(gt, "reset block failed to get lifted");
> -
>  	xe_pm_runtime_get(gt_to_xe(gt));
>  
>  	if (xe_fault_inject_gt_reset()) {
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index cc5af19c1911..b16e8fd271f8 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -1175,17 +1175,11 @@ void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p)
>  
>  static void vf_post_migration_shutdown(struct xe_gt *gt)
>  {
> -	int ret = 0;
> -
>  	spin_lock_irq(&gt->sriov.vf.migration.lock);
>  	gt->sriov.vf.migration.recovery_queued = false;
>  	spin_unlock_irq(&gt->sriov.vf.migration.lock);
>  
>  	xe_guc_submit_pause(&gt->uc.guc);
> -	ret |= xe_guc_submit_reset_block(&gt->uc.guc);
> -
> -	if (ret)
> -		xe_gt_sriov_info(gt, "migration recovery encountered ongoing reset\n");
>  }
>  
>  static size_t post_migration_scratch_size(struct xe_device *xe)
> @@ -1219,7 +1213,6 @@ static void vf_post_migration_kickstart(struct xe_gt *gt)
>  	 */
>  	xe_irq_resume(gt_to_xe(gt));
>  
> -	xe_guc_submit_reset_unblock(&gt->uc.guc);
>  	xe_guc_submit_unpause(&gt->uc.guc);
>  }
>  
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index cd5e506527fe..b82976f031e5 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -27,6 +27,7 @@
>  #include "xe_gt.h"
>  #include "xe_gt_clock.h"
>  #include "xe_gt_printk.h"
> +#include "xe_gt_sriov_vf.h"
>  #include "xe_guc.h"
>  #include "xe_guc_capture.h"
>  #include "xe_guc_ct.h"
> @@ -2182,47 +2183,13 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
>  	}
>  }
>  
> -/**
> - * xe_guc_submit_reset_block - Disallow reset calls on given GuC.
> - * @guc: the &xe_guc struct instance
> - */
> -int xe_guc_submit_reset_block(struct xe_guc *guc)
> -{
> -	return atomic_fetch_or(1, &guc->submission_state.reset_blocked);
> -}
> -
> -/**
> - * xe_guc_submit_reset_unblock - Allow back reset calls on given GuC.
> - * @guc: the &xe_guc struct instance
> - */
> -void xe_guc_submit_reset_unblock(struct xe_guc *guc)
> -{
> -	atomic_set_release(&guc->submission_state.reset_blocked, 0);
> -	wake_up_all(&guc->ct.wq);
> -}
> -
> -static int guc_submit_reset_is_blocked(struct xe_guc *guc)
> -{
> -	return atomic_read_acquire(&guc->submission_state.reset_blocked);
> -}
> -
> -/* Maximum time of blocking reset */
> -#define RESET_BLOCK_PERIOD_MAX (HZ * 5)
> -
> -/**
> - * xe_guc_wait_reset_unblock - Wait until reset blocking flag is lifted, or timeout.
> - * @guc: the &xe_guc struct instance
> - */
> -int xe_guc_wait_reset_unblock(struct xe_guc *guc)
> -{
> -	return wait_event_timeout(guc->ct.wq,
> -				  !guc_submit_reset_is_blocked(guc), RESET_BLOCK_PERIOD_MAX);
> -}
> -
>  int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>  {
>  	int ret;
>  
> +	if (WARN_ON_ONCE(xe_gt_sriov_vf_recovery_inprogress(guc_to_gt(guc))))

we have a gt-oriented variant:

	xe_gt_WARN_ON

and likely we should not use the _ONCE variant as resets could happen few
times in the VF lifetime

and since we are sure that "recovery" has the same power as "reset"
does it really need to be full WARN ? maybe just "info or notice" ?

also if want to skip real GT resets, shouldn't we have this check
upper in the reset stack, at GT level, not just under gt.uc.guc.submit ?

> +		return 0;
> +
>  	if (!guc->submission_state.initialized)
>  		return 0;
>  
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
> index 5b4a0a6fd818..f535fe3895e5 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.h
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.h
> @@ -22,9 +22,6 @@ void xe_guc_submit_stop(struct xe_guc *guc);
>  int xe_guc_submit_start(struct xe_guc *guc);
>  void xe_guc_submit_pause(struct xe_guc *guc);
>  void xe_guc_submit_unpause(struct xe_guc *guc);
> -int xe_guc_submit_reset_block(struct xe_guc *guc);
> -void xe_guc_submit_reset_unblock(struct xe_guc *guc);
> -int xe_guc_wait_reset_unblock(struct xe_guc *guc);
>  void xe_guc_submit_wedge(struct xe_guc *guc);
>  
>  int xe_guc_read_stopped(struct xe_guc *guc);


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 02/36] drm/xe/vf: Lock querying GGTT config during driver init
  2025-09-29  7:42   ` Michal Wajdeczko
@ 2025-09-29 12:15     ` Matthew Brost
  2025-09-30  0:42       ` Lis, Tomasz
  0 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-29 12:15 UTC (permalink / raw)
  To: Michal Wajdeczko; +Cc: intel-xe

On Mon, Sep 29, 2025 at 09:42:55AM +0200, Michal Wajdeczko wrote:
> 
> 
> On 9/29/2025 4:55 AM, Matthew Brost wrote:
> > From: Tomasz Lis <tomasz.lis@intel.com>
> > 
> > Protect access to GGTT config as this is non-static information.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 96 ++++++++++++++++++-----
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  3 +
> >  drivers/gpu/drm/xe/xe_sriov_vf.c          |  6 ++
> >  3 files changed, 84 insertions(+), 21 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > index 0461d5513487..016c867e5e2b 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > @@ -440,18 +440,21 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
> >  
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  
> > +	down_write(&config->lock);
> > +
> 
> still didn't get answer to my earlier question [1]
> 
> [1] https://patchwork.freedesktop.org/patch/676375/?series=154627&rev=2#comment_1240924
> 

Again, this isn't a patch, so I believe Tomasz will need to chime in to
provide an answer or conclusion here.

> >  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_START_KEY, &start);
> >  	if (unlikely(err))
> > -		return err;
> > +		goto out;
> >  
> >  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_SIZE_KEY, &size);
> >  	if (unlikely(err))
> > -		return err;
> > +		goto out;
> >  
> >  	if (config->ggtt_size && config->ggtt_size != size) {
> >  		xe_gt_sriov_err(gt, "Unexpected GGTT reassignment: %lluK != %lluK\n",
> >  				size / SZ_1K, config->ggtt_size / SZ_1K);
> > -		return -EREMCHG;
> > +		err = -EREMCHG;
> > +		goto out;
> >  	}
> >  
> >  	xe_gt_sriov_dbg_verbose(gt, "GGTT %#llx-%#llx = %lluK\n",
> > @@ -460,8 +463,11 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
> >  	config->ggtt_shift = start - (s64)config->ggtt_base;
> >  	config->ggtt_base = start;
> >  	config->ggtt_size = size;
> > +	err = config->ggtt_size ? 0 : -ENODATA;
> >  
> > -	return config->ggtt_size ? 0 : -ENODATA;
> > +out:
> > +	up_write(&config->lock);
> > +	return err;
> >  }
> >  
> >  static int vf_get_lmem_info(struct xe_gt *gt)
> > @@ -474,22 +480,28 @@ static int vf_get_lmem_info(struct xe_gt *gt)
> >  
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  
> > +	down_write(&config->lock);
> > +
> 
> also, commit message says "Protect access to GGTT config "
> while the patch seems to apply locking to the whole config ...
> 
> what's the rationale to extend this protection?
> just unification?
> 
> >  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_LMEM_SIZE_KEY, &size);
> >  	if (unlikely(err))
> > -		return err;
> > +		goto out;
> >  
> >  	if (config->lmem_size && config->lmem_size != size) {
> >  		xe_gt_sriov_err(gt, "Unexpected LMEM reassignment: %lluM != %lluM\n",
> >  				size / SZ_1M, config->lmem_size / SZ_1M);
> > -		return -EREMCHG;
> > +		err = -EREMCHG;
> > +		goto out;
> >  	}
> >  
> >  	string_get_size(size, 1, STRING_UNITS_2, size_str, sizeof(size_str));
> >  	xe_gt_sriov_dbg_verbose(gt, "LMEM %lluM %s\n", size / SZ_1M, size_str);
> >  
> >  	config->lmem_size = size;
> > +	err = config->lmem_size ? 0 : -ENODATA;
> >  
> > -	return config->lmem_size ? 0 : -ENODATA;
> > +out:
> > +	up_write(&config->lock);
> > +	return err;
> >  }
> >  
> >  static int vf_get_submission_cfg(struct xe_gt *gt)
> > @@ -501,23 +513,27 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
> >  
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  
> > +	down_write(&config->lock);
> > +
> >  	err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_CONTEXTS_KEY, &num_ctxs);
> >  	if (unlikely(err))
> > -		return err;
> > +		goto out;
> >  
> >  	err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_DOORBELLS_KEY, &num_dbs);
> >  	if (unlikely(err))
> > -		return err;
> > +		goto out;
> >  
> >  	if (config->num_ctxs && config->num_ctxs != num_ctxs) {
> >  		xe_gt_sriov_err(gt, "Unexpected CTXs reassignment: %u != %u\n",
> >  				num_ctxs, config->num_ctxs);
> > -		return -EREMCHG;
> > +		err = -EREMCHG;
> > +		goto out;
> >  	}
> >  	if (config->num_dbs && config->num_dbs != num_dbs) {
> >  		xe_gt_sriov_err(gt, "Unexpected DBs reassignment: %u != %u\n",
> >  				num_dbs, config->num_dbs);
> > -		return -EREMCHG;
> > +		err = -EREMCHG;
> > +		goto out;
> >  	}
> >  
> >  	xe_gt_sriov_dbg_verbose(gt, "CTXs %u DBs %u\n", num_ctxs, num_dbs);
> > @@ -525,7 +541,11 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
> >  	config->num_ctxs = num_ctxs;
> >  	config->num_dbs = num_dbs;
> >  
> > -	return config->num_ctxs ? 0 : -ENODATA;
> > +	err = config->num_ctxs ? 0 : -ENODATA;
> > +
> > +out:
> > +	up_write(&config->lock);
> > +	return err;
> >  }
> >  
> >  static void vf_cache_gmdid(struct xe_gt *gt)
> > @@ -579,11 +599,18 @@ int xe_gt_sriov_vf_query_config(struct xe_gt *gt)
> >   */
> >  u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
> >  {
> > +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> > +	u16 val;
> > +
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> > -	xe_gt_assert(gt, gt->sriov.vf.self_config.num_ctxs);
> >  
> > -	return gt->sriov.vf.self_config.num_ctxs;
> > +	down_read(&config->lock);
> > +	xe_gt_assert(gt, config->num_ctxs);
> > +	val = config->num_ctxs;
> > +	up_read(&config->lock);
> > +
> > +	return val;
> >  }
> >  
> >  /**
> > @@ -596,11 +623,18 @@ u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
> >   */
> >  u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
> >  {
> > +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> > +	u64 val;
> > +
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> > -	xe_gt_assert(gt, gt->sriov.vf.self_config.lmem_size);
> >  
> > -	return gt->sriov.vf.self_config.lmem_size;
> > +	down_read(&config->lock);
> > +	xe_gt_assert(gt, config->lmem_size);
> > +	val = config->lmem_size;
> > +	up_read(&config->lock);
> > +
> > +	return val;
> >  }
> >  
> >  /**
> > @@ -613,11 +647,17 @@ u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
> >   */
> >  u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt)
> >  {
> > +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> > +	u64 val;
> > +
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> > -	xe_gt_assert(gt, gt->sriov.vf.self_config.ggtt_size);
> >  
> > -	return gt->sriov.vf.self_config.ggtt_size;
> > +	down_read(&config->lock);
> > +	val = config->ggtt_size;
> > +	up_read(&config->lock);
> > +
> > +	return val;
> >  }
> >  
> >  /**
> > @@ -630,11 +670,18 @@ u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt)
> >   */
> >  u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt)
> >  {
> > +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> > +	u64 val;
> > +
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> > -	xe_gt_assert(gt, gt->sriov.vf.self_config.ggtt_size);
> >  
> > -	return gt->sriov.vf.self_config.ggtt_base;
> > +	down_read(&config->lock);
> > +	xe_gt_assert(gt, config->ggtt_size);
> > +	val = config->ggtt_base;
> > +	up_read(&config->lock);
> > +
> > +	return val;
> >  }
> >  
> >  /**
> > @@ -648,11 +695,16 @@ u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt)
> >  s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt)
> >  {
> >  	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> > +	s64 val;
> >  
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  	xe_gt_assert(gt, xe_gt_is_main_type(gt));
> >  
> > -	return config->ggtt_shift;
> > +	down_read(&config->lock);
> > +	val = config->ggtt_shift;
> > +	up_read(&config->lock);
> > +
> > +	return val;
> >  }
> >  
> >  static int relay_action_handshake(struct xe_gt *gt, u32 *major, u32 *minor)
> > @@ -1044,6 +1096,7 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
> >  
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  
> > +	down_read(&config->lock);
> >  	drm_printf(p, "GGTT range:\t%#llx-%#llx\n",
> >  		   config->ggtt_base,
> >  		   config->ggtt_base + config->ggtt_size - 1);
> > @@ -1060,6 +1113,7 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
> >  
> >  	drm_printf(p, "GuC contexts:\t%u\n", config->num_ctxs);
> >  	drm_printf(p, "GuC doorbells:\t%u\n", config->num_dbs);
> > +	up_read(&config->lock);
> >  }
> >  
> >  /**
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > index 298dedf4b009..d95857bd789b 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > @@ -6,6 +6,7 @@
> >  #ifndef _XE_GT_SRIOV_VF_TYPES_H_
> >  #define _XE_GT_SRIOV_VF_TYPES_H_
> >  
> > +#include <linux/rwsem.h>
> >  #include <linux/types.h>
> >  #include "xe_uc_fw_types.h"
> >  
> > @@ -25,6 +26,8 @@ struct xe_gt_sriov_vf_selfconfig {
> >  	u16 num_ctxs;
> >  	/** @num_dbs: assigned number of GuC doorbells IDs. */
> >  	u16 num_dbs;
> > +	/** @lock: lock for protecting access to all selfconfig fields. */
> > +	struct rw_semaphore lock;
> >  };
> >  
> >  /**
> > diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> > index cdd9f8e78b2a..d6e2ed9b9bbc 100644
> > --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> > @@ -197,6 +197,12 @@ static void vf_migration_init_early(struct xe_device *xe)
> >   */
> >  void xe_sriov_vf_init_early(struct xe_device *xe)
> >  {
> > +	struct xe_gt *gt;
> > +	unsigned int id;
> > +
> > +	for_each_gt(gt, xe, id)
> > +		init_rwsem(&gt->sriov.vf.self_config.lock);
> 
> as before, this should be done in
> 
> 	xe_gt_sriov_vf_init_early
> 

I pick up this change but again I think Michal's need some answers to
his questions.

Matt

> > +
> >  	vf_migration_init_early(xe);
> >  }
> >  
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 16/36] drm/xe/vf: Close multi-GT GGTT shift race
  2025-09-29  8:44   ` Michal Wajdeczko
@ 2025-09-29 12:31     ` Matthew Brost
  0 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-29 12:31 UTC (permalink / raw)
  To: Michal Wajdeczko; +Cc: intel-xe

On Mon, Sep 29, 2025 at 10:44:07AM +0200, Michal Wajdeczko wrote:
> 
> 
> On 9/29/2025 4:55 AM, Matthew Brost wrote:
> > As multi-GT VF post-migration recovery can run in parallel on different
> > workqueues, but both GTs point to the same GGTT, only one GT needs to
> > shift the GGTT. However, both GTs need to know when this step has
> > completed. To coordinate this, share the VF config lock among all GTs
> > that share a GGTT, and perform the GGTT shift under this lock. With
> > shift being done under the lock, storing the shift value becomes
> > unnecessary.
> 
> maybe better (and more natural) option would be to move VF GGTT config
> from GT (xe_gt_sriov_vf_config) to Tile (xe_tile_sriov_vf_config) ? 
> 
> and protect it there with single lock also defined there ?
> 
> I'm doing similar changes on the PF provisioning side...
> 

Yes, that is a better place. I was just being a bit lazy, but I can fix
this one way or another in the next revision.

I think we need to answer why the config lock [1] needs to protect more
than just the GGTT. If only the GGTT needs protection, we might be able
to move the entire GGTT config lookup + shift under the GGTT lock. That
might be considered a bit of a layering violation, but if we could
export a well-defined GGTT lock helper, it could help mitigate any
layering concerns.

[1] https://patchwork.freedesktop.org/patch/677296/?series=154627&rev=3

> > 
> > v3:
> >  - Update commmit message (Tomasz)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 95 +++++++++--------------
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  3 +-
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h | 11 ++-
> >  drivers/gpu/drm/xe/xe_guc.c               |  2 +-
> >  drivers/gpu/drm/xe/xe_tile_sriov_vf.c     |  6 +-
> >  drivers/gpu/drm/xe/xe_tile_sriov_vf.h     |  1 -
> >  6 files changed, 51 insertions(+), 67 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > index 6f15619efe01..ad1d63b5b8d1 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > @@ -436,16 +436,19 @@ u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt)
> >  	return value;
> >  }
> >  
> > -static int vf_get_ggtt_info(struct xe_gt *gt)
> > +static int vf_get_ggtt_info(struct xe_gt *gt, bool recovery)
> >  {
> >  	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> > +	struct xe_gt_sriov_vf_selfconfig *primary_config =
> > +		&gt_to_tile(gt)->primary_gt->sriov.vf.self_config;
> >  	struct xe_guc *guc = &gt->uc.guc;
> >  	u64 start, size;
> > +	s64 shift;
> >  	int err;
> >  
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  
> > -	down_write(&config->lock);
> > +	down_write(config->lock);
> >  
> >  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_START_KEY, &start);
> >  	if (unlikely(err))
> > @@ -465,13 +468,17 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
> >  	xe_gt_sriov_dbg_verbose(gt, "GGTT %#llx-%#llx = %lluK\n",
> >  				start, start + size - 1, size / SZ_1K);
> >  
> > -	config->ggtt_shift = start - (s64)config->ggtt_base;
> > +	shift = start - (s64)primary_config->ggtt_base;
> >  	config->ggtt_base = start;
> >  	config->ggtt_size = size;
> > +	if (recovery)
> > +		primary_config->ggtt_base = start;
> >  	err = config->ggtt_size ? 0 : -ENODATA;
> >  
> > +	if (!err && shift && recovery)
> > +		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
> >  out:
> > -	up_write(&config->lock);
> > +	up_write(config->lock);
> >  	return err;
> >  }
> >  
> > @@ -485,7 +492,7 @@ static int vf_get_lmem_info(struct xe_gt *gt)
> >  
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  
> > -	down_write(&config->lock);
> > +	down_write(config->lock);
> >  
> >  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_LMEM_SIZE_KEY, &size);
> >  	if (unlikely(err))
> > @@ -505,7 +512,7 @@ static int vf_get_lmem_info(struct xe_gt *gt)
> >  	err = config->lmem_size ? 0 : -ENODATA;
> >  
> >  out:
> > -	up_write(&config->lock);
> > +	up_write(config->lock);
> >  	return err;
> >  }
> >  
> > @@ -518,7 +525,7 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
> >  
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  
> > -	down_write(&config->lock);
> > +	down_write(config->lock);
> >  
> >  	err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_CONTEXTS_KEY, &num_ctxs);
> >  	if (unlikely(err))
> > @@ -549,7 +556,7 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
> >  	err = config->num_ctxs ? 0 : -ENODATA;
> >  
> >  out:
> > -	up_write(&config->lock);
> > +	up_write(config->lock);
> >  	return err;
> >  }
> >  
> > @@ -564,17 +571,18 @@ static void vf_cache_gmdid(struct xe_gt *gt)
> >  /**
> >   * xe_gt_sriov_vf_query_config - Query SR-IOV config data over MMIO.
> >   * @gt: the &xe_gt
> > + * @recovery: VF post migration recovery path
> >   *
> >   * This function is for VF use only.
> >   *
> >   * Return: 0 on success or a negative error code on failure.
> >   */
> > -int xe_gt_sriov_vf_query_config(struct xe_gt *gt)
> > +int xe_gt_sriov_vf_query_config(struct xe_gt *gt, bool recovery)
> >  {
> >  	struct xe_device *xe = gt_to_xe(gt);
> >  	int err;
> >  
> > -	err = vf_get_ggtt_info(gt);
> > +	err = vf_get_ggtt_info(gt, recovery);
> >  	if (unlikely(err))
> >  		return err;
> >  
> > @@ -610,10 +618,10 @@ u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> >  
> > -	down_read(&config->lock);
> > +	down_read(config->lock);
> >  	xe_gt_assert(gt, config->num_ctxs);
> >  	val = config->num_ctxs;
> > -	up_read(&config->lock);
> > +	up_read(config->lock);
> >  
> >  	return val;
> >  }
> > @@ -634,10 +642,10 @@ u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> >  
> > -	down_read(&config->lock);
> > +	down_read(config->lock);
> >  	xe_gt_assert(gt, config->lmem_size);
> >  	val = config->lmem_size;
> > -	up_read(&config->lock);
> > +	up_read(config->lock);
> >  
> >  	return val;
> >  }
> > @@ -656,11 +664,9 @@ u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt)
> >  	u64 val;
> >  
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> > -	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> > +	lockdep_assert_held(config->lock);
> >  
> > -	down_read(&config->lock);
> >  	val = config->ggtt_size;
> > -	up_read(&config->lock);
> >  
> >  	return val;
> >  }
> > @@ -680,34 +686,10 @@ u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt)
> >  
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> > -
> > -	down_read(&config->lock);
> >  	xe_gt_assert(gt, config->ggtt_size);
> > -	val = config->ggtt_base;
> > -	up_read(&config->lock);
> > -
> > -	return val;
> > -}
> > +	lockdep_assert_held(config->lock);
> >  
> > -/**
> > - * xe_gt_sriov_vf_ggtt_shift - Return shift in GGTT range due to VF migration
> > - * @gt: the &xe_gt struct instance
> > - *
> > - * This function is for VF use only.
> > - *
> > - * Return: The shift value; could be negative
> > - */
> > -s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt)
> > -{
> > -	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> > -	s64 val;
> > -
> > -	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> > -	xe_gt_assert(gt, xe_gt_is_main_type(gt));
> > -
> > -	down_read(&config->lock);
> > -	val = config->ggtt_shift;
> > -	up_read(&config->lock);
> > +	val = config->ggtt_base;
> >  
> >  	return val;
> >  }
> > @@ -1115,7 +1097,7 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
> >  
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  
> > -	down_read(&config->lock);
> > +	down_read(config->lock);
> >  	drm_printf(p, "GGTT range:\t%#llx-%#llx\n",
> >  		   config->ggtt_base,
> >  		   config->ggtt_base + config->ggtt_size - 1);
> > @@ -1123,8 +1105,6 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
> >  	string_get_size(config->ggtt_size, 1, STRING_UNITS_2, buf, sizeof(buf));
> >  	drm_printf(p, "GGTT size:\t%llu (%s)\n", config->ggtt_size, buf);
> >  
> > -	drm_printf(p, "GGTT shift on last restore:\t%lld\n", config->ggtt_shift);
> > -
> >  	if (IS_DGFX(xe) && xe_gt_is_main_type(gt)) {
> >  		string_get_size(config->lmem_size, 1, STRING_UNITS_2, buf, sizeof(buf));
> >  		drm_printf(p, "LMEM size:\t%llu (%s)\n", config->lmem_size, buf);
> > @@ -1132,7 +1112,7 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
> >  
> >  	drm_printf(p, "GuC contexts:\t%u\n", config->num_ctxs);
> >  	drm_printf(p, "GuC doorbells:\t%u\n", config->num_dbs);
> > -	up_read(&config->lock);
> > +	up_read(config->lock);
> >  }
> >  
> >  /**
> > @@ -1215,21 +1195,16 @@ static size_t post_migration_scratch_size(struct xe_device *xe)
> >  static int vf_post_migration_fixups(struct xe_gt *gt)
> >  {
> >  	void *buf = gt->sriov.vf.migration.scratch;
> > -	s64 shift;
> >  	int err;
> >  
> > -	err = xe_gt_sriov_vf_query_config(gt);
> > +	err = xe_gt_sriov_vf_query_config(gt, true);
> >  	if (err)
> >  		return err;
> >  
> > -	shift = xe_gt_sriov_vf_ggtt_shift(gt);
> > -	if (shift) {
> > -		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
> > -		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
> > -		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
> > -		if (err)
> > -			return err;
> > -	}
> > +	xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
> > +	err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
> > +	if (err)
> > +		return err;
> >  
> >  	return 0;
> >  }
> > @@ -1316,6 +1291,7 @@ static void migration_worker_func(struct work_struct *w)
> >   */
> >  int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
> >  {
> > +	struct xe_tile *tile = gt_to_tile(gt);
> >  	void *buf;
> >  
> >  	if (!xe_sriov_vf_migration_supported(gt_to_xe(gt)))
> > @@ -1328,7 +1304,10 @@ int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
> >  		return -ENOMEM;
> >  
> >  	gt->sriov.vf.migration.scratch = buf;
> > -	init_rwsem(&gt->sriov.vf.self_config.lock);
> > +	if (xe_gt_is_main_type(gt))
> > +		init_rwsem(&gt->sriov.vf.self_config.__lock);
> > +	gt->sriov.vf.self_config.lock =
> > +		&tile->primary_gt->sriov.vf.self_config.__lock;
> >  	spin_lock_init(&gt->sriov.vf.migration.lock);
> >  	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> > index 0b0f2a30e67c..ff3a0ce608cd 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> > @@ -18,7 +18,7 @@ int xe_gt_sriov_vf_bootstrap(struct xe_gt *gt);
> >  void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt,
> >  				 struct xe_uc_fw_version *wanted,
> >  				 struct xe_uc_fw_version *found);
> > -int xe_gt_sriov_vf_query_config(struct xe_gt *gt);
> > +int xe_gt_sriov_vf_query_config(struct xe_gt *gt, bool recovery);
> >  int xe_gt_sriov_vf_connect(struct xe_gt *gt);
> >  int xe_gt_sriov_vf_query_runtime(struct xe_gt *gt);
> >  void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
> > @@ -31,7 +31,6 @@ u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt);
> >  u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt);
> >  u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt);
> >  u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt);
> > -s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt);
> >  
> >  u32 xe_gt_sriov_vf_read32(struct xe_gt *gt, struct xe_reg reg);
> >  void xe_gt_sriov_vf_write32(struct xe_gt *gt, struct xe_reg reg, u32 val);
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > index a63b6004b0b7..6cbf8291a5ab 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > @@ -19,16 +19,19 @@ struct xe_gt_sriov_vf_selfconfig {
> >  	u64 ggtt_base;
> >  	/** @ggtt_size: assigned size of the GGTT region. */
> >  	u64 ggtt_size;
> > -	/** @ggtt_shift: difference in ggtt_base on last migration */
> > -	s64 ggtt_shift;
> >  	/** @lmem_size: assigned size of the LMEM. */
> >  	u64 lmem_size;
> >  	/** @num_ctxs: assigned number of GuC submission context IDs. */
> >  	u16 num_ctxs;
> >  	/** @num_dbs: assigned number of GuC doorbells IDs. */
> >  	u16 num_dbs;
> > -	/** @lock: lock for protecting access to all selfconfig fields. */
> > -	struct rw_semaphore lock;
> > +	/** @__lock: lock for protecting access to all selfconfig fields. */
> > +	struct rw_semaphore __lock;
> > +	/**
> > +	 * @lock: pointer to lock for protecting access to all selfconfig
> > +	 * fields, all GTs point to primary GT.
> > +	 */
> > +	struct rw_semaphore *lock;
> 
> this could be placed in tile.sriov.vf
> 
> >  };
> >  
> >  /**
> > diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> > index d5adbbb013ec..c016a11b6ab1 100644
> > --- a/drivers/gpu/drm/xe/xe_guc.c
> > +++ b/drivers/gpu/drm/xe/xe_guc.c
> > @@ -713,7 +713,7 @@ static int vf_guc_init_noalloc(struct xe_guc *guc)
> >  	if (err)
> >  		return err;
> >  
> > -	err = xe_gt_sriov_vf_query_config(gt);
> > +	err = xe_gt_sriov_vf_query_config(gt, false);
> >  	if (err)
> >  		return err;
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_tile_sriov_vf.c b/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
> > index f221dbed16f0..dc6221fc0520 100644
> > --- a/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
> > @@ -40,7 +40,7 @@ static int vf_init_ggtt_balloons(struct xe_tile *tile)
> >   *
> >   * Return: 0 on success or a negative error code on failure.
> >   */
> > -int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile)
> > +static int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile)
> >  {
> >  	u64 ggtt_base = xe_gt_sriov_vf_ggtt_base(tile->primary_gt);
> >  	u64 ggtt_size = xe_gt_sriov_vf_ggtt(tile->primary_gt);
> > @@ -100,12 +100,16 @@ int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile)
> >  
> >  static int vf_balloon_ggtt(struct xe_tile *tile)
> >  {
> > +	struct xe_gt_sriov_vf_selfconfig *config =
> > +		&tile->primary_gt->sriov.vf.self_config;
> 
> with GGTT (and its lock) stored at tile level we will not be forced
> to look at the primary-gt any more
> 

See above, for my response to both subsequent replies.

Matt

> 
> >  	struct xe_ggtt *ggtt = tile->mem.ggtt;
> >  	int err;
> >  
> > +	down_read(config->lock);
> >  	mutex_lock(&ggtt->lock);
> >  	err = xe_tile_sriov_vf_balloon_ggtt_locked(tile);
> >  	mutex_unlock(&ggtt->lock);
> > +	up_read(config->lock);
> >  
> >  	return err;
> >  }
> > diff --git a/drivers/gpu/drm/xe/xe_tile_sriov_vf.h b/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
> > index 93eb043171e8..4ee68d1fb28e 100644
> > --- a/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
> > +++ b/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
> > @@ -11,7 +11,6 @@
> >  struct xe_tile;
> >  
> >  int xe_tile_sriov_vf_prepare_ggtt(struct xe_tile *tile);
> > -int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile);
> >  void xe_tile_sriov_vf_deballoon_ggtt_locked(struct xe_tile *tile);
> >  void xe_tile_sriov_vf_fixup_ggtt_nodes(struct xe_tile *tile, s64 shift);
> >  
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 18/36] drm/xe/vf: Don't allow GT reset to be queued during VF post migration recovery
  2025-09-29  9:17   ` Michal Wajdeczko
@ 2025-09-29 12:50     ` Matthew Brost
  0 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-29 12:50 UTC (permalink / raw)
  To: Michal Wajdeczko; +Cc: intel-xe

On Mon, Sep 29, 2025 at 11:17:27AM +0200, Michal Wajdeczko wrote:
> 
> 
> On 9/29/2025 4:55 AM, Matthew Brost wrote:
> > With well-behaved software, a GT reset should never occur, nor should it
> > happen during VF post-migration recovery. If it does, trigger a warning
> 
> hmm, I'm not sure that GT-resets depends only on the SW side, nor that
> reasons for it couldn't happen just before VF was migrated
> 

GT resets should ideally only occur due to hardware, GuC, or KMD bugs—in
practice, they should never happen. Of course, we’ve encountered bugs in
all three components at various times, and GT resets help recover the
KMD/GPU.

It’s possible that a GT reset could be queued before VF migration, but
that’s highly unlikely. What this warning is really about is broken VF
migration code that accidentally triggers a GT reset—for example, a WAT
queue timing out during submission and causing a reset. That would
indicate a bug in VF migration that we need to fix.

> > but suppress the GT reset, as VF post-migration recovery is expected to
> > bring the VF back to a working state.
> 
> can't we just say this last sentence, that "there is no need to run
> explicit VF-reset sequence during recovery as VF-recovery is equivalent 
> and it is also expected to bring VF back to working state?
> 

See above, I think the commit message is largely correct.

> also, since the patch is a refactor, it should mention that "instead of
> blocking resets, just rely on the recovery" 

I can add snippet saying this.

> > 
> > v3:
> >  - Better commit message (Tomasz)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_gt.c          |  9 -------
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf.c |  7 -----
> >  drivers/gpu/drm/xe/xe_guc_submit.c  | 41 +++--------------------------
> >  drivers/gpu/drm/xe/xe_guc_submit.h  |  3 ---
> >  4 files changed, 4 insertions(+), 56 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > index 82be38c99205..5f04d562604b 100644
> > --- a/drivers/gpu/drm/xe/xe_gt.c
> > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > @@ -815,11 +815,6 @@ static int do_gt_restart(struct xe_gt *gt)
> >  	return 0;
> >  }
> >  
> > -static int gt_wait_reset_unblock(struct xe_gt *gt)
> > -{
> > -	return xe_guc_wait_reset_unblock(&gt->uc.guc);
> > -}
> > -
> >  static int gt_reset(struct xe_gt *gt)
> >  {
> >  	unsigned int fw_ref;
> > @@ -834,10 +829,6 @@ static int gt_reset(struct xe_gt *gt)
> >  
> >  	xe_gt_info(gt, "reset started\n");
> >  
> > -	err = gt_wait_reset_unblock(gt);
> > -	if (!err)
> > -		xe_gt_warn(gt, "reset block failed to get lifted");
> > -
> >  	xe_pm_runtime_get(gt_to_xe(gt));
> >  
> >  	if (xe_fault_inject_gt_reset()) {
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > index cc5af19c1911..b16e8fd271f8 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > @@ -1175,17 +1175,11 @@ void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p)
> >  
> >  static void vf_post_migration_shutdown(struct xe_gt *gt)
> >  {
> > -	int ret = 0;
> > -
> >  	spin_lock_irq(&gt->sriov.vf.migration.lock);
> >  	gt->sriov.vf.migration.recovery_queued = false;
> >  	spin_unlock_irq(&gt->sriov.vf.migration.lock);
> >  
> >  	xe_guc_submit_pause(&gt->uc.guc);
> > -	ret |= xe_guc_submit_reset_block(&gt->uc.guc);
> > -
> > -	if (ret)
> > -		xe_gt_sriov_info(gt, "migration recovery encountered ongoing reset\n");
> >  }
> >  
> >  static size_t post_migration_scratch_size(struct xe_device *xe)
> > @@ -1219,7 +1213,6 @@ static void vf_post_migration_kickstart(struct xe_gt *gt)
> >  	 */
> >  	xe_irq_resume(gt_to_xe(gt));
> >  
> > -	xe_guc_submit_reset_unblock(&gt->uc.guc);
> >  	xe_guc_submit_unpause(&gt->uc.guc);
> >  }
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index cd5e506527fe..b82976f031e5 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -27,6 +27,7 @@
> >  #include "xe_gt.h"
> >  #include "xe_gt_clock.h"
> >  #include "xe_gt_printk.h"
> > +#include "xe_gt_sriov_vf.h"
> >  #include "xe_guc.h"
> >  #include "xe_guc_capture.h"
> >  #include "xe_guc_ct.h"
> > @@ -2182,47 +2183,13 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
> >  	}
> >  }
> >  
> > -/**
> > - * xe_guc_submit_reset_block - Disallow reset calls on given GuC.
> > - * @guc: the &xe_guc struct instance
> > - */
> > -int xe_guc_submit_reset_block(struct xe_guc *guc)
> > -{
> > -	return atomic_fetch_or(1, &guc->submission_state.reset_blocked);
> > -}
> > -
> > -/**
> > - * xe_guc_submit_reset_unblock - Allow back reset calls on given GuC.
> > - * @guc: the &xe_guc struct instance
> > - */
> > -void xe_guc_submit_reset_unblock(struct xe_guc *guc)
> > -{
> > -	atomic_set_release(&guc->submission_state.reset_blocked, 0);
> > -	wake_up_all(&guc->ct.wq);
> > -}
> > -
> > -static int guc_submit_reset_is_blocked(struct xe_guc *guc)
> > -{
> > -	return atomic_read_acquire(&guc->submission_state.reset_blocked);
> > -}
> > -
> > -/* Maximum time of blocking reset */
> > -#define RESET_BLOCK_PERIOD_MAX (HZ * 5)
> > -
> > -/**
> > - * xe_guc_wait_reset_unblock - Wait until reset blocking flag is lifted, or timeout.
> > - * @guc: the &xe_guc struct instance
> > - */
> > -int xe_guc_wait_reset_unblock(struct xe_guc *guc)
> > -{
> > -	return wait_event_timeout(guc->ct.wq,
> > -				  !guc_submit_reset_is_blocked(guc), RESET_BLOCK_PERIOD_MAX);
> > -}
> > -
> >  int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> >  {
> >  	int ret;
> >  
> > +	if (WARN_ON_ONCE(xe_gt_sriov_vf_recovery_inprogress(guc_to_gt(guc))))
> 
> we have a gt-oriented variant:
> 
> 	xe_gt_WARN_ON
> 
> and likely we should not use the _ONCE variant as resets could happen few
> times in the VF lifetime
> 

IMO once is enough, as we really only care in CI if pops the first time
but sure can switch to xe_gt_WARN_ON.

> and since we are sure that "recovery" has the same power as "reset"
> does it really need to be full WARN ? maybe just "info or notice" ?
> 

I think we should complain as loudly as possible and capture a stack
trace if this occurs.

> also if want to skip real GT resets, shouldn't we have this check
> upper in the reset stack, at GT level, not just under gt.uc.guc.submit ?
> 

This code is already somewhat mislayered, but in its current state, I
believe this is the correct location. If we refactor the layering later,
we might be able to move the warning further up the stack. For now, I
think it's best to leave it here.

Matt

> > +		return 0;
> > +
> >  	if (!guc->submission_state.initialized)
> >  		return 0;
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
> > index 5b4a0a6fd818..f535fe3895e5 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.h
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.h
> > @@ -22,9 +22,6 @@ void xe_guc_submit_stop(struct xe_guc *guc);
> >  int xe_guc_submit_start(struct xe_guc *guc);
> >  void xe_guc_submit_pause(struct xe_guc *guc);
> >  void xe_guc_submit_unpause(struct xe_guc *guc);
> > -int xe_guc_submit_reset_block(struct xe_guc *guc);
> > -void xe_guc_submit_reset_unblock(struct xe_guc *guc);
> > -int xe_guc_wait_reset_unblock(struct xe_guc *guc);
> >  void xe_guc_submit_wedge(struct xe_guc *guc);
> >  
> >  int xe_guc_read_stopped(struct xe_guc *guc);
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 36/36] drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC
  2025-09-29  2:55 ` [PATCH v3 36/36] drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC Matthew Brost
@ 2025-09-29 15:17   ` K V P, Satyanarayana
  2025-09-30 12:39     ` Matthew Brost
  0 siblings, 1 reply; 83+ messages in thread
From: K V P, Satyanarayana @ 2025-09-29 15:17 UTC (permalink / raw)
  To: intel-xe



On 29-09-2025 08:25, Matthew Brost wrote:
> From: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> 
> Some VF2GUC actions may take longer to process. Increase default timeout
> after received BUSY indication to 2sec to cover all worst case scenarios.
> 
> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_guc.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> index c016a11b6ab1..f0de1fa61898 100644
> --- a/drivers/gpu/drm/xe/xe_guc.c
> +++ b/drivers/gpu/drm/xe/xe_guc.c
> @@ -1439,7 +1439,7 @@ int xe_guc_mmio_send_recv(struct xe_guc *guc, const u32 *request,
>   		BUILD_BUG_ON((GUC_HXG_TYPE_RESPONSE_SUCCESS ^ GUC_HXG_TYPE_RESPONSE_FAILURE) != 1);
>   
>   		ret = xe_mmio_wait32(mmio, reply_reg, resp_mask, resp_mask,
> -				     1000000, &header, false);
> +				     2000000, &header, false);
>   
>   		if (unlikely(FIELD_GET(GUC_HXG_MSG_0_ORIGIN, header) !=
>   			     GUC_HXG_ORIGIN_GUC))

LGTM.
Acked-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 23/36] drm/xe/vf: Flush and stop CTs in VF post migration recovery
  2025-09-29  2:55 ` [PATCH v3 23/36] drm/xe/vf: Flush and stop CTs in VF post migration recovery Matthew Brost
@ 2025-09-29 21:31   ` Michal Wajdeczko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Wajdeczko @ 2025-09-29 21:31 UTC (permalink / raw)
  To: Matthew Brost, intel-xe



On 9/29/2025 4:55 AM, Matthew Brost wrote:
> Flushing CTs (i.e., progressing all pending G2H messages) gives VF
> post-migration recovery an accurate view of which H2G messages the GuC
> has processed, enabling the GuC submission state machine to correctly
> rebuild all state.
> 
> Also, stop all CT traffic, as the CT is not live during VF
> post-migration recovery.
> 
> v3:
>  - xe_guc_ct_flush_and_stop rename (Michal)
>  - Drop extra GuC CT WQ wake up (Michal)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>

> ---
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.c |  2 ++
>  drivers/gpu/drm/xe/xe_guc_ct.c      | 10 ++++++++++
>  drivers/gpu/drm/xe/xe_guc_ct.h      |  1 +
>  3 files changed, 13 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index a564f296e4b9..37ef1c42bacb 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -23,6 +23,7 @@
>  #include "xe_gt_sriov_vf.h"
>  #include "xe_gt_sriov_vf_types.h"
>  #include "xe_guc.h"
> +#include "xe_guc_ct.h"
>  #include "xe_guc_hxg_helpers.h"
>  #include "xe_guc_relay.h"
>  #include "xe_guc_submit.h"
> @@ -1185,6 +1186,7 @@ static void vf_post_migration_shutdown(struct xe_gt *gt)
>  	gt->sriov.vf.migration.recovery_queued = false;
>  	spin_unlock_irq(&gt->sriov.vf.migration.lock);
>  
> +	xe_guc_ct_flush_and_stop(&gt->uc.guc.ct);
>  	xe_guc_submit_pause(&gt->uc.guc);
>  }
>  
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index d84de8544532..fd6e731c0395 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -573,6 +573,16 @@ void xe_guc_ct_disable(struct xe_guc_ct *ct)
>  	stop_g2h_handler(ct);
>  }
>  
> +/**
> + * xe_guc_ct_flush_and_stop - Flush and stop all processing of G2H / H2G
> + * @ct: the &xe_guc_ct
> + */
> +void xe_guc_ct_flush_and_stop(struct xe_guc_ct *ct)
> +{
> +	receive_g2h(ct);
> +	xe_guc_ct_stop(ct);
> +}
> +
>  /**
>   * xe_guc_ct_stop - Set GuC to stopped state
>   * @ct: the &xe_guc_ct
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
> index d6c81325a76c..0a88f4e447fa 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.h
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.h
> @@ -17,6 +17,7 @@ int xe_guc_ct_init_post_hwconfig(struct xe_guc_ct *ct);
>  int xe_guc_ct_enable(struct xe_guc_ct *ct);
>  void xe_guc_ct_disable(struct xe_guc_ct *ct);
>  void xe_guc_ct_stop(struct xe_guc_ct *ct);
> +void xe_guc_ct_flush_and_stop(struct xe_guc_ct *ct);
>  void xe_guc_ct_fast_path(struct xe_guc_ct *ct);
>  
>  struct xe_guc_ct_snapshot *xe_guc_ct_snapshot_capture(struct xe_guc_ct *ct);


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 26/36] drm/xe/vf: Start CTs before resfix VF post migration recovery
  2025-09-29  2:55 ` [PATCH v3 26/36] drm/xe/vf: Start CTs before resfix " Matthew Brost
@ 2025-09-29 21:49   ` Michal Wajdeczko
  2025-09-30  6:26     ` Matthew Brost
  0 siblings, 1 reply; 83+ messages in thread
From: Michal Wajdeczko @ 2025-09-29 21:49 UTC (permalink / raw)
  To: Matthew Brost, intel-xe



On 9/29/2025 4:55 AM, Matthew Brost wrote:
> Before `resfix`, all CTs stuck in the H2G queue need to be squashed, as
> they may contain stale or invalid data.
> 
> Starting the CTs clears all H2Gs in the queue. Any lost H2Gs are
> resubmitted by the GuC submission state machine.
> 
> v3:
>  - Don't mess with head / tail values (Michal)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.c |  7 ++++
>  drivers/gpu/drm/xe/xe_guc_ct.c      | 59 ++++++++++++++++++++++-------
>  drivers/gpu/drm/xe/xe_guc_ct.h      |  1 +
>  3 files changed, 54 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 35de8977c6d0..cb3e9f6e83fa 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -1214,6 +1214,11 @@ static int vf_post_migration_fixups(struct xe_gt *gt)
>  	return 0;
>  }
>  
> +static void vf_post_migration_rearm(struct xe_gt *gt)
> +{
> +	xe_guc_ct_restart(&gt->uc.guc.ct);
> +}
> +
>  static void vf_post_migration_kickstart(struct xe_gt *gt)
>  {
>  	xe_guc_submit_unpause(&gt->uc.guc);
> @@ -1265,6 +1270,8 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
>  	if (err)
>  		goto fail;
>  
> +	vf_post_migration_rearm(gt);
> +
>  	err = vf_post_migration_notify_resfix_done(gt);
>  	if (err && err != -EAGAIN)
>  		goto fail;
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index fd6e731c0395..25efc1f813ce 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -500,7 +500,7 @@ static void ct_exit_safe_mode(struct xe_guc_ct *ct)
>  		xe_gt_dbg(ct_to_gt(ct), "GuC CT safe-mode disabled\n");
>  }
>  
> -int xe_guc_ct_enable(struct xe_guc_ct *ct)
> +static int __xe_guc_ct_start(struct xe_guc_ct *ct, bool needs_register)
>  {
>  	struct xe_device *xe = ct_to_xe(ct);
>  	struct xe_gt *gt = ct_to_gt(ct);
> @@ -508,21 +508,28 @@ int xe_guc_ct_enable(struct xe_guc_ct *ct)
>  
>  	xe_gt_assert(gt, !xe_guc_ct_enabled(ct));
>  
> -	xe_map_memset(xe, &ct->bo->vmap, 0, 0, xe_bo_size(ct->bo));
> -	guc_ct_ctb_h2g_init(xe, &ct->ctbs.h2g, &ct->bo->vmap);
> -	guc_ct_ctb_g2h_init(xe, &ct->ctbs.g2h, &ct->bo->vmap);
> +	if (needs_register) {
> +		xe_map_memset(xe, &ct->bo->vmap, 0, 0, xe_bo_size(ct->bo));
> +		guc_ct_ctb_h2g_init(xe, &ct->ctbs.h2g, &ct->bo->vmap);
> +		guc_ct_ctb_g2h_init(xe, &ct->ctbs.g2h, &ct->bo->vmap);
>  
> -	err = guc_ct_ctb_h2g_register(ct);
> -	if (err)
> -		goto err_out;
> +		err = guc_ct_ctb_h2g_register(ct);
> +		if (err)
> +			goto err_out;
>  
> -	err = guc_ct_ctb_g2h_register(ct);
> -	if (err)
> -		goto err_out;
> +		err = guc_ct_ctb_g2h_register(ct);
> +		if (err)
> +			goto err_out;
>  
> -	err = guc_ct_control_toggle(ct, true);
> -	if (err)
> -		goto err_out;
> +		err = guc_ct_control_toggle(ct, true);
> +		if (err)
> +			goto err_out;
> +	} else {
> +		ct->ctbs.h2g.info.broken = false;
> +		ct->ctbs.g2h.info.broken = false;

if CTB was broken before migration, shouldn't we leave it as such?

IMO it should be cleared only by a normal reset, that involves CT re-registration,
not just by our recovery, as GuC may continue to ignore broken CTB

> +		xe_map_memset(xe, &ct->bo->vmap, CTB_DESC_SIZE * 2, 0,
> +			      CTB_H2G_BUFFER_SIZE);

now it's better, but maybe we should introduce

#define CTB_H2G_BUFFER_OFFSET (CTB_DESC_SIZE * 2)

> +	}
>  
>  	guc_ct_change_state(ct, XE_GUC_CT_STATE_ENABLED);
>  
> @@ -554,6 +561,32 @@ int xe_guc_ct_enable(struct xe_guc_ct *ct)
>  	return err;
>  }
>  
> +/**
> + * xe_guc_ct_restart() - Restart GuC CT
> + * @ct: the &xe_guc_ct
> + *
> + * Restart GuC CT to an empty state without issuing a CT register MMIO command.
> + *
> + * Return: 0 on success, or a negative errno on failure.
> + */
> +int xe_guc_ct_restart(struct xe_guc_ct *ct)
> +{
> +	return __xe_guc_ct_start(ct, false);
> +}
> +
> +/**
> + * xe_guc_ct_enable() - Enable GuC CT
> + * @ct: the &xe_guc_ct
> + *
> + * Enable GuC CT to an empty state and issue a CT register MMIO command.
> + *
> + * Return: 0 on success, or a negative errno on failure.
> + */
> +int xe_guc_ct_enable(struct xe_guc_ct *ct)
> +{
> +	return __xe_guc_ct_start(ct, true);
> +}
> +
>  static void stop_g2h_handler(struct xe_guc_ct *ct)
>  {
>  	cancel_work_sync(&ct->g2h_worker);
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
> index 0a88f4e447fa..b1cba250c51c 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.h
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.h
> @@ -15,6 +15,7 @@ int xe_guc_ct_init_noalloc(struct xe_guc_ct *ct);
>  int xe_guc_ct_init(struct xe_guc_ct *ct);
>  int xe_guc_ct_init_post_hwconfig(struct xe_guc_ct *ct);
>  int xe_guc_ct_enable(struct xe_guc_ct *ct);
> +int xe_guc_ct_restart(struct xe_guc_ct *ct);
>  void xe_guc_ct_disable(struct xe_guc_ct *ct);
>  void xe_guc_ct_stop(struct xe_guc_ct *ct);
>  void xe_guc_ct_flush_and_stop(struct xe_guc_ct *ct);


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 02/36] drm/xe/vf: Lock querying GGTT config during driver init
  2025-09-29 12:15     ` Matthew Brost
@ 2025-09-30  0:42       ` Lis, Tomasz
  2025-09-30 10:25         ` Michal Wajdeczko
  0 siblings, 1 reply; 83+ messages in thread
From: Lis, Tomasz @ 2025-09-30  0:42 UTC (permalink / raw)
  To: Matthew Brost, Michal Wajdeczko; +Cc: intel-xe


On 9/29/2025 2:15 PM, Matthew Brost wrote:
> On Mon, Sep 29, 2025 at 09:42:55AM +0200, Michal Wajdeczko wrote:
>>
>> On 9/29/2025 4:55 AM, Matthew Brost wrote:
>>> From: Tomasz Lis <tomasz.lis@intel.com>
>>>
>>> Protect access to GGTT config as this is non-static information.
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>>> ---
>>>   drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 96 ++++++++++++++++++-----
>>>   drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  3 +
>>>   drivers/gpu/drm/xe/xe_sriov_vf.c          |  6 ++
>>>   3 files changed, 84 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>>> index 0461d5513487..016c867e5e2b 100644
>>> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>>> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>>> @@ -440,18 +440,21 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
>>>   
>>>   	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>   
>>> +	down_write(&config->lock);
>>> +
>> still didn't get answer to my earlier question [1]
>>
>> [1] https://patchwork.freedesktop.org/patch/676375/?series=154627&rev=2#comment_1240924
>>
> Again, this isn't a patch, so I believe Tomasz will need to chime in to
> provide an answer or conclusion here.
Now answered.
>
>>>   	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_START_KEY, &start);
>>>   	if (unlikely(err))
>>> -		return err;
>>> +		goto out;
>>>   
>>>   	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_SIZE_KEY, &size);
>>>   	if (unlikely(err))
>>> -		return err;
>>> +		goto out;
>>>   
>>>   	if (config->ggtt_size && config->ggtt_size != size) {
>>>   		xe_gt_sriov_err(gt, "Unexpected GGTT reassignment: %lluK != %lluK\n",
>>>   				size / SZ_1K, config->ggtt_size / SZ_1K);
>>> -		return -EREMCHG;
>>> +		err = -EREMCHG;
>>> +		goto out;
>>>   	}
>>>   
>>>   	xe_gt_sriov_dbg_verbose(gt, "GGTT %#llx-%#llx = %lluK\n",
>>> @@ -460,8 +463,11 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
>>>   	config->ggtt_shift = start - (s64)config->ggtt_base;
>>>   	config->ggtt_base = start;
>>>   	config->ggtt_size = size;
>>> +	err = config->ggtt_size ? 0 : -ENODATA;
>>>   
>>> -	return config->ggtt_size ? 0 : -ENODATA;
>>> +out:
>>> +	up_write(&config->lock);
>>> +	return err;
>>>   }
>>>   
>>>   static int vf_get_lmem_info(struct xe_gt *gt)
>>> @@ -474,22 +480,28 @@ static int vf_get_lmem_info(struct xe_gt *gt)
>>>   
>>>   	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>   
>>> +	down_write(&config->lock);
>>> +
>> also, commit message says "Protect access to GGTT config "
>> while the patch seems to apply locking to the whole config ...
>>
>> what's the rationale to extend this protection?
>> just unification?

That's true, the title doesn't match.

During post-migration recovery we call the whole 
`xe_gt_sriov_vf_query_config()`, which was the main reason I went for 
protecting the whole provisioning.

Since we can query GuC multiple times, narrowing the protection would 
work as well - it would just be a bit unusual to allow two threads a 
race over writing provisioning info.

But since these values never change, both would write the same, so no 
visible problem.

If you want, I can remove the protection of anything other than GGTT. 
The solution would be then a little confusing maybe, but would work the 
same.

So, which way do we go? Fix patch name+comment, or fix locking range?

>>
>>>   	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_LMEM_SIZE_KEY, &size);
>>>   	if (unlikely(err))
>>> -		return err;
>>> +		goto out;
>>>   
>>>   	if (config->lmem_size && config->lmem_size != size) {
>>>   		xe_gt_sriov_err(gt, "Unexpected LMEM reassignment: %lluM != %lluM\n",
>>>   				size / SZ_1M, config->lmem_size / SZ_1M);
>>> -		return -EREMCHG;
>>> +		err = -EREMCHG;
>>> +		goto out;
>>>   	}
>>>   
>>>   	string_get_size(size, 1, STRING_UNITS_2, size_str, sizeof(size_str));
>>>   	xe_gt_sriov_dbg_verbose(gt, "LMEM %lluM %s\n", size / SZ_1M, size_str);
>>>   
>>>   	config->lmem_size = size;
>>> +	err = config->lmem_size ? 0 : -ENODATA;
>>>   
>>> -	return config->lmem_size ? 0 : -ENODATA;
>>> +out:
>>> +	up_write(&config->lock);
>>> +	return err;
>>>   }
>>>   
>>>   static int vf_get_submission_cfg(struct xe_gt *gt)
>>> @@ -501,23 +513,27 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
>>>   
>>>   	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>   
>>> +	down_write(&config->lock);
>>> +
>>>   	err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_CONTEXTS_KEY, &num_ctxs);
>>>   	if (unlikely(err))
>>> -		return err;
>>> +		goto out;
>>>   
>>>   	err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_DOORBELLS_KEY, &num_dbs);
>>>   	if (unlikely(err))
>>> -		return err;
>>> +		goto out;
>>>   
>>>   	if (config->num_ctxs && config->num_ctxs != num_ctxs) {
>>>   		xe_gt_sriov_err(gt, "Unexpected CTXs reassignment: %u != %u\n",
>>>   				num_ctxs, config->num_ctxs);
>>> -		return -EREMCHG;
>>> +		err = -EREMCHG;
>>> +		goto out;
>>>   	}
>>>   	if (config->num_dbs && config->num_dbs != num_dbs) {
>>>   		xe_gt_sriov_err(gt, "Unexpected DBs reassignment: %u != %u\n",
>>>   				num_dbs, config->num_dbs);
>>> -		return -EREMCHG;
>>> +		err = -EREMCHG;
>>> +		goto out;
>>>   	}
>>>   
>>>   	xe_gt_sriov_dbg_verbose(gt, "CTXs %u DBs %u\n", num_ctxs, num_dbs);
>>> @@ -525,7 +541,11 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
>>>   	config->num_ctxs = num_ctxs;
>>>   	config->num_dbs = num_dbs;
>>>   
>>> -	return config->num_ctxs ? 0 : -ENODATA;
>>> +	err = config->num_ctxs ? 0 : -ENODATA;
>>> +
>>> +out:
>>> +	up_write(&config->lock);
>>> +	return err;
>>>   }
>>>   
>>>   static void vf_cache_gmdid(struct xe_gt *gt)
>>> @@ -579,11 +599,18 @@ int xe_gt_sriov_vf_query_config(struct xe_gt *gt)
>>>    */
>>>   u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
>>>   {
>>> +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
>>> +	u16 val;
>>> +
>>>   	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>   	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
>>> -	xe_gt_assert(gt, gt->sriov.vf.self_config.num_ctxs);
>>>   
>>> -	return gt->sriov.vf.self_config.num_ctxs;
>>> +	down_read(&config->lock);
>>> +	xe_gt_assert(gt, config->num_ctxs);
>>> +	val = config->num_ctxs;
>>> +	up_read(&config->lock);
>>> +
>>> +	return val;
>>>   }
>>>   
>>>   /**
>>> @@ -596,11 +623,18 @@ u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
>>>    */
>>>   u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
>>>   {
>>> +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
>>> +	u64 val;
>>> +
>>>   	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>   	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
>>> -	xe_gt_assert(gt, gt->sriov.vf.self_config.lmem_size);
>>>   
>>> -	return gt->sriov.vf.self_config.lmem_size;
>>> +	down_read(&config->lock);
>>> +	xe_gt_assert(gt, config->lmem_size);
>>> +	val = config->lmem_size;
>>> +	up_read(&config->lock);
>>> +
>>> +	return val;
>>>   }
>>>   
>>>   /**
>>> @@ -613,11 +647,17 @@ u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
>>>    */
>>>   u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt)
>>>   {
>>> +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
>>> +	u64 val;
>>> +
>>>   	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>   	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
>>> -	xe_gt_assert(gt, gt->sriov.vf.self_config.ggtt_size);
>>>   
>>> -	return gt->sriov.vf.self_config.ggtt_size;
>>> +	down_read(&config->lock);
>>> +	val = config->ggtt_size;
>>> +	up_read(&config->lock);
>>> +
>>> +	return val;
>>>   }
>>>   
>>>   /**
>>> @@ -630,11 +670,18 @@ u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt)
>>>    */
>>>   u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt)
>>>   {
>>> +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
>>> +	u64 val;
>>> +
>>>   	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>   	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
>>> -	xe_gt_assert(gt, gt->sriov.vf.self_config.ggtt_size);
>>>   
>>> -	return gt->sriov.vf.self_config.ggtt_base;
>>> +	down_read(&config->lock);
>>> +	xe_gt_assert(gt, config->ggtt_size);
>>> +	val = config->ggtt_base;
>>> +	up_read(&config->lock);
>>> +
>>> +	return val;
>>>   }
>>>   
>>>   /**
>>> @@ -648,11 +695,16 @@ u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt)
>>>   s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt)
>>>   {
>>>   	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
>>> +	s64 val;
>>>   
>>>   	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>   	xe_gt_assert(gt, xe_gt_is_main_type(gt));
>>>   
>>> -	return config->ggtt_shift;
>>> +	down_read(&config->lock);
>>> +	val = config->ggtt_shift;
>>> +	up_read(&config->lock);
>>> +
>>> +	return val;
>>>   }
>>>   
>>>   static int relay_action_handshake(struct xe_gt *gt, u32 *major, u32 *minor)
>>> @@ -1044,6 +1096,7 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
>>>   
>>>   	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>   
>>> +	down_read(&config->lock);
>>>   	drm_printf(p, "GGTT range:\t%#llx-%#llx\n",
>>>   		   config->ggtt_base,
>>>   		   config->ggtt_base + config->ggtt_size - 1);
>>> @@ -1060,6 +1113,7 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
>>>   
>>>   	drm_printf(p, "GuC contexts:\t%u\n", config->num_ctxs);
>>>   	drm_printf(p, "GuC doorbells:\t%u\n", config->num_dbs);
>>> +	up_read(&config->lock);
>>>   }
>>>   
>>>   /**
>>> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
>>> index 298dedf4b009..d95857bd789b 100644
>>> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
>>> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
>>> @@ -6,6 +6,7 @@
>>>   #ifndef _XE_GT_SRIOV_VF_TYPES_H_
>>>   #define _XE_GT_SRIOV_VF_TYPES_H_
>>>   
>>> +#include <linux/rwsem.h>
>>>   #include <linux/types.h>
>>>   #include "xe_uc_fw_types.h"
>>>   
>>> @@ -25,6 +26,8 @@ struct xe_gt_sriov_vf_selfconfig {
>>>   	u16 num_ctxs;
>>>   	/** @num_dbs: assigned number of GuC doorbells IDs. */
>>>   	u16 num_dbs;
>>> +	/** @lock: lock for protecting access to all selfconfig fields. */
>>> +	struct rw_semaphore lock;
>>>   };
>>>   
>>>   /**
>>> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
>>> index cdd9f8e78b2a..d6e2ed9b9bbc 100644
>>> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
>>> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
>>> @@ -197,6 +197,12 @@ static void vf_migration_init_early(struct xe_device *xe)
>>>    */
>>>   void xe_sriov_vf_init_early(struct xe_device *xe)
>>>   {
>>> +	struct xe_gt *gt;
>>> +	unsigned int id;
>>> +
>>> +	for_each_gt(gt, xe, id)
>>> +		init_rwsem(&gt->sriov.vf.self_config.lock);
>> as before, this should be done in
>>
>> 	xe_gt_sriov_vf_init_early
>>
> I pick up this change but again I think Michal's need some answers to
> his questions.
>
> Matt

Sure, that is also very early call so we can move it there. (As Matt 
actually did, just not in this patch.)

-Tomasz

>
>>> +
>>>   	vf_migration_init_early(xe);
>>>   }
>>>   

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 01/36] drm/xe: Add NULL checks to scratch LRC allocation
  2025-09-29  2:55 ` [PATCH v3 01/36] drm/xe: Add NULL checks to scratch LRC allocation Matthew Brost
@ 2025-09-30  2:06   ` Lis, Tomasz
  2025-09-30 22:53     ` Matthew Brost
  0 siblings, 1 reply; 83+ messages in thread
From: Lis, Tomasz @ 2025-09-30  2:06 UTC (permalink / raw)
  To: Matthew Brost, intel-xe

[-- Attachment #1: Type: text/plain, Size: 1567 bytes --]


On 9/29/2025 4:55 AM, Matthew Brost wrote:
> kmalloc can fail, the returned value must have a NULL check.
>
> Fixes: 168b5867318b ("drm/xe/vf: Refresh utilization buffer during migration recovery")
> Signed-off-by: Matthew Brost<matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_lrc.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
> index 47e9df775072..e1bc102a6cae 100644
> --- a/drivers/gpu/drm/xe/xe_lrc.c
> +++ b/drivers/gpu/drm/xe/xe_lrc.c
> @@ -1303,8 +1303,11 @@ static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
>   	u32 *buf = NULL;
>   	int ret;
>   
> -	if (lrc->bo->vmap.is_iomem)
> +	if (lrc->bo->vmap.is_iomem) {
>   		buf = kmalloc(LRC_WA_BB_SIZE, GFP_KERNEL);
> +		if (!buf)
> +			return -ENOMEM;
> +	}
>   
>   	ret = xe_lrc_setup_wa_bb_with_scratch(lrc, hwe, buf);

xe_lrc_setup_wa_bb_with_scratch()->setup_bo() handles the ENOMEM return, there was no bug.

>   
> @@ -1347,8 +1350,11 @@ setup_indirect_ctx(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
>   	if (xe_gt_WARN_ON(lrc->gt, !state.funcs))
>   		return 0;
>   
> -	if (lrc->bo->vmap.is_iomem)
> +	if (lrc->bo->vmap.is_iomem) {
>   		state.buffer = kmalloc(state.max_size, GFP_KERNEL);
> +		if (!state.buffer)
> +			return -ENOMEM;
> +	}
>   
>   	ret = setup_bo(&state);

setup_bo() does another check with ENOMEM return, no bug here as well. 
Also, with how setup_bo() exits, it ignores lack of allocation in case 
it won't be used anyway. -Tomasz

>   	if (ret) {

[-- Attachment #2: Type: text/html, Size: 2609 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 09/36] drm/xe: Don't change LRC ring head on job resubmission
  2025-09-29  2:55 ` [PATCH v3 09/36] drm/xe: Don't change LRC ring head on job resubmission Matthew Brost
@ 2025-09-30  2:38   ` Lis, Tomasz
  0 siblings, 0 replies; 83+ messages in thread
From: Lis, Tomasz @ 2025-09-30  2:38 UTC (permalink / raw)
  To: Matthew Brost, intel-xe


On 9/29/2025 4:55 AM, Matthew Brost wrote:
> Now that we save the job's head during submission, it's no longer
> necessary to adjust the LRC ring head during resubmission. Instead, a
> software-based adjustment of the tail will overwrite the old jobs in
> place. For some odd reason, adjusting the LRC ring head didn't work on
> parallel queues, which was causing issues in our CI.
>
> v6:
>   - Also set LRC tail to head so queue is idle coming out of reset
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_guc_submit.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 3a534d93505f..70306f902ba5 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -2008,11 +2008,17 @@ static void guc_exec_queue_start(struct xe_exec_queue *q)
>   	struct xe_gpu_scheduler *sched = &q->guc->sched;
>   
>   	if (!exec_queue_killed_or_banned_or_wedged(q)) {
> +		struct xe_sched_job *job = xe_sched_first_pending_job(sched);
>   		int i;
>   
>   		trace_xe_exec_queue_resubmit(q);
> -		for (i = 0; i < q->width; ++i)
> -			xe_lrc_set_ring_head(q->lrc[i], q->lrc[i]->ring.tail);
> +		if (job) {
> +			for (i = 0; i < q->width; ++i) {
> +				q->lrc[i]->ring.tail = job->ptrs[i].head;
> +				xe_lrc_set_ring_tail(q->lrc[i],
> +						     xe_lrc_ring_head(q->lrc[i]));
> +			}
> +		}
>   		xe_sched_resubmit_jobs(sched);
>   	}
>   

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 11/36] drm/xe/guc: Document GuC submission backend
  2025-09-29  2:55 ` [PATCH v3 11/36] drm/xe/guc: Document GuC submission backend Matthew Brost
@ 2025-09-30  3:28   ` Lis, Tomasz
  2025-09-30  6:30     ` Matthew Brost
  0 siblings, 1 reply; 83+ messages in thread
From: Lis, Tomasz @ 2025-09-30  3:28 UTC (permalink / raw)
  To: Matthew Brost, intel-xe

[-- Attachment #1: Type: text/plain, Size: 17216 bytes --]


On 9/29/2025 4:55 AM, Matthew Brost wrote:
> Add kernel-doc to xe_guc_submit.c describing the submission path,
> the per-queue single-threaded model with pause/resume, the driver shadow
> state machine and lost-H2G replay, job timeout handling, recovery flows
> (GT reset, PM resume, VF resume), and reclaim constraints.
>
> v2:
>   - Mirror tweaks for clarity
>   - Add new doc to Xe rst files
> v3:
>   - Clarify global vs per-queue stop / start
>   - Clarify VF resume flow
>   - Add section for 'Waiters during VF resume'
>   - Add section for 'Page-faulting queues during VF migration'
>   - Add section for 'GuC-ID assignment'
>   - Add section for 'Reference counting and final queue destruction'
> v4:
>   - s/VF resume/VF post migration recovery (Tomasz)
>
> Signed-off-by: Matthew Brost<matthew.brost@intel.com>
> ---
>   Documentation/gpu/xe/index.rst     |   1 +
>   drivers/gpu/drm/xe/xe_guc_submit.c | 282 +++++++++++++++++++++++++++++
>   2 files changed, 283 insertions(+)
>
> diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
> index 88b22fad880e..692c544b164c 100644
> --- a/Documentation/gpu/xe/index.rst
> +++ b/Documentation/gpu/xe/index.rst
> @@ -28,3 +28,4 @@ DG2, etc is provided to prototype the driver.
>      xe_device
>      xe-drm-usage-stats.rst
>      xe_configfs
> +   xe_guc_submit
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 70306f902ba5..cd5e506527fe 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -46,6 +46,288 @@
>   #include "xe_trace.h"
>   #include "xe_vm.h"
>   
> +/*
> + * DOC: Overview
> + *
> + * The GuC submission backend is responsible for submitting GPU jobs to the GuC
> + * firmware, assigning per-queue GuC IDs, tracking submission state via a
> + * driver-side state machine, handling GuC-to-host (G2H) messages, tracking
> + * outstanding jobs, managing job timeouts and queue teardown, and providing
> + * recovery when GuC state is lost. It is built on top of the DRM scheduler
> + * (drm_sched).
> + *
> + * GuC ID assignment:
> + * ------------------
> + * Each queue is assigned a unique GuC ID at queue init. The ID is used in all
> + * H2G/G2H to identify the queue and remains reserved until final destruction,
> + * when the GuC is known to hold no references to it.
> + *
> + * The backend maintains a reverse map GuC-ID -> queue to resolve targets for
> + * G2H handlers and to iterate all queues when required (e.g., recovery). This
> + * map is protected by submission_state.lock, a global (per-GT) lock. Lockless
> + * lookups are acceptable in paths where the queue’s lifetime is otherwise
> + * pinned and it cannot disappear underneath the operation (e.g., G2H handlers).
> + *
> + * Basic submission flow
> + * ---------------------
> + * Submission is driven by the DRM scheduler vfunc ->run_job(). The flow is:
> + *
> + * 1) Emit the job's ring instructions.
> + * 2) Advance the LRC ring tail:
> + *    - width == 1: simple memory write,
> + *    - width  > 1: append a GuC workqueue (WQ) item.
> + * 3) If the queue is unregistered, issue a register H2G for the context.
> + * 4) Trigger execution via a scheduler enable or context submit command.
> + * 5) Return the job's hardware fence to the DRM scheduler.
> + *
> + * Registration, scheduler enable, and submit commands are issued as host-to-GuC
> + * (H2G) messages over the Command Transport (CT) layer, like all GuC
> + * interactions.
> + *
> + * Completion path
> + * ---------------
> + * When the job's hardware fence signals, the DRM scheduler vfunc ->free_job()
> + * is called; it drops the job's reference, typically freeing it.
> + *
> + * Control-plane messages:
> + * -----------------------
> + * GuC submission scheduler messages form the control plane for queue cleanup,
> + * toggling runnability, and modifying queue properties (e.g., scheduler
> + * priority, timeslice, preemption timeout). Messages are initiated via queue
> + * vfuncs that append a control message to the queue. They are processed on the
> + * same single-threaded DRM scheduler workqueue that runs ->run_job() and
> + * ->free_job().
> + *
> + * Lockless model:
> + * ---------------
> + * ->run_job(), ->free_job(), and the message handlers execute as work items on
> + * a single-threaded DRM scheduler workqueue. Per queue, this provides built-in
> + * mutual exclusion: only one of these items can run at a time. As a result,
> + * these paths are lockless with respect to per-queue state tracking. (Global
> + * or cross-queue data structures still use their own synchronization.)
> + *
> + * Stopping / starting:
> + * --------------------
> + * The submission backend supports two scopes of quiesce control:
> + *
> + *  - Per-queue stop/start:
> + *    The single-threaded DRM scheduler workqueue for a specific queue can be
> + *    stopped and started dynamically. Stopping synchronously quiesces that
> + *    queue's worker (lets any in-flight item finish and prevents new items from
> + *    starting), yielding a stable snapshot while an external operation (e.g.,
> + *    job timeout handling) inspects/updates state and performs any required
> + *    fixups. While stopped, no submission, message, or ->free_job() work runs
> + *    for that queue. When the operation completes, the queue is started; any
> + *    pending items are then processed in order on the same worker. Other queues
> + *    continue to run unaffected.
> + *
> + *  - Global (per-GT) stop/start:
> + *    Implemented on top of the per-queue stop/start primitive: the driver
> + *    stops (or starts) each queue on the GT to obtain a device-wide stable
> + *    snapshot. This is used by coordinated recovery flows (GT reset, PM resume,
> + *    VF post migration recovery). Queues created while the global stop is in
> + *    effect (i.e., future queues) initialize in the stopped state and remain
> + *    stopped until the global start. After recovery fixups are complete, a
> + *    global start iterates queues to start all eligible ones and resumes normal
> + *    submission.
> + *
> + * State machine:
> + * --------------
> + * The submission state machine is the driver's shadow of the GuC-visible queue
> + * state (e.g., registered, runnable, scheduler properties). It tracks the
> + * transitions we intend to make (issued as H2G commands), marking them pending
> + * until acknowledged via G2H or otherwise observed as applied. It also records
> + * the origin of each transition (->run_job(), timeout handler, explicit control
> + * message, etc.).
> + *
> + * Because H2G commands and/or GuC submission state can be lost across GT reset,
> + * PM resume, or VF post migration recovery, this bookkeeping lets recovery
> + * decide which operations to replay, which to elide, and which need fixups,
> + * restoring a consistent queue state without additional per-queue locks.
> + *
> + * Job timeouts:
> + * -------------
> + * To prevent jobs from running indefinitely and violating dma-fence signaling
> + * rules, the DRM scheduler tracks how long each job has been running. If a
> + * threshold is exceeded, it calls ->timeout_job().
> + *
> + * ->timeout_job() stops the queue, samples the LRC context timestamps to
> + * confirm the job actually started and has exceeded the allowed runtime, and
> + * then, if confirmed, signals all pending jobs' fences and initiates queue
> + * teardown. Finally, the queue is started.
> + *
> + * Job timeout handling runs on a per-GT, single-threaded recovery workqueue
> + * that is shared with other recovery paths (e.g., GT reset handling, VF
> + * resume). This guarantees only one recovery action executes at a time.
> + *
> + * Queue teardown:
> + * ---------------
> + * Teardown can be triggered by: (1) userspace closing the queue; (2) a G2H
> + * queue-reset notification; (3) a G2H memory_cat_error for the queue; or (4)
> + * in-flight jobs detected on the queue during GT reset.
> + *
> + * In all cases teardown is driven via the timeout path by setting the queue's
> + * DRM scheduler timeout to zero, forcing an immediate ->timeout_job() pass.
> + *
> + * Reference counting and final queue destruction:
> + * -----------------------------------------------
> + * Jobs reference-count the queue; queues hold a reference to the VM. When a
> + * queue's reference count reaches zero (e.g., all jobs are freed and the
> + * userspace handle is closed), the queue is not destroyed immediately because
> + * the GuC may still reference its state.
> + *
> + * Instead, a control-plane cleanup message is appended to remove GuC-side
> + * references (e.g., disable runnability, deregister). Once the final G2H
> + * confirming that GuC no longer references the queue is eligible for
> + * destruction.
> + *
> + * To avoid freeing the queue from within its own DRM scheduler workqueue (which
> + * would risk use-after-free), the actual destruction is deferred to a separate
> + * work item queued on a dedicated destruction workqueue.
> + *
> + * GT resets:
> + * ----------
> + * GT resets are triggered by catastrophic errors (e.g., CT channel failure).
> + * The GuC is reset and all GuC-side submission state is lost. Recovery proceeds
> + * as follows:
> + *
> + * 1) Quiesce:
> + *    - Stop all queues (global submission stop). Per-queue workers finish any
> + *      in-flight item and then stop; newly created queues during the window
> + *      initialize in the stopped state.
> + *    - Abort any waits on CT/G2H to avoid deadlock.
> + *
> + * 2) Sanitize driver shadow state:
> + *    - For each queue, clear GuC-derived bits in the submission state machine
> + *      (e.g., registered/enabled) and mark in-flight H2G transitions as lost.
> + *    - Convert/flush any side effects of lost H2G.
> + *
> + * 3) Decide teardown vs. replay:
> + *    - If a queue's LRC seqno indicates that a job started but did not
> + *      complete, initiate teardown for that queue via the timeout path.
> + *    - If no job started, keep the queue for replay.
> + *
> + * 4) Resume:
> + *    - Start remaining queues; resubmit pending jobs.
> + *    - Queues marked for teardown remain stopped/destroyed.
> + *
> + * The entire sequence runs on the per-GT single-threaded recovery worker,
> + * ensuring only one recovery action executes at a time; a runtime PM reference
> + * is held for the duration.
> + *
> + * PM resume:
> + * ----------
> + * PM resume assumes all GuC state is lost (the device may have been powered
> + * down). It reuses the GT reset recovery path, but executes in the context of
> + * the caller that wakes the device (runtime PM or system resume).
> + *
> + * Suspend entry:
> + *  - Control-plane message work is quiesced; state toggles that require an
> + *    active device are not enqueued while suspended.
> + *  - Per-queue scheduler workers are stopped before the device is allowed to
> + *    suspend.
> + *  - Barring driver bugs, no queues should have in-flight jobs at
> + *    suspend/resume..
> + *
> + * On resume, run the GT reset recovery flow and then start eligible queues.
> + *
> + * Runtime PM and state-change ordering:
> + * -------------------------------------
> + * Runtime/system PM transitions must not race with per-queue submission and
> + * state updates.
> + *
> + * Execution contexts and RPM sources:
> + *  - Scheduler callbacks (->run_job(), ->free_job(), ->timeout_job()):
> + *    executed with an active RPM ref held by the in-flight job.
> + *  - Control-plane message work:
> + *    enqueued from IOCTL paths that already hold an RPM ref; the message path
> + *    itself does not get/put RPM. State toggles are only issued while active.
> + *    During suspend entry, message work is quiesced and no new toggles are
> + *    enqueued until after resume.
> + *  - G2H handlers:
> + *    dispatched with an RPM ref guaranteed by the CT layer.
> + *  - Recovery phases (GT reset/VF post migration recovery):
> + *    explicitly get/put an RPM ref for their duration on the per-GT recovery
> + *    worker.
> + *
> + * Consequence:
> + *  - All submission/state mutations run with an RPM reference. The PM core
> + *    cannot enter suspend while these updates are in progress, and resume is
> + *    complete before updates execute. This prevents PM state changes from
> + *    racing with queue state changes.
> + *
> + * VF post migration recovery:
> + * ---------------------------
> + * VF post migration recovery resembles a GT reset, but GuC submission state is
> + * expected to persist across migration; in-flight H2G commands may be lost

I don't think H2Gs can be lost. GuC is expected to either finish or not 
read them, and

after recovery they can be all executed. They only require GGTT fixups.

It is our decision that we scrap them, then re-assess and re-issue the 
commands by manipulating

states of entities which issued them. Maybe:

---

expected to persist across migration; GGTT base/offsets may change, 
requiring update of all references to it, including in CTBs, LRCs, and 
on rings.

For a wider view on that, see `VF restore procedure in PF KMD and VF 
KMD`_.Recovery proceeds as follows:

---

-Tomasz

> , and
> + * GGTT base/offsets may change. Recovery proceeds as follows:
> + *
> + * 1) Quiesce:
> + *    - Stop all queues and abort waits (as with GT reset) to obtain a stable
> + *      snapshot.
> + *    - Queues created while VF post migration recovery is in-flight initialize
> + *      in the stopped state.
> + *
> + * 2) Treat H2G as lost and prepare in-place resubmission (GuC/CT down):
> + *    - Treat in-flight H2G (enable/disable, etc.) as dropped; update shadow
> + *      bits to a safe baseline and tag the ops as "needs replay".
> + *    - Quarantine device-visible submission state: set the GuC-visible LRC ring
> + *      tail equal to the head (and, for WQ-based submission, set the WQ
> + *      descriptor head == tail) so that when the GuC comes up it will not process
> + *      any entries that were built with stale GGTT addresses.
> + *    - Reset the software ring tail to the original value captured at the
> + *      submission of the oldest pending job, so the write pointer sits exactly
> + *      where that job was originally emitted.
> + *
> + * 3) Replay and resubmit once GuC/CT is live:
> + *    - VF post migration recovery invokes ->run_job() for pending jobs;
> + *      ->emit_job() overwrites ring instructions in place, fixes GGTT fields,
> + *      then advances the LRC tail (and WQ descriptor for width > 1). Required
> + *      submission H2G(s) are reissued and fresh WQ entries are written.
> + *    - Queue lost control-plane operations (scheduling-state toggles, cleanup)
> + *      in order via the message path.
> + *    - Start the queues to process the queued control-plane operations and run
> + *      the resubmitted jobs.
> + *
> + * The goal is to preserve both job and queue state; no teardown is performed
> + * in this flow. The sequence runs on the per-GT single-threaded recovery
> + * worker with a held runtime PM reference.
> + *
> + * Waiters during VF post migration recovery
> + * -----------------------------------------
> + * The submission backend frequently uses wait_event_timeout() to wait on
> + * GuC-driven conditions. Across VF migration/recovery two issues arise:
> + * 1) The timeout does not account for migration downtime and may expire
> + *    prematurely, triggering undesired actions (e.g., GT reset, prematurely
> + *    signaling a fence).
> + * 2) Some waits target GuC work that cannot complete until VF recovery
> + *    finishes; these typically sit on the queue-stopping path.
> + *
> + * To handle this, all waiters must atomically test the "GuC down / VF-recovery
> + * in progress" condition (e.g., VF_RESFIX_BLOCKED) both before sleeping and
> + * after wakeup. The flag is coherent with VF migration: vCPUs observe it
> + * immediately on unhalt, and it is cleared only after the GuC/CT is live again.
> + * If set, the waiter must either (a) abort the wait without side effects, or
> + * (b) re-arm the wait with a fresh timeout once the GuC/CT is live. Timeouts
> + * that occur while GuC/CT are down are non-fatal—the VF-recovery path will
> + * rebuild state—and must not trigger recovery or teardown.
> + *
> + * Relation to reclaim:
> + * --------------------
> + * Jobs signal dma-fences, and the MM may wait on those fences during reclaim.
> + * As a consequence, the entire GuC submission backend (DRM scheduler callbacks,
> + * message handling, and all recovery paths) lies on the reclaim path and must
> + * be reclaim-safe.
> + *
> + * Practical implications:
> + * - No memory allocations in these paths (avoid any allocation that could
> + *   recurse into reclaim or sleep).
> + * - The global submission-state lock may be taken from reclaim-tainted contexts
> + *   (timeout/recovery). Any path that acquires it (including queue init/destroy)
> + *   must not allocate or take locks that can recurse into reclaim while holding
> + *   it; keep the critical section to state/xarray updates.
> + */
> +
>   static struct xe_guc *
>   exec_queue_to_guc(struct xe_exec_queue *q)
>   {

[-- Attachment #2: Type: text/html, Size: 17651 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 26/36] drm/xe/vf: Start CTs before resfix VF post migration recovery
  2025-09-29 21:49   ` Michal Wajdeczko
@ 2025-09-30  6:26     ` Matthew Brost
  0 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-30  6:26 UTC (permalink / raw)
  To: Michal Wajdeczko; +Cc: intel-xe

On Mon, Sep 29, 2025 at 11:49:42PM +0200, Michal Wajdeczko wrote:
> 
> 
> On 9/29/2025 4:55 AM, Matthew Brost wrote:
> > Before `resfix`, all CTs stuck in the H2G queue need to be squashed, as
> > they may contain stale or invalid data.
> > 
> > Starting the CTs clears all H2Gs in the queue. Any lost H2Gs are
> > resubmitted by the GuC submission state machine.
> > 
> > v3:
> >  - Don't mess with head / tail values (Michal)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf.c |  7 ++++
> >  drivers/gpu/drm/xe/xe_guc_ct.c      | 59 ++++++++++++++++++++++-------
> >  drivers/gpu/drm/xe/xe_guc_ct.h      |  1 +
> >  3 files changed, 54 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > index 35de8977c6d0..cb3e9f6e83fa 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > @@ -1214,6 +1214,11 @@ static int vf_post_migration_fixups(struct xe_gt *gt)
> >  	return 0;
> >  }
> >  
> > +static void vf_post_migration_rearm(struct xe_gt *gt)
> > +{
> > +	xe_guc_ct_restart(&gt->uc.guc.ct);
> > +}
> > +
> >  static void vf_post_migration_kickstart(struct xe_gt *gt)
> >  {
> >  	xe_guc_submit_unpause(&gt->uc.guc);
> > @@ -1265,6 +1270,8 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
> >  	if (err)
> >  		goto fail;
> >  
> > +	vf_post_migration_rearm(gt);
> > +
> >  	err = vf_post_migration_notify_resfix_done(gt);
> >  	if (err && err != -EAGAIN)
> >  		goto fail;
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> > index fd6e731c0395..25efc1f813ce 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > @@ -500,7 +500,7 @@ static void ct_exit_safe_mode(struct xe_guc_ct *ct)
> >  		xe_gt_dbg(ct_to_gt(ct), "GuC CT safe-mode disabled\n");
> >  }
> >  
> > -int xe_guc_ct_enable(struct xe_guc_ct *ct)
> > +static int __xe_guc_ct_start(struct xe_guc_ct *ct, bool needs_register)
> >  {
> >  	struct xe_device *xe = ct_to_xe(ct);
> >  	struct xe_gt *gt = ct_to_gt(ct);
> > @@ -508,21 +508,28 @@ int xe_guc_ct_enable(struct xe_guc_ct *ct)
> >  
> >  	xe_gt_assert(gt, !xe_guc_ct_enabled(ct));
> >  
> > -	xe_map_memset(xe, &ct->bo->vmap, 0, 0, xe_bo_size(ct->bo));
> > -	guc_ct_ctb_h2g_init(xe, &ct->ctbs.h2g, &ct->bo->vmap);
> > -	guc_ct_ctb_g2h_init(xe, &ct->ctbs.g2h, &ct->bo->vmap);
> > +	if (needs_register) {
> > +		xe_map_memset(xe, &ct->bo->vmap, 0, 0, xe_bo_size(ct->bo));
> > +		guc_ct_ctb_h2g_init(xe, &ct->ctbs.h2g, &ct->bo->vmap);
> > +		guc_ct_ctb_g2h_init(xe, &ct->ctbs.g2h, &ct->bo->vmap);
> >  
> > -	err = guc_ct_ctb_h2g_register(ct);
> > -	if (err)
> > -		goto err_out;
> > +		err = guc_ct_ctb_h2g_register(ct);
> > +		if (err)
> > +			goto err_out;
> >  
> > -	err = guc_ct_ctb_g2h_register(ct);
> > -	if (err)
> > -		goto err_out;
> > +		err = guc_ct_ctb_g2h_register(ct);
> > +		if (err)
> > +			goto err_out;
> >  
> > -	err = guc_ct_control_toggle(ct, true);
> > -	if (err)
> > -		goto err_out;
> > +		err = guc_ct_control_toggle(ct, true);
> > +		if (err)
> > +			goto err_out;
> > +	} else {
> > +		ct->ctbs.h2g.info.broken = false;
> > +		ct->ctbs.g2h.info.broken = false;
> 
> if CTB was broken before migration, shouldn't we leave it as such?
> 
> IMO it should be cleared only by a normal reset, that involves CT re-registration,
> not just by our recovery, as GuC may continue to ignore broken CTB
> 

That is really bad situation to be in. If this occurs, we'd supress a
GT reset and broken wouldn't get cleared until another another side
affect in the driver triggers a GT reset. If we clear broken here -
maybe this happens sooner? Idk, either way is not great.

How about a WARN_ON if broken is set? Again this isn't something that
ever really happen and if it does we likely have in the KMD, HW, or GuC
which we'd need to root cause.

> > +		xe_map_memset(xe, &ct->bo->vmap, CTB_DESC_SIZE * 2, 0,
> > +			      CTB_H2G_BUFFER_SIZE);
> 
> now it's better, but maybe we should introduce
> 
> #define CTB_H2G_BUFFER_OFFSET (CTB_DESC_SIZE * 2)
> 

Sure. I think what we really need to do in a follow up is add a bit
kernel doc describing the memory layout too (e.g., like we have in the
LRC code).

Matt

> > +	}
> >  
> >  	guc_ct_change_state(ct, XE_GUC_CT_STATE_ENABLED);
> >  
> > @@ -554,6 +561,32 @@ int xe_guc_ct_enable(struct xe_guc_ct *ct)
> >  	return err;
> >  }
> >  
> > +/**
> > + * xe_guc_ct_restart() - Restart GuC CT
> > + * @ct: the &xe_guc_ct
> > + *
> > + * Restart GuC CT to an empty state without issuing a CT register MMIO command.
> > + *
> > + * Return: 0 on success, or a negative errno on failure.
> > + */
> > +int xe_guc_ct_restart(struct xe_guc_ct *ct)
> > +{
> > +	return __xe_guc_ct_start(ct, false);
> > +}
> > +
> > +/**
> > + * xe_guc_ct_enable() - Enable GuC CT
> > + * @ct: the &xe_guc_ct
> > + *
> > + * Enable GuC CT to an empty state and issue a CT register MMIO command.
> > + *
> > + * Return: 0 on success, or a negative errno on failure.
> > + */
> > +int xe_guc_ct_enable(struct xe_guc_ct *ct)
> > +{
> > +	return __xe_guc_ct_start(ct, true);
> > +}
> > +
> >  static void stop_g2h_handler(struct xe_guc_ct *ct)
> >  {
> >  	cancel_work_sync(&ct->g2h_worker);
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
> > index 0a88f4e447fa..b1cba250c51c 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ct.h
> > +++ b/drivers/gpu/drm/xe/xe_guc_ct.h
> > @@ -15,6 +15,7 @@ int xe_guc_ct_init_noalloc(struct xe_guc_ct *ct);
> >  int xe_guc_ct_init(struct xe_guc_ct *ct);
> >  int xe_guc_ct_init_post_hwconfig(struct xe_guc_ct *ct);
> >  int xe_guc_ct_enable(struct xe_guc_ct *ct);
> > +int xe_guc_ct_restart(struct xe_guc_ct *ct);
> >  void xe_guc_ct_disable(struct xe_guc_ct *ct);
> >  void xe_guc_ct_stop(struct xe_guc_ct *ct);
> >  void xe_guc_ct_flush_and_stop(struct xe_guc_ct *ct);
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 11/36] drm/xe/guc: Document GuC submission backend
  2025-09-30  3:28   ` Lis, Tomasz
@ 2025-09-30  6:30     ` Matthew Brost
  0 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-30  6:30 UTC (permalink / raw)
  To: Lis, Tomasz; +Cc: intel-xe

On Tue, Sep 30, 2025 at 05:28:55AM +0200, Lis, Tomasz wrote:
> 
> On 9/29/2025 4:55 AM, Matthew Brost wrote:
> > Add kernel-doc to xe_guc_submit.c describing the submission path,
> > the per-queue single-threaded model with pause/resume, the driver shadow
> > state machine and lost-H2G replay, job timeout handling, recovery flows
> > (GT reset, PM resume, VF resume), and reclaim constraints.
> > 
> > v2:
> >   - Mirror tweaks for clarity
> >   - Add new doc to Xe rst files
> > v3:
> >   - Clarify global vs per-queue stop / start
> >   - Clarify VF resume flow
> >   - Add section for 'Waiters during VF resume'
> >   - Add section for 'Page-faulting queues during VF migration'
> >   - Add section for 'GuC-ID assignment'
> >   - Add section for 'Reference counting and final queue destruction'
> > v4:
> >   - s/VF resume/VF post migration recovery (Tomasz)
> > 
> > Signed-off-by: Matthew Brost<matthew.brost@intel.com>
> > ---
> >   Documentation/gpu/xe/index.rst     |   1 +
> >   drivers/gpu/drm/xe/xe_guc_submit.c | 282 +++++++++++++++++++++++++++++
> >   2 files changed, 283 insertions(+)
> > 
> > diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
> > index 88b22fad880e..692c544b164c 100644
> > --- a/Documentation/gpu/xe/index.rst
> > +++ b/Documentation/gpu/xe/index.rst
> > @@ -28,3 +28,4 @@ DG2, etc is provided to prototype the driver.
> >      xe_device
> >      xe-drm-usage-stats.rst
> >      xe_configfs
> > +   xe_guc_submit
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 70306f902ba5..cd5e506527fe 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -46,6 +46,288 @@
> >   #include "xe_trace.h"
> >   #include "xe_vm.h"
> > +/*
> > + * DOC: Overview
> > + *
> > + * The GuC submission backend is responsible for submitting GPU jobs to the GuC
> > + * firmware, assigning per-queue GuC IDs, tracking submission state via a
> > + * driver-side state machine, handling GuC-to-host (G2H) messages, tracking
> > + * outstanding jobs, managing job timeouts and queue teardown, and providing
> > + * recovery when GuC state is lost. It is built on top of the DRM scheduler
> > + * (drm_sched).
> > + *
> > + * GuC ID assignment:
> > + * ------------------
> > + * Each queue is assigned a unique GuC ID at queue init. The ID is used in all
> > + * H2G/G2H to identify the queue and remains reserved until final destruction,
> > + * when the GuC is known to hold no references to it.
> > + *
> > + * The backend maintains a reverse map GuC-ID -> queue to resolve targets for
> > + * G2H handlers and to iterate all queues when required (e.g., recovery). This
> > + * map is protected by submission_state.lock, a global (per-GT) lock. Lockless
> > + * lookups are acceptable in paths where the queue’s lifetime is otherwise
> > + * pinned and it cannot disappear underneath the operation (e.g., G2H handlers).
> > + *
> > + * Basic submission flow
> > + * ---------------------
> > + * Submission is driven by the DRM scheduler vfunc ->run_job(). The flow is:
> > + *
> > + * 1) Emit the job's ring instructions.
> > + * 2) Advance the LRC ring tail:
> > + *    - width == 1: simple memory write,
> > + *    - width  > 1: append a GuC workqueue (WQ) item.
> > + * 3) If the queue is unregistered, issue a register H2G for the context.
> > + * 4) Trigger execution via a scheduler enable or context submit command.
> > + * 5) Return the job's hardware fence to the DRM scheduler.
> > + *
> > + * Registration, scheduler enable, and submit commands are issued as host-to-GuC
> > + * (H2G) messages over the Command Transport (CT) layer, like all GuC
> > + * interactions.
> > + *
> > + * Completion path
> > + * ---------------
> > + * When the job's hardware fence signals, the DRM scheduler vfunc ->free_job()
> > + * is called; it drops the job's reference, typically freeing it.
> > + *
> > + * Control-plane messages:
> > + * -----------------------
> > + * GuC submission scheduler messages form the control plane for queue cleanup,
> > + * toggling runnability, and modifying queue properties (e.g., scheduler
> > + * priority, timeslice, preemption timeout). Messages are initiated via queue
> > + * vfuncs that append a control message to the queue. They are processed on the
> > + * same single-threaded DRM scheduler workqueue that runs ->run_job() and
> > + * ->free_job().
> > + *
> > + * Lockless model:
> > + * ---------------
> > + * ->run_job(), ->free_job(), and the message handlers execute as work items on
> > + * a single-threaded DRM scheduler workqueue. Per queue, this provides built-in
> > + * mutual exclusion: only one of these items can run at a time. As a result,
> > + * these paths are lockless with respect to per-queue state tracking. (Global
> > + * or cross-queue data structures still use their own synchronization.)
> > + *
> > + * Stopping / starting:
> > + * --------------------
> > + * The submission backend supports two scopes of quiesce control:
> > + *
> > + *  - Per-queue stop/start:
> > + *    The single-threaded DRM scheduler workqueue for a specific queue can be
> > + *    stopped and started dynamically. Stopping synchronously quiesces that
> > + *    queue's worker (lets any in-flight item finish and prevents new items from
> > + *    starting), yielding a stable snapshot while an external operation (e.g.,
> > + *    job timeout handling) inspects/updates state and performs any required
> > + *    fixups. While stopped, no submission, message, or ->free_job() work runs
> > + *    for that queue. When the operation completes, the queue is started; any
> > + *    pending items are then processed in order on the same worker. Other queues
> > + *    continue to run unaffected.
> > + *
> > + *  - Global (per-GT) stop/start:
> > + *    Implemented on top of the per-queue stop/start primitive: the driver
> > + *    stops (or starts) each queue on the GT to obtain a device-wide stable
> > + *    snapshot. This is used by coordinated recovery flows (GT reset, PM resume,
> > + *    VF post migration recovery). Queues created while the global stop is in
> > + *    effect (i.e., future queues) initialize in the stopped state and remain
> > + *    stopped until the global start. After recovery fixups are complete, a
> > + *    global start iterates queues to start all eligible ones and resumes normal
> > + *    submission.
> > + *
> > + * State machine:
> > + * --------------
> > + * The submission state machine is the driver's shadow of the GuC-visible queue
> > + * state (e.g., registered, runnable, scheduler properties). It tracks the
> > + * transitions we intend to make (issued as H2G commands), marking them pending
> > + * until acknowledged via G2H or otherwise observed as applied. It also records
> > + * the origin of each transition (->run_job(), timeout handler, explicit control
> > + * message, etc.).
> > + *
> > + * Because H2G commands and/or GuC submission state can be lost across GT reset,
> > + * PM resume, or VF post migration recovery, this bookkeeping lets recovery
> > + * decide which operations to replay, which to elide, and which need fixups,
> > + * restoring a consistent queue state without additional per-queue locks.
> > + *
> > + * Job timeouts:
> > + * -------------
> > + * To prevent jobs from running indefinitely and violating dma-fence signaling
> > + * rules, the DRM scheduler tracks how long each job has been running. If a
> > + * threshold is exceeded, it calls ->timeout_job().
> > + *
> > + * ->timeout_job() stops the queue, samples the LRC context timestamps to
> > + * confirm the job actually started and has exceeded the allowed runtime, and
> > + * then, if confirmed, signals all pending jobs' fences and initiates queue
> > + * teardown. Finally, the queue is started.
> > + *
> > + * Job timeout handling runs on a per-GT, single-threaded recovery workqueue
> > + * that is shared with other recovery paths (e.g., GT reset handling, VF
> > + * resume). This guarantees only one recovery action executes at a time.
> > + *
> > + * Queue teardown:
> > + * ---------------
> > + * Teardown can be triggered by: (1) userspace closing the queue; (2) a G2H
> > + * queue-reset notification; (3) a G2H memory_cat_error for the queue; or (4)
> > + * in-flight jobs detected on the queue during GT reset.
> > + *
> > + * In all cases teardown is driven via the timeout path by setting the queue's
> > + * DRM scheduler timeout to zero, forcing an immediate ->timeout_job() pass.
> > + *
> > + * Reference counting and final queue destruction:
> > + * -----------------------------------------------
> > + * Jobs reference-count the queue; queues hold a reference to the VM. When a
> > + * queue's reference count reaches zero (e.g., all jobs are freed and the
> > + * userspace handle is closed), the queue is not destroyed immediately because
> > + * the GuC may still reference its state.
> > + *
> > + * Instead, a control-plane cleanup message is appended to remove GuC-side
> > + * references (e.g., disable runnability, deregister). Once the final G2H
> > + * confirming that GuC no longer references the queue is eligible for
> > + * destruction.
> > + *
> > + * To avoid freeing the queue from within its own DRM scheduler workqueue (which
> > + * would risk use-after-free), the actual destruction is deferred to a separate
> > + * work item queued on a dedicated destruction workqueue.
> > + *
> > + * GT resets:
> > + * ----------
> > + * GT resets are triggered by catastrophic errors (e.g., CT channel failure).
> > + * The GuC is reset and all GuC-side submission state is lost. Recovery proceeds
> > + * as follows:
> > + *
> > + * 1) Quiesce:
> > + *    - Stop all queues (global submission stop). Per-queue workers finish any
> > + *      in-flight item and then stop; newly created queues during the window
> > + *      initialize in the stopped state.
> > + *    - Abort any waits on CT/G2H to avoid deadlock.
> > + *
> > + * 2) Sanitize driver shadow state:
> > + *    - For each queue, clear GuC-derived bits in the submission state machine
> > + *      (e.g., registered/enabled) and mark in-flight H2G transitions as lost.
> > + *    - Convert/flush any side effects of lost H2G.
> > + *
> > + * 3) Decide teardown vs. replay:
> > + *    - If a queue's LRC seqno indicates that a job started but did not
> > + *      complete, initiate teardown for that queue via the timeout path.
> > + *    - If no job started, keep the queue for replay.
> > + *
> > + * 4) Resume:
> > + *    - Start remaining queues; resubmit pending jobs.
> > + *    - Queues marked for teardown remain stopped/destroyed.
> > + *
> > + * The entire sequence runs on the per-GT single-threaded recovery worker,
> > + * ensuring only one recovery action executes at a time; a runtime PM reference
> > + * is held for the duration.
> > + *
> > + * PM resume:
> > + * ----------
> > + * PM resume assumes all GuC state is lost (the device may have been powered
> > + * down). It reuses the GT reset recovery path, but executes in the context of
> > + * the caller that wakes the device (runtime PM or system resume).
> > + *
> > + * Suspend entry:
> > + *  - Control-plane message work is quiesced; state toggles that require an
> > + *    active device are not enqueued while suspended.
> > + *  - Per-queue scheduler workers are stopped before the device is allowed to
> > + *    suspend.
> > + *  - Barring driver bugs, no queues should have in-flight jobs at
> > + *    suspend/resume..
> > + *
> > + * On resume, run the GT reset recovery flow and then start eligible queues.
> > + *
> > + * Runtime PM and state-change ordering:
> > + * -------------------------------------
> > + * Runtime/system PM transitions must not race with per-queue submission and
> > + * state updates.
> > + *
> > + * Execution contexts and RPM sources:
> > + *  - Scheduler callbacks (->run_job(), ->free_job(), ->timeout_job()):
> > + *    executed with an active RPM ref held by the in-flight job.
> > + *  - Control-plane message work:
> > + *    enqueued from IOCTL paths that already hold an RPM ref; the message path
> > + *    itself does not get/put RPM. State toggles are only issued while active.
> > + *    During suspend entry, message work is quiesced and no new toggles are
> > + *    enqueued until after resume.
> > + *  - G2H handlers:
> > + *    dispatched with an RPM ref guaranteed by the CT layer.
> > + *  - Recovery phases (GT reset/VF post migration recovery):
> > + *    explicitly get/put an RPM ref for their duration on the per-GT recovery
> > + *    worker.
> > + *
> > + * Consequence:
> > + *  - All submission/state mutations run with an RPM reference. The PM core
> > + *    cannot enter suspend while these updates are in progress, and resume is
> > + *    complete before updates execute. This prevents PM state changes from
> > + *    racing with queue state changes.
> > + *
> > + * VF post migration recovery:
> > + * ---------------------------
> > + * VF post migration recovery resembles a GT reset, but GuC submission state is
> > + * expected to persist across migration; in-flight H2G commands may be lost
> 
> I don't think H2Gs can be lost. GuC is expected to either finish or not read
> them, and
> 
> after recovery they can be all executed. They only require GGTT fixups.
> 
> It is our decision that we scrap them, then re-assess and re-issue the
> commands by manipulating
> 
> states of entities which issued them. Maybe:
> 
> ---
> 
> expected to persist across migration; GGTT base/offsets may change,
> requiring update of all references to it, including in CTBs, LRCs, and on
> rings.
> 
> For a wider view on that, see `VF restore procedure in PF KMD and VF
> KMD`_.Recovery proceeds as follows:
> 

Sure.

Matt

> ---
> 
> -Tomasz
> 
> > , and
> > + * GGTT base/offsets may change. Recovery proceeds as follows:
> > + *
> > + * 1) Quiesce:
> > + *    - Stop all queues and abort waits (as with GT reset) to obtain a stable
> > + *      snapshot.
> > + *    - Queues created while VF post migration recovery is in-flight initialize
> > + *      in the stopped state.
> > + *
> > + * 2) Treat H2G as lost and prepare in-place resubmission (GuC/CT down):
> > + *    - Treat in-flight H2G (enable/disable, etc.) as dropped; update shadow
> > + *      bits to a safe baseline and tag the ops as "needs replay".
> > + *    - Quarantine device-visible submission state: set the GuC-visible LRC ring
> > + *      tail equal to the head (and, for WQ-based submission, set the WQ
> > + *      descriptor head == tail) so that when the GuC comes up it will not process
> > + *      any entries that were built with stale GGTT addresses.
> > + *    - Reset the software ring tail to the original value captured at the
> > + *      submission of the oldest pending job, so the write pointer sits exactly
> > + *      where that job was originally emitted.
> > + *
> > + * 3) Replay and resubmit once GuC/CT is live:
> > + *    - VF post migration recovery invokes ->run_job() for pending jobs;
> > + *      ->emit_job() overwrites ring instructions in place, fixes GGTT fields,
> > + *      then advances the LRC tail (and WQ descriptor for width > 1). Required
> > + *      submission H2G(s) are reissued and fresh WQ entries are written.
> > + *    - Queue lost control-plane operations (scheduling-state toggles, cleanup)
> > + *      in order via the message path.
> > + *    - Start the queues to process the queued control-plane operations and run
> > + *      the resubmitted jobs.
> > + *
> > + * The goal is to preserve both job and queue state; no teardown is performed
> > + * in this flow. The sequence runs on the per-GT single-threaded recovery
> > + * worker with a held runtime PM reference.
> > + *
> > + * Waiters during VF post migration recovery
> > + * -----------------------------------------
> > + * The submission backend frequently uses wait_event_timeout() to wait on
> > + * GuC-driven conditions. Across VF migration/recovery two issues arise:
> > + * 1) The timeout does not account for migration downtime and may expire
> > + *    prematurely, triggering undesired actions (e.g., GT reset, prematurely
> > + *    signaling a fence).
> > + * 2) Some waits target GuC work that cannot complete until VF recovery
> > + *    finishes; these typically sit on the queue-stopping path.
> > + *
> > + * To handle this, all waiters must atomically test the "GuC down / VF-recovery
> > + * in progress" condition (e.g., VF_RESFIX_BLOCKED) both before sleeping and
> > + * after wakeup. The flag is coherent with VF migration: vCPUs observe it
> > + * immediately on unhalt, and it is cleared only after the GuC/CT is live again.
> > + * If set, the waiter must either (a) abort the wait without side effects, or
> > + * (b) re-arm the wait with a fresh timeout once the GuC/CT is live. Timeouts
> > + * that occur while GuC/CT are down are non-fatal—the VF-recovery path will
> > + * rebuild state—and must not trigger recovery or teardown.
> > + *
> > + * Relation to reclaim:
> > + * --------------------
> > + * Jobs signal dma-fences, and the MM may wait on those fences during reclaim.
> > + * As a consequence, the entire GuC submission backend (DRM scheduler callbacks,
> > + * message handling, and all recovery paths) lies on the reclaim path and must
> > + * be reclaim-safe.
> > + *
> > + * Practical implications:
> > + * - No memory allocations in these paths (avoid any allocation that could
> > + *   recurse into reclaim or sleep).
> > + * - The global submission-state lock may be taken from reclaim-tainted contexts
> > + *   (timeout/recovery). Any path that acquires it (including queue init/destroy)
> > + *   must not allocate or take locks that can recurse into reclaim while holding
> > + *   it; keep the critical section to state/xarray updates.
> > + */
> > +
> >   static struct xe_guc *
> >   exec_queue_to_guc(struct xe_exec_queue *q)
> >   {

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 02/36] drm/xe/vf: Lock querying GGTT config during driver init
  2025-09-30  0:42       ` Lis, Tomasz
@ 2025-09-30 10:25         ` Michal Wajdeczko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Wajdeczko @ 2025-09-30 10:25 UTC (permalink / raw)
  To: Lis, Tomasz, Matthew Brost; +Cc: intel-xe



On 9/30/2025 2:42 AM, Lis, Tomasz wrote:
> 
> On 9/29/2025 2:15 PM, Matthew Brost wrote:
>> On Mon, Sep 29, 2025 at 09:42:55AM +0200, Michal Wajdeczko wrote:
>>>
>>> On 9/29/2025 4:55 AM, Matthew Brost wrote:
>>>> From: Tomasz Lis <tomasz.lis@intel.com>
>>>>
>>>> Protect access to GGTT config as this is non-static information.
>>>>
>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>>>> ---
>>>>   drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 96 ++++++++++++++++++-----
>>>>   drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  3 +
>>>>   drivers/gpu/drm/xe/xe_sriov_vf.c          |  6 ++
>>>>   3 files changed, 84 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>>>> index 0461d5513487..016c867e5e2b 100644
>>>> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>>>> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>>>> @@ -440,18 +440,21 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
>>>>         xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>>   +    down_write(&config->lock);
>>>> +
>>> still didn't get answer to my earlier question [1]
>>>
>>> [1] https://patchwork.freedesktop.org/patch/676375/?series=154627&rev=2#comment_1240924
>>>
>> Again, this isn't a patch, so I believe Tomasz will need to chime in to
>> provide an answer or conclusion here.
> Now answered.
>>
>>>>       err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_START_KEY, &start);
>>>>       if (unlikely(err))
>>>> -        return err;
>>>> +        goto out;
>>>>         err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_SIZE_KEY, &size);
>>>>       if (unlikely(err))
>>>> -        return err;
>>>> +        goto out;
>>>>         if (config->ggtt_size && config->ggtt_size != size) {
>>>>           xe_gt_sriov_err(gt, "Unexpected GGTT reassignment: %lluK != %lluK\n",
>>>>                   size / SZ_1K, config->ggtt_size / SZ_1K);
>>>> -        return -EREMCHG;
>>>> +        err = -EREMCHG;
>>>> +        goto out;
>>>>       }
>>>>         xe_gt_sriov_dbg_verbose(gt, "GGTT %#llx-%#llx = %lluK\n",
>>>> @@ -460,8 +463,11 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
>>>>       config->ggtt_shift = start - (s64)config->ggtt_base;
>>>>       config->ggtt_base = start;
>>>>       config->ggtt_size = size;
>>>> +    err = config->ggtt_size ? 0 : -ENODATA;
>>>>   -    return config->ggtt_size ? 0 : -ENODATA;
>>>> +out:
>>>> +    up_write(&config->lock);
>>>> +    return err;
>>>>   }
>>>>     static int vf_get_lmem_info(struct xe_gt *gt)
>>>> @@ -474,22 +480,28 @@ static int vf_get_lmem_info(struct xe_gt *gt)
>>>>         xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>>   +    down_write(&config->lock);
>>>> +
>>> also, commit message says "Protect access to GGTT config "
>>> while the patch seems to apply locking to the whole config ...
>>>
>>> what's the rationale to extend this protection?
>>> just unification?
> 
> That's true, the title doesn't match.
> 
> During post-migration recovery we call the whole `xe_gt_sriov_vf_query_config()`, which was the main reason I went for protecting the whole provisioning.
> 
> Since we can query GuC multiple times, narrowing the protection would work as well - it would just be a bit unusual to allow two threads a race over writing provisioning info.
> 
> But since these values never change, both would write the same, so no visible problem.
> 
> If you want, I can remove the protection of anything other than GGTT. The solution would be then a little confusing maybe, but would work the same.
> 
> So, which way do we go? Fix patch name+comment, or fix locking range?

if we make the VF GGTT data as tile-level, as suggested in [1], then it
can be protected with it's own tile-level lock

then GT level query can be as-is, just updates to the tile.ggtt will
quarded by the lock

[1] https://patchwork.freedesktop.org/patch/677304/?series=154627&rev=3#comment_1242644

> 
>>>
>>>>       err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_LMEM_SIZE_KEY, &size);
>>>>       if (unlikely(err))
>>>> -        return err;
>>>> +        goto out;
>>>>         if (config->lmem_size && config->lmem_size != size) {
>>>>           xe_gt_sriov_err(gt, "Unexpected LMEM reassignment: %lluM != %lluM\n",
>>>>                   size / SZ_1M, config->lmem_size / SZ_1M);
>>>> -        return -EREMCHG;
>>>> +        err = -EREMCHG;
>>>> +        goto out;
>>>>       }
>>>>         string_get_size(size, 1, STRING_UNITS_2, size_str, sizeof(size_str));
>>>>       xe_gt_sriov_dbg_verbose(gt, "LMEM %lluM %s\n", size / SZ_1M, size_str);
>>>>         config->lmem_size = size;
>>>> +    err = config->lmem_size ? 0 : -ENODATA;
>>>>   -    return config->lmem_size ? 0 : -ENODATA;
>>>> +out:
>>>> +    up_write(&config->lock);
>>>> +    return err;
>>>>   }
>>>>     static int vf_get_submission_cfg(struct xe_gt *gt)
>>>> @@ -501,23 +513,27 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
>>>>         xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>>   +    down_write(&config->lock);
>>>> +
>>>>       err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_CONTEXTS_KEY, &num_ctxs);
>>>>       if (unlikely(err))
>>>> -        return err;
>>>> +        goto out;
>>>>         err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_DOORBELLS_KEY, &num_dbs);
>>>>       if (unlikely(err))
>>>> -        return err;
>>>> +        goto out;
>>>>         if (config->num_ctxs && config->num_ctxs != num_ctxs) {
>>>>           xe_gt_sriov_err(gt, "Unexpected CTXs reassignment: %u != %u\n",
>>>>                   num_ctxs, config->num_ctxs);
>>>> -        return -EREMCHG;
>>>> +        err = -EREMCHG;
>>>> +        goto out;
>>>>       }
>>>>       if (config->num_dbs && config->num_dbs != num_dbs) {
>>>>           xe_gt_sriov_err(gt, "Unexpected DBs reassignment: %u != %u\n",
>>>>                   num_dbs, config->num_dbs);
>>>> -        return -EREMCHG;
>>>> +        err = -EREMCHG;
>>>> +        goto out;
>>>>       }
>>>>         xe_gt_sriov_dbg_verbose(gt, "CTXs %u DBs %u\n", num_ctxs, num_dbs);
>>>> @@ -525,7 +541,11 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
>>>>       config->num_ctxs = num_ctxs;
>>>>       config->num_dbs = num_dbs;
>>>>   -    return config->num_ctxs ? 0 : -ENODATA;
>>>> +    err = config->num_ctxs ? 0 : -ENODATA;
>>>> +
>>>> +out:
>>>> +    up_write(&config->lock);
>>>> +    return err;
>>>>   }
>>>>     static void vf_cache_gmdid(struct xe_gt *gt)
>>>> @@ -579,11 +599,18 @@ int xe_gt_sriov_vf_query_config(struct xe_gt *gt)
>>>>    */
>>>>   u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
>>>>   {
>>>> +    struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
>>>> +    u16 val;
>>>> +
>>>>       xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>>       xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
>>>> -    xe_gt_assert(gt, gt->sriov.vf.self_config.num_ctxs);
>>>>   -    return gt->sriov.vf.self_config.num_ctxs;
>>>> +    down_read(&config->lock);
>>>> +    xe_gt_assert(gt, config->num_ctxs);
>>>> +    val = config->num_ctxs;
>>>> +    up_read(&config->lock);
>>>> +
>>>> +    return val;
>>>>   }
>>>>     /**
>>>> @@ -596,11 +623,18 @@ u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
>>>>    */
>>>>   u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
>>>>   {
>>>> +    struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
>>>> +    u64 val;
>>>> +
>>>>       xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>>       xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
>>>> -    xe_gt_assert(gt, gt->sriov.vf.self_config.lmem_size);
>>>>   -    return gt->sriov.vf.self_config.lmem_size;
>>>> +    down_read(&config->lock);
>>>> +    xe_gt_assert(gt, config->lmem_size);
>>>> +    val = config->lmem_size;
>>>> +    up_read(&config->lock);
>>>> +
>>>> +    return val;
>>>>   }
>>>>     /**
>>>> @@ -613,11 +647,17 @@ u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
>>>>    */
>>>>   u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt)
>>>>   {
>>>> +    struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
>>>> +    u64 val;
>>>> +
>>>>       xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>>       xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
>>>> -    xe_gt_assert(gt, gt->sriov.vf.self_config.ggtt_size);
>>>>   -    return gt->sriov.vf.self_config.ggtt_size;
>>>> +    down_read(&config->lock);
>>>> +    val = config->ggtt_size;
>>>> +    up_read(&config->lock);
>>>> +
>>>> +    return val;
>>>>   }
>>>>     /**
>>>> @@ -630,11 +670,18 @@ u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt)
>>>>    */
>>>>   u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt)
>>>>   {
>>>> +    struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
>>>> +    u64 val;
>>>> +
>>>>       xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>>       xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
>>>> -    xe_gt_assert(gt, gt->sriov.vf.self_config.ggtt_size);
>>>>   -    return gt->sriov.vf.self_config.ggtt_base;
>>>> +    down_read(&config->lock);
>>>> +    xe_gt_assert(gt, config->ggtt_size);
>>>> +    val = config->ggtt_base;
>>>> +    up_read(&config->lock);
>>>> +
>>>> +    return val;
>>>>   }
>>>>     /**
>>>> @@ -648,11 +695,16 @@ u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt)
>>>>   s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt)
>>>>   {
>>>>       struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
>>>> +    s64 val;
>>>>         xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>>       xe_gt_assert(gt, xe_gt_is_main_type(gt));
>>>>   -    return config->ggtt_shift;
>>>> +    down_read(&config->lock);
>>>> +    val = config->ggtt_shift;
>>>> +    up_read(&config->lock);
>>>> +
>>>> +    return val;
>>>>   }
>>>>     static int relay_action_handshake(struct xe_gt *gt, u32 *major, u32 *minor)
>>>> @@ -1044,6 +1096,7 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
>>>>         xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>>>   +    down_read(&config->lock);
>>>>       drm_printf(p, "GGTT range:\t%#llx-%#llx\n",
>>>>              config->ggtt_base,
>>>>              config->ggtt_base + config->ggtt_size - 1);
>>>> @@ -1060,6 +1113,7 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
>>>>         drm_printf(p, "GuC contexts:\t%u\n", config->num_ctxs);
>>>>       drm_printf(p, "GuC doorbells:\t%u\n", config->num_dbs);
>>>> +    up_read(&config->lock);
>>>>   }
>>>>     /**
>>>> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
>>>> index 298dedf4b009..d95857bd789b 100644
>>>> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
>>>> @@ -6,6 +6,7 @@
>>>>   #ifndef _XE_GT_SRIOV_VF_TYPES_H_
>>>>   #define _XE_GT_SRIOV_VF_TYPES_H_
>>>>   +#include <linux/rwsem.h>
>>>>   #include <linux/types.h>
>>>>   #include "xe_uc_fw_types.h"
>>>>   @@ -25,6 +26,8 @@ struct xe_gt_sriov_vf_selfconfig {
>>>>       u16 num_ctxs;
>>>>       /** @num_dbs: assigned number of GuC doorbells IDs. */
>>>>       u16 num_dbs;
>>>> +    /** @lock: lock for protecting access to all selfconfig fields. */
>>>> +    struct rw_semaphore lock;
>>>>   };
>>>>     /**
>>>> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
>>>> index cdd9f8e78b2a..d6e2ed9b9bbc 100644
>>>> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
>>>> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
>>>> @@ -197,6 +197,12 @@ static void vf_migration_init_early(struct xe_device *xe)
>>>>    */
>>>>   void xe_sriov_vf_init_early(struct xe_device *xe)
>>>>   {
>>>> +    struct xe_gt *gt;
>>>> +    unsigned int id;
>>>> +
>>>> +    for_each_gt(gt, xe, id)
>>>> +        init_rwsem(&gt->sriov.vf.self_config.lock);
>>> as before, this should be done in
>>>
>>>     xe_gt_sriov_vf_init_early
>>>
>> I pick up this change but again I think Michal's need some answers to
>> his questions.
>>
>> Matt
> 
> Sure, that is also very early call so we can move it there. (As Matt actually did, just not in this patch.)
> 
> -Tomasz
> 
>>
>>>> +
>>>>       vf_migration_init_early(xe);
>>>>   }
>>>>   


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 36/36] drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC
  2025-09-29 15:17   ` K V P, Satyanarayana
@ 2025-09-30 12:39     ` Matthew Brost
  2025-09-30 13:38       ` Michal Wajdeczko
  0 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-09-30 12:39 UTC (permalink / raw)
  To: K V P, Satyanarayana; +Cc: intel-xe

On Mon, Sep 29, 2025 at 08:47:33PM +0530, K V P, Satyanarayana wrote:
> 
> 
> On 29-09-2025 08:25, Matthew Brost wrote:
> > From: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> > 
> > Some VF2GUC actions may take longer to process. Increase default timeout
> > after received BUSY indication to 2sec to cover all worst case scenarios.
> > 
> > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_guc.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> > index c016a11b6ab1..f0de1fa61898 100644
> > --- a/drivers/gpu/drm/xe/xe_guc.c
> > +++ b/drivers/gpu/drm/xe/xe_guc.c
> > @@ -1439,7 +1439,7 @@ int xe_guc_mmio_send_recv(struct xe_guc *guc, const u32 *request,
> >   		BUILD_BUG_ON((GUC_HXG_TYPE_RESPONSE_SUCCESS ^ GUC_HXG_TYPE_RESPONSE_FAILURE) != 1);
> >   		ret = xe_mmio_wait32(mmio, reply_reg, resp_mask, resp_mask,
> > -				     1000000, &header, false);
> > +				     2000000, &header, false);
> >   		if (unlikely(FIELD_GET(GUC_HXG_MSG_0_ORIGIN, header) !=
> >   			     GUC_HXG_ORIGIN_GUC))
> 
> LGTM.
> Acked-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>

This is your patch, so can't ack by but anyways:

Reviewed-by: Matthew Brost <matthew.brost@intel.com>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 02/36] drm/xe/vf: Lock querying GGTT config during driver init
  2025-09-29  8:13   ` Ville Syrjälä
@ 2025-09-30 13:22     ` Lis, Tomasz
  0 siblings, 0 replies; 83+ messages in thread
From: Lis, Tomasz @ 2025-09-30 13:22 UTC (permalink / raw)
  To: Ville Syrjälä, Matthew Brost; +Cc: intel-xe


On 9/29/2025 10:13 AM, Ville Syrjälä wrote:
> On Sun, Sep 28, 2025 at 07:55:08PM -0700, Matthew Brost wrote:
>> From: Tomasz Lis <tomasz.lis@intel.com>
>>
>> Protect access to GGTT config as this is non-static information.
>>
>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 96 ++++++++++++++++++-----
>>   drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  3 +
>>   drivers/gpu/drm/xe/xe_sriov_vf.c          |  6 ++
>>   3 files changed, 84 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>> index 0461d5513487..016c867e5e2b 100644
>> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>> @@ -440,18 +440,21 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
>>   
>>   	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>   
>> +	down_write(&config->lock);
>> +
>>   	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_START_KEY, &start);
>>   	if (unlikely(err))
>> -		return err;
>> +		goto out;
>>   
>>   	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_SIZE_KEY, &size);
>>   	if (unlikely(err))
>> -		return err;
>> +		goto out;
>>   
>>   	if (config->ggtt_size && config->ggtt_size != size) {
>>   		xe_gt_sriov_err(gt, "Unexpected GGTT reassignment: %lluK != %lluK\n",
>>   				size / SZ_1K, config->ggtt_size / SZ_1K);
>> -		return -EREMCHG;
>> +		err = -EREMCHG;
>> +		goto out;
>>   	}
>>   
>>   	xe_gt_sriov_dbg_verbose(gt, "GGTT %#llx-%#llx = %lluK\n",
>> @@ -460,8 +463,11 @@ static int vf_get_ggtt_info(struct xe_gt *gt)
>>   	config->ggtt_shift = start - (s64)config->ggtt_base;
>>   	config->ggtt_base = start;
>>   	config->ggtt_size = size;
>> +	err = config->ggtt_size ? 0 : -ENODATA;
>>   
>> -	return config->ggtt_size ? 0 : -ENODATA;
>> +out:
>> +	up_write(&config->lock);
>> +	return err;
>>   }
>>   
>>   static int vf_get_lmem_info(struct xe_gt *gt)
>> @@ -474,22 +480,28 @@ static int vf_get_lmem_info(struct xe_gt *gt)
>>   
>>   	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>   
>> +	down_write(&config->lock);
>> +
>>   	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_LMEM_SIZE_KEY, &size);
>>   	if (unlikely(err))
>> -		return err;
>> +		goto out;
>>   
>>   	if (config->lmem_size && config->lmem_size != size) {
>>   		xe_gt_sriov_err(gt, "Unexpected LMEM reassignment: %lluM != %lluM\n",
>>   				size / SZ_1M, config->lmem_size / SZ_1M);
>> -		return -EREMCHG;
>> +		err = -EREMCHG;
>> +		goto out;
>>   	}
>>   
>>   	string_get_size(size, 1, STRING_UNITS_2, size_str, sizeof(size_str));
>>   	xe_gt_sriov_dbg_verbose(gt, "LMEM %lluM %s\n", size / SZ_1M, size_str);
>>   
>>   	config->lmem_size = size;
>> +	err = config->lmem_size ? 0 : -ENODATA;
>>   
>> -	return config->lmem_size ? 0 : -ENODATA;
>> +out:
>> +	up_write(&config->lock);
>> +	return err;
>>   }
>>   
>>   static int vf_get_submission_cfg(struct xe_gt *gt)
>> @@ -501,23 +513,27 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
>>   
>>   	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>   
>> +	down_write(&config->lock);
>> +
>>   	err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_CONTEXTS_KEY, &num_ctxs);
>>   	if (unlikely(err))
>> -		return err;
>> +		goto out;
>>   
>>   	err = guc_action_query_single_klv32(guc, GUC_KLV_VF_CFG_NUM_DOORBELLS_KEY, &num_dbs);
>>   	if (unlikely(err))
>> -		return err;
>> +		goto out;
>>   
>>   	if (config->num_ctxs && config->num_ctxs != num_ctxs) {
>>   		xe_gt_sriov_err(gt, "Unexpected CTXs reassignment: %u != %u\n",
>>   				num_ctxs, config->num_ctxs);
>> -		return -EREMCHG;
>> +		err = -EREMCHG;
>> +		goto out;
>>   	}
>>   	if (config->num_dbs && config->num_dbs != num_dbs) {
>>   		xe_gt_sriov_err(gt, "Unexpected DBs reassignment: %u != %u\n",
>>   				num_dbs, config->num_dbs);
>> -		return -EREMCHG;
>> +		err = -EREMCHG;
>> +		goto out;
>>   	}
>>   
>>   	xe_gt_sriov_dbg_verbose(gt, "CTXs %u DBs %u\n", num_ctxs, num_dbs);
>> @@ -525,7 +541,11 @@ static int vf_get_submission_cfg(struct xe_gt *gt)
>>   	config->num_ctxs = num_ctxs;
>>   	config->num_dbs = num_dbs;
>>   
>> -	return config->num_ctxs ? 0 : -ENODATA;
>> +	err = config->num_ctxs ? 0 : -ENODATA;
>> +
>> +out:
>> +	up_write(&config->lock);
>> +	return err;
>>   }
>>   
>>   static void vf_cache_gmdid(struct xe_gt *gt)
>> @@ -579,11 +599,18 @@ int xe_gt_sriov_vf_query_config(struct xe_gt *gt)
>>    */
>>   u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
>>   {
>> +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
>> +	u16 val;
>> +
>>   	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>   	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
>> -	xe_gt_assert(gt, gt->sriov.vf.self_config.num_ctxs);
>>   
>> -	return gt->sriov.vf.self_config.num_ctxs;
>> +	down_read(&config->lock);
>> +	xe_gt_assert(gt, config->num_ctxs);
>> +	val = config->num_ctxs;
>> +	up_read(&config->lock);
>> +
>> +	return val;
>>   }
>>   
>>   /**
>> @@ -596,11 +623,18 @@ u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
>>    */
>>   u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
>>   {
>> +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
>> +	u64 val;
>> +
>>   	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>   	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
>> -	xe_gt_assert(gt, gt->sriov.vf.self_config.lmem_size);
>>   
>> -	return gt->sriov.vf.self_config.lmem_size;
>> +	down_read(&config->lock);
>> +	xe_gt_assert(gt, config->lmem_size);
>> +	val = config->lmem_size;
>> +	up_read(&config->lock);
> Why is someone mutating this sort of information at runtime?
> Sounds pretty crazy to me.

Only GGTT address can be altered, and only when very specific features 
are used - VF restore and VF going out of sleep.

Memory sizes and amount of contexts can't be currently changed (though 
for admins of farms of VMs, I'm pretty sure it would sound useful rather 
than crazy).

-Tomasz



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 36/36] drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC
  2025-09-30 12:39     ` Matthew Brost
@ 2025-09-30 13:38       ` Michal Wajdeczko
  2025-09-30 14:39         ` Matthew Brost
  0 siblings, 1 reply; 83+ messages in thread
From: Michal Wajdeczko @ 2025-09-30 13:38 UTC (permalink / raw)
  To: Matthew Brost, K V P, Satyanarayana; +Cc: intel-xe



On 9/30/2025 2:39 PM, Matthew Brost wrote:
> On Mon, Sep 29, 2025 at 08:47:33PM +0530, K V P, Satyanarayana wrote:
>>
>>
>> On 29-09-2025 08:25, Matthew Brost wrote:
>>> From: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
>>>
>>> Some VF2GUC actions may take longer to process. Increase default timeout
>>> after received BUSY indication to 2sec to cover all worst case scenarios.
>>>
>>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
>>> ---
>>>   drivers/gpu/drm/xe/xe_guc.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
>>> index c016a11b6ab1..f0de1fa61898 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc.c
>>> @@ -1439,7 +1439,7 @@ int xe_guc_mmio_send_recv(struct xe_guc *guc, const u32 *request,
>>>   		BUILD_BUG_ON((GUC_HXG_TYPE_RESPONSE_SUCCESS ^ GUC_HXG_TYPE_RESPONSE_FAILURE) != 1);
>>>   		ret = xe_mmio_wait32(mmio, reply_reg, resp_mask, resp_mask,
>>> -				     1000000, &header, false);
>>> +				     2000000, &header, false);
>>>   		if (unlikely(FIELD_GET(GUC_HXG_MSG_0_ORIGIN, header) !=
>>>   			     GUC_HXG_ORIGIN_GUC))
>>
>> LGTM.
>> Acked-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> 
> This is your patch, so can't ack by but anyways:
> 
> Reviewed-by: Matthew Brost <matthew.brost@intel.com>

but shouldn't we wait until your previous concern [1] is addressed ?

[1] https://patchwork.freedesktop.org/patch/675316/?series=154682&rev=1#comment_1240144


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 36/36] drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC
  2025-09-30 13:38       ` Michal Wajdeczko
@ 2025-09-30 14:39         ` Matthew Brost
  0 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-30 14:39 UTC (permalink / raw)
  To: Michal Wajdeczko; +Cc: K V P, Satyanarayana, intel-xe

On Tue, Sep 30, 2025 at 03:38:43PM +0200, Michal Wajdeczko wrote:
> 
> 
> On 9/30/2025 2:39 PM, Matthew Brost wrote:
> > On Mon, Sep 29, 2025 at 08:47:33PM +0530, K V P, Satyanarayana wrote:
> >>
> >>
> >> On 29-09-2025 08:25, Matthew Brost wrote:
> >>> From: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> >>>
> >>> Some VF2GUC actions may take longer to process. Increase default timeout
> >>> after received BUSY indication to 2sec to cover all worst case scenarios.
> >>>
> >>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> >>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> >>> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> >>> ---
> >>>   drivers/gpu/drm/xe/xe_guc.c | 2 +-
> >>>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> >>> index c016a11b6ab1..f0de1fa61898 100644
> >>> --- a/drivers/gpu/drm/xe/xe_guc.c
> >>> +++ b/drivers/gpu/drm/xe/xe_guc.c
> >>> @@ -1439,7 +1439,7 @@ int xe_guc_mmio_send_recv(struct xe_guc *guc, const u32 *request,
> >>>   		BUILD_BUG_ON((GUC_HXG_TYPE_RESPONSE_SUCCESS ^ GUC_HXG_TYPE_RESPONSE_FAILURE) != 1);
> >>>   		ret = xe_mmio_wait32(mmio, reply_reg, resp_mask, resp_mask,
> >>> -				     1000000, &header, false);
> >>> +				     2000000, &header, false);
> >>>   		if (unlikely(FIELD_GET(GUC_HXG_MSG_0_ORIGIN, header) !=
> >>>   			     GUC_HXG_ORIGIN_GUC))
> >>
> >> LGTM.
> >> Acked-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> > 
> > This is your patch, so can't ack by but anyways:
> > 
> > Reviewed-by: Matthew Brost <matthew.brost@intel.com>
> 
> but shouldn't we wait until your previous concern [1] is addressed ?
> 

Nah. That comment is actually invalid too. I misread the code - I didn't
relize this was in GUC_HXG_TYPE_NO_RESPONSE_BUSY path. I think this can
as is fine. We have relatively large waits on the GuC all over the
driver because in practice timeouts shouldn't happen unless the GuC has
died. If anything we should make this timeout bigger here.

Matt

> [1] https://patchwork.freedesktop.org/patch/675316/?series=154682&rev=1#comment_1240144
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 13/36] drm/xe/vf: Make VF recovery run on per-GT worker
  2025-09-29  2:55 ` [PATCH v3 13/36] drm/xe/vf: Make VF recovery run on per-GT worker Matthew Brost
@ 2025-09-30 14:47   ` Lis, Tomasz
  0 siblings, 0 replies; 83+ messages in thread
From: Lis, Tomasz @ 2025-09-30 14:47 UTC (permalink / raw)
  To: Matthew Brost, intel-xe


On 9/29/2025 4:55 AM, Matthew Brost wrote:
> VF recovery is a per-GT operation, so it makes sense to isolate it to a
> per-GT queue. Scheduling this operation on the same worker as the GT
> reset and TDR not only aligns with this design but also helps avoid race
> conditions, as those operations can also modify the queue state.
>
> v2:
>   - Fix lockdep splat (Adam)
>   - Use xe_sriov_vf_migration_supported helper
> v3:
>   - Drop xe_gt_sriov_ prefix for private functions (Michal)
>   - Drop message in xe_gt_sriov_vf_migration_init_early (Michal)
>   - Logic rework in vf_post_migration_notify_resfix_done (Michal)
>   - Rework init sequence layering (Michal)

One minor remark below, but other than that:

Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>

> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_gt.c                |   6 +
>   drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 179 +++++++++++++++-
>   drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |   3 +-
>   drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |   7 +
>   drivers/gpu/drm/xe/xe_sriov_vf.c          | 246 ----------------------
>   drivers/gpu/drm/xe/xe_sriov_vf.h          |   1 -
>   drivers/gpu/drm/xe/xe_sriov_vf_types.h    |   4 -
>   7 files changed, 182 insertions(+), 264 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> index 3e0ad7e5b5df..5f9ba4caf837 100644
> --- a/drivers/gpu/drm/xe/xe_gt.c
> +++ b/drivers/gpu/drm/xe/xe_gt.c
> @@ -398,6 +398,12 @@ int xe_gt_init_early(struct xe_gt *gt)
>   			return err;
>   	}
>   
> +	if (IS_SRIOV_VF(gt_to_xe(gt))) {
> +		err = xe_gt_sriov_vf_init_early(gt);
> +		if (err)
> +			return err;
> +	}
> +
>   	xe_reg_sr_init(&gt->reg_sr, "GT", gt_to_xe(gt));
>   
>   	err = xe_wa_gt_init(gt);
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 71309219a4b7..ae9df9c0876d 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -25,11 +25,15 @@
>   #include "xe_guc.h"
>   #include "xe_guc_hxg_helpers.h"
>   #include "xe_guc_relay.h"
> +#include "xe_guc_submit.h"
> +#include "xe_irq.h"
>   #include "xe_lrc.h"
>   #include "xe_memirq.h"
>   #include "xe_mmio.h"
> +#include "xe_pm.h"
>   #include "xe_sriov.h"
>   #include "xe_sriov_vf.h"
> +#include "xe_tile_sriov_vf.h"
>   #include "xe_uc_fw.h"
>   #include "xe_wopcm.h"
>   
> @@ -308,13 +312,13 @@ static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
>   }
>   
>   /**
> - * xe_gt_sriov_vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
> + * vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
>    * @gt: the &xe_gt struct instance linked to target GuC
>    *
>    * Returns: 0 if the operation completed successfully, or a negative error
>    * code otherwise.
>    */
> -int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt)
> +static int vf_notify_resfix_done(struct xe_gt *gt)
>   {
>   	struct xe_guc *guc = &gt->uc.guc;
>   	int err;
> @@ -808,7 +812,7 @@ int xe_gt_sriov_vf_connect(struct xe_gt *gt)
>    * xe_gt_sriov_vf_default_lrcs_hwsp_rebase - Update GGTT references in HWSP of default LRCs.
>    * @gt: the &xe_gt struct instance
>    */
> -void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt)
> +static void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt)
>   {
>   	struct xe_hw_engine *hwe;
>   	enum xe_hw_engine_id id;
> @@ -817,6 +821,26 @@ void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt)
>   		xe_default_lrc_update_memirq_regs_with_address(hwe);
>   }
>   
> +static void vf_start_migration_recovery(struct xe_gt *gt)
> +{
> +	bool started;
> +
> +	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> +
> +	spin_lock(&gt->sriov.vf.migration.lock);
> +
> +	if (!gt->sriov.vf.migration.recovery_queued) {
> +		gt->sriov.vf.migration.recovery_queued = true;
> +		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
> +
> +		started = queue_work(gt->ordered_wq, &gt->sriov.vf.migration.worker);
> +		xe_gt_sriov_info(gt, "VF migration recovery %s\n", started ?
> +				 "scheduled" : "already in progress");
> +	}
> +
> +	spin_unlock(&gt->sriov.vf.migration.lock);
> +}
> +
>   /**
>    * xe_gt_sriov_vf_migrated_event_handler - Start a VF migration recovery,
>    *   or just mark that a GuC is ready for it.
> @@ -831,15 +855,8 @@ void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt)
>   	xe_gt_assert(gt, IS_SRIOV_VF(xe));
>   	xe_gt_assert(gt, xe_gt_sriov_vf_recovery_inprogress(gt));
>   
> -	set_bit(gt->info.id, &xe->sriov.vf.migration.gt_flags);
> -	/*
> -	 * We need to be certain that if all flags were set, at least one
> -	 * thread will notice that and schedule the recovery.
> -	 */
> -	smp_mb__after_atomic();
> -
>   	xe_gt_sriov_info(gt, "ready for recovery after migration\n");
> -	xe_sriov_vf_start_migration_recovery(xe);
> +	vf_start_migration_recovery(gt);
>   }
>   
>   static bool vf_is_negotiated(struct xe_gt *gt, u16 major, u16 minor)
> @@ -1175,6 +1192,146 @@ void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p)
>   		   pf_version->major, pf_version->minor);
>   }
>   
> +static void vf_post_migration_shutdown(struct xe_gt *gt)
> +{
> +	int ret = 0;
> +
> +	spin_lock_irq(&gt->sriov.vf.migration.lock);
> +	gt->sriov.vf.migration.recovery_queued = false;
> +	spin_unlock_irq(&gt->sriov.vf.migration.lock);
> +
> +	xe_guc_submit_pause(&gt->uc.guc);
> +	ret |= xe_guc_submit_reset_block(&gt->uc.guc);
> +
> +	if (ret)
> +		xe_gt_sriov_info(gt, "migration recovery encountered ongoing reset\n");
> +}
> +
> +static size_t post_migration_scratch_size(struct xe_device *xe)
> +{
> +	return max(xe_lrc_reg_size(xe), LRC_WA_BB_SIZE);
> +}
> +
> +static int vf_post_migration_fixups(struct xe_gt *gt)
> +{
> +	s64 shift;
> +	void *buf;
> +	int err;
> +
> +	buf = kmalloc(post_migration_scratch_size(gt_to_xe(gt)), GFP_ATOMIC);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	err = xe_gt_sriov_vf_query_config(gt);
> +	if (err)
> +		goto out;
> +
> +	shift = xe_gt_sriov_vf_ggtt_shift(gt);
> +	if (shift) {
> +		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
> +		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
> +		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
> +		if (err)
> +			goto out;
> +	}
> +
> +out:
> +	kfree(buf);
> +	return err;
> +}
> +
> +static void vf_post_migration_kickstart(struct xe_gt *gt)
> +{
> +	/*
> +	 * Make sure interrupts on the new HW are properly set. The GuC IRQ
> +	 * must be working at this point, since the recovery did started,
> +	 * but the rest was not enabled using the procedure from spec.
> +	 */
> +	xe_irq_resume(gt_to_xe(gt));
> +
> +	xe_guc_submit_reset_unblock(&gt->uc.guc);
> +	xe_guc_submit_unpause(&gt->uc.guc);
> +}
> +
> +static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
> +{
> +	bool skip_resfix = false;
> +
> +	spin_lock_irq(&gt->sriov.vf.migration.lock);
> +	if (gt->sriov.vf.migration.recovery_queued) {
> +		skip_resfix = true;
> +		xe_gt_sriov_dbg(gt, "another recovery imminent, skipped some notifications\n");

Now as the recovery is per-GT, this message concerns one RESFIX_DONE 
notification only.

(though the message will disappear  anyway in the future, with double 
migration support)

-Tomasz

> +	} else {
> +		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, false);
> +	}
> +	spin_unlock_irq(&gt->sriov.vf.migration.lock);
> +
> +	if (skip_resfix)
> +		return -EAGAIN;
> +
> +	return vf_notify_resfix_done(gt);
> +}
> +
> +static void vf_post_migration_recovery(struct xe_gt *gt)
> +{
> +	struct xe_device *xe = gt_to_xe(gt);
> +	int err;
> +
> +	xe_gt_sriov_dbg(gt, "migration recovery in progress\n");
> +
> +	xe_pm_runtime_get(xe);
> +	vf_post_migration_shutdown(gt);
> +
> +	if (!xe_sriov_vf_migration_supported(xe)) {
> +		xe_gt_sriov_err(gt, "migration is not supported\n");
> +		err = -ENOTRECOVERABLE;
> +		goto fail;
> +	}
> +
> +	err = vf_post_migration_fixups(gt);
> +	if (err)
> +		goto fail;
> +
> +	vf_post_migration_kickstart(gt);
> +	err = vf_post_migration_notify_resfix_done(gt);
> +	if (err && err != -EAGAIN)
> +		goto fail;
> +
> +	xe_pm_runtime_put(xe);
> +	xe_gt_sriov_notice(gt, "migration recovery ended\n");
> +	return;
> +fail:
> +	xe_pm_runtime_put(xe);
> +	xe_gt_sriov_err(gt, "migration recovery failed (%pe)\n", ERR_PTR(err));
> +	xe_device_declare_wedged(xe);
> +}
> +
> +static void migration_worker_func(struct work_struct *w)
> +{
> +	struct xe_gt *gt = container_of(w, struct xe_gt,
> +					sriov.vf.migration.worker);
> +
> +	vf_post_migration_recovery(gt);
> +}
> +
> +/**
> + * xe_gt_sriov_vf_init_early() - GT VF init early
> + * @gt: the &xe_gt
> + *
> + * Return 0 on success, errno on failure
> + */
> +int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
> +{
> +	if (!xe_sriov_vf_migration_supported(gt_to_xe(gt)))
> +		return 0;
> +
> +	init_rwsem(&gt->sriov.vf.self_config.lock);
> +	spin_lock_init(&gt->sriov.vf.migration.lock);
> +	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
> +
> +	return 0;
> +}
> +
>   /**
>    * xe_gt_sriov_vf_recovery_inprogress() - VF post migration recovery in progress
>    * @gt: the &xe_gt
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> index bb5f8eace19b..0b0f2a30e67c 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> @@ -21,10 +21,9 @@ void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt,
>   int xe_gt_sriov_vf_query_config(struct xe_gt *gt);
>   int xe_gt_sriov_vf_connect(struct xe_gt *gt);
>   int xe_gt_sriov_vf_query_runtime(struct xe_gt *gt);
> -void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt);
> -int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt);
>   void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
>   
> +int xe_gt_sriov_vf_init_early(struct xe_gt *gt);
>   bool xe_gt_sriov_vf_recovery_inprogress(struct xe_gt *gt);
>   
>   u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> index 7b10b8e1e10e..53680a2f188a 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> @@ -8,6 +8,7 @@
>   
>   #include <linux/rwsem.h>
>   #include <linux/types.h>
> +#include <linux/workqueue.h>
>   #include "xe_uc_fw_types.h"
>   
>   /**
> @@ -53,6 +54,12 @@ struct xe_gt_sriov_vf_runtime {
>    * xe_gt_sriov_vf_migration - VF migration data.
>    */
>   struct xe_gt_sriov_vf_migration {
> +	/** @migration: VF migration recovery worker */
> +	struct work_struct worker;
> +	/** @lock: Protects recovery_queued */
> +	spinlock_t lock;
> +	/** @recovery_queued: VF post migration recovery in queued */
> +	bool recovery_queued;
>   	/** @recovery_inprogress: VF post migration recovery in progress */
>   	bool recovery_inprogress;
>   };
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> index da064a1e7419..911d5720917b 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> @@ -6,21 +6,12 @@
>   #include <drm/drm_debugfs.h>
>   #include <drm/drm_managed.h>
>   
> -#include "xe_assert.h"
> -#include "xe_device.h"
>   #include "xe_gt.h"
> -#include "xe_gt_sriov_printk.h"
>   #include "xe_gt_sriov_vf.h"
>   #include "xe_guc.h"
> -#include "xe_guc_submit.h"
> -#include "xe_irq.h"
> -#include "xe_lrc.h"
> -#include "xe_pm.h"
> -#include "xe_sriov.h"
>   #include "xe_sriov_printk.h"
>   #include "xe_sriov_vf.h"
>   #include "xe_sriov_vf_ccs.h"
> -#include "xe_tile_sriov_vf.h"
>   
>   /**
>    * DOC: VF restore procedure in PF KMD and VF KMD
> @@ -158,8 +149,6 @@ static void vf_disable_migration(struct xe_device *xe, const char *fmt, ...)
>   	xe->sriov.vf.migration.enabled = false;
>   }
>   
> -static void migration_worker_func(struct work_struct *w);
> -
>   static void vf_migration_init_early(struct xe_device *xe)
>   {
>   	/*
> @@ -184,8 +173,6 @@ static void vf_migration_init_early(struct xe_device *xe)
>   						    guc_version.major, guc_version.minor);
>   	}
>   
> -	INIT_WORK(&xe->sriov.vf.migration.worker, migration_worker_func);
> -
>   	xe->sriov.vf.migration.enabled = true;
>   	xe_sriov_dbg(xe, "migration support enabled\n");
>   }
> @@ -196,242 +183,9 @@ static void vf_migration_init_early(struct xe_device *xe)
>    */
>   void xe_sriov_vf_init_early(struct xe_device *xe)
>   {
> -	struct xe_gt *gt;
> -	unsigned int id;
> -
> -	for_each_gt(gt, xe, id)
> -		init_rwsem(&gt->sriov.vf.self_config.lock);
> -
>   	vf_migration_init_early(xe);
>   }
>   
> -/**
> - * vf_post_migration_shutdown - Stop the driver activities after VF migration.
> - * @xe: the &xe_device struct instance
> - *
> - * After this VM is migrated and assigned to a new VF, it is running on a new
> - * hardware, and therefore many hardware-dependent states and related structures
> - * require fixups. Without fixups, the hardware cannot do any work, and therefore
> - * all GPU pipelines are stalled.
> - * Stop some of kernel activities to make the fixup process faster.
> - */
> -static void vf_post_migration_shutdown(struct xe_device *xe)
> -{
> -	struct xe_gt *gt;
> -	unsigned int id;
> -	int ret = 0;
> -
> -	for_each_gt(gt, xe, id) {
> -		xe_guc_submit_pause(&gt->uc.guc);
> -		ret |= xe_guc_submit_reset_block(&gt->uc.guc);
> -	}
> -
> -	if (ret)
> -		drm_info(&xe->drm, "migration recovery encountered ongoing reset\n");
> -}
> -
> -/**
> - * vf_post_migration_kickstart - Re-start the driver activities under new hardware.
> - * @xe: the &xe_device struct instance
> - *
> - * After we have finished with all post-migration fixups, restart the driver
> - * activities to continue feeding the GPU with workloads.
> - */
> -static void vf_post_migration_kickstart(struct xe_device *xe)
> -{
> -	struct xe_gt *gt;
> -	unsigned int id;
> -
> -	/*
> -	 * Make sure interrupts on the new HW are properly set. The GuC IRQ
> -	 * must be working at this point, since the recovery did started,
> -	 * but the rest was not enabled using the procedure from spec.
> -	 */
> -	xe_irq_resume(xe);
> -
> -	for_each_gt(gt, xe, id) {
> -		xe_guc_submit_reset_unblock(&gt->uc.guc);
> -		xe_guc_submit_unpause(&gt->uc.guc);
> -	}
> -}
> -
> -static bool gt_vf_post_migration_needed(struct xe_gt *gt)
> -{
> -	return test_bit(gt->info.id, &gt_to_xe(gt)->sriov.vf.migration.gt_flags);
> -}
> -
> -/*
> - * Notify GuCs marked in flags about resource fixups apply finished.
> - * @xe: the &xe_device struct instance
> - * @gt_flags: flags marking to which GTs the notification shall be sent
> - */
> -static int vf_post_migration_notify_resfix_done(struct xe_device *xe, unsigned long gt_flags)
> -{
> -	struct xe_gt *gt;
> -	unsigned int id;
> -	int err = 0;
> -
> -	for_each_gt(gt, xe, id) {
> -		if (!test_bit(id, &gt_flags))
> -			continue;
> -		/* skip asking GuC for RESFIX exit if new recovery request arrived */
> -		if (gt_vf_post_migration_needed(gt))
> -			continue;
> -		err = xe_gt_sriov_vf_notify_resfix_done(gt);
> -		if (err)
> -			break;
> -		clear_bit(id, &gt_flags);
> -	}
> -
> -	if (gt_flags && !err)
> -		drm_dbg(&xe->drm, "another recovery imminent, skipped some notifications\n");
> -	return err;
> -}
> -
> -static int vf_get_next_migrated_gt_id(struct xe_device *xe)
> -{
> -	struct xe_gt *gt;
> -	unsigned int id;
> -
> -	for_each_gt(gt, xe, id) {
> -		if (test_and_clear_bit(id, &xe->sriov.vf.migration.gt_flags))
> -			return id;
> -	}
> -	return -1;
> -}
> -
> -static size_t post_migration_scratch_size(struct xe_device *xe)
> -{
> -	return max(xe_lrc_reg_size(xe), LRC_WA_BB_SIZE);
> -}
> -
> -/**
> - * Perform post-migration fixups on a single GT.
> - *
> - * After migration, GuC needs to be re-queried for VF configuration to check
> - * if it matches previous provisioning. Most of VF provisioning shall be the
> - * same, except GGTT range, since GGTT is not virtualized per-VF. If GGTT
> - * range has changed, we have to perform fixups - shift all GGTT references
> - * used anywhere within the driver. After the fixups in this function succeed,
> - * it is allowed to ask the GuC bound to this GT to continue normal operation.
> - *
> - * Returns: 0 if the operation completed successfully, or a negative error
> - * code otherwise.
> - */
> -static int gt_vf_post_migration_fixups(struct xe_gt *gt)
> -{
> -	s64 shift;
> -	void *buf;
> -	int err;
> -
> -	buf = kmalloc(post_migration_scratch_size(gt_to_xe(gt)), GFP_KERNEL);
> -	if (!buf)
> -		return -ENOMEM;
> -
> -	err = xe_gt_sriov_vf_query_config(gt);
> -	if (err)
> -		goto out;
> -
> -	shift = xe_gt_sriov_vf_ggtt_shift(gt);
> -	if (shift) {
> -		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
> -		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
> -		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
> -		if (err)
> -			goto out;
> -	}
> -
> -out:
> -	kfree(buf);
> -	return err;
> -}
> -
> -static void vf_post_migration_recovery(struct xe_device *xe)
> -{
> -	unsigned long fixed_gts = 0;
> -	int id, err;
> -
> -	drm_dbg(&xe->drm, "migration recovery in progress\n");
> -	xe_pm_runtime_get(xe);
> -	vf_post_migration_shutdown(xe);
> -
> -	if (!xe_sriov_vf_migration_supported(xe)) {
> -		xe_sriov_err(xe, "migration is not supported\n");
> -		err = -ENOTRECOVERABLE;
> -		goto fail;
> -	}
> -
> -	while (id = vf_get_next_migrated_gt_id(xe), id >= 0) {
> -		struct xe_gt *gt = xe_device_get_gt(xe, id);
> -
> -		err = gt_vf_post_migration_fixups(gt);
> -		if (err)
> -			goto fail;
> -
> -		set_bit(id, &fixed_gts);
> -	}
> -
> -	vf_post_migration_kickstart(xe);
> -	err = vf_post_migration_notify_resfix_done(xe, fixed_gts);
> -	if (err)
> -		goto fail;
> -
> -	xe_pm_runtime_put(xe);
> -	drm_notice(&xe->drm, "migration recovery ended\n");
> -	return;
> -fail:
> -	xe_pm_runtime_put(xe);
> -	drm_err(&xe->drm, "migration recovery failed (%pe)\n", ERR_PTR(err));
> -	xe_device_declare_wedged(xe);
> -}
> -
> -static void migration_worker_func(struct work_struct *w)
> -{
> -	struct xe_device *xe = container_of(w, struct xe_device,
> -					    sriov.vf.migration.worker);
> -
> -	vf_post_migration_recovery(xe);
> -}
> -
> -/*
> - * Check if post-restore recovery is coming on any of GTs.
> - * @xe: the &xe_device struct instance
> - *
> - * Return: True if migration recovery worker will soon be running. Any worker currently
> - * executing does not affect the result.
> - */
> -static bool vf_ready_to_recovery_on_any_gts(struct xe_device *xe)
> -{
> -	struct xe_gt *gt;
> -	unsigned int id;
> -
> -	for_each_gt(gt, xe, id) {
> -		if (test_bit(id, &xe->sriov.vf.migration.gt_flags))
> -			return true;
> -	}
> -	return false;
> -}
> -
> -/**
> - * xe_sriov_vf_start_migration_recovery - Start VF migration recovery.
> - * @xe: the &xe_device to start recovery on
> - *
> - * This function shall be called only by VF.
> - */
> -void xe_sriov_vf_start_migration_recovery(struct xe_device *xe)
> -{
> -	bool started;
> -
> -	xe_assert(xe, IS_SRIOV_VF(xe));
> -
> -	if (!vf_ready_to_recovery_on_any_gts(xe))
> -		return;
> -
> -	started = queue_work(xe->sriov.wq, &xe->sriov.vf.migration.worker);
> -	drm_info(&xe->drm, "VF migration recovery %s\n", started ?
> -		 "scheduled" : "already in progress");
> -}
> -
>   /**
>    * xe_sriov_vf_init_late() - SR-IOV VF late initialization functions.
>    * @xe: the &xe_device to initialize
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.h b/drivers/gpu/drm/xe/xe_sriov_vf.h
> index 9e752105ec2a..4df95266b261 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf.h
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.h
> @@ -13,7 +13,6 @@ struct xe_device;
>   
>   void xe_sriov_vf_init_early(struct xe_device *xe);
>   int xe_sriov_vf_init_late(struct xe_device *xe);
> -void xe_sriov_vf_start_migration_recovery(struct xe_device *xe);
>   bool xe_sriov_vf_migration_supported(struct xe_device *xe);
>   void xe_sriov_vf_debugfs_register(struct xe_device *xe, struct dentry *root);
>   
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> index 426cc5841958..6a0fd0f5463e 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> @@ -33,10 +33,6 @@ struct xe_device_vf {
>   
>   	/** @migration: VF Migration state data */
>   	struct {
> -		/** @migration.worker: VF migration recovery worker */
> -		struct work_struct worker;
> -		/** @migration.gt_flags: Per-GT request flags for VF migration recovery */
> -		unsigned long gt_flags;
>   		/**
>   		 * @migration.enabled: flag indicating if migration support
>   		 * was enabled or not due to missing prerequisites

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 15/36] drm/xe/vf: Remove memory allocations from VF post migration recovery
  2025-09-29  2:55 ` [PATCH v3 15/36] drm/xe/vf: Remove memory allocations from VF post migration recovery Matthew Brost
@ 2025-09-30 15:00   ` Lis, Tomasz
  0 siblings, 0 replies; 83+ messages in thread
From: Lis, Tomasz @ 2025-09-30 15:00 UTC (permalink / raw)
  To: Matthew Brost, intel-xe


On 9/29/2025 4:55 AM, Matthew Brost wrote:
> VF post migration recovery is the path of dma-fence signaling / reclaim,
> avoid memory allocations in this path.
>
> v3:
>   - s/lrc_wa_bb/scratch (Tomasz)

Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>

> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 23 +++++++++++++----------
>   drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  2 ++
>   2 files changed, 15 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index ae9df9c0876d..6f15619efe01 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -1214,17 +1214,13 @@ static size_t post_migration_scratch_size(struct xe_device *xe)
>   
>   static int vf_post_migration_fixups(struct xe_gt *gt)
>   {
> +	void *buf = gt->sriov.vf.migration.scratch;
>   	s64 shift;
> -	void *buf;
>   	int err;
>   
> -	buf = kmalloc(post_migration_scratch_size(gt_to_xe(gt)), GFP_ATOMIC);
> -	if (!buf)
> -		return -ENOMEM;
> -
>   	err = xe_gt_sriov_vf_query_config(gt);
>   	if (err)
> -		goto out;
> +		return err;
>   
>   	shift = xe_gt_sriov_vf_ggtt_shift(gt);
>   	if (shift) {
> @@ -1232,12 +1228,10 @@ static int vf_post_migration_fixups(struct xe_gt *gt)
>   		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
>   		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
>   		if (err)
> -			goto out;
> +			return err;
>   	}
>   
> -out:
> -	kfree(buf);
> -	return err;
> +	return 0;
>   }
>   
>   static void vf_post_migration_kickstart(struct xe_gt *gt)
> @@ -1322,9 +1316,18 @@ static void migration_worker_func(struct work_struct *w)
>    */
>   int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
>   {
> +	void *buf;
> +
>   	if (!xe_sriov_vf_migration_supported(gt_to_xe(gt)))
>   		return 0;
>   
> +	buf = drmm_kmalloc(&gt_to_xe(gt)->drm,
> +			   post_migration_scratch_size(gt_to_xe(gt)),
> +			   GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	gt->sriov.vf.migration.scratch = buf;
>   	init_rwsem(&gt->sriov.vf.self_config.lock);
>   	spin_lock_init(&gt->sriov.vf.migration.lock);
>   	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> index 53680a2f188a..a63b6004b0b7 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> @@ -58,6 +58,8 @@ struct xe_gt_sriov_vf_migration {
>   	struct work_struct worker;
>   	/** @lock: Protects recovery_queued */
>   	spinlock_t lock;
> +	/** @scratch: Scratch memory for VF recovery */
> +	void *scratch;
>   	/** @recovery_queued: VF post migration recovery in queued */
>   	bool recovery_queued;
>   	/** @recovery_inprogress: VF post migration recovery in progress */

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 03/36] Revert "drm/xe/vf: Rebase exec queue parallel commands during migration recovery"
  2025-09-29  2:55 ` [PATCH v3 03/36] Revert "drm/xe/vf: Rebase exec queue parallel commands during migration recovery" Matthew Brost
@ 2025-09-30 15:22   ` Michal Wajdeczko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Wajdeczko @ 2025-09-30 15:22 UTC (permalink / raw)
  To: Matthew Brost, intel-xe



On 9/29/2025 4:55 AM, Matthew Brost wrote:
> This reverts commit ba180a362128cb71d16c3f0ce6645448011d2607.
> 
> Due to change in the VF migration recovery design this code
> is not needed any more.
> 
> v3:
>  - Add commit message (Michal / Lucas)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>

> ---
>  drivers/gpu/drm/xe/abi/guc_actions_abi.h |  8 ----
>  drivers/gpu/drm/xe/xe_guc_submit.c       | 54 ------------------------
>  2 files changed, 62 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/abi/guc_actions_abi.h b/drivers/gpu/drm/xe/abi/guc_actions_abi.h
> index 31090c69dfbe..47756e4674a1 100644
> --- a/drivers/gpu/drm/xe/abi/guc_actions_abi.h
> +++ b/drivers/gpu/drm/xe/abi/guc_actions_abi.h
> @@ -196,14 +196,6 @@ enum xe_guc_register_context_multi_lrc_param_offsets {
>  	XE_GUC_REGISTER_CONTEXT_MULTI_LRC_MSG_MIN_LEN = 11,
>  };
>  
> -enum xe_guc_context_wq_item_offsets {
> -	XE_GUC_CONTEXT_WQ_HEADER_DATA_0_TYPE_LEN = 0,
> -	XE_GUC_CONTEXT_WQ_EL_INFO_DATA_1_CTX_DESC_LOW,
> -	XE_GUC_CONTEXT_WQ_EL_INFO_DATA_2_GUCCTX_RINGTAIL_FREEZEPOCS,
> -	XE_GUC_CONTEXT_WQ_EL_INFO_DATA_3_WI_FENCE_ID,
> -	XE_GUC_CONTEXT_WQ_EL_CHILD_LIST_DATA_4_RINGTAIL,
> -};
> -
>  enum xe_guc_report_status {
>  	XE_GUC_REPORT_STATUS_UNKNOWN = 0x0,
>  	XE_GUC_REPORT_STATUS_ACKED = 0x1,
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 53024eb5670b..3ac0950f55be 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -735,18 +735,12 @@ static void wq_item_append(struct xe_exec_queue *q)
>  	if (wq_wait_for_space(q, wqi_size))
>  		return;
>  
> -	xe_gt_assert(guc_to_gt(guc), i == XE_GUC_CONTEXT_WQ_HEADER_DATA_0_TYPE_LEN);
>  	wqi[i++] = FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_MULTI_LRC) |
>  		FIELD_PREP(WQ_LEN_MASK, len_dw);
> -	xe_gt_assert(guc_to_gt(guc), i == XE_GUC_CONTEXT_WQ_EL_INFO_DATA_1_CTX_DESC_LOW);
>  	wqi[i++] = xe_lrc_descriptor(q->lrc[0]);
> -	xe_gt_assert(guc_to_gt(guc), i ==
> -		     XE_GUC_CONTEXT_WQ_EL_INFO_DATA_2_GUCCTX_RINGTAIL_FREEZEPOCS);
>  	wqi[i++] = FIELD_PREP(WQ_GUC_ID_MASK, q->guc->id) |
>  		FIELD_PREP(WQ_RING_TAIL_MASK, q->lrc[0]->ring.tail / sizeof(u64));
> -	xe_gt_assert(guc_to_gt(guc), i == XE_GUC_CONTEXT_WQ_EL_INFO_DATA_3_WI_FENCE_ID);
>  	wqi[i++] = 0;
> -	xe_gt_assert(guc_to_gt(guc), i == XE_GUC_CONTEXT_WQ_EL_CHILD_LIST_DATA_4_RINGTAIL);
>  	for (j = 1; j < q->width; ++j) {
>  		struct xe_lrc *lrc = q->lrc[j];
>  
> @@ -767,50 +761,6 @@ static void wq_item_append(struct xe_exec_queue *q)
>  	parallel_write(xe, map, wq_desc.tail, q->guc->wqi_tail);
>  }
>  
> -static int wq_items_rebase(struct xe_exec_queue *q)
> -{
> -	struct xe_guc *guc = exec_queue_to_guc(q);
> -	struct xe_device *xe = guc_to_xe(guc);
> -	struct iosys_map map = xe_lrc_parallel_map(q->lrc[0]);
> -	int i = q->guc->wqi_head;
> -
> -	/* the ring starts after a header struct */
> -	iosys_map_incr(&map, offsetof(struct guc_submit_parallel_scratch, wq[0]));
> -
> -	while ((i % WQ_SIZE) != (q->guc->wqi_tail % WQ_SIZE)) {
> -		u32 len_dw, type, val;
> -
> -		if (drm_WARN_ON_ONCE(&xe->drm, i < 0 || i > 2 * WQ_SIZE))
> -			break;
> -
> -		val = xe_map_rd_ring_u32(xe, &map, i / sizeof(u32) +
> -					 XE_GUC_CONTEXT_WQ_HEADER_DATA_0_TYPE_LEN,
> -					 WQ_SIZE / sizeof(u32));
> -		len_dw = FIELD_GET(WQ_LEN_MASK, val);
> -		type = FIELD_GET(WQ_TYPE_MASK, val);
> -
> -		if (drm_WARN_ON_ONCE(&xe->drm, len_dw >= WQ_SIZE / sizeof(u32)))
> -			break;
> -
> -		if (type == WQ_TYPE_MULTI_LRC) {
> -			val = xe_lrc_descriptor(q->lrc[0]);
> -			xe_map_wr_ring_u32(xe, &map, i / sizeof(u32) +
> -					   XE_GUC_CONTEXT_WQ_EL_INFO_DATA_1_CTX_DESC_LOW,
> -					   WQ_SIZE / sizeof(u32), val);
> -		} else if (drm_WARN_ON_ONCE(&xe->drm, type != WQ_TYPE_NOOP)) {
> -			break;
> -		}
> -
> -		i += (len_dw + 1) * sizeof(u32);
> -	}
> -
> -	if ((i % WQ_SIZE) != (q->guc->wqi_tail % WQ_SIZE)) {
> -		xe_gt_err(q->gt, "Exec queue fixups incomplete - wqi parse failed\n");
> -		return -EBADMSG;
> -	}
> -	return 0;
> -}
> -
>  #define RESUME_PENDING	~0x0ull
>  static void submit_exec_queue(struct xe_exec_queue *q)
>  {
> @@ -2669,10 +2619,6 @@ int xe_guc_contexts_hwsp_rebase(struct xe_guc *guc, void *scratch)
>  		err = xe_exec_queue_contexts_hwsp_rebase(q, scratch);
>  		if (err)
>  			break;
> -		if (xe_exec_queue_is_parallel(q))
> -			err = wq_items_rebase(q);
> -		if (err)
> -			break;
>  	}
>  	mutex_unlock(&guc->submission_state.lock);
>  


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 04/36] Revert "drm/xe/vf: Post migration, repopulate ring area for pending request"
  2025-09-29  2:55 ` [PATCH v3 04/36] Revert "drm/xe/vf: Post migration, repopulate ring area for pending request" Matthew Brost
@ 2025-09-30 15:24   ` Michal Wajdeczko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Wajdeczko @ 2025-09-30 15:24 UTC (permalink / raw)
  To: Matthew Brost, intel-xe



On 9/29/2025 4:55 AM, Matthew Brost wrote:
> This reverts commit a0dda25d24e636df5c30a9370464b7cebc709faf.
> 
> Due to change in the VF migration recovery design this code
> is not needed any more.
> 
> v3:
>  - Add commit message (Michal / Lucas)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>

> ---
>  drivers/gpu/drm/xe/xe_exec_queue.c | 24 ------------------------
>  drivers/gpu/drm/xe/xe_exec_queue.h |  3 +--
>  drivers/gpu/drm/xe/xe_guc_submit.c | 24 ------------------------
>  drivers/gpu/drm/xe/xe_guc_submit.h |  2 --
>  drivers/gpu/drm/xe/xe_sriov_vf.c   |  1 -
>  5 files changed, 1 insertion(+), 53 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
> index 37b2b93b73d6..6bfaca424ca3 100644
> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> @@ -1123,27 +1123,3 @@ int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch)
>  
>  	return err;
>  }
> -
> -/**
> - * xe_exec_queue_jobs_ring_restore - Re-emit ring commands of requests pending on given queue.
> - * @q: the &xe_exec_queue struct instance
> - */
> -void xe_exec_queue_jobs_ring_restore(struct xe_exec_queue *q)
> -{
> -	struct xe_gpu_scheduler *sched = &q->guc->sched;
> -	struct xe_sched_job *job;
> -
> -	/*
> -	 * This routine is used within VF migration recovery. This means
> -	 * using the lock here introduces a restriction: we cannot wait
> -	 * for any GFX HW response while the lock is taken.
> -	 */
> -	spin_lock(&sched->base.job_list_lock);
> -	list_for_each_entry(job, &sched->base.pending_list, drm.list) {
> -		if (xe_sched_job_is_error(job))
> -			continue;
> -
> -		q->ring_ops->emit_job(job);
> -	}
> -	spin_unlock(&sched->base.job_list_lock);
> -}
> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h b/drivers/gpu/drm/xe/xe_exec_queue.h
> index 15ec852e7f7e..8821ceb838d0 100644
> --- a/drivers/gpu/drm/xe/xe_exec_queue.h
> +++ b/drivers/gpu/drm/xe/xe_exec_queue.h
> @@ -92,7 +92,6 @@ void xe_exec_queue_update_run_ticks(struct xe_exec_queue *q);
>  
>  int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch);
>  
> -void xe_exec_queue_jobs_ring_restore(struct xe_exec_queue *q);
> -
>  struct xe_lrc *xe_exec_queue_lrc(struct xe_exec_queue *q);
> +
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 3ac0950f55be..16f78376f196 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -845,30 +845,6 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
>  	return fence;
>  }
>  
> -/**
> - * xe_guc_jobs_ring_rebase - Re-emit ring commands of requests pending
> - * on all queues under a guc.
> - * @guc: the &xe_guc struct instance
> - */
> -void xe_guc_jobs_ring_rebase(struct xe_guc *guc)
> -{
> -	struct xe_exec_queue *q;
> -	unsigned long index;
> -
> -	/*
> -	 * This routine is used within VF migration recovery. This means
> -	 * using the lock here introduces a restriction: we cannot wait
> -	 * for any GFX HW response while the lock is taken.
> -	 */
> -	mutex_lock(&guc->submission_state.lock);
> -	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q) {
> -		if (exec_queue_killed_or_banned_or_wedged(q))
> -			continue;
> -		xe_exec_queue_jobs_ring_restore(q);
> -	}
> -	mutex_unlock(&guc->submission_state.lock);
> -}
> -
>  static void guc_exec_queue_free_job(struct drm_sched_job *drm_job)
>  {
>  	struct xe_sched_job *job = to_xe_sched_job(drm_job);
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
> index 78c3f07e31a0..5b4a0a6fd818 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.h
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.h
> @@ -36,8 +36,6 @@ int xe_guc_exec_queue_memory_cat_error_handler(struct xe_guc *guc, u32 *msg,
>  int xe_guc_exec_queue_reset_failure_handler(struct xe_guc *guc, u32 *msg, u32 len);
>  int xe_guc_error_capture_handler(struct xe_guc *guc, u32 *msg, u32 len);
>  
> -void xe_guc_jobs_ring_rebase(struct xe_guc *guc);
> -
>  struct xe_guc_submit_exec_queue_snapshot *
>  xe_guc_exec_queue_snapshot_capture(struct xe_exec_queue *q);
>  void
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> index d6e2ed9b9bbc..0581b881b628 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> @@ -340,7 +340,6 @@ static int gt_vf_post_migration_fixups(struct xe_gt *gt)
>  		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
>  		if (err)
>  			goto out;
> -		xe_guc_jobs_ring_rebase(&gt->uc.guc);
>  		xe_guc_ct_fixup_messages_with_ggtt(&gt->uc.guc.ct, shift);
>  	}
>  


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 05/36] Revert "drm/xe/vf: Fixup CTB send buffer messages after migration"
  2025-09-29  2:55 ` [PATCH v3 05/36] Revert "drm/xe/vf: Fixup CTB send buffer messages after migration" Matthew Brost
@ 2025-09-30 15:27   ` Michal Wajdeczko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Wajdeczko @ 2025-09-30 15:27 UTC (permalink / raw)
  To: Matthew Brost, intel-xe



On 9/29/2025 4:55 AM, Matthew Brost wrote:
> This reverts commit cef88d1265cac7d415606af73ba58926fd3cd8b7.
> 
> Due to change in the VF migration recovery design this code
> is not needed any more.
> 
> v3:
>  - Add commit message (Michal / Lucas)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>

> ---
>  drivers/gpu/drm/xe/xe_guc_ct.c   | 183 -------------------------------
>  drivers/gpu/drm/xe/xe_guc_ct.h   |   2 -
>  drivers/gpu/drm/xe/xe_map.h      |  18 ---
>  drivers/gpu/drm/xe/xe_sriov_vf.c |   2 -
>  4 files changed, 205 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index 18f6327bf552..47079ab9922c 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -25,7 +25,6 @@
>  #include "xe_gt_printk.h"
>  #include "xe_gt_sriov_pf_control.h"
>  #include "xe_gt_sriov_pf_monitor.h"
> -#include "xe_gt_sriov_printk.h"
>  #include "xe_guc.h"
>  #include "xe_guc_log.h"
>  #include "xe_guc_relay.h"
> @@ -93,8 +92,6 @@ struct g2h_fence {
>  	bool done;
>  };
>  
> -#define make_u64(hi, lo) ((u64)((u64)(u32)(hi) << 32 | (u32)(lo)))
> -
>  static void g2h_fence_init(struct g2h_fence *g2h_fence, u32 *response_buffer)
>  {
>  	memset(g2h_fence, 0, sizeof(*g2h_fence));
> @@ -1793,186 +1790,6 @@ static void g2h_worker_func(struct work_struct *w)
>  	receive_g2h(ct);
>  }
>  
> -static void xe_fixup_u64_in_cmds(struct xe_device *xe, struct iosys_map *cmds,
> -				 u32 size, u32 idx, s64 shift)
> -{
> -	u32 hi, lo;
> -	u64 offset;
> -
> -	lo = xe_map_rd_ring_u32(xe, cmds, idx, size);
> -	hi = xe_map_rd_ring_u32(xe, cmds, idx + 1, size);
> -	offset = make_u64(hi, lo);
> -	offset += shift;
> -	lo = lower_32_bits(offset);
> -	hi = upper_32_bits(offset);
> -	xe_map_wr_ring_u32(xe, cmds, idx, size, lo);
> -	xe_map_wr_ring_u32(xe, cmds, idx + 1, size, hi);
> -}
> -
> -/*
> - * Shift any GGTT addresses within a single message left within CTB from
> - * before post-migration recovery.
> - * @ct: pointer to CT struct of the target GuC
> - * @cmds: iomap buffer containing CT messages
> - * @head: start of the target message within the buffer
> - * @len: length of the target message
> - * @size: size of the commands buffer
> - * @shift: the address shift to be added to each GGTT reference
> - * Return: true if the message was fixed or needed no fixups, false on failure
> - */
> -static bool ct_fixup_ggtt_in_message(struct xe_guc_ct *ct,
> -				     struct iosys_map *cmds, u32 head,
> -				     u32 len, u32 size, s64 shift)
> -{
> -	struct xe_gt *gt = ct_to_gt(ct);
> -	struct xe_device *xe = ct_to_xe(ct);
> -	u32 msg[GUC_HXG_MSG_MIN_LEN];
> -	u32 action, i, n;
> -
> -	xe_gt_assert(gt, len >= GUC_HXG_MSG_MIN_LEN);
> -
> -	msg[0] = xe_map_rd_ring_u32(xe, cmds, head, size);
> -	action = FIELD_GET(GUC_HXG_REQUEST_MSG_0_ACTION, msg[0]);
> -
> -	xe_gt_sriov_dbg_verbose(gt, "fixing H2G %#x\n", action);
> -
> -	switch (action) {
> -	case XE_GUC_ACTION_REGISTER_CONTEXT:
> -		if (len != XE_GUC_REGISTER_CONTEXT_MSG_LEN)
> -			goto err_len;
> -		xe_fixup_u64_in_cmds(xe, cmds, size, head +
> -				     XE_GUC_REGISTER_CONTEXT_DATA_5_WQ_DESC_ADDR_LOWER,
> -				     shift);
> -		xe_fixup_u64_in_cmds(xe, cmds, size, head +
> -				     XE_GUC_REGISTER_CONTEXT_DATA_7_WQ_BUF_BASE_LOWER,
> -				     shift);
> -		xe_fixup_u64_in_cmds(xe, cmds, size, head +
> -				     XE_GUC_REGISTER_CONTEXT_DATA_10_HW_LRC_ADDR, shift);
> -		break;
> -	case XE_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC:
> -		if (len < XE_GUC_REGISTER_CONTEXT_MULTI_LRC_MSG_MIN_LEN)
> -			goto err_len;
> -		n = xe_map_rd_ring_u32(xe, cmds, head +
> -				       XE_GUC_REGISTER_CONTEXT_MULTI_LRC_DATA_10_NUM_CTXS, size);
> -		if (len != XE_GUC_REGISTER_CONTEXT_MULTI_LRC_MSG_MIN_LEN + 2 * n)
> -			goto err_len;
> -		xe_fixup_u64_in_cmds(xe, cmds, size, head +
> -				     XE_GUC_REGISTER_CONTEXT_MULTI_LRC_DATA_5_WQ_DESC_ADDR_LOWER,
> -				     shift);
> -		xe_fixup_u64_in_cmds(xe, cmds, size, head +
> -				     XE_GUC_REGISTER_CONTEXT_MULTI_LRC_DATA_7_WQ_BUF_BASE_LOWER,
> -				     shift);
> -		for (i = 0; i < n; i++)
> -			xe_fixup_u64_in_cmds(xe, cmds, size, head +
> -					     XE_GUC_REGISTER_CONTEXT_MULTI_LRC_DATA_11_HW_LRC_ADDR
> -					     + 2 * i, shift);
> -		break;
> -	default:
> -		break;
> -	}
> -	return true;
> -
> -err_len:
> -	xe_gt_err(gt, "Skipped G2G %#x message fixups, unexpected length (%u)\n", action, len);
> -	return false;
> -}
> -
> -/*
> - * Apply fixups to the next outgoing CT message within given CTB
> - * @ct: the &xe_guc_ct struct instance representing the target GuC
> - * @h2g: the &guc_ctb struct instance of the target buffer
> - * @shift: shift to be added to all GGTT addresses within the CTB
> - * @mhead: pointer to an integer storing message start position; the
> - *   position is changed to next message before this function return
> - * @avail: size of the area available for parsing, that is length
> - *   of all remaining messages stored within the CTB
> - * Return: size of the area available for parsing after one message
> - *   has been parsed, that is length remaining from the updated mhead
> - */
> -static int ct_fixup_ggtt_in_buffer(struct xe_guc_ct *ct, struct guc_ctb *h2g,
> -				   s64 shift, u32 *mhead, s32 avail)
> -{
> -	struct xe_gt *gt = ct_to_gt(ct);
> -	struct xe_device *xe = ct_to_xe(ct);
> -	u32 msg[GUC_HXG_MSG_MIN_LEN];
> -	u32 size = h2g->info.size;
> -	u32 head = *mhead;
> -	u32 len;
> -
> -	xe_gt_assert(gt, avail >= (s32)GUC_CTB_MSG_MIN_LEN);
> -
> -	/* Read header */
> -	msg[0] = xe_map_rd_ring_u32(xe, &h2g->cmds, head, size);
> -	len = FIELD_GET(GUC_CTB_MSG_0_NUM_DWORDS, msg[0]) + GUC_CTB_MSG_MIN_LEN;
> -
> -	if (unlikely(len > (u32)avail)) {
> -		xe_gt_err(gt, "H2G channel broken on read, avail=%d, len=%d, fixups skipped\n",
> -			  avail, len);
> -		return 0;
> -	}
> -
> -	head = (head + GUC_CTB_MSG_MIN_LEN) % size;
> -	if (!ct_fixup_ggtt_in_message(ct, &h2g->cmds, head, msg_len_to_hxg_len(len), size, shift))
> -		return 0;
> -	*mhead = (head + msg_len_to_hxg_len(len)) % size;
> -
> -	return avail - len;
> -}
> -
> -/**
> - * xe_guc_ct_fixup_messages_with_ggtt - Fixup any pending H2G CTB messages
> - * @ct: pointer to CT struct of the target GuC
> - * @ggtt_shift: shift to be added to all GGTT addresses within the CTB
> - *
> - * Messages in GuC to Host CTB are owned by GuC and any fixups in them
> - * are made by GuC. But content of the Host to GuC CTB is owned by the
> - * KMD, so fixups to GGTT references in any pending messages need to be
> - * applied here.
> - * This function updates GGTT offsets in payloads of pending H2G CTB
> - * messages (messages which were not consumed by GuC before the VF got
> - * paused).
> - */
> -void xe_guc_ct_fixup_messages_with_ggtt(struct xe_guc_ct *ct, s64 ggtt_shift)
> -{
> -	struct guc_ctb *h2g = &ct->ctbs.h2g;
> -	struct xe_guc *guc = ct_to_guc(ct);
> -	struct xe_gt *gt = guc_to_gt(guc);
> -	u32 head, tail, size;
> -	s32 avail;
> -
> -	if (unlikely(h2g->info.broken))
> -		return;
> -
> -	h2g->info.head = desc_read(ct_to_xe(ct), h2g, head);
> -	head = h2g->info.head;
> -	tail = READ_ONCE(h2g->info.tail);
> -	size = h2g->info.size;
> -
> -	if (unlikely(head > size))
> -		goto corrupted;
> -
> -	if (unlikely(tail >= size))
> -		goto corrupted;
> -
> -	avail = tail - head;
> -
> -	/* beware of buffer wrap case */
> -	if (unlikely(avail < 0))
> -		avail += size;
> -	xe_gt_dbg(gt, "available %d (%u:%u:%u)\n", avail, head, tail, size);
> -	xe_gt_assert(gt, avail >= 0);
> -
> -	while (avail > 0)
> -		avail = ct_fixup_ggtt_in_buffer(ct, h2g, ggtt_shift, &head, avail);
> -
> -	return;
> -
> -corrupted:
> -	xe_gt_err(gt, "Corrupted H2G descriptor head=%u tail=%u size=%u, fixups not applied\n",
> -		  head, tail, size);
> -	h2g->info.broken = true;
> -}
> -
>  static struct xe_guc_ct_snapshot *guc_ct_snapshot_alloc(struct xe_guc_ct *ct, bool atomic,
>  							bool want_ctb)
>  {
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
> index cf41210ab30a..d6c81325a76c 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.h
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.h
> @@ -24,8 +24,6 @@ void xe_guc_ct_snapshot_print(struct xe_guc_ct_snapshot *snapshot, struct drm_pr
>  void xe_guc_ct_snapshot_free(struct xe_guc_ct_snapshot *snapshot);
>  void xe_guc_ct_print(struct xe_guc_ct *ct, struct drm_printer *p, bool want_ctb);
>  
> -void xe_guc_ct_fixup_messages_with_ggtt(struct xe_guc_ct *ct, s64 ggtt_shift);
> -
>  static inline bool xe_guc_ct_initialized(struct xe_guc_ct *ct)
>  {
>  	return ct->state != XE_GUC_CT_STATE_NOT_INITIALIZED;
> diff --git a/drivers/gpu/drm/xe/xe_map.h b/drivers/gpu/drm/xe/xe_map.h
> index 8d67f6ba2d95..f62e0c8b67ab 100644
> --- a/drivers/gpu/drm/xe/xe_map.h
> +++ b/drivers/gpu/drm/xe/xe_map.h
> @@ -78,24 +78,6 @@ static inline void xe_map_write32(struct xe_device *xe, struct iosys_map *map,
>  	iosys_map_wr(map__, offset__, type__, val__);			\
>  })
>  
> -#define xe_map_rd_array(xe__, map__, index__, type__) \
> -	xe_map_rd(xe__, map__, (index__) * sizeof(type__), type__)
> -
> -#define xe_map_wr_array(xe__, map__, index__, type__, val__) \
> -	xe_map_wr(xe__, map__, (index__) * sizeof(type__), type__, val__)
> -
> -#define xe_map_rd_array_u32(xe__, map__, index__) \
> -	xe_map_rd_array(xe__, map__, index__, u32)
> -
> -#define xe_map_wr_array_u32(xe__, map__, index__, val__) \
> -	xe_map_wr_array(xe__, map__, index__, u32, val__)
> -
> -#define xe_map_rd_ring_u32(xe__, map__, index__, size__) \
> -	xe_map_rd_array_u32(xe__, map__, (index__) % (size__))
> -
> -#define xe_map_wr_ring_u32(xe__, map__, index__, size__, val__) \
> -	xe_map_wr_array_u32(xe__, map__, (index__) % (size__), val__)
> -
>  #define xe_map_rd_field(xe__, map__, struct_offset__, struct_type__, field__) ({	\
>  	struct xe_device *__xe = xe__;					\
>  	xe_device_assert_mem_access(__xe);				\
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> index 0581b881b628..da064a1e7419 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> @@ -12,7 +12,6 @@
>  #include "xe_gt_sriov_printk.h"
>  #include "xe_gt_sriov_vf.h"
>  #include "xe_guc.h"
> -#include "xe_guc_ct.h"
>  #include "xe_guc_submit.h"
>  #include "xe_irq.h"
>  #include "xe_lrc.h"
> @@ -340,7 +339,6 @@ static int gt_vf_post_migration_fixups(struct xe_gt *gt)
>  		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
>  		if (err)
>  			goto out;
> -		xe_guc_ct_fixup_messages_with_ggtt(&gt->uc.guc.ct, shift);
>  	}
>  
>  out:


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 17/36] drm/xe/vf: Teardown VF post migration worker on driver unload
  2025-09-29  2:55 ` [PATCH v3 17/36] drm/xe/vf: Teardown VF post migration worker on driver unload Matthew Brost
@ 2025-09-30 16:24   ` Lis, Tomasz
  0 siblings, 0 replies; 83+ messages in thread
From: Lis, Tomasz @ 2025-09-30 16:24 UTC (permalink / raw)
  To: Matthew Brost, intel-xe

[-- Attachment #1: Type: text/plain, Size: 4662 bytes --]


On 9/29/2025 4:55 AM, Matthew Brost wrote:
> Be cautious and ensure the VF post-migration worker is not running
> during driver unload.
>
> v3:
>   - More teardown later in driver init, use devm (Tomasz)

there is no other teardown, so you probably meant "Move".


There is no real need to check `xe_sriov_vf_migration_supported()` 
within `xe_gt_sriov_vf_init()`, at least as long as the teardown is just 
setting a flag. So this is ok (though you can add the condition if you 
prefer, to avoid confusion on future modification).

Either way, this is ok:

Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>

>
> Signed-off-by: Matthew Brost<matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_gt.c                |  6 +++++
>   drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 31 ++++++++++++++++++++++-
>   drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  1 +
>   drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  4 ++-
>   4 files changed, 40 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> index 5f9ba4caf837..82be38c99205 100644
> --- a/drivers/gpu/drm/xe/xe_gt.c
> +++ b/drivers/gpu/drm/xe/xe_gt.c
> @@ -663,6 +663,12 @@ int xe_gt_init(struct xe_gt *gt)
>   	if (err)
>   		return err;
>   
> +	if (IS_SRIOV_VF(gt_to_xe(gt))) {
> +		err = xe_gt_sriov_vf_init(gt);
> +		if (err)
> +			return err;
> +	}
> +
>   	return 0;
>   }
>   
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index ad1d63b5b8d1..cc5af19c1911 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -811,7 +811,8 @@ static void vf_start_migration_recovery(struct xe_gt *gt)
>   
>   	spin_lock(&gt->sriov.vf.migration.lock);
>   
> -	if (!gt->sriov.vf.migration.recovery_queued) {
> +	if (!gt->sriov.vf.migration.recovery_queued ||
> +	    !gt->sriov.vf.migration.recovery_teardown) {
>   		gt->sriov.vf.migration.recovery_queued = true;
>   		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
>   
> @@ -1283,6 +1284,17 @@ static void migration_worker_func(struct work_struct *w)
>   	vf_post_migration_recovery(gt);
>   }
>   
> +static void vf_migration_fini(void *arg)
> +{
> +	struct xe_gt *gt = arg;
> +
> +	spin_lock_irq(&gt->sriov.vf.migration.lock);
> +	gt->sriov.vf.migration.recovery_teardown = true;
> +	spin_unlock_irq(&gt->sriov.vf.migration.lock);
> +
> +	cancel_work_sync(&gt->sriov.vf.migration.worker);
> +}
> +
>   /**
>    * xe_gt_sriov_vf_init_early() - GT VF init early
>    * @gt: the &xe_gt
> @@ -1314,6 +1326,23 @@ int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
>   	return 0;
>   }
>   
> +/**
> + * xe_gt_sriov_vf_init() - GT VF init
> + * @gt: the &xe_gt
> + *
> + * Return 0 on success, errno on failure
> + */
> +int xe_gt_sriov_vf_init(struct xe_gt *gt)
> +{
> +	/*
> +	 * We want to tear down the VF post-migration early during driver
> +	 * unload; therefore, we add this finalization action later during
> +	 * driver load.
> +	 */
> +	return devm_add_action_or_reset(gt_to_xe(gt)->drm.dev,
> +					vf_migration_fini, gt);
> +}
> +
>   /**
>    * xe_gt_sriov_vf_recovery_inprogress() - VF post migration recovery in progress
>    * @gt: the &xe_gt
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> index ff3a0ce608cd..71e1d566da81 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> @@ -24,6 +24,7 @@ int xe_gt_sriov_vf_query_runtime(struct xe_gt *gt);
>   void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
>   
>   int xe_gt_sriov_vf_init_early(struct xe_gt *gt);
> +int xe_gt_sriov_vf_init(struct xe_gt *gt);
>   bool xe_gt_sriov_vf_recovery_inprogress(struct xe_gt *gt);
>   
>   u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> index 6cbf8291a5ab..e135018cba1e 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> @@ -59,10 +59,12 @@ struct xe_gt_sriov_vf_runtime {
>   struct xe_gt_sriov_vf_migration {
>   	/** @migration: VF migration recovery worker */
>   	struct work_struct worker;
> -	/** @lock: Protects recovery_queued */
> +	/** @lock: Protects recovery_queued, teardown */
>   	spinlock_t lock;
>   	/** @scratch: Scratch memory for VF recovery */
>   	void *scratch;
> +	/** @recovery_teardown: VF post migration recovery is being torn down */
> +	bool recovery_teardown;
>   	/** @recovery_queued: VF post migration recovery in queued */
>   	bool recovery_queued;
>   	/** @recovery_inprogress: VF post migration recovery in progress */

[-- Attachment #2: Type: text/html, Size: 5283 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 01/36] drm/xe: Add NULL checks to scratch LRC allocation
  2025-09-30  2:06   ` Lis, Tomasz
@ 2025-09-30 22:53     ` Matthew Brost
  0 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-09-30 22:53 UTC (permalink / raw)
  To: Lis, Tomasz; +Cc: intel-xe

On Tue, Sep 30, 2025 at 04:06:55AM +0200, Lis, Tomasz wrote:
> 
> On 9/29/2025 4:55 AM, Matthew Brost wrote:
> > kmalloc can fail, the returned value must have a NULL check.
> > 
> > Fixes: 168b5867318b ("drm/xe/vf: Refresh utilization buffer during migration recovery")
> > Signed-off-by: Matthew Brost<matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_lrc.c | 10 ++++++++--
> >   1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
> > index 47e9df775072..e1bc102a6cae 100644
> > --- a/drivers/gpu/drm/xe/xe_lrc.c
> > +++ b/drivers/gpu/drm/xe/xe_lrc.c
> > @@ -1303,8 +1303,11 @@ static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
> >   	u32 *buf = NULL;
> >   	int ret;
> > -	if (lrc->bo->vmap.is_iomem)
> > +	if (lrc->bo->vmap.is_iomem) {
> >   		buf = kmalloc(LRC_WA_BB_SIZE, GFP_KERNEL);
> > +		if (!buf)
> > +			return -ENOMEM;
> > +	}
> >   	ret = xe_lrc_setup_wa_bb_with_scratch(lrc, hwe, buf);
> 
> xe_lrc_setup_wa_bb_with_scratch()->setup_bo() handles the ENOMEM return, there was no bug.
> 
> > @@ -1347,8 +1350,11 @@ setup_indirect_ctx(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
> >   	if (xe_gt_WARN_ON(lrc->gt, !state.funcs))
> >   		return 0;
> > -	if (lrc->bo->vmap.is_iomem)
> > +	if (lrc->bo->vmap.is_iomem) {
> >   		state.buffer = kmalloc(state.max_size, GFP_KERNEL);
> > +		if (!state.buffer)
> > +			return -ENOMEM;
> > +	}
> >   	ret = setup_bo(&state);
> 
> setup_bo() does another check with ENOMEM return, no bug here as well. Also,
> with how setup_bo() exits, it ignores lack of allocation in case it won't be
> used anyway. -Tomasz
> 

Ok, I see this now but stylically / from layering this simply not how it
is done in Linux - the pattern is basically always kmallc + immediate
NUL check.

I can drop the fixes tag as it don't require a backport to stable but
for style and reability we should merge this patch or a version of this.

Matt

> >   	if (ret) {

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 20/36] drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs supporting migration
  2025-09-29  2:55 ` [PATCH v3 20/36] drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs supporting migration Matthew Brost
@ 2025-10-01 13:45   ` Lis, Tomasz
  2025-10-01 13:56     ` Lis, Tomasz
  0 siblings, 1 reply; 83+ messages in thread
From: Lis, Tomasz @ 2025-10-01 13:45 UTC (permalink / raw)
  To: Matthew Brost, intel-xe


On 9/29/2025 4:55 AM, Matthew Brost wrote:
> Blocking in work queues on a hardware action that may never occur —
> especially when it depends on a software fixup also scheduled on the
> awork queue — is a recipe for deadlock. This situation arises with
> the preempt rebind worker and VF post-migration recovery. To prevent
> potential deadlocks, avoid indefinite blocking in the preempt rebind
> worker for VFs that support migration.

Some would say the timeout value is a magic number here, but I don't 
have anything better to propose.

And we do not have obligation to match each tracepoint _enter() with 
_exit(), that's ok as well.

So, all good:

Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>


> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_vm.c | 29 ++++++++++++++++++++++++++++-
>   1 file changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 80b7f13ecd80..b527ee2a5da5 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -35,6 +35,7 @@
>   #include "xe_pt.h"
>   #include "xe_pxp.h"
>   #include "xe_res_cursor.h"
> +#include "xe_sriov_vf.h"
>   #include "xe_svm.h"
>   #include "xe_sync.h"
>   #include "xe_tile.h"
> @@ -111,12 +112,25 @@ static int alloc_preempt_fences(struct xe_vm *vm, struct list_head *list,
>   static int wait_for_existing_preempt_fences(struct xe_vm *vm)
>   {
>   	struct xe_exec_queue *q;
> +	bool vf_migration = IS_SRIOV_VF(vm->xe) &&
> +		xe_sriov_vf_migration_supported(vm->xe);
>   
>   	xe_vm_assert_held(vm);
>   
>   	list_for_each_entry(q, &vm->preempt.exec_queues, lr.link) {
>   		if (q->lr.pfence) {
> -			long timeout = dma_fence_wait(q->lr.pfence, false);
> +			long timeout;
> +
> +			if (vf_migration)
> +				timeout = dma_fence_wait_timeout(q->lr.pfence,
> +								 false, HZ / 5);
> +			else
> +				timeout = dma_fence_wait(q->lr.pfence, false);
> +
> +			if (!timeout) {
> +				xe_assert(vm->xe, vf_migration);
> +				return -EAGAIN;
> +			}
>   
>   			/* Only -ETIME on fence indicates VM needs to be killed */
>   			if (timeout < 0 || q->lr.pfence->error == -ETIME)
> @@ -541,6 +555,19 @@ static void preempt_rebind_work_func(struct work_struct *w)
>   out_unlock_outer:
>   	if (err == -EAGAIN) {
>   		trace_xe_vm_rebind_worker_retry(vm);
> +
> +		/*
> +		 * We can't block in workers on a VF which supports migration
> +		 * given this can block the VF post-migration workers from
> +		 * getting scheduled.
> +		 */
> +		if (IS_SRIOV_VF(vm->xe) &&
> +		    xe_sriov_vf_migration_supported(vm->xe)) {
> +			up_write(&vm->lock);
> +			xe_vm_queue_rebind_worker(vm);
> +			return;
> +		}
> +
>   		goto retry;
>   	}
>   

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 24/36] drm/xe/vf: Reset TLB invalidations during VF post migration recovery
  2025-09-29  2:55 ` [PATCH v3 24/36] drm/xe/vf: Reset TLB invalidations during " Matthew Brost
@ 2025-10-01 13:53   ` Lis, Tomasz
  0 siblings, 0 replies; 83+ messages in thread
From: Lis, Tomasz @ 2025-10-01 13:53 UTC (permalink / raw)
  To: Matthew Brost, intel-xe


On 9/29/2025 4:55 AM, Matthew Brost wrote:
> TLB invalidations requests can be lost during VF post-migration
> recovery. Since the VF has migrated, these invalidations are no longer
> needed.
>
> Reset the TLB invalidation frontend, which will signal all pending
> fences.
>
> v3:
>   - Move TLB invalidation reset after pausing submission (Tomasz)
>   - Adjust commit message (Michal)

Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>

-Tomasz

> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 37ef1c42bacb..c9d94620d197 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -35,6 +35,7 @@
>   #include "xe_sriov.h"
>   #include "xe_sriov_vf.h"
>   #include "xe_tile_sriov_vf.h"
> +#include "xe_tlb_inval.h"
>   #include "xe_uc_fw.h"
>   #include "xe_wopcm.h"
>   
> @@ -1188,6 +1189,7 @@ static void vf_post_migration_shutdown(struct xe_gt *gt)
>   
>   	xe_guc_ct_flush_and_stop(&gt->uc.guc.ct);
>   	xe_guc_submit_pause(&gt->uc.guc);
> +	xe_tlb_inval_reset(&gt->tlb_inval);
>   }
>   
>   static size_t post_migration_scratch_size(struct xe_device *xe)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 20/36] drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs supporting migration
  2025-10-01 13:45   ` Lis, Tomasz
@ 2025-10-01 13:56     ` Lis, Tomasz
  0 siblings, 0 replies; 83+ messages in thread
From: Lis, Tomasz @ 2025-10-01 13:56 UTC (permalink / raw)
  To: Matthew Brost, intel-xe


On 10/1/2025 3:45 PM, Lis, Tomasz wrote:
>
> On 9/29/2025 4:55 AM, Matthew Brost wrote:
>> Blocking in work queues on a hardware action that may never occur —
>> especially when it depends on a software fixup also scheduled on the
>> awork queue — is a recipe for deadlock. This situation arises with

Forgot about the typo - awork.

-Tomasz

>> the preempt rebind worker and VF post-migration recovery. To prevent
>> potential deadlocks, avoid indefinite blocking in the preempt rebind
>> worker for VFs that support migration.
>
> Some would say the timeout value is a magic number here, but I don't 
> have anything better to propose.
>
> And we do not have obligation to match each tracepoint _enter() with 
> _exit(), that's ok as well.
>
> So, all good:
>
> Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
>
>
>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_vm.c | 29 ++++++++++++++++++++++++++++-
>>   1 file changed, 28 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>> index 80b7f13ecd80..b527ee2a5da5 100644
>> --- a/drivers/gpu/drm/xe/xe_vm.c
>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>> @@ -35,6 +35,7 @@
>>   #include "xe_pt.h"
>>   #include "xe_pxp.h"
>>   #include "xe_res_cursor.h"
>> +#include "xe_sriov_vf.h"
>>   #include "xe_svm.h"
>>   #include "xe_sync.h"
>>   #include "xe_tile.h"
>> @@ -111,12 +112,25 @@ static int alloc_preempt_fences(struct xe_vm 
>> *vm, struct list_head *list,
>>   static int wait_for_existing_preempt_fences(struct xe_vm *vm)
>>   {
>>       struct xe_exec_queue *q;
>> +    bool vf_migration = IS_SRIOV_VF(vm->xe) &&
>> +        xe_sriov_vf_migration_supported(vm->xe);
>>         xe_vm_assert_held(vm);
>>         list_for_each_entry(q, &vm->preempt.exec_queues, lr.link) {
>>           if (q->lr.pfence) {
>> -            long timeout = dma_fence_wait(q->lr.pfence, false);
>> +            long timeout;
>> +
>> +            if (vf_migration)
>> +                timeout = dma_fence_wait_timeout(q->lr.pfence,
>> +                                 false, HZ / 5);
>> +            else
>> +                timeout = dma_fence_wait(q->lr.pfence, false);
>> +
>> +            if (!timeout) {
>> +                xe_assert(vm->xe, vf_migration);
>> +                return -EAGAIN;
>> +            }
>>                 /* Only -ETIME on fence indicates VM needs to be 
>> killed */
>>               if (timeout < 0 || q->lr.pfence->error == -ETIME)
>> @@ -541,6 +555,19 @@ static void preempt_rebind_work_func(struct 
>> work_struct *w)
>>   out_unlock_outer:
>>       if (err == -EAGAIN) {
>>           trace_xe_vm_rebind_worker_retry(vm);
>> +
>> +        /*
>> +         * We can't block in workers on a VF which supports migration
>> +         * given this can block the VF post-migration workers from
>> +         * getting scheduled.
>> +         */
>> +        if (IS_SRIOV_VF(vm->xe) &&
>> +            xe_sriov_vf_migration_supported(vm->xe)) {
>> +            up_write(&vm->lock);
>> +            xe_vm_queue_rebind_worker(vm);
>> +            return;
>> +        }
>> +
>>           goto retry;
>>       }

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 27/36] drm/xe/vf: Abort VF post migration recovery on failure
  2025-09-29  2:55 ` [PATCH v3 27/36] drm/xe/vf: Abort VF post migration recovery on failure Matthew Brost
@ 2025-10-01 14:06   ` Lis, Tomasz
  0 siblings, 0 replies; 83+ messages in thread
From: Lis, Tomasz @ 2025-10-01 14:06 UTC (permalink / raw)
  To: Matthew Brost, intel-xe


On 9/29/2025 4:55 AM, Matthew Brost wrote:
> If VF post-migration recovery fails, the device is wedged. However,
> submission queues still need to be enabled for proper cleanup. In such
> cases, call into the GuC submission backend to restart all queues that
> were previously paused.
>
> v3:
>   - s/Avort/Abort (Tomasz)

Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>

-Tomasz

> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 10 ++++++++++
>   drivers/gpu/drm/xe/xe_guc_submit.c  | 20 ++++++++++++++++++++
>   drivers/gpu/drm/xe/xe_guc_submit.h  |  1 +
>   3 files changed, 31 insertions(+)
>
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index cb3e9f6e83fa..9f33561b91c6 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -1224,6 +1224,15 @@ static void vf_post_migration_kickstart(struct xe_gt *gt)
>   	xe_guc_submit_unpause(&gt->uc.guc);
>   }
>   
> +static void vf_post_migration_abort(struct xe_gt *gt)
> +{
> +	spin_lock_irq(&gt->sriov.vf.migration.lock);
> +	WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, false);
> +	spin_unlock_irq(&gt->sriov.vf.migration.lock);
> +
> +	xe_guc_submit_pause_abort(&gt->uc.guc);
> +}
> +
>   static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
>   {
>   	bool skip_resfix = false;
> @@ -1282,6 +1291,7 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
>   	xe_gt_sriov_notice(gt, "migration recovery ended\n");
>   	return;
>   fail:
> +	vf_post_migration_abort(gt);
>   	xe_pm_runtime_put(xe);
>   	xe_gt_sriov_err(gt, "migration recovery failed (%pe)\n", ERR_PTR(err));
>   	xe_device_declare_wedged(xe);
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 9320fe9fbb29..99ea9b3507cd 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -2359,6 +2359,26 @@ void xe_guc_submit_unpause(struct xe_guc *guc)
>   	wake_up_all(&guc->ct.wq);
>   }
>   
> +/**
> + * xe_guc_submit_abort - Abort all paused submission task on given GuC.
> + * @guc: the &xe_guc struct instance whose scheduler is to be aborted
> + */
> +void xe_guc_submit_pause_abort(struct xe_guc *guc)
> +{
> +	struct xe_exec_queue *q;
> +	unsigned long index;
> +
> +	mutex_lock(&guc->submission_state.lock);
> +	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q) {
> +		struct xe_gpu_scheduler *sched = &q->guc->sched;
> +
> +		xe_sched_submission_start(sched);
> +		if (exec_queue_killed_or_banned_or_wedged(q))
> +			xe_guc_exec_queue_trigger_cleanup(q);
> +	}
> +	mutex_unlock(&guc->submission_state.lock);
> +}
> +
>   static struct xe_exec_queue *
>   g2h_exec_queue_lookup(struct xe_guc *guc, u32 guc_id)
>   {
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
> index f535fe3895e5..fe82c317048e 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.h
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.h
> @@ -22,6 +22,7 @@ void xe_guc_submit_stop(struct xe_guc *guc);
>   int xe_guc_submit_start(struct xe_guc *guc);
>   void xe_guc_submit_pause(struct xe_guc *guc);
>   void xe_guc_submit_unpause(struct xe_guc *guc);
> +void xe_guc_submit_pause_abort(struct xe_guc *guc);
>   void xe_guc_submit_wedge(struct xe_guc *guc);
>   
>   int xe_guc_read_stopped(struct xe_guc *guc);

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 28/36] drm/xe/vf: Replay GuC submission state on pause / unpause
  2025-09-29  2:55 ` [PATCH v3 28/36] drm/xe/vf: Replay GuC submission state on pause / unpause Matthew Brost
@ 2025-10-01 14:37   ` Lis, Tomasz
  0 siblings, 0 replies; 83+ messages in thread
From: Lis, Tomasz @ 2025-10-01 14:37 UTC (permalink / raw)
  To: Matthew Brost, intel-xe


On 9/29/2025 4:55 AM, Matthew Brost wrote:
> Fixup GuC submission pause / unpause functions to properly replay any
> possible state lost during VF post migration recovery.
>
> v3:
>   - Add helpers for revert / replay (Tomasz)
>   - Add comment around WQ NOPs (Tomasz)

Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>

-Tomasz

> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_gpu_scheduler.c        |  14 ++
>   drivers/gpu/drm/xe/xe_gpu_scheduler.h        |   2 +
>   drivers/gpu/drm/xe/xe_gt_sriov_vf.c          |   1 +
>   drivers/gpu/drm/xe/xe_guc_exec_queue_types.h |  15 ++
>   drivers/gpu/drm/xe/xe_guc_submit.c           | 242 +++++++++++++++++--
>   drivers/gpu/drm/xe/xe_guc_submit.h           |   1 +
>   drivers/gpu/drm/xe/xe_sched_job_types.h      |   4 +
>   7 files changed, 264 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.c b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> index 455ccaf17314..af300adc7e1a 100644
> --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> @@ -135,3 +135,17 @@ void xe_sched_add_msg_locked(struct xe_gpu_scheduler *sched,
>   	list_add_tail(&msg->link, &sched->msgs);
>   	xe_sched_process_msg_queue(sched);
>   }
> +
> +/**
> + * xe_sched_add_msg_head() - Xe GPU scheduler add message to head of list
> + * @sched: Xe GPU scheduler
> + * @msg: Message to add
> + */
> +void xe_sched_add_msg_head(struct xe_gpu_scheduler *sched,
> +			   struct xe_sched_msg *msg)
> +{
> +	lockdep_assert_held(&sched->base.job_list_lock);
> +
> +	list_add(&msg->link, &sched->msgs);
> +	xe_sched_process_msg_queue(sched);
> +}
> diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.h b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> index e548b2aed95a..010003a6103a 100644
> --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> @@ -29,6 +29,8 @@ void xe_sched_add_msg(struct xe_gpu_scheduler *sched,
>   		      struct xe_sched_msg *msg);
>   void xe_sched_add_msg_locked(struct xe_gpu_scheduler *sched,
>   			     struct xe_sched_msg *msg);
> +void xe_sched_add_msg_head(struct xe_gpu_scheduler *sched,
> +			   struct xe_sched_msg *msg);
>   
>   static inline void xe_sched_msg_lock(struct xe_gpu_scheduler *sched)
>   {
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 9f33561b91c6..0d94867dce8e 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -1217,6 +1217,7 @@ static int vf_post_migration_fixups(struct xe_gt *gt)
>   static void vf_post_migration_rearm(struct xe_gt *gt)
>   {
>   	xe_guc_ct_restart(&gt->uc.guc.ct);
> +	xe_guc_submit_unpause_prepare(&gt->uc.guc);
>   }
>   
>   static void vf_post_migration_kickstart(struct xe_gt *gt)
> diff --git a/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h b/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h
> index c30c0e3ccbbb..a3b034e4b205 100644
> --- a/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h
> +++ b/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h
> @@ -51,6 +51,21 @@ struct xe_guc_exec_queue {
>   	wait_queue_head_t suspend_wait;
>   	/** @suspend_pending: a suspend of the exec_queue is pending */
>   	bool suspend_pending;
> +	/**
> +	 * @needs_cleanup: Needs a cleanup message during VF post migration
> +	 * recovery.
> +	 */
> +	bool needs_cleanup;
> +	/**
> +	 * @needs_suspend: Needs a suspend message during VF post migration
> +	 * recovery.
> +	 */
> +	bool needs_suspend;
> +	/**
> +	 * @needs_resume: Needs a resume message during VF post migration
> +	 * recovery.
> +	 */
> +	bool needs_resume;
>   };
>   
>   #endif
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 99ea9b3507cd..497a736c23c3 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -424,6 +424,11 @@ static void set_exec_queue_destroyed(struct xe_exec_queue *q)
>   	atomic_or(EXEC_QUEUE_STATE_DESTROYED, &q->guc->state);
>   }
>   
> +static void clear_exec_queue_destroyed(struct xe_exec_queue *q)
> +{
> +	atomic_and(~EXEC_QUEUE_STATE_DESTROYED, &q->guc->state);
> +}
> +
>   static bool exec_queue_banned(struct xe_exec_queue *q)
>   {
>   	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_BANNED;
> @@ -504,7 +509,12 @@ static void set_exec_queue_extra_ref(struct xe_exec_queue *q)
>   	atomic_or(EXEC_QUEUE_STATE_EXTRA_REF, &q->guc->state);
>   }
>   
> -static bool __maybe_unused exec_queue_pending_resume(struct xe_exec_queue *q)
> +static void clear_exec_queue_extra_ref(struct xe_exec_queue *q)
> +{
> +	atomic_and(~EXEC_QUEUE_STATE_EXTRA_REF, &q->guc->state);
> +}
> +
> +static bool exec_queue_pending_resume(struct xe_exec_queue *q)
>   {
>   	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_RESUME;
>   }
> @@ -519,7 +529,7 @@ static void clear_exec_queue_pending_resume(struct xe_exec_queue *q)
>   	atomic_and(~EXEC_QUEUE_STATE_PENDING_RESUME, &q->guc->state);
>   }
>   
> -static bool __maybe_unused exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
> +static bool exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
>   {
>   	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_TDR_EXIT;
>   }
> @@ -1079,7 +1089,7 @@ static void wq_item_append(struct xe_exec_queue *q)
>   }
>   
>   #define RESUME_PENDING	~0x0ull
> -static void submit_exec_queue(struct xe_exec_queue *q)
> +static void submit_exec_queue(struct xe_exec_queue *q, struct xe_sched_job *job)
>   {
>   	struct xe_guc *guc = exec_queue_to_guc(q);
>   	struct xe_lrc *lrc = q->lrc[0];
> @@ -1091,10 +1101,13 @@ static void submit_exec_queue(struct xe_exec_queue *q)
>   
>   	xe_gt_assert(guc_to_gt(guc), exec_queue_registered(q));
>   
> -	if (xe_exec_queue_is_parallel(q))
> -		wq_item_append(q);
> -	else
> -		xe_lrc_set_ring_tail(lrc, lrc->ring.tail);
> +	if (!job->skip_emit || job->last_replay) {
> +		if (xe_exec_queue_is_parallel(q))
> +			wq_item_append(q);
> +		else
> +			xe_lrc_set_ring_tail(lrc, lrc->ring.tail);
> +		job->last_replay = false;
> +	}
>   
>   	if (exec_queue_suspended(q) && !xe_exec_queue_is_parallel(q))
>   		return;
> @@ -1147,8 +1160,10 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
>   	if (!killed_or_banned_or_wedged && !xe_sched_job_is_error(job)) {
>   		if (!exec_queue_registered(q))
>   			register_exec_queue(q, GUC_CONTEXT_NORMAL);
> -		q->ring_ops->emit_job(job);
> -		submit_exec_queue(q);
> +		if (!job->skip_emit)
> +			q->ring_ops->emit_job(job);
> +		submit_exec_queue(q, job);
> +		job->skip_emit = false;
>   	}
>   
>   	/*
> @@ -1865,6 +1880,7 @@ static void __guc_exec_queue_process_msg_resume(struct xe_sched_msg *msg)
>   #define RESUME		4
>   #define OPCODE_MASK	0xf
>   #define MSG_LOCKED	BIT(8)
> +#define MSG_HEAD	BIT(9)
>   
>   static void guc_exec_queue_process_msg(struct xe_sched_msg *msg)
>   {
> @@ -1989,12 +2005,24 @@ static void guc_exec_queue_add_msg(struct xe_exec_queue *q, struct xe_sched_msg
>   	msg->private_data = q;
>   
>   	trace_xe_sched_msg_add(msg);
> -	if (opcode & MSG_LOCKED)
> +	if (opcode & MSG_HEAD)
> +		xe_sched_add_msg_head(&q->guc->sched, msg);
> +	else if (opcode & MSG_LOCKED)
>   		xe_sched_add_msg_locked(&q->guc->sched, msg);
>   	else
>   		xe_sched_add_msg(&q->guc->sched, msg);
>   }
>   
> +static void guc_exec_queue_try_add_msg_head(struct xe_exec_queue *q,
> +					    struct xe_sched_msg *msg,
> +					    u32 opcode)
> +{
> +	if (!list_empty(&msg->link))
> +		return;
> +
> +	guc_exec_queue_add_msg(q, msg, opcode | MSG_LOCKED | MSG_HEAD);
> +}
> +
>   static bool guc_exec_queue_try_add_msg(struct xe_exec_queue *q,
>   				       struct xe_sched_msg *msg,
>   				       u32 opcode)
> @@ -2278,6 +2306,105 @@ void xe_guc_submit_stop(struct xe_guc *guc)
>   
>   }
>   
> +static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
> +{
> +	bool pending_enable, pending_disable, pending_resume;
> +
> +	pending_enable = exec_queue_pending_enable(q);
> +	pending_resume = exec_queue_pending_resume(q);
> +
> +	if (pending_enable && pending_resume)
> +		q->guc->needs_resume = true;
> +
> +	if (pending_enable && !pending_resume &&
> +	    !exec_queue_pending_tdr_exit(q)) {
> +		clear_exec_queue_registered(q);
> +		if (xe_exec_queue_is_lr(q))
> +			xe_exec_queue_put(q);
> +	}
> +
> +	if (pending_enable) {
> +		clear_exec_queue_enabled(q);
> +		clear_exec_queue_pending_resume(q);
> +		clear_exec_queue_pending_tdr_exit(q);
> +		clear_exec_queue_pending_enable(q);
> +	}
> +
> +	if (exec_queue_destroyed(q) && exec_queue_registered(q)) {
> +		clear_exec_queue_destroyed(q);
> +		if (exec_queue_extra_ref(q))
> +			xe_exec_queue_put(q);
> +		else
> +			q->guc->needs_cleanup = true;
> +		clear_exec_queue_extra_ref(q);
> +	}
> +
> +	pending_disable = exec_queue_pending_disable(q);
> +
> +	if (pending_disable && exec_queue_suspended(q)) {
> +		clear_exec_queue_suspended(q);
> +		q->guc->needs_suspend = true;
> +	}
> +
> +	if (pending_disable) {
> +		if (!pending_enable)
> +			set_exec_queue_enabled(q);
> +		clear_exec_queue_pending_disable(q);
> +		clear_exec_queue_check_timeout(q);
> +	}
> +
> +	q->guc->resume_time = 0;
> +}
> +
> +/*
> + * This function is quite complex but only real way to ensure no state is lost
> + * during VF resume flows. The function scans the queue state, make adjustments
> + * as needed, and queues jobs / messages which replayed upon unpause.
> + */
> +static void guc_exec_queue_pause(struct xe_guc *guc, struct xe_exec_queue *q)
> +{
> +	struct xe_gpu_scheduler *sched = &q->guc->sched;
> +	struct xe_sched_job *job;
> +	int i;
> +
> +	lockdep_assert_held(&guc->submission_state.lock);
> +
> +	/* Stop scheduling + flush any DRM scheduler operations */
> +	xe_sched_submission_stop(sched);
> +	if (xe_exec_queue_is_lr(q))
> +		cancel_work_sync(&q->guc->lr_tdr);
> +	else
> +		cancel_delayed_work_sync(&sched->base.work_tdr);
> +
> +	guc_exec_queue_revert_pending_state_change(q);
> +
> +	if (xe_exec_queue_is_parallel(q)) {
> +		struct xe_device *xe = guc_to_xe(guc);
> +		struct iosys_map map = xe_lrc_parallel_map(q->lrc[0]);
> +
> +		/*
> +		 * NOP existing WQ commands that may contain stale GGTT
> +		 * addresses. These will be replayed upon unpause. The hardware
> +		 * seems to get confused if the WQ head/tail pointers are
> +		 * adjusted.
> +		 */
> +		for (i = 0; i < WQ_SIZE / sizeof(u32); ++i)
> +			parallel_write(xe, map, wq[i],
> +				       FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_NOOP) |
> +				       FIELD_PREP(WQ_LEN_MASK, 0));
> +	}
> +
> +	job = xe_sched_first_pending_job(sched);
> +	if (job) {
> +		/*
> +		 * Adjust software tail so jobs submitted overwrite previous
> +		 * position in ring buffer with new GGTT addresses.
> +		 */
> +		for (i = 0; i < q->width; ++i)
> +			q->lrc[i]->ring.tail = job->ptrs[i].head;
> +	}
> +}
> +
>   /**
>    * xe_guc_submit_pause - Stop further runs of submission tasks on given GuC.
>    * @guc: the &xe_guc struct instance whose scheduler is to be disabled
> @@ -2287,8 +2414,12 @@ void xe_guc_submit_pause(struct xe_guc *guc)
>   	struct xe_exec_queue *q;
>   	unsigned long index;
>   
> +	xe_gt_assert(guc_to_gt(guc), vf_recovery(guc));
> +
> +	mutex_lock(&guc->submission_state.lock);
>   	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
> -		xe_sched_submission_stop_async(&q->guc->sched);
> +		guc_exec_queue_pause(guc, q);
> +	mutex_unlock(&guc->submission_state.lock);
>   }
>   
>   static void guc_exec_queue_start(struct xe_exec_queue *q)
> @@ -2337,11 +2468,92 @@ int xe_guc_submit_start(struct xe_guc *guc)
>   	return 0;
>   }
>   
> -static void guc_exec_queue_unpause(struct xe_exec_queue *q)
> +static void guc_exec_queue_unpause_prepare(struct xe_guc *guc,
> +					   struct xe_exec_queue *q)
>   {
>   	struct xe_gpu_scheduler *sched = &q->guc->sched;
> +	struct drm_sched_job *s_job;
> +	struct xe_sched_job *job = NULL;
> +
> +	list_for_each_entry(s_job, &sched->base.pending_list, list) {
> +		job = to_xe_sched_job(s_job);
> +
> +		q->ring_ops->emit_job(job);
> +		job->skip_emit = true;
> +	}
>   
> +	if (job)
> +		job->last_replay = true;
> +}
> +
> +/**
> + * xe_guc_submit_unpause_prepare - Prepare unpause submission tasks on given GuC.
> + * @guc: the &xe_guc struct instance whose scheduler is to be prepared for unpause
> + */
> +void xe_guc_submit_unpause_prepare(struct xe_guc *guc)
> +{
> +	struct xe_exec_queue *q;
> +	unsigned long index;
> +
> +	xe_gt_assert(guc_to_gt(guc), vf_recovery(guc));
> +
> +	mutex_lock(&guc->submission_state.lock);
> +	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
> +		guc_exec_queue_unpause_prepare(guc, q);
> +	mutex_unlock(&guc->submission_state.lock);
> +}
> +
> +static void guc_exec_queue_replay_pending_state_change(struct xe_exec_queue *q)
> +{
> +	struct xe_gpu_scheduler *sched = &q->guc->sched;
> +	struct xe_sched_msg *msg;
> +
> +	if (q->guc->needs_cleanup) {
> +		msg = q->guc->static_msgs + STATIC_MSG_CLEANUP;
> +
> +		guc_exec_queue_add_msg(q, msg, CLEANUP);
> +		q->guc->needs_cleanup = false;
> +	}
> +
> +	if (q->guc->needs_suspend) {
> +		msg = q->guc->static_msgs + STATIC_MSG_SUSPEND;
> +
> +		xe_sched_msg_lock(sched);
> +		guc_exec_queue_try_add_msg_head(q, msg, SUSPEND);
> +		xe_sched_msg_unlock(sched);
> +
> +		q->guc->needs_suspend = false;
> +	}
> +
> +	/*
> +	 * The resume must be in the message queue before the suspend as it is
> +	 * not possible for a resume to be issued if a suspend pending is, but
> +	 * the inverse is possible.
> +	 */
> +	if (q->guc->needs_resume) {
> +		msg = q->guc->static_msgs + STATIC_MSG_RESUME;
> +
> +		xe_sched_msg_lock(sched);
> +		guc_exec_queue_try_add_msg_head(q, msg, RESUME);
> +		xe_sched_msg_unlock(sched);
> +
> +		q->guc->needs_resume = false;
> +	}
> +}
> +
> +static void guc_exec_queue_unpause(struct xe_guc *guc, struct xe_exec_queue *q)
> +{
> +	struct xe_gpu_scheduler *sched = &q->guc->sched;
> +	bool needs_tdr = exec_queue_killed_or_banned_or_wedged(q);
> +
> +	lockdep_assert_held(&guc->submission_state.lock);
> +
> +	xe_sched_resubmit_jobs(sched);
> +	guc_exec_queue_replay_pending_state_change(q);
>   	xe_sched_submission_start(sched);
> +	if (needs_tdr)
> +		xe_guc_exec_queue_trigger_cleanup(q);
> +	xe_sched_submission_resume_tdr(sched);
>   }
>   
>   /**
> @@ -2353,10 +2565,10 @@ void xe_guc_submit_unpause(struct xe_guc *guc)
>   	struct xe_exec_queue *q;
>   	unsigned long index;
>   
> +	mutex_lock(&guc->submission_state.lock);
>   	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
> -		guc_exec_queue_unpause(q);
> -
> -	wake_up_all(&guc->ct.wq);
> +		guc_exec_queue_unpause(guc, q);
> +	mutex_unlock(&guc->submission_state.lock);
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
> index fe82c317048e..b49a2748ec46 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.h
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.h
> @@ -22,6 +22,7 @@ void xe_guc_submit_stop(struct xe_guc *guc);
>   int xe_guc_submit_start(struct xe_guc *guc);
>   void xe_guc_submit_pause(struct xe_guc *guc);
>   void xe_guc_submit_unpause(struct xe_guc *guc);
> +void xe_guc_submit_unpause_prepare(struct xe_guc *guc);
>   void xe_guc_submit_pause_abort(struct xe_guc *guc);
>   void xe_guc_submit_wedge(struct xe_guc *guc);
>   
> diff --git a/drivers/gpu/drm/xe/xe_sched_job_types.h b/drivers/gpu/drm/xe/xe_sched_job_types.h
> index 7ce58765a34a..13e7a12b03ad 100644
> --- a/drivers/gpu/drm/xe/xe_sched_job_types.h
> +++ b/drivers/gpu/drm/xe/xe_sched_job_types.h
> @@ -63,6 +63,10 @@ struct xe_sched_job {
>   	bool ring_ops_flush_tlb;
>   	/** @ggtt: mapped in ggtt. */
>   	bool ggtt;
> +	/** @skip_emit: skip emitting the job */
> +	bool skip_emit;
> +	/** @last_replay: last job being replayed */
> +	bool last_replay;
>   	/** @ptrs: per instance pointers. */
>   	struct xe_job_ptrs ptrs[];
>   };

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 29/36] drm/xe: Move queue init before LRC creation
  2025-09-29  2:55 ` [PATCH v3 29/36] drm/xe: Move queue init before LRC creation Matthew Brost
@ 2025-10-02  0:44   ` Lis, Tomasz
  2025-10-02  7:36     ` Matthew Brost
  0 siblings, 1 reply; 83+ messages in thread
From: Lis, Tomasz @ 2025-10-02  0:44 UTC (permalink / raw)
  To: Matthew Brost, intel-xe


On 9/29/2025 4:55 AM, Matthew Brost wrote:
> A queue must be in the submission backend's tracking state before the
> LRC is created to avoid a race condition where the LRC's GGTT addresses
> are not properly fixed up during VF post-migration recovery.
>
> Move the queue initialization—which adds the queue to the submission
> backend's tracking state—before LRC creation.
>
> v2:
>   - Wait on VF GGTT fixes before creating LRC (testing)
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_exec_queue.c        | 43 +++++++++++++++++------
>   drivers/gpu/drm/xe/xe_execlist.c          |  2 +-
>   drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 39 +++++++++++++++++++-
>   drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  2 ++
>   drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  5 +++
>   drivers/gpu/drm/xe/xe_guc_submit.c        |  2 +-
>   drivers/gpu/drm/xe/xe_lrc.h               | 10 ++++++
>   7 files changed, 90 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
> index 81f707d2c388..3db8e64d9d13 100644
> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> @@ -15,6 +15,7 @@
>   #include "xe_dep_scheduler.h"
>   #include "xe_device.h"
>   #include "xe_gt.h"
> +#include "xe_gt_sriov_vf.h"
>   #include "xe_hw_engine_class_sysfs.h"
>   #include "xe_hw_engine_group.h"
>   #include "xe_hw_fence.h"
> @@ -179,17 +180,32 @@ static int __xe_exec_queue_init(struct xe_exec_queue *q)
>   			flags |= XE_LRC_CREATE_RUNALONE;
>   	}
>   
> +	err = q->ops->init(q);
> +	if (err)
> +		return err;
> +
> +	/*
> +	 * This must occur after q->ops->init to avoid race conditions during VF
> +	 * post-migration recovery, as the fixups for the LRC GGTT addresses
> +	 * depend on the queue being present in the backend tracking structure.
> +	 *
> +	 * In addition to above, we must wait on inflight GGTT changes to
> +	 * avoid writing out stale values here.
> +	 */
> +	xe_gt_sriov_vf_wait_valid_ggtt(q->gt);

So to avoid locks, we rely on the VF knowing it got migrated from the 
first moment after vCPU starts.

On `qemu`, we do have it this way - when vCPU starts the 'MIGRATED' 
memirq is already filled.

But what about other VM managers? What about future support of platforms 
without memirq?

I don't think the availability of information that we've got migrated 
from the first vCPU cycle is guaranteed by any specification.


>   	for (i = 0; i < q->width; ++i) {
> -		q->lrc[i] = xe_lrc_create(q->hwe, q->vm, SZ_16K, q->msix_vec, flags);
> -		if (IS_ERR(q->lrc[i])) {
> -			err = PTR_ERR(q->lrc[i]);
> +		struct xe_lrc *lrc;
> +
> +		lrc = xe_lrc_create(q->hwe, q->vm, xe_lrc_ring_size(),
> +				    q->msix_vec, flags);

If migration happened at this place, it is still possible to create a 
context with wrong GGTT references in the one LRC which was already 
filled but not integrated into the queue yet.

I don't think we can avoid races without a lock.

> +		if (IS_ERR(lrc)) {
> +			err = PTR_ERR(lrc);
>   			goto err_lrc;
>   		}
> -	}
>   
> -	err = q->ops->init(q);
> -	if (err)
> -		goto err_lrc;
> +		/* Pairs with READ_ONCE to xe_exec_queue_contexts_hwsp_rebase */
> +		WRITE_ONCE(q->lrc[i], lrc);
> +	}
>   
>   	return 0;
>   
> @@ -1095,9 +1111,16 @@ int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch)
>   	int err = 0;
>   
>   	for (i = 0; i < q->width; ++i) {
> -		xe_lrc_update_memirq_regs_with_address(q->lrc[i], q->hwe, scratch);
> -		xe_lrc_update_hwctx_regs_with_address(q->lrc[i]);
> -		err = xe_lrc_setup_wa_bb_with_scratch(q->lrc[i], q->hwe, scratch);
> +		struct xe_lrc *lrc;
> +
> +		/* Pairs with WRITE_ONCE in __xe_exec_queue_init  */
> +		lrc = READ_ONCE(q->lrc[i]);
> +		if (!lrc)
> +			continue;
> +
> +		xe_lrc_update_memirq_regs_with_address(lrc, q->hwe, scratch);
> +		xe_lrc_update_hwctx_regs_with_address(lrc);
> +		err = xe_lrc_setup_wa_bb_with_scratch(lrc, q->hwe, scratch);
>   		if (err)
>   			break;
>   	}
> diff --git a/drivers/gpu/drm/xe/xe_execlist.c b/drivers/gpu/drm/xe/xe_execlist.c
> index f83d421ac9d3..769d05517f93 100644
> --- a/drivers/gpu/drm/xe/xe_execlist.c
> +++ b/drivers/gpu/drm/xe/xe_execlist.c
> @@ -339,7 +339,7 @@ static int execlist_exec_queue_init(struct xe_exec_queue *q)
>   	const struct drm_sched_init_args args = {
>   		.ops = &drm_sched_ops,
>   		.num_rqs = 1,
> -		.credit_limit = q->lrc[0]->ring.size / MAX_JOB_SIZE_BYTES,
> +		.credit_limit = xe_lrc_ring_size() / MAX_JOB_SIZE_BYTES,
>   		.hang_limit = XE_SCHED_HANG_LIMIT,
>   		.timeout = XE_SCHED_JOB_TIMEOUT,
>   		.name = q->hwe->name,
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 0d94867dce8e..42f9fd43b436 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -482,6 +482,11 @@ static int vf_get_ggtt_info(struct xe_gt *gt, bool recovery)
>   				 shift, config->ggtt_base);
>   		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
>   	}
> +
> +	WRITE_ONCE(gt->sriov.vf.migration.ggtt_need_fixes, false);
> +	smp_wmb();	/* Ensure above write visible before wake */
> +	wake_up_all(&gt->sriov.vf.migration.wq);
> +
>   out:
>   	up_write(config->lock);
>   	return err;
> @@ -820,7 +825,8 @@ static void vf_start_migration_recovery(struct xe_gt *gt)
>   	    !gt->sriov.vf.migration.recovery_teardown) {
>   		gt->sriov.vf.migration.recovery_queued = true;
>   		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
> -		smp_wmb();	/* Ensure above write visable before wake */
> +		WRITE_ONCE(gt->sriov.vf.migration.ggtt_need_fixes, true);
> +		smp_wmb();	/* Ensure above writes visable before wake */
typo in patch "Wakeup in GuC backend on VF post migration recovery"
>   
>   		wake_up_all(&gt->uc.guc.ct.wq);
>   
> @@ -1344,6 +1350,7 @@ int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
>   		&tile->primary_gt->sriov.vf.self_config.__lock;
>   	spin_lock_init(&gt->sriov.vf.migration.lock);
>   	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
> +	init_waitqueue_head(&gt->sriov.vf.migration.wq);
>   
>   	return 0;
>   }
> @@ -1387,3 +1394,33 @@ bool xe_gt_sriov_vf_recovery_inprogress(struct xe_gt *gt)
>   	return (xe_memirq_sw_int_0_irq_pending(memirq, &gt->uc.guc) ||
>   		READ_ONCE(gt->sriov.vf.migration.recovery_inprogress));
>   }
> +
> +static bool vf_valid_ggtt(struct xe_gt *gt)
> +{
> +	struct xe_memirq *memirq = &gt_to_tile(gt)->memirq;
> +
> +	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> +
> +	if (xe_memirq_sw_int_0_irq_pending(memirq, &gt->uc.guc) ||
> +	    READ_ONCE(gt->sriov.vf.migration.ggtt_need_fixes))
> +		return false;
> +
> +	return true;
> +}
> +
> +/**
> + * xe_gt_sriov_vf_wait_valid_ggtt() - VF wait for valid GGTT addresses
> + * @gt: the &xe_gt
> + */
> +void xe_gt_sriov_vf_wait_valid_ggtt(struct xe_gt *gt)
> +{
> +	int ret;
> +
> +	if (!IS_SRIOV_VF(gt_to_xe(gt)))
> +		return;
> +
> +	ret = wait_event_interruptible_timeout(gt->sriov.vf.migration.wq,
> +					       vf_valid_ggtt(gt),
> +					       HZ * 5);
> +	XE_WARN_ON(!ret);
> +}
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> index 71e1d566da81..20cc0c4c32e3 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> @@ -40,4 +40,6 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p);
>   void xe_gt_sriov_vf_print_runtime(struct xe_gt *gt, struct drm_printer *p);
>   void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p);
>   
> +void xe_gt_sriov_vf_wait_valid_ggtt(struct xe_gt *gt);
> +
>   #endif
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> index e135018cba1e..3c3e415199d1 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> @@ -8,6 +8,7 @@
>   
>   #include <linux/rwsem.h>
>   #include <linux/types.h>
> +#include <linux/wait.h>
>   #include <linux/workqueue.h>
>   #include "xe_uc_fw_types.h"
>   
> @@ -61,6 +62,8 @@ struct xe_gt_sriov_vf_migration {
>   	struct work_struct worker;
>   	/** @lock: Protects recovery_queued, teardown */
>   	spinlock_t lock;
> +	/** @wq: wait queue for migration fixes */
> +	wait_queue_head_t wq;
>   	/** @scratch: Scratch memory for VF recovery */
>   	void *scratch;
>   	/** @recovery_teardown: VF post migration recovery is being torn down */
> @@ -69,6 +72,8 @@ struct xe_gt_sriov_vf_migration {
>   	bool recovery_queued;
>   	/** @recovery_inprogress: VF post migration recovery in progress */
>   	bool recovery_inprogress;
> +	/** @ggtt_need_fixes: VF GGTT needs fixes */
> +	bool ggtt_need_fixes;
>   };
>   
>   /**
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 497a736c23c3..7fe3fb07e35e 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -1943,7 +1943,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
>   	timeout = (q->vm && xe_vm_in_lr_mode(q->vm)) ? MAX_SCHEDULE_TIMEOUT :
>   		  msecs_to_jiffies(q->sched_props.job_timeout_ms);
>   	err = xe_sched_init(&ge->sched, &drm_sched_ops, &xe_sched_ops,
> -			    NULL, q->lrc[0]->ring.size / MAX_JOB_SIZE_BYTES, 64,
> +			    NULL, xe_lrc_ring_size() / MAX_JOB_SIZE_BYTES, 64,
>   			    timeout, guc_to_gt(guc)->ordered_wq, NULL,
>   			    q->name, gt_to_xe(q->gt)->drm.dev);
>   	if (err)
> diff --git a/drivers/gpu/drm/xe/xe_lrc.h b/drivers/gpu/drm/xe/xe_lrc.h
> index 188565465779..5fb6c74bdab5 100644
> --- a/drivers/gpu/drm/xe/xe_lrc.h
> +++ b/drivers/gpu/drm/xe/xe_lrc.h
> @@ -74,6 +74,16 @@ static inline void xe_lrc_put(struct xe_lrc *lrc)
>   	kref_put(&lrc->refcount, xe_lrc_destroy);
>   }
>   
> +/**
> + * xe_lrc_ring_size() - Xe LRC ring size
> + *
> + * Return: Size of LRC size

Size of LRC ring buffer
-Tomasz

> + */
> +static inline size_t xe_lrc_ring_size(void)
> +{
> +	return SZ_16K;
> +}
> +
>   size_t xe_gt_lrc_size(struct xe_gt *gt, enum xe_engine_class class);
>   u32 xe_lrc_pphwsp_offset(struct xe_lrc *lrc);
>   u32 xe_lrc_regs_offset(struct xe_lrc *lrc);

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 30/36] drm/xe/vf: Add debug prints for GuC replaying state during VF recovery
  2025-09-29  2:55 ` [PATCH v3 30/36] drm/xe/vf: Add debug prints for GuC replaying state during VF recovery Matthew Brost
@ 2025-10-02  1:02   ` Lis, Tomasz
  0 siblings, 0 replies; 83+ messages in thread
From: Lis, Tomasz @ 2025-10-02  1:02 UTC (permalink / raw)
  To: Matthew Brost, intel-xe


On 9/29/2025 4:55 AM, Matthew Brost wrote:
> Helpful to manually verify the GuC state machine can correctly replay
> the state during a VF post-migration recovery. All replay paths have
> been manually verified as triggered and working during testing.

Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>

-Tomasz

> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_guc_submit.c | 23 ++++++++++++++++++++---
>   1 file changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 7fe3fb07e35e..bc717403740c 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -2306,21 +2306,27 @@ void xe_guc_submit_stop(struct xe_guc *guc)
>   
>   }
>   
> -static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
> +static void guc_exec_queue_revert_pending_state_change(struct xe_guc *guc,
> +						       struct xe_exec_queue *q)
>   {
>   	bool pending_enable, pending_disable, pending_resume;
>   
>   	pending_enable = exec_queue_pending_enable(q);
>   	pending_resume = exec_queue_pending_resume(q);
>   
> -	if (pending_enable && pending_resume)
> +	if (pending_enable && pending_resume) {
>   		q->guc->needs_resume = true;
> +		xe_gt_dbg(guc_to_gt(guc), "Replay RESUME - guc_id=%d",
> +			  q->guc->id);
> +	}
>   
>   	if (pending_enable && !pending_resume &&
>   	    !exec_queue_pending_tdr_exit(q)) {
>   		clear_exec_queue_registered(q);
>   		if (xe_exec_queue_is_lr(q))
>   			xe_exec_queue_put(q);
> +		xe_gt_dbg(guc_to_gt(guc), "Replay REGISTER - guc_id=%d",
> +			  q->guc->id);
>   	}
>   
>   	if (pending_enable) {
> @@ -2328,6 +2334,8 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
>   		clear_exec_queue_pending_resume(q);
>   		clear_exec_queue_pending_tdr_exit(q);
>   		clear_exec_queue_pending_enable(q);
> +		xe_gt_dbg(guc_to_gt(guc), "Replay ENABLE - guc_id=%d",
> +			  q->guc->id);
>   	}
>   
>   	if (exec_queue_destroyed(q) && exec_queue_registered(q)) {
> @@ -2337,6 +2345,8 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
>   		else
>   			q->guc->needs_cleanup = true;
>   		clear_exec_queue_extra_ref(q);
> +		xe_gt_dbg(guc_to_gt(guc), "Replay CLEANUP - guc_id=%d",
> +			  q->guc->id);
>   	}
>   
>   	pending_disable = exec_queue_pending_disable(q);
> @@ -2344,6 +2354,8 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
>   	if (pending_disable && exec_queue_suspended(q)) {
>   		clear_exec_queue_suspended(q);
>   		q->guc->needs_suspend = true;
> +		xe_gt_dbg(guc_to_gt(guc), "Replay SUSPEND - guc_id=%d",
> +			  q->guc->id);
>   	}
>   
>   	if (pending_disable) {
> @@ -2351,6 +2363,8 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
>   			set_exec_queue_enabled(q);
>   		clear_exec_queue_pending_disable(q);
>   		clear_exec_queue_check_timeout(q);
> +		xe_gt_dbg(guc_to_gt(guc), "Replay DISABLE - guc_id=%d",
> +			  q->guc->id);
>   	}
>   
>   	q->guc->resume_time = 0;
> @@ -2376,7 +2390,7 @@ static void guc_exec_queue_pause(struct xe_guc *guc, struct xe_exec_queue *q)
>   	else
>   		cancel_delayed_work_sync(&sched->base.work_tdr);
>   
> -	guc_exec_queue_revert_pending_state_change(q);
> +	guc_exec_queue_revert_pending_state_change(guc, q);
>   
>   	if (xe_exec_queue_is_parallel(q)) {
>   		struct xe_device *xe = guc_to_xe(guc);
> @@ -2478,6 +2492,9 @@ static void guc_exec_queue_unpause_prepare(struct xe_guc *guc,
>   	list_for_each_entry(s_job, &sched->base.pending_list, list) {
>   		job = to_xe_sched_job(s_job);
>   
> +		xe_gt_dbg(guc_to_gt(guc), "Replay JOB - guc_id=%d, seqno=%d",
> +			  q->guc->id, xe_sched_job_seqno(job));
> +
>   		q->ring_ops->emit_job(job);
>   		job->skip_emit = true;
>   	}

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 31/36] drm/xe/vf: Workaround for race condition in GuC firmware during VF pause
  2025-09-29  2:55 ` [PATCH v3 31/36] drm/xe/vf: Workaround for race condition in GuC firmware during VF pause Matthew Brost
@ 2025-10-02  1:09   ` Lis, Tomasz
  2025-10-02  6:12     ` Matthew Brost
  0 siblings, 1 reply; 83+ messages in thread
From: Lis, Tomasz @ 2025-10-02  1:09 UTC (permalink / raw)
  To: Matthew Brost, intel-xe


On 9/29/2025 4:55 AM, Matthew Brost wrote:
> A race condition exists where a paused VF's H2G request can be processed
> and subsequently rejected. This rejection results in a FAST_REQ failure
> being delivered to the KMD, which then terminates the CT via a dead
> worker and triggers a GT reset—an undesirable outcome.
>
> This workaround mitigates the issue by checking if a VF post-migration
> recovery is in progress and aborting these adverse actions accordingly.
> The GuC firmware will address this bug in an upcoming release. Once that
> version is available and VF migration depends on it, this workaround can
> be safely removed.

Shouldn't this be tagged with the corresponding GuC issue reference?

-Tomasz

>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_guc_ct.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index 25efc1f813ce..89ee68828f07 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -1394,6 +1394,10 @@ static int parse_g2h_response(struct xe_guc_ct *ct, u32 *msg, u32 len)
>   
>   		fast_req_report(ct, fence);
>   
> +		/* FIXME: W/A race in the GuC, will get in firmware soon */
> +		if (xe_gt_recovery_inprogress(gt))
> +			return 0;
> +
>   		CT_DEAD(ct, NULL, PARSE_G2H_RESPONSE);
>   
>   		return -EPROTO;

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 32/36] drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups
  2025-09-29  2:55 ` [PATCH v3 32/36] drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups Matthew Brost
@ 2025-10-02  1:25   ` Lis, Tomasz
  0 siblings, 0 replies; 83+ messages in thread
From: Lis, Tomasz @ 2025-10-02  1:25 UTC (permalink / raw)
  To: Matthew Brost, intel-xe

[-- Attachment #1: Type: text/plain, Size: 3482 bytes --]


On 9/29/2025 4:55 AM, Matthew Brost wrote:
> From: Satyanarayana K V P<satyanarayana.k.v.p@intel.com>
>
> The migrate VM builds the CCS metadata save/restore batch buffer (BB) in
> advance and retains it so the GuC can submit it directly when saving a
> VM’s state.

Had to read the "migrate VM" part multiple times to understand it. Also 
maybe my idea on English is wrong, but 'retains' does not convey 
'updates' for me. Maybe:

---
A VF driver with VM migration capability builds the CCS metadata save/restore batch buffer (BB) in
advance and keeps its content up to date so the GuC can submit it directly when saving a
VM’s state.

---
-Tomasz

> When a VM migrates between VFs, the GGTT base can change. Any GGTT-based
> addresses embedded in the BB would then have to be parsed and patched.
>
> Use PPGTT addresses in the BB (including for TLB invalidation) so the BB
> remains GGTT-agnostic and requires no address fixups during migration.
>
> Signed-off-by: Satyanarayana K V P<satyanarayana.k.v.p@intel.com>
> Cc: Michal Wajdeczko<michal.wajdeczko@intel.com>
> Cc: Matthew Brost<matthew.brost@intel.com>
> Reviewed-by: Matthew Brost<matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_migrate.c | 28 ++++++++++++++++++++--------
>   1 file changed, 20 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> index 1d667fa36cf3..ad03afb5145f 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -980,15 +980,27 @@ struct xe_lrc *xe_migrate_lrc(struct xe_migrate *migrate)
>   	return migrate->q->lrc[0];
>   }
>   
> -static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *dw, int i,
> -				 u32 flags)
> +static u64 migrate_vm_ppgtt_addr_tlb_inval(void)
>   {
> -	struct xe_lrc *lrc = xe_exec_queue_lrc(q);
> +	/*
> +	 * The migrate VM is self-referential so it can modify its own PTEs (see
> +	 * pte_update_size() or emit_pte() functions). We reserve NUM_KERNEL_PDE
> +	 * entries for kernel operations (copies, clears, CCS migrate), and
> +	 * suballocate the rest to user operations (binds/unbinds). With
> +	 * NUM_KERNEL_PDE = 15, NUM_KERNEL_PDE - 1 is already used for PTE updates,
> +	 * so assign NUM_KERNEL_PDE - 2 for TLB invalidation.
> +	 */
> +	return (NUM_KERNEL_PDE - 2) * XE_PAGE_SIZE;
> +}
> +
> +static int emit_flush_invalidate(u32 *dw, int i, u32 flags)
> +{
> +	u64 addr = migrate_vm_ppgtt_addr_tlb_inval();
> +
>   	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
>   		  MI_FLUSH_IMM_DW | flags;
> -	dw[i++] = lower_32_bits(xe_lrc_start_seqno_ggtt_addr(lrc)) |
> -		  MI_FLUSH_DW_USE_GTT;
> -	dw[i++] = upper_32_bits(xe_lrc_start_seqno_ggtt_addr(lrc));
> +	dw[i++] = lower_32_bits(addr);
> +	dw[i++] = upper_32_bits(addr);
>   	dw[i++] = MI_NOOP;
>   	dw[i++] = MI_NOOP;
>   
> @@ -1101,11 +1113,11 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
>   
>   		emit_pte(m, bb, ccs_pt, false, false, &ccs_it, ccs_size, src);
>   
> -		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> +		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
>   		flush_flags = xe_migrate_ccs_copy(m, bb, src_L0_ofs, src_is_pltt,
>   						  src_L0_ofs, dst_is_pltt,
>   						  src_L0, ccs_ofs, true);
> -		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> +		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
>   
>   		size -= src_L0;
>   	}

[-- Attachment #2: Type: text/html, Size: 4615 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 31/36] drm/xe/vf: Workaround for race condition in GuC firmware during VF pause
  2025-10-02  1:09   ` Lis, Tomasz
@ 2025-10-02  6:12     ` Matthew Brost
  0 siblings, 0 replies; 83+ messages in thread
From: Matthew Brost @ 2025-10-02  6:12 UTC (permalink / raw)
  To: Lis, Tomasz; +Cc: intel-xe

On Thu, Oct 02, 2025 at 03:09:45AM +0200, Lis, Tomasz wrote:
> 
> On 9/29/2025 4:55 AM, Matthew Brost wrote:
> > A race condition exists where a paused VF's H2G request can be processed
> > and subsequently rejected. This rejection results in a FAST_REQ failure
> > being delivered to the KMD, which then terminates the CT via a dead
> > worker and triggers a GT reset—an undesirable outcome.
> > 
> > This workaround mitigates the issue by checking if a VF post-migration
> > recovery is in progress and aborting these adverse actions accordingly.
> > The GuC firmware will address this bug in an upcoming release. Once that
> > version is available and VF migration depends on it, this workaround can
> > be safely removed.
> 
> Shouldn't this be tagged with the corresponding GuC issue reference?
> 

I think that is an Intel private link.

Matt

> -Tomasz
> 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_guc_ct.c | 4 ++++
> >   1 file changed, 4 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> > index 25efc1f813ce..89ee68828f07 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > @@ -1394,6 +1394,10 @@ static int parse_g2h_response(struct xe_guc_ct *ct, u32 *msg, u32 len)
> >   		fast_req_report(ct, fence);
> > +		/* FIXME: W/A race in the GuC, will get in firmware soon */
> > +		if (xe_gt_recovery_inprogress(gt))
> > +			return 0;
> > +
> >   		CT_DEAD(ct, NULL, PARSE_G2H_RESPONSE);
> >   		return -EPROTO;

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 29/36] drm/xe: Move queue init before LRC creation
  2025-10-02  0:44   ` Lis, Tomasz
@ 2025-10-02  7:36     ` Matthew Brost
  2025-10-02 14:54       ` Lis, Tomasz
  0 siblings, 1 reply; 83+ messages in thread
From: Matthew Brost @ 2025-10-02  7:36 UTC (permalink / raw)
  To: Lis, Tomasz; +Cc: intel-xe

On Thu, Oct 02, 2025 at 02:44:47AM +0200, Lis, Tomasz wrote:
> 
> On 9/29/2025 4:55 AM, Matthew Brost wrote:
> > A queue must be in the submission backend's tracking state before the
> > LRC is created to avoid a race condition where the LRC's GGTT addresses
> > are not properly fixed up during VF post-migration recovery.
> > 
> > Move the queue initialization—which adds the queue to the submission
> > backend's tracking state—before LRC creation.
> > 
> > v2:
> >   - Wait on VF GGTT fixes before creating LRC (testing)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_exec_queue.c        | 43 +++++++++++++++++------
> >   drivers/gpu/drm/xe/xe_execlist.c          |  2 +-
> >   drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 39 +++++++++++++++++++-
> >   drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  2 ++
> >   drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  5 +++
> >   drivers/gpu/drm/xe/xe_guc_submit.c        |  2 +-
> >   drivers/gpu/drm/xe/xe_lrc.h               | 10 ++++++
> >   7 files changed, 90 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
> > index 81f707d2c388..3db8e64d9d13 100644
> > --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> > +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> > @@ -15,6 +15,7 @@
> >   #include "xe_dep_scheduler.h"
> >   #include "xe_device.h"
> >   #include "xe_gt.h"
> > +#include "xe_gt_sriov_vf.h"
> >   #include "xe_hw_engine_class_sysfs.h"
> >   #include "xe_hw_engine_group.h"
> >   #include "xe_hw_fence.h"
> > @@ -179,17 +180,32 @@ static int __xe_exec_queue_init(struct xe_exec_queue *q)
> >   			flags |= XE_LRC_CREATE_RUNALONE;
> >   	}
> > +	err = q->ops->init(q);
> > +	if (err)
> > +		return err;
> > +
> > +	/*
> > +	 * This must occur after q->ops->init to avoid race conditions during VF
> > +	 * post-migration recovery, as the fixups for the LRC GGTT addresses
> > +	 * depend on the queue being present in the backend tracking structure.
> > +	 *
> > +	 * In addition to above, we must wait on inflight GGTT changes to
> > +	 * avoid writing out stale values here.
> > +	 */
> > +	xe_gt_sriov_vf_wait_valid_ggtt(q->gt);
> 
> So to avoid locks, we rely on the VF knowing it got migrated from the first
> moment after vCPU starts.
> 
> On `qemu`, we do have it this way - when vCPU starts the 'MIGRATED' memirq
> is already filled.
> 
> But what about other VM managers? What about future support of platforms
> without memirq?
> 
> I don't think the availability of information that we've got migrated from
> the first vCPU cycle is guaranteed by any specification.
> 

It is guarnetted byt the design of Xe.

> 
> >   	for (i = 0; i < q->width; ++i) {
> > -		q->lrc[i] = xe_lrc_create(q->hwe, q->vm, SZ_16K, q->msix_vec, flags);
> > -		if (IS_ERR(q->lrc[i])) {
> > -			err = PTR_ERR(q->lrc[i]);
> > +		struct xe_lrc *lrc;
> > +
> > +		lrc = xe_lrc_create(q->hwe, q->vm, xe_lrc_ring_size(),
> > +				    q->msix_vec, flags);
> 
> If migration happened at this place, it is still possible to create a
> context with wrong GGTT references in the one LRC which was already filled
> but not integrated into the queue yet.
> 
> I don't think we can avoid races without a lock.
> 

There might be a small race here, let me think about this. I will say
this change xe_exec_threads --r threads-many-queues though. Locking is
definitely not the way solve this though - reclaim rules are in play
here which make locking difficult and convoluted cross layer locks will
always get nacked by myself and others.

Matt

> > +		if (IS_ERR(lrc)) {
> > +			err = PTR_ERR(lrc);
> >   			goto err_lrc;
> >   		}
> > -	}
> > -	err = q->ops->init(q);
> > -	if (err)
> > -		goto err_lrc;
> > +		/* Pairs with READ_ONCE to xe_exec_queue_contexts_hwsp_rebase */
> > +		WRITE_ONCE(q->lrc[i], lrc);
> > +	}
> >   	return 0;
> > @@ -1095,9 +1111,16 @@ int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch)
> >   	int err = 0;
> >   	for (i = 0; i < q->width; ++i) {
> > -		xe_lrc_update_memirq_regs_with_address(q->lrc[i], q->hwe, scratch);
> > -		xe_lrc_update_hwctx_regs_with_address(q->lrc[i]);
> > -		err = xe_lrc_setup_wa_bb_with_scratch(q->lrc[i], q->hwe, scratch);
> > +		struct xe_lrc *lrc;
> > +
> > +		/* Pairs with WRITE_ONCE in __xe_exec_queue_init  */
> > +		lrc = READ_ONCE(q->lrc[i]);
> > +		if (!lrc)
> > +			continue;
> > +
> > +		xe_lrc_update_memirq_regs_with_address(lrc, q->hwe, scratch);
> > +		xe_lrc_update_hwctx_regs_with_address(lrc);
> > +		err = xe_lrc_setup_wa_bb_with_scratch(lrc, q->hwe, scratch);
> >   		if (err)
> >   			break;
> >   	}
> > diff --git a/drivers/gpu/drm/xe/xe_execlist.c b/drivers/gpu/drm/xe/xe_execlist.c
> > index f83d421ac9d3..769d05517f93 100644
> > --- a/drivers/gpu/drm/xe/xe_execlist.c
> > +++ b/drivers/gpu/drm/xe/xe_execlist.c
> > @@ -339,7 +339,7 @@ static int execlist_exec_queue_init(struct xe_exec_queue *q)
> >   	const struct drm_sched_init_args args = {
> >   		.ops = &drm_sched_ops,
> >   		.num_rqs = 1,
> > -		.credit_limit = q->lrc[0]->ring.size / MAX_JOB_SIZE_BYTES,
> > +		.credit_limit = xe_lrc_ring_size() / MAX_JOB_SIZE_BYTES,
> >   		.hang_limit = XE_SCHED_HANG_LIMIT,
> >   		.timeout = XE_SCHED_JOB_TIMEOUT,
> >   		.name = q->hwe->name,
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > index 0d94867dce8e..42f9fd43b436 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > @@ -482,6 +482,11 @@ static int vf_get_ggtt_info(struct xe_gt *gt, bool recovery)
> >   				 shift, config->ggtt_base);
> >   		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
> >   	}
> > +
> > +	WRITE_ONCE(gt->sriov.vf.migration.ggtt_need_fixes, false);
> > +	smp_wmb();	/* Ensure above write visible before wake */
> > +	wake_up_all(&gt->sriov.vf.migration.wq);
> > +
> >   out:
> >   	up_write(config->lock);
> >   	return err;
> > @@ -820,7 +825,8 @@ static void vf_start_migration_recovery(struct xe_gt *gt)
> >   	    !gt->sriov.vf.migration.recovery_teardown) {
> >   		gt->sriov.vf.migration.recovery_queued = true;
> >   		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
> > -		smp_wmb();	/* Ensure above write visable before wake */
> > +		WRITE_ONCE(gt->sriov.vf.migration.ggtt_need_fixes, true);
> > +		smp_wmb();	/* Ensure above writes visable before wake */
> typo in patch "Wakeup in GuC backend on VF post migration recovery"
> >   		wake_up_all(&gt->uc.guc.ct.wq);
> > @@ -1344,6 +1350,7 @@ int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
> >   		&tile->primary_gt->sriov.vf.self_config.__lock;
> >   	spin_lock_init(&gt->sriov.vf.migration.lock);
> >   	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
> > +	init_waitqueue_head(&gt->sriov.vf.migration.wq);
> >   	return 0;
> >   }
> > @@ -1387,3 +1394,33 @@ bool xe_gt_sriov_vf_recovery_inprogress(struct xe_gt *gt)
> >   	return (xe_memirq_sw_int_0_irq_pending(memirq, &gt->uc.guc) ||
> >   		READ_ONCE(gt->sriov.vf.migration.recovery_inprogress));
> >   }
> > +
> > +static bool vf_valid_ggtt(struct xe_gt *gt)
> > +{
> > +	struct xe_memirq *memirq = &gt_to_tile(gt)->memirq;
> > +
> > +	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> > +
> > +	if (xe_memirq_sw_int_0_irq_pending(memirq, &gt->uc.guc) ||
> > +	    READ_ONCE(gt->sriov.vf.migration.ggtt_need_fixes))
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +/**
> > + * xe_gt_sriov_vf_wait_valid_ggtt() - VF wait for valid GGTT addresses
> > + * @gt: the &xe_gt
> > + */
> > +void xe_gt_sriov_vf_wait_valid_ggtt(struct xe_gt *gt)
> > +{
> > +	int ret;
> > +
> > +	if (!IS_SRIOV_VF(gt_to_xe(gt)))
> > +		return;
> > +
> > +	ret = wait_event_interruptible_timeout(gt->sriov.vf.migration.wq,
> > +					       vf_valid_ggtt(gt),
> > +					       HZ * 5);
> > +	XE_WARN_ON(!ret);
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> > index 71e1d566da81..20cc0c4c32e3 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> > @@ -40,4 +40,6 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p);
> >   void xe_gt_sriov_vf_print_runtime(struct xe_gt *gt, struct drm_printer *p);
> >   void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p);
> > +void xe_gt_sriov_vf_wait_valid_ggtt(struct xe_gt *gt);
> > +
> >   #endif
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > index e135018cba1e..3c3e415199d1 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > @@ -8,6 +8,7 @@
> >   #include <linux/rwsem.h>
> >   #include <linux/types.h>
> > +#include <linux/wait.h>
> >   #include <linux/workqueue.h>
> >   #include "xe_uc_fw_types.h"
> > @@ -61,6 +62,8 @@ struct xe_gt_sriov_vf_migration {
> >   	struct work_struct worker;
> >   	/** @lock: Protects recovery_queued, teardown */
> >   	spinlock_t lock;
> > +	/** @wq: wait queue for migration fixes */
> > +	wait_queue_head_t wq;
> >   	/** @scratch: Scratch memory for VF recovery */
> >   	void *scratch;
> >   	/** @recovery_teardown: VF post migration recovery is being torn down */
> > @@ -69,6 +72,8 @@ struct xe_gt_sriov_vf_migration {
> >   	bool recovery_queued;
> >   	/** @recovery_inprogress: VF post migration recovery in progress */
> >   	bool recovery_inprogress;
> > +	/** @ggtt_need_fixes: VF GGTT needs fixes */
> > +	bool ggtt_need_fixes;
> >   };
> >   /**
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 497a736c23c3..7fe3fb07e35e 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -1943,7 +1943,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
> >   	timeout = (q->vm && xe_vm_in_lr_mode(q->vm)) ? MAX_SCHEDULE_TIMEOUT :
> >   		  msecs_to_jiffies(q->sched_props.job_timeout_ms);
> >   	err = xe_sched_init(&ge->sched, &drm_sched_ops, &xe_sched_ops,
> > -			    NULL, q->lrc[0]->ring.size / MAX_JOB_SIZE_BYTES, 64,
> > +			    NULL, xe_lrc_ring_size() / MAX_JOB_SIZE_BYTES, 64,
> >   			    timeout, guc_to_gt(guc)->ordered_wq, NULL,
> >   			    q->name, gt_to_xe(q->gt)->drm.dev);
> >   	if (err)
> > diff --git a/drivers/gpu/drm/xe/xe_lrc.h b/drivers/gpu/drm/xe/xe_lrc.h
> > index 188565465779..5fb6c74bdab5 100644
> > --- a/drivers/gpu/drm/xe/xe_lrc.h
> > +++ b/drivers/gpu/drm/xe/xe_lrc.h
> > @@ -74,6 +74,16 @@ static inline void xe_lrc_put(struct xe_lrc *lrc)
> >   	kref_put(&lrc->refcount, xe_lrc_destroy);
> >   }
> > +/**
> > + * xe_lrc_ring_size() - Xe LRC ring size
> > + *
> > + * Return: Size of LRC size
> 
> Size of LRC ring buffer
> -Tomasz
> 
> > + */
> > +static inline size_t xe_lrc_ring_size(void)
> > +{
> > +	return SZ_16K;
> > +}
> > +
> >   size_t xe_gt_lrc_size(struct xe_gt *gt, enum xe_engine_class class);
> >   u32 xe_lrc_pphwsp_offset(struct xe_lrc *lrc);
> >   u32 xe_lrc_regs_offset(struct xe_lrc *lrc);

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v3 29/36] drm/xe: Move queue init before LRC creation
  2025-10-02  7:36     ` Matthew Brost
@ 2025-10-02 14:54       ` Lis, Tomasz
  0 siblings, 0 replies; 83+ messages in thread
From: Lis, Tomasz @ 2025-10-02 14:54 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe


On 10/2/2025 9:36 AM, Matthew Brost wrote:
> On Thu, Oct 02, 2025 at 02:44:47AM +0200, Lis, Tomasz wrote:
>> On 9/29/2025 4:55 AM, Matthew Brost wrote:
>>> A queue must be in the submission backend's tracking state before the
>>> LRC is created to avoid a race condition where the LRC's GGTT addresses
>>> are not properly fixed up during VF post-migration recovery.
>>>
>>> Move the queue initialization—which adds the queue to the submission
>>> backend's tracking state—before LRC creation.
>>>
>>> v2:
>>>    - Wait on VF GGTT fixes before creating LRC (testing)
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/xe/xe_exec_queue.c        | 43 +++++++++++++++++------
>>>    drivers/gpu/drm/xe/xe_execlist.c          |  2 +-
>>>    drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 39 +++++++++++++++++++-
>>>    drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  2 ++
>>>    drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  5 +++
>>>    drivers/gpu/drm/xe/xe_guc_submit.c        |  2 +-
>>>    drivers/gpu/drm/xe/xe_lrc.h               | 10 ++++++
>>>    7 files changed, 90 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
>>> index 81f707d2c388..3db8e64d9d13 100644
>>> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
>>> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
>>> @@ -15,6 +15,7 @@
>>>    #include "xe_dep_scheduler.h"
>>>    #include "xe_device.h"
>>>    #include "xe_gt.h"
>>> +#include "xe_gt_sriov_vf.h"
>>>    #include "xe_hw_engine_class_sysfs.h"
>>>    #include "xe_hw_engine_group.h"
>>>    #include "xe_hw_fence.h"
>>> @@ -179,17 +180,32 @@ static int __xe_exec_queue_init(struct xe_exec_queue *q)
>>>    			flags |= XE_LRC_CREATE_RUNALONE;
>>>    	}
>>> +	err = q->ops->init(q);
>>> +	if (err)
>>> +		return err;
>>> +
>>> +	/*
>>> +	 * This must occur after q->ops->init to avoid race conditions during VF
>>> +	 * post-migration recovery, as the fixups for the LRC GGTT addresses
>>> +	 * depend on the queue being present in the backend tracking structure.
>>> +	 *
>>> +	 * In addition to above, we must wait on inflight GGTT changes to
>>> +	 * avoid writing out stale values here.
>>> +	 */
>>> +	xe_gt_sriov_vf_wait_valid_ggtt(q->gt);
>> So to avoid locks, we rely on the VF knowing it got migrated from the first
>> moment after vCPU starts.
>>
>> On `qemu`, we do have it this way - when vCPU starts the 'MIGRATED' memirq
>> is already filled.
>>
>> But what about other VM managers? What about future support of platforms
>> without memirq?
>>
>> I don't think the availability of information that we've got migrated from
>> the first vCPU cycle is guaranteed by any specification.
>>
> It is guarnetted byt the design of Xe.

You mean by the PF part? Because generally, Xe cannot make guarantees 
for how VMM works.

When state is restored, one of the chunks is GuC state. After the GuC 
state is restored, GuC is expected to send MIGRATED irq to the VM. It 
sends the interrupt around the same time it answers to PF that state 
restore completed. The vCPU is not started at that point. When it 
finally starts, on qemu we see the interrupt set from the start. And 
this should always be the case for memory-based IRQs - because the 
interrupt data is stored in a VRAM buffer. However, in general, this 
depends on how qemu implements the interrupt model. I wonder if there 
are situations where that would become a problem.


>
>>>    	for (i = 0; i < q->width; ++i) {
>>> -		q->lrc[i] = xe_lrc_create(q->hwe, q->vm, SZ_16K, q->msix_vec, flags);
>>> -		if (IS_ERR(q->lrc[i])) {
>>> -			err = PTR_ERR(q->lrc[i]);
>>> +		struct xe_lrc *lrc;
>>> +
>>> +		lrc = xe_lrc_create(q->hwe, q->vm, xe_lrc_ring_size(),
>>> +				    q->msix_vec, flags);
>> If migration happened at this place, it is still possible to create a
>> context with wrong GGTT references in the one LRC which was already filled
>> but not integrated into the queue yet.
>>
>> I don't think we can avoid races without a lock.
>>
> There might be a small race here, let me think about this. I will say
> this change xe_exec_threads --r threads-many-queues though. Locking is
> definitely not the way solve this though - reclaim rules are in play
> here which make locking difficult and convoluted cross layer locks will
> always get nacked by myself and others.

Ok, if you can find a lockless solution again, that would be beneficial.

-Tomasz

>
> Matt
>
>>> +		if (IS_ERR(lrc)) {
>>> +			err = PTR_ERR(lrc);
>>>    			goto err_lrc;
>>>    		}
>>> -	}
>>> -	err = q->ops->init(q);
>>> -	if (err)
>>> -		goto err_lrc;
>>> +		/* Pairs with READ_ONCE to xe_exec_queue_contexts_hwsp_rebase */
>>> +		WRITE_ONCE(q->lrc[i], lrc);
>>> +	}
>>>    	return 0;
>>> @@ -1095,9 +1111,16 @@ int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch)
>>>    	int err = 0;
>>>    	for (i = 0; i < q->width; ++i) {
>>> -		xe_lrc_update_memirq_regs_with_address(q->lrc[i], q->hwe, scratch);
>>> -		xe_lrc_update_hwctx_regs_with_address(q->lrc[i]);
>>> -		err = xe_lrc_setup_wa_bb_with_scratch(q->lrc[i], q->hwe, scratch);
>>> +		struct xe_lrc *lrc;
>>> +
>>> +		/* Pairs with WRITE_ONCE in __xe_exec_queue_init  */
>>> +		lrc = READ_ONCE(q->lrc[i]);
>>> +		if (!lrc)
>>> +			continue;
>>> +
>>> +		xe_lrc_update_memirq_regs_with_address(lrc, q->hwe, scratch);
>>> +		xe_lrc_update_hwctx_regs_with_address(lrc);
>>> +		err = xe_lrc_setup_wa_bb_with_scratch(lrc, q->hwe, scratch);
>>>    		if (err)
>>>    			break;
>>>    	}
>>> diff --git a/drivers/gpu/drm/xe/xe_execlist.c b/drivers/gpu/drm/xe/xe_execlist.c
>>> index f83d421ac9d3..769d05517f93 100644
>>> --- a/drivers/gpu/drm/xe/xe_execlist.c
>>> +++ b/drivers/gpu/drm/xe/xe_execlist.c
>>> @@ -339,7 +339,7 @@ static int execlist_exec_queue_init(struct xe_exec_queue *q)
>>>    	const struct drm_sched_init_args args = {
>>>    		.ops = &drm_sched_ops,
>>>    		.num_rqs = 1,
>>> -		.credit_limit = q->lrc[0]->ring.size / MAX_JOB_SIZE_BYTES,
>>> +		.credit_limit = xe_lrc_ring_size() / MAX_JOB_SIZE_BYTES,
>>>    		.hang_limit = XE_SCHED_HANG_LIMIT,
>>>    		.timeout = XE_SCHED_JOB_TIMEOUT,
>>>    		.name = q->hwe->name,
>>> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>>> index 0d94867dce8e..42f9fd43b436 100644
>>> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>>> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>>> @@ -482,6 +482,11 @@ static int vf_get_ggtt_info(struct xe_gt *gt, bool recovery)
>>>    				 shift, config->ggtt_base);
>>>    		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
>>>    	}
>>> +
>>> +	WRITE_ONCE(gt->sriov.vf.migration.ggtt_need_fixes, false);
>>> +	smp_wmb();	/* Ensure above write visible before wake */
>>> +	wake_up_all(&gt->sriov.vf.migration.wq);
>>> +
>>>    out:
>>>    	up_write(config->lock);
>>>    	return err;
>>> @@ -820,7 +825,8 @@ static void vf_start_migration_recovery(struct xe_gt *gt)
>>>    	    !gt->sriov.vf.migration.recovery_teardown) {
>>>    		gt->sriov.vf.migration.recovery_queued = true;
>>>    		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
>>> -		smp_wmb();	/* Ensure above write visable before wake */
>>> +		WRITE_ONCE(gt->sriov.vf.migration.ggtt_need_fixes, true);
>>> +		smp_wmb();	/* Ensure above writes visable before wake */
>> typo in patch "Wakeup in GuC backend on VF post migration recovery"
>>>    		wake_up_all(&gt->uc.guc.ct.wq);
>>> @@ -1344,6 +1350,7 @@ int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
>>>    		&tile->primary_gt->sriov.vf.self_config.__lock;
>>>    	spin_lock_init(&gt->sriov.vf.migration.lock);
>>>    	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
>>> +	init_waitqueue_head(&gt->sriov.vf.migration.wq);
>>>    	return 0;
>>>    }
>>> @@ -1387,3 +1394,33 @@ bool xe_gt_sriov_vf_recovery_inprogress(struct xe_gt *gt)
>>>    	return (xe_memirq_sw_int_0_irq_pending(memirq, &gt->uc.guc) ||
>>>    		READ_ONCE(gt->sriov.vf.migration.recovery_inprogress));
>>>    }
>>> +
>>> +static bool vf_valid_ggtt(struct xe_gt *gt)
>>> +{
>>> +	struct xe_memirq *memirq = &gt_to_tile(gt)->memirq;
>>> +
>>> +	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>>> +
>>> +	if (xe_memirq_sw_int_0_irq_pending(memirq, &gt->uc.guc) ||
>>> +	    READ_ONCE(gt->sriov.vf.migration.ggtt_need_fixes))
>>> +		return false;
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +/**
>>> + * xe_gt_sriov_vf_wait_valid_ggtt() - VF wait for valid GGTT addresses
>>> + * @gt: the &xe_gt
>>> + */
>>> +void xe_gt_sriov_vf_wait_valid_ggtt(struct xe_gt *gt)
>>> +{
>>> +	int ret;
>>> +
>>> +	if (!IS_SRIOV_VF(gt_to_xe(gt)))
>>> +		return;
>>> +
>>> +	ret = wait_event_interruptible_timeout(gt->sriov.vf.migration.wq,
>>> +					       vf_valid_ggtt(gt),
>>> +					       HZ * 5);
>>> +	XE_WARN_ON(!ret);
>>> +}
>>> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
>>> index 71e1d566da81..20cc0c4c32e3 100644
>>> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
>>> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
>>> @@ -40,4 +40,6 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p);
>>>    void xe_gt_sriov_vf_print_runtime(struct xe_gt *gt, struct drm_printer *p);
>>>    void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p);
>>> +void xe_gt_sriov_vf_wait_valid_ggtt(struct xe_gt *gt);
>>> +
>>>    #endif
>>> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
>>> index e135018cba1e..3c3e415199d1 100644
>>> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
>>> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
>>> @@ -8,6 +8,7 @@
>>>    #include <linux/rwsem.h>
>>>    #include <linux/types.h>
>>> +#include <linux/wait.h>
>>>    #include <linux/workqueue.h>
>>>    #include "xe_uc_fw_types.h"
>>> @@ -61,6 +62,8 @@ struct xe_gt_sriov_vf_migration {
>>>    	struct work_struct worker;
>>>    	/** @lock: Protects recovery_queued, teardown */
>>>    	spinlock_t lock;
>>> +	/** @wq: wait queue for migration fixes */
>>> +	wait_queue_head_t wq;
>>>    	/** @scratch: Scratch memory for VF recovery */
>>>    	void *scratch;
>>>    	/** @recovery_teardown: VF post migration recovery is being torn down */
>>> @@ -69,6 +72,8 @@ struct xe_gt_sriov_vf_migration {
>>>    	bool recovery_queued;
>>>    	/** @recovery_inprogress: VF post migration recovery in progress */
>>>    	bool recovery_inprogress;
>>> +	/** @ggtt_need_fixes: VF GGTT needs fixes */
>>> +	bool ggtt_need_fixes;
>>>    };
>>>    /**
>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>>> index 497a736c23c3..7fe3fb07e35e 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>>> @@ -1943,7 +1943,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
>>>    	timeout = (q->vm && xe_vm_in_lr_mode(q->vm)) ? MAX_SCHEDULE_TIMEOUT :
>>>    		  msecs_to_jiffies(q->sched_props.job_timeout_ms);
>>>    	err = xe_sched_init(&ge->sched, &drm_sched_ops, &xe_sched_ops,
>>> -			    NULL, q->lrc[0]->ring.size / MAX_JOB_SIZE_BYTES, 64,
>>> +			    NULL, xe_lrc_ring_size() / MAX_JOB_SIZE_BYTES, 64,
>>>    			    timeout, guc_to_gt(guc)->ordered_wq, NULL,
>>>    			    q->name, gt_to_xe(q->gt)->drm.dev);
>>>    	if (err)
>>> diff --git a/drivers/gpu/drm/xe/xe_lrc.h b/drivers/gpu/drm/xe/xe_lrc.h
>>> index 188565465779..5fb6c74bdab5 100644
>>> --- a/drivers/gpu/drm/xe/xe_lrc.h
>>> +++ b/drivers/gpu/drm/xe/xe_lrc.h
>>> @@ -74,6 +74,16 @@ static inline void xe_lrc_put(struct xe_lrc *lrc)
>>>    	kref_put(&lrc->refcount, xe_lrc_destroy);
>>>    }
>>> +/**
>>> + * xe_lrc_ring_size() - Xe LRC ring size
>>> + *
>>> + * Return: Size of LRC size
>> Size of LRC ring buffer
>> -Tomasz
>>
>>> + */
>>> +static inline size_t xe_lrc_ring_size(void)
>>> +{
>>> +	return SZ_16K;
>>> +}
>>> +
>>>    size_t xe_gt_lrc_size(struct xe_gt *gt, enum xe_engine_class class);
>>>    u32 xe_lrc_pphwsp_offset(struct xe_lrc *lrc);
>>>    u32 xe_lrc_regs_offset(struct xe_lrc *lrc);

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2025-10-02 14:54 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-29  2:55 [PATCH v3 00/36] VF migration redesign Matthew Brost
2025-09-29  2:55 ` [PATCH v3 01/36] drm/xe: Add NULL checks to scratch LRC allocation Matthew Brost
2025-09-30  2:06   ` Lis, Tomasz
2025-09-30 22:53     ` Matthew Brost
2025-09-29  2:55 ` [PATCH v3 02/36] drm/xe/vf: Lock querying GGTT config during driver init Matthew Brost
2025-09-29  7:42   ` Michal Wajdeczko
2025-09-29 12:15     ` Matthew Brost
2025-09-30  0:42       ` Lis, Tomasz
2025-09-30 10:25         ` Michal Wajdeczko
2025-09-29  8:13   ` Ville Syrjälä
2025-09-30 13:22     ` Lis, Tomasz
2025-09-29  2:55 ` [PATCH v3 03/36] Revert "drm/xe/vf: Rebase exec queue parallel commands during migration recovery" Matthew Brost
2025-09-30 15:22   ` Michal Wajdeczko
2025-09-29  2:55 ` [PATCH v3 04/36] Revert "drm/xe/vf: Post migration, repopulate ring area for pending request" Matthew Brost
2025-09-30 15:24   ` Michal Wajdeczko
2025-09-29  2:55 ` [PATCH v3 05/36] Revert "drm/xe/vf: Fixup CTB send buffer messages after migration" Matthew Brost
2025-09-30 15:27   ` Michal Wajdeczko
2025-09-29  2:55 ` [PATCH v3 06/36] drm/xe: Save off position in ring in which a job was programmed Matthew Brost
2025-09-29  2:55 ` [PATCH v3 07/36] drm/xe/guc: Track pending-enable source in submission state Matthew Brost
2025-09-29  2:55 ` [PATCH v3 08/36] drm/xe: Track LR jobs in DRM scheduler pending list Matthew Brost
2025-09-29  2:55 ` [PATCH v3 09/36] drm/xe: Don't change LRC ring head on job resubmission Matthew Brost
2025-09-30  2:38   ` Lis, Tomasz
2025-09-29  2:55 ` [PATCH v3 10/36] drm/xe: Make LRC W/A scratch buffer usage consistent Matthew Brost
2025-09-29  2:55 ` [PATCH v3 11/36] drm/xe/guc: Document GuC submission backend Matthew Brost
2025-09-30  3:28   ` Lis, Tomasz
2025-09-30  6:30     ` Matthew Brost
2025-09-29  2:55 ` [PATCH v3 12/36] drm/xe/vf: Add xe_gt_recovery_inprogress helper Matthew Brost
2025-09-29  8:04   ` Michal Wajdeczko
2025-09-29  8:52     ` Matthew Brost
2025-09-29  2:55 ` [PATCH v3 13/36] drm/xe/vf: Make VF recovery run on per-GT worker Matthew Brost
2025-09-30 14:47   ` Lis, Tomasz
2025-09-29  2:55 ` [PATCH v3 14/36] drm/xe/vf: Abort H2G sends during VF post-migration recovery Matthew Brost
2025-09-29  8:17   ` Michal Wajdeczko
2025-09-29  2:55 ` [PATCH v3 15/36] drm/xe/vf: Remove memory allocations from VF post migration recovery Matthew Brost
2025-09-30 15:00   ` Lis, Tomasz
2025-09-29  2:55 ` [PATCH v3 16/36] drm/xe/vf: Close multi-GT GGTT shift race Matthew Brost
2025-09-29  8:44   ` Michal Wajdeczko
2025-09-29 12:31     ` Matthew Brost
2025-09-29  2:55 ` [PATCH v3 17/36] drm/xe/vf: Teardown VF post migration worker on driver unload Matthew Brost
2025-09-30 16:24   ` Lis, Tomasz
2025-09-29  2:55 ` [PATCH v3 18/36] drm/xe/vf: Don't allow GT reset to be queued during VF post migration recovery Matthew Brost
2025-09-29  9:17   ` Michal Wajdeczko
2025-09-29 12:50     ` Matthew Brost
2025-09-29  2:55 ` [PATCH v3 19/36] drm/xe/vf: Wakeup in GuC backend on " Matthew Brost
2025-09-29  2:55 ` [PATCH v3 20/36] drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs supporting migration Matthew Brost
2025-10-01 13:45   ` Lis, Tomasz
2025-10-01 13:56     ` Lis, Tomasz
2025-09-29  2:55 ` [PATCH v3 21/36] drm/xe/vf: Extra debug on GGTT shift Matthew Brost
2025-09-29  2:55 ` [PATCH v3 22/36] drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register Matthew Brost
2025-09-29  2:55 ` [PATCH v3 23/36] drm/xe/vf: Flush and stop CTs in VF post migration recovery Matthew Brost
2025-09-29 21:31   ` Michal Wajdeczko
2025-09-29  2:55 ` [PATCH v3 24/36] drm/xe/vf: Reset TLB invalidations during " Matthew Brost
2025-10-01 13:53   ` Lis, Tomasz
2025-09-29  2:55 ` [PATCH v3 25/36] drm/xe/vf: Kickstart after resfix in " Matthew Brost
2025-09-29  2:55 ` [PATCH v3 26/36] drm/xe/vf: Start CTs before resfix " Matthew Brost
2025-09-29 21:49   ` Michal Wajdeczko
2025-09-30  6:26     ` Matthew Brost
2025-09-29  2:55 ` [PATCH v3 27/36] drm/xe/vf: Abort VF post migration recovery on failure Matthew Brost
2025-10-01 14:06   ` Lis, Tomasz
2025-09-29  2:55 ` [PATCH v3 28/36] drm/xe/vf: Replay GuC submission state on pause / unpause Matthew Brost
2025-10-01 14:37   ` Lis, Tomasz
2025-09-29  2:55 ` [PATCH v3 29/36] drm/xe: Move queue init before LRC creation Matthew Brost
2025-10-02  0:44   ` Lis, Tomasz
2025-10-02  7:36     ` Matthew Brost
2025-10-02 14:54       ` Lis, Tomasz
2025-09-29  2:55 ` [PATCH v3 30/36] drm/xe/vf: Add debug prints for GuC replaying state during VF recovery Matthew Brost
2025-10-02  1:02   ` Lis, Tomasz
2025-09-29  2:55 ` [PATCH v3 31/36] drm/xe/vf: Workaround for race condition in GuC firmware during VF pause Matthew Brost
2025-10-02  1:09   ` Lis, Tomasz
2025-10-02  6:12     ` Matthew Brost
2025-09-29  2:55 ` [PATCH v3 32/36] drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups Matthew Brost
2025-10-02  1:25   ` Lis, Tomasz
2025-09-29  2:55 ` [PATCH v3 33/36] drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF Matthew Brost
2025-09-29  2:55 ` [PATCH v3 34/36] drm/xe/vf: Ensure media GT VF recovery runs after primary GT on PTL Matthew Brost
2025-09-29  2:55 ` [PATCH v3 35/36] drm/xe/vf: Rebase CCS save/restore BB GGTT addresses Matthew Brost
2025-09-29  2:55 ` [PATCH v3 36/36] drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC Matthew Brost
2025-09-29 15:17   ` K V P, Satyanarayana
2025-09-30 12:39     ` Matthew Brost
2025-09-30 13:38       ` Michal Wajdeczko
2025-09-30 14:39         ` Matthew Brost
2025-09-29  3:06 ` ✗ CI.checkpatch: warning for VF migration redesign (rev3) Patchwork
2025-09-29  3:08 ` ✓ CI.KUnit: success " Patchwork
2025-09-29  6:28 ` ✗ Xe.CI.Full: failure " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox