[PATCH v6 00/30] VF migration redesign

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v6 00/30] VF migration redesign
@ 2025-10-06 11:10 Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 01/30] drm/xe: Add NULL checks to scratch LRC allocation Matthew Brost
                   ` (34 more replies)
  0 siblings, 35 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

Rather than modifying buffers in place using GGTT addresses during VF
migration, this approach relies on the submission backend's stop/start
mechanism to issue fixups. The patch titled "Document GuC Submission
Backend" provides a detailed explanation of the design.

Testing was performed using an out-of-tree PF/VFIO driver with manual
triggering of VF migration while IGT test cases are running.

IGT test cases:

- A new series [1] that exercises active contexts, job resubmission, and
  compressd memory.

- A new test [2] that actively creates / destroys queue on each
  submission

- xe_exec_threads basic sections, which test context registration loss,
  schedule enable loss, and job resubmission.

- xe_exec_threads balancer sections, which follow the same flows as the 
  basic sections but include a work queue (GGTT address shift).

- xe_exec_threads compute mode user pointer invalidation sections, which
  exercise the same flow as the basic sections, plus replaying
  suspend/resume flows.

All code paths in "Replay GuC submission state on pause/unpause" that
replay state have been manually verified via debug messages "Add debug
prints for GuC replaying state during VF recovery".

v2:
 - Fix lockdep splat
 - Fix checkpatch
 - Fix PTL issue with LRC W/A buffer
 - Fix race creating / destroying queues across migration exposed by [2]
 - Include a version of Satya's patches in [3] which enable CCS save /
   restore across VF migration /w GGTT shift
v3:
 - Address feedback
 - Fix preempt fence mode deadlock /w work queues + VF recovery (Testing)
 - Add NULL checks to scratch LRC allocation
v4:
 - Fix CI failure
 - Remove config lock
v5:
 - Fix CI failures related to lockdep
 - Address various comments
v6:
 - Rebase for CI

Matt

Matthew Brost (28):
  drm/xe: Add NULL checks to scratch LRC allocation
  drm/xe: Save off position in ring in which a job was programmed
  drm/xe/guc: Track pending-enable source in submission state
  drm/xe: Track LR jobs in DRM scheduler pending list
  drm/xe: Don't change LRC ring head on job resubmission
  drm/xe: Make LRC W/A scratch buffer usage consistent
  drm/xe/vf: Add xe_gt_recovery_pending helper
  drm/xe/vf: Make VF recovery run on per-GT worker
  drm/xe/vf: Abort H2G sends during VF post-migration recovery
  drm/xe/vf: Remove memory allocations from VF post migration recovery
  drm/xe/vf: Close multi-GT GGTT shift race
  drm/xe/vf: Teardown VF post migration worker on driver unload
  drm/xe/vf: Don't allow GT reset to be queued during VF post migration
    recovery
  drm/xe/vf: Wakeup in GuC backend on VF post migration recovery
  drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs
    supporting migration
  drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register
  drm/xe/vf: Flush and stop CTs in VF post migration recovery
  drm/xe/vf: Reset TLB invalidations during VF post migration recovery
  drm/xe/vf: Kickstart after resfix in VF post migration recovery
  drm/xe/vf: Start CTs before resfix VF post migration recovery
  drm/xe/vf: Abort VF post migration recovery on failure
  drm/xe/vf: Replay GuC submission state on pause / unpause
  drm/xe: Move queue init before LRC creation
  drm/xe/vf: Add debug prints for GuC replaying state during VF recovery
  drm/xe/vf: Workaround for race condition in GuC firmware during VF
    pause
  drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF
  drm/xe/vf: Ensure media GT VF recovery runs after primary GT on PTL
  drm/xe/vf: Rebase CCS save/restore BB GGTT addresses

Satyanarayana K V P (2):
  drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups
  drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC

 drivers/gpu/drm/xe/xe_device_types.h         |   5 +
 drivers/gpu/drm/xe/xe_exec.c                 |  12 +-
 drivers/gpu/drm/xe/xe_exec_queue.c           |  64 +--
 drivers/gpu/drm/xe/xe_exec_queue.h           |   2 -
 drivers/gpu/drm/xe/xe_exec_queue_types.h     |   3 +
 drivers/gpu/drm/xe/xe_execlist.c             |   2 +-
 drivers/gpu/drm/xe/xe_gpu_scheduler.c        |  14 +
 drivers/gpu/drm/xe/xe_gpu_scheduler.h        |   2 +
 drivers/gpu/drm/xe/xe_gt.c                   |  28 +-
 drivers/gpu/drm/xe/xe_gt.h                   |  15 +-
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c          | 458 +++++++++++++----
 drivers/gpu/drm/xe/xe_gt_sriov_vf.h          |  13 +-
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h    |  33 +-
 drivers/gpu/drm/xe/xe_guc.c                  |   4 +-
 drivers/gpu/drm/xe/xe_guc_ct.c               | 121 +++--
 drivers/gpu/drm/xe/xe_guc_ct.h               |  11 +
 drivers/gpu/drm/xe/xe_guc_exec_queue_types.h |  15 +
 drivers/gpu/drm/xe/xe_guc_submit.c           | 486 +++++++++++++++----
 drivers/gpu/drm/xe/xe_guc_submit.h           |   5 +-
 drivers/gpu/drm/xe/xe_lrc.c                  |  15 +-
 drivers/gpu/drm/xe/xe_lrc.h                  |  10 +
 drivers/gpu/drm/xe/xe_memirq.c               |  48 +-
 drivers/gpu/drm/xe/xe_memirq.h               |   2 +
 drivers/gpu/drm/xe/xe_migrate.c              |  28 +-
 drivers/gpu/drm/xe/xe_pci.c                  |   6 +-
 drivers/gpu/drm/xe/xe_pci_types.h            |   1 +
 drivers/gpu/drm/xe/xe_preempt_fence.c        |  11 +
 drivers/gpu/drm/xe/xe_ring_ops.c             |  23 +-
 drivers/gpu/drm/xe/xe_sched_job_types.h      |   9 +
 drivers/gpu/drm/xe/xe_sriov_vf.c             | 240 ---------
 drivers/gpu/drm/xe/xe_sriov_vf.h             |   1 -
 drivers/gpu/drm/xe/xe_sriov_vf_ccs.c         |  28 ++
 drivers/gpu/drm/xe/xe_sriov_vf_ccs.h         |   1 +
 drivers/gpu/drm/xe/xe_sriov_vf_types.h       |   4 -
 drivers/gpu/drm/xe/xe_tile.c                 |   2 +-
 drivers/gpu/drm/xe/xe_tile_sriov_vf.c        |  30 +-
 drivers/gpu/drm/xe/xe_tile_sriov_vf.h        |   2 +-
 drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h  |  23 +
 drivers/gpu/drm/xe/xe_vm.c                   |  26 +-
 drivers/gpu/drm/xe/xe_vram.c                 |   6 +-
 40 files changed, 1250 insertions(+), 559 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v6 01/30] drm/xe: Add NULL checks to scratch LRC allocation
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 21:51   ` Lis, Tomasz
  2025-10-06 11:10 ` [PATCH v6 02/30] drm/xe: Save off position in ring in which a job was programmed Matthew Brost
                   ` (33 subsequent siblings)
  34 siblings, 1 reply; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

kmalloc can fail, the returned value must have a NULL check. This should
be immediately after kmalloc for clarity.

v5:
 - Assert state->buffer in setup_bo if buffer is iomem (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_lrc.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index af09f70f6e78..2c6eae2de1f2 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -1214,8 +1214,7 @@ static int setup_bo(struct bo_setup_state *state)
 	ssize_t remain;
 
 	if (state->lrc->bo->vmap.is_iomem) {
-		if (!state->buffer)
-			return -ENOMEM;
+		xe_gt_assert(state->hwe->gt, state->buffer);
 		state->ptr = state->buffer;
 	} else {
 		state->ptr = state->lrc->bo->vmap.vaddr + state->offset;
@@ -1303,8 +1302,11 @@ static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
 	u32 *buf = NULL;
 	int ret;
 
-	if (lrc->bo->vmap.is_iomem)
+	if (lrc->bo->vmap.is_iomem) {
 		buf = kmalloc(LRC_WA_BB_SIZE, GFP_KERNEL);
+		if (!buf)
+			return -ENOMEM;
+	}
 
 	ret = xe_lrc_setup_wa_bb_with_scratch(lrc, hwe, buf);
 
@@ -1347,8 +1349,11 @@ setup_indirect_ctx(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
 	if (xe_gt_WARN_ON(lrc->gt, !state.funcs))
 		return 0;
 
-	if (lrc->bo->vmap.is_iomem)
+	if (lrc->bo->vmap.is_iomem) {
 		state.buffer = kmalloc(state.max_size, GFP_KERNEL);
+		if (!state.buffer)
+			return -ENOMEM;
+	}
 
 	ret = setup_bo(&state);
 	if (ret) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 02/30] drm/xe: Save off position in ring in which a job was programmed
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 01/30] drm/xe: Add NULL checks to scratch LRC allocation Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 03/30] drm/xe/guc: Track pending-enable source in submission state Matthew Brost
                   ` (32 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

VF post-migration recovery needs to modify the ring with updated GGTT
addresses for pending jobs. Save off position in ring in which a job was
programmed to facilitate.

v4:
 - s/VF resume/VF post-migration recovery (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_ring_ops.c        | 23 +++++++++++++++++++----
 drivers/gpu/drm/xe/xe_sched_job_types.h |  5 +++++
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
index d71837773d6c..ac0c6dcffe15 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops.c
+++ b/drivers/gpu/drm/xe/xe_ring_ops.c
@@ -245,12 +245,14 @@ static int emit_copy_timestamp(struct xe_lrc *lrc, u32 *dw, int i)
 
 /* for engines that don't require any special HW handling (no EUs, no aux inval, etc) */
 static void __emit_job_gen12_simple(struct xe_sched_job *job, struct xe_lrc *lrc,
-				    u64 batch_addr, u32 seqno)
+				    u64 batch_addr, u32 *head, u32 seqno)
 {
 	u32 dw[MAX_JOB_SIZE_DW], i = 0;
 	u32 ppgtt_flag = get_ppgtt_flag(job);
 	struct xe_gt *gt = job->q->gt;
 
+	*head = lrc->ring.tail;
+
 	i = emit_copy_timestamp(lrc, dw, i);
 
 	if (job->ring_ops_flush_tlb) {
@@ -296,7 +298,7 @@ static bool has_aux_ccs(struct xe_device *xe)
 }
 
 static void __emit_job_gen12_video(struct xe_sched_job *job, struct xe_lrc *lrc,
-				   u64 batch_addr, u32 seqno)
+				   u64 batch_addr, u32 *head, u32 seqno)
 {
 	u32 dw[MAX_JOB_SIZE_DW], i = 0;
 	u32 ppgtt_flag = get_ppgtt_flag(job);
@@ -304,6 +306,8 @@ static void __emit_job_gen12_video(struct xe_sched_job *job, struct xe_lrc *lrc,
 	struct xe_device *xe = gt_to_xe(gt);
 	bool decode = job->q->class == XE_ENGINE_CLASS_VIDEO_DECODE;
 
+	*head = lrc->ring.tail;
+
 	i = emit_copy_timestamp(lrc, dw, i);
 
 	dw[i++] = preparser_disable(true);
@@ -346,7 +350,8 @@ static void __emit_job_gen12_video(struct xe_sched_job *job, struct xe_lrc *lrc,
 
 static void __emit_job_gen12_render_compute(struct xe_sched_job *job,
 					    struct xe_lrc *lrc,
-					    u64 batch_addr, u32 seqno)
+					    u64 batch_addr, u32 *head,
+					    u32 seqno)
 {
 	u32 dw[MAX_JOB_SIZE_DW], i = 0;
 	u32 ppgtt_flag = get_ppgtt_flag(job);
@@ -355,6 +360,8 @@ static void __emit_job_gen12_render_compute(struct xe_sched_job *job,
 	bool lacks_render = !(gt->info.engine_mask & XE_HW_ENGINE_RCS_MASK);
 	u32 mask_flags = 0;
 
+	*head = lrc->ring.tail;
+
 	i = emit_copy_timestamp(lrc, dw, i);
 
 	dw[i++] = preparser_disable(true);
@@ -396,11 +403,14 @@ static void __emit_job_gen12_render_compute(struct xe_sched_job *job,
 }
 
 static void emit_migration_job_gen12(struct xe_sched_job *job,
-				     struct xe_lrc *lrc, u32 seqno)
+				     struct xe_lrc *lrc, u32 *head,
+				     u32 seqno)
 {
 	u32 saddr = xe_lrc_start_seqno_ggtt_addr(lrc);
 	u32 dw[MAX_JOB_SIZE_DW], i = 0;
 
+	*head = lrc->ring.tail;
+
 	i = emit_copy_timestamp(lrc, dw, i);
 
 	i = emit_store_imm_ggtt(saddr, seqno, dw, i);
@@ -434,6 +444,7 @@ static void emit_job_gen12_gsc(struct xe_sched_job *job)
 
 	__emit_job_gen12_simple(job, job->q->lrc[0],
 				job->ptrs[0].batch_addr,
+				&job->ptrs[0].head,
 				xe_sched_job_lrc_seqno(job));
 }
 
@@ -443,6 +454,7 @@ static void emit_job_gen12_copy(struct xe_sched_job *job)
 
 	if (xe_sched_job_is_migration(job->q)) {
 		emit_migration_job_gen12(job, job->q->lrc[0],
+					 &job->ptrs[0].head,
 					 xe_sched_job_lrc_seqno(job));
 		return;
 	}
@@ -450,6 +462,7 @@ static void emit_job_gen12_copy(struct xe_sched_job *job)
 	for (i = 0; i < job->q->width; ++i)
 		__emit_job_gen12_simple(job, job->q->lrc[i],
 					job->ptrs[i].batch_addr,
+					&job->ptrs[i].head,
 					xe_sched_job_lrc_seqno(job));
 }
 
@@ -461,6 +474,7 @@ static void emit_job_gen12_video(struct xe_sched_job *job)
 	for (i = 0; i < job->q->width; ++i)
 		__emit_job_gen12_video(job, job->q->lrc[i],
 				       job->ptrs[i].batch_addr,
+				       &job->ptrs[i].head,
 				       xe_sched_job_lrc_seqno(job));
 }
 
@@ -471,6 +485,7 @@ static void emit_job_gen12_render_compute(struct xe_sched_job *job)
 	for (i = 0; i < job->q->width; ++i)
 		__emit_job_gen12_render_compute(job, job->q->lrc[i],
 						job->ptrs[i].batch_addr,
+						&job->ptrs[i].head,
 						xe_sched_job_lrc_seqno(job));
 }
 
diff --git a/drivers/gpu/drm/xe/xe_sched_job_types.h b/drivers/gpu/drm/xe/xe_sched_job_types.h
index dbf260dded8d..7ce58765a34a 100644
--- a/drivers/gpu/drm/xe/xe_sched_job_types.h
+++ b/drivers/gpu/drm/xe/xe_sched_job_types.h
@@ -24,6 +24,11 @@ struct xe_job_ptrs {
 	struct dma_fence_chain *chain_fence;
 	/** @batch_addr: Batch buffer address. */
 	u64 batch_addr;
+	/**
+	 * @head: The tail pointer of the LRC (so head pointer of job) when the
+	 * job was submitted
+	 */
+	u32 head;
 };
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 03/30] drm/xe/guc: Track pending-enable source in submission state
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 01/30] drm/xe: Add NULL checks to scratch LRC allocation Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 02/30] drm/xe: Save off position in ring in which a job was programmed Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 04/30] drm/xe: Track LR jobs in DRM scheduler pending list Matthew Brost
                   ` (31 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

Add explicit tracking in the GuC submission state to record the source
of a pending enable (TDR vs. queue resume path vs. submission).
Disambiguating the origin lets the GuC submission state machine apply
the correct recovery/replay behavior.

This helps VF restore: when the device comes back, the state machine knows
whether the pending enable stems from timeout recovery, from a queue resume
sequence, or submission and can gate sequencing and fixups accordingly.

v4:
 - Clarify commit message (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 36 ++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 16f78376f196..13746f32b231 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -69,6 +69,8 @@ exec_queue_to_guc(struct xe_exec_queue *q)
 #define EXEC_QUEUE_STATE_BANNED			(1 << 9)
 #define EXEC_QUEUE_STATE_CHECK_TIMEOUT		(1 << 10)
 #define EXEC_QUEUE_STATE_EXTRA_REF		(1 << 11)
+#define EXEC_QUEUE_STATE_PENDING_RESUME		(1 << 12)
+#define EXEC_QUEUE_STATE_PENDING_TDR_EXIT	(1 << 13)
 
 static bool exec_queue_registered(struct xe_exec_queue *q)
 {
@@ -220,6 +222,36 @@ static void set_exec_queue_extra_ref(struct xe_exec_queue *q)
 	atomic_or(EXEC_QUEUE_STATE_EXTRA_REF, &q->guc->state);
 }
 
+static bool __maybe_unused exec_queue_pending_resume(struct xe_exec_queue *q)
+{
+	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_RESUME;
+}
+
+static void set_exec_queue_pending_resume(struct xe_exec_queue *q)
+{
+	atomic_or(EXEC_QUEUE_STATE_PENDING_RESUME, &q->guc->state);
+}
+
+static void clear_exec_queue_pending_resume(struct xe_exec_queue *q)
+{
+	atomic_and(~EXEC_QUEUE_STATE_PENDING_RESUME, &q->guc->state);
+}
+
+static bool __maybe_unused exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
+{
+	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_TDR_EXIT;
+}
+
+static void set_exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
+{
+	atomic_or(EXEC_QUEUE_STATE_PENDING_TDR_EXIT, &q->guc->state);
+}
+
+static void clear_exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
+{
+	atomic_and(~EXEC_QUEUE_STATE_PENDING_TDR_EXIT, &q->guc->state);
+}
+
 static bool exec_queue_killed_or_banned_or_wedged(struct xe_exec_queue *q)
 {
 	return (atomic_read(&q->guc->state) &
@@ -1334,6 +1366,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	return DRM_GPU_SCHED_STAT_RESET;
 
 sched_enable:
+	set_exec_queue_pending_tdr_exit(q);
 	enable_scheduling(q);
 rearm:
 	/*
@@ -1493,6 +1526,7 @@ static void __guc_exec_queue_process_msg_resume(struct xe_sched_msg *msg)
 		clear_exec_queue_suspended(q);
 		if (!exec_queue_enabled(q)) {
 			q->guc->resume_time = RESUME_PENDING;
+			set_exec_queue_pending_resume(q);
 			enable_scheduling(q);
 		}
 	} else {
@@ -2065,6 +2099,8 @@ static void handle_sched_done(struct xe_guc *guc, struct xe_exec_queue *q,
 		xe_gt_assert(guc_to_gt(guc), exec_queue_pending_enable(q));
 
 		q->guc->resume_time = ktime_get();
+		clear_exec_queue_pending_resume(q);
+		clear_exec_queue_pending_tdr_exit(q);
 		clear_exec_queue_pending_enable(q);
 		smp_wmb();
 		wake_up_all(&guc->ct.wq);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 04/30] drm/xe: Track LR jobs in DRM scheduler pending list
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (2 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 03/30] drm/xe/guc: Track pending-enable source in submission state Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 05/30] drm/xe: Don't change LRC ring head on job resubmission Matthew Brost
                   ` (30 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

VF migration requires jobs to remain pending so they can be replayed
after the VF comes back. Previously, LR job fences were intentionally
signaled immediately after submission to avoid the risk of exporting
them, as these fences do not naturally signal in a timely manner and
could break dma-fence contracts. A side effect of this approach was that
LR jobs were never added to the DRM scheduler’s pending list, preventing
them from being tracked for later resubmission.

We now avoid signaling LR job fences and ensure they are never exported;
Xe already guards against exporting these internal fences. With that
guarantee in place, we can safely track LR jobs in the scheduler’s
pending list so they are eligible for resubmission during VF
post-migration recovery (and similar recovery paths).

An added benefit is that LR queues now gain the DRM scheduler’s built-in
flow control over ring usage rather than rejecting new jobs in the exec
IOCTL if the ring is full.

v2:
 - Ensure DRM scheduler TDR doesn't run for LR jobs
 - Stack variable for killed_or_banned_or_wedged
v4:
 - Clarify commit message (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_exec.c       | 12 ++-------
 drivers/gpu/drm/xe/xe_exec_queue.c | 19 -------------
 drivers/gpu/drm/xe/xe_exec_queue.h |  2 --
 drivers/gpu/drm/xe/xe_guc_submit.c | 43 ++++++++++++++++++++----------
 4 files changed, 31 insertions(+), 45 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_exec.c b/drivers/gpu/drm/xe/xe_exec.c
index 83897950f0da..0dc27476832b 100644
--- a/drivers/gpu/drm/xe/xe_exec.c
+++ b/drivers/gpu/drm/xe/xe_exec.c
@@ -124,7 +124,7 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 	struct xe_validation_ctx ctx;
 	struct xe_sched_job *job;
 	struct xe_vm *vm;
-	bool write_locked, skip_retry = false;
+	bool write_locked;
 	int err = 0;
 	struct xe_hw_engine_group *group;
 	enum xe_hw_engine_group_execution_mode mode, previous_mode;
@@ -266,12 +266,6 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		goto err_exec;
 	}
 
-	if (xe_exec_queue_is_lr(q) && xe_exec_queue_ring_full(q)) {
-		err = -EWOULDBLOCK;	/* Aliased to -EAGAIN */
-		skip_retry = true;
-		goto err_exec;
-	}
-
 	if (xe_exec_queue_uses_pxp(q)) {
 		err = xe_vm_validate_protected(q->vm);
 		if (err)
@@ -328,8 +322,6 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		xe_sched_job_init_user_fence(job, &syncs[i]);
 	}
 
-	if (xe_exec_queue_is_lr(q))
-		q->ring_ops->emit_job(job);
 	if (!xe_vm_in_lr_mode(vm))
 		xe_exec_queue_last_fence_set(q, vm, &job->drm.s_fence->finished);
 	xe_sched_job_push(job);
@@ -355,7 +347,7 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		xe_validation_ctx_fini(&ctx);
 err_unlock_list:
 	up_read(&vm->lock);
-	if (err == -EAGAIN && !skip_retry)
+	if (err == -EAGAIN)
 		goto retry;
 err_hw_exec_mode:
 	if (mode == EXEC_MODE_DMA_FENCE)
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
index df82463b19f6..7621089a47fe 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -850,25 +850,6 @@ bool xe_exec_queue_is_lr(struct xe_exec_queue *q)
 		!(q->flags & EXEC_QUEUE_FLAG_VM);
 }
 
-static s32 xe_exec_queue_num_job_inflight(struct xe_exec_queue *q)
-{
-	return q->lrc[0]->fence_ctx.next_seqno - xe_lrc_seqno(q->lrc[0]) - 1;
-}
-
-/**
- * xe_exec_queue_ring_full() - Whether an exec_queue's ring is full
- * @q: The exec_queue
- *
- * Return: True if the exec_queue's ring is full, false otherwise.
- */
-bool xe_exec_queue_ring_full(struct xe_exec_queue *q)
-{
-	struct xe_lrc *lrc = q->lrc[0];
-	s32 max_job = lrc->ring.size / MAX_JOB_SIZE_BYTES;
-
-	return xe_exec_queue_num_job_inflight(q) >= max_job;
-}
-
 /**
  * xe_exec_queue_is_idle() - Whether an exec_queue is idle.
  * @q: The exec_queue
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h b/drivers/gpu/drm/xe/xe_exec_queue.h
index 8821ceb838d0..a4dfbe858bda 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.h
+++ b/drivers/gpu/drm/xe/xe_exec_queue.h
@@ -64,8 +64,6 @@ static inline bool xe_exec_queue_uses_pxp(struct xe_exec_queue *q)
 
 bool xe_exec_queue_is_lr(struct xe_exec_queue *q);
 
-bool xe_exec_queue_ring_full(struct xe_exec_queue *q);
-
 bool xe_exec_queue_is_idle(struct xe_exec_queue *q);
 
 void xe_exec_queue_kill(struct xe_exec_queue *q);
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 13746f32b231..3a534d93505f 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -851,30 +851,31 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
 	struct xe_sched_job *job = to_xe_sched_job(drm_job);
 	struct xe_exec_queue *q = job->q;
 	struct xe_guc *guc = exec_queue_to_guc(q);
-	struct dma_fence *fence = NULL;
-	bool lr = xe_exec_queue_is_lr(q);
+	bool lr = xe_exec_queue_is_lr(q), killed_or_banned_or_wedged =
+		exec_queue_killed_or_banned_or_wedged(q);
 
 	xe_gt_assert(guc_to_gt(guc), !(exec_queue_destroyed(q) || exec_queue_pending_disable(q)) ||
 		     exec_queue_banned(q) || exec_queue_suspended(q));
 
 	trace_xe_sched_job_run(job);
 
-	if (!exec_queue_killed_or_banned_or_wedged(q) && !xe_sched_job_is_error(job)) {
+	if (!killed_or_banned_or_wedged && !xe_sched_job_is_error(job)) {
 		if (!exec_queue_registered(q))
 			register_exec_queue(q, GUC_CONTEXT_NORMAL);
-		if (!lr)	/* LR jobs are emitted in the exec IOCTL */
-			q->ring_ops->emit_job(job);
+		q->ring_ops->emit_job(job);
 		submit_exec_queue(q);
 	}
 
-	if (lr) {
-		xe_sched_job_set_error(job, -EOPNOTSUPP);
-		dma_fence_put(job->fence);	/* Drop ref from xe_sched_job_arm */
-	} else {
-		fence = job->fence;
-	}
+	/*
+	 * We don't care about job-fence ordering in LR VMs because these fences
+	 * are never exported; they are used solely to keep jobs on the pending
+	 * list. Once a queue enters an error state, there's no need to track
+	 * them.
+	 */
+	if (killed_or_banned_or_wedged && lr)
+		xe_sched_job_set_error(job, -ECANCELED);
 
-	return fence;
+	return job->fence;
 }
 
 static void guc_exec_queue_free_job(struct drm_sched_job *drm_job)
@@ -916,7 +917,8 @@ static void disable_scheduling_deregister(struct xe_guc *guc,
 		xe_gt_warn(q->gt, "Pending enable/disable failed to respond\n");
 		xe_sched_submission_start(sched);
 		xe_gt_reset_async(q->gt);
-		xe_sched_tdr_queue_imm(sched);
+		if (!xe_exec_queue_is_lr(q))
+			xe_sched_tdr_queue_imm(sched);
 		return;
 	}
 
@@ -1008,6 +1010,7 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
 	struct xe_exec_queue *q = ge->q;
 	struct xe_guc *guc = exec_queue_to_guc(q);
 	struct xe_gpu_scheduler *sched = &ge->sched;
+	struct xe_sched_job *job;
 	bool wedged = false;
 
 	xe_gt_assert(guc_to_gt(guc), xe_exec_queue_is_lr(q));
@@ -1058,7 +1061,16 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
 	if (!exec_queue_killed(q) && !xe_lrc_ring_is_idle(q->lrc[0]))
 		xe_devcoredump(q, NULL, "LR job cleanup, guc_id=%d", q->guc->id);
 
+	xe_hw_fence_irq_stop(q->fence_irq);
+
 	xe_sched_submission_start(sched);
+
+	spin_lock(&sched->base.job_list_lock);
+	list_for_each_entry(job, &sched->base.pending_list, drm.list)
+		xe_sched_job_set_error(job, -ECANCELED);
+	spin_unlock(&sched->base.job_list_lock);
+
+	xe_hw_fence_irq_start(q->fence_irq);
 }
 
 #define ADJUST_FIVE_PERCENT(__t)	mul_u64_u32_div(__t, 105, 100)
@@ -1129,7 +1141,8 @@ static void enable_scheduling(struct xe_exec_queue *q)
 		xe_gt_warn(guc_to_gt(guc), "Schedule enable failed to respond");
 		set_exec_queue_banned(q);
 		xe_gt_reset_async(q->gt);
-		xe_sched_tdr_queue_imm(&q->guc->sched);
+		if (!xe_exec_queue_is_lr(q))
+			xe_sched_tdr_queue_imm(&q->guc->sched);
 	}
 }
 
@@ -1187,6 +1200,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	int i = 0;
 	bool wedged = false, skip_timeout_check;
 
+	xe_gt_assert(guc_to_gt(guc), !xe_exec_queue_is_lr(q));
+
 	/*
 	 * TDR has fired before free job worker. Common if exec queue
 	 * immediately closed after last fence signaled. Add back to pending
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 05/30] drm/xe: Don't change LRC ring head on job resubmission
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (3 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 04/30] drm/xe: Track LR jobs in DRM scheduler pending list Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 06/30] drm/xe: Make LRC W/A scratch buffer usage consistent Matthew Brost
                   ` (29 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

Now that we save the job's head during submission, it's no longer
necessary to adjust the LRC ring head during resubmission. Instead, a
software-based adjustment of the tail will overwrite the old jobs in
place. For some odd reason, adjusting the LRC ring head didn't work on
parallel queues, which was causing issues in our CI.

v5:
 - Add comment in guc_exec_queue_start explaning why the function works
   (Auld)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 3a534d93505f..d123bdb63369 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -2008,11 +2008,25 @@ static void guc_exec_queue_start(struct xe_exec_queue *q)
 	struct xe_gpu_scheduler *sched = &q->guc->sched;
 
 	if (!exec_queue_killed_or_banned_or_wedged(q)) {
+		struct xe_sched_job *job = xe_sched_first_pending_job(sched);
 		int i;
 
 		trace_xe_exec_queue_resubmit(q);
-		for (i = 0; i < q->width; ++i)
-			xe_lrc_set_ring_head(q->lrc[i], q->lrc[i]->ring.tail);
+		if (job) {
+			for (i = 0; i < q->width; ++i) {
+				/*
+				 * The GuC context is unregistered at this point
+				 * time, adjusting software ring tail ensures
+				 * jobs are rewritten in original placement,
+				 * adjusting LRC tail ensures the newly loaded
+				 * GuC / contexts only view the LRC tail
+				 * increasing as jobs are written out.
+				 */
+				q->lrc[i]->ring.tail = job->ptrs[i].head;
+				xe_lrc_set_ring_tail(q->lrc[i],
+						     xe_lrc_ring_head(q->lrc[i]));
+			}
+		}
 		xe_sched_resubmit_jobs(sched);
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 06/30] drm/xe: Make LRC W/A scratch buffer usage consistent
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (4 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 05/30] drm/xe: Don't change LRC ring head on job resubmission Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 07/30] drm/xe/vf: Add xe_gt_recovery_pending helper Matthew Brost
                   ` (28 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

The LRC W/A currently checks for LRC being iomem in some places, while
in others it checks if the scratch buffer is non-NULL. This
inconsistency causes issues with the VF post-migration recovery code,
which blindly passes in a scratch buffer.

This patch standardizes the check by consistently verifying whether the
LRC is iomem to determine if the scratch buffer should be used.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
---
 drivers/gpu/drm/xe/xe_lrc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index 2c6eae2de1f2..b5083c99dd50 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -1247,7 +1247,7 @@ static int setup_bo(struct bo_setup_state *state)
 
 static void finish_bo(struct bo_setup_state *state)
 {
-	if (!state->buffer)
+	if (!state->lrc->bo->vmap.is_iomem)
 		return;
 
 	xe_map_memcpy_to(gt_to_xe(state->lrc->gt), &state->lrc->bo->vmap,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 07/30] drm/xe/vf: Add xe_gt_recovery_pending helper
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (5 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 06/30] drm/xe: Make LRC W/A scratch buffer usage consistent Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 13:10   ` Michal Wajdeczko
  2025-10-06 11:10 ` [PATCH v6 08/30] drm/xe/vf: Make VF recovery run on per-GT worker Matthew Brost
                   ` (27 subsequent siblings)
  34 siblings, 1 reply; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

Add xe_gt_recovery_pending helper.

This helper serves as the singular point to determine whether a GT
recovery is currently in progress. Expected callers include the GuC CT
layer and the GuC submission layer. Atomically visable as soon as vCPU
are unhalted until VF recovery completes.

v3:
 - Add GT layer xe_gt_recovery_inprogress (Michal)
 - Don't blow up in memirq not enabled (CI)
 - Add __memirq_received with clear argument (Michal)
 - xe_memirq_sw_int_0_irq_pending rename (Michal)
 - Use offset in xe_memirq_sw_int_0_irq_pending (Michal)
v4:
 - Refactor xe_gt_recovery_inprogress logic around memirq (Michal)
v5:
 - s/inprogress/pending (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt.h                | 13 ++++++
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 27 +++++++++++++
 drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  2 +
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h | 10 +++++
 drivers/gpu/drm/xe/xe_memirq.c            | 48 +++++++++++++++++++++--
 drivers/gpu/drm/xe/xe_memirq.h            |  2 +
 6 files changed, 98 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
index 41880979f4de..5df2ffe3ff83 100644
--- a/drivers/gpu/drm/xe/xe_gt.h
+++ b/drivers/gpu/drm/xe/xe_gt.h
@@ -12,6 +12,7 @@
 
 #include "xe_device.h"
 #include "xe_device_types.h"
+#include "xe_gt_sriov_vf.h"
 #include "xe_hw_engine.h"
 
 #define for_each_hw_engine(hwe__, gt__, id__) \
@@ -124,4 +125,16 @@ static inline bool xe_gt_is_usm_hwe(struct xe_gt *gt, struct xe_hw_engine *hwe)
 		hwe->instance == gt->usm.reserved_bcs_instance;
 }
 
+/**
+ * xe_gt_recovery_pending() - GT recovery pending
+ * @gt: the &xe_gt
+ *
+ * Return: True if GT recovery in pending, False otherwise
+ */
+static inline bool xe_gt_recovery_pending(struct xe_gt *gt)
+{
+	return IS_SRIOV_VF(gt_to_xe(gt)) &&
+		xe_gt_sriov_vf_recovery_pending(gt);
+}
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 0461d5513487..86131ee481dc 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -26,6 +26,7 @@
 #include "xe_guc_hxg_helpers.h"
 #include "xe_guc_relay.h"
 #include "xe_lrc.h"
+#include "xe_memirq.h"
 #include "xe_mmio.h"
 #include "xe_sriov.h"
 #include "xe_sriov_vf.h"
@@ -776,6 +777,7 @@ void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt)
 	struct xe_device *xe = gt_to_xe(gt);
 
 	xe_gt_assert(gt, IS_SRIOV_VF(xe));
+	xe_gt_assert(gt, xe_gt_sriov_vf_recovery_pending(gt));
 
 	set_bit(gt->info.id, &xe->sriov.vf.migration.gt_flags);
 	/*
@@ -1118,3 +1120,28 @@ void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p)
 	drm_printf(p, "\thandshake:\t%u.%u\n",
 		   pf_version->major, pf_version->minor);
 }
+
+/**
+ * xe_gt_sriov_vf_recovery_pending() - VF post migration recovery pending
+ * @gt: the &xe_gt
+ *
+ * This function's return value must be immediately visable upon vCPU unhalt and
+ * be persisent until RESFIX_DONE is issued. This guarnetee is only coded for
+ * platforms which support memirq, if non-memirq platforms support VF migration
+ * this function will need to be updated.
+ *
+ * Return: True if VF post migration recovery in pending, False otherwise
+ */
+bool xe_gt_sriov_vf_recovery_pending(struct xe_gt *gt)
+{
+	struct xe_memirq *memirq = &gt_to_tile(gt)->memirq;
+
+	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
+
+	/* early detection until recovery starts */
+	if (xe_device_uses_memirq(gt_to_xe(gt)) &&
+	    xe_memirq_guc_sw_int_0_irq_pending(memirq, &gt->uc.guc))
+		return true;
+
+	return READ_ONCE(gt->sriov.vf.migration.recovery_inprogress);
+}
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
index 0af1dc769fe0..b91ae857e983 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
@@ -25,6 +25,8 @@ void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt);
 int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt);
 void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
 
+bool xe_gt_sriov_vf_recovery_pending(struct xe_gt *gt);
+
 u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
 u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt);
 u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt);
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
index 298dedf4b009..1dfef60ec044 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
@@ -46,6 +46,14 @@ struct xe_gt_sriov_vf_runtime {
 	} *regs;
 };
 
+/**
+ * xe_gt_sriov_vf_migration - VF migration data.
+ */
+struct xe_gt_sriov_vf_migration {
+	/** @recovery_inprogress: VF post migration recovery in progress */
+	bool recovery_inprogress;
+};
+
 /**
  * struct xe_gt_sriov_vf - GT level VF virtualization data.
  */
@@ -58,6 +66,8 @@ struct xe_gt_sriov_vf {
 	struct xe_gt_sriov_vf_selfconfig self_config;
 	/** @runtime: runtime data retrieved from the PF. */
 	struct xe_gt_sriov_vf_runtime runtime;
+	/** @migration: migration data for the VF. */
+	struct xe_gt_sriov_vf_migration migration;
 };
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_memirq.c b/drivers/gpu/drm/xe/xe_memirq.c
index 49c45ec3e83c..56acfdd77266 100644
--- a/drivers/gpu/drm/xe/xe_memirq.c
+++ b/drivers/gpu/drm/xe/xe_memirq.c
@@ -398,8 +398,9 @@ void xe_memirq_postinstall(struct xe_memirq *memirq)
 		memirq_set_enable(memirq, true);
 }
 
-static bool memirq_received(struct xe_memirq *memirq, struct iosys_map *vector,
-			    u16 offset, const char *name)
+static bool __memirq_received(struct xe_memirq *memirq,
+			      struct iosys_map *vector, u16 offset,
+			      const char *name, bool clear)
 {
 	u8 value;
 
@@ -409,12 +410,26 @@ static bool memirq_received(struct xe_memirq *memirq, struct iosys_map *vector,
 			memirq_err_ratelimited(memirq,
 					       "Unexpected memirq value %#x from %s at %u\n",
 					       value, name, offset);
-		iosys_map_wr(vector, offset, u8, 0x00);
+		if (clear)
+			iosys_map_wr(vector, offset, u8, 0x00);
 	}
 
 	return value;
 }
 
+static bool memirq_received_noclear(struct xe_memirq *memirq,
+				    struct iosys_map *vector,
+				    u16 offset, const char *name)
+{
+	return __memirq_received(memirq, vector, offset, name, false);
+}
+
+static bool memirq_received(struct xe_memirq *memirq, struct iosys_map *vector,
+			    u16 offset, const char *name)
+{
+	return __memirq_received(memirq, vector, offset, name, true);
+}
+
 static void memirq_dispatch_engine(struct xe_memirq *memirq, struct iosys_map *status,
 				   struct xe_hw_engine *hwe)
 {
@@ -434,8 +449,16 @@ static void memirq_dispatch_guc(struct xe_memirq *memirq, struct iosys_map *stat
 	if (memirq_received(memirq, status, ilog2(GUC_INTR_GUC2HOST), name))
 		xe_guc_irq_handler(guc, GUC_INTR_GUC2HOST);
 
-	if (memirq_received(memirq, status, ilog2(GUC_INTR_SW_INT_0), name))
+	/*
+	 * We must wait to perform the clear operation until after
+	 * xe_gt_sriov_vf_start_migration_recovery() runs, to avoid race
+	 * conditions where xe_gt_sriov_vf_recovery_pending() returns false.
+	 */
+	if (memirq_received_noclear(memirq, status, ilog2(GUC_INTR_SW_INT_0),
+				    name)) {
 		xe_guc_irq_handler(guc, GUC_INTR_SW_INT_0);
+		iosys_map_wr(status, ilog2(GUC_INTR_SW_INT_0), u8, 0x00);
+	}
 }
 
 /**
@@ -460,6 +483,23 @@ void xe_memirq_hwe_handler(struct xe_memirq *memirq, struct xe_hw_engine *hwe)
 	}
 }
 
+/**
+ * xe_memirq_guc__sw_int_0_irq_pending() - SW_INT_0 IRQ is pending
+ * @memirq: the &xe_memirq
+ * @guc: the &xe_guc to check for IRQ
+ *
+ * Return: True if SW_INT_0 IRQ is pending on @guc, False otherwise
+ */
+bool xe_memirq_guc_sw_int_0_irq_pending(struct xe_memirq *memirq, struct xe_guc *guc)
+{
+	struct xe_gt *gt = guc_to_gt(guc);
+	u32 offset = xe_gt_is_media_type(gt) ? ilog2(INTR_MGUC) : ilog2(INTR_GUC);
+	struct iosys_map map = IOSYS_MAP_INIT_OFFSET(&memirq->status, offset * SZ_16);
+
+	return memirq_received_noclear(memirq, &map, ilog2(GUC_INTR_SW_INT_0),
+				       guc_name(guc));
+}
+
 /**
  * xe_memirq_handler - The `Memory Based Interrupts`_ Handler.
  * @memirq: the &xe_memirq
diff --git a/drivers/gpu/drm/xe/xe_memirq.h b/drivers/gpu/drm/xe/xe_memirq.h
index 06130650e9d6..e25d2234ab87 100644
--- a/drivers/gpu/drm/xe/xe_memirq.h
+++ b/drivers/gpu/drm/xe/xe_memirq.h
@@ -25,4 +25,6 @@ void xe_memirq_handler(struct xe_memirq *memirq);
 
 int xe_memirq_init_guc(struct xe_memirq *memirq, struct xe_guc *guc);
 
+bool xe_memirq_guc_sw_int_0_irq_pending(struct xe_memirq *memirq, struct xe_guc *guc);
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 08/30] drm/xe/vf: Make VF recovery run on per-GT worker
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (6 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 07/30] drm/xe/vf: Add xe_gt_recovery_pending helper Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 09/30] drm/xe/vf: Abort H2G sends during VF post-migration recovery Matthew Brost
                   ` (26 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

VF recovery is a per-GT operation, so it makes sense to isolate it to a
per-GT queue. Scheduling this operation on the same worker as the GT
reset and TDR not only aligns with this design but also helps avoid race
conditions, as those operations can also modify the queue state.

v2:
 - Fix lockdep splat (Adam)
 - Use xe_sriov_vf_migration_supported helper
v3:
 - Drop xe_gt_sriov_ prefix for private functions (Michal)
 - Drop message in xe_gt_sriov_vf_migration_init_early (Michal)
 - Logic rework in vf_post_migration_notify_resfix_done (Michal)
 - Rework init sequence layering (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_gt.c                |   6 +
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 178 +++++++++++++++-
 drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |   3 +-
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |   7 +
 drivers/gpu/drm/xe/xe_sriov_vf.c          | 240 ----------------------
 drivers/gpu/drm/xe/xe_sriov_vf.h          |   1 -
 drivers/gpu/drm/xe/xe_sriov_vf_types.h    |   4 -
 7 files changed, 181 insertions(+), 258 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index b77572a19548..b11f57273b8b 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -388,6 +388,12 @@ int xe_gt_init_early(struct xe_gt *gt)
 			return err;
 	}
 
+	if (IS_SRIOV_VF(gt_to_xe(gt))) {
+		err = xe_gt_sriov_vf_init_early(gt);
+		if (err)
+			return err;
+	}
+
 	xe_reg_sr_init(&gt->reg_sr, "GT", gt_to_xe(gt));
 
 	err = xe_wa_gt_init(gt);
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 86131ee481dc..b3cee182087c 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -25,11 +25,15 @@
 #include "xe_guc.h"
 #include "xe_guc_hxg_helpers.h"
 #include "xe_guc_relay.h"
+#include "xe_guc_submit.h"
+#include "xe_irq.h"
 #include "xe_lrc.h"
 #include "xe_memirq.h"
 #include "xe_mmio.h"
+#include "xe_pm.h"
 #include "xe_sriov.h"
 #include "xe_sriov_vf.h"
+#include "xe_tile_sriov_vf.h"
 #include "xe_uc_fw.h"
 #include "xe_wopcm.h"
 
@@ -308,13 +312,13 @@ static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
 }
 
 /**
- * xe_gt_sriov_vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
+ * vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
  * @gt: the &xe_gt struct instance linked to target GuC
  *
  * Returns: 0 if the operation completed successfully, or a negative error
  * code otherwise.
  */
-int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt)
+static int vf_notify_resfix_done(struct xe_gt *gt)
 {
 	struct xe_guc *guc = &gt->uc.guc;
 	int err;
@@ -756,7 +760,7 @@ int xe_gt_sriov_vf_connect(struct xe_gt *gt)
  * xe_gt_sriov_vf_default_lrcs_hwsp_rebase - Update GGTT references in HWSP of default LRCs.
  * @gt: the &xe_gt struct instance
  */
-void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt)
+static void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt)
 {
 	struct xe_hw_engine *hwe;
 	enum xe_hw_engine_id id;
@@ -765,6 +769,26 @@ void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt)
 		xe_default_lrc_update_memirq_regs_with_address(hwe);
 }
 
+static void vf_start_migration_recovery(struct xe_gt *gt)
+{
+	bool started;
+
+	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
+
+	spin_lock(&gt->sriov.vf.migration.lock);
+
+	if (!gt->sriov.vf.migration.recovery_queued) {
+		gt->sriov.vf.migration.recovery_queued = true;
+		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
+
+		started = queue_work(gt->ordered_wq, &gt->sriov.vf.migration.worker);
+		xe_gt_sriov_info(gt, "VF migration recovery %s\n", started ?
+				 "scheduled" : "already in progress");
+	}
+
+	spin_unlock(&gt->sriov.vf.migration.lock);
+}
+
 /**
  * xe_gt_sriov_vf_migrated_event_handler - Start a VF migration recovery,
  *   or just mark that a GuC is ready for it.
@@ -779,15 +803,8 @@ void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt)
 	xe_gt_assert(gt, IS_SRIOV_VF(xe));
 	xe_gt_assert(gt, xe_gt_sriov_vf_recovery_pending(gt));
 
-	set_bit(gt->info.id, &xe->sriov.vf.migration.gt_flags);
-	/*
-	 * We need to be certain that if all flags were set, at least one
-	 * thread will notice that and schedule the recovery.
-	 */
-	smp_mb__after_atomic();
-
 	xe_gt_sriov_info(gt, "ready for recovery after migration\n");
-	xe_sriov_vf_start_migration_recovery(xe);
+	vf_start_migration_recovery(gt);
 }
 
 static bool vf_is_negotiated(struct xe_gt *gt, u16 major, u16 minor)
@@ -1121,6 +1138,145 @@ void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p)
 		   pf_version->major, pf_version->minor);
 }
 
+static void vf_post_migration_shutdown(struct xe_gt *gt)
+{
+	int ret = 0;
+
+	spin_lock_irq(&gt->sriov.vf.migration.lock);
+	gt->sriov.vf.migration.recovery_queued = false;
+	spin_unlock_irq(&gt->sriov.vf.migration.lock);
+
+	xe_guc_submit_pause(&gt->uc.guc);
+	ret |= xe_guc_submit_reset_block(&gt->uc.guc);
+
+	if (ret)
+		xe_gt_sriov_info(gt, "migration recovery encountered ongoing reset\n");
+}
+
+static size_t post_migration_scratch_size(struct xe_device *xe)
+{
+	return max(xe_lrc_reg_size(xe), LRC_WA_BB_SIZE);
+}
+
+static int vf_post_migration_fixups(struct xe_gt *gt)
+{
+	s64 shift;
+	void *buf;
+	int err;
+
+	buf = kmalloc(post_migration_scratch_size(gt_to_xe(gt)), GFP_ATOMIC);
+	if (!buf)
+		return -ENOMEM;
+
+	err = xe_gt_sriov_vf_query_config(gt);
+	if (err)
+		goto out;
+
+	shift = xe_gt_sriov_vf_ggtt_shift(gt);
+	if (shift) {
+		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
+		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
+		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
+		if (err)
+			goto out;
+	}
+
+out:
+	kfree(buf);
+	return err;
+}
+
+static void vf_post_migration_kickstart(struct xe_gt *gt)
+{
+	/*
+	 * Make sure interrupts on the new HW are properly set. The GuC IRQ
+	 * must be working at this point, since the recovery did started,
+	 * but the rest was not enabled using the procedure from spec.
+	 */
+	xe_irq_resume(gt_to_xe(gt));
+
+	xe_guc_submit_reset_unblock(&gt->uc.guc);
+	xe_guc_submit_unpause(&gt->uc.guc);
+}
+
+static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
+{
+	bool skip_resfix = false;
+
+	spin_lock_irq(&gt->sriov.vf.migration.lock);
+	if (gt->sriov.vf.migration.recovery_queued) {
+		skip_resfix = true;
+		xe_gt_sriov_dbg(gt, "another recovery imminent, resfix skipped\n");
+	} else {
+		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, false);
+	}
+	spin_unlock_irq(&gt->sriov.vf.migration.lock);
+
+	if (skip_resfix)
+		return -EAGAIN;
+
+	return vf_notify_resfix_done(gt);
+}
+
+static void vf_post_migration_recovery(struct xe_gt *gt)
+{
+	struct xe_device *xe = gt_to_xe(gt);
+	int err;
+
+	xe_gt_sriov_dbg(gt, "migration recovery in progress\n");
+
+	xe_pm_runtime_get(xe);
+	vf_post_migration_shutdown(gt);
+
+	if (!xe_sriov_vf_migration_supported(xe)) {
+		xe_gt_sriov_err(gt, "migration is not supported\n");
+		err = -ENOTRECOVERABLE;
+		goto fail;
+	}
+
+	err = vf_post_migration_fixups(gt);
+	if (err)
+		goto fail;
+
+	vf_post_migration_kickstart(gt);
+	err = vf_post_migration_notify_resfix_done(gt);
+	if (err && err != -EAGAIN)
+		goto fail;
+
+	xe_pm_runtime_put(xe);
+	xe_gt_sriov_notice(gt, "migration recovery ended\n");
+	return;
+fail:
+	xe_pm_runtime_put(xe);
+	xe_gt_sriov_err(gt, "migration recovery failed (%pe)\n", ERR_PTR(err));
+	xe_device_declare_wedged(xe);
+}
+
+static void migration_worker_func(struct work_struct *w)
+{
+	struct xe_gt *gt = container_of(w, struct xe_gt,
+					sriov.vf.migration.worker);
+
+	vf_post_migration_recovery(gt);
+}
+
+/**
+ * xe_gt_sriov_vf_init_early() - GT VF init early
+ * @gt: the &xe_gt
+ *
+ * Return 0 on success, errno on failure
+ */
+int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
+{
+	if (!xe_sriov_vf_migration_supported(gt_to_xe(gt)))
+		return 0;
+
+	spin_lock_init(&gt->sriov.vf.migration.lock);
+	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
+
+	return 0;
+}
+
 /**
  * xe_gt_sriov_vf_recovery_pending() - VF post migration recovery pending
  * @gt: the &xe_gt
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
index b91ae857e983..0adebf8aa419 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
@@ -21,10 +21,9 @@ void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt,
 int xe_gt_sriov_vf_query_config(struct xe_gt *gt);
 int xe_gt_sriov_vf_connect(struct xe_gt *gt);
 int xe_gt_sriov_vf_query_runtime(struct xe_gt *gt);
-void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt);
-int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt);
 void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
 
+int xe_gt_sriov_vf_init_early(struct xe_gt *gt);
 bool xe_gt_sriov_vf_recovery_pending(struct xe_gt *gt);
 
 u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
index 1dfef60ec044..b2c8e8c89c30 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
@@ -7,6 +7,7 @@
 #define _XE_GT_SRIOV_VF_TYPES_H_
 
 #include <linux/types.h>
+#include <linux/workqueue.h>
 #include "xe_uc_fw_types.h"
 
 /**
@@ -50,6 +51,12 @@ struct xe_gt_sriov_vf_runtime {
  * xe_gt_sriov_vf_migration - VF migration data.
  */
 struct xe_gt_sriov_vf_migration {
+	/** @migration: VF migration recovery worker */
+	struct work_struct worker;
+	/** @lock: Protects recovery_queued */
+	spinlock_t lock;
+	/** @recovery_queued: VF post migration recovery in queued */
+	bool recovery_queued;
 	/** @recovery_inprogress: VF post migration recovery in progress */
 	bool recovery_inprogress;
 };
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
index c1830ec8f0fd..911d5720917b 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
@@ -6,21 +6,12 @@
 #include <drm/drm_debugfs.h>
 #include <drm/drm_managed.h>
 
-#include "xe_assert.h"
-#include "xe_device.h"
 #include "xe_gt.h"
-#include "xe_gt_sriov_printk.h"
 #include "xe_gt_sriov_vf.h"
 #include "xe_guc.h"
-#include "xe_guc_submit.h"
-#include "xe_irq.h"
-#include "xe_lrc.h"
-#include "xe_pm.h"
-#include "xe_sriov.h"
 #include "xe_sriov_printk.h"
 #include "xe_sriov_vf.h"
 #include "xe_sriov_vf_ccs.h"
-#include "xe_tile_sriov_vf.h"
 
 /**
  * DOC: VF restore procedure in PF KMD and VF KMD
@@ -158,8 +149,6 @@ static void vf_disable_migration(struct xe_device *xe, const char *fmt, ...)
 	xe->sriov.vf.migration.enabled = false;
 }
 
-static void migration_worker_func(struct work_struct *w);
-
 static void vf_migration_init_early(struct xe_device *xe)
 {
 	/*
@@ -184,8 +173,6 @@ static void vf_migration_init_early(struct xe_device *xe)
 						    guc_version.major, guc_version.minor);
 	}
 
-	INIT_WORK(&xe->sriov.vf.migration.worker, migration_worker_func);
-
 	xe->sriov.vf.migration.enabled = true;
 	xe_sriov_dbg(xe, "migration support enabled\n");
 }
@@ -199,233 +186,6 @@ void xe_sriov_vf_init_early(struct xe_device *xe)
 	vf_migration_init_early(xe);
 }
 
-/**
- * vf_post_migration_shutdown - Stop the driver activities after VF migration.
- * @xe: the &xe_device struct instance
- *
- * After this VM is migrated and assigned to a new VF, it is running on a new
- * hardware, and therefore many hardware-dependent states and related structures
- * require fixups. Without fixups, the hardware cannot do any work, and therefore
- * all GPU pipelines are stalled.
- * Stop some of kernel activities to make the fixup process faster.
- */
-static void vf_post_migration_shutdown(struct xe_device *xe)
-{
-	struct xe_gt *gt;
-	unsigned int id;
-	int ret = 0;
-
-	for_each_gt(gt, xe, id) {
-		xe_guc_submit_pause(&gt->uc.guc);
-		ret |= xe_guc_submit_reset_block(&gt->uc.guc);
-	}
-
-	if (ret)
-		drm_info(&xe->drm, "migration recovery encountered ongoing reset\n");
-}
-
-/**
- * vf_post_migration_kickstart - Re-start the driver activities under new hardware.
- * @xe: the &xe_device struct instance
- *
- * After we have finished with all post-migration fixups, restart the driver
- * activities to continue feeding the GPU with workloads.
- */
-static void vf_post_migration_kickstart(struct xe_device *xe)
-{
-	struct xe_gt *gt;
-	unsigned int id;
-
-	/*
-	 * Make sure interrupts on the new HW are properly set. The GuC IRQ
-	 * must be working at this point, since the recovery did started,
-	 * but the rest was not enabled using the procedure from spec.
-	 */
-	xe_irq_resume(xe);
-
-	for_each_gt(gt, xe, id) {
-		xe_guc_submit_reset_unblock(&gt->uc.guc);
-		xe_guc_submit_unpause(&gt->uc.guc);
-	}
-}
-
-static bool gt_vf_post_migration_needed(struct xe_gt *gt)
-{
-	return test_bit(gt->info.id, &gt_to_xe(gt)->sriov.vf.migration.gt_flags);
-}
-
-/*
- * Notify GuCs marked in flags about resource fixups apply finished.
- * @xe: the &xe_device struct instance
- * @gt_flags: flags marking to which GTs the notification shall be sent
- */
-static int vf_post_migration_notify_resfix_done(struct xe_device *xe, unsigned long gt_flags)
-{
-	struct xe_gt *gt;
-	unsigned int id;
-	int err = 0;
-
-	for_each_gt(gt, xe, id) {
-		if (!test_bit(id, &gt_flags))
-			continue;
-		/* skip asking GuC for RESFIX exit if new recovery request arrived */
-		if (gt_vf_post_migration_needed(gt))
-			continue;
-		err = xe_gt_sriov_vf_notify_resfix_done(gt);
-		if (err)
-			break;
-		clear_bit(id, &gt_flags);
-	}
-
-	if (gt_flags && !err)
-		drm_dbg(&xe->drm, "another recovery imminent, skipped some notifications\n");
-	return err;
-}
-
-static int vf_get_next_migrated_gt_id(struct xe_device *xe)
-{
-	struct xe_gt *gt;
-	unsigned int id;
-
-	for_each_gt(gt, xe, id) {
-		if (test_and_clear_bit(id, &xe->sriov.vf.migration.gt_flags))
-			return id;
-	}
-	return -1;
-}
-
-static size_t post_migration_scratch_size(struct xe_device *xe)
-{
-	return max(xe_lrc_reg_size(xe), LRC_WA_BB_SIZE);
-}
-
-/**
- * Perform post-migration fixups on a single GT.
- *
- * After migration, GuC needs to be re-queried for VF configuration to check
- * if it matches previous provisioning. Most of VF provisioning shall be the
- * same, except GGTT range, since GGTT is not virtualized per-VF. If GGTT
- * range has changed, we have to perform fixups - shift all GGTT references
- * used anywhere within the driver. After the fixups in this function succeed,
- * it is allowed to ask the GuC bound to this GT to continue normal operation.
- *
- * Returns: 0 if the operation completed successfully, or a negative error
- * code otherwise.
- */
-static int gt_vf_post_migration_fixups(struct xe_gt *gt)
-{
-	s64 shift;
-	void *buf;
-	int err;
-
-	buf = kmalloc(post_migration_scratch_size(gt_to_xe(gt)), GFP_KERNEL);
-	if (!buf)
-		return -ENOMEM;
-
-	err = xe_gt_sriov_vf_query_config(gt);
-	if (err)
-		goto out;
-
-	shift = xe_gt_sriov_vf_ggtt_shift(gt);
-	if (shift) {
-		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
-		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
-		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
-		if (err)
-			goto out;
-	}
-
-out:
-	kfree(buf);
-	return err;
-}
-
-static void vf_post_migration_recovery(struct xe_device *xe)
-{
-	unsigned long fixed_gts = 0;
-	int id, err;
-
-	drm_dbg(&xe->drm, "migration recovery in progress\n");
-	xe_pm_runtime_get(xe);
-	vf_post_migration_shutdown(xe);
-
-	if (!xe_sriov_vf_migration_supported(xe)) {
-		xe_sriov_err(xe, "migration is not supported\n");
-		err = -ENOTRECOVERABLE;
-		goto fail;
-	}
-
-	while (id = vf_get_next_migrated_gt_id(xe), id >= 0) {
-		struct xe_gt *gt = xe_device_get_gt(xe, id);
-
-		err = gt_vf_post_migration_fixups(gt);
-		if (err)
-			goto fail;
-
-		set_bit(id, &fixed_gts);
-	}
-
-	vf_post_migration_kickstart(xe);
-	err = vf_post_migration_notify_resfix_done(xe, fixed_gts);
-	if (err)
-		goto fail;
-
-	xe_pm_runtime_put(xe);
-	drm_notice(&xe->drm, "migration recovery ended\n");
-	return;
-fail:
-	xe_pm_runtime_put(xe);
-	drm_err(&xe->drm, "migration recovery failed (%pe)\n", ERR_PTR(err));
-	xe_device_declare_wedged(xe);
-}
-
-static void migration_worker_func(struct work_struct *w)
-{
-	struct xe_device *xe = container_of(w, struct xe_device,
-					    sriov.vf.migration.worker);
-
-	vf_post_migration_recovery(xe);
-}
-
-/*
- * Check if post-restore recovery is coming on any of GTs.
- * @xe: the &xe_device struct instance
- *
- * Return: True if migration recovery worker will soon be running. Any worker currently
- * executing does not affect the result.
- */
-static bool vf_ready_to_recovery_on_any_gts(struct xe_device *xe)
-{
-	struct xe_gt *gt;
-	unsigned int id;
-
-	for_each_gt(gt, xe, id) {
-		if (test_bit(id, &xe->sriov.vf.migration.gt_flags))
-			return true;
-	}
-	return false;
-}
-
-/**
- * xe_sriov_vf_start_migration_recovery - Start VF migration recovery.
- * @xe: the &xe_device to start recovery on
- *
- * This function shall be called only by VF.
- */
-void xe_sriov_vf_start_migration_recovery(struct xe_device *xe)
-{
-	bool started;
-
-	xe_assert(xe, IS_SRIOV_VF(xe));
-
-	if (!vf_ready_to_recovery_on_any_gts(xe))
-		return;
-
-	started = queue_work(xe->sriov.wq, &xe->sriov.vf.migration.worker);
-	drm_info(&xe->drm, "VF migration recovery %s\n", started ?
-		 "scheduled" : "already in progress");
-}
-
 /**
  * xe_sriov_vf_init_late() - SR-IOV VF late initialization functions.
  * @xe: the &xe_device to initialize
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.h b/drivers/gpu/drm/xe/xe_sriov_vf.h
index 9e752105ec2a..4df95266b261 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_sriov_vf.h
@@ -13,7 +13,6 @@ struct xe_device;
 
 void xe_sriov_vf_init_early(struct xe_device *xe);
 int xe_sriov_vf_init_late(struct xe_device *xe);
-void xe_sriov_vf_start_migration_recovery(struct xe_device *xe);
 bool xe_sriov_vf_migration_supported(struct xe_device *xe);
 void xe_sriov_vf_debugfs_register(struct xe_device *xe, struct dentry *root);
 
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
index 426cc5841958..6a0fd0f5463e 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
@@ -33,10 +33,6 @@ struct xe_device_vf {
 
 	/** @migration: VF Migration state data */
 	struct {
-		/** @migration.worker: VF migration recovery worker */
-		struct work_struct worker;
-		/** @migration.gt_flags: Per-GT request flags for VF migration recovery */
-		unsigned long gt_flags;
 		/**
 		 * @migration.enabled: flag indicating if migration support
 		 * was enabled or not due to missing prerequisites
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 09/30] drm/xe/vf: Abort H2G sends during VF post-migration recovery
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (7 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 08/30] drm/xe/vf: Make VF recovery run on per-GT worker Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 10/30] drm/xe/vf: Remove memory allocations from VF post migration recovery Matthew Brost
                   ` (25 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

While VF post-migration recovery is in progress, abort H2G sends with
-ECANCEL. These messages are treated as lost, and TLB invalidation
errors are suppressed. During this phase, the H2G channel is down, and
VF recovery requires the CT lock to proceed.

v3:
 - Use xe_gt_recovery_inprogress (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_ct.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 47079ab9922c..9f0090ae64a6 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -851,7 +851,7 @@ static int __guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action,
 				u32 len, u32 g2h_len, u32 num_g2h,
 				struct g2h_fence *g2h_fence)
 {
-	struct xe_gt *gt __maybe_unused = ct_to_gt(ct);
+	struct xe_gt *gt = ct_to_gt(ct);
 	u16 seqno;
 	int ret;
 
@@ -872,7 +872,7 @@ static int __guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action,
 		goto out;
 	}
 
-	if (ct->state == XE_GUC_CT_STATE_STOPPED) {
+	if (ct->state == XE_GUC_CT_STATE_STOPPED || xe_gt_recovery_pending(gt)) {
 		ret = -ECANCELED;
 		goto out;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 10/30] drm/xe/vf: Remove memory allocations from VF post migration recovery
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (8 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 09/30] drm/xe/vf: Abort H2G sends during VF post-migration recovery Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 11/30] drm/xe/vf: Close multi-GT GGTT shift race Matthew Brost
                   ` (24 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

VF post migration recovery is the path of dma-fence signaling / reclaim,
avoid memory allocations in this path.

v3:
 - s/lrc_wa_bb/scratch (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 23 +++++++++++++----------
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  2 ++
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index b3cee182087c..55a1ebbbf47f 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -1160,17 +1160,13 @@ static size_t post_migration_scratch_size(struct xe_device *xe)
 
 static int vf_post_migration_fixups(struct xe_gt *gt)
 {
+	void *buf = gt->sriov.vf.migration.scratch;
 	s64 shift;
-	void *buf;
 	int err;
 
-	buf = kmalloc(post_migration_scratch_size(gt_to_xe(gt)), GFP_ATOMIC);
-	if (!buf)
-		return -ENOMEM;
-
 	err = xe_gt_sriov_vf_query_config(gt);
 	if (err)
-		goto out;
+		return err;
 
 	shift = xe_gt_sriov_vf_ggtt_shift(gt);
 	if (shift) {
@@ -1178,12 +1174,10 @@ static int vf_post_migration_fixups(struct xe_gt *gt)
 		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
 		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
 		if (err)
-			goto out;
+			return err;
 	}
 
-out:
-	kfree(buf);
-	return err;
+	return 0;
 }
 
 static void vf_post_migration_kickstart(struct xe_gt *gt)
@@ -1268,9 +1262,18 @@ static void migration_worker_func(struct work_struct *w)
  */
 int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
 {
+	void *buf;
+
 	if (!xe_sriov_vf_migration_supported(gt_to_xe(gt)))
 		return 0;
 
+	buf = drmm_kmalloc(&gt_to_xe(gt)->drm,
+			   post_migration_scratch_size(gt_to_xe(gt)),
+			   GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	gt->sriov.vf.migration.scratch = buf;
 	spin_lock_init(&gt->sriov.vf.migration.lock);
 	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
 
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
index b2c8e8c89c30..e753646debc4 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
@@ -55,6 +55,8 @@ struct xe_gt_sriov_vf_migration {
 	struct work_struct worker;
 	/** @lock: Protects recovery_queued */
 	spinlock_t lock;
+	/** @scratch: Scratch memory for VF recovery */
+	void *scratch;
 	/** @recovery_queued: VF post migration recovery in queued */
 	bool recovery_queued;
 	/** @recovery_inprogress: VF post migration recovery in progress */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 11/30] drm/xe/vf: Close multi-GT GGTT shift race
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (9 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 10/30] drm/xe/vf: Remove memory allocations from VF post migration recovery Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 14:27   ` Michal Wajdeczko
  2025-10-06 11:10 ` [PATCH v6 12/30] drm/xe/vf: Teardown VF post migration worker on driver unload Matthew Brost
                   ` (23 subsequent siblings)
  34 siblings, 1 reply; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

As multi-GT VF post-migration recovery can run in parallel on different
workqueues, but both GTs point to the same GGTT, only one GT needs to
shift the GGTT. However, both GTs need to know when this step has
completed. To coordinate this, perform the GGTT shift under the GGTT
lock. With shift being done under the lock, storing the shift value
becomes unnecessary.

v3:
 - Update commmit message (Tomasz)
v4:
 - Move GGTT values to tile state (Michal)
 - Use GGTT lock (Michal)
v5:
 - Only take GGTT lock during recovery (CI)
 - Drop goto in vf_get_submission_cfg (Michal)
 - Add kernel doc around recovery in xe_gt_sriov_vf_query_config (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_device_types.h        |   3 +
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c         | 153 +++++++-------------
 drivers/gpu/drm/xe/xe_gt_sriov_vf.h         |   5 +-
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h   |   7 +-
 drivers/gpu/drm/xe/xe_guc.c                 |   2 +-
 drivers/gpu/drm/xe/xe_tile_sriov_vf.c       |  30 +++-
 drivers/gpu/drm/xe/xe_tile_sriov_vf.h       |   2 +-
 drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h |  23 +++
 drivers/gpu/drm/xe/xe_vram.c                |   6 +-
 9 files changed, 112 insertions(+), 119 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h

diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 1d2718b70a5c..c66523bf4bf0 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -27,6 +27,7 @@
 #include "xe_sriov_vf_ccs_types.h"
 #include "xe_step_types.h"
 #include "xe_survivability_mode_types.h"
+#include "xe_tile_sriov_vf_types.h"
 #include "xe_validation.h"
 
 #if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
@@ -193,6 +194,8 @@ struct xe_tile {
 		struct {
 			/** @sriov.vf.ggtt_balloon: GGTT regions excluded from use. */
 			struct xe_ggtt_node *ggtt_balloon[2];
+			/** @sriov.vf.self_config: VF configuration data */
+			struct xe_tile_sriov_vf_selfconfig self_config;
 		} vf;
 	} sriov;
 
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 55a1ebbbf47f..d227c8a3ec81 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -436,42 +436,65 @@ u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt)
 	return value;
 }
 
-static int vf_get_ggtt_info(struct xe_gt *gt)
+static int vf_get_ggtt_info(struct xe_gt *gt, bool recovery)
 {
-	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
+	struct xe_tile_sriov_vf_selfconfig *config =
+		&gt_to_tile(gt)->sriov.vf.self_config;
+	struct xe_ggtt *ggtt = gt_to_tile(gt)->mem.ggtt;
 	struct xe_guc *guc = &gt->uc.guc;
 	u64 start, size;
+	s64 shift;
 	int err;
 
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 
+	/*
+	 * We only only take the GGTT lock when potentially shifting GGTTs to
+	 * make this step visable to all GTs which share a GGTT. Also the GGTT
+	 * lock is not initialized during xe_gt_init_early when this function
+	 * can also be called.
+	 */
+	if (recovery)
+		mutex_lock(&ggtt->lock);
+
 	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_START_KEY, &start);
 	if (unlikely(err))
-		return err;
+		goto out;
 
 	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_SIZE_KEY, &size);
 	if (unlikely(err))
-		return err;
+		goto out;
 
 	if (config->ggtt_size && config->ggtt_size != size) {
 		xe_gt_sriov_err(gt, "Unexpected GGTT reassignment: %lluK != %lluK\n",
 				size / SZ_1K, config->ggtt_size / SZ_1K);
-		return -EREMCHG;
+		err = -EREMCHG;
+		goto out;
 	}
 
 	xe_gt_sriov_dbg_verbose(gt, "GGTT %#llx-%#llx = %lluK\n",
 				start, start + size - 1, size / SZ_1K);
 
-	config->ggtt_shift = start - (s64)config->ggtt_base;
+	shift = start - (s64)config->ggtt_base;
 	config->ggtt_base = start;
 	config->ggtt_size = size;
+	err = config->ggtt_size ? 0 : -ENODATA;
 
-	return config->ggtt_size ? 0 : -ENODATA;
+	if (!err && shift && recovery) {
+		xe_gt_sriov_info(gt, "Shifting GGTT base by %lld to 0x%016llx\n",
+				 shift, config->ggtt_base);
+		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
+	}
+out:
+	if (recovery)
+		mutex_unlock(&ggtt->lock);
+	return err;
 }
 
 static int vf_get_lmem_info(struct xe_gt *gt)
 {
-	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
+	struct xe_tile_sriov_vf_selfconfig *config =
+		&gt_to_tile(gt)->sriov.vf.self_config;
 	struct xe_guc *guc = &gt->uc.guc;
 	char size_str[10];
 	u64 size;
@@ -544,17 +567,20 @@ static void vf_cache_gmdid(struct xe_gt *gt)
 /**
  * xe_gt_sriov_vf_query_config - Query SR-IOV config data over MMIO.
  * @gt: the &xe_gt
+ * @recovery: VF post migration recovery path
  *
- * This function is for VF use only.
+ * This function is for VF use only. If recovery is set, the GGTT shift will be
+ * performed under GGTT lock making this step visable to all GTs which share a
+ * GGTT.
  *
  * Return: 0 on success or a negative error code on failure.
  */
-int xe_gt_sriov_vf_query_config(struct xe_gt *gt)
+int xe_gt_sriov_vf_query_config(struct xe_gt *gt, bool recovery)
 {
 	struct xe_device *xe = gt_to_xe(gt);
 	int err;
 
-	err = vf_get_ggtt_info(gt);
+	err = vf_get_ggtt_info(gt, recovery);
 	if (unlikely(err))
 		return err;
 
@@ -584,80 +610,16 @@ int xe_gt_sriov_vf_query_config(struct xe_gt *gt)
  */
 u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
 {
-	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
-	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
-	xe_gt_assert(gt, gt->sriov.vf.self_config.num_ctxs);
-
-	return gt->sriov.vf.self_config.num_ctxs;
-}
-
-/**
- * xe_gt_sriov_vf_lmem - VF LMEM configuration.
- * @gt: the &xe_gt
- *
- * This function is for VF use only.
- *
- * Return: size of the LMEM assigned to VF.
- */
-u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
-{
-	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
-	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
-	xe_gt_assert(gt, gt->sriov.vf.self_config.lmem_size);
-
-	return gt->sriov.vf.self_config.lmem_size;
-}
-
-/**
- * xe_gt_sriov_vf_ggtt - VF GGTT configuration.
- * @gt: the &xe_gt
- *
- * This function is for VF use only.
- *
- * Return: size of the GGTT assigned to VF.
- */
-u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt)
-{
-	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
-	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
-	xe_gt_assert(gt, gt->sriov.vf.self_config.ggtt_size);
-
-	return gt->sriov.vf.self_config.ggtt_size;
-}
+	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
+	u16 val;
 
-/**
- * xe_gt_sriov_vf_ggtt_base - VF GGTT base offset.
- * @gt: the &xe_gt
- *
- * This function is for VF use only.
- *
- * Return: base offset of the GGTT assigned to VF.
- */
-u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt)
-{
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
-	xe_gt_assert(gt, gt->sriov.vf.self_config.ggtt_size);
-
-	return gt->sriov.vf.self_config.ggtt_base;
-}
 
-/**
- * xe_gt_sriov_vf_ggtt_shift - Return shift in GGTT range due to VF migration
- * @gt: the &xe_gt struct instance
- *
- * This function is for VF use only.
- *
- * Return: The shift value; could be negative
- */
-s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt)
-{
-	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
+	xe_gt_assert(gt, config->num_ctxs);
+	val = config->num_ctxs;
 
-	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
-	xe_gt_assert(gt, xe_gt_is_main_type(gt));
-
-	return config->ggtt_shift;
+	return val;
 }
 
 static int relay_action_handshake(struct xe_gt *gt, u32 *major, u32 *minor)
@@ -1057,6 +1019,8 @@ void xe_gt_sriov_vf_write32(struct xe_gt *gt, struct xe_reg reg, u32 val)
  */
 void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
 {
+	struct xe_tile_sriov_vf_selfconfig *tconfig =
+		&gt_to_tile(gt)->sriov.vf.self_config;
 	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
 	struct xe_device *xe = gt_to_xe(gt);
 	char buf[10];
@@ -1064,17 +1028,15 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
 	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
 
 	drm_printf(p, "GGTT range:\t%#llx-%#llx\n",
-		   config->ggtt_base,
-		   config->ggtt_base + config->ggtt_size - 1);
-
-	string_get_size(config->ggtt_size, 1, STRING_UNITS_2, buf, sizeof(buf));
-	drm_printf(p, "GGTT size:\t%llu (%s)\n", config->ggtt_size, buf);
+		   tconfig->ggtt_base,
+		   tconfig->ggtt_base + tconfig->ggtt_size - 1);
 
-	drm_printf(p, "GGTT shift on last restore:\t%lld\n", config->ggtt_shift);
+	string_get_size(tconfig->ggtt_size, 1, STRING_UNITS_2, buf, sizeof(buf));
+	drm_printf(p, "GGTT size:\t%llu (%s)\n", tconfig->ggtt_size, buf);
 
 	if (IS_DGFX(xe) && xe_gt_is_main_type(gt)) {
-		string_get_size(config->lmem_size, 1, STRING_UNITS_2, buf, sizeof(buf));
-		drm_printf(p, "LMEM size:\t%llu (%s)\n", config->lmem_size, buf);
+		string_get_size(tconfig->lmem_size, 1, STRING_UNITS_2, buf, sizeof(buf));
+		drm_printf(p, "LMEM size:\t%llu (%s)\n", tconfig->lmem_size, buf);
 	}
 
 	drm_printf(p, "GuC contexts:\t%u\n", config->num_ctxs);
@@ -1161,21 +1123,16 @@ static size_t post_migration_scratch_size(struct xe_device *xe)
 static int vf_post_migration_fixups(struct xe_gt *gt)
 {
 	void *buf = gt->sriov.vf.migration.scratch;
-	s64 shift;
 	int err;
 
-	err = xe_gt_sriov_vf_query_config(gt);
+	err = xe_gt_sriov_vf_query_config(gt, true);
 	if (err)
 		return err;
 
-	shift = xe_gt_sriov_vf_ggtt_shift(gt);
-	if (shift) {
-		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
-		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
-		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
-		if (err)
-			return err;
-	}
+	xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
+	err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
+	if (err)
+		return err;
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
index 0adebf8aa419..47ed8d513571 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
@@ -18,7 +18,7 @@ int xe_gt_sriov_vf_bootstrap(struct xe_gt *gt);
 void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt,
 				 struct xe_uc_fw_version *wanted,
 				 struct xe_uc_fw_version *found);
-int xe_gt_sriov_vf_query_config(struct xe_gt *gt);
+int xe_gt_sriov_vf_query_config(struct xe_gt *gt, bool recovery);
 int xe_gt_sriov_vf_connect(struct xe_gt *gt);
 int xe_gt_sriov_vf_query_runtime(struct xe_gt *gt);
 void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
@@ -29,9 +29,6 @@ bool xe_gt_sriov_vf_recovery_pending(struct xe_gt *gt);
 u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
 u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt);
 u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt);
-u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt);
-u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt);
-s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt);
 
 u32 xe_gt_sriov_vf_read32(struct xe_gt *gt, struct xe_reg reg);
 void xe_gt_sriov_vf_write32(struct xe_gt *gt, struct xe_reg reg, u32 val);
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
index e753646debc4..1796d4caf62f 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
@@ -6,6 +6,7 @@
 #ifndef _XE_GT_SRIOV_VF_TYPES_H_
 #define _XE_GT_SRIOV_VF_TYPES_H_
 
+#include <linux/rwsem.h>
 #include <linux/types.h>
 #include <linux/workqueue.h>
 #include "xe_uc_fw_types.h"
@@ -14,12 +15,6 @@
  * struct xe_gt_sriov_vf_selfconfig - VF configuration data.
  */
 struct xe_gt_sriov_vf_selfconfig {
-	/** @ggtt_base: assigned base offset of the GGTT region. */
-	u64 ggtt_base;
-	/** @ggtt_size: assigned size of the GGTT region. */
-	u64 ggtt_size;
-	/** @ggtt_shift: difference in ggtt_base on last migration */
-	s64 ggtt_shift;
 	/** @lmem_size: assigned size of the LMEM. */
 	u64 lmem_size;
 	/** @num_ctxs: assigned number of GuC submission context IDs. */
diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
index d5adbbb013ec..c016a11b6ab1 100644
--- a/drivers/gpu/drm/xe/xe_guc.c
+++ b/drivers/gpu/drm/xe/xe_guc.c
@@ -713,7 +713,7 @@ static int vf_guc_init_noalloc(struct xe_guc *guc)
 	if (err)
 		return err;
 
-	err = xe_gt_sriov_vf_query_config(gt);
+	err = xe_gt_sriov_vf_query_config(gt, false);
 	if (err)
 		return err;
 
diff --git a/drivers/gpu/drm/xe/xe_tile_sriov_vf.c b/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
index f221dbed16f0..074981e2ef07 100644
--- a/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
@@ -9,7 +9,6 @@
 
 #include "xe_assert.h"
 #include "xe_ggtt.h"
-#include "xe_gt_sriov_vf.h"
 #include "xe_sriov.h"
 #include "xe_sriov_printk.h"
 #include "xe_tile_sriov_vf.h"
@@ -40,10 +39,10 @@ static int vf_init_ggtt_balloons(struct xe_tile *tile)
  *
  * Return: 0 on success or a negative error code on failure.
  */
-int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile)
+static int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile)
 {
-	u64 ggtt_base = xe_gt_sriov_vf_ggtt_base(tile->primary_gt);
-	u64 ggtt_size = xe_gt_sriov_vf_ggtt(tile->primary_gt);
+	u64 ggtt_base = tile->sriov.vf.self_config.ggtt_base;
+	u64 ggtt_size = tile->sriov.vf.self_config.ggtt_size;
 	struct xe_device *xe = tile_to_xe(tile);
 	u64 wopcm = xe_wopcm_size(xe);
 	u64 start, end;
@@ -244,11 +243,30 @@ void xe_tile_sriov_vf_fixup_ggtt_nodes(struct xe_tile *tile, s64 shift)
 {
 	struct xe_ggtt *ggtt = tile->mem.ggtt;
 
-	mutex_lock(&ggtt->lock);
+	lockdep_assert_held(&ggtt->lock);
 
 	xe_tile_sriov_vf_deballoon_ggtt_locked(tile);
 	xe_ggtt_shift_nodes_locked(ggtt, shift);
 	xe_tile_sriov_vf_balloon_ggtt_locked(tile);
+}
 
-	mutex_unlock(&ggtt->lock);
+/**
+ * xe_tile_sriov_vf_lmem - VF LMEM configuration.
+ * @tile: the &xe_tile
+ *
+ * This function is for VF use only.
+ *
+ * Return: size of the LMEM assigned to VF.
+ */
+u64 xe_tile_sriov_vf_lmem(struct xe_tile *tile)
+{
+	struct xe_tile_sriov_vf_selfconfig *config = &tile->sriov.vf.self_config;
+	u64 val;
+
+	xe_tile_assert(tile, IS_SRIOV_VF(tile_to_xe(tile)));
+
+	xe_tile_assert(tile, config->lmem_size);
+	val = config->lmem_size;
+
+	return val;
 }
diff --git a/drivers/gpu/drm/xe/xe_tile_sriov_vf.h b/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
index 93eb043171e8..54e7f2a5c4e4 100644
--- a/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
@@ -11,8 +11,8 @@
 struct xe_tile;
 
 int xe_tile_sriov_vf_prepare_ggtt(struct xe_tile *tile);
-int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile);
 void xe_tile_sriov_vf_deballoon_ggtt_locked(struct xe_tile *tile);
 void xe_tile_sriov_vf_fixup_ggtt_nodes(struct xe_tile *tile, s64 shift);
+u64 xe_tile_sriov_vf_lmem(struct xe_tile *tile);
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h
new file mode 100644
index 000000000000..140717f81d8f
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef _XE_TILE_SRIOV_VF_TYPES_H_
+#define _XE_TILE_SRIOV_VF_TYPES_H_
+
+#include <linux/mutex.h>
+
+/**
+ * struct xe_tile_sriov_vf_selfconfig - VF configuration data.
+ */
+struct xe_tile_sriov_vf_selfconfig {
+	/** @ggtt_base: assigned base offset of the GGTT region. */
+	u64 ggtt_base;
+	/** @ggtt_size: assigned size of the GGTT region. */
+	u64 ggtt_size;
+	/** @lmem_size: assigned size of the LMEM. */
+	u64 lmem_size;
+};
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_vram.c b/drivers/gpu/drm/xe/xe_vram.c
index 7adfccf68e4c..70bcbb188867 100644
--- a/drivers/gpu/drm/xe/xe_vram.c
+++ b/drivers/gpu/drm/xe/xe_vram.c
@@ -17,10 +17,10 @@
 #include "xe_device.h"
 #include "xe_force_wake.h"
 #include "xe_gt_mcr.h"
-#include "xe_gt_sriov_vf.h"
 #include "xe_mmio.h"
 #include "xe_module.h"
 #include "xe_sriov.h"
+#include "xe_tile_sriov_vf.h"
 #include "xe_ttm_vram_mgr.h"
 #include "xe_vram.h"
 #include "xe_vram_types.h"
@@ -238,9 +238,9 @@ static int tile_vram_size(struct xe_tile *tile, u64 *vram_size,
 		offset = 0;
 		for_each_tile(t, xe, id)
 			for_each_if(t->id < tile->id)
-				offset += xe_gt_sriov_vf_lmem(t->primary_gt);
+				offset += xe_tile_sriov_vf_lmem(t);
 
-		*tile_size = xe_gt_sriov_vf_lmem(gt);
+		*tile_size = xe_tile_sriov_vf_lmem(tile);
 		*vram_size = *tile_size;
 		*tile_offset = offset;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 12/30] drm/xe/vf: Teardown VF post migration worker on driver unload
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (10 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 11/30] drm/xe/vf: Close multi-GT GGTT shift race Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 13/30] drm/xe/vf: Don't allow GT reset to be queued during VF post migration recovery Matthew Brost
                   ` (22 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

Be cautious and ensure the VF post-migration worker is not running
during driver unload.

v3:
 - More teardown later in driver init, use devm (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_gt.c                |  6 ++++
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 34 ++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  1 +
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  4 ++-
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index b11f57273b8b..2d032eb3bd6d 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -653,6 +653,12 @@ int xe_gt_init(struct xe_gt *gt)
 	if (err)
 		return err;
 
+	if (IS_SRIOV_VF(gt_to_xe(gt))) {
+		err = xe_gt_sriov_vf_init(gt);
+		if (err)
+			return err;
+	}
+
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index d227c8a3ec81..8a36f479df1b 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -739,7 +739,8 @@ static void vf_start_migration_recovery(struct xe_gt *gt)
 
 	spin_lock(&gt->sriov.vf.migration.lock);
 
-	if (!gt->sriov.vf.migration.recovery_queued) {
+	if (!gt->sriov.vf.migration.recovery_queued ||
+	    !gt->sriov.vf.migration.recovery_teardown) {
 		gt->sriov.vf.migration.recovery_queued = true;
 		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
 
@@ -1211,6 +1212,17 @@ static void migration_worker_func(struct work_struct *w)
 	vf_post_migration_recovery(gt);
 }
 
+static void vf_migration_fini(void *arg)
+{
+	struct xe_gt *gt = arg;
+
+	spin_lock_irq(&gt->sriov.vf.migration.lock);
+	gt->sriov.vf.migration.recovery_teardown = true;
+	spin_unlock_irq(&gt->sriov.vf.migration.lock);
+
+	cancel_work_sync(&gt->sriov.vf.migration.worker);
+}
+
 /**
  * xe_gt_sriov_vf_init_early() - GT VF init early
  * @gt: the &xe_gt
@@ -1237,6 +1249,26 @@ int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
 	return 0;
 }
 
+/**
+ * xe_gt_sriov_vf_init() - GT VF init
+ * @gt: the &xe_gt
+ *
+ * Return 0 on success, errno on failure
+ */
+int xe_gt_sriov_vf_init(struct xe_gt *gt)
+{
+	if (!xe_sriov_vf_migration_supported(gt_to_xe(gt)))
+		return 0;
+
+	/*
+	 * We want to tear down the VF post-migration early during driver
+	 * unload; therefore, we add this finalization action later during
+	 * driver load.
+	 */
+	return devm_add_action_or_reset(gt_to_xe(gt)->drm.dev,
+					vf_migration_fini, gt);
+}
+
 /**
  * xe_gt_sriov_vf_recovery_pending() - VF post migration recovery pending
  * @gt: the &xe_gt
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
index 47ed8d513571..8c9679414565 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
@@ -24,6 +24,7 @@ int xe_gt_sriov_vf_query_runtime(struct xe_gt *gt);
 void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
 
 int xe_gt_sriov_vf_init_early(struct xe_gt *gt);
+int xe_gt_sriov_vf_init(struct xe_gt *gt);
 bool xe_gt_sriov_vf_recovery_pending(struct xe_gt *gt);
 
 u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
index 1796d4caf62f..c1bd6fdd9ab1 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
@@ -48,10 +48,12 @@ struct xe_gt_sriov_vf_runtime {
 struct xe_gt_sriov_vf_migration {
 	/** @migration: VF migration recovery worker */
 	struct work_struct worker;
-	/** @lock: Protects recovery_queued */
+	/** @lock: Protects recovery_queued, teardown */
 	spinlock_t lock;
 	/** @scratch: Scratch memory for VF recovery */
 	void *scratch;
+	/** @recovery_teardown: VF post migration recovery is being torn down */
+	bool recovery_teardown;
 	/** @recovery_queued: VF post migration recovery in queued */
 	bool recovery_queued;
 	/** @recovery_inprogress: VF post migration recovery in progress */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 13/30] drm/xe/vf: Don't allow GT reset to be queued during VF post migration recovery
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (11 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 12/30] drm/xe/vf: Teardown VF post migration worker on driver unload Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 14/30] drm/xe/vf: Wakeup in GuC backend on " Matthew Brost
                   ` (21 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

With well-behaved software, a GT reset should never occur, nor should it
happen during VF post-migration recovery. If it does, trigger a warning
but suppress the GT reset, as VF post-migration recovery is expected to
bring the VF back to a working state.

v3:
 - Better commit message (Tomasz)
v5:
 - Use xe_gt_WARN_ON (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_gt.c          |  9 -------
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c |  7 -----
 drivers/gpu/drm/xe/xe_guc_submit.c  | 42 ++++-------------------------
 drivers/gpu/drm/xe/xe_guc_submit.h  |  3 ---
 4 files changed, 5 insertions(+), 56 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index 2d032eb3bd6d..cf484a2da35e 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -805,11 +805,6 @@ static int do_gt_restart(struct xe_gt *gt)
 	return 0;
 }
 
-static int gt_wait_reset_unblock(struct xe_gt *gt)
-{
-	return xe_guc_wait_reset_unblock(&gt->uc.guc);
-}
-
 static int gt_reset(struct xe_gt *gt)
 {
 	unsigned int fw_ref;
@@ -824,10 +819,6 @@ static int gt_reset(struct xe_gt *gt)
 
 	xe_gt_info(gt, "reset started\n");
 
-	err = gt_wait_reset_unblock(gt);
-	if (!err)
-		xe_gt_warn(gt, "reset block failed to get lifted");
-
 	xe_pm_runtime_get(gt_to_xe(gt));
 
 	if (xe_fault_inject_gt_reset()) {
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 8a36f479df1b..7057260175f3 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -1103,17 +1103,11 @@ void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p)
 
 static void vf_post_migration_shutdown(struct xe_gt *gt)
 {
-	int ret = 0;
-
 	spin_lock_irq(&gt->sriov.vf.migration.lock);
 	gt->sriov.vf.migration.recovery_queued = false;
 	spin_unlock_irq(&gt->sriov.vf.migration.lock);
 
 	xe_guc_submit_pause(&gt->uc.guc);
-	ret |= xe_guc_submit_reset_block(&gt->uc.guc);
-
-	if (ret)
-		xe_gt_sriov_info(gt, "migration recovery encountered ongoing reset\n");
 }
 
 static size_t post_migration_scratch_size(struct xe_device *xe)
@@ -1147,7 +1141,6 @@ static void vf_post_migration_kickstart(struct xe_gt *gt)
 	 */
 	xe_irq_resume(gt_to_xe(gt));
 
-	xe_guc_submit_reset_unblock(&gt->uc.guc);
 	xe_guc_submit_unpause(&gt->uc.guc);
 }
 
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index d123bdb63369..59371b7cc8a4 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -27,6 +27,7 @@
 #include "xe_gt.h"
 #include "xe_gt_clock.h"
 #include "xe_gt_printk.h"
+#include "xe_gt_sriov_vf.h"
 #include "xe_guc.h"
 #include "xe_guc_capture.h"
 #include "xe_guc_ct.h"
@@ -1900,47 +1901,14 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
 	}
 }
 
-/**
- * xe_guc_submit_reset_block - Disallow reset calls on given GuC.
- * @guc: the &xe_guc struct instance
- */
-int xe_guc_submit_reset_block(struct xe_guc *guc)
-{
-	return atomic_fetch_or(1, &guc->submission_state.reset_blocked);
-}
-
-/**
- * xe_guc_submit_reset_unblock - Allow back reset calls on given GuC.
- * @guc: the &xe_guc struct instance
- */
-void xe_guc_submit_reset_unblock(struct xe_guc *guc)
-{
-	atomic_set_release(&guc->submission_state.reset_blocked, 0);
-	wake_up_all(&guc->ct.wq);
-}
-
-static int guc_submit_reset_is_blocked(struct xe_guc *guc)
-{
-	return atomic_read_acquire(&guc->submission_state.reset_blocked);
-}
-
-/* Maximum time of blocking reset */
-#define RESET_BLOCK_PERIOD_MAX (HZ * 5)
-
-/**
- * xe_guc_wait_reset_unblock - Wait until reset blocking flag is lifted, or timeout.
- * @guc: the &xe_guc struct instance
- */
-int xe_guc_wait_reset_unblock(struct xe_guc *guc)
-{
-	return wait_event_timeout(guc->ct.wq,
-				  !guc_submit_reset_is_blocked(guc), RESET_BLOCK_PERIOD_MAX);
-}
-
 int xe_guc_submit_reset_prepare(struct xe_guc *guc)
 {
 	int ret;
 
+	if (xe_gt_WARN_ON(guc_to_gt(guc),
+			  xe_gt_sriov_vf_recovery_pending(guc_to_gt(guc))))
+		return 0;
+
 	if (!guc->submission_state.initialized)
 		return 0;
 
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
index 5b4a0a6fd818..f535fe3895e5 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.h
+++ b/drivers/gpu/drm/xe/xe_guc_submit.h
@@ -22,9 +22,6 @@ void xe_guc_submit_stop(struct xe_guc *guc);
 int xe_guc_submit_start(struct xe_guc *guc);
 void xe_guc_submit_pause(struct xe_guc *guc);
 void xe_guc_submit_unpause(struct xe_guc *guc);
-int xe_guc_submit_reset_block(struct xe_guc *guc);
-void xe_guc_submit_reset_unblock(struct xe_guc *guc);
-int xe_guc_wait_reset_unblock(struct xe_guc *guc);
 void xe_guc_submit_wedge(struct xe_guc *guc);
 
 int xe_guc_read_stopped(struct xe_guc *guc);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 14/30] drm/xe/vf: Wakeup in GuC backend on VF post migration recovery
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (12 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 13/30] drm/xe/vf: Don't allow GT reset to be queued during VF post migration recovery Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 14:35   ` Michal Wajdeczko
  2025-10-06 22:27   ` Lis, Tomasz
  2025-10-06 11:10 ` [PATCH v6 15/30] drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs supporting migration Matthew Brost
                   ` (20 subsequent siblings)
  34 siblings, 2 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

If VF post-migration recovery is in progress, the recovery flow will
rebuild all GuC submission state. In this case, exit all waiters to
ensure that submission queue scheduling can also be paused. Avoid taking
any adverse actions after aborting the wait.

As part of waking up the GuC backend, suspend_wait can now return
-EAGAIN indicating the waiter should be retried. If the caller is
running on work item, that work item need to be requeued to avoid a
deadlock for the work item blocking the VF migration recovery work item.

v3:
 - Don't block in preempt fence work queue as this can interfere with VF
   post-migration work queue scheduling leading to deadlock (Testing)
 - Use xe_gt_recovery_inprogress (Michal)
v5:
 - Use static function for vf_recovery (Michal)
 - Add helper to wake CT waiters (Michal)
 - Move some code to following patch (Michal)
 - Adjust commit message to explain suspend_wait returning -EAGAIN (Michal)
 - Add kernel doc to suspend_wait around returning -EAGAIN

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_exec_queue_types.h |  3 +
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c      |  4 ++
 drivers/gpu/drm/xe/xe_guc_ct.h           |  9 +++
 drivers/gpu/drm/xe/xe_guc_submit.c       | 82 ++++++++++++++++++------
 drivers/gpu/drm/xe/xe_preempt_fence.c    | 11 ++++
 5 files changed, 88 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_exec_queue_types.h b/drivers/gpu/drm/xe/xe_exec_queue_types.h
index 27b76cf9da89..282505fa1377 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue_types.h
+++ b/drivers/gpu/drm/xe/xe_exec_queue_types.h
@@ -207,6 +207,9 @@ struct xe_exec_queue_ops {
 	 * call after suspend. In dma-fencing path thus must return within a
 	 * reasonable amount of time. -ETIME return shall indicate an error
 	 * waiting for suspend resulting in associated VM getting killed.
+	 * -EAGAIN return indicates the wait should be tried again, if the wait
+	 * is within a work item, the work item should be requeued as deadlock
+	 * avoidance mechanism.
 	 */
 	int (*suspend_wait)(struct xe_exec_queue *q);
 	/**
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 7057260175f3..7f703336d692 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -23,6 +23,7 @@
 #include "xe_gt_sriov_vf.h"
 #include "xe_gt_sriov_vf_types.h"
 #include "xe_guc.h"
+#include "xe_guc_ct.h"
 #include "xe_guc_hxg_helpers.h"
 #include "xe_guc_relay.h"
 #include "xe_guc_submit.h"
@@ -743,6 +744,9 @@ static void vf_start_migration_recovery(struct xe_gt *gt)
 	    !gt->sriov.vf.migration.recovery_teardown) {
 		gt->sriov.vf.migration.recovery_queued = true;
 		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
+		smp_wmb();	/* Ensure above write visable before wake */
+
+		xe_guc_ct_wake_waiters(&gt->uc.guc.ct);
 
 		started = queue_work(gt->ordered_wq, &gt->sriov.vf.migration.worker);
 		xe_gt_sriov_info(gt, "VF migration recovery %s\n", started ?
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
index d6c81325a76c..ca0ec938edac 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.h
+++ b/drivers/gpu/drm/xe/xe_guc_ct.h
@@ -72,4 +72,13 @@ xe_guc_ct_send_block_no_fail(struct xe_guc_ct *ct, const u32 *action, u32 len)
 
 long xe_guc_ct_queue_proc_time_jiffies(struct xe_guc_ct *ct);
 
+/**
+ * xe_guc_ct_wake_waiters() - GuC CT wake up waiters
+ * @guc: GuC CT object
+ */
+static inline void xe_guc_ct_wake_waiters(struct xe_guc_ct *ct)
+{
+	wake_up_all(&ct->wq);
+}
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 59371b7cc8a4..b2ca4911efe9 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -27,7 +27,6 @@
 #include "xe_gt.h"
 #include "xe_gt_clock.h"
 #include "xe_gt_printk.h"
-#include "xe_gt_sriov_vf.h"
 #include "xe_guc.h"
 #include "xe_guc_capture.h"
 #include "xe_guc_ct.h"
@@ -702,6 +701,11 @@ static u32 wq_space_until_wrap(struct xe_exec_queue *q)
 	return (WQ_SIZE - q->guc->wqi_tail);
 }
 
+static bool vf_recovery(struct xe_guc *guc)
+{
+	return xe_gt_recovery_pending(guc_to_gt(guc));
+}
+
 static int wq_wait_for_space(struct xe_exec_queue *q, u32 wqi_size)
 {
 	struct xe_guc *guc = exec_queue_to_guc(q);
@@ -711,7 +715,7 @@ static int wq_wait_for_space(struct xe_exec_queue *q, u32 wqi_size)
 
 #define AVAILABLE_SPACE \
 	CIRC_SPACE(q->guc->wqi_tail, q->guc->wqi_head, WQ_SIZE)
-	if (wqi_size > AVAILABLE_SPACE) {
+	if (wqi_size > AVAILABLE_SPACE && !vf_recovery(guc)) {
 try_again:
 		q->guc->wqi_head = parallel_read(xe, map, wq_desc.head);
 		if (wqi_size > AVAILABLE_SPACE) {
@@ -910,9 +914,10 @@ static void disable_scheduling_deregister(struct xe_guc *guc,
 	ret = wait_event_timeout(guc->ct.wq,
 				 (!exec_queue_pending_enable(q) &&
 				  !exec_queue_pending_disable(q)) ||
-					 xe_guc_read_stopped(guc),
+					 xe_guc_read_stopped(guc) ||
+					 vf_recovery(guc),
 				 HZ * 5);
-	if (!ret) {
+	if (!ret && !vf_recovery(guc)) {
 		struct xe_gpu_scheduler *sched = &q->guc->sched;
 
 		xe_gt_warn(q->gt, "Pending enable/disable failed to respond\n");
@@ -1015,6 +1020,10 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
 	bool wedged = false;
 
 	xe_gt_assert(guc_to_gt(guc), xe_exec_queue_is_lr(q));
+
+	if (vf_recovery(guc))
+		return;
+
 	trace_xe_exec_queue_lr_cleanup(q);
 
 	if (!exec_queue_killed(q))
@@ -1047,7 +1056,11 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
 		 */
 		ret = wait_event_timeout(guc->ct.wq,
 					 !exec_queue_pending_disable(q) ||
-					 xe_guc_read_stopped(guc), HZ * 5);
+					 xe_guc_read_stopped(guc) ||
+					 vf_recovery(guc), HZ * 5);
+		if (vf_recovery(guc))
+			return;
+
 		if (!ret) {
 			xe_gt_warn(q->gt, "Schedule disable failed to respond, guc_id=%d\n",
 				   q->guc->id);
@@ -1137,8 +1150,9 @@ static void enable_scheduling(struct xe_exec_queue *q)
 
 	ret = wait_event_timeout(guc->ct.wq,
 				 !exec_queue_pending_enable(q) ||
-				 xe_guc_read_stopped(guc), HZ * 5);
-	if (!ret || xe_guc_read_stopped(guc)) {
+				 xe_guc_read_stopped(guc) ||
+				 vf_recovery(guc), HZ * 5);
+	if ((!ret && !vf_recovery(guc)) || xe_guc_read_stopped(guc)) {
 		xe_gt_warn(guc_to_gt(guc), "Schedule enable failed to respond");
 		set_exec_queue_banned(q);
 		xe_gt_reset_async(q->gt);
@@ -1209,7 +1223,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	 * list so job can be freed and kick scheduler ensuring free job is not
 	 * lost.
 	 */
-	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags))
+	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags) ||
+	    vf_recovery(guc))
 		return DRM_GPU_SCHED_STAT_NO_HANG;
 
 	/* Kill the run_job entry point */
@@ -1261,7 +1276,10 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 			ret = wait_event_timeout(guc->ct.wq,
 						 (!exec_queue_pending_enable(q) &&
 						  !exec_queue_pending_disable(q)) ||
-						 xe_guc_read_stopped(guc), HZ * 5);
+						 xe_guc_read_stopped(guc) ||
+						 vf_recovery(guc), HZ * 5);
+			if (vf_recovery(guc))
+				goto handle_vf_resume;
 			if (!ret || xe_guc_read_stopped(guc))
 				goto trigger_reset;
 
@@ -1286,7 +1304,10 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 		smp_rmb();
 		ret = wait_event_timeout(guc->ct.wq,
 					 !exec_queue_pending_disable(q) ||
-					 xe_guc_read_stopped(guc), HZ * 5);
+					 xe_guc_read_stopped(guc) ||
+					 vf_recovery(guc), HZ * 5);
+		if (vf_recovery(guc))
+			goto handle_vf_resume;
 		if (!ret || xe_guc_read_stopped(guc)) {
 trigger_reset:
 			if (!ret)
@@ -1391,6 +1412,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	 * some thought, do this in a follow up.
 	 */
 	xe_sched_submission_start(sched);
+handle_vf_resume:
 	return DRM_GPU_SCHED_STAT_NO_HANG;
 }
 
@@ -1487,11 +1509,17 @@ static void __guc_exec_queue_process_msg_set_sched_props(struct xe_sched_msg *ms
 
 static void __suspend_fence_signal(struct xe_exec_queue *q)
 {
+	struct xe_guc *guc = exec_queue_to_guc(q);
+	struct xe_device *xe = guc_to_xe(guc);
+
 	if (!q->guc->suspend_pending)
 		return;
 
 	WRITE_ONCE(q->guc->suspend_pending, false);
-	wake_up(&q->guc->suspend_wait);
+	if (IS_SRIOV_VF(xe))
+		wake_up_all(&guc->ct.wq);
+	else
+		wake_up(&q->guc->suspend_wait);
 }
 
 static void suspend_fence_signal(struct xe_exec_queue *q)
@@ -1512,8 +1540,9 @@ static void __guc_exec_queue_process_msg_suspend(struct xe_sched_msg *msg)
 
 	if (guc_exec_queue_allowed_to_change_state(q) && !exec_queue_suspended(q) &&
 	    exec_queue_enabled(q)) {
-		wait_event(guc->ct.wq, (q->guc->resume_time != RESUME_PENDING ||
-			   xe_guc_read_stopped(guc)) && !exec_queue_pending_disable(q));
+		wait_event(guc->ct.wq, vf_recovery(guc) ||
+			   ((q->guc->resume_time != RESUME_PENDING ||
+			   xe_guc_read_stopped(guc)) && !exec_queue_pending_disable(q)));
 
 		if (!xe_guc_read_stopped(guc)) {
 			s64 since_resume_ms =
@@ -1640,7 +1669,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
 
 	q->entity = &ge->entity;
 
-	if (xe_guc_read_stopped(guc))
+	if (xe_guc_read_stopped(guc) || vf_recovery(guc))
 		xe_sched_stop(sched);
 
 	mutex_unlock(&guc->submission_state.lock);
@@ -1786,6 +1815,7 @@ static int guc_exec_queue_suspend(struct xe_exec_queue *q)
 static int guc_exec_queue_suspend_wait(struct xe_exec_queue *q)
 {
 	struct xe_guc *guc = exec_queue_to_guc(q);
+	struct xe_device *xe = guc_to_xe(guc);
 	int ret;
 
 	/*
@@ -1793,11 +1823,22 @@ static int guc_exec_queue_suspend_wait(struct xe_exec_queue *q)
 	 * suspend_pending upon kill but to be paranoid but races in which
 	 * suspend_pending is set after kill also check kill here.
 	 */
-	ret = wait_event_interruptible_timeout(q->guc->suspend_wait,
-					       !READ_ONCE(q->guc->suspend_pending) ||
-					       exec_queue_killed(q) ||
-					       xe_guc_read_stopped(guc),
-					       HZ * 5);
+	if (IS_SRIOV_VF(xe))
+		ret = wait_event_interruptible_timeout(guc->ct.wq,
+						       !READ_ONCE(q->guc->suspend_pending) ||
+						       exec_queue_killed(q) ||
+						       xe_guc_read_stopped(guc) ||
+						       vf_recovery(guc),
+						       HZ * 5);
+	else
+		ret = wait_event_interruptible_timeout(q->guc->suspend_wait,
+						       !READ_ONCE(q->guc->suspend_pending) ||
+						       exec_queue_killed(q) ||
+						       xe_guc_read_stopped(guc),
+						       HZ * 5);
+
+	if (vf_recovery(guc) && !xe_device_wedged((guc_to_xe(guc))))
+		return -EAGAIN;
 
 	if (!ret) {
 		xe_gt_warn(guc_to_gt(guc),
@@ -1905,8 +1946,7 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
 {
 	int ret;
 
-	if (xe_gt_WARN_ON(guc_to_gt(guc),
-			  xe_gt_sriov_vf_recovery_pending(guc_to_gt(guc))))
+	if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
 		return 0;
 
 	if (!guc->submission_state.initialized)
diff --git a/drivers/gpu/drm/xe/xe_preempt_fence.c b/drivers/gpu/drm/xe/xe_preempt_fence.c
index 83fbeea5aa20..7f587ca3947d 100644
--- a/drivers/gpu/drm/xe/xe_preempt_fence.c
+++ b/drivers/gpu/drm/xe/xe_preempt_fence.c
@@ -8,6 +8,8 @@
 #include <linux/slab.h>
 
 #include "xe_exec_queue.h"
+#include "xe_gt_printk.h"
+#include "xe_guc_exec_queue_types.h"
 #include "xe_vm.h"
 
 static void preempt_fence_work_func(struct work_struct *w)
@@ -22,6 +24,15 @@ static void preempt_fence_work_func(struct work_struct *w)
 	} else if (!q->ops->reset_status(q)) {
 		int err = q->ops->suspend_wait(q);
 
+		if (err == -EAGAIN) {
+			xe_gt_dbg(q->gt, "PREEMPT FENCE RETRY guc_id=%d",
+				  q->guc->id);
+			queue_work(q->vm->xe->preempt_fence_wq,
+				   &pfence->preempt_work);
+			dma_fence_end_signalling(cookie);
+			return;
+		}
+
 		if (err)
 			dma_fence_set_error(&pfence->base, err);
 	} else {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 15/30] drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs supporting migration
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (13 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 14/30] drm/xe/vf: Wakeup in GuC backend on " Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 16/30] drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register Matthew Brost
                   ` (19 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

Blocking in work queues on a hardware action that may never occur —
especially when it depends on a software fixup also scheduled on the
a work queue — is a recipe for deadlock. This situation arises with
the preempt rebind worker and VF post-migration recovery. To prevent
potential deadlocks, avoid indefinite blocking in the preempt rebind
worker for VFs that support migration.

v4:
 - Use dma_fence_wait_timeout (CI)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 4e914928e0a9..faca626702b8 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -35,6 +35,7 @@
 #include "xe_pt.h"
 #include "xe_pxp.h"
 #include "xe_res_cursor.h"
+#include "xe_sriov_vf.h"
 #include "xe_svm.h"
 #include "xe_sync.h"
 #include "xe_tile.h"
@@ -111,12 +112,22 @@ static int alloc_preempt_fences(struct xe_vm *vm, struct list_head *list,
 static int wait_for_existing_preempt_fences(struct xe_vm *vm)
 {
 	struct xe_exec_queue *q;
+	bool vf_migration = IS_SRIOV_VF(vm->xe) &&
+		xe_sriov_vf_migration_supported(vm->xe);
+	signed long wait_time = vf_migration ? HZ / 5 : MAX_SCHEDULE_TIMEOUT;
 
 	xe_vm_assert_held(vm);
 
 	list_for_each_entry(q, &vm->preempt.exec_queues, lr.link) {
 		if (q->lr.pfence) {
-			long timeout = dma_fence_wait(q->lr.pfence, false);
+			long timeout;
+
+			timeout = dma_fence_wait_timeout(q->lr.pfence, false,
+							 wait_time);
+			if (!timeout) {
+				xe_assert(vm->xe, vf_migration);
+				return -EAGAIN;
+			}
 
 			/* Only -ETIME on fence indicates VM needs to be killed */
 			if (timeout < 0 || q->lr.pfence->error == -ETIME)
@@ -541,6 +552,19 @@ static void preempt_rebind_work_func(struct work_struct *w)
 out_unlock_outer:
 	if (err == -EAGAIN) {
 		trace_xe_vm_rebind_worker_retry(vm);
+
+		/*
+		 * We can't block in workers on a VF which supports migration
+		 * given this can block the VF post-migration workers from
+		 * getting scheduled.
+		 */
+		if (IS_SRIOV_VF(vm->xe) &&
+		    xe_sriov_vf_migration_supported(vm->xe)) {
+			up_write(&vm->lock);
+			xe_vm_queue_rebind_worker(vm);
+			return;
+		}
+
 		goto retry;
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 16/30] drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (14 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 15/30] drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs supporting migration Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 14:51   ` Michal Wajdeczko
  2025-10-06 22:21   ` Lis, Tomasz
  2025-10-06 11:10 ` [PATCH v6 17/30] drm/xe/vf: Flush and stop CTs in VF post migration recovery Matthew Brost
                   ` (18 subsequent siblings)
  34 siblings, 2 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

The only case where the GuC submission backend cannot reason 100%
correctly is when a GuC context is registered during VF post-migration
recovery. In this scenario, it's possible that the GuC context register
H2G is processed, but the immediately following schedule-enable H2G gets
lost.

A double register is harmless when using `GUC_HXG_TYPE_EVENT`, as GuC
simply drops the duplicate H2G. To keep things simple, use
`GUC_HXG_TYPE_EVENT` for all context registrations on VFs.

v5:
 - Check for xe_sriov_vf_migration_supported (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_ct.c | 33 +++++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 9f0090ae64a6..3ac654cebc79 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -32,6 +32,7 @@
 #include "xe_guc_tlb_inval.h"
 #include "xe_map.h"
 #include "xe_pm.h"
+#include "xe_sriov_vf.h"
 #include "xe_trace_guc.h"
 
 static void receive_g2h(struct xe_guc_ct *ct);
@@ -736,6 +737,26 @@ static u16 next_ct_seqno(struct xe_guc_ct *ct, bool is_g2h_fence)
 	return seqno;
 }
 
+#define MAKE_ACTION(type, __action)				\
+({								\
+	FIELD_PREP(GUC_HXG_MSG_0_TYPE, type) |			\
+	FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |			\
+		   GUC_HXG_EVENT_MSG_0_DATA0, __action);	\
+})
+
+static bool vf_action_can_safely_fail(struct xe_device *xe, u32 action)
+{
+	/*
+	 * If we are VF resuming, we can't exactly track if a context
+	 * registration has been completed in the GuC state machine, it is
+	 * harmless to resend as it will just fail silently if
+	 * GUC_HXG_TYPE_EVENT is used.
+	 */
+	return IS_SRIOV_VF(xe) && xe_sriov_vf_migration_supported(xe) &&
+		(action == XE_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC ||
+		 action == XE_GUC_ACTION_REGISTER_CONTEXT);
+}
+
 #define H2G_CT_HEADERS (GUC_CTB_HDR_LEN + 1) /* one DW CTB header and one DW HxG header */
 
 static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
@@ -807,18 +828,14 @@ static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
 		FIELD_PREP(GUC_CTB_MSG_0_NUM_DWORDS, len) |
 		FIELD_PREP(GUC_CTB_MSG_0_FENCE, ct_fence_value);
 	if (want_response) {
-		cmd[1] =
-			FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
-			FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |
-				   GUC_HXG_EVENT_MSG_0_DATA0, action[0]);
+		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_REQUEST, action[0]);
+	} else if (vf_action_can_safely_fail(xe, action[0])) {
+		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_EVENT, action[0]);
 	} else {
 		fast_req_track(ct, ct_fence_value,
 			       FIELD_GET(GUC_HXG_EVENT_MSG_0_ACTION, action[0]));
 
-		cmd[1] =
-			FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_FAST_REQUEST) |
-			FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |
-				   GUC_HXG_EVENT_MSG_0_DATA0, action[0]);
+		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_FAST_REQUEST, action[0]);
 	}
 
 	/* H2G header in cmd[1] replaces action[0] so: */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 17/30] drm/xe/vf: Flush and stop CTs in VF post migration recovery
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (15 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 16/30] drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 18/30] drm/xe/vf: Reset TLB invalidations during " Matthew Brost
                   ` (17 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

Flushing CTs (i.e., progressing all pending G2H messages) gives VF
post-migration recovery an accurate view of which H2G messages the GuC
has processed, enabling the GuC submission state machine to correctly
rebuild all state.

Also, stop all CT traffic, as the CT is not live during VF
post-migration recovery.

v3:
 - xe_guc_ct_flush_and_stop rename (Michal)
 - Drop extra GuC CT WQ wake up (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c |  1 +
 drivers/gpu/drm/xe/xe_guc_ct.c      | 10 ++++++++++
 drivers/gpu/drm/xe/xe_guc_ct.h      |  1 +
 3 files changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 7f703336d692..768ab33d2486 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -1111,6 +1111,7 @@ static void vf_post_migration_shutdown(struct xe_gt *gt)
 	gt->sriov.vf.migration.recovery_queued = false;
 	spin_unlock_irq(&gt->sriov.vf.migration.lock);
 
+	xe_guc_ct_flush_and_stop(&gt->uc.guc.ct);
 	xe_guc_submit_pause(&gt->uc.guc);
 }
 
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 3ac654cebc79..f67575b1ed79 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -574,6 +574,16 @@ void xe_guc_ct_disable(struct xe_guc_ct *ct)
 	stop_g2h_handler(ct);
 }
 
+/**
+ * xe_guc_ct_flush_and_stop - Flush and stop all processing of G2H / H2G
+ * @ct: the &xe_guc_ct
+ */
+void xe_guc_ct_flush_and_stop(struct xe_guc_ct *ct)
+{
+	receive_g2h(ct);
+	xe_guc_ct_stop(ct);
+}
+
 /**
  * xe_guc_ct_stop - Set GuC to stopped state
  * @ct: the &xe_guc_ct
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
index ca0ec938edac..02eaa452b400 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.h
+++ b/drivers/gpu/drm/xe/xe_guc_ct.h
@@ -17,6 +17,7 @@ int xe_guc_ct_init_post_hwconfig(struct xe_guc_ct *ct);
 int xe_guc_ct_enable(struct xe_guc_ct *ct);
 void xe_guc_ct_disable(struct xe_guc_ct *ct);
 void xe_guc_ct_stop(struct xe_guc_ct *ct);
+void xe_guc_ct_flush_and_stop(struct xe_guc_ct *ct);
 void xe_guc_ct_fast_path(struct xe_guc_ct *ct);
 
 struct xe_guc_ct_snapshot *xe_guc_ct_snapshot_capture(struct xe_guc_ct *ct);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 18/30] drm/xe/vf: Reset TLB invalidations during VF post migration recovery
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (16 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 17/30] drm/xe/vf: Flush and stop CTs in VF post migration recovery Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 19/30] drm/xe/vf: Kickstart after resfix in " Matthew Brost
                   ` (16 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

TLB invalidations requests can be lost during VF post-migration
recovery. Since the VF has migrated, these invalidations are no longer
needed.

Reset the TLB invalidation frontend, which will signal all pending
fences.

v3:
 - Move TLB invalidation reset after pausing submission (Tomasz)
 - Adjust commit message (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 768ab33d2486..36eedfc3c5eb 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -35,6 +35,7 @@
 #include "xe_sriov.h"
 #include "xe_sriov_vf.h"
 #include "xe_tile_sriov_vf.h"
+#include "xe_tlb_inval.h"
 #include "xe_uc_fw.h"
 #include "xe_wopcm.h"
 
@@ -1113,6 +1114,7 @@ static void vf_post_migration_shutdown(struct xe_gt *gt)
 
 	xe_guc_ct_flush_and_stop(&gt->uc.guc.ct);
 	xe_guc_submit_pause(&gt->uc.guc);
+	xe_tlb_inval_reset(&gt->tlb_inval);
 }
 
 static size_t post_migration_scratch_size(struct xe_device *xe)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 19/30] drm/xe/vf: Kickstart after resfix in VF post migration recovery
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (17 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 18/30] drm/xe/vf: Reset TLB invalidations during " Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 20/30] drm/xe/vf: Start CTs before resfix " Matthew Brost
                   ` (15 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

GuC needs to be live for the GuC submission state machine to resubmit
anything lost during VF post-migration recovery.  Therefore, move the
kickstart step after `resfix` to ensure proper resubmission.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 36eedfc3c5eb..2a988eb3e904 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -1141,13 +1141,6 @@ static int vf_post_migration_fixups(struct xe_gt *gt)
 
 static void vf_post_migration_kickstart(struct xe_gt *gt)
 {
-	/*
-	 * Make sure interrupts on the new HW are properly set. The GuC IRQ
-	 * must be working at this point, since the recovery did started,
-	 * but the rest was not enabled using the procedure from spec.
-	 */
-	xe_irq_resume(gt_to_xe(gt));
-
 	xe_guc_submit_unpause(&gt->uc.guc);
 }
 
@@ -1167,6 +1160,13 @@ static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
 	if (skip_resfix)
 		return -EAGAIN;
 
+	/*
+	 * Make sure interrupts on the new HW are properly set. The GuC IRQ
+	 * must be working at this point, since the recovery did started,
+	 * but the rest was not enabled using the procedure from spec.
+	 */
+	xe_irq_resume(gt_to_xe(gt));
+
 	return vf_notify_resfix_done(gt);
 }
 
@@ -1190,11 +1190,12 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
 	if (err)
 		goto fail;
 
-	vf_post_migration_kickstart(gt);
 	err = vf_post_migration_notify_resfix_done(gt);
 	if (err && err != -EAGAIN)
 		goto fail;
 
+	vf_post_migration_kickstart(gt);
+
 	xe_pm_runtime_put(xe);
 	xe_gt_sriov_notice(gt, "migration recovery ended\n");
 	return;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 20/30] drm/xe/vf: Start CTs before resfix VF post migration recovery
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (18 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 19/30] drm/xe/vf: Kickstart after resfix in " Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 21:50   ` Lis, Tomasz
  2025-10-06 11:10 ` [PATCH v6 21/30] drm/xe/vf: Abort VF post migration recovery on failure Matthew Brost
                   ` (14 subsequent siblings)
  34 siblings, 1 reply; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

Before RESFIX_DONE, all CTs stuck in the H2G queue need to be squashed,
as they may contain actions which contain invalid GGTT references or are
unnecessary after HW change.

Starting the CTs clears all H2Gs in the queue. Any lost H2Gs are
resubmitted by the GuC submission state machine.

v3:
 - Don't mess with head / tail values (Michal)
v4:
 - Don't mess with broke (Michal)
 - Add CTB_H2G_BUFFER_OFFSET (Michal)
v5:
 - Adjust commit message (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c |  7 +++
 drivers/gpu/drm/xe/xe_guc_ct.c      | 70 +++++++++++++++++++++--------
 drivers/gpu/drm/xe/xe_guc_ct.h      |  1 +
 3 files changed, 60 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 2a988eb3e904..6052c7302cc6 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -1139,6 +1139,11 @@ static int vf_post_migration_fixups(struct xe_gt *gt)
 	return 0;
 }
 
+static void vf_post_migration_rearm(struct xe_gt *gt)
+{
+	xe_guc_ct_restart(&gt->uc.guc.ct);
+}
+
 static void vf_post_migration_kickstart(struct xe_gt *gt)
 {
 	xe_guc_submit_unpause(&gt->uc.guc);
@@ -1190,6 +1195,8 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
 	if (err)
 		goto fail;
 
+	vf_post_migration_rearm(gt);
+
 	err = vf_post_migration_notify_resfix_done(gt);
 	if (err && err != -EAGAIN)
 		goto fail;
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index f67575b1ed79..c0d261abf735 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -167,6 +167,7 @@ ct_to_xe(struct xe_guc_ct *ct)
  */
 
 #define CTB_DESC_SIZE		ALIGN(sizeof(struct guc_ct_buffer_desc), SZ_2K)
+#define CTB_H2G_BUFFER_OFFSET	(CTB_DESC_SIZE * 2)
 #define CTB_H2G_BUFFER_SIZE	(SZ_4K)
 #define CTB_G2H_BUFFER_SIZE	(SZ_128K)
 #define G2H_ROOM_BUFFER_SIZE	(CTB_G2H_BUFFER_SIZE / 2)
@@ -190,7 +191,7 @@ long xe_guc_ct_queue_proc_time_jiffies(struct xe_guc_ct *ct)
 
 static size_t guc_ct_size(void)
 {
-	return 2 * CTB_DESC_SIZE + CTB_H2G_BUFFER_SIZE +
+	return CTB_H2G_BUFFER_OFFSET + CTB_H2G_BUFFER_SIZE +
 		CTB_G2H_BUFFER_SIZE;
 }
 
@@ -331,7 +332,7 @@ static void guc_ct_ctb_h2g_init(struct xe_device *xe, struct guc_ctb *h2g,
 	h2g->desc = *map;
 	xe_map_memset(xe, &h2g->desc, 0, 0, sizeof(struct guc_ct_buffer_desc));
 
-	h2g->cmds = IOSYS_MAP_INIT_OFFSET(map, CTB_DESC_SIZE * 2);
+	h2g->cmds = IOSYS_MAP_INIT_OFFSET(map, CTB_H2G_BUFFER_OFFSET);
 }
 
 static void guc_ct_ctb_g2h_init(struct xe_device *xe, struct guc_ctb *g2h,
@@ -349,7 +350,7 @@ static void guc_ct_ctb_g2h_init(struct xe_device *xe, struct guc_ctb *g2h,
 	g2h->desc = IOSYS_MAP_INIT_OFFSET(map, CTB_DESC_SIZE);
 	xe_map_memset(xe, &g2h->desc, 0, 0, sizeof(struct guc_ct_buffer_desc));
 
-	g2h->cmds = IOSYS_MAP_INIT_OFFSET(map, CTB_DESC_SIZE * 2 +
+	g2h->cmds = IOSYS_MAP_INIT_OFFSET(map, CTB_H2G_BUFFER_OFFSET +
 					    CTB_H2G_BUFFER_SIZE);
 }
 
@@ -360,7 +361,7 @@ static int guc_ct_ctb_h2g_register(struct xe_guc_ct *ct)
 	int err;
 
 	desc_addr = xe_bo_ggtt_addr(ct->bo);
-	ctb_addr = xe_bo_ggtt_addr(ct->bo) + CTB_DESC_SIZE * 2;
+	ctb_addr = xe_bo_ggtt_addr(ct->bo) + CTB_H2G_BUFFER_OFFSET;
 	size = ct->ctbs.h2g.info.size * sizeof(u32);
 
 	err = xe_guc_self_cfg64(guc,
@@ -387,7 +388,7 @@ static int guc_ct_ctb_g2h_register(struct xe_guc_ct *ct)
 	int err;
 
 	desc_addr = xe_bo_ggtt_addr(ct->bo) + CTB_DESC_SIZE;
-	ctb_addr = xe_bo_ggtt_addr(ct->bo) + CTB_DESC_SIZE * 2 +
+	ctb_addr = xe_bo_ggtt_addr(ct->bo) + CTB_H2G_BUFFER_OFFSET +
 		CTB_H2G_BUFFER_SIZE;
 	size = ct->ctbs.g2h.info.size * sizeof(u32);
 
@@ -501,7 +502,7 @@ static void ct_exit_safe_mode(struct xe_guc_ct *ct)
 		xe_gt_dbg(ct_to_gt(ct), "GuC CT safe-mode disabled\n");
 }
 
-int xe_guc_ct_enable(struct xe_guc_ct *ct)
+static int __xe_guc_ct_start(struct xe_guc_ct *ct, bool needs_register)
 {
 	struct xe_device *xe = ct_to_xe(ct);
 	struct xe_gt *gt = ct_to_gt(ct);
@@ -509,21 +510,28 @@ int xe_guc_ct_enable(struct xe_guc_ct *ct)
 
 	xe_gt_assert(gt, !xe_guc_ct_enabled(ct));
 
-	xe_map_memset(xe, &ct->bo->vmap, 0, 0, xe_bo_size(ct->bo));
-	guc_ct_ctb_h2g_init(xe, &ct->ctbs.h2g, &ct->bo->vmap);
-	guc_ct_ctb_g2h_init(xe, &ct->ctbs.g2h, &ct->bo->vmap);
+	if (needs_register) {
+		xe_map_memset(xe, &ct->bo->vmap, 0, 0, xe_bo_size(ct->bo));
+		guc_ct_ctb_h2g_init(xe, &ct->ctbs.h2g, &ct->bo->vmap);
+		guc_ct_ctb_g2h_init(xe, &ct->ctbs.g2h, &ct->bo->vmap);
 
-	err = guc_ct_ctb_h2g_register(ct);
-	if (err)
-		goto err_out;
+		err = guc_ct_ctb_h2g_register(ct);
+		if (err)
+			goto err_out;
 
-	err = guc_ct_ctb_g2h_register(ct);
-	if (err)
-		goto err_out;
+		err = guc_ct_ctb_g2h_register(ct);
+		if (err)
+			goto err_out;
 
-	err = guc_ct_control_toggle(ct, true);
-	if (err)
-		goto err_out;
+		err = guc_ct_control_toggle(ct, true);
+		if (err)
+			goto err_out;
+	} else {
+		ct->ctbs.h2g.info.broken = false;
+		ct->ctbs.g2h.info.broken = false;
+		xe_map_memset(xe, &ct->bo->vmap, CTB_H2G_BUFFER_OFFSET, 0,
+			      CTB_H2G_BUFFER_SIZE);
+	}
 
 	guc_ct_change_state(ct, XE_GUC_CT_STATE_ENABLED);
 
@@ -555,6 +563,32 @@ int xe_guc_ct_enable(struct xe_guc_ct *ct)
 	return err;
 }
 
+/**
+ * xe_guc_ct_restart() - Restart GuC CT
+ * @ct: the &xe_guc_ct
+ *
+ * Restart GuC CT to an empty state without issuing a CT register MMIO command.
+ *
+ * Return: 0 on success, or a negative errno on failure.
+ */
+int xe_guc_ct_restart(struct xe_guc_ct *ct)
+{
+	return __xe_guc_ct_start(ct, false);
+}
+
+/**
+ * xe_guc_ct_enable() - Enable GuC CT
+ * @ct: the &xe_guc_ct
+ *
+ * Enable GuC CT to an empty state and issue a CT register MMIO command.
+ *
+ * Return: 0 on success, or a negative errno on failure.
+ */
+int xe_guc_ct_enable(struct xe_guc_ct *ct)
+{
+	return __xe_guc_ct_start(ct, true);
+}
+
 static void stop_g2h_handler(struct xe_guc_ct *ct)
 {
 	cancel_work_sync(&ct->g2h_worker);
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
index 02eaa452b400..10d05193e51c 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.h
+++ b/drivers/gpu/drm/xe/xe_guc_ct.h
@@ -15,6 +15,7 @@ int xe_guc_ct_init_noalloc(struct xe_guc_ct *ct);
 int xe_guc_ct_init(struct xe_guc_ct *ct);
 int xe_guc_ct_init_post_hwconfig(struct xe_guc_ct *ct);
 int xe_guc_ct_enable(struct xe_guc_ct *ct);
+int xe_guc_ct_restart(struct xe_guc_ct *ct);
 void xe_guc_ct_disable(struct xe_guc_ct *ct);
 void xe_guc_ct_stop(struct xe_guc_ct *ct);
 void xe_guc_ct_flush_and_stop(struct xe_guc_ct *ct);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 21/30] drm/xe/vf: Abort VF post migration recovery on failure
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (19 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 20/30] drm/xe/vf: Start CTs before resfix " Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 22/30] drm/xe/vf: Replay GuC submission state on pause / unpause Matthew Brost
                   ` (13 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

If VF post-migration recovery fails, the device is wedged. However,
submission queues still need to be enabled for proper cleanup. In such
cases, call into the GuC submission backend to restart all queues that
were previously paused.

v3:
 - s/Avort/Abort (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 10 ++++++++++
 drivers/gpu/drm/xe/xe_guc_submit.c  | 20 ++++++++++++++++++++
 drivers/gpu/drm/xe/xe_guc_submit.h  |  1 +
 3 files changed, 31 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 6052c7302cc6..c7c929bd4212 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -1149,6 +1149,15 @@ static void vf_post_migration_kickstart(struct xe_gt *gt)
 	xe_guc_submit_unpause(&gt->uc.guc);
 }
 
+static void vf_post_migration_abort(struct xe_gt *gt)
+{
+	spin_lock_irq(&gt->sriov.vf.migration.lock);
+	WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, false);
+	spin_unlock_irq(&gt->sriov.vf.migration.lock);
+
+	xe_guc_submit_pause_abort(&gt->uc.guc);
+}
+
 static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
 {
 	bool skip_resfix = false;
@@ -1207,6 +1216,7 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
 	xe_gt_sriov_notice(gt, "migration recovery ended\n");
 	return;
 fail:
+	vf_post_migration_abort(gt);
 	xe_pm_runtime_put(xe);
 	xe_gt_sriov_err(gt, "migration recovery failed (%pe)\n", ERR_PTR(err));
 	xe_device_declare_wedged(xe);
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index b2ca4911efe9..e1e197ec45eb 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -2087,6 +2087,26 @@ void xe_guc_submit_unpause(struct xe_guc *guc)
 	wake_up_all(&guc->ct.wq);
 }
 
+/**
+ * xe_guc_submit_abort - Abort all paused submission task on given GuC.
+ * @guc: the &xe_guc struct instance whose scheduler is to be aborted
+ */
+void xe_guc_submit_pause_abort(struct xe_guc *guc)
+{
+	struct xe_exec_queue *q;
+	unsigned long index;
+
+	mutex_lock(&guc->submission_state.lock);
+	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q) {
+		struct xe_gpu_scheduler *sched = &q->guc->sched;
+
+		xe_sched_submission_start(sched);
+		if (exec_queue_killed_or_banned_or_wedged(q))
+			xe_guc_exec_queue_trigger_cleanup(q);
+	}
+	mutex_unlock(&guc->submission_state.lock);
+}
+
 static struct xe_exec_queue *
 g2h_exec_queue_lookup(struct xe_guc *guc, u32 guc_id)
 {
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
index f535fe3895e5..fe82c317048e 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.h
+++ b/drivers/gpu/drm/xe/xe_guc_submit.h
@@ -22,6 +22,7 @@ void xe_guc_submit_stop(struct xe_guc *guc);
 int xe_guc_submit_start(struct xe_guc *guc);
 void xe_guc_submit_pause(struct xe_guc *guc);
 void xe_guc_submit_unpause(struct xe_guc *guc);
+void xe_guc_submit_pause_abort(struct xe_guc *guc);
 void xe_guc_submit_wedge(struct xe_guc *guc);
 
 int xe_guc_read_stopped(struct xe_guc *guc);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 22/30] drm/xe/vf: Replay GuC submission state on pause / unpause
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (20 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 21/30] drm/xe/vf: Abort VF post migration recovery on failure Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 23/30] drm/xe: Move queue init before LRC creation Matthew Brost
                   ` (12 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

Fixup GuC submission pause / unpause functions to properly replay any
possible state lost during VF post migration recovery.

v3:
 - Add helpers for revert / replay (Tomasz)
 - Add comment around WQ NOPs (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_gpu_scheduler.c        |  14 ++
 drivers/gpu/drm/xe/xe_gpu_scheduler.h        |   2 +
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c          |   1 +
 drivers/gpu/drm/xe/xe_guc_exec_queue_types.h |  15 ++
 drivers/gpu/drm/xe/xe_guc_submit.c           | 242 +++++++++++++++++--
 drivers/gpu/drm/xe/xe_guc_submit.h           |   1 +
 drivers/gpu/drm/xe/xe_sched_job_types.h      |   4 +
 7 files changed, 264 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.c b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
index 455ccaf17314..af300adc7e1a 100644
--- a/drivers/gpu/drm/xe/xe_gpu_scheduler.c
+++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
@@ -135,3 +135,17 @@ void xe_sched_add_msg_locked(struct xe_gpu_scheduler *sched,
 	list_add_tail(&msg->link, &sched->msgs);
 	xe_sched_process_msg_queue(sched);
 }
+
+/**
+ * xe_sched_add_msg_head() - Xe GPU scheduler add message to head of list
+ * @sched: Xe GPU scheduler
+ * @msg: Message to add
+ */
+void xe_sched_add_msg_head(struct xe_gpu_scheduler *sched,
+			   struct xe_sched_msg *msg)
+{
+	lockdep_assert_held(&sched->base.job_list_lock);
+
+	list_add(&msg->link, &sched->msgs);
+	xe_sched_process_msg_queue(sched);
+}
diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.h b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
index e548b2aed95a..010003a6103a 100644
--- a/drivers/gpu/drm/xe/xe_gpu_scheduler.h
+++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
@@ -29,6 +29,8 @@ void xe_sched_add_msg(struct xe_gpu_scheduler *sched,
 		      struct xe_sched_msg *msg);
 void xe_sched_add_msg_locked(struct xe_gpu_scheduler *sched,
 			     struct xe_sched_msg *msg);
+void xe_sched_add_msg_head(struct xe_gpu_scheduler *sched,
+			   struct xe_sched_msg *msg);
 
 static inline void xe_sched_msg_lock(struct xe_gpu_scheduler *sched)
 {
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index c7c929bd4212..8074ffb924ce 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -1142,6 +1142,7 @@ static int vf_post_migration_fixups(struct xe_gt *gt)
 static void vf_post_migration_rearm(struct xe_gt *gt)
 {
 	xe_guc_ct_restart(&gt->uc.guc.ct);
+	xe_guc_submit_unpause_prepare(&gt->uc.guc);
 }
 
 static void vf_post_migration_kickstart(struct xe_gt *gt)
diff --git a/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h b/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h
index c30c0e3ccbbb..a3b034e4b205 100644
--- a/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h
+++ b/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h
@@ -51,6 +51,21 @@ struct xe_guc_exec_queue {
 	wait_queue_head_t suspend_wait;
 	/** @suspend_pending: a suspend of the exec_queue is pending */
 	bool suspend_pending;
+	/**
+	 * @needs_cleanup: Needs a cleanup message during VF post migration
+	 * recovery.
+	 */
+	bool needs_cleanup;
+	/**
+	 * @needs_suspend: Needs a suspend message during VF post migration
+	 * recovery.
+	 */
+	bool needs_suspend;
+	/**
+	 * @needs_resume: Needs a resume message during VF post migration
+	 * recovery.
+	 */
+	bool needs_resume;
 };
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index e1e197ec45eb..9dbdb0b54c8b 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -142,6 +142,11 @@ static void set_exec_queue_destroyed(struct xe_exec_queue *q)
 	atomic_or(EXEC_QUEUE_STATE_DESTROYED, &q->guc->state);
 }
 
+static void clear_exec_queue_destroyed(struct xe_exec_queue *q)
+{
+	atomic_and(~EXEC_QUEUE_STATE_DESTROYED, &q->guc->state);
+}
+
 static bool exec_queue_banned(struct xe_exec_queue *q)
 {
 	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_BANNED;
@@ -222,7 +227,12 @@ static void set_exec_queue_extra_ref(struct xe_exec_queue *q)
 	atomic_or(EXEC_QUEUE_STATE_EXTRA_REF, &q->guc->state);
 }
 
-static bool __maybe_unused exec_queue_pending_resume(struct xe_exec_queue *q)
+static void clear_exec_queue_extra_ref(struct xe_exec_queue *q)
+{
+	atomic_and(~EXEC_QUEUE_STATE_EXTRA_REF, &q->guc->state);
+}
+
+static bool exec_queue_pending_resume(struct xe_exec_queue *q)
 {
 	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_RESUME;
 }
@@ -237,7 +247,7 @@ static void clear_exec_queue_pending_resume(struct xe_exec_queue *q)
 	atomic_and(~EXEC_QUEUE_STATE_PENDING_RESUME, &q->guc->state);
 }
 
-static bool __maybe_unused exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
+static bool exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
 {
 	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_TDR_EXIT;
 }
@@ -799,7 +809,7 @@ static void wq_item_append(struct xe_exec_queue *q)
 }
 
 #define RESUME_PENDING	~0x0ull
-static void submit_exec_queue(struct xe_exec_queue *q)
+static void submit_exec_queue(struct xe_exec_queue *q, struct xe_sched_job *job)
 {
 	struct xe_guc *guc = exec_queue_to_guc(q);
 	struct xe_lrc *lrc = q->lrc[0];
@@ -811,10 +821,13 @@ static void submit_exec_queue(struct xe_exec_queue *q)
 
 	xe_gt_assert(guc_to_gt(guc), exec_queue_registered(q));
 
-	if (xe_exec_queue_is_parallel(q))
-		wq_item_append(q);
-	else
-		xe_lrc_set_ring_tail(lrc, lrc->ring.tail);
+	if (!job->skip_emit || job->last_replay) {
+		if (xe_exec_queue_is_parallel(q))
+			wq_item_append(q);
+		else
+			xe_lrc_set_ring_tail(lrc, lrc->ring.tail);
+		job->last_replay = false;
+	}
 
 	if (exec_queue_suspended(q) && !xe_exec_queue_is_parallel(q))
 		return;
@@ -867,8 +880,10 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
 	if (!killed_or_banned_or_wedged && !xe_sched_job_is_error(job)) {
 		if (!exec_queue_registered(q))
 			register_exec_queue(q, GUC_CONTEXT_NORMAL);
-		q->ring_ops->emit_job(job);
-		submit_exec_queue(q);
+		if (!job->skip_emit)
+			q->ring_ops->emit_job(job);
+		submit_exec_queue(q, job);
+		job->skip_emit = false;
 	}
 
 	/*
@@ -1585,6 +1600,7 @@ static void __guc_exec_queue_process_msg_resume(struct xe_sched_msg *msg)
 #define RESUME		4
 #define OPCODE_MASK	0xf
 #define MSG_LOCKED	BIT(8)
+#define MSG_HEAD	BIT(9)
 
 static void guc_exec_queue_process_msg(struct xe_sched_msg *msg)
 {
@@ -1709,12 +1725,24 @@ static void guc_exec_queue_add_msg(struct xe_exec_queue *q, struct xe_sched_msg
 	msg->private_data = q;
 
 	trace_xe_sched_msg_add(msg);
-	if (opcode & MSG_LOCKED)
+	if (opcode & MSG_HEAD)
+		xe_sched_add_msg_head(&q->guc->sched, msg);
+	else if (opcode & MSG_LOCKED)
 		xe_sched_add_msg_locked(&q->guc->sched, msg);
 	else
 		xe_sched_add_msg(&q->guc->sched, msg);
 }
 
+static void guc_exec_queue_try_add_msg_head(struct xe_exec_queue *q,
+					    struct xe_sched_msg *msg,
+					    u32 opcode)
+{
+	if (!list_empty(&msg->link))
+		return;
+
+	guc_exec_queue_add_msg(q, msg, opcode | MSG_LOCKED | MSG_HEAD);
+}
+
 static bool guc_exec_queue_try_add_msg(struct xe_exec_queue *q,
 				       struct xe_sched_msg *msg,
 				       u32 opcode)
@@ -1998,6 +2026,105 @@ void xe_guc_submit_stop(struct xe_guc *guc)
 
 }
 
+static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
+{
+	bool pending_enable, pending_disable, pending_resume;
+
+	pending_enable = exec_queue_pending_enable(q);
+	pending_resume = exec_queue_pending_resume(q);
+
+	if (pending_enable && pending_resume)
+		q->guc->needs_resume = true;
+
+	if (pending_enable && !pending_resume &&
+	    !exec_queue_pending_tdr_exit(q)) {
+		clear_exec_queue_registered(q);
+		if (xe_exec_queue_is_lr(q))
+			xe_exec_queue_put(q);
+	}
+
+	if (pending_enable) {
+		clear_exec_queue_enabled(q);
+		clear_exec_queue_pending_resume(q);
+		clear_exec_queue_pending_tdr_exit(q);
+		clear_exec_queue_pending_enable(q);
+	}
+
+	if (exec_queue_destroyed(q) && exec_queue_registered(q)) {
+		clear_exec_queue_destroyed(q);
+		if (exec_queue_extra_ref(q))
+			xe_exec_queue_put(q);
+		else
+			q->guc->needs_cleanup = true;
+		clear_exec_queue_extra_ref(q);
+	}
+
+	pending_disable = exec_queue_pending_disable(q);
+
+	if (pending_disable && exec_queue_suspended(q)) {
+		clear_exec_queue_suspended(q);
+		q->guc->needs_suspend = true;
+	}
+
+	if (pending_disable) {
+		if (!pending_enable)
+			set_exec_queue_enabled(q);
+		clear_exec_queue_pending_disable(q);
+		clear_exec_queue_check_timeout(q);
+	}
+
+	q->guc->resume_time = 0;
+}
+
+/*
+ * This function is quite complex but only real way to ensure no state is lost
+ * during VF resume flows. The function scans the queue state, make adjustments
+ * as needed, and queues jobs / messages which replayed upon unpause.
+ */
+static void guc_exec_queue_pause(struct xe_guc *guc, struct xe_exec_queue *q)
+{
+	struct xe_gpu_scheduler *sched = &q->guc->sched;
+	struct xe_sched_job *job;
+	int i;
+
+	lockdep_assert_held(&guc->submission_state.lock);
+
+	/* Stop scheduling + flush any DRM scheduler operations */
+	xe_sched_submission_stop(sched);
+	if (xe_exec_queue_is_lr(q))
+		cancel_work_sync(&q->guc->lr_tdr);
+	else
+		cancel_delayed_work_sync(&sched->base.work_tdr);
+
+	guc_exec_queue_revert_pending_state_change(q);
+
+	if (xe_exec_queue_is_parallel(q)) {
+		struct xe_device *xe = guc_to_xe(guc);
+		struct iosys_map map = xe_lrc_parallel_map(q->lrc[0]);
+
+		/*
+		 * NOP existing WQ commands that may contain stale GGTT
+		 * addresses. These will be replayed upon unpause. The hardware
+		 * seems to get confused if the WQ head/tail pointers are
+		 * adjusted.
+		 */
+		for (i = 0; i < WQ_SIZE / sizeof(u32); ++i)
+			parallel_write(xe, map, wq[i],
+				       FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_NOOP) |
+				       FIELD_PREP(WQ_LEN_MASK, 0));
+	}
+
+	job = xe_sched_first_pending_job(sched);
+	if (job) {
+		/*
+		 * Adjust software tail so jobs submitted overwrite previous
+		 * position in ring buffer with new GGTT addresses.
+		 */
+		for (i = 0; i < q->width; ++i)
+			q->lrc[i]->ring.tail = job->ptrs[i].head;
+	}
+}
+
 /**
  * xe_guc_submit_pause - Stop further runs of submission tasks on given GuC.
  * @guc: the &xe_guc struct instance whose scheduler is to be disabled
@@ -2007,8 +2134,12 @@ void xe_guc_submit_pause(struct xe_guc *guc)
 	struct xe_exec_queue *q;
 	unsigned long index;
 
+	xe_gt_assert(guc_to_gt(guc), vf_recovery(guc));
+
+	mutex_lock(&guc->submission_state.lock);
 	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
-		xe_sched_submission_stop_async(&q->guc->sched);
+		guc_exec_queue_pause(guc, q);
+	mutex_unlock(&guc->submission_state.lock);
 }
 
 static void guc_exec_queue_start(struct xe_exec_queue *q)
@@ -2065,11 +2196,92 @@ int xe_guc_submit_start(struct xe_guc *guc)
 	return 0;
 }
 
-static void guc_exec_queue_unpause(struct xe_exec_queue *q)
+static void guc_exec_queue_unpause_prepare(struct xe_guc *guc,
+					   struct xe_exec_queue *q)
 {
 	struct xe_gpu_scheduler *sched = &q->guc->sched;
+	struct drm_sched_job *s_job;
+	struct xe_sched_job *job = NULL;
+
+	list_for_each_entry(s_job, &sched->base.pending_list, list) {
+		job = to_xe_sched_job(s_job);
+
+		q->ring_ops->emit_job(job);
+		job->skip_emit = true;
+	}
 
+	if (job)
+		job->last_replay = true;
+}
+
+/**
+ * xe_guc_submit_unpause_prepare - Prepare unpause submission tasks on given GuC.
+ * @guc: the &xe_guc struct instance whose scheduler is to be prepared for unpause
+ */
+void xe_guc_submit_unpause_prepare(struct xe_guc *guc)
+{
+	struct xe_exec_queue *q;
+	unsigned long index;
+
+	xe_gt_assert(guc_to_gt(guc), vf_recovery(guc));
+
+	mutex_lock(&guc->submission_state.lock);
+	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
+		guc_exec_queue_unpause_prepare(guc, q);
+	mutex_unlock(&guc->submission_state.lock);
+}
+
+static void guc_exec_queue_replay_pending_state_change(struct xe_exec_queue *q)
+{
+	struct xe_gpu_scheduler *sched = &q->guc->sched;
+	struct xe_sched_msg *msg;
+
+	if (q->guc->needs_cleanup) {
+		msg = q->guc->static_msgs + STATIC_MSG_CLEANUP;
+
+		guc_exec_queue_add_msg(q, msg, CLEANUP);
+		q->guc->needs_cleanup = false;
+	}
+
+	if (q->guc->needs_suspend) {
+		msg = q->guc->static_msgs + STATIC_MSG_SUSPEND;
+
+		xe_sched_msg_lock(sched);
+		guc_exec_queue_try_add_msg_head(q, msg, SUSPEND);
+		xe_sched_msg_unlock(sched);
+
+		q->guc->needs_suspend = false;
+	}
+
+	/*
+	 * The resume must be in the message queue before the suspend as it is
+	 * not possible for a resume to be issued if a suspend pending is, but
+	 * the inverse is possible.
+	 */
+	if (q->guc->needs_resume) {
+		msg = q->guc->static_msgs + STATIC_MSG_RESUME;
+
+		xe_sched_msg_lock(sched);
+		guc_exec_queue_try_add_msg_head(q, msg, RESUME);
+		xe_sched_msg_unlock(sched);
+
+		q->guc->needs_resume = false;
+	}
+}
+
+static void guc_exec_queue_unpause(struct xe_guc *guc, struct xe_exec_queue *q)
+{
+	struct xe_gpu_scheduler *sched = &q->guc->sched;
+	bool needs_tdr = exec_queue_killed_or_banned_or_wedged(q);
+
+	lockdep_assert_held(&guc->submission_state.lock);
+
+	xe_sched_resubmit_jobs(sched);
+	guc_exec_queue_replay_pending_state_change(q);
 	xe_sched_submission_start(sched);
+	if (needs_tdr)
+		xe_guc_exec_queue_trigger_cleanup(q);
+	xe_sched_submission_resume_tdr(sched);
 }
 
 /**
@@ -2081,10 +2293,10 @@ void xe_guc_submit_unpause(struct xe_guc *guc)
 	struct xe_exec_queue *q;
 	unsigned long index;
 
+	mutex_lock(&guc->submission_state.lock);
 	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
-		guc_exec_queue_unpause(q);
-
-	wake_up_all(&guc->ct.wq);
+		guc_exec_queue_unpause(guc, q);
+	mutex_unlock(&guc->submission_state.lock);
 }
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
index fe82c317048e..b49a2748ec46 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.h
+++ b/drivers/gpu/drm/xe/xe_guc_submit.h
@@ -22,6 +22,7 @@ void xe_guc_submit_stop(struct xe_guc *guc);
 int xe_guc_submit_start(struct xe_guc *guc);
 void xe_guc_submit_pause(struct xe_guc *guc);
 void xe_guc_submit_unpause(struct xe_guc *guc);
+void xe_guc_submit_unpause_prepare(struct xe_guc *guc);
 void xe_guc_submit_pause_abort(struct xe_guc *guc);
 void xe_guc_submit_wedge(struct xe_guc *guc);
 
diff --git a/drivers/gpu/drm/xe/xe_sched_job_types.h b/drivers/gpu/drm/xe/xe_sched_job_types.h
index 7ce58765a34a..13e7a12b03ad 100644
--- a/drivers/gpu/drm/xe/xe_sched_job_types.h
+++ b/drivers/gpu/drm/xe/xe_sched_job_types.h
@@ -63,6 +63,10 @@ struct xe_sched_job {
 	bool ring_ops_flush_tlb;
 	/** @ggtt: mapped in ggtt. */
 	bool ggtt;
+	/** @skip_emit: skip emitting the job */
+	bool skip_emit;
+	/** @last_replay: last job being replayed */
+	bool last_replay;
 	/** @ptrs: per instance pointers. */
 	struct xe_job_ptrs ptrs[];
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 23/30] drm/xe: Move queue init before LRC creation
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (21 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 22/30] drm/xe/vf: Replay GuC submission state on pause / unpause Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 15:22   ` Michal Wajdeczko
  2025-10-06 21:33   ` Lis, Tomasz
  2025-10-06 11:10 ` [PATCH v6 24/30] drm/xe/vf: Add debug prints for GuC replaying state during VF recovery Matthew Brost
                   ` (11 subsequent siblings)
  34 siblings, 2 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

A queue must be in the submission backend's tracking state before the
LRC is created to avoid a race condition where the LRC's GGTT addresses
are not properly fixed up during VF post-migration recovery.

Move the queue initialization—which adds the queue to the submission
backend's tracking state—before LRC creation.

v2:
 - Wait on VF GGTT fixes before creating LRC (testing)
v5:
 - Adjust comment in code (Tomasz)
 - Reduce race window

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_exec_queue.c        | 45 ++++++++++++++++++-----
 drivers/gpu/drm/xe/xe_execlist.c          |  2 +-
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 39 +++++++++++++++++++-
 drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  2 +
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  5 +++
 drivers/gpu/drm/xe/xe_guc_submit.c        |  2 +-
 drivers/gpu/drm/xe/xe_lrc.h               | 10 +++++
 7 files changed, 92 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
index 7621089a47fe..90cbc95f8e2e 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -15,6 +15,7 @@
 #include "xe_dep_scheduler.h"
 #include "xe_device.h"
 #include "xe_gt.h"
+#include "xe_gt_sriov_vf.h"
 #include "xe_hw_engine_class_sysfs.h"
 #include "xe_hw_engine_group.h"
 #include "xe_hw_fence.h"
@@ -205,17 +206,34 @@ static int __xe_exec_queue_init(struct xe_exec_queue *q, u32 exec_queue_flags)
 	if (!(exec_queue_flags & EXEC_QUEUE_FLAG_KERNEL))
 		flags |= XE_LRC_CREATE_USER_CTX;
 
+	err = q->ops->init(q);
+	if (err)
+		return err;
+
+	/*
+	 * This must occur after q->ops->init to avoid race conditions during VF
+	 * post-migration recovery, as the fixups for the LRC GGTT addresses
+	 * depend on the queue being present in the backend tracking structure.
+	 *
+	 * In addition to above, we must wait on inflight GGTT changes to avoid
+	 * writing out stale values here. Such wait provides a solid solution
+	 * (without a race) only if the function can detect migration instantly
+	 * from the moment vCPU resumes execution.
+	 */
 	for (i = 0; i < q->width; ++i) {
-		q->lrc[i] = xe_lrc_create(q->hwe, q->vm, SZ_16K, q->msix_vec, flags);
-		if (IS_ERR(q->lrc[i])) {
-			err = PTR_ERR(q->lrc[i]);
+		struct xe_lrc *lrc;
+
+		xe_gt_sriov_vf_wait_valid_ggtt(q->gt);
+		lrc = xe_lrc_create(q->hwe, q->vm, xe_lrc_ring_size(),
+				    q->msix_vec, flags);
+		if (IS_ERR(lrc)) {
+			err = PTR_ERR(lrc);
 			goto err_lrc;
 		}
-	}
 
-	err = q->ops->init(q);
-	if (err)
-		goto err_lrc;
+		/* Pairs with READ_ONCE to xe_exec_queue_contexts_hwsp_rebase */
+		WRITE_ONCE(q->lrc[i], lrc);
+	}
 
 	return 0;
 
@@ -1121,9 +1139,16 @@ int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch)
 	int err = 0;
 
 	for (i = 0; i < q->width; ++i) {
-		xe_lrc_update_memirq_regs_with_address(q->lrc[i], q->hwe, scratch);
-		xe_lrc_update_hwctx_regs_with_address(q->lrc[i]);
-		err = xe_lrc_setup_wa_bb_with_scratch(q->lrc[i], q->hwe, scratch);
+		struct xe_lrc *lrc;
+
+		/* Pairs with WRITE_ONCE in __xe_exec_queue_init  */
+		lrc = READ_ONCE(q->lrc[i]);
+		if (!lrc)
+			continue;
+
+		xe_lrc_update_memirq_regs_with_address(lrc, q->hwe, scratch);
+		xe_lrc_update_hwctx_regs_with_address(lrc);
+		err = xe_lrc_setup_wa_bb_with_scratch(lrc, q->hwe, scratch);
 		if (err)
 			break;
 	}
diff --git a/drivers/gpu/drm/xe/xe_execlist.c b/drivers/gpu/drm/xe/xe_execlist.c
index f83d421ac9d3..769d05517f93 100644
--- a/drivers/gpu/drm/xe/xe_execlist.c
+++ b/drivers/gpu/drm/xe/xe_execlist.c
@@ -339,7 +339,7 @@ static int execlist_exec_queue_init(struct xe_exec_queue *q)
 	const struct drm_sched_init_args args = {
 		.ops = &drm_sched_ops,
 		.num_rqs = 1,
-		.credit_limit = q->lrc[0]->ring.size / MAX_JOB_SIZE_BYTES,
+		.credit_limit = xe_lrc_ring_size() / MAX_JOB_SIZE_BYTES,
 		.hang_limit = XE_SCHED_HANG_LIMIT,
 		.timeout = XE_SCHED_JOB_TIMEOUT,
 		.name = q->hwe->name,
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 8074ffb924ce..bf1806e90370 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -487,6 +487,11 @@ static int vf_get_ggtt_info(struct xe_gt *gt, bool recovery)
 				 shift, config->ggtt_base);
 		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
 	}
+
+	WRITE_ONCE(gt->sriov.vf.migration.ggtt_need_fixes, false);
+	smp_wmb();	/* Ensure above write visible before wake */
+	wake_up_all(&gt->sriov.vf.migration.wq);
+
 out:
 	if (recovery)
 		mutex_unlock(&ggtt->lock);
@@ -745,7 +750,8 @@ static void vf_start_migration_recovery(struct xe_gt *gt)
 	    !gt->sriov.vf.migration.recovery_teardown) {
 		gt->sriov.vf.migration.recovery_queued = true;
 		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
-		smp_wmb();	/* Ensure above write visable before wake */
+		WRITE_ONCE(gt->sriov.vf.migration.ggtt_need_fixes, true);
+		smp_wmb();	/* Ensure above writes visable before wake */
 
 		xe_guc_ct_wake_waiters(&gt->uc.guc.ct);
 
@@ -1264,6 +1270,7 @@ int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
 	gt->sriov.vf.migration.scratch = buf;
 	spin_lock_init(&gt->sriov.vf.migration.lock);
 	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
+	init_waitqueue_head(&gt->sriov.vf.migration.wq);
 
 	return 0;
 }
@@ -1312,3 +1319,33 @@ bool xe_gt_sriov_vf_recovery_pending(struct xe_gt *gt)
 
 	return READ_ONCE(gt->sriov.vf.migration.recovery_inprogress);
 }
+
+static bool vf_valid_ggtt(struct xe_gt *gt)
+{
+	struct xe_memirq *memirq = &gt_to_tile(gt)->memirq;
+
+	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
+
+	if (xe_memirq_guc_sw_int_0_irq_pending(memirq, &gt->uc.guc) ||
+	    READ_ONCE(gt->sriov.vf.migration.ggtt_need_fixes))
+		return false;
+
+	return true;
+}
+
+/**
+ * xe_gt_sriov_vf_wait_valid_ggtt() - VF wait for valid GGTT addresses
+ * @gt: the &xe_gt
+ */
+void xe_gt_sriov_vf_wait_valid_ggtt(struct xe_gt *gt)
+{
+	int ret;
+
+	if (!IS_SRIOV_VF(gt_to_xe(gt)))
+		return;
+
+	ret = wait_event_interruptible_timeout(gt->sriov.vf.migration.wq,
+					       vf_valid_ggtt(gt),
+					       HZ * 5);
+	XE_WARN_ON(!ret);
+}
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
index 8c9679414565..63102029d624 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
@@ -38,4 +38,6 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p);
 void xe_gt_sriov_vf_print_runtime(struct xe_gt *gt, struct drm_printer *p);
 void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p);
 
+void xe_gt_sriov_vf_wait_valid_ggtt(struct xe_gt *gt);
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
index c1bd6fdd9ab1..f0bc45a782a4 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
@@ -8,6 +8,7 @@
 
 #include <linux/rwsem.h>
 #include <linux/types.h>
+#include <linux/wait.h>
 #include <linux/workqueue.h>
 #include "xe_uc_fw_types.h"
 
@@ -50,6 +51,8 @@ struct xe_gt_sriov_vf_migration {
 	struct work_struct worker;
 	/** @lock: Protects recovery_queued, teardown */
 	spinlock_t lock;
+	/** @wq: wait queue for migration fixes */
+	wait_queue_head_t wq;
 	/** @scratch: Scratch memory for VF recovery */
 	void *scratch;
 	/** @recovery_teardown: VF post migration recovery is being torn down */
@@ -58,6 +61,8 @@ struct xe_gt_sriov_vf_migration {
 	bool recovery_queued;
 	/** @recovery_inprogress: VF post migration recovery in progress */
 	bool recovery_inprogress;
+	/** @ggtt_need_fixes: VF GGTT needs fixes */
+	bool ggtt_need_fixes;
 };
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 9dbdb0b54c8b..48d5133e76a6 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -1663,7 +1663,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
 	timeout = (q->vm && xe_vm_in_lr_mode(q->vm)) ? MAX_SCHEDULE_TIMEOUT :
 		  msecs_to_jiffies(q->sched_props.job_timeout_ms);
 	err = xe_sched_init(&ge->sched, &drm_sched_ops, &xe_sched_ops,
-			    NULL, q->lrc[0]->ring.size / MAX_JOB_SIZE_BYTES, 64,
+			    NULL, xe_lrc_ring_size() / MAX_JOB_SIZE_BYTES, 64,
 			    timeout, guc_to_gt(guc)->ordered_wq, NULL,
 			    q->name, gt_to_xe(q->gt)->drm.dev);
 	if (err)
diff --git a/drivers/gpu/drm/xe/xe_lrc.h b/drivers/gpu/drm/xe/xe_lrc.h
index 21a3daab0154..c4a33b135101 100644
--- a/drivers/gpu/drm/xe/xe_lrc.h
+++ b/drivers/gpu/drm/xe/xe_lrc.h
@@ -76,6 +76,16 @@ static inline void xe_lrc_put(struct xe_lrc *lrc)
 	kref_put(&lrc->refcount, xe_lrc_destroy);
 }
 
+/**
+ * xe_lrc_ring_size() - Xe LRC ring size
+ *
+ * Return: Size of LRC size
+ */
+static inline size_t xe_lrc_ring_size(void)
+{
+	return SZ_16K;
+}
+
 size_t xe_gt_lrc_size(struct xe_gt *gt, enum xe_engine_class class);
 u32 xe_lrc_pphwsp_offset(struct xe_lrc *lrc);
 u32 xe_lrc_regs_offset(struct xe_lrc *lrc);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 24/30] drm/xe/vf: Add debug prints for GuC replaying state during VF recovery
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (22 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 23/30] drm/xe: Move queue init before LRC creation Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 25/30] drm/xe/vf: Workaround for race condition in GuC firmware during VF pause Matthew Brost
                   ` (10 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

Helpful to manually verify the GuC state machine can correctly replay
the state during a VF post-migration recovery. All replay paths have
been manually verified as triggered and working during testing.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 48d5133e76a6..b33a3dd883d7 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -2026,21 +2026,27 @@ void xe_guc_submit_stop(struct xe_guc *guc)
 
 }
 
-static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
+static void guc_exec_queue_revert_pending_state_change(struct xe_guc *guc,
+						       struct xe_exec_queue *q)
 {
 	bool pending_enable, pending_disable, pending_resume;
 
 	pending_enable = exec_queue_pending_enable(q);
 	pending_resume = exec_queue_pending_resume(q);
 
-	if (pending_enable && pending_resume)
+	if (pending_enable && pending_resume) {
 		q->guc->needs_resume = true;
+		xe_gt_dbg(guc_to_gt(guc), "Replay RESUME - guc_id=%d",
+			  q->guc->id);
+	}
 
 	if (pending_enable && !pending_resume &&
 	    !exec_queue_pending_tdr_exit(q)) {
 		clear_exec_queue_registered(q);
 		if (xe_exec_queue_is_lr(q))
 			xe_exec_queue_put(q);
+		xe_gt_dbg(guc_to_gt(guc), "Replay REGISTER - guc_id=%d",
+			  q->guc->id);
 	}
 
 	if (pending_enable) {
@@ -2048,6 +2054,8 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
 		clear_exec_queue_pending_resume(q);
 		clear_exec_queue_pending_tdr_exit(q);
 		clear_exec_queue_pending_enable(q);
+		xe_gt_dbg(guc_to_gt(guc), "Replay ENABLE - guc_id=%d",
+			  q->guc->id);
 	}
 
 	if (exec_queue_destroyed(q) && exec_queue_registered(q)) {
@@ -2057,6 +2065,8 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
 		else
 			q->guc->needs_cleanup = true;
 		clear_exec_queue_extra_ref(q);
+		xe_gt_dbg(guc_to_gt(guc), "Replay CLEANUP - guc_id=%d",
+			  q->guc->id);
 	}
 
 	pending_disable = exec_queue_pending_disable(q);
@@ -2064,6 +2074,8 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
 	if (pending_disable && exec_queue_suspended(q)) {
 		clear_exec_queue_suspended(q);
 		q->guc->needs_suspend = true;
+		xe_gt_dbg(guc_to_gt(guc), "Replay SUSPEND - guc_id=%d",
+			  q->guc->id);
 	}
 
 	if (pending_disable) {
@@ -2071,6 +2083,8 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_exec_queue *q)
 			set_exec_queue_enabled(q);
 		clear_exec_queue_pending_disable(q);
 		clear_exec_queue_check_timeout(q);
+		xe_gt_dbg(guc_to_gt(guc), "Replay DISABLE - guc_id=%d",
+			  q->guc->id);
 	}
 
 	q->guc->resume_time = 0;
@@ -2096,7 +2110,7 @@ static void guc_exec_queue_pause(struct xe_guc *guc, struct xe_exec_queue *q)
 	else
 		cancel_delayed_work_sync(&sched->base.work_tdr);
 
-	guc_exec_queue_revert_pending_state_change(q);
+	guc_exec_queue_revert_pending_state_change(guc, q);
 
 	if (xe_exec_queue_is_parallel(q)) {
 		struct xe_device *xe = guc_to_xe(guc);
@@ -2206,6 +2220,9 @@ static void guc_exec_queue_unpause_prepare(struct xe_guc *guc,
 	list_for_each_entry(s_job, &sched->base.pending_list, list) {
 		job = to_xe_sched_job(s_job);
 
+		xe_gt_dbg(guc_to_gt(guc), "Replay JOB - guc_id=%d, seqno=%d",
+			  q->guc->id, xe_sched_job_seqno(job));
+
 		q->ring_ops->emit_job(job);
 		job->skip_emit = true;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 25/30] drm/xe/vf: Workaround for race condition in GuC firmware during VF pause
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (23 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 24/30] drm/xe/vf: Add debug prints for GuC replaying state during VF recovery Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 26/30] drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups Matthew Brost
                   ` (9 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

A race condition exists where a paused VF's H2G request can be processed
and subsequently rejected. This rejection results in a FAST_REQ failure
being delivered to the KMD, which then terminates the CT via a dead
worker and triggers a GT reset—an undesirable outcome.

This workaround mitigates the issue by checking if a VF post-migration
recovery is in progress and aborting these adverse actions accordingly.
The GuC firmware will address this bug in an upcoming release. Once that
version is available and VF migration depends on it, this workaround can
be safely removed.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_ct.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index c0d261abf735..dd593e9b0fe5 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -1395,6 +1395,10 @@ static int parse_g2h_response(struct xe_guc_ct *ct, u32 *msg, u32 len)
 
 		fast_req_report(ct, fence);
 
+		/* FIXME: W/A race in the GuC, will get in firmware soon */
+		if (xe_gt_recovery_pending(gt))
+			return 0;
+
 		CT_DEAD(ct, NULL, PARSE_G2H_RESPONSE);
 
 		return -EPROTO;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 26/30] drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (24 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 25/30] drm/xe/vf: Workaround for race condition in GuC firmware during VF pause Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 27/30] drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF Matthew Brost
                   ` (8 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

From: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>

The migrate VM builds the CCS metadata save/restore batch buffer (BB) in
advance and retains it so the GuC can submit it directly when saving a
VM’s state.

When a VM migrates between VFs, the GGTT base can change. Any GGTT-based
addresses embedded in the BB would then have to be parsed and patched.

Use PPGTT addresses in the BB (including for TLB invalidation) so the BB
remains GGTT-agnostic and requires no address fixups during migration.

Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_migrate.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 1d667fa36cf3..ad03afb5145f 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -980,15 +980,27 @@ struct xe_lrc *xe_migrate_lrc(struct xe_migrate *migrate)
 	return migrate->q->lrc[0];
 }
 
-static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *dw, int i,
-				 u32 flags)
+static u64 migrate_vm_ppgtt_addr_tlb_inval(void)
 {
-	struct xe_lrc *lrc = xe_exec_queue_lrc(q);
+	/*
+	 * The migrate VM is self-referential so it can modify its own PTEs (see
+	 * pte_update_size() or emit_pte() functions). We reserve NUM_KERNEL_PDE
+	 * entries for kernel operations (copies, clears, CCS migrate), and
+	 * suballocate the rest to user operations (binds/unbinds). With
+	 * NUM_KERNEL_PDE = 15, NUM_KERNEL_PDE - 1 is already used for PTE updates,
+	 * so assign NUM_KERNEL_PDE - 2 for TLB invalidation.
+	 */
+	return (NUM_KERNEL_PDE - 2) * XE_PAGE_SIZE;
+}
+
+static int emit_flush_invalidate(u32 *dw, int i, u32 flags)
+{
+	u64 addr = migrate_vm_ppgtt_addr_tlb_inval();
+
 	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
 		  MI_FLUSH_IMM_DW | flags;
-	dw[i++] = lower_32_bits(xe_lrc_start_seqno_ggtt_addr(lrc)) |
-		  MI_FLUSH_DW_USE_GTT;
-	dw[i++] = upper_32_bits(xe_lrc_start_seqno_ggtt_addr(lrc));
+	dw[i++] = lower_32_bits(addr);
+	dw[i++] = upper_32_bits(addr);
 	dw[i++] = MI_NOOP;
 	dw[i++] = MI_NOOP;
 
@@ -1101,11 +1113,11 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
 
 		emit_pte(m, bb, ccs_pt, false, false, &ccs_it, ccs_size, src);
 
-		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
+		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
 		flush_flags = xe_migrate_ccs_copy(m, bb, src_L0_ofs, src_is_pltt,
 						  src_L0_ofs, dst_is_pltt,
 						  src_L0, ccs_ofs, true);
-		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
+		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
 
 		size -= src_L0;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 27/30] drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (25 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 26/30] drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 22:24   ` Lucas De Marchi
  2025-10-06 11:10 ` [PATCH v6 28/30] drm/xe/vf: Ensure media GT VF recovery runs after primary GT on PTL Matthew Brost
                   ` (7 subsequent siblings)
  34 siblings, 1 reply; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

VF CCS restore is a primary GT operation on which the media GT depends.
Therefore, it doesn't make much sense to run these operations in
parallel. To address this, point the media GT's ordered work queue to
the primary GT's ordered work queue on platforms that require (PTL VFs)
CCS restore as part of VF post-migration recovery.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_device_types.h | 2 ++
 drivers/gpu/drm/xe/xe_gt.c           | 7 +++++--
 drivers/gpu/drm/xe/xe_gt.h           | 2 +-
 drivers/gpu/drm/xe/xe_pci.c          | 6 +++++-
 drivers/gpu/drm/xe/xe_pci_types.h    | 1 +
 drivers/gpu/drm/xe/xe_tile.c         | 2 +-
 6 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index c66523bf4bf0..02c04ad7296e 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -334,6 +334,8 @@ struct xe_device {
 		u8 skip_mtcfg:1;
 		/** @info.skip_pcode: skip access to PCODE uC */
 		u8 skip_pcode:1;
+		/** @info.needs_shared_vf_gt_wq: needs shared GT WQ on VF */
+		u8 needs_shared_vf_gt_wq:1;
 	} info;
 
 	/** @wa_active: keep track of active workarounds */
diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index cf484a2da35e..05465f358c96 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -65,7 +65,7 @@
 #include "xe_wa.h"
 #include "xe_wopcm.h"
 
-struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
+struct xe_gt *xe_gt_alloc(struct xe_tile *tile, bool use_primary_wq)
 {
 	struct drm_device *drm = &tile_to_xe(tile)->drm;
 	struct xe_gt *gt;
@@ -75,7 +75,10 @@ struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
 		return ERR_PTR(-ENOMEM);
 
 	gt->tile = tile;
-	gt->ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
+	if (use_primary_wq)
+		gt->ordered_wq = tile->primary_gt->ordered_wq;
+	else
+		gt->ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
 	if (IS_ERR(gt->ordered_wq))
 		return ERR_CAST(gt->ordered_wq);
 
diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
index 5df2ffe3ff83..9545c0c93ab6 100644
--- a/drivers/gpu/drm/xe/xe_gt.h
+++ b/drivers/gpu/drm/xe/xe_gt.h
@@ -28,7 +28,7 @@ static inline bool xe_fault_inject_gt_reset(void)
 	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&gt_reset_failure, 1);
 }
 
-struct xe_gt *xe_gt_alloc(struct xe_tile *tile);
+struct xe_gt *xe_gt_alloc(struct xe_tile *tile, bool use_primary_wq);
 int xe_gt_init_early(struct xe_gt *gt);
 int xe_gt_init(struct xe_gt *gt);
 void xe_gt_mmio_init(struct xe_gt *gt);
diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
index 3f42b91efa28..25a1d96a68e7 100644
--- a/drivers/gpu/drm/xe/xe_pci.c
+++ b/drivers/gpu/drm/xe/xe_pci.c
@@ -347,6 +347,7 @@ static const struct xe_device_desc ptl_desc = {
 	.has_sriov = true,
 	.max_gt_per_tile = 2,
 	.needs_scratch = true,
+	.needs_shared_vf_gt_wq = true,
 };
 
 #undef PLATFORM
@@ -598,6 +599,7 @@ static int xe_info_init_early(struct xe_device *xe,
 	xe->info.skip_mtcfg = desc->skip_mtcfg;
 	xe->info.skip_pcode = desc->skip_pcode;
 	xe->info.needs_scratch = desc->needs_scratch;
+	xe->info.needs_shared_vf_gt_wq = desc->needs_shared_vf_gt_wq;
 
 	xe->info.probe_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY) &&
 				 xe_modparam.probe_display &&
@@ -766,7 +768,9 @@ static int xe_info_init(struct xe_device *xe,
 		 * Allocate and setup media GT for platforms with standalone
 		 * media.
 		 */
-		tile->media_gt = xe_gt_alloc(tile);
+		tile->media_gt = xe_gt_alloc(tile,
+					     xe->info.needs_shared_vf_gt_wq &&
+					     IS_SRIOV_VF(xe));
 		if (IS_ERR(tile->media_gt))
 			return PTR_ERR(tile->media_gt);
 
diff --git a/drivers/gpu/drm/xe/xe_pci_types.h b/drivers/gpu/drm/xe/xe_pci_types.h
index 9b9766a3baa3..b11bf6abda5b 100644
--- a/drivers/gpu/drm/xe/xe_pci_types.h
+++ b/drivers/gpu/drm/xe/xe_pci_types.h
@@ -48,6 +48,7 @@ struct xe_device_desc {
 	u8 skip_guc_pc:1;
 	u8 skip_mtcfg:1;
 	u8 skip_pcode:1;
+	u8 needs_shared_vf_gt_wq:1;
 };
 
 struct xe_graphics_desc {
diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
index 6edb5062c1da..e9bcff2de563 100644
--- a/drivers/gpu/drm/xe/xe_tile.c
+++ b/drivers/gpu/drm/xe/xe_tile.c
@@ -157,7 +157,7 @@ int xe_tile_init_early(struct xe_tile *tile, struct xe_device *xe, u8 id)
 	if (err)
 		return err;
 
-	tile->primary_gt = xe_gt_alloc(tile);
+	tile->primary_gt = xe_gt_alloc(tile, false);
 	if (IS_ERR(tile->primary_gt))
 		return PTR_ERR(tile->primary_gt);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 28/30] drm/xe/vf: Ensure media GT VF recovery runs after primary GT on PTL
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (26 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 27/30] drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 29/30] drm/xe/vf: Rebase CCS save/restore BB GGTT addresses Matthew Brost
                   ` (6 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

It is possible that the media GT's VF post-migration recovery work item
gets scheduled before the primary GT's work item. Since the media GT
depends on the primary GT's work item to complete CCS restore, if the
media GT's work item is scheduled first, detect this condition and
re-queue the media GT's work item for a later time.

v5:
 - Adjust debug message (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index bf1806e90370..d43e18bb8f01 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -1112,8 +1112,22 @@ void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p)
 		   pf_version->major, pf_version->minor);
 }
 
-static void vf_post_migration_shutdown(struct xe_gt *gt)
+static bool vf_post_migration_shutdown(struct xe_gt *gt)
 {
+	struct xe_device *xe = gt_to_xe(gt);
+
+	/*
+	 * On platforms where CCS must be restored by the primary GT, the media
+	 * GT's VF post-migration recovery must run afterward. Detect this case
+	 * and re-queue the media GT's restore work item if necessary.
+	 */
+	if (xe->info.needs_shared_vf_gt_wq && xe_gt_is_media_type(gt)) {
+		struct xe_gt *primary_gt = gt_to_tile(gt)->primary_gt;
+
+		if (xe_gt_sriov_vf_recovery_pending(primary_gt))
+			return true;
+	}
+
 	spin_lock_irq(&gt->sriov.vf.migration.lock);
 	gt->sriov.vf.migration.recovery_queued = false;
 	spin_unlock_irq(&gt->sriov.vf.migration.lock);
@@ -1121,6 +1135,8 @@ static void vf_post_migration_shutdown(struct xe_gt *gt)
 	xe_guc_ct_flush_and_stop(&gt->uc.guc.ct);
 	xe_guc_submit_pause(&gt->uc.guc);
 	xe_tlb_inval_reset(&gt->tlb_inval);
+
+	return false;
 }
 
 static size_t post_migration_scratch_size(struct xe_device *xe)
@@ -1195,11 +1211,14 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
 {
 	struct xe_device *xe = gt_to_xe(gt);
 	int err;
+	bool retry;
 
 	xe_gt_sriov_dbg(gt, "migration recovery in progress\n");
 
 	xe_pm_runtime_get(xe);
-	vf_post_migration_shutdown(gt);
+	retry = vf_post_migration_shutdown(gt);
+	if (retry)
+		goto queue;
 
 	if (!xe_sriov_vf_migration_supported(xe)) {
 		xe_gt_sriov_err(gt, "migration is not supported\n");
@@ -1227,6 +1246,12 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
 	xe_pm_runtime_put(xe);
 	xe_gt_sriov_err(gt, "migration recovery failed (%pe)\n", ERR_PTR(err));
 	xe_device_declare_wedged(xe);
+	return;
+
+queue:
+	xe_gt_sriov_info(gt, "Re-queuing migration recovery\n");
+	queue_work(gt->ordered_wq, &gt->sriov.vf.migration.worker);
+	xe_pm_runtime_put(xe);
 }
 
 static void migration_worker_func(struct work_struct *w)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 29/30] drm/xe/vf: Rebase CCS save/restore BB GGTT addresses
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (27 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 28/30] drm/xe/vf: Ensure media GT VF recovery runs after primary GT on PTL Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:10 ` [PATCH v6 30/30] drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC Matthew Brost
                   ` (5 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

Rebase the CCS save/restore BB's GGTT addresses during VF post-migration
recovery by setting the software ring tail to zero, the LRC ring head to
zero, and rewriting the jump-to-BB instructions.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c  |  4 ++++
 drivers/gpu/drm/xe/xe_sriov_vf_ccs.c | 28 ++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_sriov_vf_ccs.h |  1 +
 3 files changed, 33 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index d43e18bb8f01..fb4f848c2936 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -34,6 +34,7 @@
 #include "xe_pm.h"
 #include "xe_sriov.h"
 #include "xe_sriov_vf.h"
+#include "xe_sriov_vf_ccs.h"
 #include "xe_tile_sriov_vf.h"
 #include "xe_tlb_inval.h"
 #include "xe_uc_fw.h"
@@ -1153,6 +1154,9 @@ static int vf_post_migration_fixups(struct xe_gt *gt)
 	if (err)
 		return err;
 
+	if (xe_gt_is_main_type(gt))
+		xe_sriov_vf_ccs_rebase(gt_to_xe(gt));
+
 	xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
 	err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
 	if (err)
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf_ccs.c b/drivers/gpu/drm/xe/xe_sriov_vf_ccs.c
index 8dec616c37c9..790249801364 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf_ccs.c
+++ b/drivers/gpu/drm/xe/xe_sriov_vf_ccs.c
@@ -175,6 +175,15 @@ static void ccs_rw_update_ring(struct xe_sriov_vf_ccs_ctx *ctx)
 	struct xe_lrc *lrc = xe_exec_queue_lrc(ctx->mig_q);
 	u32 dw[10], i = 0;
 
+	/*
+	 * XXX: Save/restore fixes — for some reason, the GuC only accepts the
+	 * save/restore context if the LRC head pointer is zero. This is evident
+	 * from repeated VF migrations failing when the LRC head pointer is
+	 * non-zero.
+	 */
+	lrc->ring.tail = 0;
+	xe_lrc_set_ring_head(lrc, 0);
+
 	dw[i++] = MI_ARB_ON_OFF | MI_ARB_ENABLE;
 	dw[i++] = MI_BATCH_BUFFER_START | XE_INSTR_NUM_DW(3);
 	dw[i++] = lower_32_bits(addr);
@@ -186,6 +195,25 @@ static void ccs_rw_update_ring(struct xe_sriov_vf_ccs_ctx *ctx)
 	xe_lrc_set_ring_tail(lrc, lrc->ring.tail);
 }
 
+/**
+ * xe_sriov_vf_ccs_rebase - Rebase GGTT addresses for CCS save / restore
+ * @xe: the &xe_device.
+ */
+void xe_sriov_vf_ccs_rebase(struct xe_device *xe)
+{
+	enum xe_sriov_vf_ccs_rw_ctxs ctx_id;
+
+	if (!IS_VF_CCS_READY(xe))
+		return;
+
+	for_each_ccs_rw_ctx(ctx_id) {
+		struct xe_sriov_vf_ccs_ctx *ctx =
+			&xe->sriov.vf.ccs.contexts[ctx_id];
+
+		ccs_rw_update_ring(ctx);
+	}
+}
+
 static int register_save_restore_context(struct xe_sriov_vf_ccs_ctx *ctx)
 {
 	int ctx_type;
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf_ccs.h b/drivers/gpu/drm/xe/xe_sriov_vf_ccs.h
index 0745c0ff0228..f8ca6efce9ee 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf_ccs.h
+++ b/drivers/gpu/drm/xe/xe_sriov_vf_ccs.h
@@ -18,6 +18,7 @@ int xe_sriov_vf_ccs_init(struct xe_device *xe);
 int xe_sriov_vf_ccs_attach_bo(struct xe_bo *bo);
 int xe_sriov_vf_ccs_detach_bo(struct xe_bo *bo);
 int xe_sriov_vf_ccs_register_context(struct xe_device *xe);
+void xe_sriov_vf_ccs_rebase(struct xe_device *xe);
 void xe_sriov_vf_ccs_print(struct xe_device *xe, struct drm_printer *p);
 
 static inline bool xe_sriov_vf_ccs_ready(struct xe_device *xe)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 30/30] drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (28 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 29/30] drm/xe/vf: Rebase CCS save/restore BB GGTT addresses Matthew Brost
@ 2025-10-06 11:10 ` Matthew Brost
  2025-10-06 11:17 ` ✗ CI.checkpatch: warning for VF migration redesign (rev6) Patchwork
                   ` (4 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 11:10 UTC (permalink / raw)
  To: intel-xe

From: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>

Some VF2GUC actions may take longer to process. Increase default timeout
after received BUSY indication to 2sec to cover all worst case scenarios.

Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_guc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
index c016a11b6ab1..f0de1fa61898 100644
--- a/drivers/gpu/drm/xe/xe_guc.c
+++ b/drivers/gpu/drm/xe/xe_guc.c
@@ -1439,7 +1439,7 @@ int xe_guc_mmio_send_recv(struct xe_guc *guc, const u32 *request,
 		BUILD_BUG_ON((GUC_HXG_TYPE_RESPONSE_SUCCESS ^ GUC_HXG_TYPE_RESPONSE_FAILURE) != 1);
 
 		ret = xe_mmio_wait32(mmio, reply_reg, resp_mask, resp_mask,
-				     1000000, &header, false);
+				     2000000, &header, false);
 
 		if (unlikely(FIELD_GET(GUC_HXG_MSG_0_ORIGIN, header) !=
 			     GUC_HXG_ORIGIN_GUC))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* ✗ CI.checkpatch: warning for VF migration redesign (rev6)
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (29 preceding siblings ...)
  2025-10-06 11:10 ` [PATCH v6 30/30] drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC Matthew Brost
@ 2025-10-06 11:17 ` Patchwork
  2025-10-06 11:18 ` ✓ CI.KUnit: success " Patchwork
                   ` (3 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Patchwork @ 2025-10-06 11:17 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

== Series Details ==

Series: VF migration redesign (rev6)
URL   : https://patchwork.freedesktop.org/series/154627/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
fbd08a78c3a3bb17964db2a326514c69c1dca660
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit d4759addc85b126761f61f0746bb213790160f66
Author: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Date:   Mon Oct 6 04:10:38 2025 -0700

    drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC
    
    Some VF2GUC actions may take longer to process. Increase default timeout
    after received BUSY indication to 2sec to cover all worst case scenarios.
    
    Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
    Signed-off-by: Matthew Brost <matthew.brost@intel.com>
    Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
    Reviewed-by: Matthew Brost <matthew.brost@intel.com>
+ /mt/dim checkpatch dba1fd9754c6ee58b05564ffa50bbe7be5ddf37d drm-intel
f5b70b7cfec4 drm/xe: Add NULL checks to scratch LRC allocation
ca675112c21d drm/xe: Save off position in ring in which a job was programmed
7da6cff179f4 drm/xe/guc: Track pending-enable source in submission state
2dea60eaffe1 drm/xe: Track LR jobs in DRM scheduler pending list
c87ce9b15930 drm/xe: Don't change LRC ring head on job resubmission
c889686e214a drm/xe: Make LRC W/A scratch buffer usage consistent
e875983f0656 drm/xe/vf: Add xe_gt_recovery_pending helper
f747fdc90f0c drm/xe/vf: Make VF recovery run on per-GT worker
acfa39e8a04b drm/xe/vf: Abort H2G sends during VF post-migration recovery
bccb34ccd708 drm/xe/vf: Remove memory allocations from VF post migration recovery
b460e9174a76 drm/xe/vf: Close multi-GT GGTT shift race
-:69: WARNING:REPEATED_WORD: Possible repeated word: 'only'
#69: FILE: drivers/gpu/drm/xe/xe_gt_sriov_vf.c:452:
+	 * We only only take the GGTT lock when potentially shifting GGTTs to

-:429: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#429: 
new file mode 100644

total: 0 errors, 2 warnings, 0 checks, 403 lines checked
6672f397cc28 drm/xe/vf: Teardown VF post migration worker on driver unload
171a7854b8f3 drm/xe/vf: Don't allow GT reset to be queued during VF post migration recovery
f48a6c5608a5 drm/xe/vf: Wakeup in GuC backend on VF post migration recovery
5f42e32f0716 drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs supporting migration
1b9dd2c745db drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register
2363bebb60f9 drm/xe/vf: Flush and stop CTs in VF post migration recovery
f0780f442fbc drm/xe/vf: Reset TLB invalidations during VF post migration recovery
59a35e08d952 drm/xe/vf: Kickstart after resfix in VF post migration recovery
344aa31a9eef drm/xe/vf: Start CTs before resfix VF post migration recovery
b7d04e9d3b33 drm/xe/vf: Abort VF post migration recovery on failure
c2675b6f6e64 drm/xe/vf: Replay GuC submission state on pause / unpause
c6fd88c2c92a drm/xe: Move queue init before LRC creation
ccb66a9ede9b drm/xe/vf: Add debug prints for GuC replaying state during VF recovery
941f15925a5a drm/xe/vf: Workaround for race condition in GuC firmware during VF pause
ffb772d5eb6f drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups
9b1a2a5586c9 drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF
a2bd04823ccb drm/xe/vf: Ensure media GT VF recovery runs after primary GT on PTL
1204307420e2 drm/xe/vf: Rebase CCS save/restore BB GGTT addresses
d4759addc85b drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC



^ permalink raw reply	[flat|nested] 58+ messages in thread

* ✓ CI.KUnit: success for VF migration redesign (rev6)
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (30 preceding siblings ...)
  2025-10-06 11:17 ` ✗ CI.checkpatch: warning for VF migration redesign (rev6) Patchwork
@ 2025-10-06 11:18 ` Patchwork
  2025-10-06 12:24 ` ✗ Xe.CI.BAT: failure " Patchwork
                   ` (2 subsequent siblings)
  34 siblings, 0 replies; 58+ messages in thread
From: Patchwork @ 2025-10-06 11:18 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

== Series Details ==

Series: VF migration redesign (rev6)
URL   : https://patchwork.freedesktop.org/series/154627/
State : success

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
[11:17:45] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[11:17:49] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[11:18:18] Starting KUnit Kernel (1/1)...
[11:18:18] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[11:18:18] ================== guc_buf (11 subtests) ===================
[11:18:18] [PASSED] test_smallest
[11:18:18] [PASSED] test_largest
[11:18:18] [PASSED] test_granular
[11:18:18] [PASSED] test_unique
[11:18:18] [PASSED] test_overlap
[11:18:18] [PASSED] test_reusable
[11:18:18] [PASSED] test_too_big
[11:18:18] [PASSED] test_flush
[11:18:18] [PASSED] test_lookup
[11:18:18] [PASSED] test_data
[11:18:18] [PASSED] test_class
[11:18:18] ===================== [PASSED] guc_buf =====================
[11:18:18] =================== guc_dbm (7 subtests) ===================
[11:18:18] [PASSED] test_empty
[11:18:18] [PASSED] test_default
[11:18:18] ======================== test_size  ========================
[11:18:18] [PASSED] 4
[11:18:18] [PASSED] 8
[11:18:18] [PASSED] 32
[11:18:18] [PASSED] 256
[11:18:18] ==================== [PASSED] test_size ====================
[11:18:18] ======================= test_reuse  ========================
[11:18:18] [PASSED] 4
[11:18:18] [PASSED] 8
[11:18:18] [PASSED] 32
[11:18:18] [PASSED] 256
[11:18:18] =================== [PASSED] test_reuse ====================
[11:18:18] =================== test_range_overlap  ====================
[11:18:18] [PASSED] 4
[11:18:18] [PASSED] 8
[11:18:18] [PASSED] 32
[11:18:18] [PASSED] 256
[11:18:18] =============== [PASSED] test_range_overlap ================
[11:18:18] =================== test_range_compact  ====================
[11:18:18] [PASSED] 4
[11:18:18] [PASSED] 8
[11:18:18] [PASSED] 32
[11:18:18] [PASSED] 256
[11:18:18] =============== [PASSED] test_range_compact ================
[11:18:18] ==================== test_range_spare  =====================
[11:18:18] [PASSED] 4
[11:18:18] [PASSED] 8
[11:18:18] [PASSED] 32
[11:18:18] [PASSED] 256
[11:18:18] ================ [PASSED] test_range_spare =================
[11:18:18] ===================== [PASSED] guc_dbm =====================
[11:18:18] =================== guc_idm (6 subtests) ===================
[11:18:18] [PASSED] bad_init
[11:18:18] [PASSED] no_init
[11:18:18] [PASSED] init_fini
[11:18:18] [PASSED] check_used
[11:18:18] [PASSED] check_quota
[11:18:18] [PASSED] check_all
[11:18:18] ===================== [PASSED] guc_idm =====================
[11:18:18] ================== no_relay (3 subtests) ===================
[11:18:18] [PASSED] xe_drops_guc2pf_if_not_ready
[11:18:18] [PASSED] xe_drops_guc2vf_if_not_ready
[11:18:18] [PASSED] xe_rejects_send_if_not_ready
[11:18:18] ==================== [PASSED] no_relay =====================
[11:18:18] ================== pf_relay (14 subtests) ==================
[11:18:18] [PASSED] pf_rejects_guc2pf_too_short
[11:18:18] [PASSED] pf_rejects_guc2pf_too_long
[11:18:18] [PASSED] pf_rejects_guc2pf_no_payload
[11:18:18] [PASSED] pf_fails_no_payload
[11:18:18] [PASSED] pf_fails_bad_origin
[11:18:18] [PASSED] pf_fails_bad_type
[11:18:18] [PASSED] pf_txn_reports_error
[11:18:18] [PASSED] pf_txn_sends_pf2guc
[11:18:18] [PASSED] pf_sends_pf2guc
[11:18:18] [SKIPPED] pf_loopback_nop
[11:18:18] [SKIPPED] pf_loopback_echo
[11:18:18] [SKIPPED] pf_loopback_fail
[11:18:18] [SKIPPED] pf_loopback_busy
[11:18:18] [SKIPPED] pf_loopback_retry
[11:18:18] ==================== [PASSED] pf_relay =====================
[11:18:18] ================== vf_relay (3 subtests) ===================
[11:18:18] [PASSED] vf_rejects_guc2vf_too_short
[11:18:18] [PASSED] vf_rejects_guc2vf_too_long
[11:18:18] [PASSED] vf_rejects_guc2vf_no_payload
[11:18:18] ==================== [PASSED] vf_relay =====================
[11:18:18] ===================== lmtt (1 subtest) =====================
[11:18:18] ======================== test_ops  =========================
[11:18:18] [PASSED] 2-level
[11:18:18] [PASSED] multi-level
[11:18:18] ==================== [PASSED] test_ops =====================
[11:18:18] ====================== [PASSED] lmtt =======================
[11:18:18] ================= pf_service (11 subtests) =================
[11:18:18] [PASSED] pf_negotiate_any
[11:18:18] [PASSED] pf_negotiate_base_match
[11:18:18] [PASSED] pf_negotiate_base_newer
[11:18:18] [PASSED] pf_negotiate_base_next
[11:18:18] [SKIPPED] pf_negotiate_base_older
[11:18:18] [PASSED] pf_negotiate_base_prev
[11:18:18] [PASSED] pf_negotiate_latest_match
[11:18:18] [PASSED] pf_negotiate_latest_newer
[11:18:18] [PASSED] pf_negotiate_latest_next
[11:18:18] [SKIPPED] pf_negotiate_latest_older
[11:18:18] [SKIPPED] pf_negotiate_latest_prev
[11:18:18] =================== [PASSED] pf_service ====================
[11:18:18] ================= xe_guc_g2g (2 subtests) ==================
[11:18:18] ============== xe_live_guc_g2g_kunit_default  ==============
[11:18:18] ========= [SKIPPED] xe_live_guc_g2g_kunit_default ==========
[11:18:18] ============== xe_live_guc_g2g_kunit_allmem  ===============
[11:18:18] ========== [SKIPPED] xe_live_guc_g2g_kunit_allmem ==========
[11:18:18] =================== [SKIPPED] xe_guc_g2g ===================
[11:18:18] =================== xe_mocs (2 subtests) ===================
[11:18:18] ================ xe_live_mocs_kernel_kunit  ================
[11:18:18] =========== [SKIPPED] xe_live_mocs_kernel_kunit ============
[11:18:18] ================ xe_live_mocs_reset_kunit  =================
[11:18:18] ============ [SKIPPED] xe_live_mocs_reset_kunit ============
[11:18:18] ==================== [SKIPPED] xe_mocs =====================
[11:18:18] ================= xe_migrate (2 subtests) ==================
[11:18:18] ================= xe_migrate_sanity_kunit  =================
[11:18:18] ============ [SKIPPED] xe_migrate_sanity_kunit =============
[11:18:18] ================== xe_validate_ccs_kunit  ==================
[11:18:18] ============= [SKIPPED] xe_validate_ccs_kunit ==============
[11:18:18] =================== [SKIPPED] xe_migrate ===================
[11:18:18] ================== xe_dma_buf (1 subtest) ==================
[11:18:18] ==================== xe_dma_buf_kunit  =====================
[11:18:18] ================ [SKIPPED] xe_dma_buf_kunit ================
[11:18:18] =================== [SKIPPED] xe_dma_buf ===================
[11:18:18] ================= xe_bo_shrink (1 subtest) =================
[11:18:18] =================== xe_bo_shrink_kunit  ====================
[11:18:18] =============== [SKIPPED] xe_bo_shrink_kunit ===============
[11:18:18] ================== [SKIPPED] xe_bo_shrink ==================
[11:18:18] ==================== xe_bo (2 subtests) ====================
[11:18:18] ================== xe_ccs_migrate_kunit  ===================
[11:18:18] ============== [SKIPPED] xe_ccs_migrate_kunit ==============
[11:18:18] ==================== xe_bo_evict_kunit  ====================
[11:18:18] =============== [SKIPPED] xe_bo_evict_kunit ================
[11:18:18] ===================== [SKIPPED] xe_bo ======================
[11:18:18] ==================== args (11 subtests) ====================
[11:18:18] [PASSED] count_args_test
[11:18:18] [PASSED] call_args_example
[11:18:18] [PASSED] call_args_test
[11:18:18] [PASSED] drop_first_arg_example
[11:18:18] [PASSED] drop_first_arg_test
[11:18:18] [PASSED] first_arg_example
[11:18:18] [PASSED] first_arg_test
[11:18:18] [PASSED] last_arg_example
[11:18:18] [PASSED] last_arg_test
[11:18:18] [PASSED] pick_arg_example
[11:18:18] [PASSED] sep_comma_example
[11:18:18] ====================== [PASSED] args =======================
[11:18:18] =================== xe_pci (3 subtests) ====================
[11:18:18] ==================== check_graphics_ip  ====================
[11:18:18] [PASSED] 12.00 Xe_LP
[11:18:18] [PASSED] 12.10 Xe_LP+
[11:18:18] [PASSED] 12.55 Xe_HPG
[11:18:18] [PASSED] 12.60 Xe_HPC
[11:18:18] [PASSED] 12.70 Xe_LPG
[11:18:18] [PASSED] 12.71 Xe_LPG
[11:18:18] [PASSED] 12.74 Xe_LPG+
[11:18:18] [PASSED] 20.01 Xe2_HPG
[11:18:18] [PASSED] 20.02 Xe2_HPG
[11:18:18] [PASSED] 20.04 Xe2_LPG
[11:18:18] [PASSED] 30.00 Xe3_LPG
[11:18:18] [PASSED] 30.01 Xe3_LPG
[11:18:18] [PASSED] 30.03 Xe3_LPG
[11:18:18] ================ [PASSED] check_graphics_ip ================
[11:18:18] ===================== check_media_ip  ======================
[11:18:18] [PASSED] 12.00 Xe_M
[11:18:18] [PASSED] 12.55 Xe_HPM
[11:18:18] [PASSED] 13.00 Xe_LPM+
[11:18:18] [PASSED] 13.01 Xe2_HPM
[11:18:18] [PASSED] 20.00 Xe2_LPM
[11:18:18] [PASSED] 30.00 Xe3_LPM
[11:18:18] [PASSED] 30.02 Xe3_LPM
[11:18:18] ================= [PASSED] check_media_ip ==================
[11:18:18] ================= check_platform_gt_count  =================
[11:18:18] [PASSED] 0x9A60 (TIGERLAKE)
[11:18:18] [PASSED] 0x9A68 (TIGERLAKE)
[11:18:18] [PASSED] 0x9A70 (TIGERLAKE)
[11:18:18] [PASSED] 0x9A40 (TIGERLAKE)
[11:18:18] [PASSED] 0x9A49 (TIGERLAKE)
[11:18:18] [PASSED] 0x9A59 (TIGERLAKE)
[11:18:18] [PASSED] 0x9A78 (TIGERLAKE)
[11:18:18] [PASSED] 0x9AC0 (TIGERLAKE)
[11:18:18] [PASSED] 0x9AC9 (TIGERLAKE)
[11:18:18] [PASSED] 0x9AD9 (TIGERLAKE)
[11:18:18] [PASSED] 0x9AF8 (TIGERLAKE)
[11:18:18] [PASSED] 0x4C80 (ROCKETLAKE)
[11:18:18] [PASSED] 0x4C8A (ROCKETLAKE)
[11:18:18] [PASSED] 0x4C8B (ROCKETLAKE)
[11:18:18] [PASSED] 0x4C8C (ROCKETLAKE)
[11:18:18] [PASSED] 0x4C90 (ROCKETLAKE)
[11:18:18] [PASSED] 0x4C9A (ROCKETLAKE)
[11:18:18] [PASSED] 0x4680 (ALDERLAKE_S)
[11:18:18] [PASSED] 0x4682 (ALDERLAKE_S)
[11:18:18] [PASSED] 0x4688 (ALDERLAKE_S)
[11:18:18] [PASSED] 0x468A (ALDERLAKE_S)
[11:18:18] [PASSED] 0x468B (ALDERLAKE_S)
[11:18:18] [PASSED] 0x4690 (ALDERLAKE_S)
[11:18:18] [PASSED] 0x4692 (ALDERLAKE_S)
[11:18:18] [PASSED] 0x4693 (ALDERLAKE_S)
[11:18:18] [PASSED] 0x46A0 (ALDERLAKE_P)
[11:18:18] [PASSED] 0x46A1 (ALDERLAKE_P)
[11:18:18] [PASSED] 0x46A2 (ALDERLAKE_P)
[11:18:18] [PASSED] 0x46A3 (ALDERLAKE_P)
[11:18:18] [PASSED] 0x46A6 (ALDERLAKE_P)
[11:18:18] [PASSED] 0x46A8 (ALDERLAKE_P)
[11:18:18] [PASSED] 0x46AA (ALDERLAKE_P)
[11:18:18] [PASSED] 0x462A (ALDERLAKE_P)
[11:18:18] [PASSED] 0x4626 (ALDERLAKE_P)
[11:18:18] [PASSED] 0x4628 (ALDERLAKE_P)
[11:18:18] [PASSED] 0x46B0 (ALDERLAKE_P)
[11:18:18] [PASSED] 0x46B1 (ALDERLAKE_P)
[11:18:18] [PASSED] 0x46B2 (ALDERLAKE_P)
[11:18:18] [PASSED] 0x46B3 (ALDERLAKE_P)
[11:18:18] [PASSED] 0x46C0 (ALDERLAKE_P)
[11:18:18] [PASSED] 0x46C1 (ALDERLAKE_P)
[11:18:18] [PASSED] 0x46C2 (ALDERLAKE_P)
[11:18:18] [PASSED] 0x46C3 (ALDERLAKE_P)
[11:18:18] [PASSED] 0x46D0 (ALDERLAKE_N)
[11:18:18] [PASSED] 0x46D1 (ALDERLAKE_N)
[11:18:18] [PASSED] 0x46D2 (ALDERLAKE_N)
[11:18:18] [PASSED] 0x46D3 (ALDERLAKE_N)
[11:18:18] [PASSED] 0x46D4 (ALDERLAKE_N)
[11:18:18] [PASSED] 0xA721 (ALDERLAKE_P)
[11:18:18] [PASSED] 0xA7A1 (ALDERLAKE_P)
[11:18:18] [PASSED] 0xA7A9 (ALDERLAKE_P)
[11:18:18] [PASSED] 0xA7AC (ALDERLAKE_P)
[11:18:18] [PASSED] 0xA7AD (ALDERLAKE_P)
[11:18:18] [PASSED] 0xA720 (ALDERLAKE_P)
[11:18:18] [PASSED] 0xA7A0 (ALDERLAKE_P)
[11:18:18] [PASSED] 0xA7A8 (ALDERLAKE_P)
[11:18:18] [PASSED] 0xA7AA (ALDERLAKE_P)
[11:18:18] [PASSED] 0xA7AB (ALDERLAKE_P)
[11:18:18] [PASSED] 0xA780 (ALDERLAKE_S)
[11:18:18] [PASSED] 0xA781 (ALDERLAKE_S)
[11:18:18] [PASSED] 0xA782 (ALDERLAKE_S)
[11:18:18] [PASSED] 0xA783 (ALDERLAKE_S)
[11:18:18] [PASSED] 0xA788 (ALDERLAKE_S)
[11:18:18] [PASSED] 0xA789 (ALDERLAKE_S)
[11:18:18] [PASSED] 0xA78A (ALDERLAKE_S)
[11:18:18] [PASSED] 0xA78B (ALDERLAKE_S)
[11:18:18] [PASSED] 0x4905 (DG1)
[11:18:18] [PASSED] 0x4906 (DG1)
[11:18:18] [PASSED] 0x4907 (DG1)
[11:18:18] [PASSED] 0x4908 (DG1)
[11:18:18] [PASSED] 0x4909 (DG1)
[11:18:18] [PASSED] 0x56C0 (DG2)
[11:18:18] [PASSED] 0x56C2 (DG2)
[11:18:18] [PASSED] 0x56C1 (DG2)
[11:18:18] [PASSED] 0x7D51 (METEORLAKE)
[11:18:18] [PASSED] 0x7DD1 (METEORLAKE)
[11:18:18] [PASSED] 0x7D41 (METEORLAKE)
[11:18:18] [PASSED] 0x7D67 (METEORLAKE)
[11:18:18] [PASSED] 0xB640 (METEORLAKE)
[11:18:18] [PASSED] 0x56A0 (DG2)
[11:18:18] [PASSED] 0x56A1 (DG2)
[11:18:18] [PASSED] 0x56A2 (DG2)
[11:18:18] [PASSED] 0x56BE (DG2)
[11:18:18] [PASSED] 0x56BF (DG2)
[11:18:18] [PASSED] 0x5690 (DG2)
[11:18:18] [PASSED] 0x5691 (DG2)
[11:18:18] [PASSED] 0x5692 (DG2)
[11:18:18] [PASSED] 0x56A5 (DG2)
[11:18:18] [PASSED] 0x56A6 (DG2)
[11:18:18] [PASSED] 0x56B0 (DG2)
[11:18:18] [PASSED] 0x56B1 (DG2)
[11:18:18] [PASSED] 0x56BA (DG2)
[11:18:18] [PASSED] 0x56BB (DG2)
[11:18:18] [PASSED] 0x56BC (DG2)
[11:18:18] [PASSED] 0x56BD (DG2)
[11:18:18] [PASSED] 0x5693 (DG2)
[11:18:18] [PASSED] 0x5694 (DG2)
[11:18:18] [PASSED] 0x5695 (DG2)
[11:18:18] [PASSED] 0x56A3 (DG2)
[11:18:18] [PASSED] 0x56A4 (DG2)
[11:18:18] [PASSED] 0x56B2 (DG2)
[11:18:18] [PASSED] 0x56B3 (DG2)
[11:18:18] [PASSED] 0x5696 (DG2)
[11:18:18] [PASSED] 0x5697 (DG2)
[11:18:18] [PASSED] 0xB69 (PVC)
[11:18:18] [PASSED] 0xB6E (PVC)
[11:18:18] [PASSED] 0xBD4 (PVC)
[11:18:18] [PASSED] 0xBD5 (PVC)
[11:18:18] [PASSED] 0xBD6 (PVC)
[11:18:18] [PASSED] 0xBD7 (PVC)
[11:18:18] [PASSED] 0xBD8 (PVC)
[11:18:18] [PASSED] 0xBD9 (PVC)
[11:18:18] [PASSED] 0xBDA (PVC)
[11:18:18] [PASSED] 0xBDB (PVC)
[11:18:18] [PASSED] 0xBE0 (PVC)
[11:18:18] [PASSED] 0xBE1 (PVC)
[11:18:18] [PASSED] 0xBE5 (PVC)
[11:18:18] [PASSED] 0x7D40 (METEORLAKE)
[11:18:18] [PASSED] 0x7D45 (METEORLAKE)
[11:18:18] [PASSED] 0x7D55 (METEORLAKE)
[11:18:18] [PASSED] 0x7D60 (METEORLAKE)
[11:18:18] [PASSED] 0x7DD5 (METEORLAKE)
[11:18:18] [PASSED] 0x6420 (LUNARLAKE)
[11:18:18] [PASSED] 0x64A0 (LUNARLAKE)
[11:18:18] [PASSED] 0x64B0 (LUNARLAKE)
[11:18:18] [PASSED] 0xE202 (BATTLEMAGE)
[11:18:18] [PASSED] 0xE209 (BATTLEMAGE)
[11:18:18] [PASSED] 0xE20B (BATTLEMAGE)
[11:18:18] [PASSED] 0xE20C (BATTLEMAGE)
[11:18:18] [PASSED] 0xE20D (BATTLEMAGE)
[11:18:18] [PASSED] 0xE210 (BATTLEMAGE)
[11:18:18] [PASSED] 0xE211 (BATTLEMAGE)
[11:18:18] [PASSED] 0xE212 (BATTLEMAGE)
[11:18:18] [PASSED] 0xE216 (BATTLEMAGE)
[11:18:18] [PASSED] 0xE220 (BATTLEMAGE)
[11:18:18] [PASSED] 0xE221 (BATTLEMAGE)
[11:18:18] [PASSED] 0xE222 (BATTLEMAGE)
[11:18:18] [PASSED] 0xE223 (BATTLEMAGE)
[11:18:18] [PASSED] 0xB080 (PANTHERLAKE)
[11:18:18] [PASSED] 0xB081 (PANTHERLAKE)
[11:18:18] [PASSED] 0xB082 (PANTHERLAKE)
[11:18:18] [PASSED] 0xB083 (PANTHERLAKE)
[11:18:18] [PASSED] 0xB084 (PANTHERLAKE)
[11:18:18] [PASSED] 0xB085 (PANTHERLAKE)
[11:18:18] [PASSED] 0xB086 (PANTHERLAKE)
[11:18:18] [PASSED] 0xB087 (PANTHERLAKE)
[11:18:18] [PASSED] 0xB08F (PANTHERLAKE)
[11:18:18] [PASSED] 0xB090 (PANTHERLAKE)
[11:18:18] [PASSED] 0xB0A0 (PANTHERLAKE)
[11:18:18] [PASSED] 0xB0B0 (PANTHERLAKE)
[11:18:18] [PASSED] 0xFD80 (PANTHERLAKE)
[11:18:18] [PASSED] 0xFD81 (PANTHERLAKE)
[11:18:18] ============= [PASSED] check_platform_gt_count =============
[11:18:18] ===================== [PASSED] xe_pci ======================
[11:18:18] =================== xe_rtp (2 subtests) ====================
[11:18:18] =============== xe_rtp_process_to_sr_tests  ================
[11:18:18] [PASSED] coalesce-same-reg
[11:18:18] [PASSED] no-match-no-add
[11:18:18] [PASSED] match-or
[11:18:18] [PASSED] match-or-xfail
[11:18:18] [PASSED] no-match-no-add-multiple-rules
[11:18:18] [PASSED] two-regs-two-entries
[11:18:18] [PASSED] clr-one-set-other
[11:18:18] [PASSED] set-field
[11:18:18] [PASSED] conflict-duplicate
[11:18:18] [PASSED] conflict-not-disjoint
[11:18:18] [PASSED] conflict-reg-type
[11:18:18] =========== [PASSED] xe_rtp_process_to_sr_tests ============
[11:18:18] ================== xe_rtp_process_tests  ===================
[11:18:18] [PASSED] active1
[11:18:18] [PASSED] active2
[11:18:18] [PASSED] active-inactive
[11:18:18] [PASSED] inactive-active
[11:18:18] [PASSED] inactive-1st_or_active-inactive
[11:18:18] [PASSED] inactive-2nd_or_active-inactive
[11:18:18] [PASSED] inactive-last_or_active-inactive
[11:18:18] [PASSED] inactive-no_or_active-inactive
[11:18:18] ============== [PASSED] xe_rtp_process_tests ===============
[11:18:18] ===================== [PASSED] xe_rtp ======================
[11:18:18] ==================== xe_wa (1 subtest) =====================
[11:18:18] ======================== xe_wa_gt  =========================
[11:18:18] [PASSED] TIGERLAKE B0
[11:18:18] [PASSED] DG1 A0
[11:18:18] [PASSED] DG1 B0
[11:18:18] [PASSED] ALDERLAKE_S A0
[11:18:18] [PASSED] ALDERLAKE_S B0
stty: 'standard input': Inappropriate ioctl for device
[11:18:18] [PASSED] ALDERLAKE_S C0
[11:18:18] [PASSED] ALDERLAKE_S D0
[11:18:18] [PASSED] ALDERLAKE_P A0
[11:18:18] [PASSED] ALDERLAKE_P B0
[11:18:18] [PASSED] ALDERLAKE_P C0
[11:18:18] [PASSED] ALDERLAKE_S RPLS D0
[11:18:18] [PASSED] ALDERLAKE_P RPLU E0
[11:18:18] [PASSED] DG2 G10 C0
[11:18:18] [PASSED] DG2 G11 B1
[11:18:18] [PASSED] DG2 G12 A1
[11:18:18] [PASSED] METEORLAKE 12.70(Xe_LPG) A0 13.00(Xe_LPM+) A0
[11:18:18] [PASSED] METEORLAKE 12.71(Xe_LPG) A0 13.00(Xe_LPM+) A0
[11:18:18] [PASSED] METEORLAKE 12.74(Xe_LPG+) A0 13.00(Xe_LPM+) A0
[11:18:18] [PASSED] LUNARLAKE 20.04(Xe2_LPG) A0 20.00(Xe2_LPM) A0
[11:18:18] [PASSED] LUNARLAKE 20.04(Xe2_LPG) B0 20.00(Xe2_LPM) A0
[11:18:18] [PASSED] BATTLEMAGE 20.01(Xe2_HPG) A0 13.01(Xe2_HPM) A1
[11:18:18] [PASSED] PANTHERLAKE 30.00(Xe3_LPG) A0 30.00(Xe3_LPM) A0
[11:18:18] ==================== [PASSED] xe_wa_gt =====================
[11:18:18] ====================== [PASSED] xe_wa ======================
[11:18:18] ============================================================
[11:18:18] Testing complete. Ran 306 tests: passed: 288, skipped: 18
[11:18:18] Elapsed time: 33.802s total, 4.306s configuring, 29.127s building, 0.323s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/tests/.kunitconfig
[11:18:19] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[11:18:20] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[11:18:44] Starting KUnit Kernel (1/1)...
[11:18:44] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[11:18:44] ============ drm_test_pick_cmdline (2 subtests) ============
[11:18:44] [PASSED] drm_test_pick_cmdline_res_1920_1080_60
[11:18:44] =============== drm_test_pick_cmdline_named  ===============
[11:18:44] [PASSED] NTSC
[11:18:44] [PASSED] NTSC-J
[11:18:44] [PASSED] PAL
[11:18:44] [PASSED] PAL-M
[11:18:44] =========== [PASSED] drm_test_pick_cmdline_named ===========
[11:18:44] ============== [PASSED] drm_test_pick_cmdline ==============
[11:18:44] == drm_test_atomic_get_connector_for_encoder (1 subtest) ===
[11:18:44] [PASSED] drm_test_drm_atomic_get_connector_for_encoder
[11:18:44] ==== [PASSED] drm_test_atomic_get_connector_for_encoder ====
[11:18:44] =========== drm_validate_clone_mode (2 subtests) ===========
[11:18:44] ============== drm_test_check_in_clone_mode  ===============
[11:18:44] [PASSED] in_clone_mode
[11:18:44] [PASSED] not_in_clone_mode
[11:18:44] ========== [PASSED] drm_test_check_in_clone_mode ===========
[11:18:44] =============== drm_test_check_valid_clones  ===============
[11:18:44] [PASSED] not_in_clone_mode
[11:18:44] [PASSED] valid_clone
[11:18:44] [PASSED] invalid_clone
[11:18:44] =========== [PASSED] drm_test_check_valid_clones ===========
[11:18:44] ============= [PASSED] drm_validate_clone_mode =============
[11:18:44] ============= drm_validate_modeset (1 subtest) =============
[11:18:44] [PASSED] drm_test_check_connector_changed_modeset
[11:18:44] ============== [PASSED] drm_validate_modeset ===============
[11:18:44] ====== drm_test_bridge_get_current_state (2 subtests) ======
[11:18:44] [PASSED] drm_test_drm_bridge_get_current_state_atomic
[11:18:44] [PASSED] drm_test_drm_bridge_get_current_state_legacy
[11:18:44] ======== [PASSED] drm_test_bridge_get_current_state ========
[11:18:44] ====== drm_test_bridge_helper_reset_crtc (3 subtests) ======
[11:18:44] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic
[11:18:44] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic_disabled
[11:18:44] [PASSED] drm_test_drm_bridge_helper_reset_crtc_legacy
[11:18:44] ======== [PASSED] drm_test_bridge_helper_reset_crtc ========
[11:18:44] ============== drm_bridge_alloc (2 subtests) ===============
[11:18:44] [PASSED] drm_test_drm_bridge_alloc_basic
[11:18:44] [PASSED] drm_test_drm_bridge_alloc_get_put
[11:18:44] ================ [PASSED] drm_bridge_alloc =================
[11:18:44] ================== drm_buddy (7 subtests) ==================
[11:18:44] [PASSED] drm_test_buddy_alloc_limit
[11:18:44] [PASSED] drm_test_buddy_alloc_optimistic
[11:18:44] [PASSED] drm_test_buddy_alloc_pessimistic
[11:18:44] [PASSED] drm_test_buddy_alloc_pathological
[11:18:44] [PASSED] drm_test_buddy_alloc_contiguous
[11:18:44] [PASSED] drm_test_buddy_alloc_clear
[11:18:44] [PASSED] drm_test_buddy_alloc_range_bias
[11:18:44] ==================== [PASSED] drm_buddy ====================
[11:18:44] ============= drm_cmdline_parser (40 subtests) =============
[11:18:44] [PASSED] drm_test_cmdline_force_d_only
[11:18:44] [PASSED] drm_test_cmdline_force_D_only_dvi
[11:18:44] [PASSED] drm_test_cmdline_force_D_only_hdmi
[11:18:44] [PASSED] drm_test_cmdline_force_D_only_not_digital
[11:18:44] [PASSED] drm_test_cmdline_force_e_only
[11:18:44] [PASSED] drm_test_cmdline_res
[11:18:44] [PASSED] drm_test_cmdline_res_vesa
[11:18:44] [PASSED] drm_test_cmdline_res_vesa_rblank
[11:18:44] [PASSED] drm_test_cmdline_res_rblank
[11:18:44] [PASSED] drm_test_cmdline_res_bpp
[11:18:44] [PASSED] drm_test_cmdline_res_refresh
[11:18:44] [PASSED] drm_test_cmdline_res_bpp_refresh
[11:18:44] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced
[11:18:44] [PASSED] drm_test_cmdline_res_bpp_refresh_margins
[11:18:44] [PASSED] drm_test_cmdline_res_bpp_refresh_force_off
[11:18:44] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on
[11:18:44] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_analog
[11:18:44] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_digital
[11:18:44] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced_margins_force_on
[11:18:44] [PASSED] drm_test_cmdline_res_margins_force_on
[11:18:44] [PASSED] drm_test_cmdline_res_vesa_margins
[11:18:44] [PASSED] drm_test_cmdline_name
[11:18:44] [PASSED] drm_test_cmdline_name_bpp
[11:18:44] [PASSED] drm_test_cmdline_name_option
[11:18:44] [PASSED] drm_test_cmdline_name_bpp_option
[11:18:44] [PASSED] drm_test_cmdline_rotate_0
[11:18:44] [PASSED] drm_test_cmdline_rotate_90
[11:18:44] [PASSED] drm_test_cmdline_rotate_180
[11:18:44] [PASSED] drm_test_cmdline_rotate_270
[11:18:44] [PASSED] drm_test_cmdline_hmirror
[11:18:44] [PASSED] drm_test_cmdline_vmirror
[11:18:44] [PASSED] drm_test_cmdline_margin_options
[11:18:44] [PASSED] drm_test_cmdline_multiple_options
[11:18:44] [PASSED] drm_test_cmdline_bpp_extra_and_option
[11:18:44] [PASSED] drm_test_cmdline_extra_and_option
[11:18:44] [PASSED] drm_test_cmdline_freestanding_options
[11:18:44] [PASSED] drm_test_cmdline_freestanding_force_e_and_options
[11:18:44] [PASSED] drm_test_cmdline_panel_orientation
[11:18:44] ================ drm_test_cmdline_invalid  =================
[11:18:44] [PASSED] margin_only
[11:18:44] [PASSED] interlace_only
[11:18:44] [PASSED] res_missing_x
[11:18:44] [PASSED] res_missing_y
[11:18:44] [PASSED] res_bad_y
[11:18:44] [PASSED] res_missing_y_bpp
[11:18:44] [PASSED] res_bad_bpp
[11:18:44] [PASSED] res_bad_refresh
[11:18:44] [PASSED] res_bpp_refresh_force_on_off
[11:18:44] [PASSED] res_invalid_mode
[11:18:44] [PASSED] res_bpp_wrong_place_mode
[11:18:44] [PASSED] name_bpp_refresh
[11:18:44] [PASSED] name_refresh
[11:18:44] [PASSED] name_refresh_wrong_mode
[11:18:44] [PASSED] name_refresh_invalid_mode
[11:18:44] [PASSED] rotate_multiple
[11:18:44] [PASSED] rotate_invalid_val
[11:18:44] [PASSED] rotate_truncated
[11:18:44] [PASSED] invalid_option
[11:18:44] [PASSED] invalid_tv_option
[11:18:44] [PASSED] truncated_tv_option
[11:18:44] ============ [PASSED] drm_test_cmdline_invalid =============
[11:18:44] =============== drm_test_cmdline_tv_options  ===============
[11:18:44] [PASSED] NTSC
[11:18:44] [PASSED] NTSC_443
[11:18:44] [PASSED] NTSC_J
[11:18:44] [PASSED] PAL
[11:18:44] [PASSED] PAL_M
[11:18:44] [PASSED] PAL_N
[11:18:44] [PASSED] SECAM
[11:18:44] [PASSED] MONO_525
[11:18:44] [PASSED] MONO_625
[11:18:44] =========== [PASSED] drm_test_cmdline_tv_options ===========
[11:18:44] =============== [PASSED] drm_cmdline_parser ================
[11:18:44] ========== drmm_connector_hdmi_init (20 subtests) ==========
[11:18:44] [PASSED] drm_test_connector_hdmi_init_valid
[11:18:44] [PASSED] drm_test_connector_hdmi_init_bpc_8
[11:18:44] [PASSED] drm_test_connector_hdmi_init_bpc_10
[11:18:44] [PASSED] drm_test_connector_hdmi_init_bpc_12
[11:18:44] [PASSED] drm_test_connector_hdmi_init_bpc_invalid
[11:18:44] [PASSED] drm_test_connector_hdmi_init_bpc_null
[11:18:44] [PASSED] drm_test_connector_hdmi_init_formats_empty
[11:18:44] [PASSED] drm_test_connector_hdmi_init_formats_no_rgb
[11:18:44] === drm_test_connector_hdmi_init_formats_yuv420_allowed  ===
[11:18:44] [PASSED] supported_formats=0x9 yuv420_allowed=1
[11:18:44] [PASSED] supported_formats=0x9 yuv420_allowed=0
[11:18:44] [PASSED] supported_formats=0x3 yuv420_allowed=1
[11:18:44] [PASSED] supported_formats=0x3 yuv420_allowed=0
[11:18:44] === [PASSED] drm_test_connector_hdmi_init_formats_yuv420_allowed ===
[11:18:44] [PASSED] drm_test_connector_hdmi_init_null_ddc
[11:18:44] [PASSED] drm_test_connector_hdmi_init_null_product
[11:18:44] [PASSED] drm_test_connector_hdmi_init_null_vendor
[11:18:44] [PASSED] drm_test_connector_hdmi_init_product_length_exact
[11:18:44] [PASSED] drm_test_connector_hdmi_init_product_length_too_long
[11:18:44] [PASSED] drm_test_connector_hdmi_init_product_valid
[11:18:44] [PASSED] drm_test_connector_hdmi_init_vendor_length_exact
[11:18:44] [PASSED] drm_test_connector_hdmi_init_vendor_length_too_long
[11:18:44] [PASSED] drm_test_connector_hdmi_init_vendor_valid
[11:18:44] ========= drm_test_connector_hdmi_init_type_valid  =========
[11:18:44] [PASSED] HDMI-A
[11:18:44] [PASSED] HDMI-B
[11:18:44] ===== [PASSED] drm_test_connector_hdmi_init_type_valid =====
[11:18:44] ======== drm_test_connector_hdmi_init_type_invalid  ========
[11:18:44] [PASSED] Unknown
[11:18:44] [PASSED] VGA
[11:18:44] [PASSED] DVI-I
[11:18:44] [PASSED] DVI-D
[11:18:44] [PASSED] DVI-A
[11:18:44] [PASSED] Composite
[11:18:44] [PASSED] SVIDEO
[11:18:44] [PASSED] LVDS
[11:18:44] [PASSED] Component
[11:18:44] [PASSED] DIN
[11:18:44] [PASSED] DP
[11:18:44] [PASSED] TV
[11:18:44] [PASSED] eDP
[11:18:44] [PASSED] Virtual
[11:18:44] [PASSED] DSI
[11:18:44] [PASSED] DPI
[11:18:44] [PASSED] Writeback
[11:18:44] [PASSED] SPI
[11:18:44] [PASSED] USB
[11:18:44] ==== [PASSED] drm_test_connector_hdmi_init_type_invalid ====
[11:18:44] ============ [PASSED] drmm_connector_hdmi_init =============
[11:18:44] ============= drmm_connector_init (3 subtests) =============
[11:18:44] [PASSED] drm_test_drmm_connector_init
[11:18:44] [PASSED] drm_test_drmm_connector_init_null_ddc
[11:18:44] ========= drm_test_drmm_connector_init_type_valid  =========
[11:18:44] [PASSED] Unknown
[11:18:44] [PASSED] VGA
[11:18:44] [PASSED] DVI-I
[11:18:44] [PASSED] DVI-D
[11:18:44] [PASSED] DVI-A
[11:18:44] [PASSED] Composite
[11:18:44] [PASSED] SVIDEO
[11:18:44] [PASSED] LVDS
[11:18:44] [PASSED] Component
[11:18:44] [PASSED] DIN
[11:18:44] [PASSED] DP
[11:18:44] [PASSED] HDMI-A
[11:18:44] [PASSED] HDMI-B
[11:18:44] [PASSED] TV
[11:18:44] [PASSED] eDP
[11:18:44] [PASSED] Virtual
[11:18:44] [PASSED] DSI
[11:18:44] [PASSED] DPI
[11:18:44] [PASSED] Writeback
[11:18:44] [PASSED] SPI
[11:18:44] [PASSED] USB
[11:18:44] ===== [PASSED] drm_test_drmm_connector_init_type_valid =====
[11:18:44] =============== [PASSED] drmm_connector_init ===============
[11:18:44] ========= drm_connector_dynamic_init (6 subtests) ==========
[11:18:44] [PASSED] drm_test_drm_connector_dynamic_init
[11:18:44] [PASSED] drm_test_drm_connector_dynamic_init_null_ddc
[11:18:44] [PASSED] drm_test_drm_connector_dynamic_init_not_added
[11:18:44] [PASSED] drm_test_drm_connector_dynamic_init_properties
[11:18:44] ===== drm_test_drm_connector_dynamic_init_type_valid  ======
[11:18:44] [PASSED] Unknown
[11:18:44] [PASSED] VGA
[11:18:44] [PASSED] DVI-I
[11:18:44] [PASSED] DVI-D
[11:18:44] [PASSED] DVI-A
[11:18:44] [PASSED] Composite
[11:18:44] [PASSED] SVIDEO
[11:18:44] [PASSED] LVDS
[11:18:44] [PASSED] Component
[11:18:44] [PASSED] DIN
[11:18:44] [PASSED] DP
[11:18:44] [PASSED] HDMI-A
[11:18:44] [PASSED] HDMI-B
[11:18:44] [PASSED] TV
[11:18:44] [PASSED] eDP
[11:18:44] [PASSED] Virtual
[11:18:44] [PASSED] DSI
[11:18:44] [PASSED] DPI
[11:18:44] [PASSED] Writeback
[11:18:44] [PASSED] SPI
[11:18:44] [PASSED] USB
[11:18:44] = [PASSED] drm_test_drm_connector_dynamic_init_type_valid ==
[11:18:44] ======== drm_test_drm_connector_dynamic_init_name  =========
[11:18:44] [PASSED] Unknown
[11:18:44] [PASSED] VGA
[11:18:44] [PASSED] DVI-I
[11:18:44] [PASSED] DVI-D
[11:18:44] [PASSED] DVI-A
[11:18:44] [PASSED] Composite
[11:18:44] [PASSED] SVIDEO
[11:18:44] [PASSED] LVDS
[11:18:44] [PASSED] Component
[11:18:44] [PASSED] DIN
[11:18:44] [PASSED] DP
[11:18:44] [PASSED] HDMI-A
[11:18:44] [PASSED] HDMI-B
[11:18:44] [PASSED] TV
[11:18:44] [PASSED] eDP
[11:18:44] [PASSED] Virtual
[11:18:44] [PASSED] DSI
[11:18:44] [PASSED] DPI
[11:18:44] [PASSED] Writeback
[11:18:44] [PASSED] SPI
[11:18:44] [PASSED] USB
[11:18:44] ==== [PASSED] drm_test_drm_connector_dynamic_init_name =====
[11:18:44] =========== [PASSED] drm_connector_dynamic_init ============
[11:18:44] ==== drm_connector_dynamic_register_early (4 subtests) =====
[11:18:44] [PASSED] drm_test_drm_connector_dynamic_register_early_on_list
[11:18:44] [PASSED] drm_test_drm_connector_dynamic_register_early_defer
[11:18:44] [PASSED] drm_test_drm_connector_dynamic_register_early_no_init
[11:18:44] [PASSED] drm_test_drm_connector_dynamic_register_early_no_mode_object
[11:18:44] ====== [PASSED] drm_connector_dynamic_register_early =======
[11:18:44] ======= drm_connector_dynamic_register (7 subtests) ========
[11:18:44] [PASSED] drm_test_drm_connector_dynamic_register_on_list
[11:18:44] [PASSED] drm_test_drm_connector_dynamic_register_no_defer
[11:18:44] [PASSED] drm_test_drm_connector_dynamic_register_no_init
[11:18:44] [PASSED] drm_test_drm_connector_dynamic_register_mode_object
[11:18:44] [PASSED] drm_test_drm_connector_dynamic_register_sysfs
[11:18:44] [PASSED] drm_test_drm_connector_dynamic_register_sysfs_name
[11:18:44] [PASSED] drm_test_drm_connector_dynamic_register_debugfs
[11:18:44] ========= [PASSED] drm_connector_dynamic_register ==========
[11:18:44] = drm_connector_attach_broadcast_rgb_property (2 subtests) =
[11:18:44] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property
[11:18:44] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property_hdmi_connector
[11:18:44] === [PASSED] drm_connector_attach_broadcast_rgb_property ===
[11:18:44] ========== drm_get_tv_mode_from_name (2 subtests) ==========
[11:18:44] ========== drm_test_get_tv_mode_from_name_valid  ===========
[11:18:44] [PASSED] NTSC
[11:18:44] [PASSED] NTSC-443
[11:18:44] [PASSED] NTSC-J
[11:18:44] [PASSED] PAL
[11:18:44] [PASSED] PAL-M
[11:18:44] [PASSED] PAL-N
[11:18:44] [PASSED] SECAM
[11:18:44] [PASSED] Mono
[11:18:44] ====== [PASSED] drm_test_get_tv_mode_from_name_valid =======
[11:18:44] [PASSED] drm_test_get_tv_mode_from_name_truncated
[11:18:44] ============ [PASSED] drm_get_tv_mode_from_name ============
[11:18:44] = drm_test_connector_hdmi_compute_mode_clock (12 subtests) =
[11:18:44] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb
[11:18:44] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc
[11:18:44] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc_vic_1
[11:18:44] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc
[11:18:44] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc_vic_1
[11:18:44] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_double
[11:18:44] = drm_test_connector_hdmi_compute_mode_clock_yuv420_valid  =
[11:18:44] [PASSED] VIC 96
[11:18:44] [PASSED] VIC 97
[11:18:44] [PASSED] VIC 101
[11:18:44] [PASSED] VIC 102
[11:18:44] [PASSED] VIC 106
[11:18:44] [PASSED] VIC 107
[11:18:44] === [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_valid ===
[11:18:44] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_10_bpc
[11:18:44] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_12_bpc
[11:18:44] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_8_bpc
[11:18:44] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_10_bpc
[11:18:44] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_12_bpc
[11:18:44] === [PASSED] drm_test_connector_hdmi_compute_mode_clock ====
[11:18:44] == drm_hdmi_connector_get_broadcast_rgb_name (2 subtests) ==
[11:18:44] === drm_test_drm_hdmi_connector_get_broadcast_rgb_name  ====
[11:18:44] [PASSED] Automatic
[11:18:44] [PASSED] Full
[11:18:44] [PASSED] Limited 16:235
[11:18:44] === [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name ===
[11:18:44] [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name_invalid
[11:18:44] ==== [PASSED] drm_hdmi_connector_get_broadcast_rgb_name ====
[11:18:44] == drm_hdmi_connector_get_output_format_name (2 subtests) ==
[11:18:44] === drm_test_drm_hdmi_connector_get_output_format_name  ====
[11:18:44] [PASSED] RGB
[11:18:44] [PASSED] YUV 4:2:0
[11:18:44] [PASSED] YUV 4:2:2
[11:18:44] [PASSED] YUV 4:4:4
[11:18:44] === [PASSED] drm_test_drm_hdmi_connector_get_output_format_name ===
[11:18:44] [PASSED] drm_test_drm_hdmi_connector_get_output_format_name_invalid
[11:18:44] ==== [PASSED] drm_hdmi_connector_get_output_format_name ====
[11:18:44] ============= drm_damage_helper (21 subtests) ==============
[11:18:44] [PASSED] drm_test_damage_iter_no_damage
[11:18:44] [PASSED] drm_test_damage_iter_no_damage_fractional_src
[11:18:44] [PASSED] drm_test_damage_iter_no_damage_src_moved
[11:18:44] [PASSED] drm_test_damage_iter_no_damage_fractional_src_moved
[11:18:44] [PASSED] drm_test_damage_iter_no_damage_not_visible
[11:18:44] [PASSED] drm_test_damage_iter_no_damage_no_crtc
[11:18:44] [PASSED] drm_test_damage_iter_no_damage_no_fb
[11:18:44] [PASSED] drm_test_damage_iter_simple_damage
[11:18:44] [PASSED] drm_test_damage_iter_single_damage
[11:18:44] [PASSED] drm_test_damage_iter_single_damage_intersect_src
[11:18:44] [PASSED] drm_test_damage_iter_single_damage_outside_src
[11:18:44] [PASSED] drm_test_damage_iter_single_damage_fractional_src
[11:18:44] [PASSED] drm_test_damage_iter_single_damage_intersect_fractional_src
[11:18:44] [PASSED] drm_test_damage_iter_single_damage_outside_fractional_src
[11:18:44] [PASSED] drm_test_damage_iter_single_damage_src_moved
[11:18:44] [PASSED] drm_test_damage_iter_single_damage_fractional_src_moved
[11:18:44] [PASSED] drm_test_damage_iter_damage
[11:18:44] [PASSED] drm_test_damage_iter_damage_one_intersect
[11:18:44] [PASSED] drm_test_damage_iter_damage_one_outside
[11:18:44] [PASSED] drm_test_damage_iter_damage_src_moved
[11:18:44] [PASSED] drm_test_damage_iter_damage_not_visible
[11:18:44] ================ [PASSED] drm_damage_helper ================
[11:18:44] ============== drm_dp_mst_helper (3 subtests) ==============
[11:18:44] ============== drm_test_dp_mst_calc_pbn_mode  ==============
[11:18:44] [PASSED] Clock 154000 BPP 30 DSC disabled
[11:18:44] [PASSED] Clock 234000 BPP 30 DSC disabled
[11:18:44] [PASSED] Clock 297000 BPP 24 DSC disabled
[11:18:44] [PASSED] Clock 332880 BPP 24 DSC enabled
[11:18:44] [PASSED] Clock 324540 BPP 24 DSC enabled
[11:18:44] ========== [PASSED] drm_test_dp_mst_calc_pbn_mode ==========
[11:18:44] ============== drm_test_dp_mst_calc_pbn_div  ===============
[11:18:44] [PASSED] Link rate 2000000 lane count 4
[11:18:44] [PASSED] Link rate 2000000 lane count 2
[11:18:44] [PASSED] Link rate 2000000 lane count 1
[11:18:44] [PASSED] Link rate 1350000 lane count 4
[11:18:44] [PASSED] Link rate 1350000 lane count 2
[11:18:44] [PASSED] Link rate 1350000 lane count 1
[11:18:44] [PASSED] Link rate 1000000 lane count 4
[11:18:44] [PASSED] Link rate 1000000 lane count 2
[11:18:44] [PASSED] Link rate 1000000 lane count 1
[11:18:44] [PASSED] Link rate 810000 lane count 4
[11:18:44] [PASSED] Link rate 810000 lane count 2
[11:18:44] [PASSED] Link rate 810000 lane count 1
[11:18:44] [PASSED] Link rate 540000 lane count 4
[11:18:44] [PASSED] Link rate 540000 lane count 2
[11:18:44] [PASSED] Link rate 540000 lane count 1
[11:18:44] [PASSED] Link rate 270000 lane count 4
[11:18:44] [PASSED] Link rate 270000 lane count 2
[11:18:44] [PASSED] Link rate 270000 lane count 1
[11:18:44] [PASSED] Link rate 162000 lane count 4
[11:18:44] [PASSED] Link rate 162000 lane count 2
[11:18:44] [PASSED] Link rate 162000 lane count 1
[11:18:44] ========== [PASSED] drm_test_dp_mst_calc_pbn_div ===========
[11:18:44] ========= drm_test_dp_mst_sideband_msg_req_decode  =========
[11:18:44] [PASSED] DP_ENUM_PATH_RESOURCES with port number
[11:18:44] [PASSED] DP_POWER_UP_PHY with port number
[11:18:44] [PASSED] DP_POWER_DOWN_PHY with port number
[11:18:44] [PASSED] DP_ALLOCATE_PAYLOAD with SDP stream sinks
[11:18:44] [PASSED] DP_ALLOCATE_PAYLOAD with port number
[11:18:44] [PASSED] DP_ALLOCATE_PAYLOAD with VCPI
[11:18:44] [PASSED] DP_ALLOCATE_PAYLOAD with PBN
[11:18:44] [PASSED] DP_QUERY_PAYLOAD with port number
[11:18:44] [PASSED] DP_QUERY_PAYLOAD with VCPI
[11:18:44] [PASSED] DP_REMOTE_DPCD_READ with port number
[11:18:44] [PASSED] DP_REMOTE_DPCD_READ with DPCD address
[11:18:44] [PASSED] DP_REMOTE_DPCD_READ with max number of bytes
[11:18:44] [PASSED] DP_REMOTE_DPCD_WRITE with port number
[11:18:44] [PASSED] DP_REMOTE_DPCD_WRITE with DPCD address
[11:18:44] [PASSED] DP_REMOTE_DPCD_WRITE with data array
[11:18:44] [PASSED] DP_REMOTE_I2C_READ with port number
[11:18:44] [PASSED] DP_REMOTE_I2C_READ with I2C device ID
[11:18:44] [PASSED] DP_REMOTE_I2C_READ with transactions array
[11:18:44] [PASSED] DP_REMOTE_I2C_WRITE with port number
[11:18:44] [PASSED] DP_REMOTE_I2C_WRITE with I2C device ID
[11:18:44] [PASSED] DP_REMOTE_I2C_WRITE with data array
[11:18:44] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream ID
[11:18:44] [PASSED] DP_QUERY_STREAM_ENC_STATUS with client ID
[11:18:44] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream event
[11:18:44] [PASSED] DP_QUERY_STREAM_ENC_STATUS with valid stream event
[11:18:44] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream behavior
[11:18:44] [PASSED] DP_QUERY_STREAM_ENC_STATUS with a valid stream behavior
[11:18:44] ===== [PASSED] drm_test_dp_mst_sideband_msg_req_decode =====
[11:18:44] ================ [PASSED] drm_dp_mst_helper ================
[11:18:44] ================== drm_exec (7 subtests) ===================
[11:18:44] [PASSED] sanitycheck
[11:18:44] [PASSED] test_lock
[11:18:44] [PASSED] test_lock_unlock
[11:18:44] [PASSED] test_duplicates
[11:18:44] [PASSED] test_prepare
[11:18:44] [PASSED] test_prepare_array
[11:18:44] [PASSED] test_multiple_loops
[11:18:44] ==================== [PASSED] drm_exec =====================
[11:18:44] =========== drm_format_helper_test (17 subtests) ===========
[11:18:44] ============== drm_test_fb_xrgb8888_to_gray8  ==============
[11:18:44] [PASSED] single_pixel_source_buffer
[11:18:44] [PASSED] single_pixel_clip_rectangle
[11:18:44] [PASSED] well_known_colors
[11:18:44] [PASSED] destination_pitch
[11:18:44] ========== [PASSED] drm_test_fb_xrgb8888_to_gray8 ==========
[11:18:44] ============= drm_test_fb_xrgb8888_to_rgb332  ==============
[11:18:44] [PASSED] single_pixel_source_buffer
[11:18:44] [PASSED] single_pixel_clip_rectangle
[11:18:44] [PASSED] well_known_colors
[11:18:44] [PASSED] destination_pitch
[11:18:44] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb332 ==========
[11:18:44] ============= drm_test_fb_xrgb8888_to_rgb565  ==============
[11:18:44] [PASSED] single_pixel_source_buffer
[11:18:44] [PASSED] single_pixel_clip_rectangle
[11:18:44] [PASSED] well_known_colors
[11:18:44] [PASSED] destination_pitch
[11:18:44] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb565 ==========
[11:18:44] ============ drm_test_fb_xrgb8888_to_xrgb1555  =============
[11:18:44] [PASSED] single_pixel_source_buffer
[11:18:44] [PASSED] single_pixel_clip_rectangle
[11:18:44] [PASSED] well_known_colors
[11:18:44] [PASSED] destination_pitch
[11:18:44] ======== [PASSED] drm_test_fb_xrgb8888_to_xrgb1555 =========
[11:18:44] ============ drm_test_fb_xrgb8888_to_argb1555  =============
[11:18:44] [PASSED] single_pixel_source_buffer
[11:18:44] [PASSED] single_pixel_clip_rectangle
[11:18:44] [PASSED] well_known_colors
[11:18:44] [PASSED] destination_pitch
[11:18:44] ======== [PASSED] drm_test_fb_xrgb8888_to_argb1555 =========
[11:18:44] ============ drm_test_fb_xrgb8888_to_rgba5551  =============
[11:18:44] [PASSED] single_pixel_source_buffer
[11:18:44] [PASSED] single_pixel_clip_rectangle
[11:18:44] [PASSED] well_known_colors
[11:18:44] [PASSED] destination_pitch
[11:18:44] ======== [PASSED] drm_test_fb_xrgb8888_to_rgba5551 =========
[11:18:44] ============= drm_test_fb_xrgb8888_to_rgb888  ==============
[11:18:44] [PASSED] single_pixel_source_buffer
[11:18:44] [PASSED] single_pixel_clip_rectangle
[11:18:44] [PASSED] well_known_colors
[11:18:44] [PASSED] destination_pitch
[11:18:44] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb888 ==========
[11:18:44] ============= drm_test_fb_xrgb8888_to_bgr888  ==============
[11:18:44] [PASSED] single_pixel_source_buffer
[11:18:44] [PASSED] single_pixel_clip_rectangle
[11:18:44] [PASSED] well_known_colors
[11:18:44] [PASSED] destination_pitch
[11:18:44] ========= [PASSED] drm_test_fb_xrgb8888_to_bgr888 ==========
[11:18:44] ============ drm_test_fb_xrgb8888_to_argb8888  =============
[11:18:44] [PASSED] single_pixel_source_buffer
[11:18:44] [PASSED] single_pixel_clip_rectangle
[11:18:44] [PASSED] well_known_colors
[11:18:44] [PASSED] destination_pitch
[11:18:44] ======== [PASSED] drm_test_fb_xrgb8888_to_argb8888 =========
[11:18:44] =========== drm_test_fb_xrgb8888_to_xrgb2101010  ===========
[11:18:44] [PASSED] single_pixel_source_buffer
[11:18:44] [PASSED] single_pixel_clip_rectangle
[11:18:44] [PASSED] well_known_colors
[11:18:44] [PASSED] destination_pitch
[11:18:44] ======= [PASSED] drm_test_fb_xrgb8888_to_xrgb2101010 =======
[11:18:44] =========== drm_test_fb_xrgb8888_to_argb2101010  ===========
[11:18:44] [PASSED] single_pixel_source_buffer
[11:18:44] [PASSED] single_pixel_clip_rectangle
[11:18:44] [PASSED] well_known_colors
[11:18:44] [PASSED] destination_pitch
[11:18:44] ======= [PASSED] drm_test_fb_xrgb8888_to_argb2101010 =======
[11:18:44] ============== drm_test_fb_xrgb8888_to_mono  ===============
[11:18:44] [PASSED] single_pixel_source_buffer
[11:18:44] [PASSED] single_pixel_clip_rectangle
[11:18:44] [PASSED] well_known_colors
[11:18:44] [PASSED] destination_pitch
[11:18:44] ========== [PASSED] drm_test_fb_xrgb8888_to_mono ===========
[11:18:44] ==================== drm_test_fb_swab  =====================
[11:18:44] [PASSED] single_pixel_source_buffer
[11:18:44] [PASSED] single_pixel_clip_rectangle
[11:18:44] [PASSED] well_known_colors
[11:18:44] [PASSED] destination_pitch
[11:18:44] ================ [PASSED] drm_test_fb_swab =================
[11:18:44] ============ drm_test_fb_xrgb8888_to_xbgr8888  =============
[11:18:44] [PASSED] single_pixel_source_buffer
[11:18:44] [PASSED] single_pixel_clip_rectangle
[11:18:44] [PASSED] well_known_colors
[11:18:44] [PASSED] destination_pitch
[11:18:44] ======== [PASSED] drm_test_fb_xrgb8888_to_xbgr8888 =========
[11:18:44] ============ drm_test_fb_xrgb8888_to_abgr8888  =============
[11:18:44] [PASSED] single_pixel_source_buffer
[11:18:44] [PASSED] single_pixel_clip_rectangle
[11:18:44] [PASSED] well_known_colors
[11:18:44] [PASSED] destination_pitch
[11:18:44] ======== [PASSED] drm_test_fb_xrgb8888_to_abgr8888 =========
[11:18:44] ================= drm_test_fb_clip_offset  =================
[11:18:44] [PASSED] pass through
[11:18:44] [PASSED] horizontal offset
[11:18:44] [PASSED] vertical offset
[11:18:44] [PASSED] horizontal and vertical offset
[11:18:44] [PASSED] horizontal offset (custom pitch)
[11:18:44] [PASSED] vertical offset (custom pitch)
[11:18:44] [PASSED] horizontal and vertical offset (custom pitch)
[11:18:44] ============= [PASSED] drm_test_fb_clip_offset =============
[11:18:44] =================== drm_test_fb_memcpy  ====================
[11:18:44] [PASSED] single_pixel_source_buffer: XR24 little-endian (0x34325258)
[11:18:44] [PASSED] single_pixel_source_buffer: XRA8 little-endian (0x38415258)
[11:18:44] [PASSED] single_pixel_source_buffer: YU24 little-endian (0x34325559)
[11:18:44] [PASSED] single_pixel_clip_rectangle: XB24 little-endian (0x34324258)
[11:18:44] [PASSED] single_pixel_clip_rectangle: XRA8 little-endian (0x38415258)
[11:18:44] [PASSED] single_pixel_clip_rectangle: YU24 little-endian (0x34325559)
[11:18:44] [PASSED] well_known_colors: XB24 little-endian (0x34324258)
[11:18:44] [PASSED] well_known_colors: XRA8 little-endian (0x38415258)
[11:18:44] [PASSED] well_known_colors: YU24 little-endian (0x34325559)
[11:18:44] [PASSED] destination_pitch: XB24 little-endian (0x34324258)
[11:18:44] [PASSED] destination_pitch: XRA8 little-endian (0x38415258)
[11:18:44] [PASSED] destination_pitch: YU24 little-endian (0x34325559)
[11:18:44] =============== [PASSED] drm_test_fb_memcpy ================
[11:18:44] ============= [PASSED] drm_format_helper_test ==============
[11:18:44] ================= drm_format (18 subtests) =================
[11:18:44] [PASSED] drm_test_format_block_width_invalid
[11:18:44] [PASSED] drm_test_format_block_width_one_plane
[11:18:44] [PASSED] drm_test_format_block_width_two_plane
[11:18:44] [PASSED] drm_test_format_block_width_three_plane
[11:18:44] [PASSED] drm_test_format_block_width_tiled
[11:18:44] [PASSED] drm_test_format_block_height_invalid
[11:18:44] [PASSED] drm_test_format_block_height_one_plane
[11:18:44] [PASSED] drm_test_format_block_height_two_plane
[11:18:44] [PASSED] drm_test_format_block_height_three_plane
[11:18:44] [PASSED] drm_test_format_block_height_tiled
[11:18:44] [PASSED] drm_test_format_min_pitch_invalid
[11:18:44] [PASSED] drm_test_format_min_pitch_one_plane_8bpp
[11:18:44] [PASSED] drm_test_format_min_pitch_one_plane_16bpp
[11:18:44] [PASSED] drm_test_format_min_pitch_one_plane_24bpp
[11:18:44] [PASSED] drm_test_format_min_pitch_one_plane_32bpp
[11:18:44] [PASSED] drm_test_format_min_pitch_two_plane
[11:18:44] [PASSED] drm_test_format_min_pitch_three_plane_8bpp
[11:18:44] [PASSED] drm_test_format_min_pitch_tiled
[11:18:44] =================== [PASSED] drm_format ====================
[11:18:44] ============== drm_framebuffer (10 subtests) ===============
[11:18:44] ========== drm_test_framebuffer_check_src_coords  ==========
[11:18:44] [PASSED] Success: source fits into fb
[11:18:44] [PASSED] Fail: overflowing fb with x-axis coordinate
[11:18:44] [PASSED] Fail: overflowing fb with y-axis coordinate
[11:18:44] [PASSED] Fail: overflowing fb with source width
[11:18:44] [PASSED] Fail: overflowing fb with source height
[11:18:44] ====== [PASSED] drm_test_framebuffer_check_src_coords ======
[11:18:44] [PASSED] drm_test_framebuffer_cleanup
[11:18:44] =============== drm_test_framebuffer_create  ===============
[11:18:44] [PASSED] ABGR8888 normal sizes
[11:18:44] [PASSED] ABGR8888 max sizes
[11:18:44] [PASSED] ABGR8888 pitch greater than min required
[11:18:44] [PASSED] ABGR8888 pitch less than min required
[11:18:44] [PASSED] ABGR8888 Invalid width
[11:18:44] [PASSED] ABGR8888 Invalid buffer handle
[11:18:44] [PASSED] No pixel format
[11:18:44] [PASSED] ABGR8888 Width 0
[11:18:44] [PASSED] ABGR8888 Height 0
[11:18:44] [PASSED] ABGR8888 Out of bound height * pitch combination
[11:18:44] [PASSED] ABGR8888 Large buffer offset
[11:18:44] [PASSED] ABGR8888 Buffer offset for inexistent plane
[11:18:44] [PASSED] ABGR8888 Invalid flag
[11:18:44] [PASSED] ABGR8888 Set DRM_MODE_FB_MODIFIERS without modifiers
[11:18:44] [PASSED] ABGR8888 Valid buffer modifier
[11:18:44] [PASSED] ABGR8888 Invalid buffer modifier(DRM_FORMAT_MOD_SAMSUNG_64_32_TILE)
[11:18:44] [PASSED] ABGR8888 Extra pitches without DRM_MODE_FB_MODIFIERS
[11:18:44] [PASSED] ABGR8888 Extra pitches with DRM_MODE_FB_MODIFIERS
[11:18:44] [PASSED] NV12 Normal sizes
[11:18:44] [PASSED] NV12 Max sizes
[11:18:44] [PASSED] NV12 Invalid pitch
[11:18:44] [PASSED] NV12 Invalid modifier/missing DRM_MODE_FB_MODIFIERS flag
[11:18:44] [PASSED] NV12 different  modifier per-plane
[11:18:44] [PASSED] NV12 with DRM_FORMAT_MOD_SAMSUNG_64_32_TILE
[11:18:44] [PASSED] NV12 Valid modifiers without DRM_MODE_FB_MODIFIERS
[11:18:44] [PASSED] NV12 Modifier for inexistent plane
[11:18:44] [PASSED] NV12 Handle for inexistent plane
[11:18:44] [PASSED] NV12 Handle for inexistent plane without DRM_MODE_FB_MODIFIERS
[11:18:44] [PASSED] YVU420 DRM_MODE_FB_MODIFIERS set without modifier
[11:18:44] [PASSED] YVU420 Normal sizes
[11:18:44] [PASSED] YVU420 Max sizes
[11:18:44] [PASSED] YVU420 Invalid pitch
[11:18:44] [PASSED] YVU420 Different pitches
[11:18:44] [PASSED] YVU420 Different buffer offsets/pitches
[11:18:44] [PASSED] YVU420 Modifier set just for plane 0, without DRM_MODE_FB_MODIFIERS
[11:18:44] [PASSED] YVU420 Modifier set just for planes 0, 1, without DRM_MODE_FB_MODIFIERS
[11:18:44] [PASSED] YVU420 Modifier set just for plane 0, 1, with DRM_MODE_FB_MODIFIERS
[11:18:44] [PASSED] YVU420 Valid modifier
[11:18:44] [PASSED] YVU420 Different modifiers per plane
[11:18:44] [PASSED] YVU420 Modifier for inexistent plane
[11:18:44] [PASSED] YUV420_10BIT Invalid modifier(DRM_FORMAT_MOD_LINEAR)
[11:18:44] [PASSED] X0L2 Normal sizes
[11:18:44] [PASSED] X0L2 Max sizes
[11:18:44] [PASSED] X0L2 Invalid pitch
[11:18:44] [PASSED] X0L2 Pitch greater than minimum required
[11:18:44] [PASSED] X0L2 Handle for inexistent plane
[11:18:44] [PASSED] X0L2 Offset for inexistent plane, without DRM_MODE_FB_MODIFIERS set
[11:18:44] [PASSED] X0L2 Modifier without DRM_MODE_FB_MODIFIERS set
[11:18:44] [PASSED] X0L2 Valid modifier
[11:18:44] [PASSED] X0L2 Modifier for inexistent plane
[11:18:44] =========== [PASSED] drm_test_framebuffer_create ===========
[11:18:44] [PASSED] drm_test_framebuffer_free
[11:18:44] [PASSED] drm_test_framebuffer_init
[11:18:44] [PASSED] drm_test_framebuffer_init_bad_format
[11:18:44] [PASSED] drm_test_framebuffer_init_dev_mismatch
[11:18:44] [PASSED] drm_test_framebuffer_lookup
[11:18:44] [PASSED] drm_test_framebuffer_lookup_inexistent
[11:18:44] [PASSED] drm_test_framebuffer_modifiers_not_supported
[11:18:44] ================= [PASSED] drm_framebuffer =================
[11:18:44] ================ drm_gem_shmem (8 subtests) ================
[11:18:44] [PASSED] drm_gem_shmem_test_obj_create
[11:18:44] [PASSED] drm_gem_shmem_test_obj_create_private
[11:18:44] [PASSED] drm_gem_shmem_test_pin_pages
[11:18:44] [PASSED] drm_gem_shmem_test_vmap
[11:18:44] [PASSED] drm_gem_shmem_test_get_pages_sgt
[11:18:44] [PASSED] drm_gem_shmem_test_get_sg_table
[11:18:44] [PASSED] drm_gem_shmem_test_madvise
[11:18:44] [PASSED] drm_gem_shmem_test_purge
[11:18:44] ================== [PASSED] drm_gem_shmem ==================
[11:18:44] === drm_atomic_helper_connector_hdmi_check (27 subtests) ===
[11:18:44] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode
[11:18:44] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode_vic_1
[11:18:44] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode
[11:18:44] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode_vic_1
[11:18:44] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode
[11:18:44] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode_vic_1
[11:18:44] ====== drm_test_check_broadcast_rgb_cea_mode_yuv420  =======
[11:18:44] [PASSED] Automatic
[11:18:44] [PASSED] Full
[11:18:44] [PASSED] Limited 16:235
[11:18:44] == [PASSED] drm_test_check_broadcast_rgb_cea_mode_yuv420 ===
[11:18:44] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_changed
[11:18:44] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_not_changed
[11:18:44] [PASSED] drm_test_check_disable_connector
[11:18:44] [PASSED] drm_test_check_hdmi_funcs_reject_rate
[11:18:44] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_rgb
[11:18:44] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_yuv420
[11:18:44] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv422
[11:18:44] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv420
[11:18:44] [PASSED] drm_test_check_driver_unsupported_fallback_yuv420
[11:18:44] [PASSED] drm_test_check_output_bpc_crtc_mode_changed
[11:18:44] [PASSED] drm_test_check_output_bpc_crtc_mode_not_changed
[11:18:44] [PASSED] drm_test_check_output_bpc_dvi
[11:18:44] [PASSED] drm_test_check_output_bpc_format_vic_1
[11:18:44] [PASSED] drm_test_check_output_bpc_format_display_8bpc_only
[11:18:44] [PASSED] drm_test_check_output_bpc_format_display_rgb_only
[11:18:44] [PASSED] drm_test_check_output_bpc_format_driver_8bpc_only
[11:18:44] [PASSED] drm_test_check_output_bpc_format_driver_rgb_only
[11:18:44] [PASSED] drm_test_check_tmds_char_rate_rgb_8bpc
[11:18:44] [PASSED] drm_test_check_tmds_char_rate_rgb_10bpc
[11:18:44] [PASSED] drm_test_check_tmds_char_rate_rgb_12bpc
[11:18:44] ===== [PASSED] drm_atomic_helper_connector_hdmi_check ======
[11:18:44] === drm_atomic_helper_connector_hdmi_reset (6 subtests) ====
[11:18:44] [PASSED] drm_test_check_broadcast_rgb_value
[11:18:44] [PASSED] drm_test_check_bpc_8_value
[11:18:44] [PASSED] drm_test_check_bpc_10_value
[11:18:44] [PASSED] drm_test_check_bpc_12_value
[11:18:44] [PASSED] drm_test_check_format_value
[11:18:44] [PASSED] drm_test_check_tmds_char_value
[11:18:44] ===== [PASSED] drm_atomic_helper_connector_hdmi_reset ======
[11:18:44] = drm_atomic_helper_connector_hdmi_mode_valid (4 subtests) =
[11:18:44] [PASSED] drm_test_check_mode_valid
[11:18:44] [PASSED] drm_test_check_mode_valid_reject
[11:18:44] [PASSED] drm_test_check_mode_valid_reject_rate
[11:18:44] [PASSED] drm_test_check_mode_valid_reject_max_clock
[11:18:44] === [PASSED] drm_atomic_helper_connector_hdmi_mode_valid ===
[11:18:44] ================= drm_managed (2 subtests) =================
[11:18:44] [PASSED] drm_test_managed_release_action
[11:18:44] [PASSED] drm_test_managed_run_action
[11:18:44] =================== [PASSED] drm_managed ===================
[11:18:44] =================== drm_mm (6 subtests) ====================
[11:18:44] [PASSED] drm_test_mm_init
[11:18:44] [PASSED] drm_test_mm_debug
[11:18:44] [PASSED] drm_test_mm_align32
[11:18:44] [PASSED] drm_test_mm_align64
[11:18:44] [PASSED] drm_test_mm_lowest
[11:18:44] [PASSED] drm_test_mm_highest
[11:18:44] ===================== [PASSED] drm_mm ======================
[11:18:44] ============= drm_modes_analog_tv (5 subtests) =============
[11:18:44] [PASSED] drm_test_modes_analog_tv_mono_576i
[11:18:44] [PASSED] drm_test_modes_analog_tv_ntsc_480i
[11:18:44] [PASSED] drm_test_modes_analog_tv_ntsc_480i_inlined
[11:18:44] [PASSED] drm_test_modes_analog_tv_pal_576i
[11:18:44] [PASSED] drm_test_modes_analog_tv_pal_576i_inlined
[11:18:44] =============== [PASSED] drm_modes_analog_tv ===============
[11:18:44] ============== drm_plane_helper (2 subtests) ===============
[11:18:44] =============== drm_test_check_plane_state  ================
[11:18:44] [PASSED] clipping_simple
[11:18:44] [PASSED] clipping_rotate_reflect
[11:18:44] [PASSED] positioning_simple
[11:18:44] [PASSED] upscaling
[11:18:44] [PASSED] downscaling
[11:18:44] [PASSED] rounding1
[11:18:44] [PASSED] rounding2
[11:18:44] [PASSED] rounding3
[11:18:44] [PASSED] rounding4
[11:18:44] =========== [PASSED] drm_test_check_plane_state ============
[11:18:44] =========== drm_test_check_invalid_plane_state  ============
[11:18:44] [PASSED] positioning_invalid
[11:18:44] [PASSED] upscaling_invalid
[11:18:44] [PASSED] downscaling_invalid
[11:18:44] ======= [PASSED] drm_test_check_invalid_plane_state ========
[11:18:44] ================ [PASSED] drm_plane_helper =================
[11:18:44] ====== drm_connector_helper_tv_get_modes (1 subtest) =======
[11:18:44] ====== drm_test_connector_helper_tv_get_modes_check  =======
[11:18:44] [PASSED] None
[11:18:44] [PASSED] PAL
[11:18:44] [PASSED] NTSC
[11:18:44] [PASSED] Both, NTSC Default
[11:18:44] [PASSED] Both, PAL Default
[11:18:44] [PASSED] Both, NTSC Default, with PAL on command-line
[11:18:44] [PASSED] Both, PAL Default, with NTSC on command-line
[11:18:44] == [PASSED] drm_test_connector_helper_tv_get_modes_check ===
[11:18:44] ======== [PASSED] drm_connector_helper_tv_get_modes ========
[11:18:44] ================== drm_rect (9 subtests) ===================
[11:18:44] [PASSED] drm_test_rect_clip_scaled_div_by_zero
[11:18:44] [PASSED] drm_test_rect_clip_scaled_not_clipped
[11:18:44] [PASSED] drm_test_rect_clip_scaled_clipped
[11:18:44] [PASSED] drm_test_rect_clip_scaled_signed_vs_unsigned
[11:18:44] ================= drm_test_rect_intersect  =================
[11:18:44] [PASSED] top-left x bottom-right: 2x2+1+1 x 2x2+0+0
[11:18:44] [PASSED] top-right x bottom-left: 2x2+0+0 x 2x2+1-1
[11:18:44] [PASSED] bottom-left x top-right: 2x2+1-1 x 2x2+0+0
[11:18:44] [PASSED] bottom-right x top-left: 2x2+0+0 x 2x2+1+1
[11:18:44] [PASSED] right x left: 2x1+0+0 x 3x1+1+0
[11:18:44] [PASSED] left x right: 3x1+1+0 x 2x1+0+0
[11:18:44] [PASSED] up x bottom: 1x2+0+0 x 1x3+0-1
[11:18:44] [PASSED] bottom x up: 1x3+0-1 x 1x2+0+0
[11:18:44] [PASSED] touching corner: 1x1+0+0 x 2x2+1+1
[11:18:44] [PASSED] touching side: 1x1+0+0 x 1x1+1+0
[11:18:44] [PASSED] equal rects: 2x2+0+0 x 2x2+0+0
[11:18:44] [PASSED] inside another: 2x2+0+0 x 1x1+1+1
[11:18:44] [PASSED] far away: 1x1+0+0 x 1x1+3+6
[11:18:44] [PASSED] points intersecting: 0x0+5+10 x 0x0+5+10
[11:18:44] [PASSED] points not intersecting: 0x0+0+0 x 0x0+5+10
[11:18:44] ============= [PASSED] drm_test_rect_intersect =============
[11:18:44] ================ drm_test_rect_calc_hscale  ================
[11:18:44] [PASSED] normal use
[11:18:44] [PASSED] out of max range
[11:18:44] [PASSED] out of min range
[11:18:44] [PASSED] zero dst
[11:18:44] [PASSED] negative src
[11:18:44] [PASSED] negative dst
[11:18:44] ============ [PASSED] drm_test_rect_calc_hscale ============
[11:18:44] ================ drm_test_rect_calc_vscale  ================
[11:18:44] [PASSED] normal use
[11:18:44] [PASSED] out of max range
[11:18:44] [PASSED] out of min range
[11:18:44] [PASSED] zero dst
[11:18:44] [PASSED] negative src
stty: 'standard input': Inappropriate ioctl for device
[11:18:44] [PASSED] negative dst
[11:18:44] ============ [PASSED] drm_test_rect_calc_vscale ============
[11:18:44] ================== drm_test_rect_rotate  ===================
[11:18:44] [PASSED] reflect-x
[11:18:44] [PASSED] reflect-y
[11:18:44] [PASSED] rotate-0
[11:18:44] [PASSED] rotate-90
[11:18:44] [PASSED] rotate-180
[11:18:44] [PASSED] rotate-270
[11:18:44] ============== [PASSED] drm_test_rect_rotate ===============
[11:18:44] ================ drm_test_rect_rotate_inv  =================
[11:18:44] [PASSED] reflect-x
[11:18:44] [PASSED] reflect-y
[11:18:44] [PASSED] rotate-0
[11:18:44] [PASSED] rotate-90
[11:18:44] [PASSED] rotate-180
[11:18:44] [PASSED] rotate-270
[11:18:44] ============ [PASSED] drm_test_rect_rotate_inv =============
[11:18:44] ==================== [PASSED] drm_rect =====================
[11:18:44] ============ drm_sysfb_modeset_test (1 subtest) ============
[11:18:44] ============ drm_test_sysfb_build_fourcc_list  =============
[11:18:44] [PASSED] no native formats
[11:18:44] [PASSED] XRGB8888 as native format
[11:18:44] [PASSED] remove duplicates
[11:18:44] [PASSED] convert alpha formats
[11:18:44] [PASSED] random formats
[11:18:44] ======== [PASSED] drm_test_sysfb_build_fourcc_list =========
[11:18:44] ============= [PASSED] drm_sysfb_modeset_test ==============
[11:18:44] ============================================================
[11:18:44] Testing complete. Ran 621 tests: passed: 621
[11:18:44] Elapsed time: 25.437s total, 1.778s configuring, 23.489s building, 0.153s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/ttm/tests/.kunitconfig
[11:18:44] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[11:18:46] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[11:18:55] Starting KUnit Kernel (1/1)...
[11:18:55] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[11:18:55] ================= ttm_device (5 subtests) ==================
[11:18:55] [PASSED] ttm_device_init_basic
[11:18:55] [PASSED] ttm_device_init_multiple
[11:18:55] [PASSED] ttm_device_fini_basic
[11:18:55] [PASSED] ttm_device_init_no_vma_man
[11:18:55] ================== ttm_device_init_pools  ==================
[11:18:55] [PASSED] No DMA allocations, no DMA32 required
[11:18:55] [PASSED] DMA allocations, DMA32 required
[11:18:55] [PASSED] No DMA allocations, DMA32 required
[11:18:55] [PASSED] DMA allocations, no DMA32 required
[11:18:55] ============== [PASSED] ttm_device_init_pools ==============
[11:18:55] =================== [PASSED] ttm_device ====================
[11:18:55] ================== ttm_pool (8 subtests) ===================
[11:18:55] ================== ttm_pool_alloc_basic  ===================
[11:18:55] [PASSED] One page
[11:18:55] [PASSED] More than one page
[11:18:55] [PASSED] Above the allocation limit
[11:18:55] [PASSED] One page, with coherent DMA mappings enabled
[11:18:55] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[11:18:55] ============== [PASSED] ttm_pool_alloc_basic ===============
[11:18:55] ============== ttm_pool_alloc_basic_dma_addr  ==============
[11:18:55] [PASSED] One page
[11:18:55] [PASSED] More than one page
[11:18:55] [PASSED] Above the allocation limit
[11:18:55] [PASSED] One page, with coherent DMA mappings enabled
[11:18:55] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[11:18:55] ========== [PASSED] ttm_pool_alloc_basic_dma_addr ==========
[11:18:55] [PASSED] ttm_pool_alloc_order_caching_match
[11:18:55] [PASSED] ttm_pool_alloc_caching_mismatch
[11:18:55] [PASSED] ttm_pool_alloc_order_mismatch
[11:18:55] [PASSED] ttm_pool_free_dma_alloc
[11:18:55] [PASSED] ttm_pool_free_no_dma_alloc
[11:18:55] [PASSED] ttm_pool_fini_basic
[11:18:55] ==================== [PASSED] ttm_pool =====================
[11:18:55] ================ ttm_resource (8 subtests) =================
[11:18:55] ================= ttm_resource_init_basic  =================
[11:18:55] [PASSED] Init resource in TTM_PL_SYSTEM
[11:18:55] [PASSED] Init resource in TTM_PL_VRAM
[11:18:55] [PASSED] Init resource in a private placement
[11:18:55] [PASSED] Init resource in TTM_PL_SYSTEM, set placement flags
[11:18:55] ============= [PASSED] ttm_resource_init_basic =============
[11:18:55] [PASSED] ttm_resource_init_pinned
[11:18:55] [PASSED] ttm_resource_fini_basic
[11:18:55] [PASSED] ttm_resource_manager_init_basic
[11:18:55] [PASSED] ttm_resource_manager_usage_basic
[11:18:55] [PASSED] ttm_resource_manager_set_used_basic
[11:18:55] [PASSED] ttm_sys_man_alloc_basic
[11:18:55] [PASSED] ttm_sys_man_free_basic
[11:18:55] ================== [PASSED] ttm_resource ===================
[11:18:55] =================== ttm_tt (15 subtests) ===================
[11:18:55] ==================== ttm_tt_init_basic  ====================
[11:18:55] [PASSED] Page-aligned size
[11:18:55] [PASSED] Extra pages requested
[11:18:55] ================ [PASSED] ttm_tt_init_basic ================
[11:18:55] [PASSED] ttm_tt_init_misaligned
[11:18:55] [PASSED] ttm_tt_fini_basic
[11:18:55] [PASSED] ttm_tt_fini_sg
[11:18:55] [PASSED] ttm_tt_fini_shmem
[11:18:55] [PASSED] ttm_tt_create_basic
[11:18:55] [PASSED] ttm_tt_create_invalid_bo_type
[11:18:55] [PASSED] ttm_tt_create_ttm_exists
[11:18:55] [PASSED] ttm_tt_create_failed
[11:18:55] [PASSED] ttm_tt_destroy_basic
[11:18:55] [PASSED] ttm_tt_populate_null_ttm
[11:18:55] [PASSED] ttm_tt_populate_populated_ttm
[11:18:55] [PASSED] ttm_tt_unpopulate_basic
[11:18:55] [PASSED] ttm_tt_unpopulate_empty_ttm
[11:18:55] [PASSED] ttm_tt_swapin_basic
[11:18:55] ===================== [PASSED] ttm_tt ======================
[11:18:55] =================== ttm_bo (14 subtests) ===================
[11:18:55] =========== ttm_bo_reserve_optimistic_no_ticket  ===========
[11:18:55] [PASSED] Cannot be interrupted and sleeps
[11:18:55] [PASSED] Cannot be interrupted, locks straight away
[11:18:55] [PASSED] Can be interrupted, sleeps
[11:18:55] ======= [PASSED] ttm_bo_reserve_optimistic_no_ticket =======
[11:18:55] [PASSED] ttm_bo_reserve_locked_no_sleep
[11:18:55] [PASSED] ttm_bo_reserve_no_wait_ticket
[11:18:55] [PASSED] ttm_bo_reserve_double_resv
[11:18:55] [PASSED] ttm_bo_reserve_interrupted
[11:18:55] [PASSED] ttm_bo_reserve_deadlock
[11:18:55] [PASSED] ttm_bo_unreserve_basic
[11:18:55] [PASSED] ttm_bo_unreserve_pinned
[11:18:55] [PASSED] ttm_bo_unreserve_bulk
[11:18:55] [PASSED] ttm_bo_fini_basic
[11:18:55] [PASSED] ttm_bo_fini_shared_resv
[11:18:55] [PASSED] ttm_bo_pin_basic
[11:18:55] [PASSED] ttm_bo_pin_unpin_resource
[11:18:55] [PASSED] ttm_bo_multiple_pin_one_unpin
[11:18:55] ===================== [PASSED] ttm_bo ======================
[11:18:55] ============== ttm_bo_validate (21 subtests) ===============
[11:18:55] ============== ttm_bo_init_reserved_sys_man  ===============
[11:18:55] [PASSED] Buffer object for userspace
[11:18:55] [PASSED] Kernel buffer object
[11:18:55] [PASSED] Shared buffer object
[11:18:55] ========== [PASSED] ttm_bo_init_reserved_sys_man ===========
[11:18:55] ============== ttm_bo_init_reserved_mock_man  ==============
[11:18:55] [PASSED] Buffer object for userspace
[11:18:55] [PASSED] Kernel buffer object
[11:18:55] [PASSED] Shared buffer object
[11:18:55] ========== [PASSED] ttm_bo_init_reserved_mock_man ==========
[11:18:55] [PASSED] ttm_bo_init_reserved_resv
[11:18:55] ================== ttm_bo_validate_basic  ==================
[11:18:55] [PASSED] Buffer object for userspace
[11:18:55] [PASSED] Kernel buffer object
[11:18:55] [PASSED] Shared buffer object
[11:18:55] ============== [PASSED] ttm_bo_validate_basic ==============
[11:18:55] [PASSED] ttm_bo_validate_invalid_placement
[11:18:55] ============= ttm_bo_validate_same_placement  ==============
[11:18:55] [PASSED] System manager
[11:18:55] [PASSED] VRAM manager
[11:18:55] ========= [PASSED] ttm_bo_validate_same_placement ==========
[11:18:55] [PASSED] ttm_bo_validate_failed_alloc
[11:18:55] [PASSED] ttm_bo_validate_pinned
[11:18:55] [PASSED] ttm_bo_validate_busy_placement
[11:18:55] ================ ttm_bo_validate_multihop  =================
[11:18:55] [PASSED] Buffer object for userspace
[11:18:55] [PASSED] Kernel buffer object
[11:18:55] [PASSED] Shared buffer object
[11:18:55] ============ [PASSED] ttm_bo_validate_multihop =============
[11:18:55] ========== ttm_bo_validate_no_placement_signaled  ==========
[11:18:55] [PASSED] Buffer object in system domain, no page vector
[11:18:55] [PASSED] Buffer object in system domain with an existing page vector
[11:18:55] ====== [PASSED] ttm_bo_validate_no_placement_signaled ======
[11:18:55] ======== ttm_bo_validate_no_placement_not_signaled  ========
[11:18:55] [PASSED] Buffer object for userspace
[11:18:55] [PASSED] Kernel buffer object
[11:18:55] [PASSED] Shared buffer object
[11:18:55] ==== [PASSED] ttm_bo_validate_no_placement_not_signaled ====
[11:18:55] [PASSED] ttm_bo_validate_move_fence_signaled
[11:18:55] ========= ttm_bo_validate_move_fence_not_signaled  =========
[11:18:55] [PASSED] Waits for GPU
[11:18:55] [PASSED] Tries to lock straight away
[11:18:55] ===== [PASSED] ttm_bo_validate_move_fence_not_signaled =====
[11:18:55] [PASSED] ttm_bo_validate_happy_evict
[11:18:55] [PASSED] ttm_bo_validate_all_pinned_evict
[11:18:55] [PASSED] ttm_bo_validate_allowed_only_evict
[11:18:55] [PASSED] ttm_bo_validate_deleted_evict
[11:18:55] [PASSED] ttm_bo_validate_busy_domain_evict
[11:18:55] [PASSED] ttm_bo_validate_evict_gutting
[11:18:55] [PASSED] ttm_bo_validate_recrusive_evict
stty: 'standard input': Inappropriate ioctl for device
[11:18:55] ================= [PASSED] ttm_bo_validate =================
[11:18:55] ============================================================
[11:18:55] Testing complete. Ran 101 tests: passed: 101
[11:18:55] Elapsed time: 11.235s total, 1.731s configuring, 9.238s building, 0.226s running

+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 58+ messages in thread

* ✗ Xe.CI.BAT: failure for VF migration redesign (rev6)
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (31 preceding siblings ...)
  2025-10-06 11:18 ` ✓ CI.KUnit: success " Patchwork
@ 2025-10-06 12:24 ` Patchwork
  2025-10-06 14:28 ` ✗ Xe.CI.Full: " Patchwork
  2025-10-07  0:20 ` [PATCH v6 00/30] VF migration redesign Niranjana Vishwanathapura
  34 siblings, 0 replies; 58+ messages in thread
From: Patchwork @ 2025-10-06 12:24 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

[-- Attachment #1: Type: text/plain, Size: 2569 bytes --]

== Series Details ==

Series: VF migration redesign (rev6)
URL   : https://patchwork.freedesktop.org/series/154627/
State : failure

== Summary ==

CI Bug Log - changes from xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c_BAT -> xe-pw-154627v6_BAT
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with xe-pw-154627v6_BAT absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in xe-pw-154627v6_BAT, please notify your bug team (I915-ci-infra@lists.freedesktop.org) to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Participating hosts (11 -> 11)
------------------------------

  No changes in participating hosts

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in xe-pw-154627v6_BAT:

### IGT changes ###

#### Possible regressions ####

  * igt@sriov_basic@enable-vfs-autoprobe-on:
    - bat-adlp-7:         [PASS][1] -> [ABORT][2] +1 other test abort
   [1]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/bat-adlp-7/igt@sriov_basic@enable-vfs-autoprobe-on.html
   [2]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/bat-adlp-7/igt@sriov_basic@enable-vfs-autoprobe-on.html

  * igt@sriov_basic@enable-vfs-autoprobe-on@numvfs-1:
    - bat-atsm-2:         [PASS][3] -> [ABORT][4] +1 other test abort
   [3]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/bat-atsm-2/igt@sriov_basic@enable-vfs-autoprobe-on@numvfs-1.html
   [4]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/bat-atsm-2/igt@sriov_basic@enable-vfs-autoprobe-on@numvfs-1.html

  * igt@xe_module_load@load:
    - bat-adlp-vm:        [PASS][5] -> [ABORT][6]
   [5]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/bat-adlp-vm/igt@xe_module_load@load.html
   [6]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/bat-adlp-vm/igt@xe_module_load@load.html

  


Build changes
-------------

  * Linux: xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c -> xe-pw-154627v6

  IGT_8574: 44a15713124663a622c6eddf7c6ee5ba732e0d41 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c: d03c82f71d60dc1434040ca679c683ab3b1b034c
  xe-pw-154627v6: 154627v6

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/index.html

[-- Attachment #2: Type: text/html, Size: 3227 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 07/30] drm/xe/vf: Add xe_gt_recovery_pending helper
  2025-10-06 11:10 ` [PATCH v6 07/30] drm/xe/vf: Add xe_gt_recovery_pending helper Matthew Brost
@ 2025-10-06 13:10   ` Michal Wajdeczko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Wajdeczko @ 2025-10-06 13:10 UTC (permalink / raw)
  To: Matthew Brost, intel-xe



On 10/6/2025 1:10 PM, Matthew Brost wrote:
> Add xe_gt_recovery_pending helper.
> 
> This helper serves as the singular point to determine whether a GT
> recovery is currently in progress. Expected callers include the GuC CT
> layer and the GuC submission layer. Atomically visable as soon as vCPU
> are unhalted until VF recovery completes.
> 
> v3:
>  - Add GT layer xe_gt_recovery_inprogress (Michal)
>  - Don't blow up in memirq not enabled (CI)
>  - Add __memirq_received with clear argument (Michal)
>  - xe_memirq_sw_int_0_irq_pending rename (Michal)
>  - Use offset in xe_memirq_sw_int_0_irq_pending (Michal)
> v4:
>  - Refactor xe_gt_recovery_inprogress logic around memirq (Michal)
> v5:
>  - s/inprogress/pending (Michal)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_gt.h                | 13 ++++++
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 27 +++++++++++++
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  2 +
>  drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h | 10 +++++
>  drivers/gpu/drm/xe/xe_memirq.c            | 48 +++++++++++++++++++++--
>  drivers/gpu/drm/xe/xe_memirq.h            |  2 +
>  6 files changed, 98 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
> index 41880979f4de..5df2ffe3ff83 100644
> --- a/drivers/gpu/drm/xe/xe_gt.h
> +++ b/drivers/gpu/drm/xe/xe_gt.h
> @@ -12,6 +12,7 @@
>  
>  #include "xe_device.h"
>  #include "xe_device_types.h"
> +#include "xe_gt_sriov_vf.h"
>  #include "xe_hw_engine.h"
>  
>  #define for_each_hw_engine(hwe__, gt__, id__) \
> @@ -124,4 +125,16 @@ static inline bool xe_gt_is_usm_hwe(struct xe_gt *gt, struct xe_hw_engine *hwe)
>  		hwe->instance == gt->usm.reserved_bcs_instance;
>  }
>  
> +/**
> + * xe_gt_recovery_pending() - GT recovery pending
> + * @gt: the &xe_gt
> + *
> + * Return: True if GT recovery in pending, False otherwise
> + */
> +static inline bool xe_gt_recovery_pending(struct xe_gt *gt)
> +{
> +	return IS_SRIOV_VF(gt_to_xe(gt)) &&
> +		xe_gt_sriov_vf_recovery_pending(gt);
> +}
> +
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 0461d5513487..86131ee481dc 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -26,6 +26,7 @@
>  #include "xe_guc_hxg_helpers.h"
>  #include "xe_guc_relay.h"
>  #include "xe_lrc.h"
> +#include "xe_memirq.h"
>  #include "xe_mmio.h"
>  #include "xe_sriov.h"
>  #include "xe_sriov_vf.h"
> @@ -776,6 +777,7 @@ void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt)
>  	struct xe_device *xe = gt_to_xe(gt);
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(xe));
> +	xe_gt_assert(gt, xe_gt_sriov_vf_recovery_pending(gt));
>  
>  	set_bit(gt->info.id, &xe->sriov.vf.migration.gt_flags);
>  	/*
> @@ -1118,3 +1120,28 @@ void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p)
>  	drm_printf(p, "\thandshake:\t%u.%u\n",
>  		   pf_version->major, pf_version->minor);
>  }
> +
> +/**
> + * xe_gt_sriov_vf_recovery_pending() - VF post migration recovery pending
> + * @gt: the &xe_gt
> + *
> + * This function's return value must be immediately visable upon vCPU unhalt and
> + * be persisent until RESFIX_DONE is issued. This guarnetee is only coded for

2x typos

> + * platforms which support memirq, if non-memirq platforms support VF migration
> + * this function will need to be updated.
> + *
> + * Return: True if VF post migration recovery in pending, False otherwise

"is pending" ?

> + */
> +bool xe_gt_sriov_vf_recovery_pending(struct xe_gt *gt)
> +{
> +	struct xe_memirq *memirq = &gt_to_tile(gt)->memirq;
> +
> +	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> +
> +	/* early detection until recovery starts */
> +	if (xe_device_uses_memirq(gt_to_xe(gt)) &&
> +	    xe_memirq_guc_sw_int_0_irq_pending(memirq, &gt->uc.guc))
> +		return true;
> +
> +	return READ_ONCE(gt->sriov.vf.migration.recovery_inprogress);
> +}
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> index 0af1dc769fe0..b91ae857e983 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> @@ -25,6 +25,8 @@ void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt);
>  int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt);
>  void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
>  
> +bool xe_gt_sriov_vf_recovery_pending(struct xe_gt *gt);
> +
>  u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
>  u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt);
>  u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt);
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> index 298dedf4b009..1dfef60ec044 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> @@ -46,6 +46,14 @@ struct xe_gt_sriov_vf_runtime {
>  	} *regs;
>  };
>  
> +/**
> + * xe_gt_sriov_vf_migration - VF migration data.
> + */
> +struct xe_gt_sriov_vf_migration {
> +	/** @recovery_inprogress: VF post migration recovery in progress */
> +	bool recovery_inprogress;
> +};
> +
>  /**
>   * struct xe_gt_sriov_vf - GT level VF virtualization data.
>   */
> @@ -58,6 +66,8 @@ struct xe_gt_sriov_vf {
>  	struct xe_gt_sriov_vf_selfconfig self_config;
>  	/** @runtime: runtime data retrieved from the PF. */
>  	struct xe_gt_sriov_vf_runtime runtime;
> +	/** @migration: migration data for the VF. */
> +	struct xe_gt_sriov_vf_migration migration;
>  };
>  
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_memirq.c b/drivers/gpu/drm/xe/xe_memirq.c
> index 49c45ec3e83c..56acfdd77266 100644
> --- a/drivers/gpu/drm/xe/xe_memirq.c
> +++ b/drivers/gpu/drm/xe/xe_memirq.c
> @@ -398,8 +398,9 @@ void xe_memirq_postinstall(struct xe_memirq *memirq)
>  		memirq_set_enable(memirq, true);
>  }
>  
> -static bool memirq_received(struct xe_memirq *memirq, struct iosys_map *vector,
> -			    u16 offset, const char *name)
> +static bool __memirq_received(struct xe_memirq *memirq,
> +			      struct iosys_map *vector, u16 offset,
> +			      const char *name, bool clear)
>  {
>  	u8 value;
>  
> @@ -409,12 +410,26 @@ static bool memirq_received(struct xe_memirq *memirq, struct iosys_map *vector,
>  			memirq_err_ratelimited(memirq,
>  					       "Unexpected memirq value %#x from %s at %u\n",
>  					       value, name, offset);
> -		iosys_map_wr(vector, offset, u8, 0x00);
> +		if (clear)
> +			iosys_map_wr(vector, offset, u8, 0x00);
>  	}
>  
>  	return value;
>  }
>  
> +static bool memirq_received_noclear(struct xe_memirq *memirq,
> +				    struct iosys_map *vector,
> +				    u16 offset, const char *name)
> +{
> +	return __memirq_received(memirq, vector, offset, name, false);
> +}
> +
> +static bool memirq_received(struct xe_memirq *memirq, struct iosys_map *vector,
> +			    u16 offset, const char *name)
> +{
> +	return __memirq_received(memirq, vector, offset, name, true);
> +}
> +
>  static void memirq_dispatch_engine(struct xe_memirq *memirq, struct iosys_map *status,
>  				   struct xe_hw_engine *hwe)
>  {
> @@ -434,8 +449,16 @@ static void memirq_dispatch_guc(struct xe_memirq *memirq, struct iosys_map *stat
>  	if (memirq_received(memirq, status, ilog2(GUC_INTR_GUC2HOST), name))
>  		xe_guc_irq_handler(guc, GUC_INTR_GUC2HOST);
>  
> -	if (memirq_received(memirq, status, ilog2(GUC_INTR_SW_INT_0), name))
> +	/*
> +	 * We must wait to perform the clear operation until after
> +	 * xe_gt_sriov_vf_start_migration_recovery() runs, to avoid race
> +	 * conditions where xe_gt_sriov_vf_recovery_pending() returns false.

nit: we were trying to make memirq changes VF agnostic, but we failed with this comment ;(

also the "avoid race where X returns false" isn't very explaining

maybe more general comment could be that: 

"this is a software interrupt that must be cleared _after_ it's consumed to avoid races"

nit: and since we have 

	xe_memirq_guc_sw_int_0_irq_pending()

maybe the alternative flow could be that clearing is done in explicit call to:

	xe_memirq_guc_sw_int_0_irq_clear()

so from the memirq API POV it will be visible that sw_int_0 is not cleared in usual way

> +	 */
> +	if (memirq_received_noclear(memirq, status, ilog2(GUC_INTR_SW_INT_0),
> +				    name)) {
>  		xe_guc_irq_handler(guc, GUC_INTR_SW_INT_0);

		/* SW interrupts are cleared _after_ being processed */

> +		iosys_map_wr(status, ilog2(GUC_INTR_SW_INT_0), u8, 0x00);
> +	}
>  }
>  
>  /**
> @@ -460,6 +483,23 @@ void xe_memirq_hwe_handler(struct xe_memirq *memirq, struct xe_hw_engine *hwe)
>  	}
>  }
>  
> +/**
> + * xe_memirq_guc__sw_int_0_irq_pending() - SW_INT_0 IRQ is pending

double mid underscore

> + * @memirq: the &xe_memirq
> + * @guc: the &xe_guc to check for IRQ
> + *
> + * Return: True if SW_INT_0 IRQ is pending on @guc, False otherwise
> + */
> +bool xe_memirq_guc_sw_int_0_irq_pending(struct xe_memirq *memirq, struct xe_guc *guc)
> +{
> +	struct xe_gt *gt = guc_to_gt(guc);
> +	u32 offset = xe_gt_is_media_type(gt) ? ilog2(INTR_MGUC) : ilog2(INTR_GUC);
> +	struct iosys_map map = IOSYS_MAP_INIT_OFFSET(&memirq->status, offset * SZ_16);
> +
> +	return memirq_received_noclear(memirq, &map, ilog2(GUC_INTR_SW_INT_0),
> +				       guc_name(guc));
> +}
> +
>  /**
>   * xe_memirq_handler - The `Memory Based Interrupts`_ Handler.
>   * @memirq: the &xe_memirq
> diff --git a/drivers/gpu/drm/xe/xe_memirq.h b/drivers/gpu/drm/xe/xe_memirq.h
> index 06130650e9d6..e25d2234ab87 100644
> --- a/drivers/gpu/drm/xe/xe_memirq.h
> +++ b/drivers/gpu/drm/xe/xe_memirq.h
> @@ -25,4 +25,6 @@ void xe_memirq_handler(struct xe_memirq *memirq);
>  
>  int xe_memirq_init_guc(struct xe_memirq *memirq, struct xe_guc *guc);
>  
> +bool xe_memirq_guc_sw_int_0_irq_pending(struct xe_memirq *memirq, struct xe_guc *guc);
> +
>  #endif

few typos in comments, with those fixed:

Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 11/30] drm/xe/vf: Close multi-GT GGTT shift race
  2025-10-06 11:10 ` [PATCH v6 11/30] drm/xe/vf: Close multi-GT GGTT shift race Matthew Brost
@ 2025-10-06 14:27   ` Michal Wajdeczko
  2025-10-06 14:56     ` Matthew Brost
  0 siblings, 1 reply; 58+ messages in thread
From: Michal Wajdeczko @ 2025-10-06 14:27 UTC (permalink / raw)
  To: Matthew Brost, intel-xe



On 10/6/2025 1:10 PM, Matthew Brost wrote:
> As multi-GT VF post-migration recovery can run in parallel on different
> workqueues, but both GTs point to the same GGTT, only one GT needs to
> shift the GGTT. However, both GTs need to know when this step has
> completed. To coordinate this, perform the GGTT shift under the GGTT
> lock. With shift being done under the lock, storing the shift value
> becomes unnecessary.
> 
> v3:
>  - Update commmit message (Tomasz)
> v4:
>  - Move GGTT values to tile state (Michal)
>  - Use GGTT lock (Michal)
> v5:
>  - Only take GGTT lock during recovery (CI)
>  - Drop goto in vf_get_submission_cfg (Michal)
>  - Add kernel doc around recovery in xe_gt_sriov_vf_query_config (Michal)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device_types.h        |   3 +
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.c         | 153 +++++++-------------
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.h         |   5 +-
>  drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h   |   7 +-
>  drivers/gpu/drm/xe/xe_guc.c                 |   2 +-
>  drivers/gpu/drm/xe/xe_tile_sriov_vf.c       |  30 +++-
>  drivers/gpu/drm/xe/xe_tile_sriov_vf.h       |   2 +-
>  drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h |  23 +++
>  drivers/gpu/drm/xe/xe_vram.c                |   6 +-
>  9 files changed, 112 insertions(+), 119 deletions(-)
>  create mode 100644 drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h
> 
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 1d2718b70a5c..c66523bf4bf0 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -27,6 +27,7 @@
>  #include "xe_sriov_vf_ccs_types.h"
>  #include "xe_step_types.h"
>  #include "xe_survivability_mode_types.h"
> +#include "xe_tile_sriov_vf_types.h"
>  #include "xe_validation.h"
>  
>  #if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
> @@ -193,6 +194,8 @@ struct xe_tile {
>  		struct {
>  			/** @sriov.vf.ggtt_balloon: GGTT regions excluded from use. */
>  			struct xe_ggtt_node *ggtt_balloon[2];
> +			/** @sriov.vf.self_config: VF configuration data */
> +			struct xe_tile_sriov_vf_selfconfig self_config;
>  		} vf;
>  	} sriov;
>  
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 55a1ebbbf47f..d227c8a3ec81 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -436,42 +436,65 @@ u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt)
>  	return value;
>  }
>  
> -static int vf_get_ggtt_info(struct xe_gt *gt)
> +static int vf_get_ggtt_info(struct xe_gt *gt, bool recovery)
>  {
> -	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> +	struct xe_tile_sriov_vf_selfconfig *config =
> +		&gt_to_tile(gt)->sriov.vf.self_config;

maybe
	xe_tile *tile = gt_to_tile(gt);
	struct xe_tile_sriov_vf_selfconfig *config = tile->sriov.vf.self_config;

to avoid line split

> +	struct xe_ggtt *ggtt = gt_to_tile(gt)->mem.ggtt;

then
	struct xe_ggtt *ggtt = tile->mem.ggtt;

>  	struct xe_guc *guc = &gt->uc.guc;
>  	u64 start, size;
> +	s64 shift;
>  	int err;
>  
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  
> +	/*
> +	 * We only only take the GGTT lock when potentially shifting GGTTs to
> +	 * make this step visable to all GTs which share a GGTT. Also the GGTT
> +	 * lock is not initialized during xe_gt_init_early when this function
> +	 * can also be called.

hmm, the real fix should be that GGTT lock is initialized right after GGTT was allocated
it looks that just split between GGTT alloc() and __init_early() was not ideal

note that while almost similar pattern was done for tile, in xe_tile_init_early() the pcode mutex is initialized

alternatively we can change VF to do not perform full query when doing early bootstrap as it is looking just for the GMDID

> +	 */
> +	if (recovery)
> +		mutex_lock(&ggtt->lock);

then we could use

	guard(mutex)(&ggtt->lock)

> +
>  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_START_KEY, &start);
>  	if (unlikely(err))
> -		return err;
> +		goto out;
>  
>  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_SIZE_KEY, &size);
>  	if (unlikely(err))
> -		return err;
> +		goto out;
>  
>  	if (config->ggtt_size && config->ggtt_size != size) {
>  		xe_gt_sriov_err(gt, "Unexpected GGTT reassignment: %lluK != %lluK\n",
>  				size / SZ_1K, config->ggtt_size / SZ_1K);
> -		return -EREMCHG;
> +		err = -EREMCHG;
> +		goto out;
>  	}
>  
>  	xe_gt_sriov_dbg_verbose(gt, "GGTT %#llx-%#llx = %lluK\n",
>  				start, start + size - 1, size / SZ_1K);
>  
> -	config->ggtt_shift = start - (s64)config->ggtt_base;
> +	shift = start - (s64)config->ggtt_base;
>  	config->ggtt_base = start;
>  	config->ggtt_size = size;
> +	err = config->ggtt_size ? 0 : -ENODATA;
>  
> -	return config->ggtt_size ? 0 : -ENODATA;
> +	if (!err && shift && recovery) {

maybe "recovery" is not needed:

	if (!err && shift && shift != start)

> +		xe_gt_sriov_info(gt, "Shifting GGTT base by %lld to 0x%016llx\n",
> +				 shift, config->ggtt_base);
> +		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
> +	}
> +out:
> +	if (recovery)
> +		mutex_unlock(&ggtt->lock);
> +	return err;
>  }
>  
>  static int vf_get_lmem_info(struct xe_gt *gt)
>  {
> -	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> +	struct xe_tile_sriov_vf_selfconfig *config =
> +		&gt_to_tile(gt)->sriov.vf.self_config;
>  	struct xe_guc *guc = &gt->uc.guc;
>  	char size_str[10];
>  	u64 size;
> @@ -544,17 +567,20 @@ static void vf_cache_gmdid(struct xe_gt *gt)
>  /**
>   * xe_gt_sriov_vf_query_config - Query SR-IOV config data over MMIO.
>   * @gt: the &xe_gt
> + * @recovery: VF post migration recovery path
>   *
> - * This function is for VF use only.
> + * This function is for VF use only. If recovery is set, the GGTT shift will be
> + * performed under GGTT lock making this step visable to all GTs which share a
> + * GGTT.

hmm, the question is: why GGTT query can't be done under lock even without 'recovery' ?

>   *
>   * Return: 0 on success or a negative error code on failure.
>   */
> -int xe_gt_sriov_vf_query_config(struct xe_gt *gt)
> +int xe_gt_sriov_vf_query_config(struct xe_gt *gt, bool recovery)
>  {
>  	struct xe_device *xe = gt_to_xe(gt);
>  	int err;
>  
> -	err = vf_get_ggtt_info(gt);
> +	err = vf_get_ggtt_info(gt, recovery);
>  	if (unlikely(err))
>  		return err;
>  
> @@ -584,80 +610,16 @@ int xe_gt_sriov_vf_query_config(struct xe_gt *gt)
>   */
>  u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
>  {
> -	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> -	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> -	xe_gt_assert(gt, gt->sriov.vf.self_config.num_ctxs);
> -
> -	return gt->sriov.vf.self_config.num_ctxs;
> -}
> -
> -/**
> - * xe_gt_sriov_vf_lmem - VF LMEM configuration.
> - * @gt: the &xe_gt
> - *
> - * This function is for VF use only.
> - *
> - * Return: size of the LMEM assigned to VF.
> - */
> -u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
> -{
> -	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> -	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> -	xe_gt_assert(gt, gt->sriov.vf.self_config.lmem_size);
> -
> -	return gt->sriov.vf.self_config.lmem_size;
> -}
> -
> -/**
> - * xe_gt_sriov_vf_ggtt - VF GGTT configuration.
> - * @gt: the &xe_gt
> - *
> - * This function is for VF use only.
> - *
> - * Return: size of the GGTT assigned to VF.
> - */
> -u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt)
> -{
> -	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> -	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> -	xe_gt_assert(gt, gt->sriov.vf.self_config.ggtt_size);
> -
> -	return gt->sriov.vf.self_config.ggtt_size;
> -}
> +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> +	u16 val;
>  
> -/**
> - * xe_gt_sriov_vf_ggtt_base - VF GGTT base offset.
> - * @gt: the &xe_gt
> - *
> - * This function is for VF use only.
> - *
> - * Return: base offset of the GGTT assigned to VF.
> - */
> -u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt)
> -{
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> -	xe_gt_assert(gt, gt->sriov.vf.self_config.ggtt_size);
> -
> -	return gt->sriov.vf.self_config.ggtt_base;
> -}
>  
> -/**
> - * xe_gt_sriov_vf_ggtt_shift - Return shift in GGTT range due to VF migration
> - * @gt: the &xe_gt struct instance
> - *
> - * This function is for VF use only.
> - *
> - * Return: The shift value; could be negative
> - */
> -s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt)
> -{
> -	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> +	xe_gt_assert(gt, config->num_ctxs);
> +	val = config->num_ctxs;
>  
> -	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> -	xe_gt_assert(gt, xe_gt_is_main_type(gt));
> -
> -	return config->ggtt_shift;
> +	return val;
>  }
>  
>  static int relay_action_handshake(struct xe_gt *gt, u32 *major, u32 *minor)
> @@ -1057,6 +1019,8 @@ void xe_gt_sriov_vf_write32(struct xe_gt *gt, struct xe_reg reg, u32 val)
>   */
>  void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
>  {
> +	struct xe_tile_sriov_vf_selfconfig *tconfig =
> +		&gt_to_tile(gt)->sriov.vf.self_config;
>  	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
>  	struct xe_device *xe = gt_to_xe(gt);
>  	char buf[10];
> @@ -1064,17 +1028,15 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
>  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>  
>  	drm_printf(p, "GGTT range:\t%#llx-%#llx\n",
> -		   config->ggtt_base,
> -		   config->ggtt_base + config->ggtt_size - 1);
> -
> -	string_get_size(config->ggtt_size, 1, STRING_UNITS_2, buf, sizeof(buf));
> -	drm_printf(p, "GGTT size:\t%llu (%s)\n", config->ggtt_size, buf);
> +		   tconfig->ggtt_base,
> +		   tconfig->ggtt_base + tconfig->ggtt_size - 1);
>  
> -	drm_printf(p, "GGTT shift on last restore:\t%lld\n", config->ggtt_shift);
> +	string_get_size(tconfig->ggtt_size, 1, STRING_UNITS_2, buf, sizeof(buf));
> +	drm_printf(p, "GGTT size:\t%llu (%s)\n", tconfig->ggtt_size, buf);
>  
>  	if (IS_DGFX(xe) && xe_gt_is_main_type(gt)) {
> -		string_get_size(config->lmem_size, 1, STRING_UNITS_2, buf, sizeof(buf));
> -		drm_printf(p, "LMEM size:\t%llu (%s)\n", config->lmem_size, buf);
> +		string_get_size(tconfig->lmem_size, 1, STRING_UNITS_2, buf, sizeof(buf));
> +		drm_printf(p, "LMEM size:\t%llu (%s)\n", tconfig->lmem_size, buf);
>  	}
>  
>  	drm_printf(p, "GuC contexts:\t%u\n", config->num_ctxs);
> @@ -1161,21 +1123,16 @@ static size_t post_migration_scratch_size(struct xe_device *xe)
>  static int vf_post_migration_fixups(struct xe_gt *gt)
>  {
>  	void *buf = gt->sriov.vf.migration.scratch;
> -	s64 shift;
>  	int err;
>  
> -	err = xe_gt_sriov_vf_query_config(gt);
> +	err = xe_gt_sriov_vf_query_config(gt, true);
>  	if (err)
>  		return err;
>  
> -	shift = xe_gt_sriov_vf_ggtt_shift(gt);
> -	if (shift) {
> -		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
> -		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
> -		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
> -		if (err)
> -			return err;
> -	}
> +	xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
> +	err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
> +	if (err)
> +		return err;
>  
>  	return 0;
>  }
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> index 0adebf8aa419..47ed8d513571 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> @@ -18,7 +18,7 @@ int xe_gt_sriov_vf_bootstrap(struct xe_gt *gt);
>  void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt,
>  				 struct xe_uc_fw_version *wanted,
>  				 struct xe_uc_fw_version *found);
> -int xe_gt_sriov_vf_query_config(struct xe_gt *gt);
> +int xe_gt_sriov_vf_query_config(struct xe_gt *gt, bool recovery);
>  int xe_gt_sriov_vf_connect(struct xe_gt *gt);
>  int xe_gt_sriov_vf_query_runtime(struct xe_gt *gt);
>  void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
> @@ -29,9 +29,6 @@ bool xe_gt_sriov_vf_recovery_pending(struct xe_gt *gt);
>  u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
>  u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt);
>  u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt);
> -u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt);
> -u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt);
> -s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt);
>  
>  u32 xe_gt_sriov_vf_read32(struct xe_gt *gt, struct xe_reg reg);
>  void xe_gt_sriov_vf_write32(struct xe_gt *gt, struct xe_reg reg, u32 val);
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> index e753646debc4..1796d4caf62f 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> @@ -6,6 +6,7 @@
>  #ifndef _XE_GT_SRIOV_VF_TYPES_H_
>  #define _XE_GT_SRIOV_VF_TYPES_H_
>  
> +#include <linux/rwsem.h>
>  #include <linux/types.h>
>  #include <linux/workqueue.h>
>  #include "xe_uc_fw_types.h"
> @@ -14,12 +15,6 @@
>   * struct xe_gt_sriov_vf_selfconfig - VF configuration data.
>   */
>  struct xe_gt_sriov_vf_selfconfig {
> -	/** @ggtt_base: assigned base offset of the GGTT region. */
> -	u64 ggtt_base;
> -	/** @ggtt_size: assigned size of the GGTT region. */
> -	u64 ggtt_size;
> -	/** @ggtt_shift: difference in ggtt_base on last migration */
> -	s64 ggtt_shift;
>  	/** @lmem_size: assigned size of the LMEM. */
>  	u64 lmem_size;
>  	/** @num_ctxs: assigned number of GuC submission context IDs. */
> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> index d5adbbb013ec..c016a11b6ab1 100644
> --- a/drivers/gpu/drm/xe/xe_guc.c
> +++ b/drivers/gpu/drm/xe/xe_guc.c
> @@ -713,7 +713,7 @@ static int vf_guc_init_noalloc(struct xe_guc *guc)
>  	if (err)
>  		return err;
>  
> -	err = xe_gt_sriov_vf_query_config(gt);
> +	err = xe_gt_sriov_vf_query_config(gt, false);
>  	if (err)
>  		return err;
>  
> diff --git a/drivers/gpu/drm/xe/xe_tile_sriov_vf.c b/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
> index f221dbed16f0..074981e2ef07 100644
> --- a/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
> @@ -9,7 +9,6 @@
>  
>  #include "xe_assert.h"
>  #include "xe_ggtt.h"
> -#include "xe_gt_sriov_vf.h"
>  #include "xe_sriov.h"
>  #include "xe_sriov_printk.h"
>  #include "xe_tile_sriov_vf.h"
> @@ -40,10 +39,10 @@ static int vf_init_ggtt_balloons(struct xe_tile *tile)
>   *
>   * Return: 0 on success or a negative error code on failure.
>   */
> -int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile)
> +static int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile)
>  {
> -	u64 ggtt_base = xe_gt_sriov_vf_ggtt_base(tile->primary_gt);
> -	u64 ggtt_size = xe_gt_sriov_vf_ggtt(tile->primary_gt);
> +	u64 ggtt_base = tile->sriov.vf.self_config.ggtt_base;
> +	u64 ggtt_size = tile->sriov.vf.self_config.ggtt_size;
>  	struct xe_device *xe = tile_to_xe(tile);
>  	u64 wopcm = xe_wopcm_size(xe);
>  	u64 start, end;
> @@ -244,11 +243,30 @@ void xe_tile_sriov_vf_fixup_ggtt_nodes(struct xe_tile *tile, s64 shift)

what about naming style to use _locked suffix in function name if it expects to be already protected ?
>  {
>  	struct xe_ggtt *ggtt = tile->mem.ggtt;
>  
> -	mutex_lock(&ggtt->lock);
> +	lockdep_assert_held(&ggtt->lock);
>  
>  	xe_tile_sriov_vf_deballoon_ggtt_locked(tile);
>  	xe_ggtt_shift_nodes_locked(ggtt, shift);
>  	xe_tile_sriov_vf_balloon_ggtt_locked(tile);
> +}
>  
> -	mutex_unlock(&ggtt->lock);
> +/**
> + * xe_tile_sriov_vf_lmem - VF LMEM configuration.
> + * @tile: the &xe_tile
> + *
> + * This function is for VF use only.
> + *
> + * Return: size of the LMEM assigned to VF.
> + */
> +u64 xe_tile_sriov_vf_lmem(struct xe_tile *tile)
> +{
> +	struct xe_tile_sriov_vf_selfconfig *config = &tile->sriov.vf.self_config;
> +	u64 val;
> +
> +	xe_tile_assert(tile, IS_SRIOV_VF(tile_to_xe(tile)));
> +
> +	xe_tile_assert(tile, config->lmem_size);
> +	val = config->lmem_size;
> +
> +	return val;
>  }
> diff --git a/drivers/gpu/drm/xe/xe_tile_sriov_vf.h b/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
> index 93eb043171e8..54e7f2a5c4e4 100644
> --- a/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
> +++ b/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
> @@ -11,8 +11,8 @@
>  struct xe_tile;
>  
>  int xe_tile_sriov_vf_prepare_ggtt(struct xe_tile *tile);
> -int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile);
>  void xe_tile_sriov_vf_deballoon_ggtt_locked(struct xe_tile *tile);
>  void xe_tile_sriov_vf_fixup_ggtt_nodes(struct xe_tile *tile, s64 shift);
> +u64 xe_tile_sriov_vf_lmem(struct xe_tile *tile);
>  
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h
> new file mode 100644
> index 000000000000..140717f81d8f
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#ifndef _XE_TILE_SRIOV_VF_TYPES_H_
> +#define _XE_TILE_SRIOV_VF_TYPES_H_
> +
> +#include <linux/mutex.h>
> +
> +/**
> + * struct xe_tile_sriov_vf_selfconfig - VF configuration data.
> + */
> +struct xe_tile_sriov_vf_selfconfig {
> +	/** @ggtt_base: assigned base offset of the GGTT region. */
> +	u64 ggtt_base;
> +	/** @ggtt_size: assigned size of the GGTT region. */
> +	u64 ggtt_size;
> +	/** @lmem_size: assigned size of the LMEM. */
> +	u64 lmem_size;
> +};
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_vram.c b/drivers/gpu/drm/xe/xe_vram.c
> index 7adfccf68e4c..70bcbb188867 100644
> --- a/drivers/gpu/drm/xe/xe_vram.c
> +++ b/drivers/gpu/drm/xe/xe_vram.c
> @@ -17,10 +17,10 @@
>  #include "xe_device.h"
>  #include "xe_force_wake.h"
>  #include "xe_gt_mcr.h"
> -#include "xe_gt_sriov_vf.h"
>  #include "xe_mmio.h"
>  #include "xe_module.h"
>  #include "xe_sriov.h"
> +#include "xe_tile_sriov_vf.h"
>  #include "xe_ttm_vram_mgr.h"
>  #include "xe_vram.h"
>  #include "xe_vram_types.h"
> @@ -238,9 +238,9 @@ static int tile_vram_size(struct xe_tile *tile, u64 *vram_size,
>  		offset = 0;
>  		for_each_tile(t, xe, id)
>  			for_each_if(t->id < tile->id)
> -				offset += xe_gt_sriov_vf_lmem(t->primary_gt);
> +				offset += xe_tile_sriov_vf_lmem(t);
>  
> -		*tile_size = xe_gt_sriov_vf_lmem(gt);
> +		*tile_size = xe_tile_sriov_vf_lmem(tile);
>  		*vram_size = *tile_size;
>  		*tile_offset = offset;
>  


^ permalink raw reply	[flat|nested] 58+ messages in thread

* ✗ Xe.CI.Full: failure for VF migration redesign (rev6)
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (32 preceding siblings ...)
  2025-10-06 12:24 ` ✗ Xe.CI.BAT: failure " Patchwork
@ 2025-10-06 14:28 ` Patchwork
  2025-10-07  0:20 ` [PATCH v6 00/30] VF migration redesign Niranjana Vishwanathapura
  34 siblings, 0 replies; 58+ messages in thread
From: Patchwork @ 2025-10-06 14:28 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

[-- Attachment #1: Type: text/plain, Size: 57846 bytes --]

== Series Details ==

Series: VF migration redesign (rev6)
URL   : https://patchwork.freedesktop.org/series/154627/
State : failure

== Summary ==

CI Bug Log - changes from xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c_FULL -> xe-pw-154627v6_FULL
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with xe-pw-154627v6_FULL absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in xe-pw-154627v6_FULL, please notify your bug team (I915-ci-infra@lists.freedesktop.org) to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Participating hosts (4 -> 4)
------------------------------

  No changes in participating hosts

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in xe-pw-154627v6_FULL:

### IGT changes ###

#### Possible regressions ####

  * igt@kms_pm_rpm@universal-planes:
    - shard-dg2-set2:     [PASS][1] -> [SKIP][2]
   [1]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-dg2-436/igt@kms_pm_rpm@universal-planes.html
   [2]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-434/igt@kms_pm_rpm@universal-planes.html
    - shard-lnl:          [PASS][3] -> [SKIP][4]
   [3]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-lnl-2/igt@kms_pm_rpm@universal-planes.html
   [4]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-lnl-3/igt@kms_pm_rpm@universal-planes.html
    - shard-bmg:          [PASS][5] -> [SKIP][6]
   [5]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-4/igt@kms_pm_rpm@universal-planes.html
   [6]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-4/igt@kms_pm_rpm@universal-planes.html

  * igt@xe_exec_compute_mode@non-blocking:
    - shard-bmg:          [PASS][7] -> [INCOMPLETE][8]
   [7]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-4/igt@xe_exec_compute_mode@non-blocking.html
   [8]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-4/igt@xe_exec_compute_mode@non-blocking.html
    - shard-dg2-set2:     [PASS][9] -> [INCOMPLETE][10]
   [9]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-dg2-436/igt@xe_exec_compute_mode@non-blocking.html
   [10]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-434/igt@xe_exec_compute_mode@non-blocking.html

  * igt@xe_exec_reset@cm-gt-reset:
    - shard-adlp:         [PASS][11] -> [FAIL][12]
   [11]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-4/igt@xe_exec_reset@cm-gt-reset.html
   [12]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-1/igt@xe_exec_reset@cm-gt-reset.html

  * igt@xe_fault_injection@vm-bind-fail-vm_bind_ioctl_ops_create:
    - shard-dg2-set2:     [PASS][13] -> [ABORT][14]
   [13]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-dg2-436/igt@xe_fault_injection@vm-bind-fail-vm_bind_ioctl_ops_create.html
   [14]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-434/igt@xe_fault_injection@vm-bind-fail-vm_bind_ioctl_ops_create.html
    - shard-lnl:          [PASS][15] -> [ABORT][16]
   [15]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-lnl-2/igt@xe_fault_injection@vm-bind-fail-vm_bind_ioctl_ops_create.html
   [16]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-lnl-3/igt@xe_fault_injection@vm-bind-fail-vm_bind_ioctl_ops_create.html
    - shard-bmg:          [PASS][17] -> [ABORT][18]
   [17]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-4/igt@xe_fault_injection@vm-bind-fail-vm_bind_ioctl_ops_create.html
   [18]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-4/igt@xe_fault_injection@vm-bind-fail-vm_bind_ioctl_ops_create.html

  * igt@xe_pmu@engine-activity-most-load-idle:
    - shard-adlp:         [PASS][19] -> [ABORT][20] +21 other tests abort
   [19]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-1/igt@xe_pmu@engine-activity-most-load-idle.html
   [20]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-3/igt@xe_pmu@engine-activity-most-load-idle.html

  * igt@xe_vm@large-userptr-binds-2147483648:
    - shard-dg2-set2:     [PASS][21] -> [FAIL][22]
   [21]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-dg2-436/igt@xe_vm@large-userptr-binds-2147483648.html
   [22]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-434/igt@xe_vm@large-userptr-binds-2147483648.html
    - shard-lnl:          [PASS][23] -> [FAIL][24] +1 other test fail
   [23]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-lnl-2/igt@xe_vm@large-userptr-binds-2147483648.html
   [24]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-lnl-3/igt@xe_vm@large-userptr-binds-2147483648.html
    - shard-bmg:          [PASS][25] -> [FAIL][26]
   [25]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-4/igt@xe_vm@large-userptr-binds-2147483648.html
   [26]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-4/igt@xe_vm@large-userptr-binds-2147483648.html

  
#### Warnings ####

  * igt@xe_pmu@all-fn-engine-activity-load:
    - shard-adlp:         [TIMEOUT][27] ([Intel XE#5213]) -> [ABORT][28]
   [27]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-8/igt@xe_pmu@all-fn-engine-activity-load.html
   [28]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-1/igt@xe_pmu@all-fn-engine-activity-load.html

  
#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * {igt@xe_pmu@engine-activity-accuracy-90@engine-drm_xe_engine_class_video_enhance0}:
    - shard-adlp:         [PASS][29] -> [ABORT][30] +11 other tests abort
   [29]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-6/igt@xe_pmu@engine-activity-accuracy-90@engine-drm_xe_engine_class_video_enhance0.html
   [30]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-2/igt@xe_pmu@engine-activity-accuracy-90@engine-drm_xe_engine_class_video_enhance0.html

  
Known issues
------------

  Here are the changes found in xe-pw-154627v6_FULL that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@kms_async_flips@async-flip-suspend-resume@pipe-d-hdmi-a-1:
    - shard-adlp:         [PASS][31] -> [DMESG-WARN][32] ([Intel XE#4543])
   [31]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-2/igt@kms_async_flips@async-flip-suspend-resume@pipe-d-hdmi-a-1.html
   [32]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-3/igt@kms_async_flips@async-flip-suspend-resume@pipe-d-hdmi-a-1.html

  * igt@kms_big_fb@y-tiled-16bpp-rotate-0:
    - shard-bmg:          NOTRUN -> [SKIP][33] ([Intel XE#1124]) +2 other tests skip
   [33]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_big_fb@y-tiled-16bpp-rotate-0.html

  * igt@kms_big_fb@yf-tiled-max-hw-stride-32bpp-rotate-0-hflip-async-flip:
    - shard-dg2-set2:     NOTRUN -> [SKIP][34] ([Intel XE#1124]) +4 other tests skip
   [34]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-466/igt@kms_big_fb@yf-tiled-max-hw-stride-32bpp-rotate-0-hflip-async-flip.html

  * igt@kms_bw@connected-linear-tiling-2-displays-2560x1440p:
    - shard-bmg:          [PASS][35] -> [SKIP][36] ([Intel XE#2314] / [Intel XE#2894])
   [35]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-4/igt@kms_bw@connected-linear-tiling-2-displays-2560x1440p.html
   [36]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_bw@connected-linear-tiling-2-displays-2560x1440p.html

  * igt@kms_bw@connected-linear-tiling-4-displays-2160x1440p:
    - shard-bmg:          NOTRUN -> [SKIP][37] ([Intel XE#2314] / [Intel XE#2894])
   [37]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_bw@connected-linear-tiling-4-displays-2160x1440p.html

  * igt@kms_bw@linear-tiling-1-displays-2160x1440p:
    - shard-bmg:          NOTRUN -> [SKIP][38] ([Intel XE#367])
   [38]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_bw@linear-tiling-1-displays-2160x1440p.html
    - shard-dg2-set2:     NOTRUN -> [SKIP][39] ([Intel XE#367])
   [39]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-466/igt@kms_bw@linear-tiling-1-displays-2160x1440p.html

  * igt@kms_ccs@bad-aux-stride-y-tiled-gen12-mc-ccs@pipe-c-hdmi-a-6:
    - shard-dg2-set2:     NOTRUN -> [SKIP][40] ([Intel XE#787]) +237 other tests skip
   [40]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-434/igt@kms_ccs@bad-aux-stride-y-tiled-gen12-mc-ccs@pipe-c-hdmi-a-6.html

  * igt@kms_ccs@crc-primary-basic-4-tiled-mtl-mc-ccs:
    - shard-bmg:          NOTRUN -> [SKIP][41] ([Intel XE#2887]) +5 other tests skip
   [41]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_ccs@crc-primary-basic-4-tiled-mtl-mc-ccs.html

  * igt@kms_ccs@crc-primary-rotation-180-4-tiled-bmg-ccs:
    - shard-dg2-set2:     NOTRUN -> [SKIP][42] ([Intel XE#2907])
   [42]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-435/igt@kms_ccs@crc-primary-rotation-180-4-tiled-bmg-ccs.html

  * igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc:
    - shard-dg2-set2:     [PASS][43] -> [INCOMPLETE][44] ([Intel XE#1727] / [Intel XE#3113] / [Intel XE#4345] / [Intel XE#6168])
   [43]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-dg2-436/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc.html
   [44]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-436/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc.html

  * igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc@pipe-d-dp-4:
    - shard-dg2-set2:     [PASS][45] -> [INCOMPLETE][46] ([Intel XE#6168] / [i915#14968])
   [45]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-dg2-436/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc@pipe-d-dp-4.html
   [46]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-436/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc@pipe-d-dp-4.html

  * igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc@pipe-d-hdmi-a-6:
    - shard-dg2-set2:     [PASS][47] -> [DMESG-WARN][48] ([Intel XE#1727] / [Intel XE#3113])
   [47]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-dg2-436/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc@pipe-d-hdmi-a-6.html
   [48]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-436/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc@pipe-d-hdmi-a-6.html

  * igt@kms_ccs@random-ccs-data-yf-tiled-ccs@pipe-d-dp-4:
    - shard-dg2-set2:     NOTRUN -> [SKIP][49] ([Intel XE#455] / [Intel XE#787]) +37 other tests skip
   [49]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-434/igt@kms_ccs@random-ccs-data-yf-tiled-ccs@pipe-d-dp-4.html

  * igt@kms_chamelium_color@ctm-negative:
    - shard-bmg:          NOTRUN -> [SKIP][50] ([Intel XE#2325])
   [50]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_chamelium_color@ctm-negative.html
    - shard-dg2-set2:     NOTRUN -> [SKIP][51] ([Intel XE#306])
   [51]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-466/igt@kms_chamelium_color@ctm-negative.html

  * igt@kms_chamelium_frames@hdmi-crc-nonplanar-formats:
    - shard-dg2-set2:     NOTRUN -> [SKIP][52] ([Intel XE#373]) +3 other tests skip
   [52]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-433/igt@kms_chamelium_frames@hdmi-crc-nonplanar-formats.html

  * igt@kms_chamelium_hpd@dp-hpd:
    - shard-bmg:          NOTRUN -> [SKIP][53] ([Intel XE#2252]) +3 other tests skip
   [53]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_chamelium_hpd@dp-hpd.html

  * igt@kms_content_protection@srm@pipe-a-dp-4:
    - shard-dg2-set2:     NOTRUN -> [FAIL][54] ([Intel XE#1178]) +2 other tests fail
   [54]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-466/igt@kms_content_protection@srm@pipe-a-dp-4.html

  * igt@kms_cursor_crc@cursor-sliding-256x85:
    - shard-bmg:          NOTRUN -> [SKIP][55] ([Intel XE#2320]) +1 other test skip
   [55]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_cursor_crc@cursor-sliding-256x85.html

  * igt@kms_cursor_legacy@cursora-vs-flipb-atomic-transitions-varying-size:
    - shard-bmg:          [PASS][56] -> [SKIP][57] ([Intel XE#2291]) +3 other tests skip
   [56]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-4/igt@kms_cursor_legacy@cursora-vs-flipb-atomic-transitions-varying-size.html
   [57]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_cursor_legacy@cursora-vs-flipb-atomic-transitions-varying-size.html

  * igt@kms_cursor_legacy@cursorb-vs-flipb-atomic-transitions-varying-size:
    - shard-bmg:          NOTRUN -> [SKIP][58] ([Intel XE#2291]) +1 other test skip
   [58]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_cursor_legacy@cursorb-vs-flipb-atomic-transitions-varying-size.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions:
    - shard-bmg:          [PASS][59] -> [FAIL][60] ([Intel XE#1475])
   [59]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-3/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions.html
   [60]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-7/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions.html

  * igt@kms_dither@fb-8bpc-vs-panel-6bpc@pipe-a-hdmi-a-2:
    - shard-dg2-set2:     NOTRUN -> [SKIP][61] ([Intel XE#4494])
   [61]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-432/igt@kms_dither@fb-8bpc-vs-panel-6bpc@pipe-a-hdmi-a-2.html

  * igt@kms_fbc_dirty_rect@fbc-dirty-rectangle-different-formats:
    - shard-dg2-set2:     NOTRUN -> [SKIP][62] ([Intel XE#4422]) +1 other test skip
   [62]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-435/igt@kms_fbc_dirty_rect@fbc-dirty-rectangle-different-formats.html

  * igt@kms_fbc_dirty_rect@fbc-dirty-rectangle-out-visible-area:
    - shard-bmg:          NOTRUN -> [SKIP][63] ([Intel XE#4422]) +1 other test skip
   [63]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_fbc_dirty_rect@fbc-dirty-rectangle-out-visible-area.html

  * igt@kms_flip@2x-flip-vs-suspend-interruptible:
    - shard-bmg:          NOTRUN -> [SKIP][64] ([Intel XE#2316])
   [64]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_flip@2x-flip-vs-suspend-interruptible.html

  * igt@kms_flip@2x-plain-flip-fb-recreate:
    - shard-bmg:          [PASS][65] -> [SKIP][66] ([Intel XE#2316]) +3 other tests skip
   [65]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-3/igt@kms_flip@2x-plain-flip-fb-recreate.html
   [66]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_flip@2x-plain-flip-fb-recreate.html

  * igt@kms_flip@flip-vs-expired-vblank:
    - shard-dg2-set2:     [PASS][67] -> [FAIL][68] ([Intel XE#301])
   [67]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-dg2-432/igt@kms_flip@flip-vs-expired-vblank.html
   [68]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-435/igt@kms_flip@flip-vs-expired-vblank.html

  * igt@kms_flip@flip-vs-expired-vblank@d-hdmi-a6:
    - shard-dg2-set2:     NOTRUN -> [FAIL][69] ([Intel XE#301])
   [69]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-435/igt@kms_flip@flip-vs-expired-vblank@d-hdmi-a6.html

  * igt@kms_flip_scaled_crc@flip-32bpp-yftileccs-to-64bpp-yftile-upscaling:
    - shard-dg2-set2:     NOTRUN -> [SKIP][70] ([Intel XE#455]) +3 other tests skip
   [70]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-433/igt@kms_flip_scaled_crc@flip-32bpp-yftileccs-to-64bpp-yftile-upscaling.html

  * igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytilegen12rcccs-downscaling:
    - shard-bmg:          NOTRUN -> [SKIP][71] ([Intel XE#2293] / [Intel XE#2380]) +1 other test skip
   [71]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytilegen12rcccs-downscaling.html

  * igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytilegen12rcccs-downscaling@pipe-a-valid-mode:
    - shard-bmg:          NOTRUN -> [SKIP][72] ([Intel XE#2293]) +1 other test skip
   [72]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytilegen12rcccs-downscaling@pipe-a-valid-mode.html

  * igt@kms_frontbuffer_tracking@drrs-2p-primscrn-cur-indfb-draw-blt:
    - shard-bmg:          NOTRUN -> [SKIP][73] ([Intel XE#2312]) +7 other tests skip
   [73]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_frontbuffer_tracking@drrs-2p-primscrn-cur-indfb-draw-blt.html

  * igt@kms_frontbuffer_tracking@drrs-indfb-scaledprimary:
    - shard-dg2-set2:     NOTRUN -> [SKIP][74] ([Intel XE#651]) +10 other tests skip
   [74]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-466/igt@kms_frontbuffer_tracking@drrs-indfb-scaledprimary.html

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-indfb-draw-mmap-wc:
    - shard-bmg:          NOTRUN -> [SKIP][75] ([Intel XE#5390]) +6 other tests skip
   [75]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-indfb-draw-mmap-wc.html

  * igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-indfb-msflip-blt:
    - shard-bmg:          NOTRUN -> [SKIP][76] ([Intel XE#2311]) +11 other tests skip
   [76]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-indfb-msflip-blt.html

  * igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-pri-indfb-draw-render:
    - shard-bmg:          NOTRUN -> [SKIP][77] ([Intel XE#2313]) +10 other tests skip
   [77]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-pri-indfb-draw-render.html

  * igt@kms_frontbuffer_tracking@fbcpsr-slowdraw:
    - shard-dg2-set2:     NOTRUN -> [SKIP][78] ([Intel XE#653]) +14 other tests skip
   [78]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-466/igt@kms_frontbuffer_tracking@fbcpsr-slowdraw.html

  * igt@kms_frontbuffer_tracking@fbcpsr-tiling-y:
    - shard-dg2-set2:     NOTRUN -> [SKIP][79] ([Intel XE#658])
   [79]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-433/igt@kms_frontbuffer_tracking@fbcpsr-tiling-y.html

  * igt@kms_hdr@invalid-metadata-sizes:
    - shard-bmg:          [PASS][80] -> [SKIP][81] ([Intel XE#1503])
   [80]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-4/igt@kms_hdr@invalid-metadata-sizes.html
   [81]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_hdr@invalid-metadata-sizes.html

  * igt@kms_hdr@static-toggle-suspend:
    - shard-bmg:          NOTRUN -> [SKIP][82] ([Intel XE#1503])
   [82]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_hdr@static-toggle-suspend.html

  * igt@kms_joiner@basic-force-big-joiner:
    - shard-bmg:          [PASS][83] -> [SKIP][84] ([Intel XE#3012])
   [83]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-1/igt@kms_joiner@basic-force-big-joiner.html
   [84]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_joiner@basic-force-big-joiner.html

  * igt@kms_multipipe_modeset@basic-max-pipe-crc-check:
    - shard-bmg:          NOTRUN -> [SKIP][85] ([Intel XE#2501])
   [85]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_multipipe_modeset@basic-max-pipe-crc-check.html

  * igt@kms_plane_cursor@primary:
    - shard-dg2-set2:     NOTRUN -> [FAIL][86] ([Intel XE#616])
   [86]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-466/igt@kms_plane_cursor@primary.html

  * igt@kms_plane_multiple@2x-tiling-x:
    - shard-bmg:          [PASS][87] -> [SKIP][88] ([Intel XE#4596])
   [87]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-1/igt@kms_plane_multiple@2x-tiling-x.html
   [88]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_plane_multiple@2x-tiling-x.html

  * igt@kms_plane_multiple@2x-tiling-y:
    - shard-dg2-set2:     NOTRUN -> [SKIP][89] ([Intel XE#5021])
   [89]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-463/igt@kms_plane_multiple@2x-tiling-y.html

  * igt@kms_plane_multiple@2x-tiling-yf:
    - shard-bmg:          NOTRUN -> [SKIP][90] ([Intel XE#4596])
   [90]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_plane_multiple@2x-tiling-yf.html

  * igt@kms_pm_dc@dc6-dpms:
    - shard-dg2-set2:     NOTRUN -> [SKIP][91] ([Intel XE#908])
   [91]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-466/igt@kms_pm_dc@dc6-dpms.html

  * igt@kms_pm_rpm@modeset-lpsp-stress-no-wait:
    - shard-bmg:          NOTRUN -> [SKIP][92] ([Intel XE#1439] / [Intel XE#3141] / [Intel XE#836])
   [92]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_pm_rpm@modeset-lpsp-stress-no-wait.html

  * igt@kms_pm_rpm@universal-planes-dpms:
    - shard-adlp:         [PASS][93] -> [ABORT][94] ([Intel XE#5545])
   [93]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-4/igt@kms_pm_rpm@universal-planes-dpms.html
   [94]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-1/igt@kms_pm_rpm@universal-planes-dpms.html

  * igt@kms_psr2_sf@psr2-overlay-plane-move-continuous-sf:
    - shard-bmg:          NOTRUN -> [SKIP][95] ([Intel XE#1406] / [Intel XE#1489]) +2 other tests skip
   [95]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_psr2_sf@psr2-overlay-plane-move-continuous-sf.html
    - shard-dg2-set2:     NOTRUN -> [SKIP][96] ([Intel XE#1406] / [Intel XE#1489]) +2 other tests skip
   [96]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-435/igt@kms_psr2_sf@psr2-overlay-plane-move-continuous-sf.html

  * igt@kms_psr@fbc-psr2-sprite-plane-onoff:
    - shard-dg2-set2:     NOTRUN -> [SKIP][97] ([Intel XE#1406] / [Intel XE#2850] / [Intel XE#929]) +6 other tests skip
   [97]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-435/igt@kms_psr@fbc-psr2-sprite-plane-onoff.html

  * igt@kms_psr@pr-sprite-plane-onoff:
    - shard-bmg:          NOTRUN -> [SKIP][98] ([Intel XE#1406] / [Intel XE#2234] / [Intel XE#2850]) +5 other tests skip
   [98]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_psr@pr-sprite-plane-onoff.html

  * igt@kms_rotation_crc@primary-yf-tiled-reflect-x-180:
    - shard-dg2-set2:     NOTRUN -> [SKIP][99] ([Intel XE#1127])
   [99]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-433/igt@kms_rotation_crc@primary-yf-tiled-reflect-x-180.html

  * igt@kms_rotation_crc@sprite-rotation-90:
    - shard-bmg:          NOTRUN -> [SKIP][100] ([Intel XE#3414] / [Intel XE#3904])
   [100]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_rotation_crc@sprite-rotation-90.html
    - shard-dg2-set2:     NOTRUN -> [SKIP][101] ([Intel XE#3414])
   [101]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-435/igt@kms_rotation_crc@sprite-rotation-90.html

  * igt@kms_setmode@invalid-clone-single-crtc-stealing:
    - shard-bmg:          [PASS][102] -> [SKIP][103] ([Intel XE#1435])
   [102]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-3/igt@kms_setmode@invalid-clone-single-crtc-stealing.html
   [103]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_setmode@invalid-clone-single-crtc-stealing.html

  * igt@kms_vblank@ts-continuation-suspend:
    - shard-adlp:         [PASS][104] -> [DMESG-WARN][105] ([Intel XE#2953] / [Intel XE#4173]) +2 other tests dmesg-warn
   [104]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-2/igt@kms_vblank@ts-continuation-suspend.html
   [105]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-4/igt@kms_vblank@ts-continuation-suspend.html

  * igt@kms_vrr@cmrr:
    - shard-dg2-set2:     NOTRUN -> [SKIP][106] ([Intel XE#2168])
   [106]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-433/igt@kms_vrr@cmrr.html

  * igt@kms_vrr@flip-suspend:
    - shard-bmg:          NOTRUN -> [SKIP][107] ([Intel XE#1499])
   [107]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_vrr@flip-suspend.html

  * igt@xe_create@multigpu-create-massive-size:
    - shard-bmg:          NOTRUN -> [SKIP][108] ([Intel XE#2504])
   [108]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@xe_create@multigpu-create-massive-size.html

  * igt@xe_eu_stall@invalid-gt-id:
    - shard-dg2-set2:     NOTRUN -> [SKIP][109] ([Intel XE#5626])
   [109]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-466/igt@xe_eu_stall@invalid-gt-id.html

  * igt@xe_eudebug@basic-vm-bind-discovery:
    - shard-dg2-set2:     NOTRUN -> [SKIP][110] ([Intel XE#4837]) +1 other test skip
   [110]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-435/igt@xe_eudebug@basic-vm-bind-discovery.html

  * igt@xe_eudebug_online@single-step-one:
    - shard-bmg:          NOTRUN -> [SKIP][111] ([Intel XE#4837]) +3 other tests skip
   [111]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@xe_eudebug_online@single-step-one.html

  * igt@xe_exec_basic@multigpu-once-bindexecqueue-userptr-invalidate:
    - shard-dg2-set2:     [PASS][112] -> [SKIP][113] ([Intel XE#1392]) +6 other tests skip
   [112]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-dg2-463/igt@xe_exec_basic@multigpu-once-bindexecqueue-userptr-invalidate.html
   [113]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-432/igt@xe_exec_basic@multigpu-once-bindexecqueue-userptr-invalidate.html

  * igt@xe_exec_basic@multigpu-once-null-defer-bind:
    - shard-bmg:          NOTRUN -> [SKIP][114] ([Intel XE#2322]) +3 other tests skip
   [114]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@xe_exec_basic@multigpu-once-null-defer-bind.html

  * igt@xe_exec_fault_mode@once-bindexecqueue-imm:
    - shard-dg2-set2:     NOTRUN -> [SKIP][115] ([Intel XE#288]) +9 other tests skip
   [115]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-466/igt@xe_exec_fault_mode@once-bindexecqueue-imm.html

  * igt@xe_exec_mix_modes@exec-simple-batch-store-dma-fence:
    - shard-dg2-set2:     NOTRUN -> [SKIP][116] ([Intel XE#2360])
   [116]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-466/igt@xe_exec_mix_modes@exec-simple-batch-store-dma-fence.html

  * igt@xe_exec_system_allocator@many-execqueues-mmap-huge-nomemset:
    - shard-bmg:          NOTRUN -> [SKIP][117] ([Intel XE#4943]) +11 other tests skip
   [117]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@xe_exec_system_allocator@many-execqueues-mmap-huge-nomemset.html

  * igt@xe_exec_system_allocator@threads-many-large-mmap-shared-remap-dontunmap-eocheck:
    - shard-dg2-set2:     NOTRUN -> [SKIP][118] ([Intel XE#4915]) +105 other tests skip
   [118]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-433/igt@xe_exec_system_allocator@threads-many-large-mmap-shared-remap-dontunmap-eocheck.html

  * igt@xe_oa@missing-sample-flags:
    - shard-dg2-set2:     NOTRUN -> [SKIP][119] ([Intel XE#3573]) +3 other tests skip
   [119]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-433/igt@xe_oa@missing-sample-flags.html

  * igt@xe_pat@pat-index-xehpc:
    - shard-bmg:          NOTRUN -> [SKIP][120] ([Intel XE#1420])
   [120]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@xe_pat@pat-index-xehpc.html

  * igt@xe_pm@d3cold-mmap-system:
    - shard-dg2-set2:     NOTRUN -> [SKIP][121] ([Intel XE#2284] / [Intel XE#366])
   [121]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-433/igt@xe_pm@d3cold-mmap-system.html

  * igt@xe_pm@d3hot-i2c:
    - shard-dg2-set2:     NOTRUN -> [SKIP][122] ([Intel XE#5742])
   [122]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-435/igt@xe_pm@d3hot-i2c.html
    - shard-bmg:          NOTRUN -> [SKIP][123] ([Intel XE#5742])
   [123]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@xe_pm@d3hot-i2c.html

  * igt@xe_pmu@gt-frequency:
    - shard-dg2-set2:     [PASS][124] -> [FAIL][125] ([Intel XE#4819]) +1 other test fail
   [124]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-dg2-464/igt@xe_pmu@gt-frequency.html
   [125]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-435/igt@xe_pmu@gt-frequency.html

  * igt@xe_pxp@pxp-stale-queue-post-suspend:
    - shard-dg2-set2:     NOTRUN -> [SKIP][126] ([Intel XE#4733])
   [126]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-433/igt@xe_pxp@pxp-stale-queue-post-suspend.html

  * igt@xe_query@multigpu-query-uc-fw-version-guc:
    - shard-dg2-set2:     NOTRUN -> [SKIP][127] ([Intel XE#944]) +2 other tests skip
   [127]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-466/igt@xe_query@multigpu-query-uc-fw-version-guc.html
    - shard-bmg:          NOTRUN -> [SKIP][128] ([Intel XE#944])
   [128]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@xe_query@multigpu-query-uc-fw-version-guc.html

  * igt@xe_render_copy@render-stress-1-copies:
    - shard-dg2-set2:     NOTRUN -> [SKIP][129] ([Intel XE#4814])
   [129]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-433/igt@xe_render_copy@render-stress-1-copies.html

  
#### Possible fixes ####

  * igt@kms_async_flips@async-flip-suspend-resume@pipe-c-hdmi-a-1:
    - shard-adlp:         [DMESG-WARN][130] ([Intel XE#4543]) -> [PASS][131] +2 other tests pass
   [130]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-2/igt@kms_async_flips@async-flip-suspend-resume@pipe-c-hdmi-a-1.html
   [131]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-3/igt@kms_async_flips@async-flip-suspend-resume@pipe-c-hdmi-a-1.html

  * igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs:
    - shard-dg2-set2:     [INCOMPLETE][132] ([Intel XE#1727] / [Intel XE#3113] / [Intel XE#4212] / [Intel XE#4345] / [Intel XE#4522]) -> [PASS][133]
   [132]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-dg2-466/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs.html
   [133]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-433/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs.html

  * igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs@pipe-b-dp-4:
    - shard-dg2-set2:     [INCOMPLETE][134] ([Intel XE#1727] / [Intel XE#3113] / [Intel XE#4212] / [Intel XE#4522]) -> [PASS][135]
   [134]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-dg2-466/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs@pipe-b-dp-4.html
   [135]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-433/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs@pipe-b-dp-4.html

  * igt@kms_cursor_legacy@2x-long-flip-vs-cursor-atomic:
    - shard-bmg:          [SKIP][136] ([Intel XE#2291]) -> [PASS][137] +1 other test pass
   [136]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-6/igt@kms_cursor_legacy@2x-long-flip-vs-cursor-atomic.html
   [137]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-5/igt@kms_cursor_legacy@2x-long-flip-vs-cursor-atomic.html

  * igt@kms_cursor_legacy@cursorb-vs-flipb-atomic-transitions-varying-size:
    - shard-dg2-set2:     [INCOMPLETE][138] ([Intel XE#3226]) -> [PASS][139]
   [138]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-dg2-464/igt@kms_cursor_legacy@cursorb-vs-flipb-atomic-transitions-varying-size.html
   [139]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-435/igt@kms_cursor_legacy@cursorb-vs-flipb-atomic-transitions-varying-size.html

  * igt@kms_dp_link_training@non-uhbr-sst:
    - shard-bmg:          [SKIP][140] ([Intel XE#4354]) -> [PASS][141]
   [140]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-6/igt@kms_dp_link_training@non-uhbr-sst.html
   [141]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-5/igt@kms_dp_link_training@non-uhbr-sst.html

  * igt@kms_flip@2x-nonexisting-fb:
    - shard-bmg:          [SKIP][142] ([Intel XE#2316]) -> [PASS][143] +5 other tests pass
   [142]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-6/igt@kms_flip@2x-nonexisting-fb.html
   [143]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-5/igt@kms_flip@2x-nonexisting-fb.html

  * igt@kms_flip@flip-vs-suspend-interruptible:
    - shard-bmg:          [INCOMPLETE][144] ([Intel XE#2049] / [Intel XE#2597]) -> [PASS][145] +1 other test pass
   [144]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-2/igt@kms_flip@flip-vs-suspend-interruptible.html
   [145]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-3/igt@kms_flip@flip-vs-suspend-interruptible.html

  * igt@kms_frontbuffer_tracking@fbc-suspend:
    - shard-adlp:         [DMESG-WARN][146] ([Intel XE#2953] / [Intel XE#4173]) -> [PASS][147] +1 other test pass
   [146]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-2/igt@kms_frontbuffer_tracking@fbc-suspend.html
   [147]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-3/igt@kms_frontbuffer_tracking@fbc-suspend.html

  * {igt@kms_pm_rpm@system-suspend-idle}:
    - shard-bmg:          [ABORT][148] ([Intel XE#4760] / [Intel XE#5545]) -> [PASS][149]
   [148]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-4/igt@kms_pm_rpm@system-suspend-idle.html
   [149]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_pm_rpm@system-suspend-idle.html

  * igt@kms_vrr@negative-basic:
    - shard-bmg:          [SKIP][150] ([Intel XE#1499]) -> [PASS][151]
   [150]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-6/igt@kms_vrr@negative-basic.html
   [151]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-5/igt@kms_vrr@negative-basic.html

  * igt@xe_exec_basic@multigpu-no-exec-bindexecqueue-userptr-rebind:
    - shard-dg2-set2:     [SKIP][152] ([Intel XE#1392]) -> [PASS][153] +6 other tests pass
   [152]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-dg2-432/igt@xe_exec_basic@multigpu-no-exec-bindexecqueue-userptr-rebind.html
   [153]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-466/igt@xe_exec_basic@multigpu-no-exec-bindexecqueue-userptr-rebind.html

  * igt@xe_exec_reset@parallel-gt-reset:
    - shard-bmg:          [DMESG-WARN][154] ([Intel XE#3876]) -> [PASS][155]
   [154]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-4/igt@xe_exec_reset@parallel-gt-reset.html
   [155]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@xe_exec_reset@parallel-gt-reset.html
    - shard-dg2-set2:     [DMESG-WARN][156] ([Intel XE#3876]) -> [PASS][157]
   [156]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-dg2-464/igt@xe_exec_reset@parallel-gt-reset.html
   [157]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-dg2-435/igt@xe_exec_reset@parallel-gt-reset.html

  * igt@xe_module_load@load:
    - shard-adlp:         ([PASS][158], [PASS][159], [PASS][160], [PASS][161], [PASS][162], [PASS][163], [PASS][164], [PASS][165], [PASS][166], [PASS][167], [PASS][168], [PASS][169], [PASS][170], [PASS][171], [PASS][172], [PASS][173], [PASS][174], [PASS][175], [PASS][176], [PASS][177], [PASS][178], [PASS][179], [SKIP][180], [PASS][181], [PASS][182], [PASS][183]) ([Intel XE#378] / [Intel XE#5612]) -> ([PASS][184], [PASS][185], [PASS][186], [PASS][187], [PASS][188], [PASS][189], [PASS][190], [PASS][191], [PASS][192], [PASS][193], [PASS][194], [PASS][195], [PASS][196], [PASS][197], [PASS][198], [PASS][199], [PASS][200], [PASS][201], [PASS][202], [PASS][203], [PASS][204], [PASS][205], [PASS][206], [PASS][207], [PASS][208])
   [158]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-3/igt@xe_module_load@load.html
   [159]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-2/igt@xe_module_load@load.html
   [160]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-2/igt@xe_module_load@load.html
   [161]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-6/igt@xe_module_load@load.html
   [162]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-1/igt@xe_module_load@load.html
   [163]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-1/igt@xe_module_load@load.html
   [164]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-9/igt@xe_module_load@load.html
   [165]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-9/igt@xe_module_load@load.html
   [166]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-9/igt@xe_module_load@load.html
   [167]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-4/igt@xe_module_load@load.html
   [168]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-8/igt@xe_module_load@load.html
   [169]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-8/igt@xe_module_load@load.html
   [170]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-2/igt@xe_module_load@load.html
   [171]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-1/igt@xe_module_load@load.html
   [172]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-1/igt@xe_module_load@load.html
   [173]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-4/igt@xe_module_load@load.html
   [174]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-2/igt@xe_module_load@load.html
   [175]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-4/igt@xe_module_load@load.html
   [176]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-6/igt@xe_module_load@load.html
   [177]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-6/igt@xe_module_load@load.html
   [178]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-6/igt@xe_module_load@load.html
   [179]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-8/igt@xe_module_load@load.html
   [180]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-6/igt@xe_module_load@load.html
   [181]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-3/igt@xe_module_load@load.html
   [182]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-3/igt@xe_module_load@load.html
   [183]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-8/igt@xe_module_load@load.html
   [184]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-3/igt@xe_module_load@load.html
   [185]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-6/igt@xe_module_load@load.html
   [186]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-6/igt@xe_module_load@load.html
   [187]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-1/igt@xe_module_load@load.html
   [188]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-1/igt@xe_module_load@load.html
   [189]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-1/igt@xe_module_load@load.html
   [190]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-1/igt@xe_module_load@load.html
   [191]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-6/igt@xe_module_load@load.html
   [192]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-2/igt@xe_module_load@load.html
   [193]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-8/igt@xe_module_load@load.html
   [194]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-2/igt@xe_module_load@load.html
   [195]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-8/igt@xe_module_load@load.html
   [196]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-3/igt@xe_module_load@load.html
   [197]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-3/igt@xe_module_load@load.html
   [198]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-2/igt@xe_module_load@load.html
   [199]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-2/igt@xe_module_load@load.html
   [200]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-4/igt@xe_module_load@load.html
   [201]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-8/igt@xe_module_load@load.html
   [202]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-9/igt@xe_module_load@load.html
   [203]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-6/igt@xe_module_load@load.html
   [204]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-9/igt@xe_module_load@load.html
   [205]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-9/igt@xe_module_load@load.html
   [206]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-3/igt@xe_module_load@load.html
   [207]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-4/igt@xe_module_load@load.html
   [208]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-4/igt@xe_module_load@load.html

  
#### Warnings ####

  * igt@kms_content_protection@uevent:
    - shard-bmg:          [FAIL][209] ([Intel XE#1188]) -> [SKIP][210] ([Intel XE#2341])
   [209]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-3/igt@kms_content_protection@uevent.html
   [210]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_content_protection@uevent.html

  * igt@kms_frontbuffer_tracking@drrs-2p-primscrn-cur-indfb-draw-render:
    - shard-bmg:          [SKIP][211] ([Intel XE#2311]) -> [SKIP][212] ([Intel XE#2312]) +11 other tests skip
   [211]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-3/igt@kms_frontbuffer_tracking@drrs-2p-primscrn-cur-indfb-draw-render.html
   [212]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_frontbuffer_tracking@drrs-2p-primscrn-cur-indfb-draw-render.html

  * igt@kms_frontbuffer_tracking@drrs-2p-scndscrn-spr-indfb-draw-mmap-wc:
    - shard-bmg:          [SKIP][213] ([Intel XE#2312]) -> [SKIP][214] ([Intel XE#2311]) +13 other tests skip
   [213]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-6/igt@kms_frontbuffer_tracking@drrs-2p-scndscrn-spr-indfb-draw-mmap-wc.html
   [214]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-2/igt@kms_frontbuffer_tracking@drrs-2p-scndscrn-spr-indfb-draw-mmap-wc.html

  * igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-shrfb-pgflip-blt:
    - shard-bmg:          [SKIP][215] ([Intel XE#2312]) -> [SKIP][216] ([Intel XE#5390]) +7 other tests skip
   [215]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-6/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-shrfb-pgflip-blt.html
   [216]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-5/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-shrfb-pgflip-blt.html

  * igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-spr-indfb-draw-render:
    - shard-bmg:          [SKIP][217] ([Intel XE#5390]) -> [SKIP][218] ([Intel XE#2312]) +6 other tests skip
   [217]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-3/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-spr-indfb-draw-render.html
   [218]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-spr-indfb-draw-render.html

  * igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-spr-indfb-onoff:
    - shard-bmg:          [SKIP][219] ([Intel XE#2312]) -> [SKIP][220] ([Intel XE#2313]) +11 other tests skip
   [219]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-6/igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-spr-indfb-onoff.html
   [220]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-5/igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-spr-indfb-onoff.html

  * igt@kms_frontbuffer_tracking@psr-2p-primscrn-indfb-plflip-blt:
    - shard-bmg:          [SKIP][221] ([Intel XE#2313]) -> [SKIP][222] ([Intel XE#2312]) +8 other tests skip
   [221]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-1/igt@kms_frontbuffer_tracking@psr-2p-primscrn-indfb-plflip-blt.html
   [222]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_frontbuffer_tracking@psr-2p-primscrn-indfb-plflip-blt.html

  * igt@kms_hdr@brightness-with-hdr:
    - shard-bmg:          [SKIP][223] ([Intel XE#3374] / [Intel XE#3544]) -> [SKIP][224] ([Intel XE#3544])
   [223]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-6/igt@kms_hdr@brightness-with-hdr.html
   [224]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-5/igt@kms_hdr@brightness-with-hdr.html

  * igt@kms_tiled_display@basic-test-pattern:
    - shard-bmg:          [FAIL][225] ([Intel XE#1729]) -> [SKIP][226] ([Intel XE#2426])
   [225]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-1/igt@kms_tiled_display@basic-test-pattern.html
   [226]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-6/igt@kms_tiled_display@basic-test-pattern.html

  * igt@kms_tiled_display@basic-test-pattern-with-chamelium:
    - shard-bmg:          [SKIP][227] ([Intel XE#2426]) -> [SKIP][228] ([Intel XE#2509])
   [227]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-bmg-3/igt@kms_tiled_display@basic-test-pattern-with-chamelium.html
   [228]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-bmg-5/igt@kms_tiled_display@basic-test-pattern-with-chamelium.html

  * igt@xe_exec_reset@cm-cat-error:
    - shard-adlp:         [DMESG-FAIL][229] ([Intel XE#3868]) -> [DMESG-WARN][230] ([Intel XE#3868])
   [229]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c/shard-adlp-2/igt@xe_exec_reset@cm-cat-error.html
   [230]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/shard-adlp-4/igt@xe_exec_reset@cm-cat-error.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [Intel XE#1124]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1124
  [Intel XE#1127]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1127
  [Intel XE#1178]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1178
  [Intel XE#1188]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1188
  [Intel XE#1392]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1392
  [Intel XE#1406]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1406
  [Intel XE#1420]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1420
  [Intel XE#1435]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1435
  [Intel XE#1439]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1439
  [Intel XE#1475]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1475
  [Intel XE#1489]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1489
  [Intel XE#1499]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1499
  [Intel XE#1503]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1503
  [Intel XE#1727]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1727
  [Intel XE#1729]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1729
  [Intel XE#2049]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2049
  [Intel XE#2168]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2168
  [Intel XE#2234]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2234
  [Intel XE#2252]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2252
  [Intel XE#2284]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2284
  [Intel XE#2291]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2291
  [Intel XE#2293]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2293
  [Intel XE#2311]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2311
  [Intel XE#2312]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2312
  [Intel XE#2313]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2313
  [Intel XE#2314]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2314
  [Intel XE#2316]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2316
  [Intel XE#2320]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2320
  [Intel XE#2322]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2322
  [Intel XE#2325]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2325
  [Intel XE#2341]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2341
  [Intel XE#2360]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2360
  [Intel XE#2380]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2380
  [Intel XE#2426]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2426
  [Intel XE#2501]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2501
  [Intel XE#2504]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2504
  [Intel XE#2509]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2509
  [Intel XE#2597]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2597
  [Intel XE#2850]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2850
  [Intel XE#288]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/288
  [Intel XE#2887]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2887
  [Intel XE#2894]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2894
  [Intel XE#2907]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2907
  [Intel XE#2953]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2953
  [Intel XE#301]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/301
  [Intel XE#3012]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3012
  [Intel XE#306]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/306
  [Intel XE#3113]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3113
  [Intel XE#3141]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3141
  [Intel XE#3226]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3226
  [Intel XE#3374]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3374
  [Intel XE#3414]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3414
  [Intel XE#3544]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3544
  [Intel XE#3573]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3573
  [Intel XE#366]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/366
  [Intel XE#367]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/367
  [Intel XE#373]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/373
  [Intel XE#378]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/378
  [Intel XE#3868]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3868
  [Intel XE#3876]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3876
  [Intel XE#3904]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3904
  [Intel XE#4173]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4173
  [Intel XE#4212]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4212
  [Intel XE#4345]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4345
  [Intel XE#4354]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4354
  [Intel XE#4422]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4422
  [Intel XE#4494]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4494
  [Intel XE#4522]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4522
  [Intel XE#4543]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4543
  [Intel XE#455]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/455
  [Intel XE#4596]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4596
  [Intel XE#4733]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4733
  [Intel XE#4760]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4760
  [Intel XE#4814]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4814
  [Intel XE#4819]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4819
  [Intel XE#4837]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4837
  [Intel XE#4915]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4915
  [Intel XE#4943]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4943
  [Intel XE#5021]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5021
  [Intel XE#5213]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5213
  [Intel XE#5390]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5390
  [Intel XE#5503]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5503
  [Intel XE#5545]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5545
  [Intel XE#5612]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5612
  [Intel XE#5626]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5626
  [Intel XE#5742]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5742
  [Intel XE#5890]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5890
  [Intel XE#616]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/616
  [Intel XE#6168]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6168
  [Intel XE#651]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/651
  [Intel XE#653]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/653
  [Intel XE#658]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/658
  [Intel XE#787]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/787
  [Intel XE#836]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/836
  [Intel XE#908]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/908
  [Intel XE#929]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/929
  [Intel XE#944]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/944
  [i915#14968]: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14968


Build changes
-------------

  * Linux: xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c -> xe-pw-154627v6

  IGT_8574: 44a15713124663a622c6eddf7c6ee5ba732e0d41 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  xe-3869-d03c82f71d60dc1434040ca679c683ab3b1b034c: d03c82f71d60dc1434040ca679c683ab3b1b034c
  xe-pw-154627v6: 154627v6

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-154627v6/index.html

[-- Attachment #2: Type: text/html, Size: 65110 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 14/30] drm/xe/vf: Wakeup in GuC backend on VF post migration recovery
  2025-10-06 11:10 ` [PATCH v6 14/30] drm/xe/vf: Wakeup in GuC backend on " Matthew Brost
@ 2025-10-06 14:35   ` Michal Wajdeczko
  2025-10-06 15:54     ` Matthew Brost
  2025-10-06 22:27   ` Lis, Tomasz
  1 sibling, 1 reply; 58+ messages in thread
From: Michal Wajdeczko @ 2025-10-06 14:35 UTC (permalink / raw)
  To: Matthew Brost, intel-xe



On 10/6/2025 1:10 PM, Matthew Brost wrote:
> If VF post-migration recovery is in progress, the recovery flow will
> rebuild all GuC submission state. In this case, exit all waiters to
> ensure that submission queue scheduling can also be paused. Avoid taking
> any adverse actions after aborting the wait.
> 
> As part of waking up the GuC backend, suspend_wait can now return
> -EAGAIN indicating the waiter should be retried. If the caller is
> running on work item, that work item need to be requeued to avoid a
> deadlock for the work item blocking the VF migration recovery work item.
> 
> v3:
>  - Don't block in preempt fence work queue as this can interfere with VF
>    post-migration work queue scheduling leading to deadlock (Testing)
>  - Use xe_gt_recovery_inprogress (Michal)
> v5:
>  - Use static function for vf_recovery (Michal)
>  - Add helper to wake CT waiters (Michal)
>  - Move some code to following patch (Michal)
>  - Adjust commit message to explain suspend_wait returning -EAGAIN (Michal)
>  - Add kernel doc to suspend_wait around returning -EAGAIN
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_exec_queue_types.h |  3 +
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.c      |  4 ++
>  drivers/gpu/drm/xe/xe_guc_ct.h           |  9 +++
>  drivers/gpu/drm/xe/xe_guc_submit.c       | 82 ++++++++++++++++++------
>  drivers/gpu/drm/xe/xe_preempt_fence.c    | 11 ++++
>  5 files changed, 88 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_exec_queue_types.h b/drivers/gpu/drm/xe/xe_exec_queue_types.h
> index 27b76cf9da89..282505fa1377 100644
> --- a/drivers/gpu/drm/xe/xe_exec_queue_types.h
> +++ b/drivers/gpu/drm/xe/xe_exec_queue_types.h
> @@ -207,6 +207,9 @@ struct xe_exec_queue_ops {
>  	 * call after suspend. In dma-fencing path thus must return within a
>  	 * reasonable amount of time. -ETIME return shall indicate an error
>  	 * waiting for suspend resulting in associated VM getting killed.
> +	 * -EAGAIN return indicates the wait should be tried again, if the wait
> +	 * is within a work item, the work item should be requeued as deadlock
> +	 * avoidance mechanism.
>  	 */
>  	int (*suspend_wait)(struct xe_exec_queue *q);
>  	/**
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 7057260175f3..7f703336d692 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -23,6 +23,7 @@
>  #include "xe_gt_sriov_vf.h"
>  #include "xe_gt_sriov_vf_types.h"
>  #include "xe_guc.h"
> +#include "xe_guc_ct.h"
>  #include "xe_guc_hxg_helpers.h"
>  #include "xe_guc_relay.h"
>  #include "xe_guc_submit.h"
> @@ -743,6 +744,9 @@ static void vf_start_migration_recovery(struct xe_gt *gt)
>  	    !gt->sriov.vf.migration.recovery_teardown) {
>  		gt->sriov.vf.migration.recovery_queued = true;
>  		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
> +		smp_wmb();	/* Ensure above write visable before wake */
> +
> +		xe_guc_ct_wake_waiters(&gt->uc.guc.ct);
>  
>  		started = queue_work(gt->ordered_wq, &gt->sriov.vf.migration.worker);
>  		xe_gt_sriov_info(gt, "VF migration recovery %s\n", started ?
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
> index d6c81325a76c..ca0ec938edac 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.h
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.h
> @@ -72,4 +72,13 @@ xe_guc_ct_send_block_no_fail(struct xe_guc_ct *ct, const u32 *action, u32 len)
>  
>  long xe_guc_ct_queue_proc_time_jiffies(struct xe_guc_ct *ct);
>  
> +/**
> + * xe_guc_ct_wake_waiters() - GuC CT wake up waiters
> + * @guc: GuC CT object
> + */
> +static inline void xe_guc_ct_wake_waiters(struct xe_guc_ct *ct)
> +{
> +	wake_up_all(&ct->wq);
> +}
> +
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 59371b7cc8a4..b2ca4911efe9 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -27,7 +27,6 @@
>  #include "xe_gt.h"
>  #include "xe_gt_clock.h"
>  #include "xe_gt_printk.h"
> -#include "xe_gt_sriov_vf.h"
>  #include "xe_guc.h"
>  #include "xe_guc_capture.h"
>  #include "xe_guc_ct.h"
> @@ -702,6 +701,11 @@ static u32 wq_space_until_wrap(struct xe_exec_queue *q)
>  	return (WQ_SIZE - q->guc->wqi_tail);
>  }
>  
> +static bool vf_recovery(struct xe_guc *guc)
> +{
> +	return xe_gt_recovery_pending(guc_to_gt(guc));
> +}
> +
>  static int wq_wait_for_space(struct xe_exec_queue *q, u32 wqi_size)
>  {
>  	struct xe_guc *guc = exec_queue_to_guc(q);
> @@ -711,7 +715,7 @@ static int wq_wait_for_space(struct xe_exec_queue *q, u32 wqi_size)
>  
>  #define AVAILABLE_SPACE \
>  	CIRC_SPACE(q->guc->wqi_tail, q->guc->wqi_head, WQ_SIZE)
> -	if (wqi_size > AVAILABLE_SPACE) {
> +	if (wqi_size > AVAILABLE_SPACE && !vf_recovery(guc)) {
>  try_again:
>  		q->guc->wqi_head = parallel_read(xe, map, wq_desc.head);
>  		if (wqi_size > AVAILABLE_SPACE) {
> @@ -910,9 +914,10 @@ static void disable_scheduling_deregister(struct xe_guc *guc,
>  	ret = wait_event_timeout(guc->ct.wq,
>  				 (!exec_queue_pending_enable(q) &&
>  				  !exec_queue_pending_disable(q)) ||
> -					 xe_guc_read_stopped(guc),
> +					 xe_guc_read_stopped(guc) ||
> +					 vf_recovery(guc),
>  				 HZ * 5);
> -	if (!ret) {
> +	if (!ret && !vf_recovery(guc)) {
>  		struct xe_gpu_scheduler *sched = &q->guc->sched;
>  
>  		xe_gt_warn(q->gt, "Pending enable/disable failed to respond\n");
> @@ -1015,6 +1020,10 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
>  	bool wedged = false;
>  
>  	xe_gt_assert(guc_to_gt(guc), xe_exec_queue_is_lr(q));
> +
> +	if (vf_recovery(guc))
> +		return;
> +
>  	trace_xe_exec_queue_lr_cleanup(q);
>  
>  	if (!exec_queue_killed(q))
> @@ -1047,7 +1056,11 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
>  		 */
>  		ret = wait_event_timeout(guc->ct.wq,
>  					 !exec_queue_pending_disable(q) ||
> -					 xe_guc_read_stopped(guc), HZ * 5);
> +					 xe_guc_read_stopped(guc) ||
> +					 vf_recovery(guc), HZ * 5);
> +		if (vf_recovery(guc))
> +			return;
> +
>  		if (!ret) {
>  			xe_gt_warn(q->gt, "Schedule disable failed to respond, guc_id=%d\n",
>  				   q->guc->id);
> @@ -1137,8 +1150,9 @@ static void enable_scheduling(struct xe_exec_queue *q)
>  
>  	ret = wait_event_timeout(guc->ct.wq,
>  				 !exec_queue_pending_enable(q) ||
> -				 xe_guc_read_stopped(guc), HZ * 5);
> -	if (!ret || xe_guc_read_stopped(guc)) {
> +				 xe_guc_read_stopped(guc) ||
> +				 vf_recovery(guc), HZ * 5);
> +	if ((!ret && !vf_recovery(guc)) || xe_guc_read_stopped(guc)) {
>  		xe_gt_warn(guc_to_gt(guc), "Schedule enable failed to respond");
>  		set_exec_queue_banned(q);
>  		xe_gt_reset_async(q->gt);
> @@ -1209,7 +1223,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>  	 * list so job can be freed and kick scheduler ensuring free job is not
>  	 * lost.
>  	 */
> -	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags))
> +	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags) ||
> +	    vf_recovery(guc))
>  		return DRM_GPU_SCHED_STAT_NO_HANG;
>  
>  	/* Kill the run_job entry point */
> @@ -1261,7 +1276,10 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>  			ret = wait_event_timeout(guc->ct.wq,
>  						 (!exec_queue_pending_enable(q) &&
>  						  !exec_queue_pending_disable(q)) ||
> -						 xe_guc_read_stopped(guc), HZ * 5);
> +						 xe_guc_read_stopped(guc) ||
> +						 vf_recovery(guc), HZ * 5);
> +			if (vf_recovery(guc))
> +				goto handle_vf_resume;
>  			if (!ret || xe_guc_read_stopped(guc))
>  				goto trigger_reset;
>  
> @@ -1286,7 +1304,10 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>  		smp_rmb();
>  		ret = wait_event_timeout(guc->ct.wq,
>  					 !exec_queue_pending_disable(q) ||
> -					 xe_guc_read_stopped(guc), HZ * 5);
> +					 xe_guc_read_stopped(guc) ||
> +					 vf_recovery(guc), HZ * 5);
> +		if (vf_recovery(guc))
> +			goto handle_vf_resume;
>  		if (!ret || xe_guc_read_stopped(guc)) {
>  trigger_reset:
>  			if (!ret)
> @@ -1391,6 +1412,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>  	 * some thought, do this in a follow up.
>  	 */
>  	xe_sched_submission_start(sched);
> +handle_vf_resume:
>  	return DRM_GPU_SCHED_STAT_NO_HANG;
>  }
>  
> @@ -1487,11 +1509,17 @@ static void __guc_exec_queue_process_msg_set_sched_props(struct xe_sched_msg *ms
>  
>  static void __suspend_fence_signal(struct xe_exec_queue *q)
>  {
> +	struct xe_guc *guc = exec_queue_to_guc(q);
> +	struct xe_device *xe = guc_to_xe(guc);
> +
>  	if (!q->guc->suspend_pending)
>  		return;
>  
>  	WRITE_ONCE(q->guc->suspend_pending, false);
> -	wake_up(&q->guc->suspend_wait);
> +	if (IS_SRIOV_VF(xe))
> +		wake_up_all(&guc->ct.wq);

maybe xe_guc_ct_wake_waiters() ?

and I guess some small in source comment why we differentiate between VF and !VF case would be beneficial

> +	else
> +		wake_up(&q->guc->suspend_wait);
>  }
>  
>  static void suspend_fence_signal(struct xe_exec_queue *q)
> @@ -1512,8 +1540,9 @@ static void __guc_exec_queue_process_msg_suspend(struct xe_sched_msg *msg)
>  
>  	if (guc_exec_queue_allowed_to_change_state(q) && !exec_queue_suspended(q) &&
>  	    exec_queue_enabled(q)) {
> -		wait_event(guc->ct.wq, (q->guc->resume_time != RESUME_PENDING ||
> -			   xe_guc_read_stopped(guc)) && !exec_queue_pending_disable(q));
> +		wait_event(guc->ct.wq, vf_recovery(guc) ||
> +			   ((q->guc->resume_time != RESUME_PENDING ||
> +			   xe_guc_read_stopped(guc)) && !exec_queue_pending_disable(q)));
>  
>  		if (!xe_guc_read_stopped(guc)) {
>  			s64 since_resume_ms =
> @@ -1640,7 +1669,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
>  
>  	q->entity = &ge->entity;
>  
> -	if (xe_guc_read_stopped(guc))
> +	if (xe_guc_read_stopped(guc) || vf_recovery(guc))
>  		xe_sched_stop(sched);
>  
>  	mutex_unlock(&guc->submission_state.lock);
> @@ -1786,6 +1815,7 @@ static int guc_exec_queue_suspend(struct xe_exec_queue *q)
>  static int guc_exec_queue_suspend_wait(struct xe_exec_queue *q)
>  {
>  	struct xe_guc *guc = exec_queue_to_guc(q);
> +	struct xe_device *xe = guc_to_xe(guc);
>  	int ret;
>  
>  	/*
> @@ -1793,11 +1823,22 @@ static int guc_exec_queue_suspend_wait(struct xe_exec_queue *q)
>  	 * suspend_pending upon kill but to be paranoid but races in which
>  	 * suspend_pending is set after kill also check kill here.
>  	 */
> -	ret = wait_event_interruptible_timeout(q->guc->suspend_wait,
> -					       !READ_ONCE(q->guc->suspend_pending) ||
> -					       exec_queue_killed(q) ||
> -					       xe_guc_read_stopped(guc),
> -					       HZ * 5);
> +	if (IS_SRIOV_VF(xe))
> +		ret = wait_event_interruptible_timeout(guc->ct.wq,
> +						       !READ_ONCE(q->guc->suspend_pending) ||
> +						       exec_queue_killed(q) ||
> +						       xe_guc_read_stopped(guc) ||
> +						       vf_recovery(guc),
> +						       HZ * 5);
> +	else
> +		ret = wait_event_interruptible_timeout(q->guc->suspend_wait,
> +						       !READ_ONCE(q->guc->suspend_pending) ||
> +						       exec_queue_killed(q) ||
> +						       xe_guc_read_stopped(guc),
> +						       HZ * 5);

nit: maybe both magic 5sec timeouts deserve some comment?
> +
> +	if (vf_recovery(guc) && !xe_device_wedged((guc_to_xe(guc))))
> +		return -EAGAIN;
>  
>  	if (!ret) {
>  		xe_gt_warn(guc_to_gt(guc),
> @@ -1905,8 +1946,7 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>  {
>  	int ret;
>  
> -	if (xe_gt_WARN_ON(guc_to_gt(guc),
> -			  xe_gt_sriov_vf_recovery_pending(guc_to_gt(guc))))
> +	if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
>  		return 0;
>  
>  	if (!guc->submission_state.initialized)
> diff --git a/drivers/gpu/drm/xe/xe_preempt_fence.c b/drivers/gpu/drm/xe/xe_preempt_fence.c
> index 83fbeea5aa20..7f587ca3947d 100644
> --- a/drivers/gpu/drm/xe/xe_preempt_fence.c
> +++ b/drivers/gpu/drm/xe/xe_preempt_fence.c
> @@ -8,6 +8,8 @@
>  #include <linux/slab.h>
>  
>  #include "xe_exec_queue.h"
> +#include "xe_gt_printk.h"
> +#include "xe_guc_exec_queue_types.h"
>  #include "xe_vm.h"
>  
>  static void preempt_fence_work_func(struct work_struct *w)
> @@ -22,6 +24,15 @@ static void preempt_fence_work_func(struct work_struct *w)
>  	} else if (!q->ops->reset_status(q)) {
>  		int err = q->ops->suspend_wait(q);
>  
> +		if (err == -EAGAIN) {
> +			xe_gt_dbg(q->gt, "PREEMPT FENCE RETRY guc_id=%d",
> +				  q->guc->id);
> +			queue_work(q->vm->xe->preempt_fence_wq,
> +				   &pfence->preempt_work);
> +			dma_fence_end_signalling(cookie);
> +			return;
> +		}
> +
>  		if (err)
>  			dma_fence_set_error(&pfence->base, err);
>  	} else {

just few suggestions, but overall LGTM, trusting you (and CI) that it works, so

Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 16/30] drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register
  2025-10-06 11:10 ` [PATCH v6 16/30] drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register Matthew Brost
@ 2025-10-06 14:51   ` Michal Wajdeczko
  2025-10-06 16:02     ` Matthew Brost
  2025-10-06 22:21   ` Lis, Tomasz
  1 sibling, 1 reply; 58+ messages in thread
From: Michal Wajdeczko @ 2025-10-06 14:51 UTC (permalink / raw)
  To: Matthew Brost, intel-xe



On 10/6/2025 1:10 PM, Matthew Brost wrote:
> The only case where the GuC submission backend cannot reason 100%
> correctly is when a GuC context is registered during VF post-migration
> recovery. In this scenario, it's possible that the GuC context register
> H2G is processed, but the immediately following schedule-enable H2G gets
> lost.

hmm, isn't that the other way around ?

We should know whether schedule-enable was processed or not, since we should receive corresponding DONE message (or not).
But if it wasn't processed, then we do not know whether context registration was processed or not (as we look only at G2H).

and the schedule-enable H2G is not "lost" by GuC but rather it will be made void by our explicit CTB clearance
> 
> A double register is harmless when using `GUC_HXG_TYPE_EVENT`, as GuC
> simply drops the duplicate H2G. To keep things simple, use
> `GUC_HXG_TYPE_EVENT` for all context registrations on VFs.
> 
> v5:
>  - Check for xe_sriov_vf_migration_supported (Tomasz)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_guc_ct.c | 33 +++++++++++++++++++++++++--------
>  1 file changed, 25 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index 9f0090ae64a6..3ac654cebc79 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -32,6 +32,7 @@
>  #include "xe_guc_tlb_inval.h"
>  #include "xe_map.h"
>  #include "xe_pm.h"
> +#include "xe_sriov_vf.h"
>  #include "xe_trace_guc.h"
>  
>  static void receive_g2h(struct xe_guc_ct *ct);
> @@ -736,6 +737,26 @@ static u16 next_ct_seqno(struct xe_guc_ct *ct, bool is_g2h_fence)
>  	return seqno;
>  }
>  
> +#define MAKE_ACTION(type, __action)				\
> +({								\
> +	FIELD_PREP(GUC_HXG_MSG_0_TYPE, type) |			\
> +	FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |			\
> +		   GUC_HXG_EVENT_MSG_0_DATA0, __action);	\
> +})
> +
> +static bool vf_action_can_safely_fail(struct xe_device *xe, u32 action)
> +{
> +	/*
> +	 * If we are VF resuming, we can't exactly track if a context
> +	 * registration has been completed in the GuC state machine, it is
> +	 * harmless to resend as it will just fail silently if
> +	 * GUC_HXG_TYPE_EVENT is used.
> +	 */
> +	return IS_SRIOV_VF(xe) && xe_sriov_vf_migration_supported(xe) &&
> +		(action == XE_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC ||
> +		 action == XE_GUC_ACTION_REGISTER_CONTEXT);
> +}

not a big fan of this hack, but since it may work and speed our goal,
with re-checked/reworded commit message,

	Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>

> +
>  #define H2G_CT_HEADERS (GUC_CTB_HDR_LEN + 1) /* one DW CTB header and one DW HxG header */
>  
>  static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
> @@ -807,18 +828,14 @@ static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
>  		FIELD_PREP(GUC_CTB_MSG_0_NUM_DWORDS, len) |
>  		FIELD_PREP(GUC_CTB_MSG_0_FENCE, ct_fence_value);
>  	if (want_response) {
> -		cmd[1] =
> -			FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> -			FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |
> -				   GUC_HXG_EVENT_MSG_0_DATA0, action[0]);
> +		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_REQUEST, action[0]);
> +	} else if (vf_action_can_safely_fail(xe, action[0])) {
> +		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_EVENT, action[0]);
>  	} else {
>  		fast_req_track(ct, ct_fence_value,
>  			       FIELD_GET(GUC_HXG_EVENT_MSG_0_ACTION, action[0]));
>  
> -		cmd[1] =
> -			FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_FAST_REQUEST) |
> -			FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |
> -				   GUC_HXG_EVENT_MSG_0_DATA0, action[0]);
> +		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_FAST_REQUEST, action[0]);
>  	}
>  
>  	/* H2G header in cmd[1] replaces action[0] so: */


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 11/30] drm/xe/vf: Close multi-GT GGTT shift race
  2025-10-06 14:27   ` Michal Wajdeczko
@ 2025-10-06 14:56     ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 14:56 UTC (permalink / raw)
  To: Michal Wajdeczko; +Cc: intel-xe

On Mon, Oct 06, 2025 at 04:27:36PM +0200, Michal Wajdeczko wrote:
> 
> 
> On 10/6/2025 1:10 PM, Matthew Brost wrote:
> > As multi-GT VF post-migration recovery can run in parallel on different
> > workqueues, but both GTs point to the same GGTT, only one GT needs to
> > shift the GGTT. However, both GTs need to know when this step has
> > completed. To coordinate this, perform the GGTT shift under the GGTT
> > lock. With shift being done under the lock, storing the shift value
> > becomes unnecessary.
> > 
> > v3:
> >  - Update commmit message (Tomasz)
> > v4:
> >  - Move GGTT values to tile state (Michal)
> >  - Use GGTT lock (Michal)
> > v5:
> >  - Only take GGTT lock during recovery (CI)
> >  - Drop goto in vf_get_submission_cfg (Michal)
> >  - Add kernel doc around recovery in xe_gt_sriov_vf_query_config (Michal)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_device_types.h        |   3 +
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf.c         | 153 +++++++-------------
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf.h         |   5 +-
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h   |   7 +-
> >  drivers/gpu/drm/xe/xe_guc.c                 |   2 +-
> >  drivers/gpu/drm/xe/xe_tile_sriov_vf.c       |  30 +++-
> >  drivers/gpu/drm/xe/xe_tile_sriov_vf.h       |   2 +-
> >  drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h |  23 +++
> >  drivers/gpu/drm/xe/xe_vram.c                |   6 +-
> >  9 files changed, 112 insertions(+), 119 deletions(-)
> >  create mode 100644 drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> > index 1d2718b70a5c..c66523bf4bf0 100644
> > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > @@ -27,6 +27,7 @@
> >  #include "xe_sriov_vf_ccs_types.h"
> >  #include "xe_step_types.h"
> >  #include "xe_survivability_mode_types.h"
> > +#include "xe_tile_sriov_vf_types.h"
> >  #include "xe_validation.h"
> >  
> >  #if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
> > @@ -193,6 +194,8 @@ struct xe_tile {
> >  		struct {
> >  			/** @sriov.vf.ggtt_balloon: GGTT regions excluded from use. */
> >  			struct xe_ggtt_node *ggtt_balloon[2];
> > +			/** @sriov.vf.self_config: VF configuration data */
> > +			struct xe_tile_sriov_vf_selfconfig self_config;
> >  		} vf;
> >  	} sriov;
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > index 55a1ebbbf47f..d227c8a3ec81 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > @@ -436,42 +436,65 @@ u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt)
> >  	return value;
> >  }
> >  
> > -static int vf_get_ggtt_info(struct xe_gt *gt)
> > +static int vf_get_ggtt_info(struct xe_gt *gt, bool recovery)
> >  {
> > -	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> > +	struct xe_tile_sriov_vf_selfconfig *config =
> > +		&gt_to_tile(gt)->sriov.vf.self_config;
> 
> maybe
> 	xe_tile *tile = gt_to_tile(gt);
> 	struct xe_tile_sriov_vf_selfconfig *config = tile->sriov.vf.self_config;
> 
> to avoid line split
> 
> > +	struct xe_ggtt *ggtt = gt_to_tile(gt)->mem.ggtt;
> 
> then
> 	struct xe_ggtt *ggtt = tile->mem.ggtt;
> 

Ok.

> >  	struct xe_guc *guc = &gt->uc.guc;
> >  	u64 start, size;
> > +	s64 shift;
> >  	int err;
> >  
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  
> > +	/*
> > +	 * We only only take the GGTT lock when potentially shifting GGTTs to
> > +	 * make this step visable to all GTs which share a GGTT. Also the GGTT
> > +	 * lock is not initialized during xe_gt_init_early when this function
> > +	 * can also be called.
> 
> hmm, the real fix should be that GGTT lock is initialized right after GGTT was allocated
> it looks that just split between GGTT alloc() and __init_early() was not ideal
> 
> note that while almost similar pattern was done for tile, in xe_tile_init_early() the pcode mutex is initialized
> 
> alternatively we can change VF to do not perform full query when doing early bootstrap as it is looking just for the GMDID
> 

I looked at that but the GGTT init early relies on GT W/A being applied here:

286         if (GRAPHICS_VERx100(xe) >= 1270)
287                 ggtt->pt_ops = (ggtt->tile->media_gt &&
288                                XE_GT_WA(ggtt->tile->media_gt, 22019338487)) ||
289                                XE_GT_WA(ggtt->tile->primary_gt, 22019338487) ?
290                                &xelpg_pt_wa_ops : &xelpg_pt_ops;
291         else
292                 ggtt->pt_ops = &xelp_pt_ops;

GT W/A are applied in GT init early, thus we have circular dependency.

But I think moving the lock init to xe_gt_alloc() should work.

> > +	 */
> > +	if (recovery)
> > +		mutex_lock(&ggtt->lock);
> 
> then we could use
> 
> 	guard(mutex)(&ggtt->lock)
> 
> > +
> >  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_START_KEY, &start);
> >  	if (unlikely(err))
> > -		return err;
> > +		goto out;
> >  
> >  	err = guc_action_query_single_klv64(guc, GUC_KLV_VF_CFG_GGTT_SIZE_KEY, &size);
> >  	if (unlikely(err))
> > -		return err;
> > +		goto out;
> >  
> >  	if (config->ggtt_size && config->ggtt_size != size) {
> >  		xe_gt_sriov_err(gt, "Unexpected GGTT reassignment: %lluK != %lluK\n",
> >  				size / SZ_1K, config->ggtt_size / SZ_1K);
> > -		return -EREMCHG;
> > +		err = -EREMCHG;
> > +		goto out;
> >  	}
> >  
> >  	xe_gt_sriov_dbg_verbose(gt, "GGTT %#llx-%#llx = %lluK\n",
> >  				start, start + size - 1, size / SZ_1K);
> >  
> > -	config->ggtt_shift = start - (s64)config->ggtt_base;
> > +	shift = start - (s64)config->ggtt_base;
> >  	config->ggtt_base = start;
> >  	config->ggtt_size = size;
> > +	err = config->ggtt_size ? 0 : -ENODATA;
> >  
> > -	return config->ggtt_size ? 0 : -ENODATA;
> > +	if (!err && shift && recovery) {
> 
> maybe "recovery" is not needed:
> 
> 	if (!err && shift && shift != start)
> 

Do you mean remove the recovery argument all together?

That might work...

> > +		xe_gt_sriov_info(gt, "Shifting GGTT base by %lld to 0x%016llx\n",
> > +				 shift, config->ggtt_base);
> > +		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
> > +	}
> > +out:
> > +	if (recovery)
> > +		mutex_unlock(&ggtt->lock);
> > +	return err;
> >  }
> >  
> >  static int vf_get_lmem_info(struct xe_gt *gt)
> >  {
> > -	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> > +	struct xe_tile_sriov_vf_selfconfig *config =
> > +		&gt_to_tile(gt)->sriov.vf.self_config;
> >  	struct xe_guc *guc = &gt->uc.guc;
> >  	char size_str[10];
> >  	u64 size;
> > @@ -544,17 +567,20 @@ static void vf_cache_gmdid(struct xe_gt *gt)
> >  /**
> >   * xe_gt_sriov_vf_query_config - Query SR-IOV config data over MMIO.
> >   * @gt: the &xe_gt
> > + * @recovery: VF post migration recovery path
> >   *
> > - * This function is for VF use only.
> > + * This function is for VF use only. If recovery is set, the GGTT shift will be
> > + * performed under GGTT lock making this step visable to all GTs which share a
> > + * GGTT.
> 
> hmm, the question is: why GGTT query can't be done under lock even without 'recovery' ?
> 

See above, I think we can fix that one.

> >   *
> >   * Return: 0 on success or a negative error code on failure.
> >   */
> > -int xe_gt_sriov_vf_query_config(struct xe_gt *gt)
> > +int xe_gt_sriov_vf_query_config(struct xe_gt *gt, bool recovery)
> >  {
> >  	struct xe_device *xe = gt_to_xe(gt);
> >  	int err;
> >  
> > -	err = vf_get_ggtt_info(gt);
> > +	err = vf_get_ggtt_info(gt, recovery);
> >  	if (unlikely(err))
> >  		return err;
> >  
> > @@ -584,80 +610,16 @@ int xe_gt_sriov_vf_query_config(struct xe_gt *gt)
> >   */
> >  u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt)
> >  {
> > -	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> > -	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> > -	xe_gt_assert(gt, gt->sriov.vf.self_config.num_ctxs);
> > -
> > -	return gt->sriov.vf.self_config.num_ctxs;
> > -}
> > -
> > -/**
> > - * xe_gt_sriov_vf_lmem - VF LMEM configuration.
> > - * @gt: the &xe_gt
> > - *
> > - * This function is for VF use only.
> > - *
> > - * Return: size of the LMEM assigned to VF.
> > - */
> > -u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt)
> > -{
> > -	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> > -	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> > -	xe_gt_assert(gt, gt->sriov.vf.self_config.lmem_size);
> > -
> > -	return gt->sriov.vf.self_config.lmem_size;
> > -}
> > -
> > -/**
> > - * xe_gt_sriov_vf_ggtt - VF GGTT configuration.
> > - * @gt: the &xe_gt
> > - *
> > - * This function is for VF use only.
> > - *
> > - * Return: size of the GGTT assigned to VF.
> > - */
> > -u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt)
> > -{
> > -	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> > -	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> > -	xe_gt_assert(gt, gt->sriov.vf.self_config.ggtt_size);
> > -
> > -	return gt->sriov.vf.self_config.ggtt_size;
> > -}
> > +	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> > +	u16 val;
> >  
> > -/**
> > - * xe_gt_sriov_vf_ggtt_base - VF GGTT base offset.
> > - * @gt: the &xe_gt
> > - *
> > - * This function is for VF use only.
> > - *
> > - * Return: base offset of the GGTT assigned to VF.
> > - */
> > -u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt)
> > -{
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  	xe_gt_assert(gt, gt->sriov.vf.guc_version.major);
> > -	xe_gt_assert(gt, gt->sriov.vf.self_config.ggtt_size);
> > -
> > -	return gt->sriov.vf.self_config.ggtt_base;
> > -}
> >  
> > -/**
> > - * xe_gt_sriov_vf_ggtt_shift - Return shift in GGTT range due to VF migration
> > - * @gt: the &xe_gt struct instance
> > - *
> > - * This function is for VF use only.
> > - *
> > - * Return: The shift value; could be negative
> > - */
> > -s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt)
> > -{
> > -	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> > +	xe_gt_assert(gt, config->num_ctxs);
> > +	val = config->num_ctxs;
> >  
> > -	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> > -	xe_gt_assert(gt, xe_gt_is_main_type(gt));
> > -
> > -	return config->ggtt_shift;
> > +	return val;
> >  }
> >  
> >  static int relay_action_handshake(struct xe_gt *gt, u32 *major, u32 *minor)
> > @@ -1057,6 +1019,8 @@ void xe_gt_sriov_vf_write32(struct xe_gt *gt, struct xe_reg reg, u32 val)
> >   */
> >  void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
> >  {
> > +	struct xe_tile_sriov_vf_selfconfig *tconfig =
> > +		&gt_to_tile(gt)->sriov.vf.self_config;
> >  	struct xe_gt_sriov_vf_selfconfig *config = &gt->sriov.vf.self_config;
> >  	struct xe_device *xe = gt_to_xe(gt);
> >  	char buf[10];
> > @@ -1064,17 +1028,15 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p)
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  
> >  	drm_printf(p, "GGTT range:\t%#llx-%#llx\n",
> > -		   config->ggtt_base,
> > -		   config->ggtt_base + config->ggtt_size - 1);
> > -
> > -	string_get_size(config->ggtt_size, 1, STRING_UNITS_2, buf, sizeof(buf));
> > -	drm_printf(p, "GGTT size:\t%llu (%s)\n", config->ggtt_size, buf);
> > +		   tconfig->ggtt_base,
> > +		   tconfig->ggtt_base + tconfig->ggtt_size - 1);
> >  
> > -	drm_printf(p, "GGTT shift on last restore:\t%lld\n", config->ggtt_shift);
> > +	string_get_size(tconfig->ggtt_size, 1, STRING_UNITS_2, buf, sizeof(buf));
> > +	drm_printf(p, "GGTT size:\t%llu (%s)\n", tconfig->ggtt_size, buf);
> >  
> >  	if (IS_DGFX(xe) && xe_gt_is_main_type(gt)) {
> > -		string_get_size(config->lmem_size, 1, STRING_UNITS_2, buf, sizeof(buf));
> > -		drm_printf(p, "LMEM size:\t%llu (%s)\n", config->lmem_size, buf);
> > +		string_get_size(tconfig->lmem_size, 1, STRING_UNITS_2, buf, sizeof(buf));
> > +		drm_printf(p, "LMEM size:\t%llu (%s)\n", tconfig->lmem_size, buf);
> >  	}
> >  
> >  	drm_printf(p, "GuC contexts:\t%u\n", config->num_ctxs);
> > @@ -1161,21 +1123,16 @@ static size_t post_migration_scratch_size(struct xe_device *xe)
> >  static int vf_post_migration_fixups(struct xe_gt *gt)
> >  {
> >  	void *buf = gt->sriov.vf.migration.scratch;
> > -	s64 shift;
> >  	int err;
> >  
> > -	err = xe_gt_sriov_vf_query_config(gt);
> > +	err = xe_gt_sriov_vf_query_config(gt, true);
> >  	if (err)
> >  		return err;
> >  
> > -	shift = xe_gt_sriov_vf_ggtt_shift(gt);
> > -	if (shift) {
> > -		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
> > -		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
> > -		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
> > -		if (err)
> > -			return err;
> > -	}
> > +	xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
> > +	err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
> > +	if (err)
> > +		return err;
> >  
> >  	return 0;
> >  }
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> > index 0adebf8aa419..47ed8d513571 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> > @@ -18,7 +18,7 @@ int xe_gt_sriov_vf_bootstrap(struct xe_gt *gt);
> >  void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt,
> >  				 struct xe_uc_fw_version *wanted,
> >  				 struct xe_uc_fw_version *found);
> > -int xe_gt_sriov_vf_query_config(struct xe_gt *gt);
> > +int xe_gt_sriov_vf_query_config(struct xe_gt *gt, bool recovery);
> >  int xe_gt_sriov_vf_connect(struct xe_gt *gt);
> >  int xe_gt_sriov_vf_query_runtime(struct xe_gt *gt);
> >  void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
> > @@ -29,9 +29,6 @@ bool xe_gt_sriov_vf_recovery_pending(struct xe_gt *gt);
> >  u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
> >  u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt);
> >  u64 xe_gt_sriov_vf_lmem(struct xe_gt *gt);
> > -u64 xe_gt_sriov_vf_ggtt(struct xe_gt *gt);
> > -u64 xe_gt_sriov_vf_ggtt_base(struct xe_gt *gt);
> > -s64 xe_gt_sriov_vf_ggtt_shift(struct xe_gt *gt);
> >  
> >  u32 xe_gt_sriov_vf_read32(struct xe_gt *gt, struct xe_reg reg);
> >  void xe_gt_sriov_vf_write32(struct xe_gt *gt, struct xe_reg reg, u32 val);
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > index e753646debc4..1796d4caf62f 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > @@ -6,6 +6,7 @@
> >  #ifndef _XE_GT_SRIOV_VF_TYPES_H_
> >  #define _XE_GT_SRIOV_VF_TYPES_H_
> >  
> > +#include <linux/rwsem.h>
> >  #include <linux/types.h>
> >  #include <linux/workqueue.h>
> >  #include "xe_uc_fw_types.h"
> > @@ -14,12 +15,6 @@
> >   * struct xe_gt_sriov_vf_selfconfig - VF configuration data.
> >   */
> >  struct xe_gt_sriov_vf_selfconfig {
> > -	/** @ggtt_base: assigned base offset of the GGTT region. */
> > -	u64 ggtt_base;
> > -	/** @ggtt_size: assigned size of the GGTT region. */
> > -	u64 ggtt_size;
> > -	/** @ggtt_shift: difference in ggtt_base on last migration */
> > -	s64 ggtt_shift;
> >  	/** @lmem_size: assigned size of the LMEM. */
> >  	u64 lmem_size;
> >  	/** @num_ctxs: assigned number of GuC submission context IDs. */
> > diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> > index d5adbbb013ec..c016a11b6ab1 100644
> > --- a/drivers/gpu/drm/xe/xe_guc.c
> > +++ b/drivers/gpu/drm/xe/xe_guc.c
> > @@ -713,7 +713,7 @@ static int vf_guc_init_noalloc(struct xe_guc *guc)
> >  	if (err)
> >  		return err;
> >  
> > -	err = xe_gt_sriov_vf_query_config(gt);
> > +	err = xe_gt_sriov_vf_query_config(gt, false);
> >  	if (err)
> >  		return err;
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_tile_sriov_vf.c b/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
> > index f221dbed16f0..074981e2ef07 100644
> > --- a/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_tile_sriov_vf.c
> > @@ -9,7 +9,6 @@
> >  
> >  #include "xe_assert.h"
> >  #include "xe_ggtt.h"
> > -#include "xe_gt_sriov_vf.h"
> >  #include "xe_sriov.h"
> >  #include "xe_sriov_printk.h"
> >  #include "xe_tile_sriov_vf.h"
> > @@ -40,10 +39,10 @@ static int vf_init_ggtt_balloons(struct xe_tile *tile)
> >   *
> >   * Return: 0 on success or a negative error code on failure.
> >   */
> > -int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile)
> > +static int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile)
> >  {
> > -	u64 ggtt_base = xe_gt_sriov_vf_ggtt_base(tile->primary_gt);
> > -	u64 ggtt_size = xe_gt_sriov_vf_ggtt(tile->primary_gt);
> > +	u64 ggtt_base = tile->sriov.vf.self_config.ggtt_base;
> > +	u64 ggtt_size = tile->sriov.vf.self_config.ggtt_size;
> >  	struct xe_device *xe = tile_to_xe(tile);
> >  	u64 wopcm = xe_wopcm_size(xe);
> >  	u64 start, end;
> > @@ -244,11 +243,30 @@ void xe_tile_sriov_vf_fixup_ggtt_nodes(struct xe_tile *tile, s64 shift)
> 
> what about naming style to use _locked suffix in function name if it expects to be already protected ?

Sure.

Matt

> >  {
> >  	struct xe_ggtt *ggtt = tile->mem.ggtt;
> >  
> > -	mutex_lock(&ggtt->lock);
> > +	lockdep_assert_held(&ggtt->lock);
> >  
> >  	xe_tile_sriov_vf_deballoon_ggtt_locked(tile);
> >  	xe_ggtt_shift_nodes_locked(ggtt, shift);
> >  	xe_tile_sriov_vf_balloon_ggtt_locked(tile);
> > +}
> >  
> > -	mutex_unlock(&ggtt->lock);
> > +/**
> > + * xe_tile_sriov_vf_lmem - VF LMEM configuration.
> > + * @tile: the &xe_tile
> > + *
> > + * This function is for VF use only.
> > + *
> > + * Return: size of the LMEM assigned to VF.
> > + */
> > +u64 xe_tile_sriov_vf_lmem(struct xe_tile *tile)
> > +{
> > +	struct xe_tile_sriov_vf_selfconfig *config = &tile->sriov.vf.self_config;
> > +	u64 val;
> > +
> > +	xe_tile_assert(tile, IS_SRIOV_VF(tile_to_xe(tile)));
> > +
> > +	xe_tile_assert(tile, config->lmem_size);
> > +	val = config->lmem_size;
> > +
> > +	return val;
> >  }
> > diff --git a/drivers/gpu/drm/xe/xe_tile_sriov_vf.h b/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
> > index 93eb043171e8..54e7f2a5c4e4 100644
> > --- a/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
> > +++ b/drivers/gpu/drm/xe/xe_tile_sriov_vf.h
> > @@ -11,8 +11,8 @@
> >  struct xe_tile;
> >  
> >  int xe_tile_sriov_vf_prepare_ggtt(struct xe_tile *tile);
> > -int xe_tile_sriov_vf_balloon_ggtt_locked(struct xe_tile *tile);
> >  void xe_tile_sriov_vf_deballoon_ggtt_locked(struct xe_tile *tile);
> >  void xe_tile_sriov_vf_fixup_ggtt_nodes(struct xe_tile *tile, s64 shift);
> > +u64 xe_tile_sriov_vf_lmem(struct xe_tile *tile);
> >  
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h
> > new file mode 100644
> > index 000000000000..140717f81d8f
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h
> > @@ -0,0 +1,23 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2025 Intel Corporation
> > + */
> > +
> > +#ifndef _XE_TILE_SRIOV_VF_TYPES_H_
> > +#define _XE_TILE_SRIOV_VF_TYPES_H_
> > +
> > +#include <linux/mutex.h>
> > +
> > +/**
> > + * struct xe_tile_sriov_vf_selfconfig - VF configuration data.
> > + */
> > +struct xe_tile_sriov_vf_selfconfig {
> > +	/** @ggtt_base: assigned base offset of the GGTT region. */
> > +	u64 ggtt_base;
> > +	/** @ggtt_size: assigned size of the GGTT region. */
> > +	u64 ggtt_size;
> > +	/** @lmem_size: assigned size of the LMEM. */
> > +	u64 lmem_size;
> > +};
> > +
> > +#endif
> > diff --git a/drivers/gpu/drm/xe/xe_vram.c b/drivers/gpu/drm/xe/xe_vram.c
> > index 7adfccf68e4c..70bcbb188867 100644
> > --- a/drivers/gpu/drm/xe/xe_vram.c
> > +++ b/drivers/gpu/drm/xe/xe_vram.c
> > @@ -17,10 +17,10 @@
> >  #include "xe_device.h"
> >  #include "xe_force_wake.h"
> >  #include "xe_gt_mcr.h"
> > -#include "xe_gt_sriov_vf.h"
> >  #include "xe_mmio.h"
> >  #include "xe_module.h"
> >  #include "xe_sriov.h"
> > +#include "xe_tile_sriov_vf.h"
> >  #include "xe_ttm_vram_mgr.h"
> >  #include "xe_vram.h"
> >  #include "xe_vram_types.h"
> > @@ -238,9 +238,9 @@ static int tile_vram_size(struct xe_tile *tile, u64 *vram_size,
> >  		offset = 0;
> >  		for_each_tile(t, xe, id)
> >  			for_each_if(t->id < tile->id)
> > -				offset += xe_gt_sriov_vf_lmem(t->primary_gt);
> > +				offset += xe_tile_sriov_vf_lmem(t);
> >  
> > -		*tile_size = xe_gt_sriov_vf_lmem(gt);
> > +		*tile_size = xe_tile_sriov_vf_lmem(tile);
> >  		*vram_size = *tile_size;
> >  		*tile_offset = offset;
> >  
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 23/30] drm/xe: Move queue init before LRC creation
  2025-10-06 11:10 ` [PATCH v6 23/30] drm/xe: Move queue init before LRC creation Matthew Brost
@ 2025-10-06 15:22   ` Michal Wajdeczko
  2025-10-06 21:33   ` Lis, Tomasz
  1 sibling, 0 replies; 58+ messages in thread
From: Michal Wajdeczko @ 2025-10-06 15:22 UTC (permalink / raw)
  To: Matthew Brost, intel-xe



On 10/6/2025 1:10 PM, Matthew Brost wrote:
> A queue must be in the submission backend's tracking state before the
> LRC is created to avoid a race condition where the LRC's GGTT addresses
> are not properly fixed up during VF post-migration recovery.
> 
> Move the queue initialization—which adds the queue to the submission
> backend's tracking state—before LRC creation.
> 
> v2:
>  - Wait on VF GGTT fixes before creating LRC (testing)
> v5:
>  - Adjust comment in code (Tomasz)
>  - Reduce race window
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_exec_queue.c        | 45 ++++++++++++++++++-----
>  drivers/gpu/drm/xe/xe_execlist.c          |  2 +-
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 39 +++++++++++++++++++-
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  2 +
>  drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  5 +++
>  drivers/gpu/drm/xe/xe_guc_submit.c        |  2 +-
>  drivers/gpu/drm/xe/xe_lrc.h               | 10 +++++
>  7 files changed, 92 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
> index 7621089a47fe..90cbc95f8e2e 100644
> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> @@ -15,6 +15,7 @@
>  #include "xe_dep_scheduler.h"
>  #include "xe_device.h"
>  #include "xe_gt.h"
> +#include "xe_gt_sriov_vf.h"
>  #include "xe_hw_engine_class_sysfs.h"
>  #include "xe_hw_engine_group.h"
>  #include "xe_hw_fence.h"
> @@ -205,17 +206,34 @@ static int __xe_exec_queue_init(struct xe_exec_queue *q, u32 exec_queue_flags)
>  	if (!(exec_queue_flags & EXEC_QUEUE_FLAG_KERNEL))
>  		flags |= XE_LRC_CREATE_USER_CTX;
>  
> +	err = q->ops->init(q);
> +	if (err)
> +		return err;
> +
> +	/*
> +	 * This must occur after q->ops->init to avoid race conditions during VF
> +	 * post-migration recovery, as the fixups for the LRC GGTT addresses
> +	 * depend on the queue being present in the backend tracking structure.
> +	 *
> +	 * In addition to above, we must wait on inflight GGTT changes to avoid
> +	 * writing out stale values here. Such wait provides a solid solution
> +	 * (without a race) only if the function can detect migration instantly
> +	 * from the moment vCPU resumes execution.
> +	 */
>  	for (i = 0; i < q->width; ++i) {
> -		q->lrc[i] = xe_lrc_create(q->hwe, q->vm, SZ_16K, q->msix_vec, flags);
> -		if (IS_ERR(q->lrc[i])) {
> -			err = PTR_ERR(q->lrc[i]);
> +		struct xe_lrc *lrc;
> +
> +		xe_gt_sriov_vf_wait_valid_ggtt(q->gt);
> +		lrc = xe_lrc_create(q->hwe, q->vm, xe_lrc_ring_size(),
> +				    q->msix_vec, flags);
> +		if (IS_ERR(lrc)) {
> +			err = PTR_ERR(lrc);
>  			goto err_lrc;
>  		}
> -	}
>  
> -	err = q->ops->init(q);
> -	if (err)
> -		goto err_lrc;
> +		/* Pairs with READ_ONCE to xe_exec_queue_contexts_hwsp_rebase */
> +		WRITE_ONCE(q->lrc[i], lrc);
> +	}
>  
>  	return 0;
>  
> @@ -1121,9 +1139,16 @@ int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch)
>  	int err = 0;
>  
>  	for (i = 0; i < q->width; ++i) {
> -		xe_lrc_update_memirq_regs_with_address(q->lrc[i], q->hwe, scratch);
> -		xe_lrc_update_hwctx_regs_with_address(q->lrc[i]);
> -		err = xe_lrc_setup_wa_bb_with_scratch(q->lrc[i], q->hwe, scratch);
> +		struct xe_lrc *lrc;
> +
> +		/* Pairs with WRITE_ONCE in __xe_exec_queue_init  */
> +		lrc = READ_ONCE(q->lrc[i]);
> +		if (!lrc)
> +			continue;
> +
> +		xe_lrc_update_memirq_regs_with_address(lrc, q->hwe, scratch);
> +		xe_lrc_update_hwctx_regs_with_address(lrc);
> +		err = xe_lrc_setup_wa_bb_with_scratch(lrc, q->hwe, scratch);
>  		if (err)
>  			break;
>  	}
> diff --git a/drivers/gpu/drm/xe/xe_execlist.c b/drivers/gpu/drm/xe/xe_execlist.c
> index f83d421ac9d3..769d05517f93 100644
> --- a/drivers/gpu/drm/xe/xe_execlist.c
> +++ b/drivers/gpu/drm/xe/xe_execlist.c
> @@ -339,7 +339,7 @@ static int execlist_exec_queue_init(struct xe_exec_queue *q)
>  	const struct drm_sched_init_args args = {
>  		.ops = &drm_sched_ops,
>  		.num_rqs = 1,
> -		.credit_limit = q->lrc[0]->ring.size / MAX_JOB_SIZE_BYTES,
> +		.credit_limit = xe_lrc_ring_size() / MAX_JOB_SIZE_BYTES,
>  		.hang_limit = XE_SCHED_HANG_LIMIT,
>  		.timeout = XE_SCHED_JOB_TIMEOUT,
>  		.name = q->hwe->name,
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 8074ffb924ce..bf1806e90370 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -487,6 +487,11 @@ static int vf_get_ggtt_info(struct xe_gt *gt, bool recovery)
>  				 shift, config->ggtt_base);
>  		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
>  	}
> +
> +	WRITE_ONCE(gt->sriov.vf.migration.ggtt_need_fixes, false);
> +	smp_wmb();	/* Ensure above write visible before wake */
> +	wake_up_all(&gt->sriov.vf.migration.wq);
> +
>  out:
>  	if (recovery)
>  		mutex_unlock(&ggtt->lock);
> @@ -745,7 +750,8 @@ static void vf_start_migration_recovery(struct xe_gt *gt)
>  	    !gt->sriov.vf.migration.recovery_teardown) {
>  		gt->sriov.vf.migration.recovery_queued = true;
>  		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
> -		smp_wmb();	/* Ensure above write visable before wake */
> +		WRITE_ONCE(gt->sriov.vf.migration.ggtt_need_fixes, true);
> +		smp_wmb();	/* Ensure above writes visable before wake */
>  
>  		xe_guc_ct_wake_waiters(&gt->uc.guc.ct);
>  
> @@ -1264,6 +1270,7 @@ int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
>  	gt->sriov.vf.migration.scratch = buf;
>  	spin_lock_init(&gt->sriov.vf.migration.lock);
>  	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
> +	init_waitqueue_head(&gt->sriov.vf.migration.wq);
>  
>  	return 0;
>  }
> @@ -1312,3 +1319,33 @@ bool xe_gt_sriov_vf_recovery_pending(struct xe_gt *gt)
>  
>  	return READ_ONCE(gt->sriov.vf.migration.recovery_inprogress);
>  }
> +
> +static bool vf_valid_ggtt(struct xe_gt *gt)
> +{
> +	struct xe_memirq *memirq = &gt_to_tile(gt)->memirq;
> +
> +	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> +
> +	if (xe_memirq_guc_sw_int_0_irq_pending(memirq, &gt->uc.guc) ||
> +	    READ_ONCE(gt->sriov.vf.migration.ggtt_need_fixes))
> +		return false;
> +
> +	return true;
> +}
> +
> +/**
> + * xe_gt_sriov_vf_wait_valid_ggtt() - VF wait for valid GGTT addresses
> + * @gt: the &xe_gt
> + */
> +void xe_gt_sriov_vf_wait_valid_ggtt(struct xe_gt *gt)
> +{
> +	int ret;
> +
> +	if (!IS_SRIOV_VF(gt_to_xe(gt)))
> +		return;
> +
> +	ret = wait_event_interruptible_timeout(gt->sriov.vf.migration.wq,
> +					       vf_valid_ggtt(gt),
> +					       HZ * 5);
> +	XE_WARN_ON(!ret);

use gt-oriented warn:

	xe_gt_WARN_ON(gt, !ret);

> +}
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> index 8c9679414565..63102029d624 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> @@ -38,4 +38,6 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p);
>  void xe_gt_sriov_vf_print_runtime(struct xe_gt *gt, struct drm_printer *p);
>  void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p);
>  
> +void xe_gt_sriov_vf_wait_valid_ggtt(struct xe_gt *gt);
> +
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> index c1bd6fdd9ab1..f0bc45a782a4 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> @@ -8,6 +8,7 @@
>  
>  #include <linux/rwsem.h>
>  #include <linux/types.h>
> +#include <linux/wait.h>
>  #include <linux/workqueue.h>
>  #include "xe_uc_fw_types.h"
>  
> @@ -50,6 +51,8 @@ struct xe_gt_sriov_vf_migration {
>  	struct work_struct worker;
>  	/** @lock: Protects recovery_queued, teardown */
>  	spinlock_t lock;
> +	/** @wq: wait queue for migration fixes */
> +	wait_queue_head_t wq;
>  	/** @scratch: Scratch memory for VF recovery */
>  	void *scratch;
>  	/** @recovery_teardown: VF post migration recovery is being torn down */
> @@ -58,6 +61,8 @@ struct xe_gt_sriov_vf_migration {
>  	bool recovery_queued;
>  	/** @recovery_inprogress: VF post migration recovery in progress */
>  	bool recovery_inprogress;
> +	/** @ggtt_need_fixes: VF GGTT needs fixes */
> +	bool ggtt_need_fixes;
>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 9dbdb0b54c8b..48d5133e76a6 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -1663,7 +1663,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
>  	timeout = (q->vm && xe_vm_in_lr_mode(q->vm)) ? MAX_SCHEDULE_TIMEOUT :
>  		  msecs_to_jiffies(q->sched_props.job_timeout_ms);
>  	err = xe_sched_init(&ge->sched, &drm_sched_ops, &xe_sched_ops,
> -			    NULL, q->lrc[0]->ring.size / MAX_JOB_SIZE_BYTES, 64,
> +			    NULL, xe_lrc_ring_size() / MAX_JOB_SIZE_BYTES, 64,
>  			    timeout, guc_to_gt(guc)->ordered_wq, NULL,
>  			    q->name, gt_to_xe(q->gt)->drm.dev);
>  	if (err)
> diff --git a/drivers/gpu/drm/xe/xe_lrc.h b/drivers/gpu/drm/xe/xe_lrc.h
> index 21a3daab0154..c4a33b135101 100644
> --- a/drivers/gpu/drm/xe/xe_lrc.h
> +++ b/drivers/gpu/drm/xe/xe_lrc.h
> @@ -76,6 +76,16 @@ static inline void xe_lrc_put(struct xe_lrc *lrc)
>  	kref_put(&lrc->refcount, xe_lrc_destroy);
>  }
>  
> +/**
> + * xe_lrc_ring_size() - Xe LRC ring size
> + *
> + * Return: Size of LRC size
> + */
> +static inline size_t xe_lrc_ring_size(void)
> +{
> +	return SZ_16K;
> +}
> +
>  size_t xe_gt_lrc_size(struct xe_gt *gt, enum xe_engine_class class);
>  u32 xe_lrc_pphwsp_offset(struct xe_lrc *lrc);
>  u32 xe_lrc_regs_offset(struct xe_lrc *lrc);


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 14/30] drm/xe/vf: Wakeup in GuC backend on VF post migration recovery
  2025-10-06 14:35   ` Michal Wajdeczko
@ 2025-10-06 15:54     ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 15:54 UTC (permalink / raw)
  To: Michal Wajdeczko; +Cc: intel-xe

On Mon, Oct 06, 2025 at 04:35:51PM +0200, Michal Wajdeczko wrote:
> 
> 
> On 10/6/2025 1:10 PM, Matthew Brost wrote:
> > If VF post-migration recovery is in progress, the recovery flow will
> > rebuild all GuC submission state. In this case, exit all waiters to
> > ensure that submission queue scheduling can also be paused. Avoid taking
> > any adverse actions after aborting the wait.
> > 
> > As part of waking up the GuC backend, suspend_wait can now return
> > -EAGAIN indicating the waiter should be retried. If the caller is
> > running on work item, that work item need to be requeued to avoid a
> > deadlock for the work item blocking the VF migration recovery work item.
> > 
> > v3:
> >  - Don't block in preempt fence work queue as this can interfere with VF
> >    post-migration work queue scheduling leading to deadlock (Testing)
> >  - Use xe_gt_recovery_inprogress (Michal)
> > v5:
> >  - Use static function for vf_recovery (Michal)
> >  - Add helper to wake CT waiters (Michal)
> >  - Move some code to following patch (Michal)
> >  - Adjust commit message to explain suspend_wait returning -EAGAIN (Michal)
> >  - Add kernel doc to suspend_wait around returning -EAGAIN
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_exec_queue_types.h |  3 +
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf.c      |  4 ++
> >  drivers/gpu/drm/xe/xe_guc_ct.h           |  9 +++
> >  drivers/gpu/drm/xe/xe_guc_submit.c       | 82 ++++++++++++++++++------
> >  drivers/gpu/drm/xe/xe_preempt_fence.c    | 11 ++++
> >  5 files changed, 88 insertions(+), 21 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_exec_queue_types.h b/drivers/gpu/drm/xe/xe_exec_queue_types.h
> > index 27b76cf9da89..282505fa1377 100644
> > --- a/drivers/gpu/drm/xe/xe_exec_queue_types.h
> > +++ b/drivers/gpu/drm/xe/xe_exec_queue_types.h
> > @@ -207,6 +207,9 @@ struct xe_exec_queue_ops {
> >  	 * call after suspend. In dma-fencing path thus must return within a
> >  	 * reasonable amount of time. -ETIME return shall indicate an error
> >  	 * waiting for suspend resulting in associated VM getting killed.
> > +	 * -EAGAIN return indicates the wait should be tried again, if the wait
> > +	 * is within a work item, the work item should be requeued as deadlock
> > +	 * avoidance mechanism.
> >  	 */
> >  	int (*suspend_wait)(struct xe_exec_queue *q);
> >  	/**
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > index 7057260175f3..7f703336d692 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > @@ -23,6 +23,7 @@
> >  #include "xe_gt_sriov_vf.h"
> >  #include "xe_gt_sriov_vf_types.h"
> >  #include "xe_guc.h"
> > +#include "xe_guc_ct.h"
> >  #include "xe_guc_hxg_helpers.h"
> >  #include "xe_guc_relay.h"
> >  #include "xe_guc_submit.h"
> > @@ -743,6 +744,9 @@ static void vf_start_migration_recovery(struct xe_gt *gt)
> >  	    !gt->sriov.vf.migration.recovery_teardown) {
> >  		gt->sriov.vf.migration.recovery_queued = true;
> >  		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
> > +		smp_wmb();	/* Ensure above write visable before wake */
> > +
> > +		xe_guc_ct_wake_waiters(&gt->uc.guc.ct);
> >  
> >  		started = queue_work(gt->ordered_wq, &gt->sriov.vf.migration.worker);
> >  		xe_gt_sriov_info(gt, "VF migration recovery %s\n", started ?
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
> > index d6c81325a76c..ca0ec938edac 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ct.h
> > +++ b/drivers/gpu/drm/xe/xe_guc_ct.h
> > @@ -72,4 +72,13 @@ xe_guc_ct_send_block_no_fail(struct xe_guc_ct *ct, const u32 *action, u32 len)
> >  
> >  long xe_guc_ct_queue_proc_time_jiffies(struct xe_guc_ct *ct);
> >  
> > +/**
> > + * xe_guc_ct_wake_waiters() - GuC CT wake up waiters
> > + * @guc: GuC CT object
> > + */
> > +static inline void xe_guc_ct_wake_waiters(struct xe_guc_ct *ct)
> > +{
> > +	wake_up_all(&ct->wq);
> > +}
> > +
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 59371b7cc8a4..b2ca4911efe9 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -27,7 +27,6 @@
> >  #include "xe_gt.h"
> >  #include "xe_gt_clock.h"
> >  #include "xe_gt_printk.h"
> > -#include "xe_gt_sriov_vf.h"
> >  #include "xe_guc.h"
> >  #include "xe_guc_capture.h"
> >  #include "xe_guc_ct.h"
> > @@ -702,6 +701,11 @@ static u32 wq_space_until_wrap(struct xe_exec_queue *q)
> >  	return (WQ_SIZE - q->guc->wqi_tail);
> >  }
> >  
> > +static bool vf_recovery(struct xe_guc *guc)
> > +{
> > +	return xe_gt_recovery_pending(guc_to_gt(guc));
> > +}
> > +
> >  static int wq_wait_for_space(struct xe_exec_queue *q, u32 wqi_size)
> >  {
> >  	struct xe_guc *guc = exec_queue_to_guc(q);
> > @@ -711,7 +715,7 @@ static int wq_wait_for_space(struct xe_exec_queue *q, u32 wqi_size)
> >  
> >  #define AVAILABLE_SPACE \
> >  	CIRC_SPACE(q->guc->wqi_tail, q->guc->wqi_head, WQ_SIZE)
> > -	if (wqi_size > AVAILABLE_SPACE) {
> > +	if (wqi_size > AVAILABLE_SPACE && !vf_recovery(guc)) {
> >  try_again:
> >  		q->guc->wqi_head = parallel_read(xe, map, wq_desc.head);
> >  		if (wqi_size > AVAILABLE_SPACE) {
> > @@ -910,9 +914,10 @@ static void disable_scheduling_deregister(struct xe_guc *guc,
> >  	ret = wait_event_timeout(guc->ct.wq,
> >  				 (!exec_queue_pending_enable(q) &&
> >  				  !exec_queue_pending_disable(q)) ||
> > -					 xe_guc_read_stopped(guc),
> > +					 xe_guc_read_stopped(guc) ||
> > +					 vf_recovery(guc),
> >  				 HZ * 5);
> > -	if (!ret) {
> > +	if (!ret && !vf_recovery(guc)) {
> >  		struct xe_gpu_scheduler *sched = &q->guc->sched;
> >  
> >  		xe_gt_warn(q->gt, "Pending enable/disable failed to respond\n");
> > @@ -1015,6 +1020,10 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
> >  	bool wedged = false;
> >  
> >  	xe_gt_assert(guc_to_gt(guc), xe_exec_queue_is_lr(q));
> > +
> > +	if (vf_recovery(guc))
> > +		return;
> > +
> >  	trace_xe_exec_queue_lr_cleanup(q);
> >  
> >  	if (!exec_queue_killed(q))
> > @@ -1047,7 +1056,11 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
> >  		 */
> >  		ret = wait_event_timeout(guc->ct.wq,
> >  					 !exec_queue_pending_disable(q) ||
> > -					 xe_guc_read_stopped(guc), HZ * 5);
> > +					 xe_guc_read_stopped(guc) ||
> > +					 vf_recovery(guc), HZ * 5);
> > +		if (vf_recovery(guc))
> > +			return;
> > +
> >  		if (!ret) {
> >  			xe_gt_warn(q->gt, "Schedule disable failed to respond, guc_id=%d\n",
> >  				   q->guc->id);
> > @@ -1137,8 +1150,9 @@ static void enable_scheduling(struct xe_exec_queue *q)
> >  
> >  	ret = wait_event_timeout(guc->ct.wq,
> >  				 !exec_queue_pending_enable(q) ||
> > -				 xe_guc_read_stopped(guc), HZ * 5);
> > -	if (!ret || xe_guc_read_stopped(guc)) {
> > +				 xe_guc_read_stopped(guc) ||
> > +				 vf_recovery(guc), HZ * 5);
> > +	if ((!ret && !vf_recovery(guc)) || xe_guc_read_stopped(guc)) {
> >  		xe_gt_warn(guc_to_gt(guc), "Schedule enable failed to respond");
> >  		set_exec_queue_banned(q);
> >  		xe_gt_reset_async(q->gt);
> > @@ -1209,7 +1223,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> >  	 * list so job can be freed and kick scheduler ensuring free job is not
> >  	 * lost.
> >  	 */
> > -	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags))
> > +	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags) ||
> > +	    vf_recovery(guc))
> >  		return DRM_GPU_SCHED_STAT_NO_HANG;
> >  
> >  	/* Kill the run_job entry point */
> > @@ -1261,7 +1276,10 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> >  			ret = wait_event_timeout(guc->ct.wq,
> >  						 (!exec_queue_pending_enable(q) &&
> >  						  !exec_queue_pending_disable(q)) ||
> > -						 xe_guc_read_stopped(guc), HZ * 5);
> > +						 xe_guc_read_stopped(guc) ||
> > +						 vf_recovery(guc), HZ * 5);
> > +			if (vf_recovery(guc))
> > +				goto handle_vf_resume;
> >  			if (!ret || xe_guc_read_stopped(guc))
> >  				goto trigger_reset;
> >  
> > @@ -1286,7 +1304,10 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> >  		smp_rmb();
> >  		ret = wait_event_timeout(guc->ct.wq,
> >  					 !exec_queue_pending_disable(q) ||
> > -					 xe_guc_read_stopped(guc), HZ * 5);
> > +					 xe_guc_read_stopped(guc) ||
> > +					 vf_recovery(guc), HZ * 5);
> > +		if (vf_recovery(guc))
> > +			goto handle_vf_resume;
> >  		if (!ret || xe_guc_read_stopped(guc)) {
> >  trigger_reset:
> >  			if (!ret)
> > @@ -1391,6 +1412,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> >  	 * some thought, do this in a follow up.
> >  	 */
> >  	xe_sched_submission_start(sched);
> > +handle_vf_resume:
> >  	return DRM_GPU_SCHED_STAT_NO_HANG;
> >  }
> >  
> > @@ -1487,11 +1509,17 @@ static void __guc_exec_queue_process_msg_set_sched_props(struct xe_sched_msg *ms
> >  
> >  static void __suspend_fence_signal(struct xe_exec_queue *q)
> >  {
> > +	struct xe_guc *guc = exec_queue_to_guc(q);
> > +	struct xe_device *xe = guc_to_xe(guc);
> > +
> >  	if (!q->guc->suspend_pending)
> >  		return;
> >  
> >  	WRITE_ONCE(q->guc->suspend_pending, false);
> > -	wake_up(&q->guc->suspend_wait);
> > +	if (IS_SRIOV_VF(xe))
> > +		wake_up_all(&guc->ct.wq);
> 
> maybe xe_guc_ct_wake_waiters() ?
> 

We have roughly 10 other calls of wake_up_all(&guc->ct.wq) else where
that need fixing. I suggest we fixup the entire driver in follow on
patch to this series.

> and I guess some small in source comment why we differentiate between VF and !VF case would be beneficial
> 

I've added this.

> > +	else
> > +		wake_up(&q->guc->suspend_wait);
> >  }
> >  
> >  static void suspend_fence_signal(struct xe_exec_queue *q)
> > @@ -1512,8 +1540,9 @@ static void __guc_exec_queue_process_msg_suspend(struct xe_sched_msg *msg)
> >  
> >  	if (guc_exec_queue_allowed_to_change_state(q) && !exec_queue_suspended(q) &&
> >  	    exec_queue_enabled(q)) {
> > -		wait_event(guc->ct.wq, (q->guc->resume_time != RESUME_PENDING ||
> > -			   xe_guc_read_stopped(guc)) && !exec_queue_pending_disable(q));
> > +		wait_event(guc->ct.wq, vf_recovery(guc) ||
> > +			   ((q->guc->resume_time != RESUME_PENDING ||
> > +			   xe_guc_read_stopped(guc)) && !exec_queue_pending_disable(q)));
> >  
> >  		if (!xe_guc_read_stopped(guc)) {
> >  			s64 since_resume_ms =
> > @@ -1640,7 +1669,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
> >  
> >  	q->entity = &ge->entity;
> >  
> > -	if (xe_guc_read_stopped(guc))
> > +	if (xe_guc_read_stopped(guc) || vf_recovery(guc))
> >  		xe_sched_stop(sched);
> >  
> >  	mutex_unlock(&guc->submission_state.lock);
> > @@ -1786,6 +1815,7 @@ static int guc_exec_queue_suspend(struct xe_exec_queue *q)
> >  static int guc_exec_queue_suspend_wait(struct xe_exec_queue *q)
> >  {
> >  	struct xe_guc *guc = exec_queue_to_guc(q);
> > +	struct xe_device *xe = guc_to_xe(guc);
> >  	int ret;
> >  
> >  	/*
> > @@ -1793,11 +1823,22 @@ static int guc_exec_queue_suspend_wait(struct xe_exec_queue *q)
> >  	 * suspend_pending upon kill but to be paranoid but races in which
> >  	 * suspend_pending is set after kill also check kill here.
> >  	 */
> > -	ret = wait_event_interruptible_timeout(q->guc->suspend_wait,
> > -					       !READ_ONCE(q->guc->suspend_pending) ||
> > -					       exec_queue_killed(q) ||
> > -					       xe_guc_read_stopped(guc),
> > -					       HZ * 5);
> > +	if (IS_SRIOV_VF(xe))
> > +		ret = wait_event_interruptible_timeout(guc->ct.wq,
> > +						       !READ_ONCE(q->guc->suspend_pending) ||
> > +						       exec_queue_killed(q) ||
> > +						       xe_guc_read_stopped(guc) ||
> > +						       vf_recovery(guc),
> > +						       HZ * 5);
> > +	else
> > +		ret = wait_event_interruptible_timeout(q->guc->suspend_wait,
> > +						       !READ_ONCE(q->guc->suspend_pending) ||
> > +						       exec_queue_killed(q) ||
> > +						       xe_guc_read_stopped(guc),
> > +						       HZ * 5);
> 
> nit: maybe both magic 5sec timeouts deserve some comment?

That's just the standard time we pick for dma-fences to signal
everywhere in Xe. Again perhaps we do a follow up and replace HZ * 5
with global dma fence timeout value.

Matt

> > +
> > +	if (vf_recovery(guc) && !xe_device_wedged((guc_to_xe(guc))))
> > +		return -EAGAIN;
> >  
> >  	if (!ret) {
> >  		xe_gt_warn(guc_to_gt(guc),
> > @@ -1905,8 +1946,7 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> >  {
> >  	int ret;
> >  
> > -	if (xe_gt_WARN_ON(guc_to_gt(guc),
> > -			  xe_gt_sriov_vf_recovery_pending(guc_to_gt(guc))))
> > +	if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
> >  		return 0;
> >  
> >  	if (!guc->submission_state.initialized)
> > diff --git a/drivers/gpu/drm/xe/xe_preempt_fence.c b/drivers/gpu/drm/xe/xe_preempt_fence.c
> > index 83fbeea5aa20..7f587ca3947d 100644
> > --- a/drivers/gpu/drm/xe/xe_preempt_fence.c
> > +++ b/drivers/gpu/drm/xe/xe_preempt_fence.c
> > @@ -8,6 +8,8 @@
> >  #include <linux/slab.h>
> >  
> >  #include "xe_exec_queue.h"
> > +#include "xe_gt_printk.h"
> > +#include "xe_guc_exec_queue_types.h"
> >  #include "xe_vm.h"
> >  
> >  static void preempt_fence_work_func(struct work_struct *w)
> > @@ -22,6 +24,15 @@ static void preempt_fence_work_func(struct work_struct *w)
> >  	} else if (!q->ops->reset_status(q)) {
> >  		int err = q->ops->suspend_wait(q);
> >  
> > +		if (err == -EAGAIN) {
> > +			xe_gt_dbg(q->gt, "PREEMPT FENCE RETRY guc_id=%d",
> > +				  q->guc->id);
> > +			queue_work(q->vm->xe->preempt_fence_wq,
> > +				   &pfence->preempt_work);
> > +			dma_fence_end_signalling(cookie);
> > +			return;
> > +		}
> > +
> >  		if (err)
> >  			dma_fence_set_error(&pfence->base, err);
> >  	} else {
> 
> just few suggestions, but overall LGTM, trusting you (and CI) that it works, so
> 
> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 16/30] drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register
  2025-10-06 14:51   ` Michal Wajdeczko
@ 2025-10-06 16:02     ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 16:02 UTC (permalink / raw)
  To: Michal Wajdeczko; +Cc: intel-xe

On Mon, Oct 06, 2025 at 04:51:52PM +0200, Michal Wajdeczko wrote:
> 
> 
> On 10/6/2025 1:10 PM, Matthew Brost wrote:
> > The only case where the GuC submission backend cannot reason 100%
> > correctly is when a GuC context is registered during VF post-migration
> > recovery. In this scenario, it's possible that the GuC context register
> > H2G is processed, but the immediately following schedule-enable H2G gets
> > lost.
> 
> hmm, isn't that the other way around ?
> 
> We should know whether schedule-enable was processed or not, since we should receive corresponding DONE message (or not).
> But if it wasn't processed, then we do not know whether context registration was processed or not (as we look only at G2H).
> 
> and the schedule-enable H2G is not "lost" by GuC but rather it will be made void by our explicit CTB clearance

If isn't clear - we use the pending schedule enable state to determine
if context is registered - i.e., if a pending enable is outstanding we
assume the context didn't get registered as this is immediately sent
after registering the context. The race would be register completes but
the subsequent schedule enable didn't get processed. Let me see if I can
clear this up in the commit message.

Matt

> > 
> > A double register is harmless when using `GUC_HXG_TYPE_EVENT`, as GuC
> > simply drops the duplicate H2G. To keep things simple, use
> > `GUC_HXG_TYPE_EVENT` for all context registrations on VFs.
> > 
> > v5:
> >  - Check for xe_sriov_vf_migration_supported (Tomasz)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_guc_ct.c | 33 +++++++++++++++++++++++++--------
> >  1 file changed, 25 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> > index 9f0090ae64a6..3ac654cebc79 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > @@ -32,6 +32,7 @@
> >  #include "xe_guc_tlb_inval.h"
> >  #include "xe_map.h"
> >  #include "xe_pm.h"
> > +#include "xe_sriov_vf.h"
> >  #include "xe_trace_guc.h"
> >  
> >  static void receive_g2h(struct xe_guc_ct *ct);
> > @@ -736,6 +737,26 @@ static u16 next_ct_seqno(struct xe_guc_ct *ct, bool is_g2h_fence)
> >  	return seqno;
> >  }
> >  
> > +#define MAKE_ACTION(type, __action)				\
> > +({								\
> > +	FIELD_PREP(GUC_HXG_MSG_0_TYPE, type) |			\
> > +	FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |			\
> > +		   GUC_HXG_EVENT_MSG_0_DATA0, __action);	\
> > +})
> > +
> > +static bool vf_action_can_safely_fail(struct xe_device *xe, u32 action)
> > +{
> > +	/*
> > +	 * If we are VF resuming, we can't exactly track if a context
> > +	 * registration has been completed in the GuC state machine, it is
> > +	 * harmless to resend as it will just fail silently if
> > +	 * GUC_HXG_TYPE_EVENT is used.
> > +	 */
> > +	return IS_SRIOV_VF(xe) && xe_sriov_vf_migration_supported(xe) &&
> > +		(action == XE_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC ||
> > +		 action == XE_GUC_ACTION_REGISTER_CONTEXT);
> > +}
> 
> not a big fan of this hack, but since it may work and speed our goal,
> with re-checked/reworded commit message,
> 
> 	Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
> 
> > +
> >  #define H2G_CT_HEADERS (GUC_CTB_HDR_LEN + 1) /* one DW CTB header and one DW HxG header */
> >  
> >  static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
> > @@ -807,18 +828,14 @@ static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
> >  		FIELD_PREP(GUC_CTB_MSG_0_NUM_DWORDS, len) |
> >  		FIELD_PREP(GUC_CTB_MSG_0_FENCE, ct_fence_value);
> >  	if (want_response) {
> > -		cmd[1] =
> > -			FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> > -			FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |
> > -				   GUC_HXG_EVENT_MSG_0_DATA0, action[0]);
> > +		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_REQUEST, action[0]);
> > +	} else if (vf_action_can_safely_fail(xe, action[0])) {
> > +		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_EVENT, action[0]);
> >  	} else {
> >  		fast_req_track(ct, ct_fence_value,
> >  			       FIELD_GET(GUC_HXG_EVENT_MSG_0_ACTION, action[0]));
> >  
> > -		cmd[1] =
> > -			FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_FAST_REQUEST) |
> > -			FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |
> > -				   GUC_HXG_EVENT_MSG_0_DATA0, action[0]);
> > +		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_FAST_REQUEST, action[0]);
> >  	}
> >  
> >  	/* H2G header in cmd[1] replaces action[0] so: */
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 23/30] drm/xe: Move queue init before LRC creation
  2025-10-06 11:10 ` [PATCH v6 23/30] drm/xe: Move queue init before LRC creation Matthew Brost
  2025-10-06 15:22   ` Michal Wajdeczko
@ 2025-10-06 21:33   ` Lis, Tomasz
  1 sibling, 0 replies; 58+ messages in thread
From: Lis, Tomasz @ 2025-10-06 21:33 UTC (permalink / raw)
  To: Matthew Brost, intel-xe


On 10/6/2025 1:10 PM, Matthew Brost wrote:
> A queue must be in the submission backend's tracking state before the
> LRC is created to avoid a race condition where the LRC's GGTT addresses
> are not properly fixed up during VF post-migration recovery.
>
> Move the queue initialization—which adds the queue to the submission
> backend's tracking state—before LRC creation.
>
> v2:
>   - Wait on VF GGTT fixes before creating LRC (testing)
> v5:
>   - Adjust comment in code (Tomasz)
>   - Reduce race window
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_exec_queue.c        | 45 ++++++++++++++++++-----
>   drivers/gpu/drm/xe/xe_execlist.c          |  2 +-
>   drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 39 +++++++++++++++++++-
>   drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |  2 +
>   drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |  5 +++
>   drivers/gpu/drm/xe/xe_guc_submit.c        |  2 +-
>   drivers/gpu/drm/xe/xe_lrc.h               | 10 +++++
>   7 files changed, 92 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
> index 7621089a47fe..90cbc95f8e2e 100644
> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> @@ -15,6 +15,7 @@
>   #include "xe_dep_scheduler.h"
>   #include "xe_device.h"
>   #include "xe_gt.h"
> +#include "xe_gt_sriov_vf.h"
>   #include "xe_hw_engine_class_sysfs.h"
>   #include "xe_hw_engine_group.h"
>   #include "xe_hw_fence.h"
> @@ -205,17 +206,34 @@ static int __xe_exec_queue_init(struct xe_exec_queue *q, u32 exec_queue_flags)
>   	if (!(exec_queue_flags & EXEC_QUEUE_FLAG_KERNEL))
>   		flags |= XE_LRC_CREATE_USER_CTX;
>   
> +	err = q->ops->init(q);
> +	if (err)
> +		return err;
> +
> +	/*
> +	 * This must occur after q->ops->init to avoid race conditions during VF
> +	 * post-migration recovery, as the fixups for the LRC GGTT addresses
> +	 * depend on the queue being present in the backend tracking structure.
> +	 *
> +	 * In addition to above, we must wait on inflight GGTT changes to avoid
> +	 * writing out stale values here. Such wait provides a solid solution
> +	 * (without a race) only if the function can detect migration instantly
> +	 * from the moment vCPU resumes execution.
> +	 */
>   	for (i = 0; i < q->width; ++i) {
> -		q->lrc[i] = xe_lrc_create(q->hwe, q->vm, SZ_16K, q->msix_vec, flags);
> -		if (IS_ERR(q->lrc[i])) {
> -			err = PTR_ERR(q->lrc[i]);
> +		struct xe_lrc *lrc;
> +
> +		xe_gt_sriov_vf_wait_valid_ggtt(q->gt);
> +		lrc = xe_lrc_create(q->hwe, q->vm, xe_lrc_ring_size(),
> +				    q->msix_vec, flags);
> +		if (IS_ERR(lrc)) {
> +			err = PTR_ERR(lrc);
>   			goto err_lrc;
>   		}
> -	}
>   
> -	err = q->ops->init(q);
> -	if (err)
> -		goto err_lrc;
> +		/* Pairs with READ_ONCE to xe_exec_queue_contexts_hwsp_rebase */
> +		WRITE_ONCE(q->lrc[i], lrc);
> +	}
>   
>   	return 0;
>   
> @@ -1121,9 +1139,16 @@ int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch)
>   	int err = 0;
>   
>   	for (i = 0; i < q->width; ++i) {
> -		xe_lrc_update_memirq_regs_with_address(q->lrc[i], q->hwe, scratch);
> -		xe_lrc_update_hwctx_regs_with_address(q->lrc[i]);
> -		err = xe_lrc_setup_wa_bb_with_scratch(q->lrc[i], q->hwe, scratch);
> +		struct xe_lrc *lrc;
> +
> +		/* Pairs with WRITE_ONCE in __xe_exec_queue_init  */
> +		lrc = READ_ONCE(q->lrc[i]);
> +		if (!lrc)
> +			continue;
> +
> +		xe_lrc_update_memirq_regs_with_address(lrc, q->hwe, scratch);
> +		xe_lrc_update_hwctx_regs_with_address(lrc);
> +		err = xe_lrc_setup_wa_bb_with_scratch(lrc, q->hwe, scratch);
>   		if (err)
>   			break;
>   	}
> diff --git a/drivers/gpu/drm/xe/xe_execlist.c b/drivers/gpu/drm/xe/xe_execlist.c
> index f83d421ac9d3..769d05517f93 100644
> --- a/drivers/gpu/drm/xe/xe_execlist.c
> +++ b/drivers/gpu/drm/xe/xe_execlist.c
> @@ -339,7 +339,7 @@ static int execlist_exec_queue_init(struct xe_exec_queue *q)
>   	const struct drm_sched_init_args args = {
>   		.ops = &drm_sched_ops,
>   		.num_rqs = 1,
> -		.credit_limit = q->lrc[0]->ring.size / MAX_JOB_SIZE_BYTES,
> +		.credit_limit = xe_lrc_ring_size() / MAX_JOB_SIZE_BYTES,
>   		.hang_limit = XE_SCHED_HANG_LIMIT,
>   		.timeout = XE_SCHED_JOB_TIMEOUT,
>   		.name = q->hwe->name,
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 8074ffb924ce..bf1806e90370 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -487,6 +487,11 @@ static int vf_get_ggtt_info(struct xe_gt *gt, bool recovery)
>   				 shift, config->ggtt_base);
>   		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
>   	}
> +
> +	WRITE_ONCE(gt->sriov.vf.migration.ggtt_need_fixes, false);
> +	smp_wmb();	/* Ensure above write visible before wake */
> +	wake_up_all(&gt->sriov.vf.migration.wq);
> +
>   out:
>   	if (recovery)
>   		mutex_unlock(&ggtt->lock);
> @@ -745,7 +750,8 @@ static void vf_start_migration_recovery(struct xe_gt *gt)
>   	    !gt->sriov.vf.migration.recovery_teardown) {
>   		gt->sriov.vf.migration.recovery_queued = true;
>   		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
> -		smp_wmb();	/* Ensure above write visable before wake */
> +		WRITE_ONCE(gt->sriov.vf.migration.ggtt_need_fixes, true);
> +		smp_wmb();	/* Ensure above writes visable before wake */
>   
>   		xe_guc_ct_wake_waiters(&gt->uc.guc.ct);
>   
> @@ -1264,6 +1270,7 @@ int xe_gt_sriov_vf_init_early(struct xe_gt *gt)
>   	gt->sriov.vf.migration.scratch = buf;
>   	spin_lock_init(&gt->sriov.vf.migration.lock);
>   	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
> +	init_waitqueue_head(&gt->sriov.vf.migration.wq);
>   
>   	return 0;
>   }
> @@ -1312,3 +1319,33 @@ bool xe_gt_sriov_vf_recovery_pending(struct xe_gt *gt)
>   
>   	return READ_ONCE(gt->sriov.vf.migration.recovery_inprogress);
>   }
> +
> +static bool vf_valid_ggtt(struct xe_gt *gt)
> +{
> +	struct xe_memirq *memirq = &gt_to_tile(gt)->memirq;
> +
> +	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> +
> +	if (xe_memirq_guc_sw_int_0_irq_pending(memirq, &gt->uc.guc) ||
> +	    READ_ONCE(gt->sriov.vf.migration.ggtt_need_fixes))
> +		return false;
> +
> +	return true;
> +}
> +
> +/**
> + * xe_gt_sriov_vf_wait_valid_ggtt() - VF wait for valid GGTT addresses
> + * @gt: the &xe_gt
> + */
> +void xe_gt_sriov_vf_wait_valid_ggtt(struct xe_gt *gt)
> +{
> +	int ret;
> +
> +	if (!IS_SRIOV_VF(gt_to_xe(gt)))
> +		return;
> +
> +	ret = wait_event_interruptible_timeout(gt->sriov.vf.migration.wq,
> +					       vf_valid_ggtt(gt),
> +					       HZ * 5);
> +	XE_WARN_ON(!ret);
> +}
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> index 8c9679414565..63102029d624 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> @@ -38,4 +38,6 @@ void xe_gt_sriov_vf_print_config(struct xe_gt *gt, struct drm_printer *p);
>   void xe_gt_sriov_vf_print_runtime(struct xe_gt *gt, struct drm_printer *p);
>   void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p);
>   
> +void xe_gt_sriov_vf_wait_valid_ggtt(struct xe_gt *gt);
> +
>   #endif
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> index c1bd6fdd9ab1..f0bc45a782a4 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> @@ -8,6 +8,7 @@
>   
>   #include <linux/rwsem.h>
>   #include <linux/types.h>
> +#include <linux/wait.h>
>   #include <linux/workqueue.h>
>   #include "xe_uc_fw_types.h"
>   
> @@ -50,6 +51,8 @@ struct xe_gt_sriov_vf_migration {
>   	struct work_struct worker;
>   	/** @lock: Protects recovery_queued, teardown */
>   	spinlock_t lock;
> +	/** @wq: wait queue for migration fixes */
> +	wait_queue_head_t wq;
>   	/** @scratch: Scratch memory for VF recovery */
>   	void *scratch;
>   	/** @recovery_teardown: VF post migration recovery is being torn down */
> @@ -58,6 +61,8 @@ struct xe_gt_sriov_vf_migration {
>   	bool recovery_queued;
>   	/** @recovery_inprogress: VF post migration recovery in progress */
>   	bool recovery_inprogress;
> +	/** @ggtt_need_fixes: VF GGTT needs fixes */
> +	bool ggtt_need_fixes;
>   };
>   
>   /**
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 9dbdb0b54c8b..48d5133e76a6 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -1663,7 +1663,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
>   	timeout = (q->vm && xe_vm_in_lr_mode(q->vm)) ? MAX_SCHEDULE_TIMEOUT :
>   		  msecs_to_jiffies(q->sched_props.job_timeout_ms);
>   	err = xe_sched_init(&ge->sched, &drm_sched_ops, &xe_sched_ops,
> -			    NULL, q->lrc[0]->ring.size / MAX_JOB_SIZE_BYTES, 64,
> +			    NULL, xe_lrc_ring_size() / MAX_JOB_SIZE_BYTES, 64,
>   			    timeout, guc_to_gt(guc)->ordered_wq, NULL,
>   			    q->name, gt_to_xe(q->gt)->drm.dev);
>   	if (err)
> diff --git a/drivers/gpu/drm/xe/xe_lrc.h b/drivers/gpu/drm/xe/xe_lrc.h
> index 21a3daab0154..c4a33b135101 100644
> --- a/drivers/gpu/drm/xe/xe_lrc.h
> +++ b/drivers/gpu/drm/xe/xe_lrc.h
> @@ -76,6 +76,16 @@ static inline void xe_lrc_put(struct xe_lrc *lrc)
>   	kref_put(&lrc->refcount, xe_lrc_destroy);
>   }
>   
> +/**
> + * xe_lrc_ring_size() - Xe LRC ring size
> + *
> + * Return: Size of LRC size

I remember commenting this before..

"Size of LRC ring buffer"

-Tomasz

> + */
> +static inline size_t xe_lrc_ring_size(void)
> +{
> +	return SZ_16K;
> +}
> +
>   size_t xe_gt_lrc_size(struct xe_gt *gt, enum xe_engine_class class);
>   u32 xe_lrc_pphwsp_offset(struct xe_lrc *lrc);
>   u32 xe_lrc_regs_offset(struct xe_lrc *lrc);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 20/30] drm/xe/vf: Start CTs before resfix VF post migration recovery
  2025-10-06 11:10 ` [PATCH v6 20/30] drm/xe/vf: Start CTs before resfix " Matthew Brost
@ 2025-10-06 21:50   ` Lis, Tomasz
  0 siblings, 0 replies; 58+ messages in thread
From: Lis, Tomasz @ 2025-10-06 21:50 UTC (permalink / raw)
  To: Matthew Brost, intel-xe

[-- Attachment #1: Type: text/plain, Size: 7209 bytes --]


On 10/6/2025 1:10 PM, Matthew Brost wrote:
> Before RESFIX_DONE, all CTs stuck in the H2G queue need to be squashed,
> as they may contain actions which contain invalid GGTT references or are
> unnecessary after HW change.
>
> Starting the CTs clears all H2Gs in the queue. Any lost H2Gs are
> resubmitted by the GuC submission state machine.
>
> v3:
>   - Don't mess with head / tail values (Michal)
> v4:
>   - Don't mess with broke (Michal)
>   - Add CTB_H2G_BUFFER_OFFSET (Michal)
> v5:
>   - Adjust commit message (Tomasz)
>
> Signed-off-by: Matthew Brost<matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_gt_sriov_vf.c |  7 +++
>   drivers/gpu/drm/xe/xe_guc_ct.c      | 70 +++++++++++++++++++++--------
>   drivers/gpu/drm/xe/xe_guc_ct.h      |  1 +
>   3 files changed, 60 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 2a988eb3e904..6052c7302cc6 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -1139,6 +1139,11 @@ static int vf_post_migration_fixups(struct xe_gt *gt)
>   	return 0;
>   }
>   
> +static void vf_post_migration_rearm(struct xe_gt *gt)
> +{
> +	xe_guc_ct_restart(&gt->uc.guc.ct);
> +}
> +
>   static void vf_post_migration_kickstart(struct xe_gt *gt)
>   {
>   	xe_guc_submit_unpause(&gt->uc.guc);
> @@ -1190,6 +1195,8 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
>   	if (err)
>   		goto fail;
>   
> +	vf_post_migration_rearm(gt);
> +
>   	err = vf_post_migration_notify_resfix_done(gt);
>   	if (err && err != -EAGAIN)
>   		goto fail;
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index f67575b1ed79..c0d261abf735 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -167,6 +167,7 @@ ct_to_xe(struct xe_guc_ct *ct)
>    */
>   
>   #define CTB_DESC_SIZE		ALIGN(sizeof(struct guc_ct_buffer_desc), SZ_2K)
> +#define CTB_H2G_BUFFER_OFFSET	(CTB_DESC_SIZE * 2)

We've agreed today to separate the CTB_H2G_BUFFER_OFFSET introduction to 
its own patch within this series.

(as in comments for rev4)

-Tomasz

>   #define CTB_H2G_BUFFER_SIZE	(SZ_4K)
>   #define CTB_G2H_BUFFER_SIZE	(SZ_128K)
>   #define G2H_ROOM_BUFFER_SIZE	(CTB_G2H_BUFFER_SIZE / 2)
> @@ -190,7 +191,7 @@ long xe_guc_ct_queue_proc_time_jiffies(struct xe_guc_ct *ct)
>   
>   static size_t guc_ct_size(void)
>   {
> -	return 2 * CTB_DESC_SIZE + CTB_H2G_BUFFER_SIZE +
> +	return CTB_H2G_BUFFER_OFFSET + CTB_H2G_BUFFER_SIZE +
>   		CTB_G2H_BUFFER_SIZE;
>   }
>   
> @@ -331,7 +332,7 @@ static void guc_ct_ctb_h2g_init(struct xe_device *xe, struct guc_ctb *h2g,
>   	h2g->desc = *map;
>   	xe_map_memset(xe, &h2g->desc, 0, 0, sizeof(struct guc_ct_buffer_desc));
>   
> -	h2g->cmds = IOSYS_MAP_INIT_OFFSET(map, CTB_DESC_SIZE * 2);
> +	h2g->cmds = IOSYS_MAP_INIT_OFFSET(map, CTB_H2G_BUFFER_OFFSET);
>   }
>   
>   static void guc_ct_ctb_g2h_init(struct xe_device *xe, struct guc_ctb *g2h,
> @@ -349,7 +350,7 @@ static void guc_ct_ctb_g2h_init(struct xe_device *xe, struct guc_ctb *g2h,
>   	g2h->desc = IOSYS_MAP_INIT_OFFSET(map, CTB_DESC_SIZE);
>   	xe_map_memset(xe, &g2h->desc, 0, 0, sizeof(struct guc_ct_buffer_desc));
>   
> -	g2h->cmds = IOSYS_MAP_INIT_OFFSET(map, CTB_DESC_SIZE * 2 +
> +	g2h->cmds = IOSYS_MAP_INIT_OFFSET(map, CTB_H2G_BUFFER_OFFSET +
>   					    CTB_H2G_BUFFER_SIZE);
>   }
>   
> @@ -360,7 +361,7 @@ static int guc_ct_ctb_h2g_register(struct xe_guc_ct *ct)
>   	int err;
>   
>   	desc_addr = xe_bo_ggtt_addr(ct->bo);
> -	ctb_addr = xe_bo_ggtt_addr(ct->bo) + CTB_DESC_SIZE * 2;
> +	ctb_addr = xe_bo_ggtt_addr(ct->bo) + CTB_H2G_BUFFER_OFFSET;
>   	size = ct->ctbs.h2g.info.size * sizeof(u32);
>   
>   	err = xe_guc_self_cfg64(guc,
> @@ -387,7 +388,7 @@ static int guc_ct_ctb_g2h_register(struct xe_guc_ct *ct)
>   	int err;
>   
>   	desc_addr = xe_bo_ggtt_addr(ct->bo) + CTB_DESC_SIZE;
> -	ctb_addr = xe_bo_ggtt_addr(ct->bo) + CTB_DESC_SIZE * 2 +
> +	ctb_addr = xe_bo_ggtt_addr(ct->bo) + CTB_H2G_BUFFER_OFFSET +
>   		CTB_H2G_BUFFER_SIZE;
>   	size = ct->ctbs.g2h.info.size * sizeof(u32);
>   
> @@ -501,7 +502,7 @@ static void ct_exit_safe_mode(struct xe_guc_ct *ct)
>   		xe_gt_dbg(ct_to_gt(ct), "GuC CT safe-mode disabled\n");
>   }
>   
> -int xe_guc_ct_enable(struct xe_guc_ct *ct)
> +static int __xe_guc_ct_start(struct xe_guc_ct *ct, bool needs_register)
>   {
>   	struct xe_device *xe = ct_to_xe(ct);
>   	struct xe_gt *gt = ct_to_gt(ct);
> @@ -509,21 +510,28 @@ int xe_guc_ct_enable(struct xe_guc_ct *ct)
>   
>   	xe_gt_assert(gt, !xe_guc_ct_enabled(ct));
>   
> -	xe_map_memset(xe, &ct->bo->vmap, 0, 0, xe_bo_size(ct->bo));
> -	guc_ct_ctb_h2g_init(xe, &ct->ctbs.h2g, &ct->bo->vmap);
> -	guc_ct_ctb_g2h_init(xe, &ct->ctbs.g2h, &ct->bo->vmap);
> +	if (needs_register) {
> +		xe_map_memset(xe, &ct->bo->vmap, 0, 0, xe_bo_size(ct->bo));
> +		guc_ct_ctb_h2g_init(xe, &ct->ctbs.h2g, &ct->bo->vmap);
> +		guc_ct_ctb_g2h_init(xe, &ct->ctbs.g2h, &ct->bo->vmap);
>   
> -	err = guc_ct_ctb_h2g_register(ct);
> -	if (err)
> -		goto err_out;
> +		err = guc_ct_ctb_h2g_register(ct);
> +		if (err)
> +			goto err_out;
>   
> -	err = guc_ct_ctb_g2h_register(ct);
> -	if (err)
> -		goto err_out;
> +		err = guc_ct_ctb_g2h_register(ct);
> +		if (err)
> +			goto err_out;
>   
> -	err = guc_ct_control_toggle(ct, true);
> -	if (err)
> -		goto err_out;
> +		err = guc_ct_control_toggle(ct, true);
> +		if (err)
> +			goto err_out;
> +	} else {
> +		ct->ctbs.h2g.info.broken = false;
> +		ct->ctbs.g2h.info.broken = false;
> +		xe_map_memset(xe, &ct->bo->vmap, CTB_H2G_BUFFER_OFFSET, 0,
> +			      CTB_H2G_BUFFER_SIZE);
> +	}
>   
>   	guc_ct_change_state(ct, XE_GUC_CT_STATE_ENABLED);
>   
> @@ -555,6 +563,32 @@ int xe_guc_ct_enable(struct xe_guc_ct *ct)
>   	return err;
>   }
>   
> +/**
> + * xe_guc_ct_restart() - Restart GuC CT
> + * @ct: the &xe_guc_ct
> + *
> + * Restart GuC CT to an empty state without issuing a CT register MMIO command.
> + *
> + * Return: 0 on success, or a negative errno on failure.
> + */
> +int xe_guc_ct_restart(struct xe_guc_ct *ct)
> +{
> +	return __xe_guc_ct_start(ct, false);
> +}
> +
> +/**
> + * xe_guc_ct_enable() - Enable GuC CT
> + * @ct: the &xe_guc_ct
> + *
> + * Enable GuC CT to an empty state and issue a CT register MMIO command.
> + *
> + * Return: 0 on success, or a negative errno on failure.
> + */
> +int xe_guc_ct_enable(struct xe_guc_ct *ct)
> +{
> +	return __xe_guc_ct_start(ct, true);
> +}
> +
>   static void stop_g2h_handler(struct xe_guc_ct *ct)
>   {
>   	cancel_work_sync(&ct->g2h_worker);
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
> index 02eaa452b400..10d05193e51c 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.h
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.h
> @@ -15,6 +15,7 @@ int xe_guc_ct_init_noalloc(struct xe_guc_ct *ct);
>   int xe_guc_ct_init(struct xe_guc_ct *ct);
>   int xe_guc_ct_init_post_hwconfig(struct xe_guc_ct *ct);
>   int xe_guc_ct_enable(struct xe_guc_ct *ct);
> +int xe_guc_ct_restart(struct xe_guc_ct *ct);
>   void xe_guc_ct_disable(struct xe_guc_ct *ct);
>   void xe_guc_ct_stop(struct xe_guc_ct *ct);
>   void xe_guc_ct_flush_and_stop(struct xe_guc_ct *ct);

[-- Attachment #2: Type: text/html, Size: 7629 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 01/30] drm/xe: Add NULL checks to scratch LRC allocation
  2025-10-06 11:10 ` [PATCH v6 01/30] drm/xe: Add NULL checks to scratch LRC allocation Matthew Brost
@ 2025-10-06 21:51   ` Lis, Tomasz
  0 siblings, 0 replies; 58+ messages in thread
From: Lis, Tomasz @ 2025-10-06 21:51 UTC (permalink / raw)
  To: Matthew Brost, intel-xe

[-- Attachment #1: Type: text/plain, Size: 1782 bytes --]


On 10/6/2025 1:10 PM, Matthew Brost wrote:
> kmalloc can fail, the returned value must have a NULL check. This should
> be immediately after kmalloc for clarity.
>
> v5:
>   - Assert state->buffer in setup_bo if buffer is iomem (Tomasz)
Repeating as I sent it to rev5 late:

Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>

-Tomasz

> Signed-off-by: Matthew Brost<matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_lrc.c | 13 +++++++++----
>   1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
> index af09f70f6e78..2c6eae2de1f2 100644
> --- a/drivers/gpu/drm/xe/xe_lrc.c
> +++ b/drivers/gpu/drm/xe/xe_lrc.c
> @@ -1214,8 +1214,7 @@ static int setup_bo(struct bo_setup_state *state)
>   	ssize_t remain;
>   
>   	if (state->lrc->bo->vmap.is_iomem) {
> -		if (!state->buffer)
> -			return -ENOMEM;
> +		xe_gt_assert(state->hwe->gt, state->buffer);
>   		state->ptr = state->buffer;
>   	} else {
>   		state->ptr = state->lrc->bo->vmap.vaddr + state->offset;
> @@ -1303,8 +1302,11 @@ static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
>   	u32 *buf = NULL;
>   	int ret;
>   
> -	if (lrc->bo->vmap.is_iomem)
> +	if (lrc->bo->vmap.is_iomem) {
>   		buf = kmalloc(LRC_WA_BB_SIZE, GFP_KERNEL);
> +		if (!buf)
> +			return -ENOMEM;
> +	}
>   
>   	ret = xe_lrc_setup_wa_bb_with_scratch(lrc, hwe, buf);
>   
> @@ -1347,8 +1349,11 @@ setup_indirect_ctx(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
>   	if (xe_gt_WARN_ON(lrc->gt, !state.funcs))
>   		return 0;
>   
> -	if (lrc->bo->vmap.is_iomem)
> +	if (lrc->bo->vmap.is_iomem) {
>   		state.buffer = kmalloc(state.max_size, GFP_KERNEL);
> +		if (!state.buffer)
> +			return -ENOMEM;
> +	}
>   
>   	ret = setup_bo(&state);
>   	if (ret) {

[-- Attachment #2: Type: text/html, Size: 2482 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 16/30] drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register
  2025-10-06 11:10 ` [PATCH v6 16/30] drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register Matthew Brost
  2025-10-06 14:51   ` Michal Wajdeczko
@ 2025-10-06 22:21   ` Lis, Tomasz
  2025-10-06 22:57     ` Matthew Brost
  1 sibling, 1 reply; 58+ messages in thread
From: Lis, Tomasz @ 2025-10-06 22:21 UTC (permalink / raw)
  To: Matthew Brost, intel-xe


On 10/6/2025 1:10 PM, Matthew Brost wrote:
> The only case where the GuC submission backend cannot reason 100%
> correctly is when a GuC context is registered during VF post-migration
> recovery. In this scenario, it's possible that the GuC context register
> H2G is processed, but the immediately following schedule-enable H2G gets
> lost.
>
> A double register is harmless when using `GUC_HXG_TYPE_EVENT`, as GuC
> simply drops the duplicate H2G. To keep things simple, use
> `GUC_HXG_TYPE_EVENT` for all context registrations on VFs.
>
> v5:
>   - Check for xe_sriov_vf_migration_supported (Tomasz)
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_guc_ct.c | 33 +++++++++++++++++++++++++--------
>   1 file changed, 25 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index 9f0090ae64a6..3ac654cebc79 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -32,6 +32,7 @@
>   #include "xe_guc_tlb_inval.h"
>   #include "xe_map.h"
>   #include "xe_pm.h"
> +#include "xe_sriov_vf.h"
>   #include "xe_trace_guc.h"
>   
>   static void receive_g2h(struct xe_guc_ct *ct);
> @@ -736,6 +737,26 @@ static u16 next_ct_seqno(struct xe_guc_ct *ct, bool is_g2h_fence)
>   	return seqno;
>   }
>   
> +#define MAKE_ACTION(type, __action)				\
> +({								\
> +	FIELD_PREP(GUC_HXG_MSG_0_TYPE, type) |			\
> +	FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |			\
> +		   GUC_HXG_EVENT_MSG_0_DATA0, __action);	\
> +})
> +
> +static bool vf_action_can_safely_fail(struct xe_device *xe, u32 action)
> +{
> +	/*
> +	 * If we are VF resuming, we can't exactly track if a context
> +	 * registration has been completed in the GuC state machine, it is
> +	 * harmless to resend as it will just fail silently if
> +	 * GUC_HXG_TYPE_EVENT is used.

Maybe add:

If the registration H2G fails with error other than ALREADY_REGISTERED, 
we will know due to the shortly following schedule-enable H2G failing.

Other than that, a second to Michal:

Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>

> +	 */
> +	return IS_SRIOV_VF(xe) && xe_sriov_vf_migration_supported(xe) &&
> +		(action == XE_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC ||
> +		 action == XE_GUC_ACTION_REGISTER_CONTEXT);
> +}
> +
>   #define H2G_CT_HEADERS (GUC_CTB_HDR_LEN + 1) /* one DW CTB header and one DW HxG header */
>   
>   static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
> @@ -807,18 +828,14 @@ static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
>   		FIELD_PREP(GUC_CTB_MSG_0_NUM_DWORDS, len) |
>   		FIELD_PREP(GUC_CTB_MSG_0_FENCE, ct_fence_value);
>   	if (want_response) {
> -		cmd[1] =
> -			FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> -			FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |
> -				   GUC_HXG_EVENT_MSG_0_DATA0, action[0]);
> +		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_REQUEST, action[0]);
> +	} else if (vf_action_can_safely_fail(xe, action[0])) {
> +		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_EVENT, action[0]);
>   	} else {
>   		fast_req_track(ct, ct_fence_value,
>   			       FIELD_GET(GUC_HXG_EVENT_MSG_0_ACTION, action[0]));
>   
> -		cmd[1] =
> -			FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_FAST_REQUEST) |
> -			FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |
> -				   GUC_HXG_EVENT_MSG_0_DATA0, action[0]);
> +		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_FAST_REQUEST, action[0]);
>   	}
>   
>   	/* H2G header in cmd[1] replaces action[0] so: */

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 27/30] drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF
  2025-10-06 11:10 ` [PATCH v6 27/30] drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF Matthew Brost
@ 2025-10-06 22:24   ` Lucas De Marchi
  2025-10-06 22:51     ` Matthew Brost
  0 siblings, 1 reply; 58+ messages in thread
From: Lucas De Marchi @ 2025-10-06 22:24 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe, Matt Roper

On Mon, Oct 06, 2025 at 04:10:35AM -0700, Matthew Brost wrote:
>VF CCS restore is a primary GT operation on which the media GT depends.
>Therefore, it doesn't make much sense to run these operations in

I'd need to double check the previous patches to see the entire
picture, but this seems weird at a first glance. The VF CCS restore is
not the single work we queue in gt->ordered_wq.  To me it seems more
like "in what ordered queue we are going to queue the VF CCS restore. If
it's global per device, why are we not using the device wq rather than
making all the GT wq point to the same thing?

>parallel. To address this, point the media GT's ordered work queue to
>the primary GT's ordered work queue on platforms that require (PTL VFs)
>CCS restore as part of VF post-migration recovery.
>
>Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>---
> drivers/gpu/drm/xe/xe_device_types.h | 2 ++
> drivers/gpu/drm/xe/xe_gt.c           | 7 +++++--
> drivers/gpu/drm/xe/xe_gt.h           | 2 +-
> drivers/gpu/drm/xe/xe_pci.c          | 6 +++++-
> drivers/gpu/drm/xe/xe_pci_types.h    | 1 +
> drivers/gpu/drm/xe/xe_tile.c         | 2 +-
> 6 files changed, 15 insertions(+), 5 deletions(-)
>
>diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>index c66523bf4bf0..02c04ad7296e 100644
>--- a/drivers/gpu/drm/xe/xe_device_types.h
>+++ b/drivers/gpu/drm/xe/xe_device_types.h
>@@ -334,6 +334,8 @@ struct xe_device {
> 		u8 skip_mtcfg:1;
> 		/** @info.skip_pcode: skip access to PCODE uC */
> 		u8 skip_pcode:1;
>+		/** @info.needs_shared_vf_gt_wq: needs shared GT WQ on VF */
>+		u8 needs_shared_vf_gt_wq:1;
> 	} info;
>
> 	/** @wa_active: keep track of active workarounds */
>diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
>index cf484a2da35e..05465f358c96 100644
>--- a/drivers/gpu/drm/xe/xe_gt.c
>+++ b/drivers/gpu/drm/xe/xe_gt.c
>@@ -65,7 +65,7 @@
> #include "xe_wa.h"
> #include "xe_wopcm.h"
>
>-struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
>+struct xe_gt *xe_gt_alloc(struct xe_tile *tile, bool use_primary_wq)

If using the device wq is not an option (possibly because it would queue
with other undesired work going on there), then I'd rather drop this
bool passing here and make the decision inside this function:

> {
> 	struct drm_device *drm = &tile_to_xe(tile)->drm;
> 	struct xe_gt *gt;
>@@ -75,7 +75,10 @@ struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
> 		return ERR_PTR(-ENOMEM);
>
> 	gt->tile = tile;

	if (!xe->info.needs_shared_gt_wq || !tile->primary_gt->ordered_wq)
		ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
	else
		ordered_wq = tile->primary_gt->ordered_wq;
	if (IS_ERR_OR_NUL(ordered_wq))
  		return ordered_wq ? ERR_CAST(gt->ordered_wq) : ERR_PTR(-EINVAL);

	gt->ordered_wq = ordered_wq;
	
... or something like that so you use the xe info to decide it here
rather than passing it down as a function arg.


>-	gt->ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
>+	if (use_primary_wq)
>+		gt->ordered_wq = tile->primary_gt->ordered_wq;
>+	else
>+		gt->ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
> 	if (IS_ERR(gt->ordered_wq))
> 		return ERR_CAST(gt->ordered_wq);
>
>diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
>index 5df2ffe3ff83..9545c0c93ab6 100644
>--- a/drivers/gpu/drm/xe/xe_gt.h
>+++ b/drivers/gpu/drm/xe/xe_gt.h
>@@ -28,7 +28,7 @@ static inline bool xe_fault_inject_gt_reset(void)
> 	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&gt_reset_failure, 1);
> }
>
>-struct xe_gt *xe_gt_alloc(struct xe_tile *tile);
>+struct xe_gt *xe_gt_alloc(struct xe_tile *tile, bool use_primary_wq);
> int xe_gt_init_early(struct xe_gt *gt);
> int xe_gt_init(struct xe_gt *gt);
> void xe_gt_mmio_init(struct xe_gt *gt);
>diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
>index 3f42b91efa28..25a1d96a68e7 100644
>--- a/drivers/gpu/drm/xe/xe_pci.c
>+++ b/drivers/gpu/drm/xe/xe_pci.c
>@@ -347,6 +347,7 @@ static const struct xe_device_desc ptl_desc = {
> 	.has_sriov = true,
> 	.max_gt_per_tile = 2,
> 	.needs_scratch = true,
>+	.needs_shared_vf_gt_wq = true,

as per above... I think this needs to be detached from vf. There may be
other reasons the wq needs to be shared.

If we just make them point to a device wq as suggested above, then
there's no extra issue with the ongoing work to disable GTs that Matt
Roper is doing (https://patchwork.freedesktop.org/series/154739/).
Otherwise we will need to think on how to reconciliate them.

Lucas De Marchi

> };
>
> #undef PLATFORM
>@@ -598,6 +599,7 @@ static int xe_info_init_early(struct xe_device *xe,
> 	xe->info.skip_mtcfg = desc->skip_mtcfg;
> 	xe->info.skip_pcode = desc->skip_pcode;
> 	xe->info.needs_scratch = desc->needs_scratch;
>+	xe->info.needs_shared_vf_gt_wq = desc->needs_shared_vf_gt_wq;
>
> 	xe->info.probe_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY) &&
> 				 xe_modparam.probe_display &&
>@@ -766,7 +768,9 @@ static int xe_info_init(struct xe_device *xe,
> 		 * Allocate and setup media GT for platforms with standalone
> 		 * media.
> 		 */
>-		tile->media_gt = xe_gt_alloc(tile);
>+		tile->media_gt = xe_gt_alloc(tile,
>+					     xe->info.needs_shared_vf_gt_wq &&
>+					     IS_SRIOV_VF(xe));
> 		if (IS_ERR(tile->media_gt))
> 			return PTR_ERR(tile->media_gt);
>
>diff --git a/drivers/gpu/drm/xe/xe_pci_types.h b/drivers/gpu/drm/xe/xe_pci_types.h
>index 9b9766a3baa3..b11bf6abda5b 100644
>--- a/drivers/gpu/drm/xe/xe_pci_types.h
>+++ b/drivers/gpu/drm/xe/xe_pci_types.h
>@@ -48,6 +48,7 @@ struct xe_device_desc {
> 	u8 skip_guc_pc:1;
> 	u8 skip_mtcfg:1;
> 	u8 skip_pcode:1;
>+	u8 needs_shared_vf_gt_wq:1;
> };
>
> struct xe_graphics_desc {
>diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
>index 6edb5062c1da..e9bcff2de563 100644
>--- a/drivers/gpu/drm/xe/xe_tile.c
>+++ b/drivers/gpu/drm/xe/xe_tile.c
>@@ -157,7 +157,7 @@ int xe_tile_init_early(struct xe_tile *tile, struct xe_device *xe, u8 id)
> 	if (err)
> 		return err;
>
>-	tile->primary_gt = xe_gt_alloc(tile);
>+	tile->primary_gt = xe_gt_alloc(tile, false);
> 	if (IS_ERR(tile->primary_gt))
> 		return PTR_ERR(tile->primary_gt);
>
>-- 
>2.34.1
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 14/30] drm/xe/vf: Wakeup in GuC backend on VF post migration recovery
  2025-10-06 11:10 ` [PATCH v6 14/30] drm/xe/vf: Wakeup in GuC backend on " Matthew Brost
  2025-10-06 14:35   ` Michal Wajdeczko
@ 2025-10-06 22:27   ` Lis, Tomasz
  2025-10-06 23:07     ` Matthew Brost
  1 sibling, 1 reply; 58+ messages in thread
From: Lis, Tomasz @ 2025-10-06 22:27 UTC (permalink / raw)
  To: Matthew Brost, intel-xe


On 10/6/2025 1:10 PM, Matthew Brost wrote:
> If VF post-migration recovery is in progress, the recovery flow will
> rebuild all GuC submission state. In this case, exit all waiters to
> ensure that submission queue scheduling can also be paused. Avoid taking
> any adverse actions after aborting the wait.
>
> As part of waking up the GuC backend, suspend_wait can now return
> -EAGAIN indicating the waiter should be retried. If the caller is
> running on work item, that work item need to be requeued to avoid a
> deadlock for the work item blocking the VF migration recovery work item.
>
> v3:
>   - Don't block in preempt fence work queue as this can interfere with VF
>     post-migration work queue scheduling leading to deadlock (Testing)
>   - Use xe_gt_recovery_inprogress (Michal)
> v5:
>   - Use static function for vf_recovery (Michal)
>   - Add helper to wake CT waiters (Michal)
>   - Move some code to following patch (Michal)
>   - Adjust commit message to explain suspend_wait returning -EAGAIN (Michal)
>   - Add kernel doc to suspend_wait around returning -EAGAIN
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_exec_queue_types.h |  3 +
>   drivers/gpu/drm/xe/xe_gt_sriov_vf.c      |  4 ++
>   drivers/gpu/drm/xe/xe_guc_ct.h           |  9 +++
>   drivers/gpu/drm/xe/xe_guc_submit.c       | 82 ++++++++++++++++++------
>   drivers/gpu/drm/xe/xe_preempt_fence.c    | 11 ++++
>   5 files changed, 88 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_exec_queue_types.h b/drivers/gpu/drm/xe/xe_exec_queue_types.h
> index 27b76cf9da89..282505fa1377 100644
> --- a/drivers/gpu/drm/xe/xe_exec_queue_types.h
> +++ b/drivers/gpu/drm/xe/xe_exec_queue_types.h
> @@ -207,6 +207,9 @@ struct xe_exec_queue_ops {
>   	 * call after suspend. In dma-fencing path thus must return within a
>   	 * reasonable amount of time. -ETIME return shall indicate an error
>   	 * waiting for suspend resulting in associated VM getting killed.
> +	 * -EAGAIN return indicates the wait should be tried again, if the wait
> +	 * is within a work item, the work item should be requeued as deadlock
> +	 * avoidance mechanism.
>   	 */
>   	int (*suspend_wait)(struct xe_exec_queue *q);
>   	/**
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 7057260175f3..7f703336d692 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -23,6 +23,7 @@
>   #include "xe_gt_sriov_vf.h"
>   #include "xe_gt_sriov_vf_types.h"
>   #include "xe_guc.h"
> +#include "xe_guc_ct.h"
>   #include "xe_guc_hxg_helpers.h"
>   #include "xe_guc_relay.h"
>   #include "xe_guc_submit.h"
> @@ -743,6 +744,9 @@ static void vf_start_migration_recovery(struct xe_gt *gt)
>   	    !gt->sriov.vf.migration.recovery_teardown) {
>   		gt->sriov.vf.migration.recovery_queued = true;
>   		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
> +		smp_wmb();	/* Ensure above write visable before wake */
> +
> +		xe_guc_ct_wake_waiters(&gt->uc.guc.ct);
>   
>   		started = queue_work(gt->ordered_wq, &gt->sriov.vf.migration.worker);
>   		xe_gt_sriov_info(gt, "VF migration recovery %s\n", started ?
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
> index d6c81325a76c..ca0ec938edac 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.h
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.h
> @@ -72,4 +72,13 @@ xe_guc_ct_send_block_no_fail(struct xe_guc_ct *ct, const u32 *action, u32 len)
>   
>   long xe_guc_ct_queue_proc_time_jiffies(struct xe_guc_ct *ct);
>   
> +/**
> + * xe_guc_ct_wake_waiters() - GuC CT wake up waiters
> + * @guc: GuC CT object
> + */
> +static inline void xe_guc_ct_wake_waiters(struct xe_guc_ct *ct)
> +{
> +	wake_up_all(&ct->wq);
> +}
> +
>   #endif
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 59371b7cc8a4..b2ca4911efe9 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -27,7 +27,6 @@
>   #include "xe_gt.h"
>   #include "xe_gt_clock.h"
>   #include "xe_gt_printk.h"
> -#include "xe_gt_sriov_vf.h"
>   #include "xe_guc.h"
>   #include "xe_guc_capture.h"
>   #include "xe_guc_ct.h"
> @@ -702,6 +701,11 @@ static u32 wq_space_until_wrap(struct xe_exec_queue *q)
>   	return (WQ_SIZE - q->guc->wqi_tail);
>   }
>   
> +static bool vf_recovery(struct xe_guc *guc)
> +{
> +	return xe_gt_recovery_pending(guc_to_gt(guc));
> +}
> +
>   static int wq_wait_for_space(struct xe_exec_queue *q, u32 wqi_size)
>   {
>   	struct xe_guc *guc = exec_queue_to_guc(q);
> @@ -711,7 +715,7 @@ static int wq_wait_for_space(struct xe_exec_queue *q, u32 wqi_size)
>   
>   #define AVAILABLE_SPACE \
>   	CIRC_SPACE(q->guc->wqi_tail, q->guc->wqi_head, WQ_SIZE)
> -	if (wqi_size > AVAILABLE_SPACE) {
> +	if (wqi_size > AVAILABLE_SPACE && !vf_recovery(guc)) {
>   try_again:
>   		q->guc->wqi_head = parallel_read(xe, map, wq_desc.head);
>   		if (wqi_size > AVAILABLE_SPACE) {
> @@ -910,9 +914,10 @@ static void disable_scheduling_deregister(struct xe_guc *guc,
>   	ret = wait_event_timeout(guc->ct.wq,
>   				 (!exec_queue_pending_enable(q) &&
>   				  !exec_queue_pending_disable(q)) ||
> -					 xe_guc_read_stopped(guc),
> +					 xe_guc_read_stopped(guc) ||
> +					 vf_recovery(guc),
>   				 HZ * 5);
> -	if (!ret) {
> +	if (!ret && !vf_recovery(guc)) {

Is it possible for vf_recovery() to change its retval between the above 
llines? Ending the wait due to recovery, and then forgetting that happened?

Maybe we should assign to a local?

(concerns all places where we do the check this way)

-Tomasz

>   		struct xe_gpu_scheduler *sched = &q->guc->sched;
>   
>   		xe_gt_warn(q->gt, "Pending enable/disable failed to respond\n");
> @@ -1015,6 +1020,10 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
>   	bool wedged = false;
>   
>   	xe_gt_assert(guc_to_gt(guc), xe_exec_queue_is_lr(q));
> +
> +	if (vf_recovery(guc))
> +		return;
> +
>   	trace_xe_exec_queue_lr_cleanup(q);
>   
>   	if (!exec_queue_killed(q))
> @@ -1047,7 +1056,11 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
>   		 */
>   		ret = wait_event_timeout(guc->ct.wq,
>   					 !exec_queue_pending_disable(q) ||
> -					 xe_guc_read_stopped(guc), HZ * 5);
> +					 xe_guc_read_stopped(guc) ||
> +					 vf_recovery(guc), HZ * 5);
> +		if (vf_recovery(guc))
> +			return;
> +
>   		if (!ret) {
>   			xe_gt_warn(q->gt, "Schedule disable failed to respond, guc_id=%d\n",
>   				   q->guc->id);
> @@ -1137,8 +1150,9 @@ static void enable_scheduling(struct xe_exec_queue *q)
>   
>   	ret = wait_event_timeout(guc->ct.wq,
>   				 !exec_queue_pending_enable(q) ||
> -				 xe_guc_read_stopped(guc), HZ * 5);
> -	if (!ret || xe_guc_read_stopped(guc)) {
> +				 xe_guc_read_stopped(guc) ||
> +				 vf_recovery(guc), HZ * 5);
> +	if ((!ret && !vf_recovery(guc)) || xe_guc_read_stopped(guc)) {
>   		xe_gt_warn(guc_to_gt(guc), "Schedule enable failed to respond");
>   		set_exec_queue_banned(q);
>   		xe_gt_reset_async(q->gt);
> @@ -1209,7 +1223,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>   	 * list so job can be freed and kick scheduler ensuring free job is not
>   	 * lost.
>   	 */
> -	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags))
> +	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags) ||
> +	    vf_recovery(guc))
>   		return DRM_GPU_SCHED_STAT_NO_HANG;
>   
>   	/* Kill the run_job entry point */
> @@ -1261,7 +1276,10 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>   			ret = wait_event_timeout(guc->ct.wq,
>   						 (!exec_queue_pending_enable(q) &&
>   						  !exec_queue_pending_disable(q)) ||
> -						 xe_guc_read_stopped(guc), HZ * 5);
> +						 xe_guc_read_stopped(guc) ||
> +						 vf_recovery(guc), HZ * 5);
> +			if (vf_recovery(guc))
> +				goto handle_vf_resume;
>   			if (!ret || xe_guc_read_stopped(guc))
>   				goto trigger_reset;
>   
> @@ -1286,7 +1304,10 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>   		smp_rmb();
>   		ret = wait_event_timeout(guc->ct.wq,
>   					 !exec_queue_pending_disable(q) ||
> -					 xe_guc_read_stopped(guc), HZ * 5);
> +					 xe_guc_read_stopped(guc) ||
> +					 vf_recovery(guc), HZ * 5);
> +		if (vf_recovery(guc))
> +			goto handle_vf_resume;
>   		if (!ret || xe_guc_read_stopped(guc)) {
>   trigger_reset:
>   			if (!ret)
> @@ -1391,6 +1412,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>   	 * some thought, do this in a follow up.
>   	 */
>   	xe_sched_submission_start(sched);
> +handle_vf_resume:
>   	return DRM_GPU_SCHED_STAT_NO_HANG;
>   }
>   
> @@ -1487,11 +1509,17 @@ static void __guc_exec_queue_process_msg_set_sched_props(struct xe_sched_msg *ms
>   
>   static void __suspend_fence_signal(struct xe_exec_queue *q)
>   {
> +	struct xe_guc *guc = exec_queue_to_guc(q);
> +	struct xe_device *xe = guc_to_xe(guc);
> +
>   	if (!q->guc->suspend_pending)
>   		return;
>   
>   	WRITE_ONCE(q->guc->suspend_pending, false);
> -	wake_up(&q->guc->suspend_wait);
> +	if (IS_SRIOV_VF(xe))
> +		wake_up_all(&guc->ct.wq);
> +	else
> +		wake_up(&q->guc->suspend_wait);
>   }
>   
>   static void suspend_fence_signal(struct xe_exec_queue *q)
> @@ -1512,8 +1540,9 @@ static void __guc_exec_queue_process_msg_suspend(struct xe_sched_msg *msg)
>   
>   	if (guc_exec_queue_allowed_to_change_state(q) && !exec_queue_suspended(q) &&
>   	    exec_queue_enabled(q)) {
> -		wait_event(guc->ct.wq, (q->guc->resume_time != RESUME_PENDING ||
> -			   xe_guc_read_stopped(guc)) && !exec_queue_pending_disable(q));
> +		wait_event(guc->ct.wq, vf_recovery(guc) ||
> +			   ((q->guc->resume_time != RESUME_PENDING ||
> +			   xe_guc_read_stopped(guc)) && !exec_queue_pending_disable(q)));
>   
>   		if (!xe_guc_read_stopped(guc)) {
>   			s64 since_resume_ms =
> @@ -1640,7 +1669,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
>   
>   	q->entity = &ge->entity;
>   
> -	if (xe_guc_read_stopped(guc))
> +	if (xe_guc_read_stopped(guc) || vf_recovery(guc))
>   		xe_sched_stop(sched);
>   
>   	mutex_unlock(&guc->submission_state.lock);
> @@ -1786,6 +1815,7 @@ static int guc_exec_queue_suspend(struct xe_exec_queue *q)
>   static int guc_exec_queue_suspend_wait(struct xe_exec_queue *q)
>   {
>   	struct xe_guc *guc = exec_queue_to_guc(q);
> +	struct xe_device *xe = guc_to_xe(guc);
>   	int ret;
>   
>   	/*
> @@ -1793,11 +1823,22 @@ static int guc_exec_queue_suspend_wait(struct xe_exec_queue *q)
>   	 * suspend_pending upon kill but to be paranoid but races in which
>   	 * suspend_pending is set after kill also check kill here.
>   	 */
> -	ret = wait_event_interruptible_timeout(q->guc->suspend_wait,
> -					       !READ_ONCE(q->guc->suspend_pending) ||
> -					       exec_queue_killed(q) ||
> -					       xe_guc_read_stopped(guc),
> -					       HZ * 5);
> +	if (IS_SRIOV_VF(xe))
> +		ret = wait_event_interruptible_timeout(guc->ct.wq,
> +						       !READ_ONCE(q->guc->suspend_pending) ||
> +						       exec_queue_killed(q) ||
> +						       xe_guc_read_stopped(guc) ||
> +						       vf_recovery(guc),
> +						       HZ * 5);
> +	else
> +		ret = wait_event_interruptible_timeout(q->guc->suspend_wait,
> +						       !READ_ONCE(q->guc->suspend_pending) ||
> +						       exec_queue_killed(q) ||
> +						       xe_guc_read_stopped(guc),
> +						       HZ * 5);
> +
> +	if (vf_recovery(guc) && !xe_device_wedged((guc_to_xe(guc))))
> +		return -EAGAIN;
>   
>   	if (!ret) {
>   		xe_gt_warn(guc_to_gt(guc),
> @@ -1905,8 +1946,7 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>   {
>   	int ret;
>   
> -	if (xe_gt_WARN_ON(guc_to_gt(guc),
> -			  xe_gt_sriov_vf_recovery_pending(guc_to_gt(guc))))
> +	if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
>   		return 0;
>   
>   	if (!guc->submission_state.initialized)
> diff --git a/drivers/gpu/drm/xe/xe_preempt_fence.c b/drivers/gpu/drm/xe/xe_preempt_fence.c
> index 83fbeea5aa20..7f587ca3947d 100644
> --- a/drivers/gpu/drm/xe/xe_preempt_fence.c
> +++ b/drivers/gpu/drm/xe/xe_preempt_fence.c
> @@ -8,6 +8,8 @@
>   #include <linux/slab.h>
>   
>   #include "xe_exec_queue.h"
> +#include "xe_gt_printk.h"
> +#include "xe_guc_exec_queue_types.h"
>   #include "xe_vm.h"
>   
>   static void preempt_fence_work_func(struct work_struct *w)
> @@ -22,6 +24,15 @@ static void preempt_fence_work_func(struct work_struct *w)
>   	} else if (!q->ops->reset_status(q)) {
>   		int err = q->ops->suspend_wait(q);
>   
> +		if (err == -EAGAIN) {
> +			xe_gt_dbg(q->gt, "PREEMPT FENCE RETRY guc_id=%d",
> +				  q->guc->id);
> +			queue_work(q->vm->xe->preempt_fence_wq,
> +				   &pfence->preempt_work);
> +			dma_fence_end_signalling(cookie);
> +			return;
> +		}
> +
>   		if (err)
>   			dma_fence_set_error(&pfence->base, err);
>   	} else {

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 27/30] drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF
  2025-10-06 22:24   ` Lucas De Marchi
@ 2025-10-06 22:51     ` Matthew Brost
  2025-10-07 17:00       ` Lucas De Marchi
  0 siblings, 1 reply; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 22:51 UTC (permalink / raw)
  To: Lucas De Marchi; +Cc: intel-xe, Matt Roper

On Mon, Oct 06, 2025 at 05:24:55PM -0500, Lucas De Marchi wrote:
> On Mon, Oct 06, 2025 at 04:10:35AM -0700, Matthew Brost wrote:
> > VF CCS restore is a primary GT operation on which the media GT depends.
> > Therefore, it doesn't make much sense to run these operations in
> 
> I'd need to double check the previous patches to see the entire
> picture, but this seems weird at a first glance. The VF CCS restore is
> not the single work we queue in gt->ordered_wq.  To me it seems more
> like "in what ordered queue we are going to queue the VF CCS restore. If
> it's global per device, why are we not using the device wq rather than
> making all the GT wq point to the same thing?
> 

This is where things get convoluted. Four mechanisms manipulate the
scheduling state by starting or stopping the DRM scheduler:

- Job timeouts
- GT resets
- VF restore
- PM resume

All of these paths require mutual exclusion, or the scheduler design
breaks.

The first three ensure ordering by scheduling on the same ordered work
queue (GT-ordered WQ). The last one is guaranteed by holding PM
references in all the right places.

Another issue is that the first three items are all in the reclaim path
— the GT-ordered work queue is designed to handle this.

This patch [1] explains the entire scheduler design in detail.

Only PTL has the cross-VF/GT restore ordering requirement, so I figured
the path of least resistance is to just point all GTs to the primary
work queue.

[1] https://patchwork.freedesktop.org/patch/677980/?series=154627&rev=4

> > parallel. To address this, point the media GT's ordered work queue to
> > the primary GT's ordered work queue on platforms that require (PTL VFs)
> > CCS restore as part of VF post-migration recovery.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_device_types.h | 2 ++
> > drivers/gpu/drm/xe/xe_gt.c           | 7 +++++--
> > drivers/gpu/drm/xe/xe_gt.h           | 2 +-
> > drivers/gpu/drm/xe/xe_pci.c          | 6 +++++-
> > drivers/gpu/drm/xe/xe_pci_types.h    | 1 +
> > drivers/gpu/drm/xe/xe_tile.c         | 2 +-
> > 6 files changed, 15 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> > index c66523bf4bf0..02c04ad7296e 100644
> > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > @@ -334,6 +334,8 @@ struct xe_device {
> > 		u8 skip_mtcfg:1;
> > 		/** @info.skip_pcode: skip access to PCODE uC */
> > 		u8 skip_pcode:1;
> > +		/** @info.needs_shared_vf_gt_wq: needs shared GT WQ on VF */
> > +		u8 needs_shared_vf_gt_wq:1;
> > 	} info;
> > 
> > 	/** @wa_active: keep track of active workarounds */
> > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > index cf484a2da35e..05465f358c96 100644
> > --- a/drivers/gpu/drm/xe/xe_gt.c
> > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > @@ -65,7 +65,7 @@
> > #include "xe_wa.h"
> > #include "xe_wopcm.h"
> > 
> > -struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
> > +struct xe_gt *xe_gt_alloc(struct xe_tile *tile, bool use_primary_wq)
> 
> If using the device wq is not an option (possibly because it would queue
> with other undesired work going on there), then I'd rather drop this
> bool passing here and make the decision inside this function:
> 
> > {
> > 	struct drm_device *drm = &tile_to_xe(tile)->drm;
> > 	struct xe_gt *gt;
> > @@ -75,7 +75,10 @@ struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
> > 		return ERR_PTR(-ENOMEM);
> > 
> > 	gt->tile = tile;
> 
> 	if (!xe->info.needs_shared_gt_wq || !tile->primary_gt->ordered_wq)
> 		ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
> 	else
> 		ordered_wq = tile->primary_gt->ordered_wq;
> 	if (IS_ERR_OR_NUL(ordered_wq))
>  		return ordered_wq ? ERR_CAST(gt->ordered_wq) : ERR_PTR(-EINVAL);
> 
> 	gt->ordered_wq = ordered_wq;
> 	
> ... or something like that so you use the xe info to decide it here
> rather than passing it down as a function arg.
> 

Sure can refactor this as you suggest.

> 
> > -	gt->ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
> > +	if (use_primary_wq)
> > +		gt->ordered_wq = tile->primary_gt->ordered_wq;
> > +	else
> > +		gt->ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
> > 	if (IS_ERR(gt->ordered_wq))
> > 		return ERR_CAST(gt->ordered_wq);
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
> > index 5df2ffe3ff83..9545c0c93ab6 100644
> > --- a/drivers/gpu/drm/xe/xe_gt.h
> > +++ b/drivers/gpu/drm/xe/xe_gt.h
> > @@ -28,7 +28,7 @@ static inline bool xe_fault_inject_gt_reset(void)
> > 	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&gt_reset_failure, 1);
> > }
> > 
> > -struct xe_gt *xe_gt_alloc(struct xe_tile *tile);
> > +struct xe_gt *xe_gt_alloc(struct xe_tile *tile, bool use_primary_wq);
> > int xe_gt_init_early(struct xe_gt *gt);
> > int xe_gt_init(struct xe_gt *gt);
> > void xe_gt_mmio_init(struct xe_gt *gt);
> > diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> > index 3f42b91efa28..25a1d96a68e7 100644
> > --- a/drivers/gpu/drm/xe/xe_pci.c
> > +++ b/drivers/gpu/drm/xe/xe_pci.c
> > @@ -347,6 +347,7 @@ static const struct xe_device_desc ptl_desc = {
> > 	.has_sriov = true,
> > 	.max_gt_per_tile = 2,
> > 	.needs_scratch = true,
> > +	.needs_shared_vf_gt_wq = true,
> 
> as per above... I think this needs to be detached from vf. There may be
> other reasons the wq needs to be shared.
> 

I’d prefer to share the work queue only when absolutely necessary — such
as when PTL is on a VF and migration is supported.

> If we just make them point to a device wq as suggested above, then
> there's no extra issue with the ongoing work to disable GTs that Matt
> Roper is doing (https://patchwork.freedesktop.org/series/154739/).
> Otherwise we will need to think on how to reconciliate them.
> 

I don’t see how it would ever be possible to disable the primary GT and
still do anything meaningful. You can’t allocate memory (e.g., for
clears), perform VM binds, or handle page faults without a copy engine.

Matt

> Lucas De Marchi
> 
> > };
> > 
> > #undef PLATFORM
> > @@ -598,6 +599,7 @@ static int xe_info_init_early(struct xe_device *xe,
> > 	xe->info.skip_mtcfg = desc->skip_mtcfg;
> > 	xe->info.skip_pcode = desc->skip_pcode;
> > 	xe->info.needs_scratch = desc->needs_scratch;
> > +	xe->info.needs_shared_vf_gt_wq = desc->needs_shared_vf_gt_wq;
> > 
> > 	xe->info.probe_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY) &&
> > 				 xe_modparam.probe_display &&
> > @@ -766,7 +768,9 @@ static int xe_info_init(struct xe_device *xe,
> > 		 * Allocate and setup media GT for platforms with standalone
> > 		 * media.
> > 		 */
> > -		tile->media_gt = xe_gt_alloc(tile);
> > +		tile->media_gt = xe_gt_alloc(tile,
> > +					     xe->info.needs_shared_vf_gt_wq &&
> > +					     IS_SRIOV_VF(xe));
> > 		if (IS_ERR(tile->media_gt))
> > 			return PTR_ERR(tile->media_gt);
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_pci_types.h b/drivers/gpu/drm/xe/xe_pci_types.h
> > index 9b9766a3baa3..b11bf6abda5b 100644
> > --- a/drivers/gpu/drm/xe/xe_pci_types.h
> > +++ b/drivers/gpu/drm/xe/xe_pci_types.h
> > @@ -48,6 +48,7 @@ struct xe_device_desc {
> > 	u8 skip_guc_pc:1;
> > 	u8 skip_mtcfg:1;
> > 	u8 skip_pcode:1;
> > +	u8 needs_shared_vf_gt_wq:1;
> > };
> > 
> > struct xe_graphics_desc {
> > diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
> > index 6edb5062c1da..e9bcff2de563 100644
> > --- a/drivers/gpu/drm/xe/xe_tile.c
> > +++ b/drivers/gpu/drm/xe/xe_tile.c
> > @@ -157,7 +157,7 @@ int xe_tile_init_early(struct xe_tile *tile, struct xe_device *xe, u8 id)
> > 	if (err)
> > 		return err;
> > 
> > -	tile->primary_gt = xe_gt_alloc(tile);
> > +	tile->primary_gt = xe_gt_alloc(tile, false);
> > 	if (IS_ERR(tile->primary_gt))
> > 		return PTR_ERR(tile->primary_gt);
> > 
> > -- 
> > 2.34.1
> > 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 16/30] drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register
  2025-10-06 22:21   ` Lis, Tomasz
@ 2025-10-06 22:57     ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 22:57 UTC (permalink / raw)
  To: Lis, Tomasz; +Cc: intel-xe

On Tue, Oct 07, 2025 at 12:21:02AM +0200, Lis, Tomasz wrote:
> 
> On 10/6/2025 1:10 PM, Matthew Brost wrote:
> > The only case where the GuC submission backend cannot reason 100%
> > correctly is when a GuC context is registered during VF post-migration
> > recovery. In this scenario, it's possible that the GuC context register
> > H2G is processed, but the immediately following schedule-enable H2G gets
> > lost.
> > 
> > A double register is harmless when using `GUC_HXG_TYPE_EVENT`, as GuC
> > simply drops the duplicate H2G. To keep things simple, use
> > `GUC_HXG_TYPE_EVENT` for all context registrations on VFs.
> > 
> > v5:
> >   - Check for xe_sriov_vf_migration_supported (Tomasz)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_guc_ct.c | 33 +++++++++++++++++++++++++--------
> >   1 file changed, 25 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> > index 9f0090ae64a6..3ac654cebc79 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > @@ -32,6 +32,7 @@
> >   #include "xe_guc_tlb_inval.h"
> >   #include "xe_map.h"
> >   #include "xe_pm.h"
> > +#include "xe_sriov_vf.h"
> >   #include "xe_trace_guc.h"
> >   static void receive_g2h(struct xe_guc_ct *ct);
> > @@ -736,6 +737,26 @@ static u16 next_ct_seqno(struct xe_guc_ct *ct, bool is_g2h_fence)
> >   	return seqno;
> >   }
> > +#define MAKE_ACTION(type, __action)				\
> > +({								\
> > +	FIELD_PREP(GUC_HXG_MSG_0_TYPE, type) |			\
> > +	FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |			\
> > +		   GUC_HXG_EVENT_MSG_0_DATA0, __action);	\
> > +})
> > +
> > +static bool vf_action_can_safely_fail(struct xe_device *xe, u32 action)
> > +{
> > +	/*
> > +	 * If we are VF resuming, we can't exactly track if a context
> > +	 * registration has been completed in the GuC state machine, it is
> > +	 * harmless to resend as it will just fail silently if
> > +	 * GUC_HXG_TYPE_EVENT is used.
> 
> Maybe add:
> 
> If the registration H2G fails with error other than ALREADY_REGISTERED, we
> will know due to the shortly following schedule-enable H2G failing.
> 

I've added this.

Matt

> Other than that, a second to Michal:
> 
> Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
> 
> > +	 */
> > +	return IS_SRIOV_VF(xe) && xe_sriov_vf_migration_supported(xe) &&
> > +		(action == XE_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC ||
> > +		 action == XE_GUC_ACTION_REGISTER_CONTEXT);
> > +}
> > +
> >   #define H2G_CT_HEADERS (GUC_CTB_HDR_LEN + 1) /* one DW CTB header and one DW HxG header */
> >   static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
> > @@ -807,18 +828,14 @@ static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
> >   		FIELD_PREP(GUC_CTB_MSG_0_NUM_DWORDS, len) |
> >   		FIELD_PREP(GUC_CTB_MSG_0_FENCE, ct_fence_value);
> >   	if (want_response) {
> > -		cmd[1] =
> > -			FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> > -			FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |
> > -				   GUC_HXG_EVENT_MSG_0_DATA0, action[0]);
> > +		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_REQUEST, action[0]);
> > +	} else if (vf_action_can_safely_fail(xe, action[0])) {
> > +		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_EVENT, action[0]);
> >   	} else {
> >   		fast_req_track(ct, ct_fence_value,
> >   			       FIELD_GET(GUC_HXG_EVENT_MSG_0_ACTION, action[0]));
> > -		cmd[1] =
> > -			FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_FAST_REQUEST) |
> > -			FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |
> > -				   GUC_HXG_EVENT_MSG_0_DATA0, action[0]);
> > +		cmd[1] = MAKE_ACTION(GUC_HXG_TYPE_FAST_REQUEST, action[0]);
> >   	}
> >   	/* H2G header in cmd[1] replaces action[0] so: */

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 14/30] drm/xe/vf: Wakeup in GuC backend on VF post migration recovery
  2025-10-06 22:27   ` Lis, Tomasz
@ 2025-10-06 23:07     ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-06 23:07 UTC (permalink / raw)
  To: Lis, Tomasz; +Cc: intel-xe

On Tue, Oct 07, 2025 at 12:27:06AM +0200, Lis, Tomasz wrote:
> 
> On 10/6/2025 1:10 PM, Matthew Brost wrote:
> > If VF post-migration recovery is in progress, the recovery flow will
> > rebuild all GuC submission state. In this case, exit all waiters to
> > ensure that submission queue scheduling can also be paused. Avoid taking
> > any adverse actions after aborting the wait.
> > 
> > As part of waking up the GuC backend, suspend_wait can now return
> > -EAGAIN indicating the waiter should be retried. If the caller is
> > running on work item, that work item need to be requeued to avoid a
> > deadlock for the work item blocking the VF migration recovery work item.
> > 
> > v3:
> >   - Don't block in preempt fence work queue as this can interfere with VF
> >     post-migration work queue scheduling leading to deadlock (Testing)
> >   - Use xe_gt_recovery_inprogress (Michal)
> > v5:
> >   - Use static function for vf_recovery (Michal)
> >   - Add helper to wake CT waiters (Michal)
> >   - Move some code to following patch (Michal)
> >   - Adjust commit message to explain suspend_wait returning -EAGAIN (Michal)
> >   - Add kernel doc to suspend_wait around returning -EAGAIN
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_exec_queue_types.h |  3 +
> >   drivers/gpu/drm/xe/xe_gt_sriov_vf.c      |  4 ++
> >   drivers/gpu/drm/xe/xe_guc_ct.h           |  9 +++
> >   drivers/gpu/drm/xe/xe_guc_submit.c       | 82 ++++++++++++++++++------
> >   drivers/gpu/drm/xe/xe_preempt_fence.c    | 11 ++++
> >   5 files changed, 88 insertions(+), 21 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_exec_queue_types.h b/drivers/gpu/drm/xe/xe_exec_queue_types.h
> > index 27b76cf9da89..282505fa1377 100644
> > --- a/drivers/gpu/drm/xe/xe_exec_queue_types.h
> > +++ b/drivers/gpu/drm/xe/xe_exec_queue_types.h
> > @@ -207,6 +207,9 @@ struct xe_exec_queue_ops {
> >   	 * call after suspend. In dma-fencing path thus must return within a
> >   	 * reasonable amount of time. -ETIME return shall indicate an error
> >   	 * waiting for suspend resulting in associated VM getting killed.
> > +	 * -EAGAIN return indicates the wait should be tried again, if the wait
> > +	 * is within a work item, the work item should be requeued as deadlock
> > +	 * avoidance mechanism.
> >   	 */
> >   	int (*suspend_wait)(struct xe_exec_queue *q);
> >   	/**
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > index 7057260175f3..7f703336d692 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > @@ -23,6 +23,7 @@
> >   #include "xe_gt_sriov_vf.h"
> >   #include "xe_gt_sriov_vf_types.h"
> >   #include "xe_guc.h"
> > +#include "xe_guc_ct.h"
> >   #include "xe_guc_hxg_helpers.h"
> >   #include "xe_guc_relay.h"
> >   #include "xe_guc_submit.h"
> > @@ -743,6 +744,9 @@ static void vf_start_migration_recovery(struct xe_gt *gt)
> >   	    !gt->sriov.vf.migration.recovery_teardown) {
> >   		gt->sriov.vf.migration.recovery_queued = true;
> >   		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
> > +		smp_wmb();	/* Ensure above write visable before wake */
> > +
> > +		xe_guc_ct_wake_waiters(&gt->uc.guc.ct);
> >   		started = queue_work(gt->ordered_wq, &gt->sriov.vf.migration.worker);
> >   		xe_gt_sriov_info(gt, "VF migration recovery %s\n", started ?
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
> > index d6c81325a76c..ca0ec938edac 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ct.h
> > +++ b/drivers/gpu/drm/xe/xe_guc_ct.h
> > @@ -72,4 +72,13 @@ xe_guc_ct_send_block_no_fail(struct xe_guc_ct *ct, const u32 *action, u32 len)
> >   long xe_guc_ct_queue_proc_time_jiffies(struct xe_guc_ct *ct);
> > +/**
> > + * xe_guc_ct_wake_waiters() - GuC CT wake up waiters
> > + * @guc: GuC CT object
> > + */
> > +static inline void xe_guc_ct_wake_waiters(struct xe_guc_ct *ct)
> > +{
> > +	wake_up_all(&ct->wq);
> > +}
> > +
> >   #endif
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 59371b7cc8a4..b2ca4911efe9 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -27,7 +27,6 @@
> >   #include "xe_gt.h"
> >   #include "xe_gt_clock.h"
> >   #include "xe_gt_printk.h"
> > -#include "xe_gt_sriov_vf.h"
> >   #include "xe_guc.h"
> >   #include "xe_guc_capture.h"
> >   #include "xe_guc_ct.h"
> > @@ -702,6 +701,11 @@ static u32 wq_space_until_wrap(struct xe_exec_queue *q)
> >   	return (WQ_SIZE - q->guc->wqi_tail);
> >   }
> > +static bool vf_recovery(struct xe_guc *guc)
> > +{
> > +	return xe_gt_recovery_pending(guc_to_gt(guc));
> > +}
> > +
> >   static int wq_wait_for_space(struct xe_exec_queue *q, u32 wqi_size)
> >   {
> >   	struct xe_guc *guc = exec_queue_to_guc(q);
> > @@ -711,7 +715,7 @@ static int wq_wait_for_space(struct xe_exec_queue *q, u32 wqi_size)
> >   #define AVAILABLE_SPACE \
> >   	CIRC_SPACE(q->guc->wqi_tail, q->guc->wqi_head, WQ_SIZE)
> > -	if (wqi_size > AVAILABLE_SPACE) {
> > +	if (wqi_size > AVAILABLE_SPACE && !vf_recovery(guc)) {
> >   try_again:
> >   		q->guc->wqi_head = parallel_read(xe, map, wq_desc.head);
> >   		if (wqi_size > AVAILABLE_SPACE) {
> > @@ -910,9 +914,10 @@ static void disable_scheduling_deregister(struct xe_guc *guc,
> >   	ret = wait_event_timeout(guc->ct.wq,
> >   				 (!exec_queue_pending_enable(q) &&
> >   				  !exec_queue_pending_disable(q)) ||
> > -					 xe_guc_read_stopped(guc),
> > +					 xe_guc_read_stopped(guc) ||
> > +					 vf_recovery(guc),
> >   				 HZ * 5);
> > -	if (!ret) {
> > +	if (!ret && !vf_recovery(guc)) {
> 
> Is it possible for vf_recovery() to change its retval between the above
> llines? Ending the wait due to recovery, and then forgetting that happened?
> 

I don't think in practice this can change. The first thing the resfix
IRQ does is wakeup all waiters so these should immediately pop out. Most
of the waiters are in the queue stopping path which VF recovery triggers
so vf_recovery shouldn't be able to change. The waiter which is not is a
suspend fence, I think I need to add a little extra logic there to fixup
that path.

> Maybe we should assign to a local?
> 

I don't think that is possible with how wait_event_timeout is designed.

Matt

> (concerns all places where we do the check this way)
> 
> -Tomasz
> 
> >   		struct xe_gpu_scheduler *sched = &q->guc->sched;
> >   		xe_gt_warn(q->gt, "Pending enable/disable failed to respond\n");
> > @@ -1015,6 +1020,10 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
> >   	bool wedged = false;
> >   	xe_gt_assert(guc_to_gt(guc), xe_exec_queue_is_lr(q));
> > +
> > +	if (vf_recovery(guc))
> > +		return;
> > +
> >   	trace_xe_exec_queue_lr_cleanup(q);
> >   	if (!exec_queue_killed(q))
> > @@ -1047,7 +1056,11 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
> >   		 */
> >   		ret = wait_event_timeout(guc->ct.wq,
> >   					 !exec_queue_pending_disable(q) ||
> > -					 xe_guc_read_stopped(guc), HZ * 5);
> > +					 xe_guc_read_stopped(guc) ||
> > +					 vf_recovery(guc), HZ * 5);
> > +		if (vf_recovery(guc))
> > +			return;
> > +
> >   		if (!ret) {
> >   			xe_gt_warn(q->gt, "Schedule disable failed to respond, guc_id=%d\n",
> >   				   q->guc->id);
> > @@ -1137,8 +1150,9 @@ static void enable_scheduling(struct xe_exec_queue *q)
> >   	ret = wait_event_timeout(guc->ct.wq,
> >   				 !exec_queue_pending_enable(q) ||
> > -				 xe_guc_read_stopped(guc), HZ * 5);
> > -	if (!ret || xe_guc_read_stopped(guc)) {
> > +				 xe_guc_read_stopped(guc) ||
> > +				 vf_recovery(guc), HZ * 5);
> > +	if ((!ret && !vf_recovery(guc)) || xe_guc_read_stopped(guc)) {
> >   		xe_gt_warn(guc_to_gt(guc), "Schedule enable failed to respond");
> >   		set_exec_queue_banned(q);
> >   		xe_gt_reset_async(q->gt);
> > @@ -1209,7 +1223,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> >   	 * list so job can be freed and kick scheduler ensuring free job is not
> >   	 * lost.
> >   	 */
> > -	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags))
> > +	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags) ||
> > +	    vf_recovery(guc))
> >   		return DRM_GPU_SCHED_STAT_NO_HANG;
> >   	/* Kill the run_job entry point */
> > @@ -1261,7 +1276,10 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> >   			ret = wait_event_timeout(guc->ct.wq,
> >   						 (!exec_queue_pending_enable(q) &&
> >   						  !exec_queue_pending_disable(q)) ||
> > -						 xe_guc_read_stopped(guc), HZ * 5);
> > +						 xe_guc_read_stopped(guc) ||
> > +						 vf_recovery(guc), HZ * 5);
> > +			if (vf_recovery(guc))
> > +				goto handle_vf_resume;
> >   			if (!ret || xe_guc_read_stopped(guc))
> >   				goto trigger_reset;
> > @@ -1286,7 +1304,10 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> >   		smp_rmb();
> >   		ret = wait_event_timeout(guc->ct.wq,
> >   					 !exec_queue_pending_disable(q) ||
> > -					 xe_guc_read_stopped(guc), HZ * 5);
> > +					 xe_guc_read_stopped(guc) ||
> > +					 vf_recovery(guc), HZ * 5);
> > +		if (vf_recovery(guc))
> > +			goto handle_vf_resume;
> >   		if (!ret || xe_guc_read_stopped(guc)) {
> >   trigger_reset:
> >   			if (!ret)
> > @@ -1391,6 +1412,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> >   	 * some thought, do this in a follow up.
> >   	 */
> >   	xe_sched_submission_start(sched);
> > +handle_vf_resume:
> >   	return DRM_GPU_SCHED_STAT_NO_HANG;
> >   }
> > @@ -1487,11 +1509,17 @@ static void __guc_exec_queue_process_msg_set_sched_props(struct xe_sched_msg *ms
> >   static void __suspend_fence_signal(struct xe_exec_queue *q)
> >   {
> > +	struct xe_guc *guc = exec_queue_to_guc(q);
> > +	struct xe_device *xe = guc_to_xe(guc);
> > +
> >   	if (!q->guc->suspend_pending)
> >   		return;
> >   	WRITE_ONCE(q->guc->suspend_pending, false);
> > -	wake_up(&q->guc->suspend_wait);
> > +	if (IS_SRIOV_VF(xe))
> > +		wake_up_all(&guc->ct.wq);
> > +	else
> > +		wake_up(&q->guc->suspend_wait);
> >   }
> >   static void suspend_fence_signal(struct xe_exec_queue *q)
> > @@ -1512,8 +1540,9 @@ static void __guc_exec_queue_process_msg_suspend(struct xe_sched_msg *msg)
> >   	if (guc_exec_queue_allowed_to_change_state(q) && !exec_queue_suspended(q) &&
> >   	    exec_queue_enabled(q)) {
> > -		wait_event(guc->ct.wq, (q->guc->resume_time != RESUME_PENDING ||
> > -			   xe_guc_read_stopped(guc)) && !exec_queue_pending_disable(q));
> > +		wait_event(guc->ct.wq, vf_recovery(guc) ||
> > +			   ((q->guc->resume_time != RESUME_PENDING ||
> > +			   xe_guc_read_stopped(guc)) && !exec_queue_pending_disable(q)));
> >   		if (!xe_guc_read_stopped(guc)) {
> >   			s64 since_resume_ms =
> > @@ -1640,7 +1669,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
> >   	q->entity = &ge->entity;
> > -	if (xe_guc_read_stopped(guc))
> > +	if (xe_guc_read_stopped(guc) || vf_recovery(guc))
> >   		xe_sched_stop(sched);
> >   	mutex_unlock(&guc->submission_state.lock);
> > @@ -1786,6 +1815,7 @@ static int guc_exec_queue_suspend(struct xe_exec_queue *q)
> >   static int guc_exec_queue_suspend_wait(struct xe_exec_queue *q)
> >   {
> >   	struct xe_guc *guc = exec_queue_to_guc(q);
> > +	struct xe_device *xe = guc_to_xe(guc);
> >   	int ret;
> >   	/*
> > @@ -1793,11 +1823,22 @@ static int guc_exec_queue_suspend_wait(struct xe_exec_queue *q)
> >   	 * suspend_pending upon kill but to be paranoid but races in which
> >   	 * suspend_pending is set after kill also check kill here.
> >   	 */
> > -	ret = wait_event_interruptible_timeout(q->guc->suspend_wait,
> > -					       !READ_ONCE(q->guc->suspend_pending) ||
> > -					       exec_queue_killed(q) ||
> > -					       xe_guc_read_stopped(guc),
> > -					       HZ * 5);
> > +	if (IS_SRIOV_VF(xe))
> > +		ret = wait_event_interruptible_timeout(guc->ct.wq,
> > +						       !READ_ONCE(q->guc->suspend_pending) ||
> > +						       exec_queue_killed(q) ||
> > +						       xe_guc_read_stopped(guc) ||
> > +						       vf_recovery(guc),
> > +						       HZ * 5);
> > +	else
> > +		ret = wait_event_interruptible_timeout(q->guc->suspend_wait,
> > +						       !READ_ONCE(q->guc->suspend_pending) ||
> > +						       exec_queue_killed(q) ||
> > +						       xe_guc_read_stopped(guc),
> > +						       HZ * 5);
> > +
> > +	if (vf_recovery(guc) && !xe_device_wedged((guc_to_xe(guc))))
> > +		return -EAGAIN;
> >   	if (!ret) {
> >   		xe_gt_warn(guc_to_gt(guc),
> > @@ -1905,8 +1946,7 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> >   {
> >   	int ret;
> > -	if (xe_gt_WARN_ON(guc_to_gt(guc),
> > -			  xe_gt_sriov_vf_recovery_pending(guc_to_gt(guc))))
> > +	if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
> >   		return 0;
> >   	if (!guc->submission_state.initialized)
> > diff --git a/drivers/gpu/drm/xe/xe_preempt_fence.c b/drivers/gpu/drm/xe/xe_preempt_fence.c
> > index 83fbeea5aa20..7f587ca3947d 100644
> > --- a/drivers/gpu/drm/xe/xe_preempt_fence.c
> > +++ b/drivers/gpu/drm/xe/xe_preempt_fence.c
> > @@ -8,6 +8,8 @@
> >   #include <linux/slab.h>
> >   #include "xe_exec_queue.h"
> > +#include "xe_gt_printk.h"
> > +#include "xe_guc_exec_queue_types.h"
> >   #include "xe_vm.h"
> >   static void preempt_fence_work_func(struct work_struct *w)
> > @@ -22,6 +24,15 @@ static void preempt_fence_work_func(struct work_struct *w)
> >   	} else if (!q->ops->reset_status(q)) {
> >   		int err = q->ops->suspend_wait(q);
> > +		if (err == -EAGAIN) {
> > +			xe_gt_dbg(q->gt, "PREEMPT FENCE RETRY guc_id=%d",
> > +				  q->guc->id);
> > +			queue_work(q->vm->xe->preempt_fence_wq,
> > +				   &pfence->preempt_work);
> > +			dma_fence_end_signalling(cookie);
> > +			return;
> > +		}
> > +
> >   		if (err)
> >   			dma_fence_set_error(&pfence->base, err);
> >   	} else {

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 00/30] VF migration redesign
  2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
                   ` (33 preceding siblings ...)
  2025-10-06 14:28 ` ✗ Xe.CI.Full: " Patchwork
@ 2025-10-07  0:20 ` Niranjana Vishwanathapura
  2025-10-07  1:11   ` Matthew Brost
  34 siblings, 1 reply; 58+ messages in thread
From: Niranjana Vishwanathapura @ 2025-10-07  0:20 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

On Mon, Oct 06, 2025 at 04:10:08AM -0700, Matthew Brost wrote:
>Rather than modifying buffers in place using GGTT addresses during VF
>migration, this approach relies on the submission backend's stop/start
>mechanism to issue fixups. The patch titled "Document GuC Submission
>Backend" provides a detailed explanation of the design.

I don't see this "Document GuC Submission Backend" patch in this version
of patch series. I saw that in older version of the series, but there
also, I think it was missing xe_guc_submit.rst file.

Niranjana

>
>Testing was performed using an out-of-tree PF/VFIO driver with manual
>triggering of VF migration while IGT test cases are running.
>
>IGT test cases:
>
>- A new series [1] that exercises active contexts, job resubmission, and
>  compressd memory.
>
>- A new test [2] that actively creates / destroys queue on each
>  submission
>
>- xe_exec_threads basic sections, which test context registration loss,
>  schedule enable loss, and job resubmission.
>
>- xe_exec_threads balancer sections, which follow the same flows as the
>  basic sections but include a work queue (GGTT address shift).
>
>- xe_exec_threads compute mode user pointer invalidation sections, which
>  exercise the same flow as the basic sections, plus replaying
>  suspend/resume flows.
>
>All code paths in "Replay GuC submission state on pause/unpause" that
>replay state have been manually verified via debug messages "Add debug
>prints for GuC replaying state during VF recovery".
>
>v2:
> - Fix lockdep splat
> - Fix checkpatch
> - Fix PTL issue with LRC W/A buffer
> - Fix race creating / destroying queues across migration exposed by [2]
> - Include a version of Satya's patches in [3] which enable CCS save /
>   restore across VF migration /w GGTT shift
>v3:
> - Address feedback
> - Fix preempt fence mode deadlock /w work queues + VF recovery (Testing)
> - Add NULL checks to scratch LRC allocation
>v4:
> - Fix CI failure
> - Remove config lock
>v5:
> - Fix CI failures related to lockdep
> - Address various comments
>v6:
> - Rebase for CI
>
>Matt
>
>Matthew Brost (28):
>  drm/xe: Add NULL checks to scratch LRC allocation
>  drm/xe: Save off position in ring in which a job was programmed
>  drm/xe/guc: Track pending-enable source in submission state
>  drm/xe: Track LR jobs in DRM scheduler pending list
>  drm/xe: Don't change LRC ring head on job resubmission
>  drm/xe: Make LRC W/A scratch buffer usage consistent
>  drm/xe/vf: Add xe_gt_recovery_pending helper
>  drm/xe/vf: Make VF recovery run on per-GT worker
>  drm/xe/vf: Abort H2G sends during VF post-migration recovery
>  drm/xe/vf: Remove memory allocations from VF post migration recovery
>  drm/xe/vf: Close multi-GT GGTT shift race
>  drm/xe/vf: Teardown VF post migration worker on driver unload
>  drm/xe/vf: Don't allow GT reset to be queued during VF post migration
>    recovery
>  drm/xe/vf: Wakeup in GuC backend on VF post migration recovery
>  drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs
>    supporting migration
>  drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register
>  drm/xe/vf: Flush and stop CTs in VF post migration recovery
>  drm/xe/vf: Reset TLB invalidations during VF post migration recovery
>  drm/xe/vf: Kickstart after resfix in VF post migration recovery
>  drm/xe/vf: Start CTs before resfix VF post migration recovery
>  drm/xe/vf: Abort VF post migration recovery on failure
>  drm/xe/vf: Replay GuC submission state on pause / unpause
>  drm/xe: Move queue init before LRC creation
>  drm/xe/vf: Add debug prints for GuC replaying state during VF recovery
>  drm/xe/vf: Workaround for race condition in GuC firmware during VF
>    pause
>  drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF
>  drm/xe/vf: Ensure media GT VF recovery runs after primary GT on PTL
>  drm/xe/vf: Rebase CCS save/restore BB GGTT addresses
>
>Satyanarayana K V P (2):
>  drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups
>  drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC
>
> drivers/gpu/drm/xe/xe_device_types.h         |   5 +
> drivers/gpu/drm/xe/xe_exec.c                 |  12 +-
> drivers/gpu/drm/xe/xe_exec_queue.c           |  64 +--
> drivers/gpu/drm/xe/xe_exec_queue.h           |   2 -
> drivers/gpu/drm/xe/xe_exec_queue_types.h     |   3 +
> drivers/gpu/drm/xe/xe_execlist.c             |   2 +-
> drivers/gpu/drm/xe/xe_gpu_scheduler.c        |  14 +
> drivers/gpu/drm/xe/xe_gpu_scheduler.h        |   2 +
> drivers/gpu/drm/xe/xe_gt.c                   |  28 +-
> drivers/gpu/drm/xe/xe_gt.h                   |  15 +-
> drivers/gpu/drm/xe/xe_gt_sriov_vf.c          | 458 +++++++++++++----
> drivers/gpu/drm/xe/xe_gt_sriov_vf.h          |  13 +-
> drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h    |  33 +-
> drivers/gpu/drm/xe/xe_guc.c                  |   4 +-
> drivers/gpu/drm/xe/xe_guc_ct.c               | 121 +++--
> drivers/gpu/drm/xe/xe_guc_ct.h               |  11 +
> drivers/gpu/drm/xe/xe_guc_exec_queue_types.h |  15 +
> drivers/gpu/drm/xe/xe_guc_submit.c           | 486 +++++++++++++++----
> drivers/gpu/drm/xe/xe_guc_submit.h           |   5 +-
> drivers/gpu/drm/xe/xe_lrc.c                  |  15 +-
> drivers/gpu/drm/xe/xe_lrc.h                  |  10 +
> drivers/gpu/drm/xe/xe_memirq.c               |  48 +-
> drivers/gpu/drm/xe/xe_memirq.h               |   2 +
> drivers/gpu/drm/xe/xe_migrate.c              |  28 +-
> drivers/gpu/drm/xe/xe_pci.c                  |   6 +-
> drivers/gpu/drm/xe/xe_pci_types.h            |   1 +
> drivers/gpu/drm/xe/xe_preempt_fence.c        |  11 +
> drivers/gpu/drm/xe/xe_ring_ops.c             |  23 +-
> drivers/gpu/drm/xe/xe_sched_job_types.h      |   9 +
> drivers/gpu/drm/xe/xe_sriov_vf.c             | 240 ---------
> drivers/gpu/drm/xe/xe_sriov_vf.h             |   1 -
> drivers/gpu/drm/xe/xe_sriov_vf_ccs.c         |  28 ++
> drivers/gpu/drm/xe/xe_sriov_vf_ccs.h         |   1 +
> drivers/gpu/drm/xe/xe_sriov_vf_types.h       |   4 -
> drivers/gpu/drm/xe/xe_tile.c                 |   2 +-
> drivers/gpu/drm/xe/xe_tile_sriov_vf.c        |  30 +-
> drivers/gpu/drm/xe/xe_tile_sriov_vf.h        |   2 +-
> drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h  |  23 +
> drivers/gpu/drm/xe/xe_vm.c                   |  26 +-
> drivers/gpu/drm/xe/xe_vram.c                 |   6 +-
> 40 files changed, 1250 insertions(+), 559 deletions(-)
> create mode 100644 drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h
>
>-- 
>2.34.1
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 00/30] VF migration redesign
  2025-10-07  0:20 ` [PATCH v6 00/30] VF migration redesign Niranjana Vishwanathapura
@ 2025-10-07  1:11   ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-07  1:11 UTC (permalink / raw)
  To: Niranjana Vishwanathapura; +Cc: intel-xe

On Mon, Oct 06, 2025 at 05:20:08PM -0700, Niranjana Vishwanathapura wrote:
> On Mon, Oct 06, 2025 at 04:10:08AM -0700, Matthew Brost wrote:
> > Rather than modifying buffers in place using GGTT addresses during VF
> > migration, this approach relies on the submission backend's stop/start
> > mechanism to issue fixups. The patch titled "Document GuC Submission
> > Backend" provides a detailed explanation of the design.
> 
> I don't see this "Document GuC Submission Backend" patch in this version
> of patch series. I saw that in older version of the series, but there
> also, I think it was missing xe_guc_submit.rst file.
> 

I think that patch will be posted in a standalone patch in a follow up.

Matt

> Niranjana
> 
> > 
> > Testing was performed using an out-of-tree PF/VFIO driver with manual
> > triggering of VF migration while IGT test cases are running.
> > 
> > IGT test cases:
> > 
> > - A new series [1] that exercises active contexts, job resubmission, and
> >  compressd memory.
> > 
> > - A new test [2] that actively creates / destroys queue on each
> >  submission
> > 
> > - xe_exec_threads basic sections, which test context registration loss,
> >  schedule enable loss, and job resubmission.
> > 
> > - xe_exec_threads balancer sections, which follow the same flows as the
> >  basic sections but include a work queue (GGTT address shift).
> > 
> > - xe_exec_threads compute mode user pointer invalidation sections, which
> >  exercise the same flow as the basic sections, plus replaying
> >  suspend/resume flows.
> > 
> > All code paths in "Replay GuC submission state on pause/unpause" that
> > replay state have been manually verified via debug messages "Add debug
> > prints for GuC replaying state during VF recovery".
> > 
> > v2:
> > - Fix lockdep splat
> > - Fix checkpatch
> > - Fix PTL issue with LRC W/A buffer
> > - Fix race creating / destroying queues across migration exposed by [2]
> > - Include a version of Satya's patches in [3] which enable CCS save /
> >   restore across VF migration /w GGTT shift
> > v3:
> > - Address feedback
> > - Fix preempt fence mode deadlock /w work queues + VF recovery (Testing)
> > - Add NULL checks to scratch LRC allocation
> > v4:
> > - Fix CI failure
> > - Remove config lock
> > v5:
> > - Fix CI failures related to lockdep
> > - Address various comments
> > v6:
> > - Rebase for CI
> > 
> > Matt
> > 
> > Matthew Brost (28):
> >  drm/xe: Add NULL checks to scratch LRC allocation
> >  drm/xe: Save off position in ring in which a job was programmed
> >  drm/xe/guc: Track pending-enable source in submission state
> >  drm/xe: Track LR jobs in DRM scheduler pending list
> >  drm/xe: Don't change LRC ring head on job resubmission
> >  drm/xe: Make LRC W/A scratch buffer usage consistent
> >  drm/xe/vf: Add xe_gt_recovery_pending helper
> >  drm/xe/vf: Make VF recovery run on per-GT worker
> >  drm/xe/vf: Abort H2G sends during VF post-migration recovery
> >  drm/xe/vf: Remove memory allocations from VF post migration recovery
> >  drm/xe/vf: Close multi-GT GGTT shift race
> >  drm/xe/vf: Teardown VF post migration worker on driver unload
> >  drm/xe/vf: Don't allow GT reset to be queued during VF post migration
> >    recovery
> >  drm/xe/vf: Wakeup in GuC backend on VF post migration recovery
> >  drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs
> >    supporting migration
> >  drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register
> >  drm/xe/vf: Flush and stop CTs in VF post migration recovery
> >  drm/xe/vf: Reset TLB invalidations during VF post migration recovery
> >  drm/xe/vf: Kickstart after resfix in VF post migration recovery
> >  drm/xe/vf: Start CTs before resfix VF post migration recovery
> >  drm/xe/vf: Abort VF post migration recovery on failure
> >  drm/xe/vf: Replay GuC submission state on pause / unpause
> >  drm/xe: Move queue init before LRC creation
> >  drm/xe/vf: Add debug prints for GuC replaying state during VF recovery
> >  drm/xe/vf: Workaround for race condition in GuC firmware during VF
> >    pause
> >  drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF
> >  drm/xe/vf: Ensure media GT VF recovery runs after primary GT on PTL
> >  drm/xe/vf: Rebase CCS save/restore BB GGTT addresses
> > 
> > Satyanarayana K V P (2):
> >  drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups
> >  drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC
> > 
> > drivers/gpu/drm/xe/xe_device_types.h         |   5 +
> > drivers/gpu/drm/xe/xe_exec.c                 |  12 +-
> > drivers/gpu/drm/xe/xe_exec_queue.c           |  64 +--
> > drivers/gpu/drm/xe/xe_exec_queue.h           |   2 -
> > drivers/gpu/drm/xe/xe_exec_queue_types.h     |   3 +
> > drivers/gpu/drm/xe/xe_execlist.c             |   2 +-
> > drivers/gpu/drm/xe/xe_gpu_scheduler.c        |  14 +
> > drivers/gpu/drm/xe/xe_gpu_scheduler.h        |   2 +
> > drivers/gpu/drm/xe/xe_gt.c                   |  28 +-
> > drivers/gpu/drm/xe/xe_gt.h                   |  15 +-
> > drivers/gpu/drm/xe/xe_gt_sriov_vf.c          | 458 +++++++++++++----
> > drivers/gpu/drm/xe/xe_gt_sriov_vf.h          |  13 +-
> > drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h    |  33 +-
> > drivers/gpu/drm/xe/xe_guc.c                  |   4 +-
> > drivers/gpu/drm/xe/xe_guc_ct.c               | 121 +++--
> > drivers/gpu/drm/xe/xe_guc_ct.h               |  11 +
> > drivers/gpu/drm/xe/xe_guc_exec_queue_types.h |  15 +
> > drivers/gpu/drm/xe/xe_guc_submit.c           | 486 +++++++++++++++----
> > drivers/gpu/drm/xe/xe_guc_submit.h           |   5 +-
> > drivers/gpu/drm/xe/xe_lrc.c                  |  15 +-
> > drivers/gpu/drm/xe/xe_lrc.h                  |  10 +
> > drivers/gpu/drm/xe/xe_memirq.c               |  48 +-
> > drivers/gpu/drm/xe/xe_memirq.h               |   2 +
> > drivers/gpu/drm/xe/xe_migrate.c              |  28 +-
> > drivers/gpu/drm/xe/xe_pci.c                  |   6 +-
> > drivers/gpu/drm/xe/xe_pci_types.h            |   1 +
> > drivers/gpu/drm/xe/xe_preempt_fence.c        |  11 +
> > drivers/gpu/drm/xe/xe_ring_ops.c             |  23 +-
> > drivers/gpu/drm/xe/xe_sched_job_types.h      |   9 +
> > drivers/gpu/drm/xe/xe_sriov_vf.c             | 240 ---------
> > drivers/gpu/drm/xe/xe_sriov_vf.h             |   1 -
> > drivers/gpu/drm/xe/xe_sriov_vf_ccs.c         |  28 ++
> > drivers/gpu/drm/xe/xe_sriov_vf_ccs.h         |   1 +
> > drivers/gpu/drm/xe/xe_sriov_vf_types.h       |   4 -
> > drivers/gpu/drm/xe/xe_tile.c                 |   2 +-
> > drivers/gpu/drm/xe/xe_tile_sriov_vf.c        |  30 +-
> > drivers/gpu/drm/xe/xe_tile_sriov_vf.h        |   2 +-
> > drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h  |  23 +
> > drivers/gpu/drm/xe/xe_vm.c                   |  26 +-
> > drivers/gpu/drm/xe/xe_vram.c                 |   6 +-
> > 40 files changed, 1250 insertions(+), 559 deletions(-)
> > create mode 100644 drivers/gpu/drm/xe/xe_tile_sriov_vf_types.h
> > 
> > -- 
> > 2.34.1
> > 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 27/30] drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF
  2025-10-06 22:51     ` Matthew Brost
@ 2025-10-07 17:00       ` Lucas De Marchi
  2025-10-07 17:22         ` Matthew Brost
  0 siblings, 1 reply; 58+ messages in thread
From: Lucas De Marchi @ 2025-10-07 17:00 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe, Matt Roper

On Mon, Oct 06, 2025 at 03:51:12PM -0700, Matthew Brost wrote:
>On Mon, Oct 06, 2025 at 05:24:55PM -0500, Lucas De Marchi wrote:
>> On Mon, Oct 06, 2025 at 04:10:35AM -0700, Matthew Brost wrote:
>> > VF CCS restore is a primary GT operation on which the media GT depends.
>> > Therefore, it doesn't make much sense to run these operations in
>>
>> I'd need to double check the previous patches to see the entire
>> picture, but this seems weird at a first glance. The VF CCS restore is
>> not the single work we queue in gt->ordered_wq.  To me it seems more
>> like "in what ordered queue we are going to queue the VF CCS restore. If
>> it's global per device, why are we not using the device wq rather than
>> making all the GT wq point to the same thing?
>>
>
>This is where things get convoluted. Four mechanisms manipulate the
>scheduling state by starting or stopping the DRM scheduler:
>
>- Job timeouts
>- GT resets
>- VF restore
>- PM resume
>
>All of these paths require mutual exclusion, or the scheduler design
>breaks.
>
>The first three ensure ordering by scheduling on the same ordered work
>queue (GT-ordered WQ). The last one is guaranteed by holding PM
>references in all the right places.
>
>Another issue is that the first three items are all in the reclaim path
>— the GT-ordered work queue is designed to handle this.
>
>This patch [1] explains the entire scheduler design in detail.
>
>Only PTL has the cross-VF/GT restore ordering requirement, so I figured
>the path of least resistance is to just point all GTs to the primary
>work queue.
>
>[1] https://patchwork.freedesktop.org/patch/677980/?series=154627&rev=4
>
>> > parallel. To address this, point the media GT's ordered work queue to
>> > the primary GT's ordered work queue on platforms that require (PTL VFs)
>> > CCS restore as part of VF post-migration recovery.
>> >
>> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> > ---
>> > drivers/gpu/drm/xe/xe_device_types.h | 2 ++
>> > drivers/gpu/drm/xe/xe_gt.c           | 7 +++++--
>> > drivers/gpu/drm/xe/xe_gt.h           | 2 +-
>> > drivers/gpu/drm/xe/xe_pci.c          | 6 +++++-
>> > drivers/gpu/drm/xe/xe_pci_types.h    | 1 +
>> > drivers/gpu/drm/xe/xe_tile.c         | 2 +-
>> > 6 files changed, 15 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>> > index c66523bf4bf0..02c04ad7296e 100644
>> > --- a/drivers/gpu/drm/xe/xe_device_types.h
>> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> > @@ -334,6 +334,8 @@ struct xe_device {
>> > 		u8 skip_mtcfg:1;
>> > 		/** @info.skip_pcode: skip access to PCODE uC */
>> > 		u8 skip_pcode:1;
>> > +		/** @info.needs_shared_vf_gt_wq: needs shared GT WQ on VF */
>> > +		u8 needs_shared_vf_gt_wq:1;
>> > 	} info;
>> >
>> > 	/** @wa_active: keep track of active workarounds */
>> > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
>> > index cf484a2da35e..05465f358c96 100644
>> > --- a/drivers/gpu/drm/xe/xe_gt.c
>> > +++ b/drivers/gpu/drm/xe/xe_gt.c
>> > @@ -65,7 +65,7 @@
>> > #include "xe_wa.h"
>> > #include "xe_wopcm.h"
>> >
>> > -struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
>> > +struct xe_gt *xe_gt_alloc(struct xe_tile *tile, bool use_primary_wq)
>>
>> If using the device wq is not an option (possibly because it would queue
>> with other undesired work going on there), then I'd rather drop this
>> bool passing here and make the decision inside this function:
>>
>> > {
>> > 	struct drm_device *drm = &tile_to_xe(tile)->drm;
>> > 	struct xe_gt *gt;
>> > @@ -75,7 +75,10 @@ struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
>> > 		return ERR_PTR(-ENOMEM);
>> >
>> > 	gt->tile = tile;
>>
>> 	if (!xe->info.needs_shared_gt_wq || !tile->primary_gt->ordered_wq)
>> 		ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
>> 	else
>> 		ordered_wq = tile->primary_gt->ordered_wq;
>> 	if (IS_ERR_OR_NUL(ordered_wq))
>>  		return ordered_wq ? ERR_CAST(gt->ordered_wq) : ERR_PTR(-EINVAL);
>>
>> 	gt->ordered_wq = ordered_wq;
>> 	
>> ... or something like that so you use the xe info to decide it here
>> rather than passing it down as a function arg.
>>
>
>Sure can refactor this as you suggest.


another option that would avoid a bool arg would be to actually pass the
wq from the caller.

	wq = NULL;

	if (xe->info.needs_shared_gt_wq)
		wq = tile->primary_gt->ordered_wq;

	xe_gt_alloc(tile, wq);

>
>>
>> > -	gt->ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
>> > +	if (use_primary_wq)
>> > +		gt->ordered_wq = tile->primary_gt->ordered_wq;
>> > +	else
>> > +		gt->ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
>> > 	if (IS_ERR(gt->ordered_wq))
>> > 		return ERR_CAST(gt->ordered_wq);
>> >
>> > diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
>> > index 5df2ffe3ff83..9545c0c93ab6 100644
>> > --- a/drivers/gpu/drm/xe/xe_gt.h
>> > +++ b/drivers/gpu/drm/xe/xe_gt.h
>> > @@ -28,7 +28,7 @@ static inline bool xe_fault_inject_gt_reset(void)
>> > 	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&gt_reset_failure, 1);
>> > }
>> >
>> > -struct xe_gt *xe_gt_alloc(struct xe_tile *tile);
>> > +struct xe_gt *xe_gt_alloc(struct xe_tile *tile, bool use_primary_wq);
>> > int xe_gt_init_early(struct xe_gt *gt);
>> > int xe_gt_init(struct xe_gt *gt);
>> > void xe_gt_mmio_init(struct xe_gt *gt);
>> > diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
>> > index 3f42b91efa28..25a1d96a68e7 100644
>> > --- a/drivers/gpu/drm/xe/xe_pci.c
>> > +++ b/drivers/gpu/drm/xe/xe_pci.c
>> > @@ -347,6 +347,7 @@ static const struct xe_device_desc ptl_desc = {
>> > 	.has_sriov = true,
>> > 	.max_gt_per_tile = 2,
>> > 	.needs_scratch = true,
>> > +	.needs_shared_vf_gt_wq = true,
>>
>> as per above... I think this needs to be detached from vf. There may be
>> other reasons the wq needs to be shared.
>>
>
>I’d prefer to share the work queue only when absolutely necessary — such
>as when PTL is on a VF and migration is supported.


I missed that you left the VF condition out as an && for the allocation:

>> > +                                       xe->info.needs_shared_vf_gt_wq &&
>> > +                                       IS_SRIOV_VF(xe));

IMO a better way would be to call it `needs_shared_gt_wq` and then
override the condition in the vf-specific function:

sriov_update_device_info()

Lucas De Marchi

>
>> If we just make them point to a device wq as suggested above, then
>> there's no extra issue with the ongoing work to disable GTs that Matt
>> Roper is doing (https://patchwork.freedesktop.org/series/154739/).
>> Otherwise we will need to think on how to reconciliate them.
>>
>
>I don’t see how it would ever be possible to disable the primary GT and
>still do anything meaningful. You can’t allocate memory (e.g., for
>clears), perform VM binds, or handle page faults without a copy engine.
>
>Matt
>
>> Lucas De Marchi
>>
>> > };
>> >
>> > #undef PLATFORM
>> > @@ -598,6 +599,7 @@ static int xe_info_init_early(struct xe_device *xe,
>> > 	xe->info.skip_mtcfg = desc->skip_mtcfg;
>> > 	xe->info.skip_pcode = desc->skip_pcode;
>> > 	xe->info.needs_scratch = desc->needs_scratch;
>> > +	xe->info.needs_shared_vf_gt_wq = desc->needs_shared_vf_gt_wq;
>> >
>> > 	xe->info.probe_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY) &&
>> > 				 xe_modparam.probe_display &&
>> > @@ -766,7 +768,9 @@ static int xe_info_init(struct xe_device *xe,
>> > 		 * Allocate and setup media GT for platforms with standalone
>> > 		 * media.
>> > 		 */
>> > -		tile->media_gt = xe_gt_alloc(tile);
>> > +		tile->media_gt = xe_gt_alloc(tile,
>> > +					     xe->info.needs_shared_vf_gt_wq &&
>> > +					     IS_SRIOV_VF(xe));
>> > 		if (IS_ERR(tile->media_gt))
>> > 			return PTR_ERR(tile->media_gt);
>> >
>> > diff --git a/drivers/gpu/drm/xe/xe_pci_types.h b/drivers/gpu/drm/xe/xe_pci_types.h
>> > index 9b9766a3baa3..b11bf6abda5b 100644
>> > --- a/drivers/gpu/drm/xe/xe_pci_types.h
>> > +++ b/drivers/gpu/drm/xe/xe_pci_types.h
>> > @@ -48,6 +48,7 @@ struct xe_device_desc {
>> > 	u8 skip_guc_pc:1;
>> > 	u8 skip_mtcfg:1;
>> > 	u8 skip_pcode:1;
>> > +	u8 needs_shared_vf_gt_wq:1;
>> > };
>> >
>> > struct xe_graphics_desc {
>> > diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
>> > index 6edb5062c1da..e9bcff2de563 100644
>> > --- a/drivers/gpu/drm/xe/xe_tile.c
>> > +++ b/drivers/gpu/drm/xe/xe_tile.c
>> > @@ -157,7 +157,7 @@ int xe_tile_init_early(struct xe_tile *tile, struct xe_device *xe, u8 id)
>> > 	if (err)
>> > 		return err;
>> >
>> > -	tile->primary_gt = xe_gt_alloc(tile);
>> > +	tile->primary_gt = xe_gt_alloc(tile, false);
>> > 	if (IS_ERR(tile->primary_gt))
>> > 		return PTR_ERR(tile->primary_gt);
>> >
>> > --
>> > 2.34.1
>> >

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 27/30] drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF
  2025-10-07 17:00       ` Lucas De Marchi
@ 2025-10-07 17:22         ` Matthew Brost
  2025-10-07 20:36           ` Lucas De Marchi
  0 siblings, 1 reply; 58+ messages in thread
From: Matthew Brost @ 2025-10-07 17:22 UTC (permalink / raw)
  To: Lucas De Marchi; +Cc: intel-xe, Matt Roper

On Tue, Oct 07, 2025 at 12:00:32PM -0500, Lucas De Marchi wrote:
> On Mon, Oct 06, 2025 at 03:51:12PM -0700, Matthew Brost wrote:
> > On Mon, Oct 06, 2025 at 05:24:55PM -0500, Lucas De Marchi wrote:
> > > On Mon, Oct 06, 2025 at 04:10:35AM -0700, Matthew Brost wrote:
> > > > VF CCS restore is a primary GT operation on which the media GT depends.
> > > > Therefore, it doesn't make much sense to run these operations in
> > > 
> > > I'd need to double check the previous patches to see the entire
> > > picture, but this seems weird at a first glance. The VF CCS restore is
> > > not the single work we queue in gt->ordered_wq.  To me it seems more
> > > like "in what ordered queue we are going to queue the VF CCS restore. If
> > > it's global per device, why are we not using the device wq rather than
> > > making all the GT wq point to the same thing?
> > > 
> > 
> > This is where things get convoluted. Four mechanisms manipulate the
> > scheduling state by starting or stopping the DRM scheduler:
> > 
> > - Job timeouts
> > - GT resets
> > - VF restore
> > - PM resume
> > 
> > All of these paths require mutual exclusion, or the scheduler design
> > breaks.
> > 
> > The first three ensure ordering by scheduling on the same ordered work
> > queue (GT-ordered WQ). The last one is guaranteed by holding PM
> > references in all the right places.
> > 
> > Another issue is that the first three items are all in the reclaim path
> > — the GT-ordered work queue is designed to handle this.
> > 
> > This patch [1] explains the entire scheduler design in detail.
> > 
> > Only PTL has the cross-VF/GT restore ordering requirement, so I figured
> > the path of least resistance is to just point all GTs to the primary
> > work queue.
> > 
> > [1] https://patchwork.freedesktop.org/patch/677980/?series=154627&rev=4
> > 
> > > > parallel. To address this, point the media GT's ordered work queue to
> > > > the primary GT's ordered work queue on platforms that require (PTL VFs)
> > > > CCS restore as part of VF post-migration recovery.
> > > >
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > > drivers/gpu/drm/xe/xe_device_types.h | 2 ++
> > > > drivers/gpu/drm/xe/xe_gt.c           | 7 +++++--
> > > > drivers/gpu/drm/xe/xe_gt.h           | 2 +-
> > > > drivers/gpu/drm/xe/xe_pci.c          | 6 +++++-
> > > > drivers/gpu/drm/xe/xe_pci_types.h    | 1 +
> > > > drivers/gpu/drm/xe/xe_tile.c         | 2 +-
> > > > 6 files changed, 15 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> > > > index c66523bf4bf0..02c04ad7296e 100644
> > > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > > @@ -334,6 +334,8 @@ struct xe_device {
> > > > 		u8 skip_mtcfg:1;
> > > > 		/** @info.skip_pcode: skip access to PCODE uC */
> > > > 		u8 skip_pcode:1;
> > > > +		/** @info.needs_shared_vf_gt_wq: needs shared GT WQ on VF */
> > > > +		u8 needs_shared_vf_gt_wq:1;
> > > > 	} info;
> > > >
> > > > 	/** @wa_active: keep track of active workarounds */
> > > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > > > index cf484a2da35e..05465f358c96 100644
> > > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > > @@ -65,7 +65,7 @@
> > > > #include "xe_wa.h"
> > > > #include "xe_wopcm.h"
> > > >
> > > > -struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
> > > > +struct xe_gt *xe_gt_alloc(struct xe_tile *tile, bool use_primary_wq)
> > > 
> > > If using the device wq is not an option (possibly because it would queue
> > > with other undesired work going on there), then I'd rather drop this
> > > bool passing here and make the decision inside this function:
> > > 
> > > > {
> > > > 	struct drm_device *drm = &tile_to_xe(tile)->drm;
> > > > 	struct xe_gt *gt;
> > > > @@ -75,7 +75,10 @@ struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
> > > > 		return ERR_PTR(-ENOMEM);
> > > >
> > > > 	gt->tile = tile;
> > > 
> > > 	if (!xe->info.needs_shared_gt_wq || !tile->primary_gt->ordered_wq)
> > > 		ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
> > > 	else
> > > 		ordered_wq = tile->primary_gt->ordered_wq;
> > > 	if (IS_ERR_OR_NUL(ordered_wq))
> > >  		return ordered_wq ? ERR_CAST(gt->ordered_wq) : ERR_PTR(-EINVAL);
> > > 
> > > 	gt->ordered_wq = ordered_wq;
> > > 	
> > > ... or something like that so you use the xe info to decide it here
> > > rather than passing it down as a function arg.
> > > 
> > 
> > Sure can refactor this as you suggest.
> 
> 
> another option that would avoid a bool arg would be to actually pass the
> wq from the caller.
> 
> 	wq = NULL;
> 
> 	if (xe->info.needs_shared_gt_wq)
> 		wq = tile->primary_gt->ordered_wq;
> 
> 	xe_gt_alloc(tile, wq);
> 

I posted a v7/v8 which did it like you suggest.

> > 
> > > 
> > > > -	gt->ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
> > > > +	if (use_primary_wq)
> > > > +		gt->ordered_wq = tile->primary_gt->ordered_wq;
> > > > +	else
> > > > +		gt->ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
> > > > 	if (IS_ERR(gt->ordered_wq))
> > > > 		return ERR_CAST(gt->ordered_wq);
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
> > > > index 5df2ffe3ff83..9545c0c93ab6 100644
> > > > --- a/drivers/gpu/drm/xe/xe_gt.h
> > > > +++ b/drivers/gpu/drm/xe/xe_gt.h
> > > > @@ -28,7 +28,7 @@ static inline bool xe_fault_inject_gt_reset(void)
> > > > 	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&gt_reset_failure, 1);
> > > > }
> > > >
> > > > -struct xe_gt *xe_gt_alloc(struct xe_tile *tile);
> > > > +struct xe_gt *xe_gt_alloc(struct xe_tile *tile, bool use_primary_wq);
> > > > int xe_gt_init_early(struct xe_gt *gt);
> > > > int xe_gt_init(struct xe_gt *gt);
> > > > void xe_gt_mmio_init(struct xe_gt *gt);
> > > > diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> > > > index 3f42b91efa28..25a1d96a68e7 100644
> > > > --- a/drivers/gpu/drm/xe/xe_pci.c
> > > > +++ b/drivers/gpu/drm/xe/xe_pci.c
> > > > @@ -347,6 +347,7 @@ static const struct xe_device_desc ptl_desc = {
> > > > 	.has_sriov = true,
> > > > 	.max_gt_per_tile = 2,
> > > > 	.needs_scratch = true,
> > > > +	.needs_shared_vf_gt_wq = true,
> > > 
> > > as per above... I think this needs to be detached from vf. There may be
> > > other reasons the wq needs to be shared.
> > > 
> > 
> > I’d prefer to share the work queue only when absolutely necessary — such
> > as when PTL is on a VF and migration is supported.
> 
> 
> I missed that you left the VF condition out as an && for the allocation:
> 

Yes.

> > > > +                                       xe->info.needs_shared_vf_gt_wq &&
> > > > +                                       IS_SRIOV_VF(xe));
> 
> IMO a better way would be to call it `needs_shared_gt_wq` and then
> override the condition in the vf-specific function:
> 
> sriov_update_device_info()
> 

I'm trying to follow this one.

This a platform (static) and VF (dynamic) condition combination.

Are you suggesting removing platform information from xe_pci.c and just
have info bit in the device which the sriov code sets? I think we'd need
a platform check then in the sriov code which I thought in general we
want to avoid inline platform checks.

Maybe I'm misunderstanding here.

Matt 

> Lucas De Marchi
> 
> > 
> > > If we just make them point to a device wq as suggested above, then
> > > there's no extra issue with the ongoing work to disable GTs that Matt
> > > Roper is doing (https://patchwork.freedesktop.org/series/154739/).
> > > Otherwise we will need to think on how to reconciliate them.
> > > 
> > 
> > I don’t see how it would ever be possible to disable the primary GT and
> > still do anything meaningful. You can’t allocate memory (e.g., for
> > clears), perform VM binds, or handle page faults without a copy engine.
> > 
> > Matt
> > 
> > > Lucas De Marchi
> > > 
> > > > };
> > > >
> > > > #undef PLATFORM
> > > > @@ -598,6 +599,7 @@ static int xe_info_init_early(struct xe_device *xe,
> > > > 	xe->info.skip_mtcfg = desc->skip_mtcfg;
> > > > 	xe->info.skip_pcode = desc->skip_pcode;
> > > > 	xe->info.needs_scratch = desc->needs_scratch;
> > > > +	xe->info.needs_shared_vf_gt_wq = desc->needs_shared_vf_gt_wq;
> > > >
> > > > 	xe->info.probe_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY) &&
> > > > 				 xe_modparam.probe_display &&
> > > > @@ -766,7 +768,9 @@ static int xe_info_init(struct xe_device *xe,
> > > > 		 * Allocate and setup media GT for platforms with standalone
> > > > 		 * media.
> > > > 		 */
> > > > -		tile->media_gt = xe_gt_alloc(tile);
> > > > +		tile->media_gt = xe_gt_alloc(tile,
> > > > +					     xe->info.needs_shared_vf_gt_wq &&
> > > > +					     IS_SRIOV_VF(xe));
> > > > 		if (IS_ERR(tile->media_gt))
> > > > 			return PTR_ERR(tile->media_gt);
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_pci_types.h b/drivers/gpu/drm/xe/xe_pci_types.h
> > > > index 9b9766a3baa3..b11bf6abda5b 100644
> > > > --- a/drivers/gpu/drm/xe/xe_pci_types.h
> > > > +++ b/drivers/gpu/drm/xe/xe_pci_types.h
> > > > @@ -48,6 +48,7 @@ struct xe_device_desc {
> > > > 	u8 skip_guc_pc:1;
> > > > 	u8 skip_mtcfg:1;
> > > > 	u8 skip_pcode:1;
> > > > +	u8 needs_shared_vf_gt_wq:1;
> > > > };
> > > >
> > > > struct xe_graphics_desc {
> > > > diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
> > > > index 6edb5062c1da..e9bcff2de563 100644
> > > > --- a/drivers/gpu/drm/xe/xe_tile.c
> > > > +++ b/drivers/gpu/drm/xe/xe_tile.c
> > > > @@ -157,7 +157,7 @@ int xe_tile_init_early(struct xe_tile *tile, struct xe_device *xe, u8 id)
> > > > 	if (err)
> > > > 		return err;
> > > >
> > > > -	tile->primary_gt = xe_gt_alloc(tile);
> > > > +	tile->primary_gt = xe_gt_alloc(tile, false);
> > > > 	if (IS_ERR(tile->primary_gt))
> > > > 		return PTR_ERR(tile->primary_gt);
> > > >
> > > > --
> > > > 2.34.1
> > > >

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 27/30] drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF
  2025-10-07 17:22         ` Matthew Brost
@ 2025-10-07 20:36           ` Lucas De Marchi
  2025-10-07 21:18             ` Matthew Brost
  0 siblings, 1 reply; 58+ messages in thread
From: Lucas De Marchi @ 2025-10-07 20:36 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe, Matt Roper

On Tue, Oct 07, 2025 at 10:22:39AM -0700, Matthew Brost wrote:
>On Tue, Oct 07, 2025 at 12:00:32PM -0500, Lucas De Marchi wrote:
>> On Mon, Oct 06, 2025 at 03:51:12PM -0700, Matthew Brost wrote:
>> > On Mon, Oct 06, 2025 at 05:24:55PM -0500, Lucas De Marchi wrote:
>> > > On Mon, Oct 06, 2025 at 04:10:35AM -0700, Matthew Brost wrote:
>> > > > VF CCS restore is a primary GT operation on which the media GT depends.
>> > > > Therefore, it doesn't make much sense to run these operations in
>> > >
>> > > I'd need to double check the previous patches to see the entire
>> > > picture, but this seems weird at a first glance. The VF CCS restore is
>> > > not the single work we queue in gt->ordered_wq.  To me it seems more
>> > > like "in what ordered queue we are going to queue the VF CCS restore. If
>> > > it's global per device, why are we not using the device wq rather than
>> > > making all the GT wq point to the same thing?
>> > >
>> >
>> > This is where things get convoluted. Four mechanisms manipulate the
>> > scheduling state by starting or stopping the DRM scheduler:
>> >
>> > - Job timeouts
>> > - GT resets
>> > - VF restore
>> > - PM resume
>> >
>> > All of these paths require mutual exclusion, or the scheduler design
>> > breaks.
>> >
>> > The first three ensure ordering by scheduling on the same ordered work
>> > queue (GT-ordered WQ). The last one is guaranteed by holding PM
>> > references in all the right places.
>> >
>> > Another issue is that the first three items are all in the reclaim path
>> > — the GT-ordered work queue is designed to handle this.
>> >
>> > This patch [1] explains the entire scheduler design in detail.
>> >
>> > Only PTL has the cross-VF/GT restore ordering requirement, so I figured
>> > the path of least resistance is to just point all GTs to the primary
>> > work queue.
>> >
>> > [1] https://patchwork.freedesktop.org/patch/677980/?series=154627&rev=4
>> >
>> > > > parallel. To address this, point the media GT's ordered work queue to
>> > > > the primary GT's ordered work queue on platforms that require (PTL VFs)
>> > > > CCS restore as part of VF post-migration recovery.
>> > > >
>> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> > > > ---
>> > > > drivers/gpu/drm/xe/xe_device_types.h | 2 ++
>> > > > drivers/gpu/drm/xe/xe_gt.c           | 7 +++++--
>> > > > drivers/gpu/drm/xe/xe_gt.h           | 2 +-
>> > > > drivers/gpu/drm/xe/xe_pci.c          | 6 +++++-
>> > > > drivers/gpu/drm/xe/xe_pci_types.h    | 1 +
>> > > > drivers/gpu/drm/xe/xe_tile.c         | 2 +-
>> > > > 6 files changed, 15 insertions(+), 5 deletions(-)
>> > > >
>> > > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>> > > > index c66523bf4bf0..02c04ad7296e 100644
>> > > > --- a/drivers/gpu/drm/xe/xe_device_types.h
>> > > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> > > > @@ -334,6 +334,8 @@ struct xe_device {
>> > > > 		u8 skip_mtcfg:1;
>> > > > 		/** @info.skip_pcode: skip access to PCODE uC */
>> > > > 		u8 skip_pcode:1;
>> > > > +		/** @info.needs_shared_vf_gt_wq: needs shared GT WQ on VF */
>> > > > +		u8 needs_shared_vf_gt_wq:1;
>> > > > 	} info;
>> > > >
>> > > > 	/** @wa_active: keep track of active workarounds */
>> > > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
>> > > > index cf484a2da35e..05465f358c96 100644
>> > > > --- a/drivers/gpu/drm/xe/xe_gt.c
>> > > > +++ b/drivers/gpu/drm/xe/xe_gt.c
>> > > > @@ -65,7 +65,7 @@
>> > > > #include "xe_wa.h"
>> > > > #include "xe_wopcm.h"
>> > > >
>> > > > -struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
>> > > > +struct xe_gt *xe_gt_alloc(struct xe_tile *tile, bool use_primary_wq)
>> > >
>> > > If using the device wq is not an option (possibly because it would queue
>> > > with other undesired work going on there), then I'd rather drop this
>> > > bool passing here and make the decision inside this function:
>> > >
>> > > > {
>> > > > 	struct drm_device *drm = &tile_to_xe(tile)->drm;
>> > > > 	struct xe_gt *gt;
>> > > > @@ -75,7 +75,10 @@ struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
>> > > > 		return ERR_PTR(-ENOMEM);
>> > > >
>> > > > 	gt->tile = tile;
>> > >
>> > > 	if (!xe->info.needs_shared_gt_wq || !tile->primary_gt->ordered_wq)
>> > > 		ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
>> > > 	else
>> > > 		ordered_wq = tile->primary_gt->ordered_wq;
>> > > 	if (IS_ERR_OR_NUL(ordered_wq))
>> > >  		return ordered_wq ? ERR_CAST(gt->ordered_wq) : ERR_PTR(-EINVAL);
>> > >
>> > > 	gt->ordered_wq = ordered_wq;
>> > > 	
>> > > ... or something like that so you use the xe info to decide it here
>> > > rather than passing it down as a function arg.
>> > >
>> >
>> > Sure can refactor this as you suggest.
>>
>>
>> another option that would avoid a bool arg would be to actually pass the
>> wq from the caller.
>>
>> 	wq = NULL;
>>
>> 	if (xe->info.needs_shared_gt_wq)
>> 		wq = tile->primary_gt->ordered_wq;
>>
>> 	xe_gt_alloc(tile, wq);
>>

>I posted a v7/v8 which did it like you suggest.

good, will take a look

>
>> >
>> > >
>> > > > -	gt->ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
>> > > > +	if (use_primary_wq)
>> > > > +		gt->ordered_wq = tile->primary_gt->ordered_wq;
>> > > > +	else
>> > > > +		gt->ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
>> > > > 	if (IS_ERR(gt->ordered_wq))
>> > > > 		return ERR_CAST(gt->ordered_wq);
>> > > >
>> > > > diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
>> > > > index 5df2ffe3ff83..9545c0c93ab6 100644
>> > > > --- a/drivers/gpu/drm/xe/xe_gt.h
>> > > > +++ b/drivers/gpu/drm/xe/xe_gt.h
>> > > > @@ -28,7 +28,7 @@ static inline bool xe_fault_inject_gt_reset(void)
>> > > > 	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&gt_reset_failure, 1);
>> > > > }
>> > > >
>> > > > -struct xe_gt *xe_gt_alloc(struct xe_tile *tile);
>> > > > +struct xe_gt *xe_gt_alloc(struct xe_tile *tile, bool use_primary_wq);
>> > > > int xe_gt_init_early(struct xe_gt *gt);
>> > > > int xe_gt_init(struct xe_gt *gt);
>> > > > void xe_gt_mmio_init(struct xe_gt *gt);
>> > > > diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
>> > > > index 3f42b91efa28..25a1d96a68e7 100644
>> > > > --- a/drivers/gpu/drm/xe/xe_pci.c
>> > > > +++ b/drivers/gpu/drm/xe/xe_pci.c
>> > > > @@ -347,6 +347,7 @@ static const struct xe_device_desc ptl_desc = {
>> > > > 	.has_sriov = true,
>> > > > 	.max_gt_per_tile = 2,
>> > > > 	.needs_scratch = true,
>> > > > +	.needs_shared_vf_gt_wq = true,
>> > >
>> > > as per above... I think this needs to be detached from vf. There may be
>> > > other reasons the wq needs to be shared.
>> > >
>> >
>> > I’d prefer to share the work queue only when absolutely necessary — such
>> > as when PTL is on a VF and migration is supported.
>>
>>
>> I missed that you left the VF condition out as an && for the allocation:
>>
>
>Yes.
>
>> > > > +                                       xe->info.needs_shared_vf_gt_wq &&
>> > > > +                                       IS_SRIOV_VF(xe));
>>
>> IMO a better way would be to call it `needs_shared_gt_wq` and then
>> override the condition in the vf-specific function:
>>
>> sriov_update_device_info()
>>
>
>I'm trying to follow this one.
>
>This a platform (static) and VF (dynamic) condition combination.
>
>Are you suggesting removing platform information from xe_pci.c and just
>have info bit in the device which the sriov code sets? I think we'd need
>a platform check then in the sriov code which I thought in general we
>want to avoid inline platform checks.

yeah, true... for that to be done without hardcoding the vf it would be
more verbose and there isn't much need if the VF is the only expected
one.


>
>Maybe I'm misunderstanding here.
>
>Matt
>
>> Lucas De Marchi
>>
>> >
>> > > If we just make them point to a device wq as suggested above, then
>> > > there's no extra issue with the ongoing work to disable GTs that Matt
>> > > Roper is doing (https://patchwork.freedesktop.org/series/154739/).
>> > > Otherwise we will need to think on how to reconciliate them.
>> > >
>> >
>> > I don’t see how it would ever be possible to disable the primary GT and
>> > still do anything meaningful. You can’t allocate memory (e.g., for
>> > clears), perform VM binds, or handle page faults without a copy engine.

it's actually useful for platform bringup as we can enable the
individual pieces in parallel. We know what should and shouldn't
work and workloads we can execute in this case.

Lucas De Marchi

>> >
>> > Matt
>> >
>> > > Lucas De Marchi
>> > >
>> > > > };
>> > > >
>> > > > #undef PLATFORM
>> > > > @@ -598,6 +599,7 @@ static int xe_info_init_early(struct xe_device *xe,
>> > > > 	xe->info.skip_mtcfg = desc->skip_mtcfg;
>> > > > 	xe->info.skip_pcode = desc->skip_pcode;
>> > > > 	xe->info.needs_scratch = desc->needs_scratch;
>> > > > +	xe->info.needs_shared_vf_gt_wq = desc->needs_shared_vf_gt_wq;
>> > > >
>> > > > 	xe->info.probe_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY) &&
>> > > > 				 xe_modparam.probe_display &&
>> > > > @@ -766,7 +768,9 @@ static int xe_info_init(struct xe_device *xe,
>> > > > 		 * Allocate and setup media GT for platforms with standalone
>> > > > 		 * media.
>> > > > 		 */
>> > > > -		tile->media_gt = xe_gt_alloc(tile);
>> > > > +		tile->media_gt = xe_gt_alloc(tile,
>> > > > +					     xe->info.needs_shared_vf_gt_wq &&
>> > > > +					     IS_SRIOV_VF(xe));
>> > > > 		if (IS_ERR(tile->media_gt))
>> > > > 			return PTR_ERR(tile->media_gt);
>> > > >
>> > > > diff --git a/drivers/gpu/drm/xe/xe_pci_types.h b/drivers/gpu/drm/xe/xe_pci_types.h
>> > > > index 9b9766a3baa3..b11bf6abda5b 100644
>> > > > --- a/drivers/gpu/drm/xe/xe_pci_types.h
>> > > > +++ b/drivers/gpu/drm/xe/xe_pci_types.h
>> > > > @@ -48,6 +48,7 @@ struct xe_device_desc {
>> > > > 	u8 skip_guc_pc:1;
>> > > > 	u8 skip_mtcfg:1;
>> > > > 	u8 skip_pcode:1;
>> > > > +	u8 needs_shared_vf_gt_wq:1;
>> > > > };
>> > > >
>> > > > struct xe_graphics_desc {
>> > > > diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
>> > > > index 6edb5062c1da..e9bcff2de563 100644
>> > > > --- a/drivers/gpu/drm/xe/xe_tile.c
>> > > > +++ b/drivers/gpu/drm/xe/xe_tile.c
>> > > > @@ -157,7 +157,7 @@ int xe_tile_init_early(struct xe_tile *tile, struct xe_device *xe, u8 id)
>> > > > 	if (err)
>> > > > 		return err;
>> > > >
>> > > > -	tile->primary_gt = xe_gt_alloc(tile);
>> > > > +	tile->primary_gt = xe_gt_alloc(tile, false);
>> > > > 	if (IS_ERR(tile->primary_gt))
>> > > > 		return PTR_ERR(tile->primary_gt);
>> > > >
>> > > > --
>> > > > 2.34.1
>> > > >

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 27/30] drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF
  2025-10-07 20:36           ` Lucas De Marchi
@ 2025-10-07 21:18             ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2025-10-07 21:18 UTC (permalink / raw)
  To: Lucas De Marchi; +Cc: intel-xe, Matt Roper

On Tue, Oct 07, 2025 at 03:36:55PM -0500, Lucas De Marchi wrote:
> On Tue, Oct 07, 2025 at 10:22:39AM -0700, Matthew Brost wrote:
> > On Tue, Oct 07, 2025 at 12:00:32PM -0500, Lucas De Marchi wrote:
> > > On Mon, Oct 06, 2025 at 03:51:12PM -0700, Matthew Brost wrote:
> > > > On Mon, Oct 06, 2025 at 05:24:55PM -0500, Lucas De Marchi wrote:
> > > > > On Mon, Oct 06, 2025 at 04:10:35AM -0700, Matthew Brost wrote:
> > > > > > VF CCS restore is a primary GT operation on which the media GT depends.
> > > > > > Therefore, it doesn't make much sense to run these operations in
> > > > >
> > > > > I'd need to double check the previous patches to see the entire
> > > > > picture, but this seems weird at a first glance. The VF CCS restore is
> > > > > not the single work we queue in gt->ordered_wq.  To me it seems more
> > > > > like "in what ordered queue we are going to queue the VF CCS restore. If
> > > > > it's global per device, why are we not using the device wq rather than
> > > > > making all the GT wq point to the same thing?
> > > > >
> > > >
> > > > This is where things get convoluted. Four mechanisms manipulate the
> > > > scheduling state by starting or stopping the DRM scheduler:
> > > >
> > > > - Job timeouts
> > > > - GT resets
> > > > - VF restore
> > > > - PM resume
> > > >
> > > > All of these paths require mutual exclusion, or the scheduler design
> > > > breaks.
> > > >
> > > > The first three ensure ordering by scheduling on the same ordered work
> > > > queue (GT-ordered WQ). The last one is guaranteed by holding PM
> > > > references in all the right places.
> > > >
> > > > Another issue is that the first three items are all in the reclaim path
> > > > — the GT-ordered work queue is designed to handle this.
> > > >
> > > > This patch [1] explains the entire scheduler design in detail.
> > > >
> > > > Only PTL has the cross-VF/GT restore ordering requirement, so I figured
> > > > the path of least resistance is to just point all GTs to the primary
> > > > work queue.
> > > >
> > > > [1] https://patchwork.freedesktop.org/patch/677980/?series=154627&rev=4
> > > >
> > > > > > parallel. To address this, point the media GT's ordered work queue to
> > > > > > the primary GT's ordered work queue on platforms that require (PTL VFs)
> > > > > > CCS restore as part of VF post-migration recovery.
> > > > > >
> > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > ---
> > > > > > drivers/gpu/drm/xe/xe_device_types.h | 2 ++
> > > > > > drivers/gpu/drm/xe/xe_gt.c           | 7 +++++--
> > > > > > drivers/gpu/drm/xe/xe_gt.h           | 2 +-
> > > > > > drivers/gpu/drm/xe/xe_pci.c          | 6 +++++-
> > > > > > drivers/gpu/drm/xe/xe_pci_types.h    | 1 +
> > > > > > drivers/gpu/drm/xe/xe_tile.c         | 2 +-
> > > > > > 6 files changed, 15 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > index c66523bf4bf0..02c04ad7296e 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > @@ -334,6 +334,8 @@ struct xe_device {
> > > > > > 		u8 skip_mtcfg:1;
> > > > > > 		/** @info.skip_pcode: skip access to PCODE uC */
> > > > > > 		u8 skip_pcode:1;
> > > > > > +		/** @info.needs_shared_vf_gt_wq: needs shared GT WQ on VF */
> > > > > > +		u8 needs_shared_vf_gt_wq:1;
> > > > > > 	} info;
> > > > > >
> > > > > > 	/** @wa_active: keep track of active workarounds */
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > > > > > index cf484a2da35e..05465f358c96 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > > > > @@ -65,7 +65,7 @@
> > > > > > #include "xe_wa.h"
> > > > > > #include "xe_wopcm.h"
> > > > > >
> > > > > > -struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
> > > > > > +struct xe_gt *xe_gt_alloc(struct xe_tile *tile, bool use_primary_wq)
> > > > >
> > > > > If using the device wq is not an option (possibly because it would queue
> > > > > with other undesired work going on there), then I'd rather drop this
> > > > > bool passing here and make the decision inside this function:
> > > > >
> > > > > > {
> > > > > > 	struct drm_device *drm = &tile_to_xe(tile)->drm;
> > > > > > 	struct xe_gt *gt;
> > > > > > @@ -75,7 +75,10 @@ struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
> > > > > > 		return ERR_PTR(-ENOMEM);
> > > > > >
> > > > > > 	gt->tile = tile;
> > > > >
> > > > > 	if (!xe->info.needs_shared_gt_wq || !tile->primary_gt->ordered_wq)
> > > > > 		ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
> > > > > 	else
> > > > > 		ordered_wq = tile->primary_gt->ordered_wq;
> > > > > 	if (IS_ERR_OR_NUL(ordered_wq))
> > > > >  		return ordered_wq ? ERR_CAST(gt->ordered_wq) : ERR_PTR(-EINVAL);
> > > > >
> > > > > 	gt->ordered_wq = ordered_wq;
> > > > > 	
> > > > > ... or something like that so you use the xe info to decide it here
> > > > > rather than passing it down as a function arg.
> > > > >
> > > >
> > > > Sure can refactor this as you suggest.
> > > 
> > > 
> > > another option that would avoid a bool arg would be to actually pass the
> > > wq from the caller.
> > > 
> > > 	wq = NULL;
> > > 
> > > 	if (xe->info.needs_shared_gt_wq)
> > > 		wq = tile->primary_gt->ordered_wq;
> > > 
> > > 	xe_gt_alloc(tile, wq);
> > > 
> 
> > I posted a v7/v8 which did it like you suggest.
> 
> good, will take a look
> 
> > 
> > > >
> > > > >
> > > > > > -	gt->ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
> > > > > > +	if (use_primary_wq)
> > > > > > +		gt->ordered_wq = tile->primary_gt->ordered_wq;
> > > > > > +	else
> > > > > > +		gt->ordered_wq = drmm_alloc_ordered_workqueue(drm, "gt-ordered-wq", WQ_MEM_RECLAIM);
> > > > > > 	if (IS_ERR(gt->ordered_wq))
> > > > > > 		return ERR_CAST(gt->ordered_wq);
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
> > > > > > index 5df2ffe3ff83..9545c0c93ab6 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_gt.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_gt.h
> > > > > > @@ -28,7 +28,7 @@ static inline bool xe_fault_inject_gt_reset(void)
> > > > > > 	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&gt_reset_failure, 1);
> > > > > > }
> > > > > >
> > > > > > -struct xe_gt *xe_gt_alloc(struct xe_tile *tile);
> > > > > > +struct xe_gt *xe_gt_alloc(struct xe_tile *tile, bool use_primary_wq);
> > > > > > int xe_gt_init_early(struct xe_gt *gt);
> > > > > > int xe_gt_init(struct xe_gt *gt);
> > > > > > void xe_gt_mmio_init(struct xe_gt *gt);
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> > > > > > index 3f42b91efa28..25a1d96a68e7 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_pci.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_pci.c
> > > > > > @@ -347,6 +347,7 @@ static const struct xe_device_desc ptl_desc = {
> > > > > > 	.has_sriov = true,
> > > > > > 	.max_gt_per_tile = 2,
> > > > > > 	.needs_scratch = true,
> > > > > > +	.needs_shared_vf_gt_wq = true,
> > > > >
> > > > > as per above... I think this needs to be detached from vf. There may be
> > > > > other reasons the wq needs to be shared.
> > > > >
> > > >
> > > > I’d prefer to share the work queue only when absolutely necessary — such
> > > > as when PTL is on a VF and migration is supported.
> > > 
> > > 
> > > I missed that you left the VF condition out as an && for the allocation:
> > > 
> > 
> > Yes.
> > 
> > > > > > +                                       xe->info.needs_shared_vf_gt_wq &&
> > > > > > +                                       IS_SRIOV_VF(xe));
> > > 
> > > IMO a better way would be to call it `needs_shared_gt_wq` and then
> > > override the condition in the vf-specific function:
> > > 
> > > sriov_update_device_info()
> > > 
> > 
> > I'm trying to follow this one.
> > 
> > This a platform (static) and VF (dynamic) condition combination.
> > 
> > Are you suggesting removing platform information from xe_pci.c and just
> > have info bit in the device which the sriov code sets? I think we'd need
> > a platform check then in the sriov code which I thought in general we
> > want to avoid inline platform checks.
> 
> yeah, true... for that to be done without hardcoding the vf it would be
> more verbose and there isn't much need if the VF is the only expected
> one.
> 
> 
> > 
> > Maybe I'm misunderstanding here.
> > 
> > Matt
> > 
> > > Lucas De Marchi
> > > 
> > > >
> > > > > If we just make them point to a device wq as suggested above, then
> > > > > there's no extra issue with the ongoing work to disable GTs that Matt
> > > > > Roper is doing (https://patchwork.freedesktop.org/series/154739/).
> > > > > Otherwise we will need to think on how to reconciliate them.
> > > > >
> > > >
> > > > I don’t see how it would ever be possible to disable the primary GT and
> > > > still do anything meaningful. You can’t allocate memory (e.g., for
> > > > clears), perform VM binds, or handle page faults without a copy engine.
> 
> it's actually useful for platform bringup as we can enable the
> individual pieces in parallel. We know what should and shouldn't
> work and workloads we can execute in this case.

I believe this, but it would be very, very difficult to even run a user
job on the media GT without the primary GT—unless you performed some
serious hacking of the KMD.

Regardless, I don't think this patch would affect any platform bring-up,
as it is tied to a VF. That is, we are never going to bring up a
platform with only the media GT in VF mode. So no conflict with Matt
Roper's series IMO.

Matt

> 
> Lucas De Marchi
> 
> > > >
> > > > Matt
> > > >
> > > > > Lucas De Marchi
> > > > >
> > > > > > };
> > > > > >
> > > > > > #undef PLATFORM
> > > > > > @@ -598,6 +599,7 @@ static int xe_info_init_early(struct xe_device *xe,
> > > > > > 	xe->info.skip_mtcfg = desc->skip_mtcfg;
> > > > > > 	xe->info.skip_pcode = desc->skip_pcode;
> > > > > > 	xe->info.needs_scratch = desc->needs_scratch;
> > > > > > +	xe->info.needs_shared_vf_gt_wq = desc->needs_shared_vf_gt_wq;
> > > > > >
> > > > > > 	xe->info.probe_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY) &&
> > > > > > 				 xe_modparam.probe_display &&
> > > > > > @@ -766,7 +768,9 @@ static int xe_info_init(struct xe_device *xe,
> > > > > > 		 * Allocate and setup media GT for platforms with standalone
> > > > > > 		 * media.
> > > > > > 		 */
> > > > > > -		tile->media_gt = xe_gt_alloc(tile);
> > > > > > +		tile->media_gt = xe_gt_alloc(tile,
> > > > > > +					     xe->info.needs_shared_vf_gt_wq &&
> > > > > > +					     IS_SRIOV_VF(xe));
> > > > > > 		if (IS_ERR(tile->media_gt))
> > > > > > 			return PTR_ERR(tile->media_gt);
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_pci_types.h b/drivers/gpu/drm/xe/xe_pci_types.h
> > > > > > index 9b9766a3baa3..b11bf6abda5b 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_pci_types.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_pci_types.h
> > > > > > @@ -48,6 +48,7 @@ struct xe_device_desc {
> > > > > > 	u8 skip_guc_pc:1;
> > > > > > 	u8 skip_mtcfg:1;
> > > > > > 	u8 skip_pcode:1;
> > > > > > +	u8 needs_shared_vf_gt_wq:1;
> > > > > > };
> > > > > >
> > > > > > struct xe_graphics_desc {
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
> > > > > > index 6edb5062c1da..e9bcff2de563 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_tile.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_tile.c
> > > > > > @@ -157,7 +157,7 @@ int xe_tile_init_early(struct xe_tile *tile, struct xe_device *xe, u8 id)
> > > > > > 	if (err)
> > > > > > 		return err;
> > > > > >
> > > > > > -	tile->primary_gt = xe_gt_alloc(tile);
> > > > > > +	tile->primary_gt = xe_gt_alloc(tile, false);
> > > > > > 	if (IS_ERR(tile->primary_gt))
> > > > > > 		return PTR_ERR(tile->primary_gt);
> > > > > >
> > > > > > --
> > > > > > 2.34.1
> > > > > >

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2025-10-07 21:18 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-06 11:10 [PATCH v6 00/30] VF migration redesign Matthew Brost
2025-10-06 11:10 ` [PATCH v6 01/30] drm/xe: Add NULL checks to scratch LRC allocation Matthew Brost
2025-10-06 21:51   ` Lis, Tomasz
2025-10-06 11:10 ` [PATCH v6 02/30] drm/xe: Save off position in ring in which a job was programmed Matthew Brost
2025-10-06 11:10 ` [PATCH v6 03/30] drm/xe/guc: Track pending-enable source in submission state Matthew Brost
2025-10-06 11:10 ` [PATCH v6 04/30] drm/xe: Track LR jobs in DRM scheduler pending list Matthew Brost
2025-10-06 11:10 ` [PATCH v6 05/30] drm/xe: Don't change LRC ring head on job resubmission Matthew Brost
2025-10-06 11:10 ` [PATCH v6 06/30] drm/xe: Make LRC W/A scratch buffer usage consistent Matthew Brost
2025-10-06 11:10 ` [PATCH v6 07/30] drm/xe/vf: Add xe_gt_recovery_pending helper Matthew Brost
2025-10-06 13:10   ` Michal Wajdeczko
2025-10-06 11:10 ` [PATCH v6 08/30] drm/xe/vf: Make VF recovery run on per-GT worker Matthew Brost
2025-10-06 11:10 ` [PATCH v6 09/30] drm/xe/vf: Abort H2G sends during VF post-migration recovery Matthew Brost
2025-10-06 11:10 ` [PATCH v6 10/30] drm/xe/vf: Remove memory allocations from VF post migration recovery Matthew Brost
2025-10-06 11:10 ` [PATCH v6 11/30] drm/xe/vf: Close multi-GT GGTT shift race Matthew Brost
2025-10-06 14:27   ` Michal Wajdeczko
2025-10-06 14:56     ` Matthew Brost
2025-10-06 11:10 ` [PATCH v6 12/30] drm/xe/vf: Teardown VF post migration worker on driver unload Matthew Brost
2025-10-06 11:10 ` [PATCH v6 13/30] drm/xe/vf: Don't allow GT reset to be queued during VF post migration recovery Matthew Brost
2025-10-06 11:10 ` [PATCH v6 14/30] drm/xe/vf: Wakeup in GuC backend on " Matthew Brost
2025-10-06 14:35   ` Michal Wajdeczko
2025-10-06 15:54     ` Matthew Brost
2025-10-06 22:27   ` Lis, Tomasz
2025-10-06 23:07     ` Matthew Brost
2025-10-06 11:10 ` [PATCH v6 15/30] drm/xe/vf: Avoid indefinite blocking in preempt rebind worker for VFs supporting migration Matthew Brost
2025-10-06 11:10 ` [PATCH v6 16/30] drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register Matthew Brost
2025-10-06 14:51   ` Michal Wajdeczko
2025-10-06 16:02     ` Matthew Brost
2025-10-06 22:21   ` Lis, Tomasz
2025-10-06 22:57     ` Matthew Brost
2025-10-06 11:10 ` [PATCH v6 17/30] drm/xe/vf: Flush and stop CTs in VF post migration recovery Matthew Brost
2025-10-06 11:10 ` [PATCH v6 18/30] drm/xe/vf: Reset TLB invalidations during " Matthew Brost
2025-10-06 11:10 ` [PATCH v6 19/30] drm/xe/vf: Kickstart after resfix in " Matthew Brost
2025-10-06 11:10 ` [PATCH v6 20/30] drm/xe/vf: Start CTs before resfix " Matthew Brost
2025-10-06 21:50   ` Lis, Tomasz
2025-10-06 11:10 ` [PATCH v6 21/30] drm/xe/vf: Abort VF post migration recovery on failure Matthew Brost
2025-10-06 11:10 ` [PATCH v6 22/30] drm/xe/vf: Replay GuC submission state on pause / unpause Matthew Brost
2025-10-06 11:10 ` [PATCH v6 23/30] drm/xe: Move queue init before LRC creation Matthew Brost
2025-10-06 15:22   ` Michal Wajdeczko
2025-10-06 21:33   ` Lis, Tomasz
2025-10-06 11:10 ` [PATCH v6 24/30] drm/xe/vf: Add debug prints for GuC replaying state during VF recovery Matthew Brost
2025-10-06 11:10 ` [PATCH v6 25/30] drm/xe/vf: Workaround for race condition in GuC firmware during VF pause Matthew Brost
2025-10-06 11:10 ` [PATCH v6 26/30] drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups Matthew Brost
2025-10-06 11:10 ` [PATCH v6 27/30] drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF Matthew Brost
2025-10-06 22:24   ` Lucas De Marchi
2025-10-06 22:51     ` Matthew Brost
2025-10-07 17:00       ` Lucas De Marchi
2025-10-07 17:22         ` Matthew Brost
2025-10-07 20:36           ` Lucas De Marchi
2025-10-07 21:18             ` Matthew Brost
2025-10-06 11:10 ` [PATCH v6 28/30] drm/xe/vf: Ensure media GT VF recovery runs after primary GT on PTL Matthew Brost
2025-10-06 11:10 ` [PATCH v6 29/30] drm/xe/vf: Rebase CCS save/restore BB GGTT addresses Matthew Brost
2025-10-06 11:10 ` [PATCH v6 30/30] drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC Matthew Brost
2025-10-06 11:17 ` ✗ CI.checkpatch: warning for VF migration redesign (rev6) Patchwork
2025-10-06 11:18 ` ✓ CI.KUnit: success " Patchwork
2025-10-06 12:24 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-10-06 14:28 ` ✗ Xe.CI.Full: " Patchwork
2025-10-07  0:20 ` [PATCH v6 00/30] VF migration redesign Niranjana Vishwanathapura
2025-10-07  1:11   ` Matthew Brost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox