Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/3] drm/xe/migrate: Atomicize CCS copy command setup
@ 2025-10-08 10:11 Satyanarayana K V P
  2025-10-08 10:11 ` [PATCH v5 1/3] " Satyanarayana K V P
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Satyanarayana K V P @ 2025-10-08 10:11 UTC (permalink / raw)
  To: intel-xe; +Cc: Satyanarayana K V P

The CCS copy command is a 5-dword sequence. If the vCPU halts during
save/restore while this sequence is being programmed, partial writes may
trigger page faults when saving IGPU CCS metadata. Use the VMOVDQU
instruction to write the sequence atomically. Since VMOVDQU operates on
256-bit chunks, update EMIT_COPY_CCS_DW to emit 8 dwords instead of 5
dwords. Update emit_flush_invalidate() to use VMOVDQU operating	with
128-bit chunks.

The MI_STORE_DATA_IMM instruction header is quad dword in size. If the
vCPU halts during save/restore while this sequence is being programmed,
partial writes may trigger page faults when saving IGPU CCS metadata.
Update instruction header atomically.

Clear the contents of the CCS read/write batch buffer, ensuring no page
faults / GPU hang occur if migration happens midway.

---
V4 -> V5:
- Fixed review comments (Matt B)

V3 -> V4:
- Fixed review comments (Wajdeczko)
- Fix issues reported by patchworks.

V2 -> V3:
- Added support for 128 bit and 256 bit instructions with memcpy_vmovdqu
- Updated emit_flush_invalidate() to use vmovdqu instruction.

V1 -> V2:
- Use memcpy_vmovdqu only for x86 arch and for VF. Else use memcpy
(Auld, Matthew)
- Fix issues reported by patchworks.

Satyanarayana K V P (3):
  drm/xe/migrate: Atomicize CCS copy command setup
  drm/xe/migrate: Make emit_pte() header write atomic
  drm/xe/vf: Clear CCS read/write buffers in atomic way

 drivers/gpu/drm/xe/xe_migrate.c      | 236 ++++++++++++++++++++++++---
 drivers/gpu/drm/xe/xe_migrate.h      |   3 +
 drivers/gpu/drm/xe/xe_sriov_vf_ccs.c |   5 +-
 3 files changed, 220 insertions(+), 24 deletions(-)

-- 
2.51.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v5 1/3] drm/xe/migrate: Atomicize CCS copy command setup
  2025-10-08 10:11 [PATCH v5 0/3] drm/xe/migrate: Atomicize CCS copy command setup Satyanarayana K V P
@ 2025-10-08 10:11 ` Satyanarayana K V P
  2025-10-08 22:58   ` Matthew Brost
  2025-10-08 10:11 ` [PATCH v5 2/3] drm/xe/migrate: Make emit_pte() header write atomic Satyanarayana K V P
  2025-10-08 10:11 ` [PATCH v5 3/3] drm/xe/vf: Clear CCS read/write buffers in atomic way Satyanarayana K V P
  2 siblings, 1 reply; 16+ messages in thread
From: Satyanarayana K V P @ 2025-10-08 10:11 UTC (permalink / raw)
  To: intel-xe; +Cc: Satyanarayana K V P, Michal Wajdeczko, Matthew Brost,
	Matthew Auld

The CCS copy command is a 5-dword sequence. If the vCPU halts during
save/restore while this sequence is being programmed, partial writes may
trigger page faults when saving IGPU CCS metadata. Use the VMOVDQU
instruction to write the sequence atomically.

Since VMOVDQU operates on 256-bit chunks, update EMIT_COPY_CCS_DW to emit
8 dwords instead of 5 dwords.

Update emit_flush_invalidate() to use VMOVDQU operating with 128-bit
chunks.

Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>

---
V4 -> V5:
- Fixed review comments. (Matt B)

V3 -> V4:
- Fixed review comments. (Wajdeczko)
- Fix issues reported by patchworks.

V2 -> V3:
- Added support for 128 bit and 256 bit instructions with memcpy_vmovdqu
- Updated emit_flush_invalidate() to use vmovdqu instruction.

V1 -> V2:
- Use memcpy_vmovdqu only for x86 arch and for VF. Else use memcpy
  (Auld, Matthew)
- Fix issues reported by patchworks.
---
 drivers/gpu/drm/xe/xe_migrate.c | 93 +++++++++++++++++++++++++--------
 1 file changed, 72 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index c39c3b423d05..b629072956ee 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -5,7 +5,9 @@
 
 #include "xe_migrate.h"
 
+#include <asm/fpu/api.h>
 #include <linux/bitfield.h>
+#include <linux/cpufeature.h>
 #include <linux/sizes.h>
 
 #include <drm/drm_managed.h>
@@ -33,6 +35,7 @@
 #include "xe_res_cursor.h"
 #include "xe_sa.h"
 #include "xe_sched_job.h"
+#include "xe_sriov_vf_ccs.h"
 #include "xe_sync.h"
 #include "xe_trace_bo.h"
 #include "xe_validation.h"
@@ -644,18 +647,49 @@ static void emit_pte(struct xe_migrate *m,
 	}
 }
 
-#define EMIT_COPY_CCS_DW 5
+static void memcpy_vmovdqu(void *dst, const void *src, u32 size)
+{
+#ifdef CONFIG_X86
+	kernel_fpu_begin();
+	if (size == SZ_128) {
+		asm("vmovdqu (%0), %%xmm0\n"
+		    "vmovups %%xmm0,   (%1)\n"
+		    :: "r" (src), "r" (dst) : "memory");
+	} else if (size == SZ_256) {
+		asm("vmovdqu (%0), %%ymm0\n"
+		    "vmovups %%ymm0,   (%1)\n"
+		    :: "r" (src), "r" (dst) : "memory");
+	}
+	kernel_fpu_end();
+#endif
+}
+
+static void emit_atomic(struct xe_gt *gt, void *dst, const void *src, u32 size)
+{
+	u32 instr_size = size * BITS_PER_BYTE;
+
+	xe_gt_assert(gt, instr_size == SZ_128 || instr_size == SZ_256);
+
+	if (IS_VF_CCS_READY(gt_to_xe(gt)) && static_cpu_has(X86_FEATURE_AVX))
+		memcpy_vmovdqu(dst, src, instr_size);
+	else
+		memcpy(dst, src, size);
+}
+
+#define EMIT_COPY_CCS_DW 8
 static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
 			  u64 dst_ofs, bool dst_is_indirect,
 			  u64 src_ofs, bool src_is_indirect,
 			  u32 size)
 {
+	u32 dw[EMIT_COPY_CCS_DW] = {MI_NOOP};
 	struct xe_device *xe = gt_to_xe(gt);
 	u32 *cs = bb->cs + bb->len;
 	u32 num_ccs_blks;
 	u32 num_pages;
 	u32 ccs_copy_size;
 	u32 mocs;
+	u32 i = 0;
 
 	if (GRAPHICS_VERx100(xe) >= 2000) {
 		num_pages = DIV_ROUND_UP(size, XE_PAGE_SIZE);
@@ -673,15 +707,23 @@ static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
 		mocs = FIELD_PREP(XY_CTRL_SURF_MOCS_MASK, gt->mocs.uc_index);
 	}
 
-	*cs++ = XY_CTRL_SURF_COPY_BLT |
-		(src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
-		(dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
-		ccs_copy_size;
-	*cs++ = lower_32_bits(src_ofs);
-	*cs++ = upper_32_bits(src_ofs) | mocs;
-	*cs++ = lower_32_bits(dst_ofs);
-	*cs++ = upper_32_bits(dst_ofs) | mocs;
+	dw[i++] = XY_CTRL_SURF_COPY_BLT |
+		  (src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
+		  (dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
+		  ccs_copy_size;
+	dw[i++] = lower_32_bits(src_ofs);
+	dw[i++] = upper_32_bits(src_ofs) | mocs;
+	dw[i++] = lower_32_bits(dst_ofs);
+	dw[i++] = upper_32_bits(dst_ofs) | mocs;
 
+	/*
+	 * The CCS copy command is a 5-dword sequence. If the vCPU halts during
+	 * save/restore while this sequence is being issued, partial writes may trigger
+	 * page faults when saving iGPU CCS metadata. Use the VMOVDQU instruction to
+	 * write the sequence atomically.
+	 */
+	emit_atomic(gt, cs, dw, sizeof(dw));
+	cs += EMIT_COPY_CCS_DW;
 	bb->len = cs - bb->cs;
 }
 
@@ -993,18 +1035,27 @@ static u64 migrate_vm_ppgtt_addr_tlb_inval(void)
 	return (NUM_KERNEL_PDE - 2) * XE_PAGE_SIZE;
 }
 
-static int emit_flush_invalidate(u32 *dw, int i, u32 flags)
+/*
+ * The MI_FLUSH_DW command is a 4-dword sequence. If the vCPU halts during
+ * save/restore while this sequence is being issued, partial writes may
+ * trigger page faults when saving iGPU CCS metadata. Use
+ * emit_atomic() to write the sequence atomically.
+ */
+#define EMIT_FLUSH_INVALIDATE_DW 4
+static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *cs, int i, u32 flags)
 {
 	u64 addr = migrate_vm_ppgtt_addr_tlb_inval();
+	u32 dw[EMIT_FLUSH_INVALIDATE_DW] = {MI_NOOP}, j = 0;
+
+	dw[j++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
+		      MI_FLUSH_IMM_DW | flags;
+	dw[j++] = lower_32_bits(addr);
+	dw[j++] = upper_32_bits(addr);
+	dw[j++] = MI_NOOP;
 
-	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
-		  MI_FLUSH_IMM_DW | flags;
-	dw[i++] = lower_32_bits(addr);
-	dw[i++] = upper_32_bits(addr);
-	dw[i++] = MI_NOOP;
-	dw[i++] = MI_NOOP;
+	emit_atomic(q->gt, &cs[i], dw, sizeof(dw));
 
-	return i;
+	return i + j;
 }
 
 /**
@@ -1049,7 +1100,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
 	/* Calculate Batch buffer size */
 	batch_size = 0;
 	while (size) {
-		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
+		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
 		u64 ccs_ofs, ccs_size;
 		u32 ccs_pt;
 
@@ -1090,7 +1141,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
 	 * sizes here again before copy command is emitted.
 	 */
 	while (size) {
-		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
+		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
 		u32 flush_flags = 0;
 		u64 ccs_ofs, ccs_size;
 		u32 ccs_pt;
@@ -1113,11 +1164,11 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
 
 		emit_pte(m, bb, ccs_pt, false, false, &ccs_it, ccs_size, src);
 
-		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
+		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
 		flush_flags = xe_migrate_ccs_copy(m, bb, src_L0_ofs, src_is_pltt,
 						  src_L0_ofs, dst_is_pltt,
 						  src_L0, ccs_ofs, true);
-		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
+		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
 
 		size -= src_L0;
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 2/3] drm/xe/migrate: Make emit_pte() header write atomic
  2025-10-08 10:11 [PATCH v5 0/3] drm/xe/migrate: Atomicize CCS copy command setup Satyanarayana K V P
  2025-10-08 10:11 ` [PATCH v5 1/3] " Satyanarayana K V P
@ 2025-10-08 10:11 ` Satyanarayana K V P
  2025-10-08 23:27   ` Matthew Brost
  2025-10-08 10:11 ` [PATCH v5 3/3] drm/xe/vf: Clear CCS read/write buffers in atomic way Satyanarayana K V P
  2 siblings, 1 reply; 16+ messages in thread
From: Satyanarayana K V P @ 2025-10-08 10:11 UTC (permalink / raw)
  To: intel-xe; +Cc: Satyanarayana K V P, Michal Wajdeczko, Matthew Brost,
	Matthew Auld

The MI_STORE_DATA_IMM instruction header is quad dword in size. If the
vCPU halts during save/restore while this sequence is being programmed,
partial writes may trigger page faults when saving IGPU CCS metadata.
Update instruction header atomically.

Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>

---
V4 -> V5:
- Fixed review comments (Matt B).

V3 -> V4:
- New commit added.

V2 -> V3:
- None

V1 -> V2:
- None
---
 drivers/gpu/drm/xe/xe_migrate.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index b629072956ee..7097da1555ca 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -607,9 +607,14 @@ static void emit_pte(struct xe_migrate *m,
 
 	while (ptes) {
 		u32 chunk = min(MAX_PTE_PER_SDI, ptes);
+		u32 dw[SZ_2], i = 0;
+
+		dw[i++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(chunk);
+		dw[i++] = ofs;
+
+		WRITE_ONCE(*(u64 *)&bb->cs[bb->len], *(u64 *)dw);
+		bb->len += i;
 
-		bb->cs[bb->len++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(chunk);
-		bb->cs[bb->len++] = ofs;
 		bb->cs[bb->len++] = 0;
 
 		cur_ofs = ofs;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 3/3] drm/xe/vf: Clear CCS read/write buffers in atomic way
  2025-10-08 10:11 [PATCH v5 0/3] drm/xe/migrate: Atomicize CCS copy command setup Satyanarayana K V P
  2025-10-08 10:11 ` [PATCH v5 1/3] " Satyanarayana K V P
  2025-10-08 10:11 ` [PATCH v5 2/3] drm/xe/migrate: Make emit_pte() header write atomic Satyanarayana K V P
@ 2025-10-08 10:11 ` Satyanarayana K V P
  2 siblings, 0 replies; 16+ messages in thread
From: Satyanarayana K V P @ 2025-10-08 10:11 UTC (permalink / raw)
  To: intel-xe; +Cc: Satyanarayana K V P, Michal Wajdeczko, Matthew Brost,
	Matthew Auld

Clear the contents of the CCS read/write batch buffer, ensuring no page
faults / GPU hang occur if migration happens midway.

Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>

---
V4 -> V5:
- Fixed review comments (Matt B).

V3 -> V4:
- New commit added.

V2 -> V3:
- None

V1 -> V2:
- None
---
 drivers/gpu/drm/xe/xe_migrate.c      | 134 +++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_migrate.h      |   3 +
 drivers/gpu/drm/xe/xe_sriov_vf_ccs.c |   5 +-
 3 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 7097da1555ca..ca90bfbdff12 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -652,6 +652,43 @@ static void emit_pte(struct xe_migrate *m,
 	}
 }
 
+static void emit_pte_clear(struct xe_gt *gt, struct xe_bb *bb, int start_offset,
+			   int end_offset)
+{
+	u32 dw_nop[SZ_2] = {MI_NOOP};
+	int i = start_offset;
+	int len = end_offset;
+	u32 *cs = bb->cs;
+
+	/* Reverses the operations performed by emit_pte() */
+	while (i < len) {
+		u32 dwords, qwords;
+
+		xe_assert(gt_to_xe(gt), (REG_FIELD_GET(REG_GENMASK(31, 23), cs[i]) == 0x20));
+
+		qwords = REG_FIELD_GET(MI_SDI_LEN_DW, cs[i]);
+		/*
+		 * If Store QW is enabled, then the value of the dwlengh
+		 * includes the header, address and multiple QW pairs of data
+		 * which means the values will be limited to odd values starting
+		 * at a value of 3(3 representing the size of a 5 DW command
+		 * including header, 2 dw address and 2 dw data).
+		 */
+		dwords = qwords - 1;
+		/*
+		 * Do not clear header first. Clear PTEs first and then clear the
+		 * header to avoid page faults.
+		 */
+		memset(&cs[i + 3], MI_NOOP, (dwords) * sizeof(u32));
+
+		xe_device_wmb(gt_to_xe(gt));
+		WRITE_ONCE(*(u64 *)&cs[i], *(u64 *)dw_nop);
+
+		cs[i + 2] = MI_NOOP;
+		i += (dwords + 3);
+	}
+}
+
 static void memcpy_vmovdqu(void *dst, const void *src, u32 size)
 {
 #ifdef CONFIG_X86
@@ -732,6 +769,18 @@ static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
 	bb->len = cs - bb->cs;
 }
 
+static u32 emit_copy_ccs_clear(struct xe_gt *gt, struct xe_bb *bb, u32 offset)
+{
+	u32 dw[EMIT_COPY_CCS_DW] = {MI_NOOP};
+	u32 *cs = bb->cs + offset - EMIT_COPY_CCS_DW;
+
+	xe_assert(gt_to_xe(gt), (REG_FIELD_GET(REG_GENMASK(31, 22), *cs) == 0x148));
+	emit_atomic(gt, cs, dw, sizeof(dw));
+	xe_device_wmb(gt_to_xe(gt));
+
+	return offset - EMIT_COPY_CCS_DW;
+}
+
 #define EMIT_COPY_DW 10
 static void emit_copy(struct xe_gt *gt, struct xe_bb *bb,
 		      u64 src_ofs, u64 dst_ofs, unsigned int size,
@@ -1063,6 +1112,19 @@ static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *cs, int i, u32 fl
 	return i + j;
 }
 
+static u32 emit_flush_invalidate_clear(struct xe_gt *gt, struct xe_bb *bb,
+				       u32 offset)
+{
+	u32 dw[EMIT_FLUSH_INVALIDATE_DW] = {MI_NOOP};
+	u32 *cs = bb->cs + offset - EMIT_FLUSH_INVALIDATE_DW;
+
+	xe_assert(gt_to_xe(gt), (REG_FIELD_GET(REG_GENMASK(31, 23), *cs) == 0x26));
+
+	emit_atomic(gt, cs, dw, sizeof(dw));
+
+	return offset - EMIT_FLUSH_INVALIDATE_DW;
+}
+
 /**
  * xe_migrate_ccs_rw_copy() - Copy content of TTM resources.
  * @tile: Tile whose migration context to be used.
@@ -1187,6 +1249,78 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
 	return err;
 }
 
+static u32 ccs_rw_pte_size(struct xe_gt *gt, struct xe_bb *bb, u32 offset)
+{
+	int len = bb->len;
+	u32 *cs = bb->cs;
+	u32 i = offset;
+
+	while (i < len) {
+		u32 dwords, qwords;
+
+		xe_assert(gt_to_xe(gt), (REG_FIELD_GET(REG_GENMASK(31, 23), cs[i]) == 0x20));
+
+		qwords = REG_FIELD_GET(MI_SDI_LEN_DW, cs[i]);
+		/*
+		 * If Store QW is enabled, then the value of the dwlengh
+		 * includes the header, address and multiple QW pairs of data
+		 * which means the values will be limited to odd values starting
+		 * at a value of 3(3 representing the size of a 5 DW command
+		 * including header, 2 dw address and 2 dw data).
+		 */
+		dwords = qwords - 1;
+		i += dwords + 3;
+
+		/*
+		 * Break if the next dword is for emit_flush_invalidate_clear()
+		 * or emit_copy_ccs_clear()
+		 */
+		if ((REG_FIELD_GET(REG_GENMASK(31, 23), cs[i]) == 0x26) ||
+		    (REG_FIELD_GET(REG_GENMASK(31, 22), cs[i]) == 0x148))
+			break;
+	}
+	return i;
+}
+
+/**
+ * xe_migrate_ccs_rw_copy_clear() - Clear the CCS read/write batch buffer
+ * content.
+ * @tile: Tile whose migration context to be used.
+ * @src_bo: The buffer object @src is currently bound to.
+ * @read_write : Creates BB commands for CCS read/write.
+ *
+ * The CCS copy command has three stages: PTE setup, TLB invalidation, and CCS
+ * copy. Each stage includes a header followed by instructions. When clearing,
+ * remove the instructions first, then the header. For the TLB invalidation and
+ * CCS copy stages, ensure the writes are atomic.
+ *
+ * This reverses the operations performed by xe_migrate_ccs_rw_copy().
+ *
+ * Returns: None.
+ */
+void xe_migrate_ccs_rw_copy_clear(struct xe_tile *tile, struct xe_bo *src_bo,
+				  enum xe_sriov_vf_ccs_rw_ctxs read_write)
+{
+	struct xe_bb *bb = src_bo->bb_ccs[read_write];
+	u32 bb_offset = 0, bb_offset_chunk = 0;
+	struct xe_gt *gt = tile->primary_gt;
+
+	while (bb_offset_chunk >= 0 && bb_offset_chunk < bb->len) {
+		bb_offset = ccs_rw_pte_size(gt, bb, bb_offset_chunk);
+		/*
+		 * After PTE entries, we have one TLB invalidation, CCS copy
+		 * command and another TLB invalidation command.
+		 */
+		bb_offset_chunk = bb_offset + EMIT_FLUSH_INVALIDATE_DW +
+				  EMIT_COPY_CCS_DW + EMIT_FLUSH_INVALIDATE_DW;
+
+		bb_offset = emit_flush_invalidate_clear(gt, bb, bb_offset_chunk);
+		bb_offset = emit_copy_ccs_clear(gt, bb, bb_offset);
+		bb_offset = emit_flush_invalidate_clear(gt, bb, bb_offset);
+		emit_pte_clear(gt, bb, bb_offset_chunk, bb_offset);
+	}
+}
+
 /**
  * xe_get_migrate_exec_queue() - Get the execution queue from migrate context.
  * @migrate: Migrate context.
diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
index 0d8944b1cee6..bd2c0eb3ad94 100644
--- a/drivers/gpu/drm/xe/xe_migrate.h
+++ b/drivers/gpu/drm/xe/xe_migrate.h
@@ -129,6 +129,9 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
 			   struct xe_bo *src_bo,
 			   enum xe_sriov_vf_ccs_rw_ctxs read_write);
 
+void xe_migrate_ccs_rw_copy_clear(struct xe_tile *tile, struct xe_bo *src_bo,
+				  enum xe_sriov_vf_ccs_rw_ctxs read_write);
+
 struct xe_lrc *xe_migrate_lrc(struct xe_migrate *migrate);
 struct xe_exec_queue *xe_migrate_exec_queue(struct xe_migrate *migrate);
 struct dma_fence *xe_migrate_raw_vram_copy(struct xe_bo *vram_bo, u64 vram_offset,
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf_ccs.c b/drivers/gpu/drm/xe/xe_sriov_vf_ccs.c
index 790249801364..2d3728cb24ca 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf_ccs.c
+++ b/drivers/gpu/drm/xe/xe_sriov_vf_ccs.c
@@ -387,6 +387,7 @@ int xe_sriov_vf_ccs_detach_bo(struct xe_bo *bo)
 {
 	struct xe_device *xe = xe_bo_device(bo);
 	enum xe_sriov_vf_ccs_rw_ctxs ctx_id;
+	struct xe_tile *tile;
 	struct xe_bb *bb;
 
 	xe_assert(xe, IS_VF_CCS_READY(xe));
@@ -394,12 +395,14 @@ int xe_sriov_vf_ccs_detach_bo(struct xe_bo *bo)
 	if (!xe_bo_has_valid_ccs_bb(bo))
 		return 0;
 
+	tile = xe_device_get_root_tile(xe);
+
 	for_each_ccs_rw_ctx(ctx_id) {
 		bb = bo->bb_ccs[ctx_id];
 		if (!bb)
 			continue;
 
-		memset(bb->cs, MI_NOOP, bb->len * sizeof(u32));
+		xe_migrate_ccs_rw_copy_clear(tile, bo, ctx_id);
 		xe_bb_free(bb, NULL);
 		bo->bb_ccs[ctx_id] = NULL;
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/3] drm/xe/migrate: Atomicize CCS copy command setup
  2025-10-08 10:11 ` [PATCH v5 1/3] " Satyanarayana K V P
@ 2025-10-08 22:58   ` Matthew Brost
  2025-10-09 13:00     ` Rodrigo Vivi
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Brost @ 2025-10-08 22:58 UTC (permalink / raw)
  To: Satyanarayana K V P; +Cc: intel-xe, Michal Wajdeczko, Matthew Auld

On Wed, Oct 08, 2025 at 03:41:47PM +0530, Satyanarayana K V P wrote:
> The CCS copy command is a 5-dword sequence. If the vCPU halts during
> save/restore while this sequence is being programmed, partial writes may
> trigger page faults when saving IGPU CCS metadata. Use the VMOVDQU
> instruction to write the sequence atomically.
> 
> Since VMOVDQU operates on 256-bit chunks, update EMIT_COPY_CCS_DW to emit
> 8 dwords instead of 5 dwords.
> 
> Update emit_flush_invalidate() to use VMOVDQU operating with 128-bit
> chunks.
> 
> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> 
> ---
> V4 -> V5:
> - Fixed review comments. (Matt B)
> 
> V3 -> V4:
> - Fixed review comments. (Wajdeczko)
> - Fix issues reported by patchworks.
> 
> V2 -> V3:
> - Added support for 128 bit and 256 bit instructions with memcpy_vmovdqu
> - Updated emit_flush_invalidate() to use vmovdqu instruction.
> 
> V1 -> V2:
> - Use memcpy_vmovdqu only for x86 arch and for VF. Else use memcpy
>   (Auld, Matthew)
> - Fix issues reported by patchworks.
> ---
>  drivers/gpu/drm/xe/xe_migrate.c | 93 +++++++++++++++++++++++++--------
>  1 file changed, 72 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> index c39c3b423d05..b629072956ee 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -5,7 +5,9 @@
>  
>  #include "xe_migrate.h"
>  
> +#include <asm/fpu/api.h>
>  #include <linux/bitfield.h>
> +#include <linux/cpufeature.h>
>  #include <linux/sizes.h>
>  
>  #include <drm/drm_managed.h>
> @@ -33,6 +35,7 @@
>  #include "xe_res_cursor.h"
>  #include "xe_sa.h"
>  #include "xe_sched_job.h"
> +#include "xe_sriov_vf_ccs.h"
>  #include "xe_sync.h"
>  #include "xe_trace_bo.h"
>  #include "xe_validation.h"
> @@ -644,18 +647,49 @@ static void emit_pte(struct xe_migrate *m,
>  	}
>  }
>  
> -#define EMIT_COPY_CCS_DW 5
> +static void memcpy_vmovdqu(void *dst, const void *src, u32 size)
> +{
> +#ifdef CONFIG_X86
> +	kernel_fpu_begin();
> +	if (size == SZ_128) {
> +		asm("vmovdqu (%0), %%xmm0\n"
> +		    "vmovups %%xmm0,   (%1)\n"
> +		    :: "r" (src), "r" (dst) : "memory");
> +	} else if (size == SZ_256) {
> +		asm("vmovdqu (%0), %%ymm0\n"
> +		    "vmovups %%ymm0,   (%1)\n"
> +		    :: "r" (src), "r" (dst) : "memory");
> +	}
> +	kernel_fpu_end();
> +#endif

Everything in this patch LGTM but I think we maintainer input to ensure
we are breaking some rules about inlined asm code in a driver (no idea
if this exists) or if a better place would be somewhere common. Can you
ping Lucas, Thomas, or Rodrigo and ask them about this?

Matt 

> +}
> +
> +static void emit_atomic(struct xe_gt *gt, void *dst, const void *src, u32 size)
> +{
> +	u32 instr_size = size * BITS_PER_BYTE;
> +
> +	xe_gt_assert(gt, instr_size == SZ_128 || instr_size == SZ_256);
> +
> +	if (IS_VF_CCS_READY(gt_to_xe(gt)) && static_cpu_has(X86_FEATURE_AVX))
> +		memcpy_vmovdqu(dst, src, instr_size);
> +	else
> +		memcpy(dst, src, size);
> +}
> +
> +#define EMIT_COPY_CCS_DW 8
>  static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
>  			  u64 dst_ofs, bool dst_is_indirect,
>  			  u64 src_ofs, bool src_is_indirect,
>  			  u32 size)
>  {
> +	u32 dw[EMIT_COPY_CCS_DW] = {MI_NOOP};
>  	struct xe_device *xe = gt_to_xe(gt);
>  	u32 *cs = bb->cs + bb->len;
>  	u32 num_ccs_blks;
>  	u32 num_pages;
>  	u32 ccs_copy_size;
>  	u32 mocs;
> +	u32 i = 0;
>  
>  	if (GRAPHICS_VERx100(xe) >= 2000) {
>  		num_pages = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> @@ -673,15 +707,23 @@ static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
>  		mocs = FIELD_PREP(XY_CTRL_SURF_MOCS_MASK, gt->mocs.uc_index);
>  	}
>  
> -	*cs++ = XY_CTRL_SURF_COPY_BLT |
> -		(src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> -		(dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> -		ccs_copy_size;
> -	*cs++ = lower_32_bits(src_ofs);
> -	*cs++ = upper_32_bits(src_ofs) | mocs;
> -	*cs++ = lower_32_bits(dst_ofs);
> -	*cs++ = upper_32_bits(dst_ofs) | mocs;
> +	dw[i++] = XY_CTRL_SURF_COPY_BLT |
> +		  (src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> +		  (dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> +		  ccs_copy_size;
> +	dw[i++] = lower_32_bits(src_ofs);
> +	dw[i++] = upper_32_bits(src_ofs) | mocs;
> +	dw[i++] = lower_32_bits(dst_ofs);
> +	dw[i++] = upper_32_bits(dst_ofs) | mocs;
>  
> +	/*
> +	 * The CCS copy command is a 5-dword sequence. If the vCPU halts during
> +	 * save/restore while this sequence is being issued, partial writes may trigger
> +	 * page faults when saving iGPU CCS metadata. Use the VMOVDQU instruction to
> +	 * write the sequence atomically.
> +	 */
> +	emit_atomic(gt, cs, dw, sizeof(dw));
> +	cs += EMIT_COPY_CCS_DW;
>  	bb->len = cs - bb->cs;
>  }
>  
> @@ -993,18 +1035,27 @@ static u64 migrate_vm_ppgtt_addr_tlb_inval(void)
>  	return (NUM_KERNEL_PDE - 2) * XE_PAGE_SIZE;
>  }
>  
> -static int emit_flush_invalidate(u32 *dw, int i, u32 flags)
> +/*
> + * The MI_FLUSH_DW command is a 4-dword sequence. If the vCPU halts during
> + * save/restore while this sequence is being issued, partial writes may
> + * trigger page faults when saving iGPU CCS metadata. Use
> + * emit_atomic() to write the sequence atomically.
> + */
> +#define EMIT_FLUSH_INVALIDATE_DW 4
> +static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *cs, int i, u32 flags)
>  {
>  	u64 addr = migrate_vm_ppgtt_addr_tlb_inval();
> +	u32 dw[EMIT_FLUSH_INVALIDATE_DW] = {MI_NOOP}, j = 0;
> +
> +	dw[j++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> +		      MI_FLUSH_IMM_DW | flags;
> +	dw[j++] = lower_32_bits(addr);
> +	dw[j++] = upper_32_bits(addr);
> +	dw[j++] = MI_NOOP;
>  
> -	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> -		  MI_FLUSH_IMM_DW | flags;
> -	dw[i++] = lower_32_bits(addr);
> -	dw[i++] = upper_32_bits(addr);
> -	dw[i++] = MI_NOOP;
> -	dw[i++] = MI_NOOP;
> +	emit_atomic(q->gt, &cs[i], dw, sizeof(dw));
>  
> -	return i;
> +	return i + j;
>  }
>  
>  /**
> @@ -1049,7 +1100,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
>  	/* Calculate Batch buffer size */
>  	batch_size = 0;
>  	while (size) {
> -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
>  		u64 ccs_ofs, ccs_size;
>  		u32 ccs_pt;
>  
> @@ -1090,7 +1141,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
>  	 * sizes here again before copy command is emitted.
>  	 */
>  	while (size) {
> -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
>  		u32 flush_flags = 0;
>  		u64 ccs_ofs, ccs_size;
>  		u32 ccs_pt;
> @@ -1113,11 +1164,11 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
>  
>  		emit_pte(m, bb, ccs_pt, false, false, &ccs_it, ccs_size, src);
>  
> -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
>  		flush_flags = xe_migrate_ccs_copy(m, bb, src_L0_ofs, src_is_pltt,
>  						  src_L0_ofs, dst_is_pltt,
>  						  src_L0, ccs_ofs, true);
> -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
>  
>  		size -= src_L0;
>  	}
> -- 
> 2.51.0
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 2/3] drm/xe/migrate: Make emit_pte() header write atomic
  2025-10-08 10:11 ` [PATCH v5 2/3] drm/xe/migrate: Make emit_pte() header write atomic Satyanarayana K V P
@ 2025-10-08 23:27   ` Matthew Brost
  0 siblings, 0 replies; 16+ messages in thread
From: Matthew Brost @ 2025-10-08 23:27 UTC (permalink / raw)
  To: Satyanarayana K V P; +Cc: intel-xe, Michal Wajdeczko, Matthew Auld

On Wed, Oct 08, 2025 at 03:41:48PM +0530, Satyanarayana K V P wrote:
> The MI_STORE_DATA_IMM instruction header is quad dword in size. If the
> vCPU halts during save/restore while this sequence is being programmed,
> partial writes may trigger page faults when saving IGPU CCS metadata.
> Update instruction header atomically.
> 
> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> 
> ---
> V4 -> V5:
> - Fixed review comments (Matt B).
> 
> V3 -> V4:
> - New commit added.
> 
> V2 -> V3:
> - None
> 
> V1 -> V2:
> - None
> ---
>  drivers/gpu/drm/xe/xe_migrate.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> index b629072956ee..7097da1555ca 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -607,9 +607,14 @@ static void emit_pte(struct xe_migrate *m,
>  
>  	while (ptes) {
>  		u32 chunk = min(MAX_PTE_PER_SDI, ptes);
> +		u32 dw[SZ_2], i = 0;
> +
> +		dw[i++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(chunk);
> +		dw[i++] = ofs;
> +
> +		WRITE_ONCE(*(u64 *)&bb->cs[bb->len], *(u64 *)dw);
> +		bb->len += i;
>  
> -		bb->cs[bb->len++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(chunk);
> -		bb->cs[bb->len++] = ofs;
>  		bb->cs[bb->len++] = 0;

I think you actually need an emit_atomic here. I just read the bsepc and
the last dw enforces some msb restrictions so the 3 dw must be written
atomically. I believe it is safe to write out 4 dw in the atomic
instruction but only increment bb->len by 3.

Matt 

>  
>  		cur_ofs = ofs;
> -- 
> 2.51.0
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/3] drm/xe/migrate: Atomicize CCS copy command setup
  2025-10-08 22:58   ` Matthew Brost
@ 2025-10-09 13:00     ` Rodrigo Vivi
  2025-10-09 16:11       ` Matthew Brost
  0 siblings, 1 reply; 16+ messages in thread
From: Rodrigo Vivi @ 2025-10-09 13:00 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Satyanarayana K V P, intel-xe, Michal Wajdeczko, Matthew Auld

On Wed, Oct 08, 2025 at 03:58:32PM -0700, Matthew Brost wrote:
> On Wed, Oct 08, 2025 at 03:41:47PM +0530, Satyanarayana K V P wrote:
> > The CCS copy command is a 5-dword sequence. If the vCPU halts during
> > save/restore while this sequence is being programmed, partial writes may
> > trigger page faults when saving IGPU CCS metadata. Use the VMOVDQU
> > instruction to write the sequence atomically.
> > 
> > Since VMOVDQU operates on 256-bit chunks, update EMIT_COPY_CCS_DW to emit
> > 8 dwords instead of 5 dwords.
> > 
> > Update emit_flush_invalidate() to use VMOVDQU operating with 128-bit
> > chunks.
> > 
> > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Matthew Auld <matthew.auld@intel.com>
> > 
> > ---
> > V4 -> V5:
> > - Fixed review comments. (Matt B)
> > 
> > V3 -> V4:
> > - Fixed review comments. (Wajdeczko)
> > - Fix issues reported by patchworks.
> > 
> > V2 -> V3:
> > - Added support for 128 bit and 256 bit instructions with memcpy_vmovdqu
> > - Updated emit_flush_invalidate() to use vmovdqu instruction.
> > 
> > V1 -> V2:
> > - Use memcpy_vmovdqu only for x86 arch and for VF. Else use memcpy
> >   (Auld, Matthew)
> > - Fix issues reported by patchworks.
> > ---
> >  drivers/gpu/drm/xe/xe_migrate.c | 93 +++++++++++++++++++++++++--------
> >  1 file changed, 72 insertions(+), 21 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> > index c39c3b423d05..b629072956ee 100644
> > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > @@ -5,7 +5,9 @@
> >  
> >  #include "xe_migrate.h"
> >  
> > +#include <asm/fpu/api.h>
> >  #include <linux/bitfield.h>
> > +#include <linux/cpufeature.h>
> >  #include <linux/sizes.h>
> >  
> >  #include <drm/drm_managed.h>
> > @@ -33,6 +35,7 @@
> >  #include "xe_res_cursor.h"
> >  #include "xe_sa.h"
> >  #include "xe_sched_job.h"
> > +#include "xe_sriov_vf_ccs.h"
> >  #include "xe_sync.h"
> >  #include "xe_trace_bo.h"
> >  #include "xe_validation.h"
> > @@ -644,18 +647,49 @@ static void emit_pte(struct xe_migrate *m,
> >  	}
> >  }
> >  
> > -#define EMIT_COPY_CCS_DW 5
> > +static void memcpy_vmovdqu(void *dst, const void *src, u32 size)
> > +{
> > +#ifdef CONFIG_X86
> > +	kernel_fpu_begin();
> > +	if (size == SZ_128) {
> > +		asm("vmovdqu (%0), %%xmm0\n"
> > +		    "vmovups %%xmm0,   (%1)\n"
> > +		    :: "r" (src), "r" (dst) : "memory");
> > +	} else if (size == SZ_256) {
> > +		asm("vmovdqu (%0), %%ymm0\n"
> > +		    "vmovups %%ymm0,   (%1)\n"
> > +		    :: "r" (src), "r" (dst) : "memory");
> > +	}
> > +	kernel_fpu_end();
> > +#endif
> 
> Everything in this patch LGTM but I think we maintainer input to ensure
> we are breaking some rules about inlined asm code in a driver (no idea
> if this exists) or if a better place would be somewhere common. Can you
> ping Lucas, Thomas, or Rodrigo and ask them about this?

Well, it is possible and we have asm code in i915 for instance (i915_memcpy.c)

But the rule does exist:
https://www.kernel.org/doc/html/latest/process/coding-style.html#inline-assembly

"don’t use inline assembly gratuitously when C can do the job. You can and should
poke hardware from C when possible"

In this case here, please explain why exactly memcpy with smp_wmb barriers and
or WRITE_ONCE code combined couldn't solve it.

Also, please explain how exactly vmovdqu guarantees the atomicity promised by
the commit message. On a quick search here my take is that for this 128 or 256
bits, atomicity is not guaranteed.

So, imho this patch is introducing a unmaintainable, complex, and fragile code
that is not even doing what it is claiming to do. But I will be glad if someone
can challenge this and prove me wrong.

Thanks,
Rodrigo.

> 
> Matt 
> 
> > +}
> > +
> > +static void emit_atomic(struct xe_gt *gt, void *dst, const void *src, u32 size)
> > +{
> > +	u32 instr_size = size * BITS_PER_BYTE;
> > +
> > +	xe_gt_assert(gt, instr_size == SZ_128 || instr_size == SZ_256);
> > +
> > +	if (IS_VF_CCS_READY(gt_to_xe(gt)) && static_cpu_has(X86_FEATURE_AVX))
> > +		memcpy_vmovdqu(dst, src, instr_size);
> > +	else
> > +		memcpy(dst, src, size);
> > +}
> > +
> > +#define EMIT_COPY_CCS_DW 8
> >  static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> >  			  u64 dst_ofs, bool dst_is_indirect,
> >  			  u64 src_ofs, bool src_is_indirect,
> >  			  u32 size)
> >  {
> > +	u32 dw[EMIT_COPY_CCS_DW] = {MI_NOOP};
> >  	struct xe_device *xe = gt_to_xe(gt);
> >  	u32 *cs = bb->cs + bb->len;
> >  	u32 num_ccs_blks;
> >  	u32 num_pages;
> >  	u32 ccs_copy_size;
> >  	u32 mocs;
> > +	u32 i = 0;
> >  
> >  	if (GRAPHICS_VERx100(xe) >= 2000) {
> >  		num_pages = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> > @@ -673,15 +707,23 @@ static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> >  		mocs = FIELD_PREP(XY_CTRL_SURF_MOCS_MASK, gt->mocs.uc_index);
> >  	}
> >  
> > -	*cs++ = XY_CTRL_SURF_COPY_BLT |
> > -		(src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> > -		(dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> > -		ccs_copy_size;
> > -	*cs++ = lower_32_bits(src_ofs);
> > -	*cs++ = upper_32_bits(src_ofs) | mocs;
> > -	*cs++ = lower_32_bits(dst_ofs);
> > -	*cs++ = upper_32_bits(dst_ofs) | mocs;
> > +	dw[i++] = XY_CTRL_SURF_COPY_BLT |
> > +		  (src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> > +		  (dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> > +		  ccs_copy_size;
> > +	dw[i++] = lower_32_bits(src_ofs);
> > +	dw[i++] = upper_32_bits(src_ofs) | mocs;
> > +	dw[i++] = lower_32_bits(dst_ofs);
> > +	dw[i++] = upper_32_bits(dst_ofs) | mocs;
> >  
> > +	/*
> > +	 * The CCS copy command is a 5-dword sequence. If the vCPU halts during
> > +	 * save/restore while this sequence is being issued, partial writes may trigger
> > +	 * page faults when saving iGPU CCS metadata. Use the VMOVDQU instruction to
> > +	 * write the sequence atomically.
> > +	 */
> > +	emit_atomic(gt, cs, dw, sizeof(dw));
> > +	cs += EMIT_COPY_CCS_DW;
> >  	bb->len = cs - bb->cs;
> >  }
> >  
> > @@ -993,18 +1035,27 @@ static u64 migrate_vm_ppgtt_addr_tlb_inval(void)
> >  	return (NUM_KERNEL_PDE - 2) * XE_PAGE_SIZE;
> >  }
> >  
> > -static int emit_flush_invalidate(u32 *dw, int i, u32 flags)
> > +/*
> > + * The MI_FLUSH_DW command is a 4-dword sequence. If the vCPU halts during
> > + * save/restore while this sequence is being issued, partial writes may
> > + * trigger page faults when saving iGPU CCS metadata. Use
> > + * emit_atomic() to write the sequence atomically.
> > + */
> > +#define EMIT_FLUSH_INVALIDATE_DW 4
> > +static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *cs, int i, u32 flags)
> >  {
> >  	u64 addr = migrate_vm_ppgtt_addr_tlb_inval();
> > +	u32 dw[EMIT_FLUSH_INVALIDATE_DW] = {MI_NOOP}, j = 0;
> > +
> > +	dw[j++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> > +		      MI_FLUSH_IMM_DW | flags;
> > +	dw[j++] = lower_32_bits(addr);
> > +	dw[j++] = upper_32_bits(addr);
> > +	dw[j++] = MI_NOOP;
> >  
> > -	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> > -		  MI_FLUSH_IMM_DW | flags;
> > -	dw[i++] = lower_32_bits(addr);
> > -	dw[i++] = upper_32_bits(addr);
> > -	dw[i++] = MI_NOOP;
> > -	dw[i++] = MI_NOOP;
> > +	emit_atomic(q->gt, &cs[i], dw, sizeof(dw));
> >  
> > -	return i;
> > +	return i + j;
> >  }
> >  
> >  /**
> > @@ -1049,7 +1100,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> >  	/* Calculate Batch buffer size */
> >  	batch_size = 0;
> >  	while (size) {
> > -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> > +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
> >  		u64 ccs_ofs, ccs_size;
> >  		u32 ccs_pt;
> >  
> > @@ -1090,7 +1141,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> >  	 * sizes here again before copy command is emitted.
> >  	 */
> >  	while (size) {
> > -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> > +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
> >  		u32 flush_flags = 0;
> >  		u64 ccs_ofs, ccs_size;
> >  		u32 ccs_pt;
> > @@ -1113,11 +1164,11 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> >  
> >  		emit_pte(m, bb, ccs_pt, false, false, &ccs_it, ccs_size, src);
> >  
> > -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> > +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> >  		flush_flags = xe_migrate_ccs_copy(m, bb, src_L0_ofs, src_is_pltt,
> >  						  src_L0_ofs, dst_is_pltt,
> >  						  src_L0, ccs_ofs, true);
> > -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> > +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> >  
> >  		size -= src_L0;
> >  	}
> > -- 
> > 2.51.0
> > 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/3] drm/xe/migrate: Atomicize CCS copy command setup
  2025-10-09 13:00     ` Rodrigo Vivi
@ 2025-10-09 16:11       ` Matthew Brost
  2025-10-09 18:35         ` Rodrigo Vivi
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Brost @ 2025-10-09 16:11 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Satyanarayana K V P, intel-xe, Michal Wajdeczko, Matthew Auld

On Thu, Oct 09, 2025 at 09:00:43AM -0400, Rodrigo Vivi wrote:
> On Wed, Oct 08, 2025 at 03:58:32PM -0700, Matthew Brost wrote:
> > On Wed, Oct 08, 2025 at 03:41:47PM +0530, Satyanarayana K V P wrote:
> > > The CCS copy command is a 5-dword sequence. If the vCPU halts during
> > > save/restore while this sequence is being programmed, partial writes may
> > > trigger page faults when saving IGPU CCS metadata. Use the VMOVDQU
> > > instruction to write the sequence atomically.
> > > 
> > > Since VMOVDQU operates on 256-bit chunks, update EMIT_COPY_CCS_DW to emit
> > > 8 dwords instead of 5 dwords.
> > > 
> > > Update emit_flush_invalidate() to use VMOVDQU operating with 128-bit
> > > chunks.
> > > 
> > > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> > > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > 
> > > ---
> > > V4 -> V5:
> > > - Fixed review comments. (Matt B)
> > > 
> > > V3 -> V4:
> > > - Fixed review comments. (Wajdeczko)
> > > - Fix issues reported by patchworks.
> > > 
> > > V2 -> V3:
> > > - Added support for 128 bit and 256 bit instructions with memcpy_vmovdqu
> > > - Updated emit_flush_invalidate() to use vmovdqu instruction.
> > > 
> > > V1 -> V2:
> > > - Use memcpy_vmovdqu only for x86 arch and for VF. Else use memcpy
> > >   (Auld, Matthew)
> > > - Fix issues reported by patchworks.
> > > ---
> > >  drivers/gpu/drm/xe/xe_migrate.c | 93 +++++++++++++++++++++++++--------
> > >  1 file changed, 72 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> > > index c39c3b423d05..b629072956ee 100644
> > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > @@ -5,7 +5,9 @@
> > >  
> > >  #include "xe_migrate.h"
> > >  
> > > +#include <asm/fpu/api.h>
> > >  #include <linux/bitfield.h>
> > > +#include <linux/cpufeature.h>
> > >  #include <linux/sizes.h>
> > >  
> > >  #include <drm/drm_managed.h>
> > > @@ -33,6 +35,7 @@
> > >  #include "xe_res_cursor.h"
> > >  #include "xe_sa.h"
> > >  #include "xe_sched_job.h"
> > > +#include "xe_sriov_vf_ccs.h"
> > >  #include "xe_sync.h"
> > >  #include "xe_trace_bo.h"
> > >  #include "xe_validation.h"
> > > @@ -644,18 +647,49 @@ static void emit_pte(struct xe_migrate *m,
> > >  	}
> > >  }
> > >  
> > > -#define EMIT_COPY_CCS_DW 5
> > > +static void memcpy_vmovdqu(void *dst, const void *src, u32 size)
> > > +{
> > > +#ifdef CONFIG_X86
> > > +	kernel_fpu_begin();
> > > +	if (size == SZ_128) {
> > > +		asm("vmovdqu (%0), %%xmm0\n"
> > > +		    "vmovups %%xmm0,   (%1)\n"
> > > +		    :: "r" (src), "r" (dst) : "memory");
> > > +	} else if (size == SZ_256) {
> > > +		asm("vmovdqu (%0), %%ymm0\n"
> > > +		    "vmovups %%ymm0,   (%1)\n"
> > > +		    :: "r" (src), "r" (dst) : "memory");
> > > +	}
> > > +	kernel_fpu_end();
> > > +#endif
> > 
> > Everything in this patch LGTM but I think we maintainer input to ensure
> > we are breaking some rules about inlined asm code in a driver (no idea
> > if this exists) or if a better place would be somewhere common. Can you
> > ping Lucas, Thomas, or Rodrigo and ask them about this?
> 
> Well, it is possible and we have asm code in i915 for instance (i915_memcpy.c)
> 
> But the rule does exist:
> https://www.kernel.org/doc/html/latest/process/coding-style.html#inline-assembly
> 
> "don’t use inline assembly gratuitously when C can do the job. You can and should
> poke hardware from C when possible"
> 
> In this case here, please explain why exactly memcpy with smp_wmb barriers and
> or WRITE_ONCE code combined couldn't solve it.
> 
> Also, please explain how exactly vmovdqu guarantees the atomicity promised by
> the commit message. On a quick search here my take is that for this 128 or 256
> bits, atomicity is not guaranteed.

I don't think cache atomicity is what we're after here—rather, it's vCPU
halting atomicity.

Consider the following case:
b++ = XY_CTRL_SURF_COPY_BLT;
b++ = addr;

If the vCPU is halted during the instruction that stores
XY_CTRL_SURF_COPY_BLT, the address will be invalid. The GuC executes the
batch buffer (BB) that is being programmed as part of the VF save. This
will clearly cause the BB to hang due to a page fault on the copy
command.

If the entire XY_CTRL_SURF_COPY_BLT is stored via an AVX instruction,
then either the GPU entire instruction is written or none of it is. I
believe vCPU halting guarantees that a CPU instruction is either fully
executed or not at all—regardless of how many micro-operations (uOPs) it
decodes into. If this guarantee does not hold, then the entire
architecture of CCS save/restore on PTL is fundamentally broken which is
always possible.

> 
> So, imho this patch is introducing a unmaintainable, complex, and fragile code
> that is not even doing what it is claiming to do. But I will be glad if someone
> can challenge this and prove me wrong.
> 

Let me know if the above makes any sense.

Matt

> Thanks,
> Rodrigo.
> 
> > 
> > Matt 
> > 
> > > +}
> > > +
> > > +static void emit_atomic(struct xe_gt *gt, void *dst, const void *src, u32 size)
> > > +{
> > > +	u32 instr_size = size * BITS_PER_BYTE;
> > > +
> > > +	xe_gt_assert(gt, instr_size == SZ_128 || instr_size == SZ_256);
> > > +
> > > +	if (IS_VF_CCS_READY(gt_to_xe(gt)) && static_cpu_has(X86_FEATURE_AVX))
> > > +		memcpy_vmovdqu(dst, src, instr_size);
> > > +	else
> > > +		memcpy(dst, src, size);
> > > +}
> > > +
> > > +#define EMIT_COPY_CCS_DW 8
> > >  static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> > >  			  u64 dst_ofs, bool dst_is_indirect,
> > >  			  u64 src_ofs, bool src_is_indirect,
> > >  			  u32 size)
> > >  {
> > > +	u32 dw[EMIT_COPY_CCS_DW] = {MI_NOOP};
> > >  	struct xe_device *xe = gt_to_xe(gt);
> > >  	u32 *cs = bb->cs + bb->len;
> > >  	u32 num_ccs_blks;
> > >  	u32 num_pages;
> > >  	u32 ccs_copy_size;
> > >  	u32 mocs;
> > > +	u32 i = 0;
> > >  
> > >  	if (GRAPHICS_VERx100(xe) >= 2000) {
> > >  		num_pages = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> > > @@ -673,15 +707,23 @@ static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> > >  		mocs = FIELD_PREP(XY_CTRL_SURF_MOCS_MASK, gt->mocs.uc_index);
> > >  	}
> > >  
> > > -	*cs++ = XY_CTRL_SURF_COPY_BLT |
> > > -		(src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> > > -		(dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> > > -		ccs_copy_size;
> > > -	*cs++ = lower_32_bits(src_ofs);
> > > -	*cs++ = upper_32_bits(src_ofs) | mocs;
> > > -	*cs++ = lower_32_bits(dst_ofs);
> > > -	*cs++ = upper_32_bits(dst_ofs) | mocs;
> > > +	dw[i++] = XY_CTRL_SURF_COPY_BLT |
> > > +		  (src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> > > +		  (dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> > > +		  ccs_copy_size;
> > > +	dw[i++] = lower_32_bits(src_ofs);
> > > +	dw[i++] = upper_32_bits(src_ofs) | mocs;
> > > +	dw[i++] = lower_32_bits(dst_ofs);
> > > +	dw[i++] = upper_32_bits(dst_ofs) | mocs;
> > >  
> > > +	/*
> > > +	 * The CCS copy command is a 5-dword sequence. If the vCPU halts during
> > > +	 * save/restore while this sequence is being issued, partial writes may trigger
> > > +	 * page faults when saving iGPU CCS metadata. Use the VMOVDQU instruction to
> > > +	 * write the sequence atomically.
> > > +	 */
> > > +	emit_atomic(gt, cs, dw, sizeof(dw));
> > > +	cs += EMIT_COPY_CCS_DW;
> > >  	bb->len = cs - bb->cs;
> > >  }
> > >  
> > > @@ -993,18 +1035,27 @@ static u64 migrate_vm_ppgtt_addr_tlb_inval(void)
> > >  	return (NUM_KERNEL_PDE - 2) * XE_PAGE_SIZE;
> > >  }
> > >  
> > > -static int emit_flush_invalidate(u32 *dw, int i, u32 flags)
> > > +/*
> > > + * The MI_FLUSH_DW command is a 4-dword sequence. If the vCPU halts during
> > > + * save/restore while this sequence is being issued, partial writes may
> > > + * trigger page faults when saving iGPU CCS metadata. Use
> > > + * emit_atomic() to write the sequence atomically.
> > > + */
> > > +#define EMIT_FLUSH_INVALIDATE_DW 4
> > > +static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *cs, int i, u32 flags)
> > >  {
> > >  	u64 addr = migrate_vm_ppgtt_addr_tlb_inval();
> > > +	u32 dw[EMIT_FLUSH_INVALIDATE_DW] = {MI_NOOP}, j = 0;
> > > +
> > > +	dw[j++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> > > +		      MI_FLUSH_IMM_DW | flags;
> > > +	dw[j++] = lower_32_bits(addr);
> > > +	dw[j++] = upper_32_bits(addr);
> > > +	dw[j++] = MI_NOOP;
> > >  
> > > -	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> > > -		  MI_FLUSH_IMM_DW | flags;
> > > -	dw[i++] = lower_32_bits(addr);
> > > -	dw[i++] = upper_32_bits(addr);
> > > -	dw[i++] = MI_NOOP;
> > > -	dw[i++] = MI_NOOP;
> > > +	emit_atomic(q->gt, &cs[i], dw, sizeof(dw));
> > >  
> > > -	return i;
> > > +	return i + j;
> > >  }
> > >  
> > >  /**
> > > @@ -1049,7 +1100,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > >  	/* Calculate Batch buffer size */
> > >  	batch_size = 0;
> > >  	while (size) {
> > > -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> > > +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
> > >  		u64 ccs_ofs, ccs_size;
> > >  		u32 ccs_pt;
> > >  
> > > @@ -1090,7 +1141,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > >  	 * sizes here again before copy command is emitted.
> > >  	 */
> > >  	while (size) {
> > > -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> > > +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
> > >  		u32 flush_flags = 0;
> > >  		u64 ccs_ofs, ccs_size;
> > >  		u32 ccs_pt;
> > > @@ -1113,11 +1164,11 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > >  
> > >  		emit_pte(m, bb, ccs_pt, false, false, &ccs_it, ccs_size, src);
> > >  
> > > -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> > > +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> > >  		flush_flags = xe_migrate_ccs_copy(m, bb, src_L0_ofs, src_is_pltt,
> > >  						  src_L0_ofs, dst_is_pltt,
> > >  						  src_L0, ccs_ofs, true);
> > > -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> > > +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> > >  
> > >  		size -= src_L0;
> > >  	}
> > > -- 
> > > 2.51.0
> > > 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/3] drm/xe/migrate: Atomicize CCS copy command setup
  2025-10-09 16:11       ` Matthew Brost
@ 2025-10-09 18:35         ` Rodrigo Vivi
  2025-10-09 18:49           ` Matthew Brost
  0 siblings, 1 reply; 16+ messages in thread
From: Rodrigo Vivi @ 2025-10-09 18:35 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Satyanarayana K V P, intel-xe, Michal Wajdeczko, Matthew Auld

On Thu, Oct 09, 2025 at 09:11:13AM -0700, Matthew Brost wrote:
> On Thu, Oct 09, 2025 at 09:00:43AM -0400, Rodrigo Vivi wrote:
> > On Wed, Oct 08, 2025 at 03:58:32PM -0700, Matthew Brost wrote:
> > > On Wed, Oct 08, 2025 at 03:41:47PM +0530, Satyanarayana K V P wrote:
> > > > The CCS copy command is a 5-dword sequence. If the vCPU halts during
> > > > save/restore while this sequence is being programmed, partial writes may
> > > > trigger page faults when saving IGPU CCS metadata. Use the VMOVDQU
> > > > instruction to write the sequence atomically.
> > > > 
> > > > Since VMOVDQU operates on 256-bit chunks, update EMIT_COPY_CCS_DW to emit
> > > > 8 dwords instead of 5 dwords.
> > > > 
> > > > Update emit_flush_invalidate() to use VMOVDQU operating with 128-bit
> > > > chunks.
> > > > 
> > > > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> > > > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > > 
> > > > ---
> > > > V4 -> V5:
> > > > - Fixed review comments. (Matt B)
> > > > 
> > > > V3 -> V4:
> > > > - Fixed review comments. (Wajdeczko)
> > > > - Fix issues reported by patchworks.
> > > > 
> > > > V2 -> V3:
> > > > - Added support for 128 bit and 256 bit instructions with memcpy_vmovdqu
> > > > - Updated emit_flush_invalidate() to use vmovdqu instruction.
> > > > 
> > > > V1 -> V2:
> > > > - Use memcpy_vmovdqu only for x86 arch and for VF. Else use memcpy
> > > >   (Auld, Matthew)
> > > > - Fix issues reported by patchworks.
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_migrate.c | 93 +++++++++++++++++++++++++--------
> > > >  1 file changed, 72 insertions(+), 21 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> > > > index c39c3b423d05..b629072956ee 100644
> > > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > > @@ -5,7 +5,9 @@
> > > >  
> > > >  #include "xe_migrate.h"
> > > >  
> > > > +#include <asm/fpu/api.h>
> > > >  #include <linux/bitfield.h>
> > > > +#include <linux/cpufeature.h>
> > > >  #include <linux/sizes.h>
> > > >  
> > > >  #include <drm/drm_managed.h>
> > > > @@ -33,6 +35,7 @@
> > > >  #include "xe_res_cursor.h"
> > > >  #include "xe_sa.h"
> > > >  #include "xe_sched_job.h"
> > > > +#include "xe_sriov_vf_ccs.h"
> > > >  #include "xe_sync.h"
> > > >  #include "xe_trace_bo.h"
> > > >  #include "xe_validation.h"
> > > > @@ -644,18 +647,49 @@ static void emit_pte(struct xe_migrate *m,
> > > >  	}
> > > >  }
> > > >  
> > > > -#define EMIT_COPY_CCS_DW 5
> > > > +static void memcpy_vmovdqu(void *dst, const void *src, u32 size)
> > > > +{
> > > > +#ifdef CONFIG_X86
> > > > +	kernel_fpu_begin();
> > > > +	if (size == SZ_128) {
> > > > +		asm("vmovdqu (%0), %%xmm0\n"
> > > > +		    "vmovups %%xmm0,   (%1)\n"
> > > > +		    :: "r" (src), "r" (dst) : "memory");
> > > > +	} else if (size == SZ_256) {
> > > > +		asm("vmovdqu (%0), %%ymm0\n"
> > > > +		    "vmovups %%ymm0,   (%1)\n"
> > > > +		    :: "r" (src), "r" (dst) : "memory");
> > > > +	}
> > > > +	kernel_fpu_end();
> > > > +#endif
> > > 
> > > Everything in this patch LGTM but I think we maintainer input to ensure
> > > we are breaking some rules about inlined asm code in a driver (no idea
> > > if this exists) or if a better place would be somewhere common. Can you
> > > ping Lucas, Thomas, or Rodrigo and ask them about this?
> > 
> > Well, it is possible and we have asm code in i915 for instance (i915_memcpy.c)
> > 
> > But the rule does exist:
> > https://www.kernel.org/doc/html/latest/process/coding-style.html#inline-assembly
> > 
> > "don’t use inline assembly gratuitously when C can do the job. You can and should
> > poke hardware from C when possible"
> > 
> > In this case here, please explain why exactly memcpy with smp_wmb barriers and
> > or WRITE_ONCE code combined couldn't solve it.
> > 
> > Also, please explain how exactly vmovdqu guarantees the atomicity promised by
> > the commit message. On a quick search here my take is that for this 128 or 256
> > bits, atomicity is not guaranteed.
> 
> I don't think cache atomicity is what we're after here—rather, it's vCPU
> halting atomicity.
> 
> Consider the following case:
> b++ = XY_CTRL_SURF_COPY_BLT;
> b++ = addr;
> 
> If the vCPU is halted during the instruction that stores
> XY_CTRL_SURF_COPY_BLT, the address will be invalid. The GuC executes the
> batch buffer (BB) that is being programmed as part of the VF save. This
> will clearly cause the BB to hang due to a page fault on the copy
> command.

okay, perhaps this is what is getting me confused most
what I don't understand in the flow is: why GuC is already
executing it or going to execute it while you are going to a halt when
writing the command to the buffer?  and not writing to the buffer first
and then sending it to the exec queue?

> 
> If the entire XY_CTRL_SURF_COPY_BLT is stored via an AVX instruction,
> then either the GPU entire instruction is written or none of it is. I
> believe vCPU halting guarantees that a CPU instruction is either fully
> executed or not at all—regardless of how many micro-operations (uOPs) it
> decodes into. If this guarantee does not hold, then the entire
> architecture of CCS save/restore on PTL is fundamentally broken which is
> always possible.

Okay, this is guaranteed. I mean, the vCPU won't get halted in the middle
of the vmovdqu nor vmovups. only before, between, or after them.

But is this uncached and/or coherent? isn't there really any possibility that
the command finished, but GuC mid-flight executing things aren't still
seeing different cachelines?

> 
> > 
> > So, imho this patch is introducing a unmaintainable, complex, and fragile code
> > that is not even doing what it is claiming to do. But I will be glad if someone
> > can challenge this and prove me wrong.
> > 
> 
> Let me know if the above makes any sense.

Okay. But how to handle cases where AVX might not be available? really not needed?

> 
> Matt
> 
> > Thanks,
> > Rodrigo.
> > 
> > > 
> > > Matt 
> > > 
> > > > +}
> > > > +
> > > > +static void emit_atomic(struct xe_gt *gt, void *dst, const void *src, u32 size)
> > > > +{
> > > > +	u32 instr_size = size * BITS_PER_BYTE;
> > > > +
> > > > +	xe_gt_assert(gt, instr_size == SZ_128 || instr_size == SZ_256);
> > > > +
> > > > +	if (IS_VF_CCS_READY(gt_to_xe(gt)) && static_cpu_has(X86_FEATURE_AVX))
> > > > +		memcpy_vmovdqu(dst, src, instr_size);
> > > > +	else
> > > > +		memcpy(dst, src, size);
> > > > +}
> > > > +
> > > > +#define EMIT_COPY_CCS_DW 8
> > > >  static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> > > >  			  u64 dst_ofs, bool dst_is_indirect,
> > > >  			  u64 src_ofs, bool src_is_indirect,
> > > >  			  u32 size)
> > > >  {
> > > > +	u32 dw[EMIT_COPY_CCS_DW] = {MI_NOOP};
> > > >  	struct xe_device *xe = gt_to_xe(gt);
> > > >  	u32 *cs = bb->cs + bb->len;
> > > >  	u32 num_ccs_blks;
> > > >  	u32 num_pages;
> > > >  	u32 ccs_copy_size;
> > > >  	u32 mocs;
> > > > +	u32 i = 0;
> > > >  
> > > >  	if (GRAPHICS_VERx100(xe) >= 2000) {
> > > >  		num_pages = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> > > > @@ -673,15 +707,23 @@ static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> > > >  		mocs = FIELD_PREP(XY_CTRL_SURF_MOCS_MASK, gt->mocs.uc_index);
> > > >  	}
> > > >  
> > > > -	*cs++ = XY_CTRL_SURF_COPY_BLT |
> > > > -		(src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> > > > -		(dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> > > > -		ccs_copy_size;
> > > > -	*cs++ = lower_32_bits(src_ofs);
> > > > -	*cs++ = upper_32_bits(src_ofs) | mocs;
> > > > -	*cs++ = lower_32_bits(dst_ofs);
> > > > -	*cs++ = upper_32_bits(dst_ofs) | mocs;
> > > > +	dw[i++] = XY_CTRL_SURF_COPY_BLT |
> > > > +		  (src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> > > > +		  (dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> > > > +		  ccs_copy_size;
> > > > +	dw[i++] = lower_32_bits(src_ofs);
> > > > +	dw[i++] = upper_32_bits(src_ofs) | mocs;
> > > > +	dw[i++] = lower_32_bits(dst_ofs);
> > > > +	dw[i++] = upper_32_bits(dst_ofs) | mocs;
> > > >  
> > > > +	/*
> > > > +	 * The CCS copy command is a 5-dword sequence. If the vCPU halts during
> > > > +	 * save/restore while this sequence is being issued, partial writes may trigger
> > > > +	 * page faults when saving iGPU CCS metadata. Use the VMOVDQU instruction to
> > > > +	 * write the sequence atomically.
> > > > +	 */
> > > > +	emit_atomic(gt, cs, dw, sizeof(dw));
> > > > +	cs += EMIT_COPY_CCS_DW;
> > > >  	bb->len = cs - bb->cs;
> > > >  }
> > > >  
> > > > @@ -993,18 +1035,27 @@ static u64 migrate_vm_ppgtt_addr_tlb_inval(void)
> > > >  	return (NUM_KERNEL_PDE - 2) * XE_PAGE_SIZE;
> > > >  }
> > > >  
> > > > -static int emit_flush_invalidate(u32 *dw, int i, u32 flags)
> > > > +/*
> > > > + * The MI_FLUSH_DW command is a 4-dword sequence. If the vCPU halts during
> > > > + * save/restore while this sequence is being issued, partial writes may
> > > > + * trigger page faults when saving iGPU CCS metadata. Use
> > > > + * emit_atomic() to write the sequence atomically.
> > > > + */
> > > > +#define EMIT_FLUSH_INVALIDATE_DW 4
> > > > +static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *cs, int i, u32 flags)
> > > >  {
> > > >  	u64 addr = migrate_vm_ppgtt_addr_tlb_inval();
> > > > +	u32 dw[EMIT_FLUSH_INVALIDATE_DW] = {MI_NOOP}, j = 0;
> > > > +
> > > > +	dw[j++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> > > > +		      MI_FLUSH_IMM_DW | flags;
> > > > +	dw[j++] = lower_32_bits(addr);
> > > > +	dw[j++] = upper_32_bits(addr);
> > > > +	dw[j++] = MI_NOOP;
> > > >  
> > > > -	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> > > > -		  MI_FLUSH_IMM_DW | flags;
> > > > -	dw[i++] = lower_32_bits(addr);
> > > > -	dw[i++] = upper_32_bits(addr);
> > > > -	dw[i++] = MI_NOOP;
> > > > -	dw[i++] = MI_NOOP;
> > > > +	emit_atomic(q->gt, &cs[i], dw, sizeof(dw));
> > > >  
> > > > -	return i;
> > > > +	return i + j;
> > > >  }
> > > >  
> > > >  /**
> > > > @@ -1049,7 +1100,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > > >  	/* Calculate Batch buffer size */
> > > >  	batch_size = 0;
> > > >  	while (size) {
> > > > -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> > > > +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
> > > >  		u64 ccs_ofs, ccs_size;
> > > >  		u32 ccs_pt;
> > > >  
> > > > @@ -1090,7 +1141,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > > >  	 * sizes here again before copy command is emitted.
> > > >  	 */
> > > >  	while (size) {
> > > > -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> > > > +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
> > > >  		u32 flush_flags = 0;
> > > >  		u64 ccs_ofs, ccs_size;
> > > >  		u32 ccs_pt;
> > > > @@ -1113,11 +1164,11 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > > >  
> > > >  		emit_pte(m, bb, ccs_pt, false, false, &ccs_it, ccs_size, src);
> > > >  
> > > > -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> > > > +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> > > >  		flush_flags = xe_migrate_ccs_copy(m, bb, src_L0_ofs, src_is_pltt,
> > > >  						  src_L0_ofs, dst_is_pltt,
> > > >  						  src_L0, ccs_ofs, true);
> > > > -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> > > > +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> > > >  
> > > >  		size -= src_L0;
> > > >  	}
> > > > -- 
> > > > 2.51.0
> > > > 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/3] drm/xe/migrate: Atomicize CCS copy command setup
  2025-10-09 18:35         ` Rodrigo Vivi
@ 2025-10-09 18:49           ` Matthew Brost
  2025-10-09 19:49             ` Rodrigo Vivi
  2025-10-09 23:06             ` Matt Roper
  0 siblings, 2 replies; 16+ messages in thread
From: Matthew Brost @ 2025-10-09 18:49 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Satyanarayana K V P, intel-xe, Michal Wajdeczko, Matthew Auld

On Thu, Oct 09, 2025 at 02:35:10PM -0400, Rodrigo Vivi wrote:
> On Thu, Oct 09, 2025 at 09:11:13AM -0700, Matthew Brost wrote:
> > On Thu, Oct 09, 2025 at 09:00:43AM -0400, Rodrigo Vivi wrote:
> > > On Wed, Oct 08, 2025 at 03:58:32PM -0700, Matthew Brost wrote:
> > > > On Wed, Oct 08, 2025 at 03:41:47PM +0530, Satyanarayana K V P wrote:
> > > > > The CCS copy command is a 5-dword sequence. If the vCPU halts during
> > > > > save/restore while this sequence is being programmed, partial writes may
> > > > > trigger page faults when saving IGPU CCS metadata. Use the VMOVDQU
> > > > > instruction to write the sequence atomically.
> > > > > 
> > > > > Since VMOVDQU operates on 256-bit chunks, update EMIT_COPY_CCS_DW to emit
> > > > > 8 dwords instead of 5 dwords.
> > > > > 
> > > > > Update emit_flush_invalidate() to use VMOVDQU operating with 128-bit
> > > > > chunks.
> > > > > 
> > > > > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> > > > > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > > > 
> > > > > ---
> > > > > V4 -> V5:
> > > > > - Fixed review comments. (Matt B)
> > > > > 
> > > > > V3 -> V4:
> > > > > - Fixed review comments. (Wajdeczko)
> > > > > - Fix issues reported by patchworks.
> > > > > 
> > > > > V2 -> V3:
> > > > > - Added support for 128 bit and 256 bit instructions with memcpy_vmovdqu
> > > > > - Updated emit_flush_invalidate() to use vmovdqu instruction.
> > > > > 
> > > > > V1 -> V2:
> > > > > - Use memcpy_vmovdqu only for x86 arch and for VF. Else use memcpy
> > > > >   (Auld, Matthew)
> > > > > - Fix issues reported by patchworks.
> > > > > ---
> > > > >  drivers/gpu/drm/xe/xe_migrate.c | 93 +++++++++++++++++++++++++--------
> > > > >  1 file changed, 72 insertions(+), 21 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> > > > > index c39c3b423d05..b629072956ee 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > > > @@ -5,7 +5,9 @@
> > > > >  
> > > > >  #include "xe_migrate.h"
> > > > >  
> > > > > +#include <asm/fpu/api.h>
> > > > >  #include <linux/bitfield.h>
> > > > > +#include <linux/cpufeature.h>
> > > > >  #include <linux/sizes.h>
> > > > >  
> > > > >  #include <drm/drm_managed.h>
> > > > > @@ -33,6 +35,7 @@
> > > > >  #include "xe_res_cursor.h"
> > > > >  #include "xe_sa.h"
> > > > >  #include "xe_sched_job.h"
> > > > > +#include "xe_sriov_vf_ccs.h"
> > > > >  #include "xe_sync.h"
> > > > >  #include "xe_trace_bo.h"
> > > > >  #include "xe_validation.h"
> > > > > @@ -644,18 +647,49 @@ static void emit_pte(struct xe_migrate *m,
> > > > >  	}
> > > > >  }
> > > > >  
> > > > > -#define EMIT_COPY_CCS_DW 5
> > > > > +static void memcpy_vmovdqu(void *dst, const void *src, u32 size)
> > > > > +{
> > > > > +#ifdef CONFIG_X86
> > > > > +	kernel_fpu_begin();
> > > > > +	if (size == SZ_128) {
> > > > > +		asm("vmovdqu (%0), %%xmm0\n"
> > > > > +		    "vmovups %%xmm0,   (%1)\n"
> > > > > +		    :: "r" (src), "r" (dst) : "memory");
> > > > > +	} else if (size == SZ_256) {
> > > > > +		asm("vmovdqu (%0), %%ymm0\n"
> > > > > +		    "vmovups %%ymm0,   (%1)\n"
> > > > > +		    :: "r" (src), "r" (dst) : "memory");
> > > > > +	}
> > > > > +	kernel_fpu_end();
> > > > > +#endif
> > > > 
> > > > Everything in this patch LGTM but I think we maintainer input to ensure
> > > > we are breaking some rules about inlined asm code in a driver (no idea
> > > > if this exists) or if a better place would be somewhere common. Can you
> > > > ping Lucas, Thomas, or Rodrigo and ask them about this?
> > > 
> > > Well, it is possible and we have asm code in i915 for instance (i915_memcpy.c)
> > > 
> > > But the rule does exist:
> > > https://www.kernel.org/doc/html/latest/process/coding-style.html#inline-assembly
> > > 
> > > "don’t use inline assembly gratuitously when C can do the job. You can and should
> > > poke hardware from C when possible"
> > > 
> > > In this case here, please explain why exactly memcpy with smp_wmb barriers and
> > > or WRITE_ONCE code combined couldn't solve it.
> > > 
> > > Also, please explain how exactly vmovdqu guarantees the atomicity promised by
> > > the commit message. On a quick search here my take is that for this 128 or 256
> > > bits, atomicity is not guaranteed.
> > 
> > I don't think cache atomicity is what we're after here—rather, it's vCPU
> > halting atomicity.
> > 
> > Consider the following case:
> > b++ = XY_CTRL_SURF_COPY_BLT;
> > b++ = addr;
> > 
> > If the vCPU is halted during the instruction that stores
> > XY_CTRL_SURF_COPY_BLT, the address will be invalid. The GuC executes the
> > batch buffer (BB) that is being programmed as part of the VF save. This
> > will clearly cause the BB to hang due to a page fault on the copy
> > command.
> 
> okay, perhaps this is what is getting me confused most
> what I don't understand in the flow is: why GuC is already
> executing it or going to execute it while you are going to a halt when
> writing the command to the buffer?  and not writing to the buffer first
> and then sending it to the exec queue?
> 

It how this feature was architected, will send over SaS link of the list.

> > 
> > If the entire XY_CTRL_SURF_COPY_BLT is stored via an AVX instruction,
> > then either the GPU entire instruction is written or none of it is. I
> > believe vCPU halting guarantees that a CPU instruction is either fully
> > executed or not at all—regardless of how many micro-operations (uOPs) it
> > decodes into. If this guarantee does not hold, then the entire
> > architecture of CCS save/restore on PTL is fundamentally broken which is
> > always possible.
> 
> Okay, this is guaranteed. I mean, the vCPU won't get halted in the middle
> of the vmovdqu nor vmovups. only before, between, or after them.
> 
> But is this uncached and/or coherent? isn't there really any possibility that
> the command finished, but GuC mid-flight executing things aren't still
> seeing different cachelines?
> 

The GuC won't start executing until vCPU unpause on the save flow.
Restore flow is bit more tricky as vCPU are live when this happens but
we can W/A that race in software I think. That part is not in this
series.

> > 
> > > 
> > > So, imho this patch is introducing a unmaintainable, complex, and fragile code
> > > that is not even doing what it is claiming to do. But I will be glad if someone
> > > can challenge this and prove me wrong.
> > > 
> > 
> > Let me know if the above makes any sense.
> 
> Okay. But how to handle cases where AVX might not be available? really not needed?
> 

This iGPU feature for PTL so shouldn't be an issue as PTL has AVX
instructions.

Matt

> > 
> > Matt
> > 
> > > Thanks,
> > > Rodrigo.
> > > 
> > > > 
> > > > Matt 
> > > > 
> > > > > +}
> > > > > +
> > > > > +static void emit_atomic(struct xe_gt *gt, void *dst, const void *src, u32 size)
> > > > > +{
> > > > > +	u32 instr_size = size * BITS_PER_BYTE;
> > > > > +
> > > > > +	xe_gt_assert(gt, instr_size == SZ_128 || instr_size == SZ_256);
> > > > > +
> > > > > +	if (IS_VF_CCS_READY(gt_to_xe(gt)) && static_cpu_has(X86_FEATURE_AVX))
> > > > > +		memcpy_vmovdqu(dst, src, instr_size);
> > > > > +	else
> > > > > +		memcpy(dst, src, size);
> > > > > +}
> > > > > +
> > > > > +#define EMIT_COPY_CCS_DW 8
> > > > >  static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> > > > >  			  u64 dst_ofs, bool dst_is_indirect,
> > > > >  			  u64 src_ofs, bool src_is_indirect,
> > > > >  			  u32 size)
> > > > >  {
> > > > > +	u32 dw[EMIT_COPY_CCS_DW] = {MI_NOOP};
> > > > >  	struct xe_device *xe = gt_to_xe(gt);
> > > > >  	u32 *cs = bb->cs + bb->len;
> > > > >  	u32 num_ccs_blks;
> > > > >  	u32 num_pages;
> > > > >  	u32 ccs_copy_size;
> > > > >  	u32 mocs;
> > > > > +	u32 i = 0;
> > > > >  
> > > > >  	if (GRAPHICS_VERx100(xe) >= 2000) {
> > > > >  		num_pages = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> > > > > @@ -673,15 +707,23 @@ static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> > > > >  		mocs = FIELD_PREP(XY_CTRL_SURF_MOCS_MASK, gt->mocs.uc_index);
> > > > >  	}
> > > > >  
> > > > > -	*cs++ = XY_CTRL_SURF_COPY_BLT |
> > > > > -		(src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> > > > > -		(dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> > > > > -		ccs_copy_size;
> > > > > -	*cs++ = lower_32_bits(src_ofs);
> > > > > -	*cs++ = upper_32_bits(src_ofs) | mocs;
> > > > > -	*cs++ = lower_32_bits(dst_ofs);
> > > > > -	*cs++ = upper_32_bits(dst_ofs) | mocs;
> > > > > +	dw[i++] = XY_CTRL_SURF_COPY_BLT |
> > > > > +		  (src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> > > > > +		  (dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> > > > > +		  ccs_copy_size;
> > > > > +	dw[i++] = lower_32_bits(src_ofs);
> > > > > +	dw[i++] = upper_32_bits(src_ofs) | mocs;
> > > > > +	dw[i++] = lower_32_bits(dst_ofs);
> > > > > +	dw[i++] = upper_32_bits(dst_ofs) | mocs;
> > > > >  
> > > > > +	/*
> > > > > +	 * The CCS copy command is a 5-dword sequence. If the vCPU halts during
> > > > > +	 * save/restore while this sequence is being issued, partial writes may trigger
> > > > > +	 * page faults when saving iGPU CCS metadata. Use the VMOVDQU instruction to
> > > > > +	 * write the sequence atomically.
> > > > > +	 */
> > > > > +	emit_atomic(gt, cs, dw, sizeof(dw));
> > > > > +	cs += EMIT_COPY_CCS_DW;
> > > > >  	bb->len = cs - bb->cs;
> > > > >  }
> > > > >  
> > > > > @@ -993,18 +1035,27 @@ static u64 migrate_vm_ppgtt_addr_tlb_inval(void)
> > > > >  	return (NUM_KERNEL_PDE - 2) * XE_PAGE_SIZE;
> > > > >  }
> > > > >  
> > > > > -static int emit_flush_invalidate(u32 *dw, int i, u32 flags)
> > > > > +/*
> > > > > + * The MI_FLUSH_DW command is a 4-dword sequence. If the vCPU halts during
> > > > > + * save/restore while this sequence is being issued, partial writes may
> > > > > + * trigger page faults when saving iGPU CCS metadata. Use
> > > > > + * emit_atomic() to write the sequence atomically.
> > > > > + */
> > > > > +#define EMIT_FLUSH_INVALIDATE_DW 4
> > > > > +static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *cs, int i, u32 flags)
> > > > >  {
> > > > >  	u64 addr = migrate_vm_ppgtt_addr_tlb_inval();
> > > > > +	u32 dw[EMIT_FLUSH_INVALIDATE_DW] = {MI_NOOP}, j = 0;
> > > > > +
> > > > > +	dw[j++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> > > > > +		      MI_FLUSH_IMM_DW | flags;
> > > > > +	dw[j++] = lower_32_bits(addr);
> > > > > +	dw[j++] = upper_32_bits(addr);
> > > > > +	dw[j++] = MI_NOOP;
> > > > >  
> > > > > -	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> > > > > -		  MI_FLUSH_IMM_DW | flags;
> > > > > -	dw[i++] = lower_32_bits(addr);
> > > > > -	dw[i++] = upper_32_bits(addr);
> > > > > -	dw[i++] = MI_NOOP;
> > > > > -	dw[i++] = MI_NOOP;
> > > > > +	emit_atomic(q->gt, &cs[i], dw, sizeof(dw));
> > > > >  
> > > > > -	return i;
> > > > > +	return i + j;
> > > > >  }
> > > > >  
> > > > >  /**
> > > > > @@ -1049,7 +1100,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > > > >  	/* Calculate Batch buffer size */
> > > > >  	batch_size = 0;
> > > > >  	while (size) {
> > > > > -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> > > > > +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
> > > > >  		u64 ccs_ofs, ccs_size;
> > > > >  		u32 ccs_pt;
> > > > >  
> > > > > @@ -1090,7 +1141,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > > > >  	 * sizes here again before copy command is emitted.
> > > > >  	 */
> > > > >  	while (size) {
> > > > > -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> > > > > +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
> > > > >  		u32 flush_flags = 0;
> > > > >  		u64 ccs_ofs, ccs_size;
> > > > >  		u32 ccs_pt;
> > > > > @@ -1113,11 +1164,11 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > > > >  
> > > > >  		emit_pte(m, bb, ccs_pt, false, false, &ccs_it, ccs_size, src);
> > > > >  
> > > > > -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> > > > > +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> > > > >  		flush_flags = xe_migrate_ccs_copy(m, bb, src_L0_ofs, src_is_pltt,
> > > > >  						  src_L0_ofs, dst_is_pltt,
> > > > >  						  src_L0, ccs_ofs, true);
> > > > > -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> > > > > +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> > > > >  
> > > > >  		size -= src_L0;
> > > > >  	}
> > > > > -- 
> > > > > 2.51.0
> > > > > 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/3] drm/xe/migrate: Atomicize CCS copy command setup
  2025-10-09 18:49           ` Matthew Brost
@ 2025-10-09 19:49             ` Rodrigo Vivi
  2025-10-09 20:34               ` Michal Wajdeczko
  2025-10-09 23:06             ` Matt Roper
  1 sibling, 1 reply; 16+ messages in thread
From: Rodrigo Vivi @ 2025-10-09 19:49 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Satyanarayana K V P, intel-xe, Michal Wajdeczko, Matthew Auld

On Thu, Oct 09, 2025 at 11:49:16AM -0700, Matthew Brost wrote:
> On Thu, Oct 09, 2025 at 02:35:10PM -0400, Rodrigo Vivi wrote:
> > On Thu, Oct 09, 2025 at 09:11:13AM -0700, Matthew Brost wrote:
> > > On Thu, Oct 09, 2025 at 09:00:43AM -0400, Rodrigo Vivi wrote:
> > > > On Wed, Oct 08, 2025 at 03:58:32PM -0700, Matthew Brost wrote:
> > > > > On Wed, Oct 08, 2025 at 03:41:47PM +0530, Satyanarayana K V P wrote:
> > > > > > The CCS copy command is a 5-dword sequence. If the vCPU halts during
> > > > > > save/restore while this sequence is being programmed, partial writes may
> > > > > > trigger page faults when saving IGPU CCS metadata. Use the VMOVDQU
> > > > > > instruction to write the sequence atomically.
> > > > > > 
> > > > > > Since VMOVDQU operates on 256-bit chunks, update EMIT_COPY_CCS_DW to emit
> > > > > > 8 dwords instead of 5 dwords.
> > > > > > 
> > > > > > Update emit_flush_invalidate() to use VMOVDQU operating with 128-bit
> > > > > > chunks.
> > > > > > 
> > > > > > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> > > > > > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > > > > 
> > > > > > ---
> > > > > > V4 -> V5:
> > > > > > - Fixed review comments. (Matt B)
> > > > > > 
> > > > > > V3 -> V4:
> > > > > > - Fixed review comments. (Wajdeczko)
> > > > > > - Fix issues reported by patchworks.
> > > > > > 
> > > > > > V2 -> V3:
> > > > > > - Added support for 128 bit and 256 bit instructions with memcpy_vmovdqu
> > > > > > - Updated emit_flush_invalidate() to use vmovdqu instruction.
> > > > > > 
> > > > > > V1 -> V2:
> > > > > > - Use memcpy_vmovdqu only for x86 arch and for VF. Else use memcpy
> > > > > >   (Auld, Matthew)
> > > > > > - Fix issues reported by patchworks.
> > > > > > ---
> > > > > >  drivers/gpu/drm/xe/xe_migrate.c | 93 +++++++++++++++++++++++++--------
> > > > > >  1 file changed, 72 insertions(+), 21 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > index c39c3b423d05..b629072956ee 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > @@ -5,7 +5,9 @@
> > > > > >  
> > > > > >  #include "xe_migrate.h"
> > > > > >  
> > > > > > +#include <asm/fpu/api.h>
> > > > > >  #include <linux/bitfield.h>
> > > > > > +#include <linux/cpufeature.h>
> > > > > >  #include <linux/sizes.h>
> > > > > >  
> > > > > >  #include <drm/drm_managed.h>
> > > > > > @@ -33,6 +35,7 @@
> > > > > >  #include "xe_res_cursor.h"
> > > > > >  #include "xe_sa.h"
> > > > > >  #include "xe_sched_job.h"
> > > > > > +#include "xe_sriov_vf_ccs.h"
> > > > > >  #include "xe_sync.h"
> > > > > >  #include "xe_trace_bo.h"
> > > > > >  #include "xe_validation.h"
> > > > > > @@ -644,18 +647,49 @@ static void emit_pte(struct xe_migrate *m,
> > > > > >  	}
> > > > > >  }
> > > > > >  
> > > > > > -#define EMIT_COPY_CCS_DW 5
> > > > > > +static void memcpy_vmovdqu(void *dst, const void *src, u32 size)
> > > > > > +{
> > > > > > +#ifdef CONFIG_X86
> > > > > > +	kernel_fpu_begin();
> > > > > > +	if (size == SZ_128) {
> > > > > > +		asm("vmovdqu (%0), %%xmm0\n"
> > > > > > +		    "vmovups %%xmm0,   (%1)\n"
> > > > > > +		    :: "r" (src), "r" (dst) : "memory");
> > > > > > +	} else if (size == SZ_256) {
> > > > > > +		asm("vmovdqu (%0), %%ymm0\n"
> > > > > > +		    "vmovups %%ymm0,   (%1)\n"
> > > > > > +		    :: "r" (src), "r" (dst) : "memory");
> > > > > > +	}
> > > > > > +	kernel_fpu_end();
> > > > > > +#endif
> > > > > 
> > > > > Everything in this patch LGTM but I think we maintainer input to ensure
> > > > > we are breaking some rules about inlined asm code in a driver (no idea
> > > > > if this exists) or if a better place would be somewhere common. Can you
> > > > > ping Lucas, Thomas, or Rodrigo and ask them about this?
> > > > 
> > > > Well, it is possible and we have asm code in i915 for instance (i915_memcpy.c)
> > > > 
> > > > But the rule does exist:
> > > > https://www.kernel.org/doc/html/latest/process/coding-style.html#inline-assembly
> > > > 
> > > > "don’t use inline assembly gratuitously when C can do the job. You can and should
> > > > poke hardware from C when possible"
> > > > 
> > > > In this case here, please explain why exactly memcpy with smp_wmb barriers and
> > > > or WRITE_ONCE code combined couldn't solve it.
> > > > 
> > > > Also, please explain how exactly vmovdqu guarantees the atomicity promised by
> > > > the commit message. On a quick search here my take is that for this 128 or 256
> > > > bits, atomicity is not guaranteed.
> > > 
> > > I don't think cache atomicity is what we're after here—rather, it's vCPU
> > > halting atomicity.
> > > 
> > > Consider the following case:
> > > b++ = XY_CTRL_SURF_COPY_BLT;
> > > b++ = addr;
> > > 
> > > If the vCPU is halted during the instruction that stores
> > > XY_CTRL_SURF_COPY_BLT, the address will be invalid. The GuC executes the
> > > batch buffer (BB) that is being programmed as part of the VF save. This
> > > will clearly cause the BB to hang due to a page fault on the copy
> > > command.
> > 
> > okay, perhaps this is what is getting me confused most
> > what I don't understand in the flow is: why GuC is already
> > executing it or going to execute it while you are going to a halt when
> > writing the command to the buffer?  and not writing to the buffer first
> > and then sending it to the exec queue?
> > 
> 
> It how this feature was architected, will send over SaS link of the list.

It probably deserves some comments around the code on how that works and
why we are doing that.

> 
> > > 
> > > If the entire XY_CTRL_SURF_COPY_BLT is stored via an AVX instruction,
> > > then either the GPU entire instruction is written or none of it is. I
> > > believe vCPU halting guarantees that a CPU instruction is either fully
> > > executed or not at all—regardless of how many micro-operations (uOPs) it
> > > decodes into. If this guarantee does not hold, then the entire
> > > architecture of CCS save/restore on PTL is fundamentally broken which is
> > > always possible.
> > 
> > Okay, this is guaranteed. I mean, the vCPU won't get halted in the middle
> > of the vmovdqu nor vmovups. only before, between, or after them.
> > 
> > But is this uncached and/or coherent? isn't there really any possibility that
> > the command finished, but GuC mid-flight executing things aren't still
> > seeing different cachelines?
> > 
> 
> The GuC won't start executing until vCPU unpause on the save flow.
> Restore flow is bit more tricky as vCPU are live when this happens but
> we can W/A that race in software I think. That part is not in this
> series.
> 
> > > 
> > > > 
> > > > So, imho this patch is introducing a unmaintainable, complex, and fragile code
> > > > that is not even doing what it is claiming to do. But I will be glad if someone
> > > > can challenge this and prove me wrong.
> > > > 
> > > 
> > > Let me know if the above makes any sense.
> > 
> > Okay. But how to handle cases where AVX might not be available? really not needed?
> > 
> 
> This iGPU feature for PTL so shouldn't be an issue as PTL has AVX
> instructions.

Some comments around the code about that to be clear that we don't try to
reuse this later in any discrete.
and perhaps an assert ! dgfx?!

> 
> Matt
> 
> > > 
> > > Matt
> > > 
> > > > Thanks,
> > > > Rodrigo.
> > > > 
> > > > > 
> > > > > Matt 
> > > > > 
> > > > > > +}
> > > > > > +
> > > > > > +static void emit_atomic(struct xe_gt *gt, void *dst, const void *src, u32 size)
> > > > > > +{
> > > > > > +	u32 instr_size = size * BITS_PER_BYTE;
> > > > > > +
> > > > > > +	xe_gt_assert(gt, instr_size == SZ_128 || instr_size == SZ_256);
> > > > > > +
> > > > > > +	if (IS_VF_CCS_READY(gt_to_xe(gt)) && static_cpu_has(X86_FEATURE_AVX))
> > > > > > +		memcpy_vmovdqu(dst, src, instr_size);
> > > > > > +	else
> > > > > > +		memcpy(dst, src, size);
> > > > > > +}
> > > > > > +
> > > > > > +#define EMIT_COPY_CCS_DW 8
> > > > > >  static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> > > > > >  			  u64 dst_ofs, bool dst_is_indirect,
> > > > > >  			  u64 src_ofs, bool src_is_indirect,
> > > > > >  			  u32 size)
> > > > > >  {
> > > > > > +	u32 dw[EMIT_COPY_CCS_DW] = {MI_NOOP};
> > > > > >  	struct xe_device *xe = gt_to_xe(gt);
> > > > > >  	u32 *cs = bb->cs + bb->len;
> > > > > >  	u32 num_ccs_blks;
> > > > > >  	u32 num_pages;
> > > > > >  	u32 ccs_copy_size;
> > > > > >  	u32 mocs;
> > > > > > +	u32 i = 0;
> > > > > >  
> > > > > >  	if (GRAPHICS_VERx100(xe) >= 2000) {
> > > > > >  		num_pages = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> > > > > > @@ -673,15 +707,23 @@ static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> > > > > >  		mocs = FIELD_PREP(XY_CTRL_SURF_MOCS_MASK, gt->mocs.uc_index);
> > > > > >  	}
> > > > > >  
> > > > > > -	*cs++ = XY_CTRL_SURF_COPY_BLT |
> > > > > > -		(src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> > > > > > -		(dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> > > > > > -		ccs_copy_size;
> > > > > > -	*cs++ = lower_32_bits(src_ofs);
> > > > > > -	*cs++ = upper_32_bits(src_ofs) | mocs;
> > > > > > -	*cs++ = lower_32_bits(dst_ofs);
> > > > > > -	*cs++ = upper_32_bits(dst_ofs) | mocs;
> > > > > > +	dw[i++] = XY_CTRL_SURF_COPY_BLT |
> > > > > > +		  (src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> > > > > > +		  (dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> > > > > > +		  ccs_copy_size;
> > > > > > +	dw[i++] = lower_32_bits(src_ofs);
> > > > > > +	dw[i++] = upper_32_bits(src_ofs) | mocs;
> > > > > > +	dw[i++] = lower_32_bits(dst_ofs);
> > > > > > +	dw[i++] = upper_32_bits(dst_ofs) | mocs;
> > > > > >  
> > > > > > +	/*
> > > > > > +	 * The CCS copy command is a 5-dword sequence. If the vCPU halts during
> > > > > > +	 * save/restore while this sequence is being issued, partial writes may trigger
> > > > > > +	 * page faults when saving iGPU CCS metadata. Use the VMOVDQU instruction to
> > > > > > +	 * write the sequence atomically.
> > > > > > +	 */
> > > > > > +	emit_atomic(gt, cs, dw, sizeof(dw));
> > > > > > +	cs += EMIT_COPY_CCS_DW;
> > > > > >  	bb->len = cs - bb->cs;
> > > > > >  }
> > > > > >  
> > > > > > @@ -993,18 +1035,27 @@ static u64 migrate_vm_ppgtt_addr_tlb_inval(void)
> > > > > >  	return (NUM_KERNEL_PDE - 2) * XE_PAGE_SIZE;
> > > > > >  }
> > > > > >  
> > > > > > -static int emit_flush_invalidate(u32 *dw, int i, u32 flags)
> > > > > > +/*
> > > > > > + * The MI_FLUSH_DW command is a 4-dword sequence. If the vCPU halts during
> > > > > > + * save/restore while this sequence is being issued, partial writes may
> > > > > > + * trigger page faults when saving iGPU CCS metadata. Use
> > > > > > + * emit_atomic() to write the sequence atomically.
> > > > > > + */
> > > > > > +#define EMIT_FLUSH_INVALIDATE_DW 4
> > > > > > +static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *cs, int i, u32 flags)
> > > > > >  {
> > > > > >  	u64 addr = migrate_vm_ppgtt_addr_tlb_inval();
> > > > > > +	u32 dw[EMIT_FLUSH_INVALIDATE_DW] = {MI_NOOP}, j = 0;
> > > > > > +
> > > > > > +	dw[j++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> > > > > > +		      MI_FLUSH_IMM_DW | flags;
> > > > > > +	dw[j++] = lower_32_bits(addr);
> > > > > > +	dw[j++] = upper_32_bits(addr);
> > > > > > +	dw[j++] = MI_NOOP;
> > > > > >  
> > > > > > -	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> > > > > > -		  MI_FLUSH_IMM_DW | flags;
> > > > > > -	dw[i++] = lower_32_bits(addr);
> > > > > > -	dw[i++] = upper_32_bits(addr);
> > > > > > -	dw[i++] = MI_NOOP;
> > > > > > -	dw[i++] = MI_NOOP;
> > > > > > +	emit_atomic(q->gt, &cs[i], dw, sizeof(dw));
> > > > > >  
> > > > > > -	return i;
> > > > > > +	return i + j;
> > > > > >  }
> > > > > >  
> > > > > >  /**
> > > > > > @@ -1049,7 +1100,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > > > > >  	/* Calculate Batch buffer size */
> > > > > >  	batch_size = 0;
> > > > > >  	while (size) {
> > > > > > -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> > > > > > +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
> > > > > >  		u64 ccs_ofs, ccs_size;
> > > > > >  		u32 ccs_pt;
> > > > > >  
> > > > > > @@ -1090,7 +1141,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > > > > >  	 * sizes here again before copy command is emitted.
> > > > > >  	 */
> > > > > >  	while (size) {
> > > > > > -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> > > > > > +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
> > > > > >  		u32 flush_flags = 0;
> > > > > >  		u64 ccs_ofs, ccs_size;
> > > > > >  		u32 ccs_pt;
> > > > > > @@ -1113,11 +1164,11 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > > > > >  
> > > > > >  		emit_pte(m, bb, ccs_pt, false, false, &ccs_it, ccs_size, src);
> > > > > >  
> > > > > > -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> > > > > > +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> > > > > >  		flush_flags = xe_migrate_ccs_copy(m, bb, src_L0_ofs, src_is_pltt,
> > > > > >  						  src_L0_ofs, dst_is_pltt,
> > > > > >  						  src_L0, ccs_ofs, true);
> > > > > > -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> > > > > > +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> > > > > >  
> > > > > >  		size -= src_L0;
> > > > > >  	}
> > > > > > -- 
> > > > > > 2.51.0
> > > > > > 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/3] drm/xe/migrate: Atomicize CCS copy command setup
  2025-10-09 19:49             ` Rodrigo Vivi
@ 2025-10-09 20:34               ` Michal Wajdeczko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Wajdeczko @ 2025-10-09 20:34 UTC (permalink / raw)
  To: Rodrigo Vivi, Matthew Brost; +Cc: Satyanarayana K V P, intel-xe, Matthew Auld



On 10/9/2025 9:49 PM, Rodrigo Vivi wrote:
> On Thu, Oct 09, 2025 at 11:49:16AM -0700, Matthew Brost wrote:
>> On Thu, Oct 09, 2025 at 02:35:10PM -0400, Rodrigo Vivi wrote:
>>> On Thu, Oct 09, 2025 at 09:11:13AM -0700, Matthew Brost wrote:
>>>> On Thu, Oct 09, 2025 at 09:00:43AM -0400, Rodrigo Vivi wrote:
>>>>> On Wed, Oct 08, 2025 at 03:58:32PM -0700, Matthew Brost wrote:
>>>>>> On Wed, Oct 08, 2025 at 03:41:47PM +0530, Satyanarayana K V P wrote:
>>>>>>> The CCS copy command is a 5-dword sequence. If the vCPU halts during
>>>>>>> save/restore while this sequence is being programmed, partial writes may
>>>>>>> trigger page faults when saving IGPU CCS metadata. Use the VMOVDQU
>>>>>>> instruction to write the sequence atomically.
>>>>>>>
>>>>>>> Since VMOVDQU operates on 256-bit chunks, update EMIT_COPY_CCS_DW to emit
>>>>>>> 8 dwords instead of 5 dwords.
>>>>>>>
>>>>>>> Update emit_flush_invalidate() to use VMOVDQU operating with 128-bit
>>>>>>> chunks.
>>>>>>>
>>>>>>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
>>>>>>> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>> Cc: Matthew Auld <matthew.auld@intel.com>
>>>>>>>
>>>>>>> ---
>>>>>>> V4 -> V5:
>>>>>>> - Fixed review comments. (Matt B)
>>>>>>>
>>>>>>> V3 -> V4:
>>>>>>> - Fixed review comments. (Wajdeczko)
>>>>>>> - Fix issues reported by patchworks.
>>>>>>>
>>>>>>> V2 -> V3:
>>>>>>> - Added support for 128 bit and 256 bit instructions with memcpy_vmovdqu
>>>>>>> - Updated emit_flush_invalidate() to use vmovdqu instruction.
>>>>>>>
>>>>>>> V1 -> V2:
>>>>>>> - Use memcpy_vmovdqu only for x86 arch and for VF. Else use memcpy
>>>>>>>   (Auld, Matthew)
>>>>>>> - Fix issues reported by patchworks.
>>>>>>> ---
>>>>>>>  drivers/gpu/drm/xe/xe_migrate.c | 93 +++++++++++++++++++++++++--------
>>>>>>>  1 file changed, 72 insertions(+), 21 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>> index c39c3b423d05..b629072956ee 100644
>>>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>> @@ -5,7 +5,9 @@
>>>>>>>  
>>>>>>>  #include "xe_migrate.h"
>>>>>>>  
>>>>>>> +#include <asm/fpu/api.h>
>>>>>>>  #include <linux/bitfield.h>
>>>>>>> +#include <linux/cpufeature.h>
>>>>>>>  #include <linux/sizes.h>
>>>>>>>  
>>>>>>>  #include <drm/drm_managed.h>
>>>>>>> @@ -33,6 +35,7 @@
>>>>>>>  #include "xe_res_cursor.h"
>>>>>>>  #include "xe_sa.h"
>>>>>>>  #include "xe_sched_job.h"
>>>>>>> +#include "xe_sriov_vf_ccs.h"
>>>>>>>  #include "xe_sync.h"
>>>>>>>  #include "xe_trace_bo.h"
>>>>>>>  #include "xe_validation.h"
>>>>>>> @@ -644,18 +647,49 @@ static void emit_pte(struct xe_migrate *m,
>>>>>>>  	}
>>>>>>>  }
>>>>>>>  
>>>>>>> -#define EMIT_COPY_CCS_DW 5
>>>>>>> +static void memcpy_vmovdqu(void *dst, const void *src, u32 size)
>>>>>>> +{
>>>>>>> +#ifdef CONFIG_X86
>>>>>>> +	kernel_fpu_begin();
>>>>>>> +	if (size == SZ_128) {
>>>>>>> +		asm("vmovdqu (%0), %%xmm0\n"
>>>>>>> +		    "vmovups %%xmm0,   (%1)\n"
>>>>>>> +		    :: "r" (src), "r" (dst) : "memory");
>>>>>>> +	} else if (size == SZ_256) {
>>>>>>> +		asm("vmovdqu (%0), %%ymm0\n"
>>>>>>> +		    "vmovups %%ymm0,   (%1)\n"
>>>>>>> +		    :: "r" (src), "r" (dst) : "memory");
>>>>>>> +	}
>>>>>>> +	kernel_fpu_end();
>>>>>>> +#endif
>>>>>>
>>>>>> Everything in this patch LGTM but I think we maintainer input to ensure
>>>>>> we are breaking some rules about inlined asm code in a driver (no idea
>>>>>> if this exists) or if a better place would be somewhere common. Can you
>>>>>> ping Lucas, Thomas, or Rodrigo and ask them about this?
>>>>>
>>>>> Well, it is possible and we have asm code in i915 for instance (i915_memcpy.c)
>>>>>
>>>>> But the rule does exist:
>>>>> https://www.kernel.org/doc/html/latest/process/coding-style.html#inline-assembly
>>>>>
>>>>> "don’t use inline assembly gratuitously when C can do the job. You can and should
>>>>> poke hardware from C when possible"
>>>>>
>>>>> In this case here, please explain why exactly memcpy with smp_wmb barriers and
>>>>> or WRITE_ONCE code combined couldn't solve it.
>>>>>
>>>>> Also, please explain how exactly vmovdqu guarantees the atomicity promised by
>>>>> the commit message. On a quick search here my take is that for this 128 or 256
>>>>> bits, atomicity is not guaranteed.
>>>>
>>>> I don't think cache atomicity is what we're after here—rather, it's vCPU
>>>> halting atomicity.
>>>>
>>>> Consider the following case:
>>>> b++ = XY_CTRL_SURF_COPY_BLT;
>>>> b++ = addr;
>>>>
>>>> If the vCPU is halted during the instruction that stores
>>>> XY_CTRL_SURF_COPY_BLT, the address will be invalid. The GuC executes the
>>>> batch buffer (BB) that is being programmed as part of the VF save. This
>>>> will clearly cause the BB to hang due to a page fault on the copy
>>>> command.
>>>
>>> okay, perhaps this is what is getting me confused most
>>> what I don't understand in the flow is: why GuC is already
>>> executing it or going to execute it while you are going to a halt when
>>> writing the command to the buffer?  and not writing to the buffer first
>>> and then sending it to the exec queue?
>>>
>>
>> It how this feature was architected, will send over SaS link of the list.
> 
> It probably deserves some comments around the code on how that works and
> why we are doing that.
> 
>>
>>>>
>>>> If the entire XY_CTRL_SURF_COPY_BLT is stored via an AVX instruction,
>>>> then either the GPU entire instruction is written or none of it is. I
>>>> believe vCPU halting guarantees that a CPU instruction is either fully
>>>> executed or not at all—regardless of how many micro-operations (uOPs) it
>>>> decodes into. If this guarantee does not hold, then the entire
>>>> architecture of CCS save/restore on PTL is fundamentally broken which is
>>>> always possible.
>>>
>>> Okay, this is guaranteed. I mean, the vCPU won't get halted in the middle
>>> of the vmovdqu nor vmovups. only before, between, or after them.
>>>
>>> But is this uncached and/or coherent? isn't there really any possibility that
>>> the command finished, but GuC mid-flight executing things aren't still
>>> seeing different cachelines?
>>>
>>
>> The GuC won't start executing until vCPU unpause on the save flow.
>> Restore flow is bit more tricky as vCPU are live when this happens but
>> we can W/A that race in software I think. That part is not in this
>> series.
>>
>>>>
>>>>>
>>>>> So, imho this patch is introducing a unmaintainable, complex, and fragile code
>>>>> that is not even doing what it is claiming to do. But I will be glad if someone
>>>>> can challenge this and prove me wrong.
>>>>>
>>>>
>>>> Let me know if the above makes any sense.
>>>
>>> Okay. But how to handle cases where AVX might not be available? really not needed?
>>>
>>
>> This iGPU feature for PTL so shouldn't be an issue as PTL has AVX
>> instructions.
> 
> Some comments around the code about that to be clear that we don't try to
> reuse this later in any discrete.
> and perhaps an assert ! dgfx?!

this was already suggested offline to make checks for all prerequisites when declaring readiness for the full migration support,
and then just add xe_asserts here without any misleading runtime checks
> 
>>
>> Matt
>>
>>>>
>>>> Matt
>>>>
>>>>> Thanks,
>>>>> Rodrigo.
>>>>>
>>>>>>
>>>>>> Matt 
>>>>>>
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void emit_atomic(struct xe_gt *gt, void *dst, const void *src, u32 size)
>>>>>>> +{
>>>>>>> +	u32 instr_size = size * BITS_PER_BYTE;
>>>>>>> +
>>>>>>> +	xe_gt_assert(gt, instr_size == SZ_128 || instr_size == SZ_256);
>>>>>>> +
>>>>>>> +	if (IS_VF_CCS_READY(gt_to_xe(gt)) && static_cpu_has(X86_FEATURE_AVX))
>>>>>>> +		memcpy_vmovdqu(dst, src, instr_size);
>>>>>>> +	else
>>>>>>> +		memcpy(dst, src, size);
>>>>>>> +}
>>>>>>> +
>>>>>>> +#define EMIT_COPY_CCS_DW 8
>>>>>>>  static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
>>>>>>>  			  u64 dst_ofs, bool dst_is_indirect,
>>>>>>>  			  u64 src_ofs, bool src_is_indirect,
>>>>>>>  			  u32 size)
>>>>>>>  {
>>>>>>> +	u32 dw[EMIT_COPY_CCS_DW] = {MI_NOOP};
>>>>>>>  	struct xe_device *xe = gt_to_xe(gt);
>>>>>>>  	u32 *cs = bb->cs + bb->len;
>>>>>>>  	u32 num_ccs_blks;
>>>>>>>  	u32 num_pages;
>>>>>>>  	u32 ccs_copy_size;
>>>>>>>  	u32 mocs;
>>>>>>> +	u32 i = 0;
>>>>>>>  
>>>>>>>  	if (GRAPHICS_VERx100(xe) >= 2000) {
>>>>>>>  		num_pages = DIV_ROUND_UP(size, XE_PAGE_SIZE);
>>>>>>> @@ -673,15 +707,23 @@ static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
>>>>>>>  		mocs = FIELD_PREP(XY_CTRL_SURF_MOCS_MASK, gt->mocs.uc_index);
>>>>>>>  	}
>>>>>>>  
>>>>>>> -	*cs++ = XY_CTRL_SURF_COPY_BLT |
>>>>>>> -		(src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
>>>>>>> -		(dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
>>>>>>> -		ccs_copy_size;
>>>>>>> -	*cs++ = lower_32_bits(src_ofs);
>>>>>>> -	*cs++ = upper_32_bits(src_ofs) | mocs;
>>>>>>> -	*cs++ = lower_32_bits(dst_ofs);
>>>>>>> -	*cs++ = upper_32_bits(dst_ofs) | mocs;
>>>>>>> +	dw[i++] = XY_CTRL_SURF_COPY_BLT |
>>>>>>> +		  (src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
>>>>>>> +		  (dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
>>>>>>> +		  ccs_copy_size;
>>>>>>> +	dw[i++] = lower_32_bits(src_ofs);
>>>>>>> +	dw[i++] = upper_32_bits(src_ofs) | mocs;
>>>>>>> +	dw[i++] = lower_32_bits(dst_ofs);
>>>>>>> +	dw[i++] = upper_32_bits(dst_ofs) | mocs;
>>>>>>>  
>>>>>>> +	/*
>>>>>>> +	 * The CCS copy command is a 5-dword sequence. If the vCPU halts during
>>>>>>> +	 * save/restore while this sequence is being issued, partial writes may trigger
>>>>>>> +	 * page faults when saving iGPU CCS metadata. Use the VMOVDQU instruction to
>>>>>>> +	 * write the sequence atomically.
>>>>>>> +	 */
>>>>>>> +	emit_atomic(gt, cs, dw, sizeof(dw));
>>>>>>> +	cs += EMIT_COPY_CCS_DW;
>>>>>>>  	bb->len = cs - bb->cs;
>>>>>>>  }
>>>>>>>  
>>>>>>> @@ -993,18 +1035,27 @@ static u64 migrate_vm_ppgtt_addr_tlb_inval(void)
>>>>>>>  	return (NUM_KERNEL_PDE - 2) * XE_PAGE_SIZE;
>>>>>>>  }
>>>>>>>  
>>>>>>> -static int emit_flush_invalidate(u32 *dw, int i, u32 flags)
>>>>>>> +/*
>>>>>>> + * The MI_FLUSH_DW command is a 4-dword sequence. If the vCPU halts during
>>>>>>> + * save/restore while this sequence is being issued, partial writes may
>>>>>>> + * trigger page faults when saving iGPU CCS metadata. Use
>>>>>>> + * emit_atomic() to write the sequence atomically.
>>>>>>> + */
>>>>>>> +#define EMIT_FLUSH_INVALIDATE_DW 4
>>>>>>> +static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *cs, int i, u32 flags)
>>>>>>>  {
>>>>>>>  	u64 addr = migrate_vm_ppgtt_addr_tlb_inval();
>>>>>>> +	u32 dw[EMIT_FLUSH_INVALIDATE_DW] = {MI_NOOP}, j = 0;
>>>>>>> +
>>>>>>> +	dw[j++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
>>>>>>> +		      MI_FLUSH_IMM_DW | flags;
>>>>>>> +	dw[j++] = lower_32_bits(addr);
>>>>>>> +	dw[j++] = upper_32_bits(addr);
>>>>>>> +	dw[j++] = MI_NOOP;
>>>>>>>  
>>>>>>> -	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
>>>>>>> -		  MI_FLUSH_IMM_DW | flags;
>>>>>>> -	dw[i++] = lower_32_bits(addr);
>>>>>>> -	dw[i++] = upper_32_bits(addr);
>>>>>>> -	dw[i++] = MI_NOOP;
>>>>>>> -	dw[i++] = MI_NOOP;
>>>>>>> +	emit_atomic(q->gt, &cs[i], dw, sizeof(dw));
>>>>>>>  
>>>>>>> -	return i;
>>>>>>> +	return i + j;
>>>>>>>  }
>>>>>>>  
>>>>>>>  /**
>>>>>>> @@ -1049,7 +1100,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
>>>>>>>  	/* Calculate Batch buffer size */
>>>>>>>  	batch_size = 0;
>>>>>>>  	while (size) {
>>>>>>> -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
>>>>>>> +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
>>>>>>>  		u64 ccs_ofs, ccs_size;
>>>>>>>  		u32 ccs_pt;
>>>>>>>  
>>>>>>> @@ -1090,7 +1141,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
>>>>>>>  	 * sizes here again before copy command is emitted.
>>>>>>>  	 */
>>>>>>>  	while (size) {
>>>>>>> -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
>>>>>>> +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
>>>>>>>  		u32 flush_flags = 0;
>>>>>>>  		u64 ccs_ofs, ccs_size;
>>>>>>>  		u32 ccs_pt;
>>>>>>> @@ -1113,11 +1164,11 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
>>>>>>>  
>>>>>>>  		emit_pte(m, bb, ccs_pt, false, false, &ccs_it, ccs_size, src);
>>>>>>>  
>>>>>>> -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
>>>>>>> +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
>>>>>>>  		flush_flags = xe_migrate_ccs_copy(m, bb, src_L0_ofs, src_is_pltt,
>>>>>>>  						  src_L0_ofs, dst_is_pltt,
>>>>>>>  						  src_L0, ccs_ofs, true);
>>>>>>> -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
>>>>>>> +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
>>>>>>>  
>>>>>>>  		size -= src_L0;
>>>>>>>  	}
>>>>>>> -- 
>>>>>>> 2.51.0
>>>>>>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/3] drm/xe/migrate: Atomicize CCS copy command setup
  2025-10-09 18:49           ` Matthew Brost
  2025-10-09 19:49             ` Rodrigo Vivi
@ 2025-10-09 23:06             ` Matt Roper
  2025-10-10  8:41               ` K V P, Satyanarayana
  1 sibling, 1 reply; 16+ messages in thread
From: Matt Roper @ 2025-10-09 23:06 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Rodrigo Vivi, Satyanarayana K V P, intel-xe, Michal Wajdeczko,
	Matthew Auld

On Thu, Oct 09, 2025 at 11:49:16AM -0700, Matthew Brost wrote:
> On Thu, Oct 09, 2025 at 02:35:10PM -0400, Rodrigo Vivi wrote:
> > On Thu, Oct 09, 2025 at 09:11:13AM -0700, Matthew Brost wrote:
> > > On Thu, Oct 09, 2025 at 09:00:43AM -0400, Rodrigo Vivi wrote:
> > > > On Wed, Oct 08, 2025 at 03:58:32PM -0700, Matthew Brost wrote:
> > > > > On Wed, Oct 08, 2025 at 03:41:47PM +0530, Satyanarayana K V P wrote:
> > > > > > The CCS copy command is a 5-dword sequence. If the vCPU halts during
> > > > > > save/restore while this sequence is being programmed, partial writes may
> > > > > > trigger page faults when saving IGPU CCS metadata. Use the VMOVDQU
> > > > > > instruction to write the sequence atomically.
> > > > > > 
> > > > > > Since VMOVDQU operates on 256-bit chunks, update EMIT_COPY_CCS_DW to emit
> > > > > > 8 dwords instead of 5 dwords.
> > > > > > 
> > > > > > Update emit_flush_invalidate() to use VMOVDQU operating with 128-bit
> > > > > > chunks.
> > > > > > 
> > > > > > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> > > > > > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > > > > 
> > > > > > ---
> > > > > > V4 -> V5:
> > > > > > - Fixed review comments. (Matt B)
> > > > > > 
> > > > > > V3 -> V4:
> > > > > > - Fixed review comments. (Wajdeczko)
> > > > > > - Fix issues reported by patchworks.
> > > > > > 
> > > > > > V2 -> V3:
> > > > > > - Added support for 128 bit and 256 bit instructions with memcpy_vmovdqu
> > > > > > - Updated emit_flush_invalidate() to use vmovdqu instruction.
> > > > > > 
> > > > > > V1 -> V2:
> > > > > > - Use memcpy_vmovdqu only for x86 arch and for VF. Else use memcpy
> > > > > >   (Auld, Matthew)
> > > > > > - Fix issues reported by patchworks.
> > > > > > ---
> > > > > >  drivers/gpu/drm/xe/xe_migrate.c | 93 +++++++++++++++++++++++++--------
> > > > > >  1 file changed, 72 insertions(+), 21 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > index c39c3b423d05..b629072956ee 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > @@ -5,7 +5,9 @@
> > > > > >  
> > > > > >  #include "xe_migrate.h"
> > > > > >  
> > > > > > +#include <asm/fpu/api.h>
> > > > > >  #include <linux/bitfield.h>
> > > > > > +#include <linux/cpufeature.h>
> > > > > >  #include <linux/sizes.h>
> > > > > >  
> > > > > >  #include <drm/drm_managed.h>
> > > > > > @@ -33,6 +35,7 @@
> > > > > >  #include "xe_res_cursor.h"
> > > > > >  #include "xe_sa.h"
> > > > > >  #include "xe_sched_job.h"
> > > > > > +#include "xe_sriov_vf_ccs.h"
> > > > > >  #include "xe_sync.h"
> > > > > >  #include "xe_trace_bo.h"
> > > > > >  #include "xe_validation.h"
> > > > > > @@ -644,18 +647,49 @@ static void emit_pte(struct xe_migrate *m,
> > > > > >  	}
> > > > > >  }
> > > > > >  
> > > > > > -#define EMIT_COPY_CCS_DW 5
> > > > > > +static void memcpy_vmovdqu(void *dst, const void *src, u32 size)
> > > > > > +{
> > > > > > +#ifdef CONFIG_X86
> > > > > > +	kernel_fpu_begin();
> > > > > > +	if (size == SZ_128) {
> > > > > > +		asm("vmovdqu (%0), %%xmm0\n"
> > > > > > +		    "vmovups %%xmm0,   (%1)\n"
> > > > > > +		    :: "r" (src), "r" (dst) : "memory");
> > > > > > +	} else if (size == SZ_256) {
> > > > > > +		asm("vmovdqu (%0), %%ymm0\n"
> > > > > > +		    "vmovups %%ymm0,   (%1)\n"
> > > > > > +		    :: "r" (src), "r" (dst) : "memory");
> > > > > > +	}
> > > > > > +	kernel_fpu_end();
> > > > > > +#endif
> > > > > 
> > > > > Everything in this patch LGTM but I think we maintainer input to ensure
> > > > > we are breaking some rules about inlined asm code in a driver (no idea
> > > > > if this exists) or if a better place would be somewhere common. Can you
> > > > > ping Lucas, Thomas, or Rodrigo and ask them about this?
> > > > 
> > > > Well, it is possible and we have asm code in i915 for instance (i915_memcpy.c)
> > > > 
> > > > But the rule does exist:
> > > > https://www.kernel.org/doc/html/latest/process/coding-style.html#inline-assembly
> > > > 
> > > > "don’t use inline assembly gratuitously when C can do the job. You can and should
> > > > poke hardware from C when possible"
> > > > 
> > > > In this case here, please explain why exactly memcpy with smp_wmb barriers and
> > > > or WRITE_ONCE code combined couldn't solve it.
> > > > 
> > > > Also, please explain how exactly vmovdqu guarantees the atomicity promised by
> > > > the commit message. On a quick search here my take is that for this 128 or 256
> > > > bits, atomicity is not guaranteed.
> > > 
> > > I don't think cache atomicity is what we're after here—rather, it's vCPU
> > > halting atomicity.
> > > 
> > > Consider the following case:
> > > b++ = XY_CTRL_SURF_COPY_BLT;
> > > b++ = addr;
> > > 
> > > If the vCPU is halted during the instruction that stores
> > > XY_CTRL_SURF_COPY_BLT, the address will be invalid. The GuC executes the
> > > batch buffer (BB) that is being programmed as part of the VF save. This
> > > will clearly cause the BB to hang due to a page fault on the copy
> > > command.
> > 
> > okay, perhaps this is what is getting me confused most
> > what I don't understand in the flow is: why GuC is already
> > executing it or going to execute it while you are going to a halt when
> > writing the command to the buffer?  and not writing to the buffer first
> > and then sending it to the exec queue?
> > 
> 
> It how this feature was architected, will send over SaS link of the list.

I'm confused by this too.  At the point we're filling in the
batchbuffer, the GuC isn't aware of the batch at all yet as far as I can
see.  In xe_migrate_copy(), we've called xe_bb_new() to allocate a new
batchbuffer, and then we start calling emit_* functions to poke
instructions into that buffer.  At the point we call
xe_migrate_ccs_copy(), the hardware still isn't aware that this buffer
exists, so it shouldn't be possible for it to start executing.  Only
later on when we eventually create a job for the batchbuffer (after
we've finished emitting all of the commands) should it be possible for
the hardware to start executing this.

If there's some other *future* changes (not present in the driver today)
that change the design such that we allocate a batchbuffer and tell the
GuC it's free to start executing it, but only fill in the contents after
that point, then that needs to be clearly explained in the commit
message.  But that also sounds like an fundamentally racy design, so I'm
not sure why vCPU would be the only situations we'd be running into
problems.


Matt

> 
> > > 
> > > If the entire XY_CTRL_SURF_COPY_BLT is stored via an AVX instruction,
> > > then either the GPU entire instruction is written or none of it is. I
> > > believe vCPU halting guarantees that a CPU instruction is either fully
> > > executed or not at all—regardless of how many micro-operations (uOPs) it
> > > decodes into. If this guarantee does not hold, then the entire
> > > architecture of CCS save/restore on PTL is fundamentally broken which is
> > > always possible.
> > 
> > Okay, this is guaranteed. I mean, the vCPU won't get halted in the middle
> > of the vmovdqu nor vmovups. only before, between, or after them.
> > 
> > But is this uncached and/or coherent? isn't there really any possibility that
> > the command finished, but GuC mid-flight executing things aren't still
> > seeing different cachelines?
> > 
> 
> The GuC won't start executing until vCPU unpause on the save flow.
> Restore flow is bit more tricky as vCPU are live when this happens but
> we can W/A that race in software I think. That part is not in this
> series.
> 
> > > 
> > > > 
> > > > So, imho this patch is introducing a unmaintainable, complex, and fragile code
> > > > that is not even doing what it is claiming to do. But I will be glad if someone
> > > > can challenge this and prove me wrong.
> > > > 
> > > 
> > > Let me know if the above makes any sense.
> > 
> > Okay. But how to handle cases where AVX might not be available? really not needed?
> > 
> 
> This iGPU feature for PTL so shouldn't be an issue as PTL has AVX
> instructions.
> 
> Matt
> 
> > > 
> > > Matt
> > > 
> > > > Thanks,
> > > > Rodrigo.
> > > > 
> > > > > 
> > > > > Matt 
> > > > > 
> > > > > > +}
> > > > > > +
> > > > > > +static void emit_atomic(struct xe_gt *gt, void *dst, const void *src, u32 size)
> > > > > > +{
> > > > > > +	u32 instr_size = size * BITS_PER_BYTE;
> > > > > > +
> > > > > > +	xe_gt_assert(gt, instr_size == SZ_128 || instr_size == SZ_256);
> > > > > > +
> > > > > > +	if (IS_VF_CCS_READY(gt_to_xe(gt)) && static_cpu_has(X86_FEATURE_AVX))
> > > > > > +		memcpy_vmovdqu(dst, src, instr_size);
> > > > > > +	else
> > > > > > +		memcpy(dst, src, size);
> > > > > > +}
> > > > > > +
> > > > > > +#define EMIT_COPY_CCS_DW 8
> > > > > >  static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> > > > > >  			  u64 dst_ofs, bool dst_is_indirect,
> > > > > >  			  u64 src_ofs, bool src_is_indirect,
> > > > > >  			  u32 size)
> > > > > >  {
> > > > > > +	u32 dw[EMIT_COPY_CCS_DW] = {MI_NOOP};
> > > > > >  	struct xe_device *xe = gt_to_xe(gt);
> > > > > >  	u32 *cs = bb->cs + bb->len;
> > > > > >  	u32 num_ccs_blks;
> > > > > >  	u32 num_pages;
> > > > > >  	u32 ccs_copy_size;
> > > > > >  	u32 mocs;
> > > > > > +	u32 i = 0;
> > > > > >  
> > > > > >  	if (GRAPHICS_VERx100(xe) >= 2000) {
> > > > > >  		num_pages = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> > > > > > @@ -673,15 +707,23 @@ static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> > > > > >  		mocs = FIELD_PREP(XY_CTRL_SURF_MOCS_MASK, gt->mocs.uc_index);
> > > > > >  	}
> > > > > >  
> > > > > > -	*cs++ = XY_CTRL_SURF_COPY_BLT |
> > > > > > -		(src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> > > > > > -		(dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> > > > > > -		ccs_copy_size;
> > > > > > -	*cs++ = lower_32_bits(src_ofs);
> > > > > > -	*cs++ = upper_32_bits(src_ofs) | mocs;
> > > > > > -	*cs++ = lower_32_bits(dst_ofs);
> > > > > > -	*cs++ = upper_32_bits(dst_ofs) | mocs;
> > > > > > +	dw[i++] = XY_CTRL_SURF_COPY_BLT |
> > > > > > +		  (src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> > > > > > +		  (dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> > > > > > +		  ccs_copy_size;
> > > > > > +	dw[i++] = lower_32_bits(src_ofs);
> > > > > > +	dw[i++] = upper_32_bits(src_ofs) | mocs;
> > > > > > +	dw[i++] = lower_32_bits(dst_ofs);
> > > > > > +	dw[i++] = upper_32_bits(dst_ofs) | mocs;
> > > > > >  
> > > > > > +	/*
> > > > > > +	 * The CCS copy command is a 5-dword sequence. If the vCPU halts during
> > > > > > +	 * save/restore while this sequence is being issued, partial writes may trigger
> > > > > > +	 * page faults when saving iGPU CCS metadata. Use the VMOVDQU instruction to
> > > > > > +	 * write the sequence atomically.
> > > > > > +	 */
> > > > > > +	emit_atomic(gt, cs, dw, sizeof(dw));
> > > > > > +	cs += EMIT_COPY_CCS_DW;
> > > > > >  	bb->len = cs - bb->cs;
> > > > > >  }
> > > > > >  
> > > > > > @@ -993,18 +1035,27 @@ static u64 migrate_vm_ppgtt_addr_tlb_inval(void)
> > > > > >  	return (NUM_KERNEL_PDE - 2) * XE_PAGE_SIZE;
> > > > > >  }
> > > > > >  
> > > > > > -static int emit_flush_invalidate(u32 *dw, int i, u32 flags)
> > > > > > +/*
> > > > > > + * The MI_FLUSH_DW command is a 4-dword sequence. If the vCPU halts during
> > > > > > + * save/restore while this sequence is being issued, partial writes may
> > > > > > + * trigger page faults when saving iGPU CCS metadata. Use
> > > > > > + * emit_atomic() to write the sequence atomically.
> > > > > > + */
> > > > > > +#define EMIT_FLUSH_INVALIDATE_DW 4
> > > > > > +static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *cs, int i, u32 flags)
> > > > > >  {
> > > > > >  	u64 addr = migrate_vm_ppgtt_addr_tlb_inval();
> > > > > > +	u32 dw[EMIT_FLUSH_INVALIDATE_DW] = {MI_NOOP}, j = 0;
> > > > > > +
> > > > > > +	dw[j++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> > > > > > +		      MI_FLUSH_IMM_DW | flags;
> > > > > > +	dw[j++] = lower_32_bits(addr);
> > > > > > +	dw[j++] = upper_32_bits(addr);
> > > > > > +	dw[j++] = MI_NOOP;
> > > > > >  
> > > > > > -	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> > > > > > -		  MI_FLUSH_IMM_DW | flags;
> > > > > > -	dw[i++] = lower_32_bits(addr);
> > > > > > -	dw[i++] = upper_32_bits(addr);
> > > > > > -	dw[i++] = MI_NOOP;
> > > > > > -	dw[i++] = MI_NOOP;
> > > > > > +	emit_atomic(q->gt, &cs[i], dw, sizeof(dw));
> > > > > >  
> > > > > > -	return i;
> > > > > > +	return i + j;
> > > > > >  }
> > > > > >  
> > > > > >  /**
> > > > > > @@ -1049,7 +1100,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > > > > >  	/* Calculate Batch buffer size */
> > > > > >  	batch_size = 0;
> > > > > >  	while (size) {
> > > > > > -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> > > > > > +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
> > > > > >  		u64 ccs_ofs, ccs_size;
> > > > > >  		u32 ccs_pt;
> > > > > >  
> > > > > > @@ -1090,7 +1141,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > > > > >  	 * sizes here again before copy command is emitted.
> > > > > >  	 */
> > > > > >  	while (size) {
> > > > > > -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> > > > > > +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
> > > > > >  		u32 flush_flags = 0;
> > > > > >  		u64 ccs_ofs, ccs_size;
> > > > > >  		u32 ccs_pt;
> > > > > > @@ -1113,11 +1164,11 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > > > > >  
> > > > > >  		emit_pte(m, bb, ccs_pt, false, false, &ccs_it, ccs_size, src);
> > > > > >  
> > > > > > -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> > > > > > +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> > > > > >  		flush_flags = xe_migrate_ccs_copy(m, bb, src_L0_ofs, src_is_pltt,
> > > > > >  						  src_L0_ofs, dst_is_pltt,
> > > > > >  						  src_L0, ccs_ofs, true);
> > > > > > -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> > > > > > +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> > > > > >  
> > > > > >  		size -= src_L0;
> > > > > >  	}
> > > > > > -- 
> > > > > > 2.51.0
> > > > > > 

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/3] drm/xe/migrate: Atomicize CCS copy command setup
  2025-10-09 23:06             ` Matt Roper
@ 2025-10-10  8:41               ` K V P, Satyanarayana
  2025-10-10 19:13                 ` Rodrigo Vivi
  0 siblings, 1 reply; 16+ messages in thread
From: K V P, Satyanarayana @ 2025-10-10  8:41 UTC (permalink / raw)
  To: Matt Roper, Matthew Brost
  Cc: Rodrigo Vivi, intel-xe, Michal Wajdeczko, Matthew Auld



On 10-10-2025 04:36, Matt Roper wrote:
> On Thu, Oct 09, 2025 at 11:49:16AM -0700, Matthew Brost wrote:
>> On Thu, Oct 09, 2025 at 02:35:10PM -0400, Rodrigo Vivi wrote:
>>> On Thu, Oct 09, 2025 at 09:11:13AM -0700, Matthew Brost wrote:
>>>> On Thu, Oct 09, 2025 at 09:00:43AM -0400, Rodrigo Vivi wrote:
>>>>> On Wed, Oct 08, 2025 at 03:58:32PM -0700, Matthew Brost wrote:
>>>>>> On Wed, Oct 08, 2025 at 03:41:47PM +0530, Satyanarayana K V P wrote:
>>>>>>> The CCS copy command is a 5-dword sequence. If the vCPU halts during
>>>>>>> save/restore while this sequence is being programmed, partial writes may
>>>>>>> trigger page faults when saving IGPU CCS metadata. Use the VMOVDQU
>>>>>>> instruction to write the sequence atomically.
>>>>>>>
>>>>>>> Since VMOVDQU operates on 256-bit chunks, update EMIT_COPY_CCS_DW to emit
>>>>>>> 8 dwords instead of 5 dwords.
>>>>>>>
>>>>>>> Update emit_flush_invalidate() to use VMOVDQU operating with 128-bit
>>>>>>> chunks.
>>>>>>>
>>>>>>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
>>>>>>> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>> Cc: Matthew Auld <matthew.auld@intel.com>
>>>>>>>
>>>>>>> ---
>>>>>>> V4 -> V5:
>>>>>>> - Fixed review comments. (Matt B)
>>>>>>>
>>>>>>> V3 -> V4:
>>>>>>> - Fixed review comments. (Wajdeczko)
>>>>>>> - Fix issues reported by patchworks.
>>>>>>>
>>>>>>> V2 -> V3:
>>>>>>> - Added support for 128 bit and 256 bit instructions with memcpy_vmovdqu
>>>>>>> - Updated emit_flush_invalidate() to use vmovdqu instruction.
>>>>>>>
>>>>>>> V1 -> V2:
>>>>>>> - Use memcpy_vmovdqu only for x86 arch and for VF. Else use memcpy
>>>>>>>    (Auld, Matthew)
>>>>>>> - Fix issues reported by patchworks.
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/xe/xe_migrate.c | 93 +++++++++++++++++++++++++--------
>>>>>>>   1 file changed, 72 insertions(+), 21 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>> index c39c3b423d05..b629072956ee 100644
>>>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>> @@ -5,7 +5,9 @@
>>>>>>>   
>>>>>>>   #include "xe_migrate.h"
>>>>>>>   
>>>>>>> +#include <asm/fpu/api.h>
>>>>>>>   #include <linux/bitfield.h>
>>>>>>> +#include <linux/cpufeature.h>
>>>>>>>   #include <linux/sizes.h>
>>>>>>>   
>>>>>>>   #include <drm/drm_managed.h>
>>>>>>> @@ -33,6 +35,7 @@
>>>>>>>   #include "xe_res_cursor.h"
>>>>>>>   #include "xe_sa.h"
>>>>>>>   #include "xe_sched_job.h"
>>>>>>> +#include "xe_sriov_vf_ccs.h"
>>>>>>>   #include "xe_sync.h"
>>>>>>>   #include "xe_trace_bo.h"
>>>>>>>   #include "xe_validation.h"
>>>>>>> @@ -644,18 +647,49 @@ static void emit_pte(struct xe_migrate *m,
>>>>>>>   	}
>>>>>>>   }
>>>>>>>   
>>>>>>> -#define EMIT_COPY_CCS_DW 5
>>>>>>> +static void memcpy_vmovdqu(void *dst, const void *src, u32 size)
>>>>>>> +{
>>>>>>> +#ifdef CONFIG_X86
>>>>>>> +	kernel_fpu_begin();
>>>>>>> +	if (size == SZ_128) {
>>>>>>> +		asm("vmovdqu (%0), %%xmm0\n"
>>>>>>> +		    "vmovups %%xmm0,   (%1)\n"
>>>>>>> +		    :: "r" (src), "r" (dst) : "memory");
>>>>>>> +	} else if (size == SZ_256) {
>>>>>>> +		asm("vmovdqu (%0), %%ymm0\n"
>>>>>>> +		    "vmovups %%ymm0,   (%1)\n"
>>>>>>> +		    :: "r" (src), "r" (dst) : "memory");
>>>>>>> +	}
>>>>>>> +	kernel_fpu_end();
>>>>>>> +#endif
>>>>>>
>>>>>> Everything in this patch LGTM but I think we maintainer input to ensure
>>>>>> we are breaking some rules about inlined asm code in a driver (no idea
>>>>>> if this exists) or if a better place would be somewhere common. Can you
>>>>>> ping Lucas, Thomas, or Rodrigo and ask them about this?
>>>>>
>>>>> Well, it is possible and we have asm code in i915 for instance (i915_memcpy.c)
>>>>>
>>>>> But the rule does exist:
>>>>> https://www.kernel.org/doc/html/latest/process/coding-style.html#inline-assembly
>>>>>
>>>>> "don’t use inline assembly gratuitously when C can do the job. You can and should
>>>>> poke hardware from C when possible"
>>>>>
>>>>> In this case here, please explain why exactly memcpy with smp_wmb barriers and
>>>>> or WRITE_ONCE code combined couldn't solve it.
>>>>>
>>>>> Also, please explain how exactly vmovdqu guarantees the atomicity promised by
>>>>> the commit message. On a quick search here my take is that for this 128 or 256
>>>>> bits, atomicity is not guaranteed.
>>>>
>>>> I don't think cache atomicity is what we're after here—rather, it's vCPU
>>>> halting atomicity.
>>>>
>>>> Consider the following case:
>>>> b++ = XY_CTRL_SURF_COPY_BLT;
>>>> b++ = addr;
>>>>
>>>> If the vCPU is halted during the instruction that stores
>>>> XY_CTRL_SURF_COPY_BLT, the address will be invalid. The GuC executes the
>>>> batch buffer (BB) that is being programmed as part of the VF save. This
>>>> will clearly cause the BB to hang due to a page fault on the copy
>>>> command.
>>>
>>> okay, perhaps this is what is getting me confused most
>>> what I don't understand in the flow is: why GuC is already
>>> executing it or going to execute it while you are going to a halt when
>>> writing the command to the buffer?  and not writing to the buffer first
>>> and then sending it to the exec queue?
>>>
>>
>> It how this feature was architected, will send over SaS link of the list.
> 
> I'm confused by this too.  At the point we're filling in the
> batchbuffer, the GuC isn't aware of the batch at all yet as far as I can
> see.  In xe_migrate_copy(), we've called xe_bb_new() to allocate a new
> batchbuffer, and then we start calling emit_* functions to poke
> instructions into that buffer.  At the point we call
> xe_migrate_ccs_copy(), the hardware still isn't aware that this buffer
> exists, so it shouldn't be possible for it to start executing.  Only
> later on when we eventually create a job for the batchbuffer (after
> we've finished emitting all of the commands) should it be possible for
> the hardware to start executing this.
> 
> If there's some other *future* changes (not present in the driver today)
> that change the design such that we allocate a batchbuffer and tell the
> GuC it's free to start executing it, but only fill in the contents after
> that point, then that needs to be clearly explained in the commit
> message.  But that also sounds like an fundamentally racy design, so I'm
> not sure why vCPU would be the only situations we'd be running into
> problems.
> 
> 
> Matt
> 
HI Matt,
  Please refer to xe_migrate_ccs_rw_copy() function which just creates 
BB and it does not submit job. The idea here is that, we have a 
sub-allocator which is already registered with Guc and the function 
xe_migrate_ccs_rw_copy() is allocating BBs from the sub-allocator.
When the vCPU is paused, GUC automatically submits these BBs to HW. So, 
we are making sure that the BB always contain valid GPU instructions so 
that HW will not report any page faults while executing.
I will share the SAS for this.

-Satya.
>>
>>>>
>>>> If the entire XY_CTRL_SURF_COPY_BLT is stored via an AVX instruction,
>>>> then either the GPU entire instruction is written or none of it is. I
>>>> believe vCPU halting guarantees that a CPU instruction is either fully
>>>> executed or not at all—regardless of how many micro-operations (uOPs) it
>>>> decodes into. If this guarantee does not hold, then the entire
>>>> architecture of CCS save/restore on PTL is fundamentally broken which is
>>>> always possible.
>>>
>>> Okay, this is guaranteed. I mean, the vCPU won't get halted in the middle
>>> of the vmovdqu nor vmovups. only before, between, or after them.
>>>
>>> But is this uncached and/or coherent? isn't there really any possibility that
>>> the command finished, but GuC mid-flight executing things aren't still
>>> seeing different cachelines?
>>>
>>
>> The GuC won't start executing until vCPU unpause on the save flow.
>> Restore flow is bit more tricky as vCPU are live when this happens but
>> we can W/A that race in software I think. That part is not in this
>> series.
>>
>>>>
>>>>>
>>>>> So, imho this patch is introducing a unmaintainable, complex, and fragile code
>>>>> that is not even doing what it is claiming to do. But I will be glad if someone
>>>>> can challenge this and prove me wrong.
>>>>>
>>>>
>>>> Let me know if the above makes any sense.
>>>
>>> Okay. But how to handle cases where AVX might not be available? really not needed?
>>>
>>
>> This iGPU feature for PTL so shouldn't be an issue as PTL has AVX
>> instructions.
>>
>> Matt
>>
>>>>
>>>> Matt
>>>>
>>>>> Thanks,
>>>>> Rodrigo.
>>>>>
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void emit_atomic(struct xe_gt *gt, void *dst, const void *src, u32 size)
>>>>>>> +{
>>>>>>> +	u32 instr_size = size * BITS_PER_BYTE;
>>>>>>> +
>>>>>>> +	xe_gt_assert(gt, instr_size == SZ_128 || instr_size == SZ_256);
>>>>>>> +
>>>>>>> +	if (IS_VF_CCS_READY(gt_to_xe(gt)) && static_cpu_has(X86_FEATURE_AVX))
>>>>>>> +		memcpy_vmovdqu(dst, src, instr_size);
>>>>>>> +	else
>>>>>>> +		memcpy(dst, src, size);
>>>>>>> +}
>>>>>>> +
>>>>>>> +#define EMIT_COPY_CCS_DW 8
>>>>>>>   static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
>>>>>>>   			  u64 dst_ofs, bool dst_is_indirect,
>>>>>>>   			  u64 src_ofs, bool src_is_indirect,
>>>>>>>   			  u32 size)
>>>>>>>   {
>>>>>>> +	u32 dw[EMIT_COPY_CCS_DW] = {MI_NOOP};
>>>>>>>   	struct xe_device *xe = gt_to_xe(gt);
>>>>>>>   	u32 *cs = bb->cs + bb->len;
>>>>>>>   	u32 num_ccs_blks;
>>>>>>>   	u32 num_pages;
>>>>>>>   	u32 ccs_copy_size;
>>>>>>>   	u32 mocs;
>>>>>>> +	u32 i = 0;
>>>>>>>   
>>>>>>>   	if (GRAPHICS_VERx100(xe) >= 2000) {
>>>>>>>   		num_pages = DIV_ROUND_UP(size, XE_PAGE_SIZE);
>>>>>>> @@ -673,15 +707,23 @@ static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
>>>>>>>   		mocs = FIELD_PREP(XY_CTRL_SURF_MOCS_MASK, gt->mocs.uc_index);
>>>>>>>   	}
>>>>>>>   
>>>>>>> -	*cs++ = XY_CTRL_SURF_COPY_BLT |
>>>>>>> -		(src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
>>>>>>> -		(dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
>>>>>>> -		ccs_copy_size;
>>>>>>> -	*cs++ = lower_32_bits(src_ofs);
>>>>>>> -	*cs++ = upper_32_bits(src_ofs) | mocs;
>>>>>>> -	*cs++ = lower_32_bits(dst_ofs);
>>>>>>> -	*cs++ = upper_32_bits(dst_ofs) | mocs;
>>>>>>> +	dw[i++] = XY_CTRL_SURF_COPY_BLT |
>>>>>>> +		  (src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
>>>>>>> +		  (dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
>>>>>>> +		  ccs_copy_size;
>>>>>>> +	dw[i++] = lower_32_bits(src_ofs);
>>>>>>> +	dw[i++] = upper_32_bits(src_ofs) | mocs;
>>>>>>> +	dw[i++] = lower_32_bits(dst_ofs);
>>>>>>> +	dw[i++] = upper_32_bits(dst_ofs) | mocs;
>>>>>>>   
>>>>>>> +	/*
>>>>>>> +	 * The CCS copy command is a 5-dword sequence. If the vCPU halts during
>>>>>>> +	 * save/restore while this sequence is being issued, partial writes may trigger
>>>>>>> +	 * page faults when saving iGPU CCS metadata. Use the VMOVDQU instruction to
>>>>>>> +	 * write the sequence atomically.
>>>>>>> +	 */
>>>>>>> +	emit_atomic(gt, cs, dw, sizeof(dw));
>>>>>>> +	cs += EMIT_COPY_CCS_DW;
>>>>>>>   	bb->len = cs - bb->cs;
>>>>>>>   }
>>>>>>>   
>>>>>>> @@ -993,18 +1035,27 @@ static u64 migrate_vm_ppgtt_addr_tlb_inval(void)
>>>>>>>   	return (NUM_KERNEL_PDE - 2) * XE_PAGE_SIZE;
>>>>>>>   }
>>>>>>>   
>>>>>>> -static int emit_flush_invalidate(u32 *dw, int i, u32 flags)
>>>>>>> +/*
>>>>>>> + * The MI_FLUSH_DW command is a 4-dword sequence. If the vCPU halts during
>>>>>>> + * save/restore while this sequence is being issued, partial writes may
>>>>>>> + * trigger page faults when saving iGPU CCS metadata. Use
>>>>>>> + * emit_atomic() to write the sequence atomically.
>>>>>>> + */
>>>>>>> +#define EMIT_FLUSH_INVALIDATE_DW 4
>>>>>>> +static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *cs, int i, u32 flags)
>>>>>>>   {
>>>>>>>   	u64 addr = migrate_vm_ppgtt_addr_tlb_inval();
>>>>>>> +	u32 dw[EMIT_FLUSH_INVALIDATE_DW] = {MI_NOOP}, j = 0;
>>>>>>> +
>>>>>>> +	dw[j++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
>>>>>>> +		      MI_FLUSH_IMM_DW | flags;
>>>>>>> +	dw[j++] = lower_32_bits(addr);
>>>>>>> +	dw[j++] = upper_32_bits(addr);
>>>>>>> +	dw[j++] = MI_NOOP;
>>>>>>>   
>>>>>>> -	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
>>>>>>> -		  MI_FLUSH_IMM_DW | flags;
>>>>>>> -	dw[i++] = lower_32_bits(addr);
>>>>>>> -	dw[i++] = upper_32_bits(addr);
>>>>>>> -	dw[i++] = MI_NOOP;
>>>>>>> -	dw[i++] = MI_NOOP;
>>>>>>> +	emit_atomic(q->gt, &cs[i], dw, sizeof(dw));
>>>>>>>   
>>>>>>> -	return i;
>>>>>>> +	return i + j;
>>>>>>>   }
>>>>>>>   
>>>>>>>   /**
>>>>>>> @@ -1049,7 +1100,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
>>>>>>>   	/* Calculate Batch buffer size */
>>>>>>>   	batch_size = 0;
>>>>>>>   	while (size) {
>>>>>>> -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
>>>>>>> +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
>>>>>>>   		u64 ccs_ofs, ccs_size;
>>>>>>>   		u32 ccs_pt;
>>>>>>>   
>>>>>>> @@ -1090,7 +1141,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
>>>>>>>   	 * sizes here again before copy command is emitted.
>>>>>>>   	 */
>>>>>>>   	while (size) {
>>>>>>> -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
>>>>>>> +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
>>>>>>>   		u32 flush_flags = 0;
>>>>>>>   		u64 ccs_ofs, ccs_size;
>>>>>>>   		u32 ccs_pt;
>>>>>>> @@ -1113,11 +1164,11 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
>>>>>>>   
>>>>>>>   		emit_pte(m, bb, ccs_pt, false, false, &ccs_it, ccs_size, src);
>>>>>>>   
>>>>>>> -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
>>>>>>> +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
>>>>>>>   		flush_flags = xe_migrate_ccs_copy(m, bb, src_L0_ofs, src_is_pltt,
>>>>>>>   						  src_L0_ofs, dst_is_pltt,
>>>>>>>   						  src_L0, ccs_ofs, true);
>>>>>>> -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
>>>>>>> +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
>>>>>>>   
>>>>>>>   		size -= src_L0;
>>>>>>>   	}
>>>>>>> -- 
>>>>>>> 2.51.0
>>>>>>>
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/3] drm/xe/migrate: Atomicize CCS copy command setup
  2025-10-10  8:41               ` K V P, Satyanarayana
@ 2025-10-10 19:13                 ` Rodrigo Vivi
  2025-10-13  4:42                   ` K V P, Satyanarayana
  0 siblings, 1 reply; 16+ messages in thread
From: Rodrigo Vivi @ 2025-10-10 19:13 UTC (permalink / raw)
  To: K V P, Satyanarayana
  Cc: Matt Roper, Matthew Brost, intel-xe, Michal Wajdeczko,
	Matthew Auld

On Fri, Oct 10, 2025 at 02:11:52PM +0530, K V P, Satyanarayana wrote:
> 
> 
> On 10-10-2025 04:36, Matt Roper wrote:
> > On Thu, Oct 09, 2025 at 11:49:16AM -0700, Matthew Brost wrote:
> > > On Thu, Oct 09, 2025 at 02:35:10PM -0400, Rodrigo Vivi wrote:
> > > > On Thu, Oct 09, 2025 at 09:11:13AM -0700, Matthew Brost wrote:
> > > > > On Thu, Oct 09, 2025 at 09:00:43AM -0400, Rodrigo Vivi wrote:
> > > > > > On Wed, Oct 08, 2025 at 03:58:32PM -0700, Matthew Brost wrote:
> > > > > > > On Wed, Oct 08, 2025 at 03:41:47PM +0530, Satyanarayana K V P wrote:
> > > > > > > > The CCS copy command is a 5-dword sequence. If the vCPU halts during
> > > > > > > > save/restore while this sequence is being programmed, partial writes may
> > > > > > > > trigger page faults when saving IGPU CCS metadata. Use the VMOVDQU
> > > > > > > > instruction to write the sequence atomically.
> > > > > > > > 
> > > > > > > > Since VMOVDQU operates on 256-bit chunks, update EMIT_COPY_CCS_DW to emit
> > > > > > > > 8 dwords instead of 5 dwords.
> > > > > > > > 
> > > > > > > > Update emit_flush_invalidate() to use VMOVDQU operating with 128-bit
> > > > > > > > chunks.
> > > > > > > > 
> > > > > > > > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> > > > > > > > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > > > > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > > > > > > 
> > > > > > > > ---
> > > > > > > > V4 -> V5:
> > > > > > > > - Fixed review comments. (Matt B)
> > > > > > > > 
> > > > > > > > V3 -> V4:
> > > > > > > > - Fixed review comments. (Wajdeczko)
> > > > > > > > - Fix issues reported by patchworks.
> > > > > > > > 
> > > > > > > > V2 -> V3:
> > > > > > > > - Added support for 128 bit and 256 bit instructions with memcpy_vmovdqu
> > > > > > > > - Updated emit_flush_invalidate() to use vmovdqu instruction.
> > > > > > > > 
> > > > > > > > V1 -> V2:
> > > > > > > > - Use memcpy_vmovdqu only for x86 arch and for VF. Else use memcpy
> > > > > > > >    (Auld, Matthew)
> > > > > > > > - Fix issues reported by patchworks.
> > > > > > > > ---
> > > > > > > >   drivers/gpu/drm/xe/xe_migrate.c | 93 +++++++++++++++++++++++++--------
> > > > > > > >   1 file changed, 72 insertions(+), 21 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > > > index c39c3b423d05..b629072956ee 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > > > @@ -5,7 +5,9 @@
> > > > > > > >   #include "xe_migrate.h"
> > > > > > > > +#include <asm/fpu/api.h>
> > > > > > > >   #include <linux/bitfield.h>
> > > > > > > > +#include <linux/cpufeature.h>
> > > > > > > >   #include <linux/sizes.h>
> > > > > > > >   #include <drm/drm_managed.h>
> > > > > > > > @@ -33,6 +35,7 @@
> > > > > > > >   #include "xe_res_cursor.h"
> > > > > > > >   #include "xe_sa.h"
> > > > > > > >   #include "xe_sched_job.h"
> > > > > > > > +#include "xe_sriov_vf_ccs.h"
> > > > > > > >   #include "xe_sync.h"
> > > > > > > >   #include "xe_trace_bo.h"
> > > > > > > >   #include "xe_validation.h"
> > > > > > > > @@ -644,18 +647,49 @@ static void emit_pte(struct xe_migrate *m,
> > > > > > > >   	}
> > > > > > > >   }
> > > > > > > > -#define EMIT_COPY_CCS_DW 5
> > > > > > > > +static void memcpy_vmovdqu(void *dst, const void *src, u32 size)
> > > > > > > > +{
> > > > > > > > +#ifdef CONFIG_X86
> > > > > > > > +	kernel_fpu_begin();
> > > > > > > > +	if (size == SZ_128) {
> > > > > > > > +		asm("vmovdqu (%0), %%xmm0\n"
> > > > > > > > +		    "vmovups %%xmm0,   (%1)\n"
> > > > > > > > +		    :: "r" (src), "r" (dst) : "memory");
> > > > > > > > +	} else if (size == SZ_256) {
> > > > > > > > +		asm("vmovdqu (%0), %%ymm0\n"
> > > > > > > > +		    "vmovups %%ymm0,   (%1)\n"
> > > > > > > > +		    :: "r" (src), "r" (dst) : "memory");
> > > > > > > > +	}
> > > > > > > > +	kernel_fpu_end();
> > > > > > > > +#endif
> > > > > > > 
> > > > > > > Everything in this patch LGTM but I think we maintainer input to ensure
> > > > > > > we are breaking some rules about inlined asm code in a driver (no idea
> > > > > > > if this exists) or if a better place would be somewhere common. Can you
> > > > > > > ping Lucas, Thomas, or Rodrigo and ask them about this?
> > > > > > 
> > > > > > Well, it is possible and we have asm code in i915 for instance (i915_memcpy.c)
> > > > > > 
> > > > > > But the rule does exist:
> > > > > > https://www.kernel.org/doc/html/latest/process/coding-style.html#inline-assembly
> > > > > > 
> > > > > > "don’t use inline assembly gratuitously when C can do the job. You can and should
> > > > > > poke hardware from C when possible"
> > > > > > 
> > > > > > In this case here, please explain why exactly memcpy with smp_wmb barriers and
> > > > > > or WRITE_ONCE code combined couldn't solve it.
> > > > > > 
> > > > > > Also, please explain how exactly vmovdqu guarantees the atomicity promised by
> > > > > > the commit message. On a quick search here my take is that for this 128 or 256
> > > > > > bits, atomicity is not guaranteed.
> > > > > 
> > > > > I don't think cache atomicity is what we're after here—rather, it's vCPU
> > > > > halting atomicity.
> > > > > 
> > > > > Consider the following case:
> > > > > b++ = XY_CTRL_SURF_COPY_BLT;
> > > > > b++ = addr;
> > > > > 
> > > > > If the vCPU is halted during the instruction that stores
> > > > > XY_CTRL_SURF_COPY_BLT, the address will be invalid. The GuC executes the
> > > > > batch buffer (BB) that is being programmed as part of the VF save. This
> > > > > will clearly cause the BB to hang due to a page fault on the copy
> > > > > command.
> > > > 
> > > > okay, perhaps this is what is getting me confused most
> > > > what I don't understand in the flow is: why GuC is already
> > > > executing it or going to execute it while you are going to a halt when
> > > > writing the command to the buffer?  and not writing to the buffer first
> > > > and then sending it to the exec queue?
> > > > 
> > > 
> > > It how this feature was architected, will send over SaS link of the list.
> > 
> > I'm confused by this too.  At the point we're filling in the
> > batchbuffer, the GuC isn't aware of the batch at all yet as far as I can
> > see.  In xe_migrate_copy(), we've called xe_bb_new() to allocate a new
> > batchbuffer, and then we start calling emit_* functions to poke
> > instructions into that buffer.  At the point we call
> > xe_migrate_ccs_copy(), the hardware still isn't aware that this buffer
> > exists, so it shouldn't be possible for it to start executing.  Only
> > later on when we eventually create a job for the batchbuffer (after
> > we've finished emitting all of the commands) should it be possible for
> > the hardware to start executing this.
> > 
> > If there's some other *future* changes (not present in the driver today)
> > that change the design such that we allocate a batchbuffer and tell the
> > GuC it's free to start executing it, but only fill in the contents after
> > that point, then that needs to be clearly explained in the commit
> > message.  But that also sounds like an fundamentally racy design, so I'm
> > not sure why vCPU would be the only situations we'd be running into
> > problems.
> > 
> > 
> > Matt
> > 
> HI Matt,
>  Please refer to xe_migrate_ccs_rw_copy() function which just creates BB and
> it does not submit job. The idea here is that, we have a sub-allocator which
> is already registered with Guc and the function xe_migrate_ccs_rw_copy() is
> allocating BBs from the sub-allocator.
> When the vCPU is paused, GUC automatically submits these BBs to HW. So, we
> are making sure that the BB always contain valid GPU instructions so that HW
> will not report any page faults while executing.
> I will share the SAS for this.

The SAS sharing doesn't help. Please ensure that this flow is documented
in the patch itself with some comments. I didn't see this in the last
version. Also ensure kunit is passing.

Thanks,
Rodrigo.


> 
> -Satya.
> > > 
> > > > > 
> > > > > If the entire XY_CTRL_SURF_COPY_BLT is stored via an AVX instruction,
> > > > > then either the GPU entire instruction is written or none of it is. I
> > > > > believe vCPU halting guarantees that a CPU instruction is either fully
> > > > > executed or not at all—regardless of how many micro-operations (uOPs) it
> > > > > decodes into. If this guarantee does not hold, then the entire
> > > > > architecture of CCS save/restore on PTL is fundamentally broken which is
> > > > > always possible.
> > > > 
> > > > Okay, this is guaranteed. I mean, the vCPU won't get halted in the middle
> > > > of the vmovdqu nor vmovups. only before, between, or after them.
> > > > 
> > > > But is this uncached and/or coherent? isn't there really any possibility that
> > > > the command finished, but GuC mid-flight executing things aren't still
> > > > seeing different cachelines?
> > > > 
> > > 
> > > The GuC won't start executing until vCPU unpause on the save flow.
> > > Restore flow is bit more tricky as vCPU are live when this happens but
> > > we can W/A that race in software I think. That part is not in this
> > > series.
> > > 
> > > > > 
> > > > > > 
> > > > > > So, imho this patch is introducing a unmaintainable, complex, and fragile code
> > > > > > that is not even doing what it is claiming to do. But I will be glad if someone
> > > > > > can challenge this and prove me wrong.
> > > > > > 
> > > > > 
> > > > > Let me know if the above makes any sense.
> > > > 
> > > > Okay. But how to handle cases where AVX might not be available? really not needed?
> > > > 
> > > 
> > > This iGPU feature for PTL so shouldn't be an issue as PTL has AVX
> > > instructions.
> > > 
> > > Matt
> > > 
> > > > > 
> > > > > Matt
> > > > > 
> > > > > > Thanks,
> > > > > > Rodrigo.
> > > > > > 
> > > > > > > 
> > > > > > > Matt
> > > > > > > 
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static void emit_atomic(struct xe_gt *gt, void *dst, const void *src, u32 size)
> > > > > > > > +{
> > > > > > > > +	u32 instr_size = size * BITS_PER_BYTE;
> > > > > > > > +
> > > > > > > > +	xe_gt_assert(gt, instr_size == SZ_128 || instr_size == SZ_256);
> > > > > > > > +
> > > > > > > > +	if (IS_VF_CCS_READY(gt_to_xe(gt)) && static_cpu_has(X86_FEATURE_AVX))
> > > > > > > > +		memcpy_vmovdqu(dst, src, instr_size);
> > > > > > > > +	else
> > > > > > > > +		memcpy(dst, src, size);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +#define EMIT_COPY_CCS_DW 8
> > > > > > > >   static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> > > > > > > >   			  u64 dst_ofs, bool dst_is_indirect,
> > > > > > > >   			  u64 src_ofs, bool src_is_indirect,
> > > > > > > >   			  u32 size)
> > > > > > > >   {
> > > > > > > > +	u32 dw[EMIT_COPY_CCS_DW] = {MI_NOOP};
> > > > > > > >   	struct xe_device *xe = gt_to_xe(gt);
> > > > > > > >   	u32 *cs = bb->cs + bb->len;
> > > > > > > >   	u32 num_ccs_blks;
> > > > > > > >   	u32 num_pages;
> > > > > > > >   	u32 ccs_copy_size;
> > > > > > > >   	u32 mocs;
> > > > > > > > +	u32 i = 0;
> > > > > > > >   	if (GRAPHICS_VERx100(xe) >= 2000) {
> > > > > > > >   		num_pages = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> > > > > > > > @@ -673,15 +707,23 @@ static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> > > > > > > >   		mocs = FIELD_PREP(XY_CTRL_SURF_MOCS_MASK, gt->mocs.uc_index);
> > > > > > > >   	}
> > > > > > > > -	*cs++ = XY_CTRL_SURF_COPY_BLT |
> > > > > > > > -		(src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> > > > > > > > -		(dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> > > > > > > > -		ccs_copy_size;
> > > > > > > > -	*cs++ = lower_32_bits(src_ofs);
> > > > > > > > -	*cs++ = upper_32_bits(src_ofs) | mocs;
> > > > > > > > -	*cs++ = lower_32_bits(dst_ofs);
> > > > > > > > -	*cs++ = upper_32_bits(dst_ofs) | mocs;
> > > > > > > > +	dw[i++] = XY_CTRL_SURF_COPY_BLT |
> > > > > > > > +		  (src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> > > > > > > > +		  (dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> > > > > > > > +		  ccs_copy_size;
> > > > > > > > +	dw[i++] = lower_32_bits(src_ofs);
> > > > > > > > +	dw[i++] = upper_32_bits(src_ofs) | mocs;
> > > > > > > > +	dw[i++] = lower_32_bits(dst_ofs);
> > > > > > > > +	dw[i++] = upper_32_bits(dst_ofs) | mocs;
> > > > > > > > +	/*
> > > > > > > > +	 * The CCS copy command is a 5-dword sequence. If the vCPU halts during
> > > > > > > > +	 * save/restore while this sequence is being issued, partial writes may trigger
> > > > > > > > +	 * page faults when saving iGPU CCS metadata. Use the VMOVDQU instruction to
> > > > > > > > +	 * write the sequence atomically.
> > > > > > > > +	 */
> > > > > > > > +	emit_atomic(gt, cs, dw, sizeof(dw));
> > > > > > > > +	cs += EMIT_COPY_CCS_DW;
> > > > > > > >   	bb->len = cs - bb->cs;
> > > > > > > >   }
> > > > > > > > @@ -993,18 +1035,27 @@ static u64 migrate_vm_ppgtt_addr_tlb_inval(void)
> > > > > > > >   	return (NUM_KERNEL_PDE - 2) * XE_PAGE_SIZE;
> > > > > > > >   }
> > > > > > > > -static int emit_flush_invalidate(u32 *dw, int i, u32 flags)
> > > > > > > > +/*
> > > > > > > > + * The MI_FLUSH_DW command is a 4-dword sequence. If the vCPU halts during
> > > > > > > > + * save/restore while this sequence is being issued, partial writes may
> > > > > > > > + * trigger page faults when saving iGPU CCS metadata. Use
> > > > > > > > + * emit_atomic() to write the sequence atomically.
> > > > > > > > + */
> > > > > > > > +#define EMIT_FLUSH_INVALIDATE_DW 4
> > > > > > > > +static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *cs, int i, u32 flags)
> > > > > > > >   {
> > > > > > > >   	u64 addr = migrate_vm_ppgtt_addr_tlb_inval();
> > > > > > > > +	u32 dw[EMIT_FLUSH_INVALIDATE_DW] = {MI_NOOP}, j = 0;
> > > > > > > > +
> > > > > > > > +	dw[j++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> > > > > > > > +		      MI_FLUSH_IMM_DW | flags;
> > > > > > > > +	dw[j++] = lower_32_bits(addr);
> > > > > > > > +	dw[j++] = upper_32_bits(addr);
> > > > > > > > +	dw[j++] = MI_NOOP;
> > > > > > > > -	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> > > > > > > > -		  MI_FLUSH_IMM_DW | flags;
> > > > > > > > -	dw[i++] = lower_32_bits(addr);
> > > > > > > > -	dw[i++] = upper_32_bits(addr);
> > > > > > > > -	dw[i++] = MI_NOOP;
> > > > > > > > -	dw[i++] = MI_NOOP;
> > > > > > > > +	emit_atomic(q->gt, &cs[i], dw, sizeof(dw));
> > > > > > > > -	return i;
> > > > > > > > +	return i + j;
> > > > > > > >   }
> > > > > > > >   /**
> > > > > > > > @@ -1049,7 +1100,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > > > > > > >   	/* Calculate Batch buffer size */
> > > > > > > >   	batch_size = 0;
> > > > > > > >   	while (size) {
> > > > > > > > -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> > > > > > > > +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
> > > > > > > >   		u64 ccs_ofs, ccs_size;
> > > > > > > >   		u32 ccs_pt;
> > > > > > > > @@ -1090,7 +1141,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > > > > > > >   	 * sizes here again before copy command is emitted.
> > > > > > > >   	 */
> > > > > > > >   	while (size) {
> > > > > > > > -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> > > > > > > > +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
> > > > > > > >   		u32 flush_flags = 0;
> > > > > > > >   		u64 ccs_ofs, ccs_size;
> > > > > > > >   		u32 ccs_pt;
> > > > > > > > @@ -1113,11 +1164,11 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> > > > > > > >   		emit_pte(m, bb, ccs_pt, false, false, &ccs_it, ccs_size, src);
> > > > > > > > -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> > > > > > > > +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> > > > > > > >   		flush_flags = xe_migrate_ccs_copy(m, bb, src_L0_ofs, src_is_pltt,
> > > > > > > >   						  src_L0_ofs, dst_is_pltt,
> > > > > > > >   						  src_L0, ccs_ofs, true);
> > > > > > > > -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> > > > > > > > +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> > > > > > > >   		size -= src_L0;
> > > > > > > >   	}
> > > > > > > > -- 
> > > > > > > > 2.51.0
> > > > > > > > 
> > 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/3] drm/xe/migrate: Atomicize CCS copy command setup
  2025-10-10 19:13                 ` Rodrigo Vivi
@ 2025-10-13  4:42                   ` K V P, Satyanarayana
  0 siblings, 0 replies; 16+ messages in thread
From: K V P, Satyanarayana @ 2025-10-13  4:42 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Matt Roper, Matthew Brost, intel-xe, Michal Wajdeczko,
	Matthew Auld



On 11-10-2025 00:43, Rodrigo Vivi wrote:
> On Fri, Oct 10, 2025 at 02:11:52PM +0530, K V P, Satyanarayana wrote:
>>
>>
>> On 10-10-2025 04:36, Matt Roper wrote:
>>> On Thu, Oct 09, 2025 at 11:49:16AM -0700, Matthew Brost wrote:
>>>> On Thu, Oct 09, 2025 at 02:35:10PM -0400, Rodrigo Vivi wrote:
>>>>> On Thu, Oct 09, 2025 at 09:11:13AM -0700, Matthew Brost wrote:
>>>>>> On Thu, Oct 09, 2025 at 09:00:43AM -0400, Rodrigo Vivi wrote:
>>>>>>> On Wed, Oct 08, 2025 at 03:58:32PM -0700, Matthew Brost wrote:
>>>>>>>> On Wed, Oct 08, 2025 at 03:41:47PM +0530, Satyanarayana K V P wrote:
>>>>>>>>> The CCS copy command is a 5-dword sequence. If the vCPU halts during
>>>>>>>>> save/restore while this sequence is being programmed, partial writes may
>>>>>>>>> trigger page faults when saving IGPU CCS metadata. Use the VMOVDQU
>>>>>>>>> instruction to write the sequence atomically.
>>>>>>>>>
>>>>>>>>> Since VMOVDQU operates on 256-bit chunks, update EMIT_COPY_CCS_DW to emit
>>>>>>>>> 8 dwords instead of 5 dwords.
>>>>>>>>>
>>>>>>>>> Update emit_flush_invalidate() to use VMOVDQU operating with 128-bit
>>>>>>>>> chunks.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
>>>>>>>>> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>> Cc: Matthew Auld <matthew.auld@intel.com>
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> V4 -> V5:
>>>>>>>>> - Fixed review comments. (Matt B)
>>>>>>>>>
>>>>>>>>> V3 -> V4:
>>>>>>>>> - Fixed review comments. (Wajdeczko)
>>>>>>>>> - Fix issues reported by patchworks.
>>>>>>>>>
>>>>>>>>> V2 -> V3:
>>>>>>>>> - Added support for 128 bit and 256 bit instructions with memcpy_vmovdqu
>>>>>>>>> - Updated emit_flush_invalidate() to use vmovdqu instruction.
>>>>>>>>>
>>>>>>>>> V1 -> V2:
>>>>>>>>> - Use memcpy_vmovdqu only for x86 arch and for VF. Else use memcpy
>>>>>>>>>     (Auld, Matthew)
>>>>>>>>> - Fix issues reported by patchworks.
>>>>>>>>> ---
>>>>>>>>>    drivers/gpu/drm/xe/xe_migrate.c | 93 +++++++++++++++++++++++++--------
>>>>>>>>>    1 file changed, 72 insertions(+), 21 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>> index c39c3b423d05..b629072956ee 100644
>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>> @@ -5,7 +5,9 @@
>>>>>>>>>    #include "xe_migrate.h"
>>>>>>>>> +#include <asm/fpu/api.h>
>>>>>>>>>    #include <linux/bitfield.h>
>>>>>>>>> +#include <linux/cpufeature.h>
>>>>>>>>>    #include <linux/sizes.h>
>>>>>>>>>    #include <drm/drm_managed.h>
>>>>>>>>> @@ -33,6 +35,7 @@
>>>>>>>>>    #include "xe_res_cursor.h"
>>>>>>>>>    #include "xe_sa.h"
>>>>>>>>>    #include "xe_sched_job.h"
>>>>>>>>> +#include "xe_sriov_vf_ccs.h"
>>>>>>>>>    #include "xe_sync.h"
>>>>>>>>>    #include "xe_trace_bo.h"
>>>>>>>>>    #include "xe_validation.h"
>>>>>>>>> @@ -644,18 +647,49 @@ static void emit_pte(struct xe_migrate *m,
>>>>>>>>>    	}
>>>>>>>>>    }
>>>>>>>>> -#define EMIT_COPY_CCS_DW 5
>>>>>>>>> +static void memcpy_vmovdqu(void *dst, const void *src, u32 size)
>>>>>>>>> +{
>>>>>>>>> +#ifdef CONFIG_X86
>>>>>>>>> +	kernel_fpu_begin();
>>>>>>>>> +	if (size == SZ_128) {
>>>>>>>>> +		asm("vmovdqu (%0), %%xmm0\n"
>>>>>>>>> +		    "vmovups %%xmm0,   (%1)\n"
>>>>>>>>> +		    :: "r" (src), "r" (dst) : "memory");
>>>>>>>>> +	} else if (size == SZ_256) {
>>>>>>>>> +		asm("vmovdqu (%0), %%ymm0\n"
>>>>>>>>> +		    "vmovups %%ymm0,   (%1)\n"
>>>>>>>>> +		    :: "r" (src), "r" (dst) : "memory");
>>>>>>>>> +	}
>>>>>>>>> +	kernel_fpu_end();
>>>>>>>>> +#endif
>>>>>>>>
>>>>>>>> Everything in this patch LGTM but I think we maintainer input to ensure
>>>>>>>> we are breaking some rules about inlined asm code in a driver (no idea
>>>>>>>> if this exists) or if a better place would be somewhere common. Can you
>>>>>>>> ping Lucas, Thomas, or Rodrigo and ask them about this?
>>>>>>>
>>>>>>> Well, it is possible and we have asm code in i915 for instance (i915_memcpy.c)
>>>>>>>
>>>>>>> But the rule does exist:
>>>>>>> https://www.kernel.org/doc/html/latest/process/coding-style.html#inline-assembly
>>>>>>>
>>>>>>> "don’t use inline assembly gratuitously when C can do the job. You can and should
>>>>>>> poke hardware from C when possible"
>>>>>>>
>>>>>>> In this case here, please explain why exactly memcpy with smp_wmb barriers and
>>>>>>> or WRITE_ONCE code combined couldn't solve it.
>>>>>>>
>>>>>>> Also, please explain how exactly vmovdqu guarantees the atomicity promised by
>>>>>>> the commit message. On a quick search here my take is that for this 128 or 256
>>>>>>> bits, atomicity is not guaranteed.
>>>>>>
>>>>>> I don't think cache atomicity is what we're after here—rather, it's vCPU
>>>>>> halting atomicity.
>>>>>>
>>>>>> Consider the following case:
>>>>>> b++ = XY_CTRL_SURF_COPY_BLT;
>>>>>> b++ = addr;
>>>>>>
>>>>>> If the vCPU is halted during the instruction that stores
>>>>>> XY_CTRL_SURF_COPY_BLT, the address will be invalid. The GuC executes the
>>>>>> batch buffer (BB) that is being programmed as part of the VF save. This
>>>>>> will clearly cause the BB to hang due to a page fault on the copy
>>>>>> command.
>>>>>
>>>>> okay, perhaps this is what is getting me confused most
>>>>> what I don't understand in the flow is: why GuC is already
>>>>> executing it or going to execute it while you are going to a halt when
>>>>> writing the command to the buffer?  and not writing to the buffer first
>>>>> and then sending it to the exec queue?
>>>>>
>>>>
>>>> It how this feature was architected, will send over SaS link of the list.
>>>
>>> I'm confused by this too.  At the point we're filling in the
>>> batchbuffer, the GuC isn't aware of the batch at all yet as far as I can
>>> see.  In xe_migrate_copy(), we've called xe_bb_new() to allocate a new
>>> batchbuffer, and then we start calling emit_* functions to poke
>>> instructions into that buffer.  At the point we call
>>> xe_migrate_ccs_copy(), the hardware still isn't aware that this buffer
>>> exists, so it shouldn't be possible for it to start executing.  Only
>>> later on when we eventually create a job for the batchbuffer (after
>>> we've finished emitting all of the commands) should it be possible for
>>> the hardware to start executing this.
>>>
>>> If there's some other *future* changes (not present in the driver today)
>>> that change the design such that we allocate a batchbuffer and tell the
>>> GuC it's free to start executing it, but only fill in the contents after
>>> that point, then that needs to be clearly explained in the commit
>>> message.  But that also sounds like an fundamentally racy design, so I'm
>>> not sure why vCPU would be the only situations we'd be running into
>>> problems.
>>>
>>>
>>> Matt
>>>
>> HI Matt,
>>   Please refer to xe_migrate_ccs_rw_copy() function which just creates BB and
>> it does not submit job. The idea here is that, we have a sub-allocator which
>> is already registered with Guc and the function xe_migrate_ccs_rw_copy() is
>> allocating BBs from the sub-allocator.
>> When the vCPU is paused, GUC automatically submits these BBs to HW. So, we
>> are making sure that the BB always contain valid GPU instructions so that HW
>> will not report any page faults while executing.
>> I will share the SAS for this.
> 
> The SAS sharing doesn't help. Please ensure that this flow is documented
> in the patch itself with some comments. I didn't see this in the last
> version. Also ensure kunit is passing.
I will fix these in the new revision.>
> Thanks,
> Rodrigo.
> 
The complete workflow is documented in 
drivers/gpu/drm/xe/xe_sriov_vf_ccs.c within the source tree. This patch 
covers corner cases identified during the review of the SRIOV 
save/restore feature.

-Satya>
>>
>> -Satya.
>>>>
>>>>>>
>>>>>> If the entire XY_CTRL_SURF_COPY_BLT is stored via an AVX instruction,
>>>>>> then either the GPU entire instruction is written or none of it is. I
>>>>>> believe vCPU halting guarantees that a CPU instruction is either fully
>>>>>> executed or not at all—regardless of how many micro-operations (uOPs) it
>>>>>> decodes into. If this guarantee does not hold, then the entire
>>>>>> architecture of CCS save/restore on PTL is fundamentally broken which is
>>>>>> always possible.
>>>>>
>>>>> Okay, this is guaranteed. I mean, the vCPU won't get halted in the middle
>>>>> of the vmovdqu nor vmovups. only before, between, or after them.
>>>>>
>>>>> But is this uncached and/or coherent? isn't there really any possibility that
>>>>> the command finished, but GuC mid-flight executing things aren't still
>>>>> seeing different cachelines?
>>>>>
>>>>
>>>> The GuC won't start executing until vCPU unpause on the save flow.
>>>> Restore flow is bit more tricky as vCPU are live when this happens but
>>>> we can W/A that race in software I think. That part is not in this
>>>> series.
>>>>
>>>>>>
>>>>>>>
>>>>>>> So, imho this patch is introducing a unmaintainable, complex, and fragile code
>>>>>>> that is not even doing what it is claiming to do. But I will be glad if someone
>>>>>>> can challenge this and prove me wrong.
>>>>>>>
>>>>>>
>>>>>> Let me know if the above makes any sense.
>>>>>
>>>>> Okay. But how to handle cases where AVX might not be available? really not needed?
>>>>>
>>>>
>>>> This iGPU feature for PTL so shouldn't be an issue as PTL has AVX
>>>> instructions.
>>>>
>>>> Matt
>>>>
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>>> Thanks,
>>>>>>> Rodrigo.
>>>>>>>
>>>>>>>>
>>>>>>>> Matt
>>>>>>>>
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static void emit_atomic(struct xe_gt *gt, void *dst, const void *src, u32 size)
>>>>>>>>> +{
>>>>>>>>> +	u32 instr_size = size * BITS_PER_BYTE;
>>>>>>>>> +
>>>>>>>>> +	xe_gt_assert(gt, instr_size == SZ_128 || instr_size == SZ_256);
>>>>>>>>> +
>>>>>>>>> +	if (IS_VF_CCS_READY(gt_to_xe(gt)) && static_cpu_has(X86_FEATURE_AVX))
>>>>>>>>> +		memcpy_vmovdqu(dst, src, instr_size);
>>>>>>>>> +	else
>>>>>>>>> +		memcpy(dst, src, size);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +#define EMIT_COPY_CCS_DW 8
>>>>>>>>>    static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
>>>>>>>>>    			  u64 dst_ofs, bool dst_is_indirect,
>>>>>>>>>    			  u64 src_ofs, bool src_is_indirect,
>>>>>>>>>    			  u32 size)
>>>>>>>>>    {
>>>>>>>>> +	u32 dw[EMIT_COPY_CCS_DW] = {MI_NOOP};
>>>>>>>>>    	struct xe_device *xe = gt_to_xe(gt);
>>>>>>>>>    	u32 *cs = bb->cs + bb->len;
>>>>>>>>>    	u32 num_ccs_blks;
>>>>>>>>>    	u32 num_pages;
>>>>>>>>>    	u32 ccs_copy_size;
>>>>>>>>>    	u32 mocs;
>>>>>>>>> +	u32 i = 0;
>>>>>>>>>    	if (GRAPHICS_VERx100(xe) >= 2000) {
>>>>>>>>>    		num_pages = DIV_ROUND_UP(size, XE_PAGE_SIZE);
>>>>>>>>> @@ -673,15 +707,23 @@ static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
>>>>>>>>>    		mocs = FIELD_PREP(XY_CTRL_SURF_MOCS_MASK, gt->mocs.uc_index);
>>>>>>>>>    	}
>>>>>>>>> -	*cs++ = XY_CTRL_SURF_COPY_BLT |
>>>>>>>>> -		(src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
>>>>>>>>> -		(dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
>>>>>>>>> -		ccs_copy_size;
>>>>>>>>> -	*cs++ = lower_32_bits(src_ofs);
>>>>>>>>> -	*cs++ = upper_32_bits(src_ofs) | mocs;
>>>>>>>>> -	*cs++ = lower_32_bits(dst_ofs);
>>>>>>>>> -	*cs++ = upper_32_bits(dst_ofs) | mocs;
>>>>>>>>> +	dw[i++] = XY_CTRL_SURF_COPY_BLT |
>>>>>>>>> +		  (src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
>>>>>>>>> +		  (dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
>>>>>>>>> +		  ccs_copy_size;
>>>>>>>>> +	dw[i++] = lower_32_bits(src_ofs);
>>>>>>>>> +	dw[i++] = upper_32_bits(src_ofs) | mocs;
>>>>>>>>> +	dw[i++] = lower_32_bits(dst_ofs);
>>>>>>>>> +	dw[i++] = upper_32_bits(dst_ofs) | mocs;
>>>>>>>>> +	/*
>>>>>>>>> +	 * The CCS copy command is a 5-dword sequence. If the vCPU halts during
>>>>>>>>> +	 * save/restore while this sequence is being issued, partial writes may trigger
>>>>>>>>> +	 * page faults when saving iGPU CCS metadata. Use the VMOVDQU instruction to
>>>>>>>>> +	 * write the sequence atomically.
>>>>>>>>> +	 */
>>>>>>>>> +	emit_atomic(gt, cs, dw, sizeof(dw));
>>>>>>>>> +	cs += EMIT_COPY_CCS_DW;
>>>>>>>>>    	bb->len = cs - bb->cs;
>>>>>>>>>    }
>>>>>>>>> @@ -993,18 +1035,27 @@ static u64 migrate_vm_ppgtt_addr_tlb_inval(void)
>>>>>>>>>    	return (NUM_KERNEL_PDE - 2) * XE_PAGE_SIZE;
>>>>>>>>>    }
>>>>>>>>> -static int emit_flush_invalidate(u32 *dw, int i, u32 flags)
>>>>>>>>> +/*
>>>>>>>>> + * The MI_FLUSH_DW command is a 4-dword sequence. If the vCPU halts during
>>>>>>>>> + * save/restore while this sequence is being issued, partial writes may
>>>>>>>>> + * trigger page faults when saving iGPU CCS metadata. Use
>>>>>>>>> + * emit_atomic() to write the sequence atomically.
>>>>>>>>> + */
>>>>>>>>> +#define EMIT_FLUSH_INVALIDATE_DW 4
>>>>>>>>> +static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *cs, int i, u32 flags)
>>>>>>>>>    {
>>>>>>>>>    	u64 addr = migrate_vm_ppgtt_addr_tlb_inval();
>>>>>>>>> +	u32 dw[EMIT_FLUSH_INVALIDATE_DW] = {MI_NOOP}, j = 0;
>>>>>>>>> +
>>>>>>>>> +	dw[j++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
>>>>>>>>> +		      MI_FLUSH_IMM_DW | flags;
>>>>>>>>> +	dw[j++] = lower_32_bits(addr);
>>>>>>>>> +	dw[j++] = upper_32_bits(addr);
>>>>>>>>> +	dw[j++] = MI_NOOP;
>>>>>>>>> -	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
>>>>>>>>> -		  MI_FLUSH_IMM_DW | flags;
>>>>>>>>> -	dw[i++] = lower_32_bits(addr);
>>>>>>>>> -	dw[i++] = upper_32_bits(addr);
>>>>>>>>> -	dw[i++] = MI_NOOP;
>>>>>>>>> -	dw[i++] = MI_NOOP;
>>>>>>>>> +	emit_atomic(q->gt, &cs[i], dw, sizeof(dw));
>>>>>>>>> -	return i;
>>>>>>>>> +	return i + j;
>>>>>>>>>    }
>>>>>>>>>    /**
>>>>>>>>> @@ -1049,7 +1100,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
>>>>>>>>>    	/* Calculate Batch buffer size */
>>>>>>>>>    	batch_size = 0;
>>>>>>>>>    	while (size) {
>>>>>>>>> -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
>>>>>>>>> +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
>>>>>>>>>    		u64 ccs_ofs, ccs_size;
>>>>>>>>>    		u32 ccs_pt;
>>>>>>>>> @@ -1090,7 +1141,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
>>>>>>>>>    	 * sizes here again before copy command is emitted.
>>>>>>>>>    	 */
>>>>>>>>>    	while (size) {
>>>>>>>>> -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
>>>>>>>>> +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
>>>>>>>>>    		u32 flush_flags = 0;
>>>>>>>>>    		u64 ccs_ofs, ccs_size;
>>>>>>>>>    		u32 ccs_pt;
>>>>>>>>> @@ -1113,11 +1164,11 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
>>>>>>>>>    		emit_pte(m, bb, ccs_pt, false, false, &ccs_it, ccs_size, src);
>>>>>>>>> -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
>>>>>>>>> +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
>>>>>>>>>    		flush_flags = xe_migrate_ccs_copy(m, bb, src_L0_ofs, src_is_pltt,
>>>>>>>>>    						  src_L0_ofs, dst_is_pltt,
>>>>>>>>>    						  src_L0, ccs_ofs, true);
>>>>>>>>> -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
>>>>>>>>> +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
>>>>>>>>>    		size -= src_L0;
>>>>>>>>>    	}
>>>>>>>>> -- 
>>>>>>>>> 2.51.0
>>>>>>>>>
>>>
>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-10-13  4:42 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-08 10:11 [PATCH v5 0/3] drm/xe/migrate: Atomicize CCS copy command setup Satyanarayana K V P
2025-10-08 10:11 ` [PATCH v5 1/3] " Satyanarayana K V P
2025-10-08 22:58   ` Matthew Brost
2025-10-09 13:00     ` Rodrigo Vivi
2025-10-09 16:11       ` Matthew Brost
2025-10-09 18:35         ` Rodrigo Vivi
2025-10-09 18:49           ` Matthew Brost
2025-10-09 19:49             ` Rodrigo Vivi
2025-10-09 20:34               ` Michal Wajdeczko
2025-10-09 23:06             ` Matt Roper
2025-10-10  8:41               ` K V P, Satyanarayana
2025-10-10 19:13                 ` Rodrigo Vivi
2025-10-13  4:42                   ` K V P, Satyanarayana
2025-10-08 10:11 ` [PATCH v5 2/3] drm/xe/migrate: Make emit_pte() header write atomic Satyanarayana K V P
2025-10-08 23:27   ` Matthew Brost
2025-10-08 10:11 ` [PATCH v5 3/3] drm/xe/vf: Clear CCS read/write buffers in atomic way Satyanarayana K V P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox