Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] drm/xe/xe3p_lpg: L2 flush optimization
@ 2025-11-25  9:43 Tejas Upadhyay
  2025-11-25  9:43 ` [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually Tejas Upadhyay
                   ` (5 more replies)
  0 siblings, 6 replies; 39+ messages in thread
From: Tejas Upadhyay @ 2025-11-25  9:43 UTC (permalink / raw)
  To: intel-xe; +Cc: Tejas Upadhyay

The optimization involves two key changes:

Hardware-assisted Transient Display Flush: 
The new hardware can automatically manage the flushing of "transient" 
display data from the L2 cache. This eliminates the need for manual 
(software-driven) transient display (TD) flushes by the driver, 
simplifying the code and likely improving efficiency.

Transient Application (App) Cacheline Management: 
The hardware gains the ability to flush transient application cache 
lines more efficiently. The patch handles the necessary integration 
to utilize this new functionality and manages manual flushing where 
it is still required, ensuring data coherency and optimizing 
performance. 

Tejas Upadhyay (3):
  drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  drm/xe/xe3p_lpg: Enable L2 flush optimization feature
  drm/xe/xe3p: Skip TD flush

 drivers/gpu/drm/xe/xe_bo.c       |  3 ++-
 drivers/gpu/drm/xe/xe_device.c   | 28 ++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_device.h   |  1 +
 drivers/gpu/drm/xe/xe_guc.c      |  3 +++
 drivers/gpu/drm/xe/xe_guc_fwif.h |  1 +
 drivers/gpu/drm/xe/xe_userptr.c  |  3 ++-
 6 files changed, 37 insertions(+), 2 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2025-11-25  9:43 [PATCH 0/3] drm/xe/xe3p_lpg: L2 flush optimization Tejas Upadhyay
@ 2025-11-25  9:43 ` Tejas Upadhyay
  2025-11-25 10:17   ` Matthew Auld
  2025-11-25 15:06   ` Thomas Hellström
  2025-11-25  9:43 ` [PATCH 2/3] drm/xe/xe3p_lpg: Enable L2 flush optimization feature Tejas Upadhyay
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 39+ messages in thread
From: Tejas Upadhyay @ 2025-11-25  9:43 UTC (permalink / raw)
  To: intel-xe; +Cc: Tejas Upadhyay

Starting NVL, HW will flush cachelines marked with XA only
when media is off. We have few cases where kernel will have
non-XA cachelines which needs manual flush as we postpone
the invalidation. Flush asap from correctness POV to ensure
non accelerated CPU copy to swap/shmem file will see coherent
view of memory, but also from security POV where later flush
can't corrupt the next user of those pages.

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
 drivers/gpu/drm/xe/xe_device.c  | 20 ++++++++++++++++++++
 drivers/gpu/drm/xe/xe_device.h  |  1 +
 drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
 4 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 465cf9fc7ce9..97e1e9d40e96 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct xe_device *xe, struct xe_bo *bo,
 
 		if (!xe_vm_in_fault_mode(vm)) {
 			drm_gpuvm_bo_evict(vm_bo, true);
-			continue;
+			if (!xe_device_needs_cache_flush(xe))
+				continue;
 		}
 
 		if (!idle) {
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 92f883dd8877..6e8335b493e8 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1079,6 +1079,26 @@ void xe_device_l2_flush(struct xe_device *xe)
 	spin_unlock(&gt->global_invl_lock);
 }
 
+/**
+ * xe_device_needs_cache_flush - Whether the cache needs to be flushed
+ * @xe: The device to check.
+ *
+ * Return: true if the device needs cache flush, false otherwise.
+ */
+bool xe_device_needs_cache_flush(struct xe_device *xe)
+{
+	/*
+	 * Starting NVL, HW will flush cachelines marked with XA only when media is off. We have
+	 * few cases where kernel will have non-XA cachelines which needs manual flush and this is
+	 * one of them as we postpone the invalidation. Flush asap from correctness POV to ensure
+	 * non accelerated CPU copy to swap/shmem file will see coherent view of memory, but also
+	 * from security POV where later flush can't corrupt the next user of those pages.
+	 */
+	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
+		return true;
+	return false;
+}
+
 /**
  * xe_device_td_flush() - Flush transient L3 cache entries
  * @xe: The device
diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
index 32cc6323b7f6..15e67db44b56 100644
--- a/drivers/gpu/drm/xe/xe_device.h
+++ b/drivers/gpu/drm/xe/xe_device.h
@@ -179,6 +179,7 @@ void xe_device_snapshot_print(struct xe_device *xe, struct drm_printer *p);
 u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
 u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address);
 
+bool xe_device_needs_cache_flush(struct xe_device *xe);
 void xe_device_td_flush(struct xe_device *xe);
 void xe_device_l2_flush(struct xe_device *xe);
 
diff --git a/drivers/gpu/drm/xe/xe_userptr.c b/drivers/gpu/drm/xe/xe_userptr.c
index 0d9130b1958a..a93c7e887cca 100644
--- a/drivers/gpu/drm/xe/xe_userptr.c
+++ b/drivers/gpu/drm/xe/xe_userptr.c
@@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct xe_vm *vm, struct xe_userptr_vma *uv
 				    false, MAX_SCHEDULE_TIMEOUT);
 	XE_WARN_ON(err <= 0);
 
-	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
+	if ((xe_vm_in_fault_mode(vm) || xe_device_needs_cache_flush(vm->xe)) &&
+	    userptr->initial_bind) {
 		err = xe_vm_invalidate_vma(vma);
 		XE_WARN_ON(err);
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/3] drm/xe/xe3p_lpg: Enable L2 flush optimization feature
  2025-11-25  9:43 [PATCH 0/3] drm/xe/xe3p_lpg: L2 flush optimization Tejas Upadhyay
  2025-11-25  9:43 ` [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually Tejas Upadhyay
@ 2025-11-25  9:43 ` Tejas Upadhyay
  2025-11-25  9:43 ` [PATCH 3/3] drm/xe/xe3p: Skip TD flush Tejas Upadhyay
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Tejas Upadhyay @ 2025-11-25  9:43 UTC (permalink / raw)
  To: intel-xe; +Cc: Tejas Upadhyay

When set, the L2 flush optimization feature will control
whether L2 is in Persistent or Transient mode through
monitoring of media activity.

To enable L2 flush optimization include new feature flag
GUC_CTL_ENABLE_L2FLUSH_OPT for Novalake platforms when
media type is detected.

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_guc.c      | 3 +++
 drivers/gpu/drm/xe/xe_guc_fwif.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
index d6672cf30d3e..b3bf8d168f0a 100644
--- a/drivers/gpu/drm/xe/xe_guc.c
+++ b/drivers/gpu/drm/xe/xe_guc.c
@@ -96,6 +96,9 @@ static u32 guc_ctl_feature_flags(struct xe_guc *guc)
 	if (xe_guc_using_main_gamctrl_queues(guc))
 		flags |= GUC_CTL_MAIN_GAMCTRL_QUEUES;
 
+	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe) && xe_gt_is_media_type(guc_to_gt(guc)))
+		flags |= GUC_CTL_ENABLE_L2FLUSH_OPT;
+
 	return flags;
 }
 
diff --git a/drivers/gpu/drm/xe/xe_guc_fwif.h b/drivers/gpu/drm/xe/xe_guc_fwif.h
index 7d93c2749485..9aff0e454c7f 100644
--- a/drivers/gpu/drm/xe/xe_guc_fwif.h
+++ b/drivers/gpu/drm/xe/xe_guc_fwif.h
@@ -115,6 +115,7 @@ struct guc_update_exec_queue_policy {
 #define   GUC_CTL_ENABLE_PSMI_LOGGING	BIT(7)
 #define   GUC_CTL_MAIN_GAMCTRL_QUEUES	BIT(9)
 #define   GUC_CTL_DISABLE_SCHEDULER	BIT(14)
+#define   GUC_CTL_ENABLE_L2FLUSH_OPT	BIT(15)
 
 #define GUC_CTL_DEBUG			3
 #define   GUC_LOG_VERBOSITY		REG_GENMASK(1, 0)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/3] drm/xe/xe3p: Skip TD flush
  2025-11-25  9:43 [PATCH 0/3] drm/xe/xe3p_lpg: L2 flush optimization Tejas Upadhyay
  2025-11-25  9:43 ` [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually Tejas Upadhyay
  2025-11-25  9:43 ` [PATCH 2/3] drm/xe/xe3p_lpg: Enable L2 flush optimization feature Tejas Upadhyay
@ 2025-11-25  9:43 ` Tejas Upadhyay
  2025-11-25 13:20 ` ✓ CI.KUnit: success for drm/xe/xe3p_lpg: L2 flush optimization Patchwork
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Tejas Upadhyay @ 2025-11-25  9:43 UTC (permalink / raw)
  To: intel-xe; +Cc: Tejas Upadhyay

Xe3p has HW ability to do transient display flush so the xe driver can
enable this HW feature by default and skip the software TD flush.

Bspec: 60002
Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_device.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 6e8335b493e8..6885cdbc7dfd 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1121,6 +1121,14 @@ void xe_device_td_flush(struct xe_device *xe)
 {
 	struct xe_gt *root_gt;
 
+	/*
+	 * From Xe3p onward the HW takes care of flush of TD entries also along
+	 * with flushing XA entries, which will be at the usual sync points,
+	 * like at the end of submission, so no manual flush is needed here.
+	 */
+	if (GRAPHICS_VER(xe) >= 35)
+		return;
+
 	if (!IS_DGFX(xe) || GRAPHICS_VER(xe) < 20)
 		return;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2025-11-25  9:43 ` [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually Tejas Upadhyay
@ 2025-11-25 10:17   ` Matthew Auld
  2025-11-25 13:39     ` Souza, Jose
  2025-11-25 15:06   ` Thomas Hellström
  1 sibling, 1 reply; 39+ messages in thread
From: Matthew Auld @ 2025-11-25 10:17 UTC (permalink / raw)
  To: Tejas Upadhyay, intel-xe, Souza, Jose, Thomas Hellström

On 25/11/2025 09:43, Tejas Upadhyay wrote:
> Starting NVL, HW will flush cachelines marked with XA only

I think would be good to give basic overview of what XA is?

> when media is off. We have few cases where kernel will have
> non-XA cachelines which needs manual flush as we postpone
> the invalidation. Flush asap from correctness POV to ensure
> non accelerated CPU copy to swap/shmem file will see coherent
> view of memory, but also from security POV where later flush
> can't corrupt the next user of those pages.
> 
> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
>   drivers/gpu/drm/xe/xe_device.c  | 20 ++++++++++++++++++++
>   drivers/gpu/drm/xe/xe_device.h  |  1 +
>   drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
>   4 files changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index 465cf9fc7ce9..97e1e9d40e96 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct xe_device *xe, struct xe_bo *bo,
>   
>   		if (!xe_vm_in_fault_mode(vm)) {
>   			drm_gpuvm_bo_evict(vm_bo, true);
> -			continue;
> +			if (!xe_device_needs_cache_flush(xe))
> +				continue;
>   		}
>   
>   		if (!idle) {
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 92f883dd8877..6e8335b493e8 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -1079,6 +1079,26 @@ void xe_device_l2_flush(struct xe_device *xe)
>   	spin_unlock(&gt->global_invl_lock);
>   }
>   
> +/**
> + * xe_device_needs_cache_flush - Whether the cache needs to be flushed
> + * @xe: The device to check.
> + *
> + * Return: true if the device needs cache flush, false otherwise.
> + */
> +bool xe_device_needs_cache_flush(struct xe_device *xe)
> +{
> +	/*
> +	 * Starting NVL, HW will flush cachelines marked with XA only when media is off. We have

I think the wording could be improved here (same for commit message). XA 
is *always* flushed, like at the end-of-submssion (and maybe other 
places), just that internally as an optimisation hw doesn't need to make 
that a full flush (which will also include XA) when Media is 
off/powergated, since it doesn't need to worry about GT caches vs Media 
coherency, and only CPU vs GPU coherency, so can make that flush a 
targeted XA flush, since stuff tagged with XA now means it's shared with 
the CPU.

> +	 * few cases where kernel will have non-XA cachelines which needs manual flush and this is
> +	 * one of them as we postpone the invalidation. Flush asap from correctness POV to ensure
> +	 * non accelerated CPU copy to swap/shmem file will see coherent view of memory, but also
> +	 * from security POV where later flush can't corrupt the next user of those pages.
> +	 */
> +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> +		return true;
> +	return false;
> +}
> +
>   /**
>    * xe_device_td_flush() - Flush transient L3 cache entries
>    * @xe: The device
> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> index 32cc6323b7f6..15e67db44b56 100644
> --- a/drivers/gpu/drm/xe/xe_device.h
> +++ b/drivers/gpu/drm/xe/xe_device.h
> @@ -179,6 +179,7 @@ void xe_device_snapshot_print(struct xe_device *xe, struct drm_printer *p);
>   u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
>   u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address);
>   
> +bool xe_device_needs_cache_flush(struct xe_device *xe);
>   void xe_device_td_flush(struct xe_device *xe);
>   void xe_device_l2_flush(struct xe_device *xe);
>   
> diff --git a/drivers/gpu/drm/xe/xe_userptr.c b/drivers/gpu/drm/xe/xe_userptr.c
> index 0d9130b1958a..a93c7e887cca 100644
> --- a/drivers/gpu/drm/xe/xe_userptr.c
> +++ b/drivers/gpu/drm/xe/xe_userptr.c
> @@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct xe_vm *vm, struct xe_userptr_vma *uv
>   				    false, MAX_SCHEDULE_TIMEOUT);
>   	XE_WARN_ON(err <= 0);
>   
> -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> +	if ((xe_vm_in_fault_mode(vm) || xe_device_needs_cache_flush(vm->xe)) &&

Other option is to ban non-XA or non-2WAY at the uAPI level on such 
platforms, but I guess also depends on what UMD wants here?

Jose, I assume Mesa is just going to use XA or 2WAY for userptr on such 
hw? Or do you see a usecase for being more flexible?

> +	    userptr->initial_bind) {
>   		err = xe_vm_invalidate_vma(vma);
>   		XE_WARN_ON(err);
>   	}


^ permalink raw reply	[flat|nested] 39+ messages in thread

* ✓ CI.KUnit: success for drm/xe/xe3p_lpg: L2 flush optimization
  2025-11-25  9:43 [PATCH 0/3] drm/xe/xe3p_lpg: L2 flush optimization Tejas Upadhyay
                   ` (2 preceding siblings ...)
  2025-11-25  9:43 ` [PATCH 3/3] drm/xe/xe3p: Skip TD flush Tejas Upadhyay
@ 2025-11-25 13:20 ` Patchwork
  2025-11-25 14:47 ` ✓ Xe.CI.BAT: " Patchwork
  2025-11-25 17:42 ` ✓ Xe.CI.Full: " Patchwork
  5 siblings, 0 replies; 39+ messages in thread
From: Patchwork @ 2025-11-25 13:20 UTC (permalink / raw)
  To: Upadhyay, Tejas; +Cc: intel-xe

== Series Details ==

Series: drm/xe/xe3p_lpg: L2 flush optimization
URL   : https://patchwork.freedesktop.org/series/158017/
State : success

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
[13:18:33] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[13:18:37] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=25
[13:19:15] Starting KUnit Kernel (1/1)...
[13:19:15] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[13:19:16] ================== guc_buf (11 subtests) ===================
[13:19:16] [PASSED] test_smallest
[13:19:16] [PASSED] test_largest
[13:19:16] [PASSED] test_granular
[13:19:16] [PASSED] test_unique
[13:19:16] [PASSED] test_overlap
[13:19:16] [PASSED] test_reusable
[13:19:16] [PASSED] test_too_big
[13:19:16] [PASSED] test_flush
[13:19:16] [PASSED] test_lookup
[13:19:16] [PASSED] test_data
[13:19:16] [PASSED] test_class
[13:19:16] ===================== [PASSED] guc_buf =====================
[13:19:16] =================== guc_dbm (7 subtests) ===================
[13:19:16] [PASSED] test_empty
[13:19:16] [PASSED] test_default
[13:19:16] ======================== test_size  ========================
[13:19:16] [PASSED] 4
[13:19:16] [PASSED] 8
[13:19:16] [PASSED] 32
[13:19:16] [PASSED] 256
[13:19:16] ==================== [PASSED] test_size ====================
[13:19:16] ======================= test_reuse  ========================
[13:19:16] [PASSED] 4
[13:19:16] [PASSED] 8
[13:19:16] [PASSED] 32
[13:19:16] [PASSED] 256
[13:19:16] =================== [PASSED] test_reuse ====================
[13:19:16] =================== test_range_overlap  ====================
[13:19:16] [PASSED] 4
[13:19:16] [PASSED] 8
[13:19:16] [PASSED] 32
[13:19:16] [PASSED] 256
[13:19:16] =============== [PASSED] test_range_overlap ================
[13:19:16] =================== test_range_compact  ====================
[13:19:16] [PASSED] 4
[13:19:16] [PASSED] 8
[13:19:16] [PASSED] 32
[13:19:16] [PASSED] 256
[13:19:16] =============== [PASSED] test_range_compact ================
[13:19:16] ==================== test_range_spare  =====================
[13:19:16] [PASSED] 4
[13:19:16] [PASSED] 8
[13:19:16] [PASSED] 32
[13:19:16] [PASSED] 256
[13:19:16] ================ [PASSED] test_range_spare =================
[13:19:16] ===================== [PASSED] guc_dbm =====================
[13:19:16] =================== guc_idm (6 subtests) ===================
[13:19:16] [PASSED] bad_init
[13:19:16] [PASSED] no_init
[13:19:16] [PASSED] init_fini
[13:19:16] [PASSED] check_used
[13:19:16] [PASSED] check_quota
[13:19:16] [PASSED] check_all
[13:19:16] ===================== [PASSED] guc_idm =====================
[13:19:16] ================== no_relay (3 subtests) ===================
[13:19:16] [PASSED] xe_drops_guc2pf_if_not_ready
[13:19:16] [PASSED] xe_drops_guc2vf_if_not_ready
[13:19:16] [PASSED] xe_rejects_send_if_not_ready
[13:19:16] ==================== [PASSED] no_relay =====================
[13:19:16] ================== pf_relay (14 subtests) ==================
[13:19:16] [PASSED] pf_rejects_guc2pf_too_short
[13:19:16] [PASSED] pf_rejects_guc2pf_too_long
[13:19:16] [PASSED] pf_rejects_guc2pf_no_payload
[13:19:16] [PASSED] pf_fails_no_payload
[13:19:16] [PASSED] pf_fails_bad_origin
[13:19:16] [PASSED] pf_fails_bad_type
[13:19:16] [PASSED] pf_txn_reports_error
[13:19:16] [PASSED] pf_txn_sends_pf2guc
[13:19:16] [PASSED] pf_sends_pf2guc
[13:19:16] [SKIPPED] pf_loopback_nop
[13:19:16] [SKIPPED] pf_loopback_echo
[13:19:16] [SKIPPED] pf_loopback_fail
[13:19:16] [SKIPPED] pf_loopback_busy
[13:19:16] [SKIPPED] pf_loopback_retry
[13:19:16] ==================== [PASSED] pf_relay =====================
[13:19:16] ================== vf_relay (3 subtests) ===================
[13:19:16] [PASSED] vf_rejects_guc2vf_too_short
[13:19:16] [PASSED] vf_rejects_guc2vf_too_long
[13:19:16] [PASSED] vf_rejects_guc2vf_no_payload
[13:19:16] ==================== [PASSED] vf_relay =====================
[13:19:16] ================ pf_gt_config (6 subtests) =================
[13:19:16] [PASSED] fair_contexts_1vf
[13:19:16] [PASSED] fair_doorbells_1vf
[13:19:16] [PASSED] fair_ggtt_1vf
[13:19:16] ====================== fair_contexts  ======================
[13:19:16] [PASSED] 1 VF
[13:19:16] [PASSED] 2 VFs
[13:19:16] [PASSED] 3 VFs
[13:19:16] [PASSED] 4 VFs
[13:19:16] [PASSED] 5 VFs
[13:19:16] [PASSED] 6 VFs
[13:19:16] [PASSED] 7 VFs
[13:19:16] [PASSED] 8 VFs
[13:19:16] [PASSED] 9 VFs
[13:19:16] [PASSED] 10 VFs
[13:19:16] [PASSED] 11 VFs
[13:19:16] [PASSED] 12 VFs
[13:19:16] [PASSED] 13 VFs
[13:19:16] [PASSED] 14 VFs
[13:19:16] [PASSED] 15 VFs
[13:19:16] [PASSED] 16 VFs
[13:19:16] [PASSED] 17 VFs
[13:19:16] [PASSED] 18 VFs
[13:19:16] [PASSED] 19 VFs
[13:19:16] [PASSED] 20 VFs
[13:19:16] [PASSED] 21 VFs
[13:19:16] [PASSED] 22 VFs
[13:19:16] [PASSED] 23 VFs
[13:19:16] [PASSED] 24 VFs
[13:19:16] [PASSED] 25 VFs
[13:19:16] [PASSED] 26 VFs
[13:19:16] [PASSED] 27 VFs
[13:19:16] [PASSED] 28 VFs
[13:19:16] [PASSED] 29 VFs
[13:19:16] [PASSED] 30 VFs
[13:19:16] [PASSED] 31 VFs
[13:19:16] [PASSED] 32 VFs
[13:19:16] [PASSED] 33 VFs
[13:19:16] [PASSED] 34 VFs
[13:19:16] [PASSED] 35 VFs
[13:19:16] [PASSED] 36 VFs
[13:19:16] [PASSED] 37 VFs
[13:19:16] [PASSED] 38 VFs
[13:19:16] [PASSED] 39 VFs
[13:19:16] [PASSED] 40 VFs
[13:19:16] [PASSED] 41 VFs
[13:19:16] [PASSED] 42 VFs
[13:19:16] [PASSED] 43 VFs
[13:19:16] [PASSED] 44 VFs
[13:19:16] [PASSED] 45 VFs
[13:19:16] [PASSED] 46 VFs
[13:19:16] [PASSED] 47 VFs
[13:19:16] [PASSED] 48 VFs
[13:19:16] [PASSED] 49 VFs
[13:19:16] [PASSED] 50 VFs
[13:19:16] [PASSED] 51 VFs
[13:19:16] [PASSED] 52 VFs
[13:19:16] [PASSED] 53 VFs
[13:19:16] [PASSED] 54 VFs
[13:19:16] [PASSED] 55 VFs
[13:19:16] [PASSED] 56 VFs
[13:19:16] [PASSED] 57 VFs
[13:19:16] [PASSED] 58 VFs
[13:19:16] [PASSED] 59 VFs
[13:19:16] [PASSED] 60 VFs
[13:19:16] [PASSED] 61 VFs
[13:19:16] [PASSED] 62 VFs
[13:19:16] [PASSED] 63 VFs
[13:19:16] ================== [PASSED] fair_contexts ==================
[13:19:16] ===================== fair_doorbells  ======================
[13:19:16] [PASSED] 1 VF
[13:19:16] [PASSED] 2 VFs
[13:19:16] [PASSED] 3 VFs
[13:19:16] [PASSED] 4 VFs
[13:19:16] [PASSED] 5 VFs
[13:19:16] [PASSED] 6 VFs
[13:19:16] [PASSED] 7 VFs
[13:19:16] [PASSED] 8 VFs
[13:19:16] [PASSED] 9 VFs
[13:19:16] [PASSED] 10 VFs
[13:19:16] [PASSED] 11 VFs
[13:19:16] [PASSED] 12 VFs
[13:19:16] [PASSED] 13 VFs
[13:19:16] [PASSED] 14 VFs
[13:19:16] [PASSED] 15 VFs
[13:19:16] [PASSED] 16 VFs
[13:19:16] [PASSED] 17 VFs
[13:19:16] [PASSED] 18 VFs
[13:19:16] [PASSED] 19 VFs
[13:19:16] [PASSED] 20 VFs
[13:19:16] [PASSED] 21 VFs
[13:19:16] [PASSED] 22 VFs
[13:19:16] [PASSED] 23 VFs
[13:19:16] [PASSED] 24 VFs
[13:19:16] [PASSED] 25 VFs
[13:19:16] [PASSED] 26 VFs
[13:19:16] [PASSED] 27 VFs
[13:19:16] [PASSED] 28 VFs
[13:19:16] [PASSED] 29 VFs
[13:19:16] [PASSED] 30 VFs
[13:19:16] [PASSED] 31 VFs
[13:19:16] [PASSED] 32 VFs
[13:19:16] [PASSED] 33 VFs
[13:19:16] [PASSED] 34 VFs
[13:19:16] [PASSED] 35 VFs
[13:19:16] [PASSED] 36 VFs
[13:19:16] [PASSED] 37 VFs
[13:19:16] [PASSED] 38 VFs
[13:19:16] [PASSED] 39 VFs
[13:19:16] [PASSED] 40 VFs
[13:19:16] [PASSED] 41 VFs
[13:19:16] [PASSED] 42 VFs
[13:19:16] [PASSED] 43 VFs
[13:19:16] [PASSED] 44 VFs
[13:19:16] [PASSED] 45 VFs
[13:19:16] [PASSED] 46 VFs
[13:19:16] [PASSED] 47 VFs
[13:19:16] [PASSED] 48 VFs
[13:19:16] [PASSED] 49 VFs
[13:19:16] [PASSED] 50 VFs
[13:19:16] [PASSED] 51 VFs
[13:19:16] [PASSED] 52 VFs
[13:19:16] [PASSED] 53 VFs
[13:19:16] [PASSED] 54 VFs
[13:19:16] [PASSED] 55 VFs
[13:19:16] [PASSED] 56 VFs
[13:19:16] [PASSED] 57 VFs
[13:19:16] [PASSED] 58 VFs
[13:19:16] [PASSED] 59 VFs
[13:19:16] [PASSED] 60 VFs
[13:19:16] [PASSED] 61 VFs
[13:19:16] [PASSED] 62 VFs
[13:19:16] [PASSED] 63 VFs
[13:19:16] ================= [PASSED] fair_doorbells ==================
[13:19:16] ======================== fair_ggtt  ========================
[13:19:16] [PASSED] 1 VF
[13:19:16] [PASSED] 2 VFs
[13:19:16] [PASSED] 3 VFs
[13:19:16] [PASSED] 4 VFs
[13:19:16] [PASSED] 5 VFs
[13:19:16] [PASSED] 6 VFs
[13:19:16] [PASSED] 7 VFs
[13:19:16] [PASSED] 8 VFs
[13:19:16] [PASSED] 9 VFs
[13:19:16] [PASSED] 10 VFs
[13:19:16] [PASSED] 11 VFs
[13:19:16] [PASSED] 12 VFs
[13:19:16] [PASSED] 13 VFs
[13:19:16] [PASSED] 14 VFs
[13:19:16] [PASSED] 15 VFs
[13:19:16] [PASSED] 16 VFs
[13:19:16] [PASSED] 17 VFs
[13:19:16] [PASSED] 18 VFs
[13:19:16] [PASSED] 19 VFs
[13:19:16] [PASSED] 20 VFs
[13:19:16] [PASSED] 21 VFs
[13:19:16] [PASSED] 22 VFs
[13:19:16] [PASSED] 23 VFs
[13:19:16] [PASSED] 24 VFs
[13:19:16] [PASSED] 25 VFs
[13:19:16] [PASSED] 26 VFs
[13:19:16] [PASSED] 27 VFs
[13:19:16] [PASSED] 28 VFs
[13:19:16] [PASSED] 29 VFs
[13:19:16] [PASSED] 30 VFs
[13:19:16] [PASSED] 31 VFs
[13:19:16] [PASSED] 32 VFs
[13:19:16] [PASSED] 33 VFs
[13:19:16] [PASSED] 34 VFs
[13:19:16] [PASSED] 35 VFs
[13:19:16] [PASSED] 36 VFs
[13:19:16] [PASSED] 37 VFs
[13:19:16] [PASSED] 38 VFs
[13:19:16] [PASSED] 39 VFs
[13:19:16] [PASSED] 40 VFs
[13:19:16] [PASSED] 41 VFs
[13:19:16] [PASSED] 42 VFs
[13:19:16] [PASSED] 43 VFs
[13:19:16] [PASSED] 44 VFs
[13:19:16] [PASSED] 45 VFs
[13:19:16] [PASSED] 46 VFs
[13:19:16] [PASSED] 47 VFs
[13:19:16] [PASSED] 48 VFs
[13:19:16] [PASSED] 49 VFs
[13:19:16] [PASSED] 50 VFs
[13:19:16] [PASSED] 51 VFs
[13:19:16] [PASSED] 52 VFs
[13:19:16] [PASSED] 53 VFs
[13:19:16] [PASSED] 54 VFs
[13:19:16] [PASSED] 55 VFs
[13:19:16] [PASSED] 56 VFs
[13:19:16] [PASSED] 57 VFs
[13:19:16] [PASSED] 58 VFs
[13:19:16] [PASSED] 59 VFs
[13:19:16] [PASSED] 60 VFs
[13:19:16] [PASSED] 61 VFs
[13:19:16] [PASSED] 62 VFs
[13:19:16] [PASSED] 63 VFs
[13:19:16] ==================== [PASSED] fair_ggtt ====================
[13:19:16] ================== [PASSED] pf_gt_config ===================
[13:19:16] ===================== lmtt (1 subtest) =====================
[13:19:16] ======================== test_ops  =========================
[13:19:16] [PASSED] 2-level
[13:19:16] [PASSED] multi-level
[13:19:16] ==================== [PASSED] test_ops =====================
[13:19:16] ====================== [PASSED] lmtt =======================
[13:19:16] ================= pf_service (11 subtests) =================
[13:19:16] [PASSED] pf_negotiate_any
[13:19:16] [PASSED] pf_negotiate_base_match
[13:19:16] [PASSED] pf_negotiate_base_newer
[13:19:16] [PASSED] pf_negotiate_base_next
[13:19:16] [SKIPPED] pf_negotiate_base_older
[13:19:16] [PASSED] pf_negotiate_base_prev
[13:19:16] [PASSED] pf_negotiate_latest_match
[13:19:16] [PASSED] pf_negotiate_latest_newer
[13:19:16] [PASSED] pf_negotiate_latest_next
[13:19:16] [SKIPPED] pf_negotiate_latest_older
[13:19:16] [SKIPPED] pf_negotiate_latest_prev
[13:19:16] =================== [PASSED] pf_service ====================
[13:19:16] ================= xe_guc_g2g (2 subtests) ==================
[13:19:16] ============== xe_live_guc_g2g_kunit_default  ==============
[13:19:16] ========= [SKIPPED] xe_live_guc_g2g_kunit_default ==========
[13:19:16] ============== xe_live_guc_g2g_kunit_allmem  ===============
[13:19:16] ========== [SKIPPED] xe_live_guc_g2g_kunit_allmem ==========
[13:19:16] =================== [SKIPPED] xe_guc_g2g ===================
[13:19:16] =================== xe_mocs (2 subtests) ===================
[13:19:16] ================ xe_live_mocs_kernel_kunit  ================
[13:19:16] =========== [SKIPPED] xe_live_mocs_kernel_kunit ============
[13:19:16] ================ xe_live_mocs_reset_kunit  =================
[13:19:16] ============ [SKIPPED] xe_live_mocs_reset_kunit ============
[13:19:16] ==================== [SKIPPED] xe_mocs =====================
[13:19:16] ================= xe_migrate (2 subtests) ==================
[13:19:16] ================= xe_migrate_sanity_kunit  =================
[13:19:16] ============ [SKIPPED] xe_migrate_sanity_kunit =============
[13:19:16] ================== xe_validate_ccs_kunit  ==================
[13:19:16] ============= [SKIPPED] xe_validate_ccs_kunit ==============
[13:19:16] =================== [SKIPPED] xe_migrate ===================
[13:19:16] ================== xe_dma_buf (1 subtest) ==================
[13:19:16] ==================== xe_dma_buf_kunit  =====================
[13:19:16] ================ [SKIPPED] xe_dma_buf_kunit ================
[13:19:16] =================== [SKIPPED] xe_dma_buf ===================
[13:19:16] ================= xe_bo_shrink (1 subtest) =================
[13:19:16] =================== xe_bo_shrink_kunit  ====================
[13:19:16] =============== [SKIPPED] xe_bo_shrink_kunit ===============
[13:19:16] ================== [SKIPPED] xe_bo_shrink ==================
[13:19:16] ==================== xe_bo (2 subtests) ====================
[13:19:16] ================== xe_ccs_migrate_kunit  ===================
[13:19:16] ============== [SKIPPED] xe_ccs_migrate_kunit ==============
[13:19:16] ==================== xe_bo_evict_kunit  ====================
[13:19:16] =============== [SKIPPED] xe_bo_evict_kunit ================
[13:19:16] ===================== [SKIPPED] xe_bo ======================
[13:19:16] ==================== args (11 subtests) ====================
[13:19:16] [PASSED] count_args_test
[13:19:16] [PASSED] call_args_example
[13:19:16] [PASSED] call_args_test
[13:19:16] [PASSED] drop_first_arg_example
[13:19:16] [PASSED] drop_first_arg_test
[13:19:16] [PASSED] first_arg_example
[13:19:16] [PASSED] first_arg_test
[13:19:16] [PASSED] last_arg_example
[13:19:16] [PASSED] last_arg_test
[13:19:16] [PASSED] pick_arg_example
[13:19:16] [PASSED] sep_comma_example
[13:19:16] ====================== [PASSED] args =======================
[13:19:16] =================== xe_pci (3 subtests) ====================
[13:19:16] ==================== check_graphics_ip  ====================
[13:19:16] [PASSED] 12.00 Xe_LP
[13:19:16] [PASSED] 12.10 Xe_LP+
[13:19:16] [PASSED] 12.55 Xe_HPG
[13:19:16] [PASSED] 12.60 Xe_HPC
[13:19:16] [PASSED] 12.70 Xe_LPG
[13:19:16] [PASSED] 12.71 Xe_LPG
[13:19:16] [PASSED] 12.74 Xe_LPG+
[13:19:16] [PASSED] 20.01 Xe2_HPG
[13:19:16] [PASSED] 20.02 Xe2_HPG
[13:19:16] [PASSED] 20.04 Xe2_LPG
[13:19:16] [PASSED] 30.00 Xe3_LPG
[13:19:16] [PASSED] 30.01 Xe3_LPG
[13:19:16] [PASSED] 30.03 Xe3_LPG
[13:19:16] [PASSED] 30.04 Xe3_LPG
[13:19:16] [PASSED] 30.05 Xe3_LPG
[13:19:16] [PASSED] 35.11 Xe3p_XPC
[13:19:16] ================ [PASSED] check_graphics_ip ================
[13:19:16] ===================== check_media_ip  ======================
[13:19:16] [PASSED] 12.00 Xe_M
[13:19:16] [PASSED] 12.55 Xe_HPM
[13:19:16] [PASSED] 13.00 Xe_LPM+
[13:19:16] [PASSED] 13.01 Xe2_HPM
[13:19:16] [PASSED] 20.00 Xe2_LPM
[13:19:16] [PASSED] 30.00 Xe3_LPM
[13:19:16] [PASSED] 30.02 Xe3_LPM
[13:19:16] [PASSED] 35.00 Xe3p_LPM
[13:19:16] [PASSED] 35.03 Xe3p_HPM
[13:19:16] ================= [PASSED] check_media_ip ==================
[13:19:16] =================== check_platform_desc  ===================
[13:19:16] [PASSED] 0x9A60 (TIGERLAKE)
[13:19:16] [PASSED] 0x9A68 (TIGERLAKE)
[13:19:16] [PASSED] 0x9A70 (TIGERLAKE)
[13:19:16] [PASSED] 0x9A40 (TIGERLAKE)
[13:19:16] [PASSED] 0x9A49 (TIGERLAKE)
[13:19:16] [PASSED] 0x9A59 (TIGERLAKE)
[13:19:16] [PASSED] 0x9A78 (TIGERLAKE)
[13:19:16] [PASSED] 0x9AC0 (TIGERLAKE)
[13:19:16] [PASSED] 0x9AC9 (TIGERLAKE)
[13:19:16] [PASSED] 0x9AD9 (TIGERLAKE)
[13:19:16] [PASSED] 0x9AF8 (TIGERLAKE)
[13:19:16] [PASSED] 0x4C80 (ROCKETLAKE)
[13:19:16] [PASSED] 0x4C8A (ROCKETLAKE)
[13:19:16] [PASSED] 0x4C8B (ROCKETLAKE)
[13:19:16] [PASSED] 0x4C8C (ROCKETLAKE)
[13:19:16] [PASSED] 0x4C90 (ROCKETLAKE)
[13:19:16] [PASSED] 0x4C9A (ROCKETLAKE)
[13:19:16] [PASSED] 0x4680 (ALDERLAKE_S)
[13:19:16] [PASSED] 0x4682 (ALDERLAKE_S)
[13:19:16] [PASSED] 0x4688 (ALDERLAKE_S)
[13:19:16] [PASSED] 0x468A (ALDERLAKE_S)
[13:19:16] [PASSED] 0x468B (ALDERLAKE_S)
[13:19:16] [PASSED] 0x4690 (ALDERLAKE_S)
[13:19:16] [PASSED] 0x4692 (ALDERLAKE_S)
[13:19:16] [PASSED] 0x4693 (ALDERLAKE_S)
[13:19:16] [PASSED] 0x46A0 (ALDERLAKE_P)
[13:19:16] [PASSED] 0x46A1 (ALDERLAKE_P)
[13:19:16] [PASSED] 0x46A2 (ALDERLAKE_P)
[13:19:16] [PASSED] 0x46A3 (ALDERLAKE_P)
[13:19:16] [PASSED] 0x46A6 (ALDERLAKE_P)
[13:19:16] [PASSED] 0x46A8 (ALDERLAKE_P)
[13:19:16] [PASSED] 0x46AA (ALDERLAKE_P)
[13:19:16] [PASSED] 0x462A (ALDERLAKE_P)
[13:19:16] [PASSED] 0x4626 (ALDERLAKE_P)
[13:19:16] [PASSED] 0x4628 (ALDERLAKE_P)
[13:19:16] [PASSED] 0x46B0 (ALDERLAKE_P)
stty: 'standard input': Inappropriate ioctl for device
[13:19:16] [PASSED] 0x46B1 (ALDERLAKE_P)
[13:19:16] [PASSED] 0x46B2 (ALDERLAKE_P)
[13:19:16] [PASSED] 0x46B3 (ALDERLAKE_P)
[13:19:16] [PASSED] 0x46C0 (ALDERLAKE_P)
[13:19:16] [PASSED] 0x46C1 (ALDERLAKE_P)
[13:19:16] [PASSED] 0x46C2 (ALDERLAKE_P)
[13:19:16] [PASSED] 0x46C3 (ALDERLAKE_P)
[13:19:16] [PASSED] 0x46D0 (ALDERLAKE_N)
[13:19:16] [PASSED] 0x46D1 (ALDERLAKE_N)
[13:19:16] [PASSED] 0x46D2 (ALDERLAKE_N)
[13:19:16] [PASSED] 0x46D3 (ALDERLAKE_N)
[13:19:16] [PASSED] 0x46D4 (ALDERLAKE_N)
[13:19:16] [PASSED] 0xA721 (ALDERLAKE_P)
[13:19:16] [PASSED] 0xA7A1 (ALDERLAKE_P)
[13:19:16] [PASSED] 0xA7A9 (ALDERLAKE_P)
[13:19:16] [PASSED] 0xA7AC (ALDERLAKE_P)
[13:19:16] [PASSED] 0xA7AD (ALDERLAKE_P)
[13:19:16] [PASSED] 0xA720 (ALDERLAKE_P)
[13:19:16] [PASSED] 0xA7A0 (ALDERLAKE_P)
[13:19:16] [PASSED] 0xA7A8 (ALDERLAKE_P)
[13:19:16] [PASSED] 0xA7AA (ALDERLAKE_P)
[13:19:16] [PASSED] 0xA7AB (ALDERLAKE_P)
[13:19:16] [PASSED] 0xA780 (ALDERLAKE_S)
[13:19:16] [PASSED] 0xA781 (ALDERLAKE_S)
[13:19:16] [PASSED] 0xA782 (ALDERLAKE_S)
[13:19:16] [PASSED] 0xA783 (ALDERLAKE_S)
[13:19:16] [PASSED] 0xA788 (ALDERLAKE_S)
[13:19:16] [PASSED] 0xA789 (ALDERLAKE_S)
[13:19:16] [PASSED] 0xA78A (ALDERLAKE_S)
[13:19:16] [PASSED] 0xA78B (ALDERLAKE_S)
[13:19:16] [PASSED] 0x4905 (DG1)
[13:19:16] [PASSED] 0x4906 (DG1)
[13:19:16] [PASSED] 0x4907 (DG1)
[13:19:16] [PASSED] 0x4908 (DG1)
[13:19:16] [PASSED] 0x4909 (DG1)
[13:19:16] [PASSED] 0x56C0 (DG2)
[13:19:16] [PASSED] 0x56C2 (DG2)
[13:19:16] [PASSED] 0x56C1 (DG2)
[13:19:16] [PASSED] 0x7D51 (METEORLAKE)
[13:19:16] [PASSED] 0x7DD1 (METEORLAKE)
[13:19:16] [PASSED] 0x7D41 (METEORLAKE)
[13:19:16] [PASSED] 0x7D67 (METEORLAKE)
[13:19:16] [PASSED] 0xB640 (METEORLAKE)
[13:19:16] [PASSED] 0x56A0 (DG2)
[13:19:16] [PASSED] 0x56A1 (DG2)
[13:19:16] [PASSED] 0x56A2 (DG2)
[13:19:16] [PASSED] 0x56BE (DG2)
[13:19:16] [PASSED] 0x56BF (DG2)
[13:19:16] [PASSED] 0x5690 (DG2)
[13:19:16] [PASSED] 0x5691 (DG2)
[13:19:16] [PASSED] 0x5692 (DG2)
[13:19:16] [PASSED] 0x56A5 (DG2)
[13:19:16] [PASSED] 0x56A6 (DG2)
[13:19:16] [PASSED] 0x56B0 (DG2)
[13:19:16] [PASSED] 0x56B1 (DG2)
[13:19:16] [PASSED] 0x56BA (DG2)
[13:19:16] [PASSED] 0x56BB (DG2)
[13:19:16] [PASSED] 0x56BC (DG2)
[13:19:16] [PASSED] 0x56BD (DG2)
[13:19:16] [PASSED] 0x5693 (DG2)
[13:19:16] [PASSED] 0x5694 (DG2)
[13:19:16] [PASSED] 0x5695 (DG2)
[13:19:16] [PASSED] 0x56A3 (DG2)
[13:19:16] [PASSED] 0x56A4 (DG2)
[13:19:16] [PASSED] 0x56B2 (DG2)
[13:19:16] [PASSED] 0x56B3 (DG2)
[13:19:16] [PASSED] 0x5696 (DG2)
[13:19:16] [PASSED] 0x5697 (DG2)
[13:19:16] [PASSED] 0xB69 (PVC)
[13:19:16] [PASSED] 0xB6E (PVC)
[13:19:16] [PASSED] 0xBD4 (PVC)
[13:19:16] [PASSED] 0xBD5 (PVC)
[13:19:16] [PASSED] 0xBD6 (PVC)
[13:19:16] [PASSED] 0xBD7 (PVC)
[13:19:16] [PASSED] 0xBD8 (PVC)
[13:19:16] [PASSED] 0xBD9 (PVC)
[13:19:16] [PASSED] 0xBDA (PVC)
[13:19:16] [PASSED] 0xBDB (PVC)
[13:19:16] [PASSED] 0xBE0 (PVC)
[13:19:16] [PASSED] 0xBE1 (PVC)
[13:19:16] [PASSED] 0xBE5 (PVC)
[13:19:16] [PASSED] 0x7D40 (METEORLAKE)
[13:19:16] [PASSED] 0x7D45 (METEORLAKE)
[13:19:16] [PASSED] 0x7D55 (METEORLAKE)
[13:19:16] [PASSED] 0x7D60 (METEORLAKE)
[13:19:16] [PASSED] 0x7DD5 (METEORLAKE)
[13:19:16] [PASSED] 0x6420 (LUNARLAKE)
[13:19:16] [PASSED] 0x64A0 (LUNARLAKE)
[13:19:16] [PASSED] 0x64B0 (LUNARLAKE)
[13:19:16] [PASSED] 0xE202 (BATTLEMAGE)
[13:19:16] [PASSED] 0xE209 (BATTLEMAGE)
[13:19:16] [PASSED] 0xE20B (BATTLEMAGE)
[13:19:16] [PASSED] 0xE20C (BATTLEMAGE)
[13:19:16] [PASSED] 0xE20D (BATTLEMAGE)
[13:19:16] [PASSED] 0xE210 (BATTLEMAGE)
[13:19:16] [PASSED] 0xE211 (BATTLEMAGE)
[13:19:16] [PASSED] 0xE212 (BATTLEMAGE)
[13:19:16] [PASSED] 0xE216 (BATTLEMAGE)
[13:19:16] [PASSED] 0xE220 (BATTLEMAGE)
[13:19:16] [PASSED] 0xE221 (BATTLEMAGE)
[13:19:16] [PASSED] 0xE222 (BATTLEMAGE)
[13:19:16] [PASSED] 0xE223 (BATTLEMAGE)
[13:19:16] [PASSED] 0xB080 (PANTHERLAKE)
[13:19:16] [PASSED] 0xB081 (PANTHERLAKE)
[13:19:16] [PASSED] 0xB082 (PANTHERLAKE)
[13:19:16] [PASSED] 0xB083 (PANTHERLAKE)
[13:19:16] [PASSED] 0xB084 (PANTHERLAKE)
[13:19:16] [PASSED] 0xB085 (PANTHERLAKE)
[13:19:16] [PASSED] 0xB086 (PANTHERLAKE)
[13:19:16] [PASSED] 0xB087 (PANTHERLAKE)
[13:19:16] [PASSED] 0xB08F (PANTHERLAKE)
[13:19:16] [PASSED] 0xB090 (PANTHERLAKE)
[13:19:16] [PASSED] 0xB0A0 (PANTHERLAKE)
[13:19:16] [PASSED] 0xB0B0 (PANTHERLAKE)
[13:19:16] [PASSED] 0xD740 (NOVALAKE_S)
[13:19:16] [PASSED] 0xD741 (NOVALAKE_S)
[13:19:16] [PASSED] 0xD742 (NOVALAKE_S)
[13:19:16] [PASSED] 0xD743 (NOVALAKE_S)
[13:19:16] [PASSED] 0xD744 (NOVALAKE_S)
[13:19:16] [PASSED] 0xD745 (NOVALAKE_S)
[13:19:16] [PASSED] 0x674C (CRESCENTISLAND)
[13:19:16] [PASSED] 0xFD80 (PANTHERLAKE)
[13:19:16] [PASSED] 0xFD81 (PANTHERLAKE)
[13:19:16] =============== [PASSED] check_platform_desc ===============
[13:19:16] ===================== [PASSED] xe_pci ======================
[13:19:16] =================== xe_rtp (2 subtests) ====================
[13:19:16] =============== xe_rtp_process_to_sr_tests  ================
[13:19:16] [PASSED] coalesce-same-reg
[13:19:16] [PASSED] no-match-no-add
[13:19:16] [PASSED] match-or
[13:19:16] [PASSED] match-or-xfail
[13:19:16] [PASSED] no-match-no-add-multiple-rules
[13:19:16] [PASSED] two-regs-two-entries
[13:19:16] [PASSED] clr-one-set-other
[13:19:16] [PASSED] set-field
[13:19:16] [PASSED] conflict-duplicate
[13:19:16] [PASSED] conflict-not-disjoint
[13:19:16] [PASSED] conflict-reg-type
[13:19:16] =========== [PASSED] xe_rtp_process_to_sr_tests ============
[13:19:16] ================== xe_rtp_process_tests  ===================
[13:19:16] [PASSED] active1
[13:19:16] [PASSED] active2
[13:19:16] [PASSED] active-inactive
[13:19:16] [PASSED] inactive-active
[13:19:16] [PASSED] inactive-1st_or_active-inactive
[13:19:16] [PASSED] inactive-2nd_or_active-inactive
[13:19:16] [PASSED] inactive-last_or_active-inactive
[13:19:16] [PASSED] inactive-no_or_active-inactive
[13:19:16] ============== [PASSED] xe_rtp_process_tests ===============
[13:19:16] ===================== [PASSED] xe_rtp ======================
[13:19:16] ==================== xe_wa (1 subtest) =====================
[13:19:16] ======================== xe_wa_gt  =========================
[13:19:16] [PASSED] TIGERLAKE B0
[13:19:16] [PASSED] DG1 A0
[13:19:16] [PASSED] DG1 B0
[13:19:16] [PASSED] ALDERLAKE_S A0
[13:19:16] [PASSED] ALDERLAKE_S B0
[13:19:16] [PASSED] ALDERLAKE_S C0
[13:19:16] [PASSED] ALDERLAKE_S D0
[13:19:16] [PASSED] ALDERLAKE_P A0
[13:19:16] [PASSED] ALDERLAKE_P B0
[13:19:16] [PASSED] ALDERLAKE_P C0
[13:19:16] [PASSED] ALDERLAKE_S RPLS D0
[13:19:16] [PASSED] ALDERLAKE_P RPLU E0
[13:19:16] [PASSED] DG2 G10 C0
[13:19:16] [PASSED] DG2 G11 B1
[13:19:16] [PASSED] DG2 G12 A1
[13:19:16] [PASSED] METEORLAKE 12.70(Xe_LPG) A0 13.00(Xe_LPM+) A0
[13:19:16] [PASSED] METEORLAKE 12.71(Xe_LPG) A0 13.00(Xe_LPM+) A0
[13:19:16] [PASSED] METEORLAKE 12.74(Xe_LPG+) A0 13.00(Xe_LPM+) A0
[13:19:16] [PASSED] LUNARLAKE 20.04(Xe2_LPG) A0 20.00(Xe2_LPM) A0
[13:19:16] [PASSED] LUNARLAKE 20.04(Xe2_LPG) B0 20.00(Xe2_LPM) A0
[13:19:16] [PASSED] BATTLEMAGE 20.01(Xe2_HPG) A0 13.01(Xe2_HPM) A1
[13:19:16] [PASSED] PANTHERLAKE 30.00(Xe3_LPG) A0 30.00(Xe3_LPM) A0
[13:19:16] ==================== [PASSED] xe_wa_gt =====================
[13:19:16] ====================== [PASSED] xe_wa ======================
[13:19:16] ============================================================
[13:19:16] Testing complete. Ran 510 tests: passed: 492, skipped: 18
[13:19:16] Elapsed time: 43.089s total, 4.324s configuring, 38.248s building, 0.464s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/tests/.kunitconfig
[13:19:16] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[13:19:18] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=25
[13:19:47] Starting KUnit Kernel (1/1)...
[13:19:47] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[13:19:47] ============ drm_test_pick_cmdline (2 subtests) ============
[13:19:47] [PASSED] drm_test_pick_cmdline_res_1920_1080_60
[13:19:47] =============== drm_test_pick_cmdline_named  ===============
[13:19:47] [PASSED] NTSC
[13:19:47] [PASSED] NTSC-J
[13:19:47] [PASSED] PAL
[13:19:47] [PASSED] PAL-M
[13:19:47] =========== [PASSED] drm_test_pick_cmdline_named ===========
[13:19:47] ============== [PASSED] drm_test_pick_cmdline ==============
[13:19:47] == drm_test_atomic_get_connector_for_encoder (1 subtest) ===
[13:19:47] [PASSED] drm_test_drm_atomic_get_connector_for_encoder
[13:19:47] ==== [PASSED] drm_test_atomic_get_connector_for_encoder ====
[13:19:47] =========== drm_validate_clone_mode (2 subtests) ===========
[13:19:47] ============== drm_test_check_in_clone_mode  ===============
[13:19:47] [PASSED] in_clone_mode
[13:19:47] [PASSED] not_in_clone_mode
[13:19:47] ========== [PASSED] drm_test_check_in_clone_mode ===========
[13:19:47] =============== drm_test_check_valid_clones  ===============
[13:19:47] [PASSED] not_in_clone_mode
[13:19:47] [PASSED] valid_clone
[13:19:47] [PASSED] invalid_clone
[13:19:47] =========== [PASSED] drm_test_check_valid_clones ===========
[13:19:47] ============= [PASSED] drm_validate_clone_mode =============
[13:19:47] ============= drm_validate_modeset (1 subtest) =============
[13:19:47] [PASSED] drm_test_check_connector_changed_modeset
[13:19:47] ============== [PASSED] drm_validate_modeset ===============
[13:19:47] ====== drm_test_bridge_get_current_state (2 subtests) ======
[13:19:47] [PASSED] drm_test_drm_bridge_get_current_state_atomic
[13:19:47] [PASSED] drm_test_drm_bridge_get_current_state_legacy
[13:19:47] ======== [PASSED] drm_test_bridge_get_current_state ========
[13:19:47] ====== drm_test_bridge_helper_reset_crtc (3 subtests) ======
[13:19:47] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic
[13:19:47] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic_disabled
[13:19:47] [PASSED] drm_test_drm_bridge_helper_reset_crtc_legacy
[13:19:47] ======== [PASSED] drm_test_bridge_helper_reset_crtc ========
[13:19:47] ============== drm_bridge_alloc (2 subtests) ===============
[13:19:47] [PASSED] drm_test_drm_bridge_alloc_basic
[13:19:47] [PASSED] drm_test_drm_bridge_alloc_get_put
[13:19:47] ================ [PASSED] drm_bridge_alloc =================
[13:19:47] ================== drm_buddy (8 subtests) ==================
[13:19:47] [PASSED] drm_test_buddy_alloc_limit
[13:19:47] [PASSED] drm_test_buddy_alloc_optimistic
[13:19:47] [PASSED] drm_test_buddy_alloc_pessimistic
[13:19:47] [PASSED] drm_test_buddy_alloc_pathological
[13:19:47] [PASSED] drm_test_buddy_alloc_contiguous
[13:19:47] [PASSED] drm_test_buddy_alloc_clear
[13:19:48] [PASSED] drm_test_buddy_alloc_range_bias
[13:19:48] [PASSED] drm_test_buddy_fragmentation_performance
[13:19:48] ==================== [PASSED] drm_buddy ====================
[13:19:48] ============= drm_cmdline_parser (40 subtests) =============
[13:19:48] [PASSED] drm_test_cmdline_force_d_only
[13:19:48] [PASSED] drm_test_cmdline_force_D_only_dvi
[13:19:48] [PASSED] drm_test_cmdline_force_D_only_hdmi
[13:19:48] [PASSED] drm_test_cmdline_force_D_only_not_digital
[13:19:48] [PASSED] drm_test_cmdline_force_e_only
[13:19:48] [PASSED] drm_test_cmdline_res
[13:19:48] [PASSED] drm_test_cmdline_res_vesa
[13:19:48] [PASSED] drm_test_cmdline_res_vesa_rblank
[13:19:48] [PASSED] drm_test_cmdline_res_rblank
[13:19:48] [PASSED] drm_test_cmdline_res_bpp
[13:19:48] [PASSED] drm_test_cmdline_res_refresh
[13:19:48] [PASSED] drm_test_cmdline_res_bpp_refresh
[13:19:48] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced
[13:19:48] [PASSED] drm_test_cmdline_res_bpp_refresh_margins
[13:19:48] [PASSED] drm_test_cmdline_res_bpp_refresh_force_off
[13:19:48] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on
[13:19:48] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_analog
[13:19:48] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_digital
[13:19:48] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced_margins_force_on
[13:19:48] [PASSED] drm_test_cmdline_res_margins_force_on
[13:19:48] [PASSED] drm_test_cmdline_res_vesa_margins
[13:19:48] [PASSED] drm_test_cmdline_name
[13:19:48] [PASSED] drm_test_cmdline_name_bpp
[13:19:48] [PASSED] drm_test_cmdline_name_option
[13:19:48] [PASSED] drm_test_cmdline_name_bpp_option
[13:19:48] [PASSED] drm_test_cmdline_rotate_0
[13:19:48] [PASSED] drm_test_cmdline_rotate_90
[13:19:48] [PASSED] drm_test_cmdline_rotate_180
[13:19:48] [PASSED] drm_test_cmdline_rotate_270
[13:19:48] [PASSED] drm_test_cmdline_hmirror
[13:19:48] [PASSED] drm_test_cmdline_vmirror
[13:19:48] [PASSED] drm_test_cmdline_margin_options
[13:19:48] [PASSED] drm_test_cmdline_multiple_options
[13:19:48] [PASSED] drm_test_cmdline_bpp_extra_and_option
[13:19:48] [PASSED] drm_test_cmdline_extra_and_option
[13:19:48] [PASSED] drm_test_cmdline_freestanding_options
[13:19:48] [PASSED] drm_test_cmdline_freestanding_force_e_and_options
[13:19:48] [PASSED] drm_test_cmdline_panel_orientation
[13:19:48] ================ drm_test_cmdline_invalid  =================
[13:19:48] [PASSED] margin_only
[13:19:48] [PASSED] interlace_only
[13:19:48] [PASSED] res_missing_x
[13:19:48] [PASSED] res_missing_y
[13:19:48] [PASSED] res_bad_y
[13:19:48] [PASSED] res_missing_y_bpp
[13:19:48] [PASSED] res_bad_bpp
[13:19:48] [PASSED] res_bad_refresh
[13:19:48] [PASSED] res_bpp_refresh_force_on_off
[13:19:48] [PASSED] res_invalid_mode
[13:19:48] [PASSED] res_bpp_wrong_place_mode
[13:19:48] [PASSED] name_bpp_refresh
[13:19:48] [PASSED] name_refresh
[13:19:48] [PASSED] name_refresh_wrong_mode
[13:19:48] [PASSED] name_refresh_invalid_mode
[13:19:48] [PASSED] rotate_multiple
[13:19:48] [PASSED] rotate_invalid_val
[13:19:48] [PASSED] rotate_truncated
[13:19:48] [PASSED] invalid_option
[13:19:48] [PASSED] invalid_tv_option
[13:19:48] [PASSED] truncated_tv_option
[13:19:48] ============ [PASSED] drm_test_cmdline_invalid =============
[13:19:48] =============== drm_test_cmdline_tv_options  ===============
[13:19:48] [PASSED] NTSC
[13:19:48] [PASSED] NTSC_443
[13:19:48] [PASSED] NTSC_J
[13:19:48] [PASSED] PAL
[13:19:48] [PASSED] PAL_M
[13:19:48] [PASSED] PAL_N
[13:19:48] [PASSED] SECAM
[13:19:48] [PASSED] MONO_525
[13:19:48] [PASSED] MONO_625
[13:19:48] =========== [PASSED] drm_test_cmdline_tv_options ===========
[13:19:48] =============== [PASSED] drm_cmdline_parser ================
[13:19:48] ========== drmm_connector_hdmi_init (20 subtests) ==========
[13:19:48] [PASSED] drm_test_connector_hdmi_init_valid
[13:19:48] [PASSED] drm_test_connector_hdmi_init_bpc_8
[13:19:48] [PASSED] drm_test_connector_hdmi_init_bpc_10
[13:19:48] [PASSED] drm_test_connector_hdmi_init_bpc_12
[13:19:48] [PASSED] drm_test_connector_hdmi_init_bpc_invalid
[13:19:48] [PASSED] drm_test_connector_hdmi_init_bpc_null
[13:19:48] [PASSED] drm_test_connector_hdmi_init_formats_empty
[13:19:48] [PASSED] drm_test_connector_hdmi_init_formats_no_rgb
[13:19:48] === drm_test_connector_hdmi_init_formats_yuv420_allowed  ===
[13:19:48] [PASSED] supported_formats=0x9 yuv420_allowed=1
[13:19:48] [PASSED] supported_formats=0x9 yuv420_allowed=0
[13:19:48] [PASSED] supported_formats=0x3 yuv420_allowed=1
[13:19:48] [PASSED] supported_formats=0x3 yuv420_allowed=0
[13:19:48] === [PASSED] drm_test_connector_hdmi_init_formats_yuv420_allowed ===
[13:19:48] [PASSED] drm_test_connector_hdmi_init_null_ddc
[13:19:48] [PASSED] drm_test_connector_hdmi_init_null_product
[13:19:48] [PASSED] drm_test_connector_hdmi_init_null_vendor
[13:19:48] [PASSED] drm_test_connector_hdmi_init_product_length_exact
[13:19:48] [PASSED] drm_test_connector_hdmi_init_product_length_too_long
[13:19:48] [PASSED] drm_test_connector_hdmi_init_product_valid
[13:19:48] [PASSED] drm_test_connector_hdmi_init_vendor_length_exact
[13:19:48] [PASSED] drm_test_connector_hdmi_init_vendor_length_too_long
[13:19:48] [PASSED] drm_test_connector_hdmi_init_vendor_valid
[13:19:48] ========= drm_test_connector_hdmi_init_type_valid  =========
[13:19:48] [PASSED] HDMI-A
[13:19:48] [PASSED] HDMI-B
[13:19:48] ===== [PASSED] drm_test_connector_hdmi_init_type_valid =====
[13:19:48] ======== drm_test_connector_hdmi_init_type_invalid  ========
[13:19:48] [PASSED] Unknown
[13:19:48] [PASSED] VGA
[13:19:48] [PASSED] DVI-I
[13:19:48] [PASSED] DVI-D
[13:19:48] [PASSED] DVI-A
[13:19:48] [PASSED] Composite
[13:19:48] [PASSED] SVIDEO
[13:19:48] [PASSED] LVDS
[13:19:48] [PASSED] Component
[13:19:48] [PASSED] DIN
[13:19:48] [PASSED] DP
[13:19:48] [PASSED] TV
[13:19:48] [PASSED] eDP
[13:19:48] [PASSED] Virtual
[13:19:48] [PASSED] DSI
[13:19:48] [PASSED] DPI
[13:19:48] [PASSED] Writeback
[13:19:48] [PASSED] SPI
[13:19:48] [PASSED] USB
[13:19:48] ==== [PASSED] drm_test_connector_hdmi_init_type_invalid ====
[13:19:48] ============ [PASSED] drmm_connector_hdmi_init =============
[13:19:48] ============= drmm_connector_init (3 subtests) =============
[13:19:48] [PASSED] drm_test_drmm_connector_init
[13:19:48] [PASSED] drm_test_drmm_connector_init_null_ddc
[13:19:48] ========= drm_test_drmm_connector_init_type_valid  =========
[13:19:48] [PASSED] Unknown
[13:19:48] [PASSED] VGA
[13:19:48] [PASSED] DVI-I
[13:19:48] [PASSED] DVI-D
[13:19:48] [PASSED] DVI-A
[13:19:48] [PASSED] Composite
[13:19:48] [PASSED] SVIDEO
[13:19:48] [PASSED] LVDS
[13:19:48] [PASSED] Component
[13:19:48] [PASSED] DIN
[13:19:48] [PASSED] DP
[13:19:48] [PASSED] HDMI-A
[13:19:48] [PASSED] HDMI-B
[13:19:48] [PASSED] TV
[13:19:48] [PASSED] eDP
[13:19:48] [PASSED] Virtual
[13:19:48] [PASSED] DSI
[13:19:48] [PASSED] DPI
[13:19:48] [PASSED] Writeback
[13:19:48] [PASSED] SPI
[13:19:48] [PASSED] USB
[13:19:48] ===== [PASSED] drm_test_drmm_connector_init_type_valid =====
[13:19:48] =============== [PASSED] drmm_connector_init ===============
[13:19:48] ========= drm_connector_dynamic_init (6 subtests) ==========
[13:19:48] [PASSED] drm_test_drm_connector_dynamic_init
[13:19:48] [PASSED] drm_test_drm_connector_dynamic_init_null_ddc
[13:19:48] [PASSED] drm_test_drm_connector_dynamic_init_not_added
[13:19:48] [PASSED] drm_test_drm_connector_dynamic_init_properties
[13:19:48] ===== drm_test_drm_connector_dynamic_init_type_valid  ======
[13:19:48] [PASSED] Unknown
[13:19:48] [PASSED] VGA
[13:19:48] [PASSED] DVI-I
[13:19:48] [PASSED] DVI-D
[13:19:48] [PASSED] DVI-A
[13:19:48] [PASSED] Composite
[13:19:48] [PASSED] SVIDEO
[13:19:48] [PASSED] LVDS
[13:19:48] [PASSED] Component
[13:19:48] [PASSED] DIN
[13:19:48] [PASSED] DP
[13:19:48] [PASSED] HDMI-A
[13:19:48] [PASSED] HDMI-B
[13:19:48] [PASSED] TV
[13:19:48] [PASSED] eDP
[13:19:48] [PASSED] Virtual
[13:19:48] [PASSED] DSI
[13:19:48] [PASSED] DPI
[13:19:48] [PASSED] Writeback
[13:19:48] [PASSED] SPI
[13:19:48] [PASSED] USB
[13:19:48] = [PASSED] drm_test_drm_connector_dynamic_init_type_valid ==
[13:19:48] ======== drm_test_drm_connector_dynamic_init_name  =========
[13:19:48] [PASSED] Unknown
[13:19:48] [PASSED] VGA
[13:19:48] [PASSED] DVI-I
[13:19:48] [PASSED] DVI-D
[13:19:48] [PASSED] DVI-A
[13:19:48] [PASSED] Composite
[13:19:48] [PASSED] SVIDEO
[13:19:48] [PASSED] LVDS
[13:19:48] [PASSED] Component
[13:19:48] [PASSED] DIN
[13:19:48] [PASSED] DP
[13:19:48] [PASSED] HDMI-A
[13:19:48] [PASSED] HDMI-B
[13:19:48] [PASSED] TV
[13:19:48] [PASSED] eDP
[13:19:48] [PASSED] Virtual
[13:19:48] [PASSED] DSI
[13:19:48] [PASSED] DPI
[13:19:48] [PASSED] Writeback
[13:19:48] [PASSED] SPI
[13:19:48] [PASSED] USB
[13:19:48] ==== [PASSED] drm_test_drm_connector_dynamic_init_name =====
[13:19:48] =========== [PASSED] drm_connector_dynamic_init ============
[13:19:48] ==== drm_connector_dynamic_register_early (4 subtests) =====
[13:19:48] [PASSED] drm_test_drm_connector_dynamic_register_early_on_list
[13:19:48] [PASSED] drm_test_drm_connector_dynamic_register_early_defer
[13:19:48] [PASSED] drm_test_drm_connector_dynamic_register_early_no_init
[13:19:48] [PASSED] drm_test_drm_connector_dynamic_register_early_no_mode_object
[13:19:48] ====== [PASSED] drm_connector_dynamic_register_early =======
[13:19:48] ======= drm_connector_dynamic_register (7 subtests) ========
[13:19:48] [PASSED] drm_test_drm_connector_dynamic_register_on_list
[13:19:48] [PASSED] drm_test_drm_connector_dynamic_register_no_defer
[13:19:48] [PASSED] drm_test_drm_connector_dynamic_register_no_init
[13:19:48] [PASSED] drm_test_drm_connector_dynamic_register_mode_object
[13:19:48] [PASSED] drm_test_drm_connector_dynamic_register_sysfs
[13:19:48] [PASSED] drm_test_drm_connector_dynamic_register_sysfs_name
[13:19:48] [PASSED] drm_test_drm_connector_dynamic_register_debugfs
[13:19:48] ========= [PASSED] drm_connector_dynamic_register ==========
[13:19:48] = drm_connector_attach_broadcast_rgb_property (2 subtests) =
[13:19:48] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property
[13:19:48] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property_hdmi_connector
[13:19:48] === [PASSED] drm_connector_attach_broadcast_rgb_property ===
[13:19:48] ========== drm_get_tv_mode_from_name (2 subtests) ==========
[13:19:48] ========== drm_test_get_tv_mode_from_name_valid  ===========
[13:19:48] [PASSED] NTSC
[13:19:48] [PASSED] NTSC-443
[13:19:48] [PASSED] NTSC-J
[13:19:48] [PASSED] PAL
[13:19:48] [PASSED] PAL-M
[13:19:48] [PASSED] PAL-N
[13:19:48] [PASSED] SECAM
[13:19:48] [PASSED] Mono
[13:19:48] ====== [PASSED] drm_test_get_tv_mode_from_name_valid =======
[13:19:48] [PASSED] drm_test_get_tv_mode_from_name_truncated
[13:19:48] ============ [PASSED] drm_get_tv_mode_from_name ============
[13:19:48] = drm_test_connector_hdmi_compute_mode_clock (12 subtests) =
[13:19:48] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb
[13:19:48] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc
[13:19:48] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc_vic_1
[13:19:48] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc
[13:19:48] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc_vic_1
[13:19:48] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_double
[13:19:48] = drm_test_connector_hdmi_compute_mode_clock_yuv420_valid  =
[13:19:48] [PASSED] VIC 96
[13:19:48] [PASSED] VIC 97
[13:19:48] [PASSED] VIC 101
[13:19:48] [PASSED] VIC 102
[13:19:48] [PASSED] VIC 106
[13:19:48] [PASSED] VIC 107
[13:19:48] === [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_valid ===
[13:19:48] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_10_bpc
[13:19:48] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_12_bpc
[13:19:48] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_8_bpc
[13:19:48] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_10_bpc
[13:19:48] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_12_bpc
[13:19:48] === [PASSED] drm_test_connector_hdmi_compute_mode_clock ====
[13:19:48] == drm_hdmi_connector_get_broadcast_rgb_name (2 subtests) ==
[13:19:48] === drm_test_drm_hdmi_connector_get_broadcast_rgb_name  ====
[13:19:48] [PASSED] Automatic
[13:19:48] [PASSED] Full
[13:19:48] [PASSED] Limited 16:235
[13:19:48] === [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name ===
[13:19:48] [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name_invalid
[13:19:48] ==== [PASSED] drm_hdmi_connector_get_broadcast_rgb_name ====
[13:19:48] == drm_hdmi_connector_get_output_format_name (2 subtests) ==
[13:19:48] === drm_test_drm_hdmi_connector_get_output_format_name  ====
[13:19:48] [PASSED] RGB
[13:19:48] [PASSED] YUV 4:2:0
[13:19:48] [PASSED] YUV 4:2:2
[13:19:48] [PASSED] YUV 4:4:4
[13:19:48] === [PASSED] drm_test_drm_hdmi_connector_get_output_format_name ===
[13:19:48] [PASSED] drm_test_drm_hdmi_connector_get_output_format_name_invalid
[13:19:48] ==== [PASSED] drm_hdmi_connector_get_output_format_name ====
[13:19:48] ============= drm_damage_helper (21 subtests) ==============
[13:19:48] [PASSED] drm_test_damage_iter_no_damage
[13:19:48] [PASSED] drm_test_damage_iter_no_damage_fractional_src
[13:19:48] [PASSED] drm_test_damage_iter_no_damage_src_moved
[13:19:48] [PASSED] drm_test_damage_iter_no_damage_fractional_src_moved
[13:19:48] [PASSED] drm_test_damage_iter_no_damage_not_visible
[13:19:48] [PASSED] drm_test_damage_iter_no_damage_no_crtc
[13:19:48] [PASSED] drm_test_damage_iter_no_damage_no_fb
[13:19:48] [PASSED] drm_test_damage_iter_simple_damage
[13:19:48] [PASSED] drm_test_damage_iter_single_damage
[13:19:48] [PASSED] drm_test_damage_iter_single_damage_intersect_src
[13:19:48] [PASSED] drm_test_damage_iter_single_damage_outside_src
[13:19:48] [PASSED] drm_test_damage_iter_single_damage_fractional_src
[13:19:48] [PASSED] drm_test_damage_iter_single_damage_intersect_fractional_src
[13:19:48] [PASSED] drm_test_damage_iter_single_damage_outside_fractional_src
[13:19:48] [PASSED] drm_test_damage_iter_single_damage_src_moved
[13:19:48] [PASSED] drm_test_damage_iter_single_damage_fractional_src_moved
[13:19:48] [PASSED] drm_test_damage_iter_damage
[13:19:48] [PASSED] drm_test_damage_iter_damage_one_intersect
[13:19:48] [PASSED] drm_test_damage_iter_damage_one_outside
[13:19:48] [PASSED] drm_test_damage_iter_damage_src_moved
[13:19:48] [PASSED] drm_test_damage_iter_damage_not_visible
[13:19:48] ================ [PASSED] drm_damage_helper ================
[13:19:48] ============== drm_dp_mst_helper (3 subtests) ==============
[13:19:48] ============== drm_test_dp_mst_calc_pbn_mode  ==============
[13:19:48] [PASSED] Clock 154000 BPP 30 DSC disabled
[13:19:48] [PASSED] Clock 234000 BPP 30 DSC disabled
[13:19:48] [PASSED] Clock 297000 BPP 24 DSC disabled
[13:19:48] [PASSED] Clock 332880 BPP 24 DSC enabled
[13:19:48] [PASSED] Clock 324540 BPP 24 DSC enabled
[13:19:48] ========== [PASSED] drm_test_dp_mst_calc_pbn_mode ==========
[13:19:48] ============== drm_test_dp_mst_calc_pbn_div  ===============
[13:19:48] [PASSED] Link rate 2000000 lane count 4
[13:19:48] [PASSED] Link rate 2000000 lane count 2
[13:19:48] [PASSED] Link rate 2000000 lane count 1
[13:19:48] [PASSED] Link rate 1350000 lane count 4
[13:19:48] [PASSED] Link rate 1350000 lane count 2
[13:19:48] [PASSED] Link rate 1350000 lane count 1
[13:19:48] [PASSED] Link rate 1000000 lane count 4
[13:19:48] [PASSED] Link rate 1000000 lane count 2
[13:19:48] [PASSED] Link rate 1000000 lane count 1
[13:19:48] [PASSED] Link rate 810000 lane count 4
[13:19:48] [PASSED] Link rate 810000 lane count 2
[13:19:48] [PASSED] Link rate 810000 lane count 1
[13:19:48] [PASSED] Link rate 540000 lane count 4
[13:19:48] [PASSED] Link rate 540000 lane count 2
[13:19:48] [PASSED] Link rate 540000 lane count 1
[13:19:48] [PASSED] Link rate 270000 lane count 4
[13:19:48] [PASSED] Link rate 270000 lane count 2
[13:19:48] [PASSED] Link rate 270000 lane count 1
[13:19:48] [PASSED] Link rate 162000 lane count 4
[13:19:48] [PASSED] Link rate 162000 lane count 2
[13:19:48] [PASSED] Link rate 162000 lane count 1
[13:19:48] ========== [PASSED] drm_test_dp_mst_calc_pbn_div ===========
[13:19:48] ========= drm_test_dp_mst_sideband_msg_req_decode  =========
[13:19:48] [PASSED] DP_ENUM_PATH_RESOURCES with port number
[13:19:48] [PASSED] DP_POWER_UP_PHY with port number
[13:19:48] [PASSED] DP_POWER_DOWN_PHY with port number
[13:19:48] [PASSED] DP_ALLOCATE_PAYLOAD with SDP stream sinks
[13:19:48] [PASSED] DP_ALLOCATE_PAYLOAD with port number
[13:19:48] [PASSED] DP_ALLOCATE_PAYLOAD with VCPI
[13:19:48] [PASSED] DP_ALLOCATE_PAYLOAD with PBN
[13:19:48] [PASSED] DP_QUERY_PAYLOAD with port number
[13:19:48] [PASSED] DP_QUERY_PAYLOAD with VCPI
[13:19:48] [PASSED] DP_REMOTE_DPCD_READ with port number
[13:19:48] [PASSED] DP_REMOTE_DPCD_READ with DPCD address
[13:19:48] [PASSED] DP_REMOTE_DPCD_READ with max number of bytes
[13:19:48] [PASSED] DP_REMOTE_DPCD_WRITE with port number
[13:19:48] [PASSED] DP_REMOTE_DPCD_WRITE with DPCD address
[13:19:48] [PASSED] DP_REMOTE_DPCD_WRITE with data array
[13:19:48] [PASSED] DP_REMOTE_I2C_READ with port number
[13:19:48] [PASSED] DP_REMOTE_I2C_READ with I2C device ID
[13:19:48] [PASSED] DP_REMOTE_I2C_READ with transactions array
[13:19:48] [PASSED] DP_REMOTE_I2C_WRITE with port number
[13:19:48] [PASSED] DP_REMOTE_I2C_WRITE with I2C device ID
[13:19:48] [PASSED] DP_REMOTE_I2C_WRITE with data array
[13:19:48] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream ID
[13:19:48] [PASSED] DP_QUERY_STREAM_ENC_STATUS with client ID
[13:19:48] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream event
[13:19:48] [PASSED] DP_QUERY_STREAM_ENC_STATUS with valid stream event
[13:19:48] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream behavior
[13:19:48] [PASSED] DP_QUERY_STREAM_ENC_STATUS with a valid stream behavior
[13:19:48] ===== [PASSED] drm_test_dp_mst_sideband_msg_req_decode =====
[13:19:48] ================ [PASSED] drm_dp_mst_helper ================
[13:19:48] ================== drm_exec (7 subtests) ===================
[13:19:48] [PASSED] sanitycheck
[13:19:48] [PASSED] test_lock
[13:19:48] [PASSED] test_lock_unlock
[13:19:48] [PASSED] test_duplicates
[13:19:48] [PASSED] test_prepare
[13:19:48] [PASSED] test_prepare_array
[13:19:48] [PASSED] test_multiple_loops
[13:19:48] ==================== [PASSED] drm_exec =====================
[13:19:48] =========== drm_format_helper_test (17 subtests) ===========
[13:19:48] ============== drm_test_fb_xrgb8888_to_gray8  ==============
[13:19:48] [PASSED] single_pixel_source_buffer
[13:19:48] [PASSED] single_pixel_clip_rectangle
[13:19:48] [PASSED] well_known_colors
[13:19:48] [PASSED] destination_pitch
[13:19:48] ========== [PASSED] drm_test_fb_xrgb8888_to_gray8 ==========
[13:19:48] ============= drm_test_fb_xrgb8888_to_rgb332  ==============
[13:19:48] [PASSED] single_pixel_source_buffer
[13:19:48] [PASSED] single_pixel_clip_rectangle
[13:19:48] [PASSED] well_known_colors
[13:19:48] [PASSED] destination_pitch
[13:19:48] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb332 ==========
[13:19:48] ============= drm_test_fb_xrgb8888_to_rgb565  ==============
[13:19:48] [PASSED] single_pixel_source_buffer
[13:19:48] [PASSED] single_pixel_clip_rectangle
[13:19:48] [PASSED] well_known_colors
[13:19:48] [PASSED] destination_pitch
[13:19:48] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb565 ==========
[13:19:48] ============ drm_test_fb_xrgb8888_to_xrgb1555  =============
[13:19:48] [PASSED] single_pixel_source_buffer
[13:19:48] [PASSED] single_pixel_clip_rectangle
[13:19:48] [PASSED] well_known_colors
[13:19:48] [PASSED] destination_pitch
[13:19:48] ======== [PASSED] drm_test_fb_xrgb8888_to_xrgb1555 =========
[13:19:48] ============ drm_test_fb_xrgb8888_to_argb1555  =============
[13:19:48] [PASSED] single_pixel_source_buffer
[13:19:48] [PASSED] single_pixel_clip_rectangle
[13:19:48] [PASSED] well_known_colors
[13:19:48] [PASSED] destination_pitch
[13:19:48] ======== [PASSED] drm_test_fb_xrgb8888_to_argb1555 =========
[13:19:48] ============ drm_test_fb_xrgb8888_to_rgba5551  =============
[13:19:48] [PASSED] single_pixel_source_buffer
[13:19:48] [PASSED] single_pixel_clip_rectangle
[13:19:48] [PASSED] well_known_colors
[13:19:48] [PASSED] destination_pitch
[13:19:48] ======== [PASSED] drm_test_fb_xrgb8888_to_rgba5551 =========
[13:19:48] ============= drm_test_fb_xrgb8888_to_rgb888  ==============
[13:19:48] [PASSED] single_pixel_source_buffer
[13:19:48] [PASSED] single_pixel_clip_rectangle
[13:19:48] [PASSED] well_known_colors
[13:19:48] [PASSED] destination_pitch
[13:19:48] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb888 ==========
[13:19:48] ============= drm_test_fb_xrgb8888_to_bgr888  ==============
[13:19:48] [PASSED] single_pixel_source_buffer
[13:19:48] [PASSED] single_pixel_clip_rectangle
[13:19:48] [PASSED] well_known_colors
[13:19:48] [PASSED] destination_pitch
[13:19:48] ========= [PASSED] drm_test_fb_xrgb8888_to_bgr888 ==========
[13:19:48] ============ drm_test_fb_xrgb8888_to_argb8888  =============
[13:19:48] [PASSED] single_pixel_source_buffer
[13:19:48] [PASSED] single_pixel_clip_rectangle
[13:19:48] [PASSED] well_known_colors
[13:19:48] [PASSED] destination_pitch
[13:19:48] ======== [PASSED] drm_test_fb_xrgb8888_to_argb8888 =========
[13:19:48] =========== drm_test_fb_xrgb8888_to_xrgb2101010  ===========
[13:19:48] [PASSED] single_pixel_source_buffer
[13:19:48] [PASSED] single_pixel_clip_rectangle
[13:19:48] [PASSED] well_known_colors
[13:19:48] [PASSED] destination_pitch
[13:19:48] ======= [PASSED] drm_test_fb_xrgb8888_to_xrgb2101010 =======
[13:19:48] =========== drm_test_fb_xrgb8888_to_argb2101010  ===========
[13:19:48] [PASSED] single_pixel_source_buffer
[13:19:48] [PASSED] single_pixel_clip_rectangle
[13:19:48] [PASSED] well_known_colors
[13:19:48] [PASSED] destination_pitch
[13:19:48] ======= [PASSED] drm_test_fb_xrgb8888_to_argb2101010 =======
[13:19:48] ============== drm_test_fb_xrgb8888_to_mono  ===============
[13:19:48] [PASSED] single_pixel_source_buffer
[13:19:48] [PASSED] single_pixel_clip_rectangle
[13:19:48] [PASSED] well_known_colors
[13:19:48] [PASSED] destination_pitch
[13:19:48] ========== [PASSED] drm_test_fb_xrgb8888_to_mono ===========
[13:19:48] ==================== drm_test_fb_swab  =====================
[13:19:48] [PASSED] single_pixel_source_buffer
[13:19:48] [PASSED] single_pixel_clip_rectangle
[13:19:48] [PASSED] well_known_colors
[13:19:48] [PASSED] destination_pitch
[13:19:48] ================ [PASSED] drm_test_fb_swab =================
[13:19:48] ============ drm_test_fb_xrgb8888_to_xbgr8888  =============
[13:19:48] [PASSED] single_pixel_source_buffer
[13:19:48] [PASSED] single_pixel_clip_rectangle
[13:19:48] [PASSED] well_known_colors
[13:19:48] [PASSED] destination_pitch
[13:19:48] ======== [PASSED] drm_test_fb_xrgb8888_to_xbgr8888 =========
[13:19:48] ============ drm_test_fb_xrgb8888_to_abgr8888  =============
[13:19:48] [PASSED] single_pixel_source_buffer
[13:19:48] [PASSED] single_pixel_clip_rectangle
[13:19:48] [PASSED] well_known_colors
[13:19:48] [PASSED] destination_pitch
[13:19:48] ======== [PASSED] drm_test_fb_xrgb8888_to_abgr8888 =========
[13:19:48] ================= drm_test_fb_clip_offset  =================
[13:19:48] [PASSED] pass through
[13:19:48] [PASSED] horizontal offset
[13:19:48] [PASSED] vertical offset
[13:19:48] [PASSED] horizontal and vertical offset
[13:19:48] [PASSED] horizontal offset (custom pitch)
[13:19:48] [PASSED] vertical offset (custom pitch)
[13:19:48] [PASSED] horizontal and vertical offset (custom pitch)
[13:19:48] ============= [PASSED] drm_test_fb_clip_offset =============
[13:19:48] =================== drm_test_fb_memcpy  ====================
[13:19:48] [PASSED] single_pixel_source_buffer: XR24 little-endian (0x34325258)
[13:19:48] [PASSED] single_pixel_source_buffer: XRA8 little-endian (0x38415258)
[13:19:48] [PASSED] single_pixel_source_buffer: YU24 little-endian (0x34325559)
[13:19:48] [PASSED] single_pixel_clip_rectangle: XB24 little-endian (0x34324258)
[13:19:48] [PASSED] single_pixel_clip_rectangle: XRA8 little-endian (0x38415258)
[13:19:48] [PASSED] single_pixel_clip_rectangle: YU24 little-endian (0x34325559)
[13:19:48] [PASSED] well_known_colors: XB24 little-endian (0x34324258)
[13:19:48] [PASSED] well_known_colors: XRA8 little-endian (0x38415258)
[13:19:48] [PASSED] well_known_colors: YU24 little-endian (0x34325559)
[13:19:48] [PASSED] destination_pitch: XB24 little-endian (0x34324258)
[13:19:48] [PASSED] destination_pitch: XRA8 little-endian (0x38415258)
[13:19:48] [PASSED] destination_pitch: YU24 little-endian (0x34325559)
[13:19:48] =============== [PASSED] drm_test_fb_memcpy ================
[13:19:48] ============= [PASSED] drm_format_helper_test ==============
[13:19:48] ================= drm_format (18 subtests) =================
[13:19:48] [PASSED] drm_test_format_block_width_invalid
[13:19:48] [PASSED] drm_test_format_block_width_one_plane
[13:19:48] [PASSED] drm_test_format_block_width_two_plane
[13:19:48] [PASSED] drm_test_format_block_width_three_plane
[13:19:48] [PASSED] drm_test_format_block_width_tiled
[13:19:48] [PASSED] drm_test_format_block_height_invalid
[13:19:48] [PASSED] drm_test_format_block_height_one_plane
[13:19:48] [PASSED] drm_test_format_block_height_two_plane
[13:19:48] [PASSED] drm_test_format_block_height_three_plane
[13:19:48] [PASSED] drm_test_format_block_height_tiled
[13:19:48] [PASSED] drm_test_format_min_pitch_invalid
[13:19:48] [PASSED] drm_test_format_min_pitch_one_plane_8bpp
[13:19:48] [PASSED] drm_test_format_min_pitch_one_plane_16bpp
[13:19:48] [PASSED] drm_test_format_min_pitch_one_plane_24bpp
[13:19:48] [PASSED] drm_test_format_min_pitch_one_plane_32bpp
[13:19:48] [PASSED] drm_test_format_min_pitch_two_plane
[13:19:48] [PASSED] drm_test_format_min_pitch_three_plane_8bpp
[13:19:48] [PASSED] drm_test_format_min_pitch_tiled
[13:19:48] =================== [PASSED] drm_format ====================
[13:19:48] ============== drm_framebuffer (10 subtests) ===============
[13:19:48] ========== drm_test_framebuffer_check_src_coords  ==========
[13:19:48] [PASSED] Success: source fits into fb
[13:19:48] [PASSED] Fail: overflowing fb with x-axis coordinate
[13:19:48] [PASSED] Fail: overflowing fb with y-axis coordinate
[13:19:48] [PASSED] Fail: overflowing fb with source width
[13:19:48] [PASSED] Fail: overflowing fb with source height
[13:19:48] ====== [PASSED] drm_test_framebuffer_check_src_coords ======
[13:19:48] [PASSED] drm_test_framebuffer_cleanup
[13:19:48] =============== drm_test_framebuffer_create  ===============
[13:19:48] [PASSED] ABGR8888 normal sizes
[13:19:48] [PASSED] ABGR8888 max sizes
[13:19:48] [PASSED] ABGR8888 pitch greater than min required
[13:19:48] [PASSED] ABGR8888 pitch less than min required
[13:19:48] [PASSED] ABGR8888 Invalid width
[13:19:48] [PASSED] ABGR8888 Invalid buffer handle
[13:19:48] [PASSED] No pixel format
[13:19:48] [PASSED] ABGR8888 Width 0
[13:19:48] [PASSED] ABGR8888 Height 0
[13:19:48] [PASSED] ABGR8888 Out of bound height * pitch combination
[13:19:48] [PASSED] ABGR8888 Large buffer offset
[13:19:48] [PASSED] ABGR8888 Buffer offset for inexistent plane
[13:19:48] [PASSED] ABGR8888 Invalid flag
[13:19:48] [PASSED] ABGR8888 Set DRM_MODE_FB_MODIFIERS without modifiers
[13:19:48] [PASSED] ABGR8888 Valid buffer modifier
[13:19:48] [PASSED] ABGR8888 Invalid buffer modifier(DRM_FORMAT_MOD_SAMSUNG_64_32_TILE)
[13:19:48] [PASSED] ABGR8888 Extra pitches without DRM_MODE_FB_MODIFIERS
[13:19:48] [PASSED] ABGR8888 Extra pitches with DRM_MODE_FB_MODIFIERS
[13:19:48] [PASSED] NV12 Normal sizes
[13:19:48] [PASSED] NV12 Max sizes
[13:19:48] [PASSED] NV12 Invalid pitch
[13:19:48] [PASSED] NV12 Invalid modifier/missing DRM_MODE_FB_MODIFIERS flag
[13:19:48] [PASSED] NV12 different  modifier per-plane
[13:19:48] [PASSED] NV12 with DRM_FORMAT_MOD_SAMSUNG_64_32_TILE
[13:19:48] [PASSED] NV12 Valid modifiers without DRM_MODE_FB_MODIFIERS
[13:19:48] [PASSED] NV12 Modifier for inexistent plane
[13:19:48] [PASSED] NV12 Handle for inexistent plane
[13:19:48] [PASSED] NV12 Handle for inexistent plane without DRM_MODE_FB_MODIFIERS
[13:19:48] [PASSED] YVU420 DRM_MODE_FB_MODIFIERS set without modifier
[13:19:48] [PASSED] YVU420 Normal sizes
[13:19:48] [PASSED] YVU420 Max sizes
[13:19:48] [PASSED] YVU420 Invalid pitch
[13:19:48] [PASSED] YVU420 Different pitches
[13:19:48] [PASSED] YVU420 Different buffer offsets/pitches
[13:19:48] [PASSED] YVU420 Modifier set just for plane 0, without DRM_MODE_FB_MODIFIERS
[13:19:48] [PASSED] YVU420 Modifier set just for planes 0, 1, without DRM_MODE_FB_MODIFIERS
[13:19:48] [PASSED] YVU420 Modifier set just for plane 0, 1, with DRM_MODE_FB_MODIFIERS
[13:19:48] [PASSED] YVU420 Valid modifier
[13:19:48] [PASSED] YVU420 Different modifiers per plane
[13:19:48] [PASSED] YVU420 Modifier for inexistent plane
[13:19:48] [PASSED] YUV420_10BIT Invalid modifier(DRM_FORMAT_MOD_LINEAR)
[13:19:48] [PASSED] X0L2 Normal sizes
[13:19:48] [PASSED] X0L2 Max sizes
[13:19:48] [PASSED] X0L2 Invalid pitch
[13:19:48] [PASSED] X0L2 Pitch greater than minimum required
[13:19:48] [PASSED] X0L2 Handle for inexistent plane
[13:19:48] [PASSED] X0L2 Offset for inexistent plane, without DRM_MODE_FB_MODIFIERS set
[13:19:48] [PASSED] X0L2 Modifier without DRM_MODE_FB_MODIFIERS set
[13:19:48] [PASSED] X0L2 Valid modifier
[13:19:48] [PASSED] X0L2 Modifier for inexistent plane
[13:19:48] =========== [PASSED] drm_test_framebuffer_create ===========
[13:19:48] [PASSED] drm_test_framebuffer_free
[13:19:48] [PASSED] drm_test_framebuffer_init
[13:19:48] [PASSED] drm_test_framebuffer_init_bad_format
[13:19:48] [PASSED] drm_test_framebuffer_init_dev_mismatch
[13:19:48] [PASSED] drm_test_framebuffer_lookup
[13:19:48] [PASSED] drm_test_framebuffer_lookup_inexistent
[13:19:48] [PASSED] drm_test_framebuffer_modifiers_not_supported
[13:19:48] ================= [PASSED] drm_framebuffer =================
[13:19:48] ================ drm_gem_shmem (8 subtests) ================
[13:19:48] [PASSED] drm_gem_shmem_test_obj_create
[13:19:48] [PASSED] drm_gem_shmem_test_obj_create_private
[13:19:48] [PASSED] drm_gem_shmem_test_pin_pages
[13:19:48] [PASSED] drm_gem_shmem_test_vmap
[13:19:48] [PASSED] drm_gem_shmem_test_get_pages_sgt
[13:19:48] [PASSED] drm_gem_shmem_test_get_sg_table
[13:19:48] [PASSED] drm_gem_shmem_test_madvise
[13:19:48] [PASSED] drm_gem_shmem_test_purge
[13:19:48] ================== [PASSED] drm_gem_shmem ==================
[13:19:48] === drm_atomic_helper_connector_hdmi_check (27 subtests) ===
[13:19:48] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode
[13:19:48] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode_vic_1
[13:19:48] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode
[13:19:48] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode_vic_1
[13:19:48] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode
[13:19:48] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode_vic_1
[13:19:48] ====== drm_test_check_broadcast_rgb_cea_mode_yuv420  =======
[13:19:48] [PASSED] Automatic
[13:19:48] [PASSED] Full
[13:19:48] [PASSED] Limited 16:235
[13:19:48] == [PASSED] drm_test_check_broadcast_rgb_cea_mode_yuv420 ===
[13:19:48] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_changed
[13:19:48] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_not_changed
[13:19:48] [PASSED] drm_test_check_disable_connector
[13:19:48] [PASSED] drm_test_check_hdmi_funcs_reject_rate
[13:19:48] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_rgb
[13:19:48] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_yuv420
[13:19:48] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv422
[13:19:48] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv420
[13:19:48] [PASSED] drm_test_check_driver_unsupported_fallback_yuv420
[13:19:48] [PASSED] drm_test_check_output_bpc_crtc_mode_changed
[13:19:48] [PASSED] drm_test_check_output_bpc_crtc_mode_not_changed
[13:19:48] [PASSED] drm_test_check_output_bpc_dvi
[13:19:48] [PASSED] drm_test_check_output_bpc_format_vic_1
[13:19:48] [PASSED] drm_test_check_output_bpc_format_display_8bpc_only
[13:19:48] [PASSED] drm_test_check_output_bpc_format_display_rgb_only
[13:19:48] [PASSED] drm_test_check_output_bpc_format_driver_8bpc_only
[13:19:48] [PASSED] drm_test_check_output_bpc_format_driver_rgb_only
[13:19:48] [PASSED] drm_test_check_tmds_char_rate_rgb_8bpc
[13:19:48] [PASSED] drm_test_check_tmds_char_rate_rgb_10bpc
[13:19:48] [PASSED] drm_test_check_tmds_char_rate_rgb_12bpc
[13:19:48] ===== [PASSED] drm_atomic_helper_connector_hdmi_check ======
[13:19:48] === drm_atomic_helper_connector_hdmi_reset (6 subtests) ====
[13:19:48] [PASSED] drm_test_check_broadcast_rgb_value
[13:19:48] [PASSED] drm_test_check_bpc_8_value
[13:19:48] [PASSED] drm_test_check_bpc_10_value
[13:19:48] [PASSED] drm_test_check_bpc_12_value
[13:19:48] [PASSED] drm_test_check_format_value
[13:19:48] [PASSED] drm_test_check_tmds_char_value
[13:19:48] ===== [PASSED] drm_atomic_helper_connector_hdmi_reset ======
[13:19:48] = drm_atomic_helper_connector_hdmi_mode_valid (4 subtests) =
[13:19:48] [PASSED] drm_test_check_mode_valid
[13:19:48] [PASSED] drm_test_check_mode_valid_reject
[13:19:48] [PASSED] drm_test_check_mode_valid_reject_rate
[13:19:48] [PASSED] drm_test_check_mode_valid_reject_max_clock
[13:19:48] === [PASSED] drm_atomic_helper_connector_hdmi_mode_valid ===
[13:19:48] ================= drm_managed (2 subtests) =================
[13:19:48] [PASSED] drm_test_managed_release_action
[13:19:48] [PASSED] drm_test_managed_run_action
[13:19:48] =================== [PASSED] drm_managed ===================
[13:19:48] =================== drm_mm (6 subtests) ====================
[13:19:48] [PASSED] drm_test_mm_init
[13:19:48] [PASSED] drm_test_mm_debug
[13:19:48] [PASSED] drm_test_mm_align32
[13:19:48] [PASSED] drm_test_mm_align64
[13:19:48] [PASSED] drm_test_mm_lowest
[13:19:48] [PASSED] drm_test_mm_highest
[13:19:48] ===================== [PASSED] drm_mm ======================
[13:19:48] ============= drm_modes_analog_tv (5 subtests) =============
[13:19:48] [PASSED] drm_test_modes_analog_tv_mono_576i
[13:19:48] [PASSED] drm_test_modes_analog_tv_ntsc_480i
[13:19:48] [PASSED] drm_test_modes_analog_tv_ntsc_480i_inlined
[13:19:48] [PASSED] drm_test_modes_analog_tv_pal_576i
[13:19:48] [PASSED] drm_test_modes_analog_tv_pal_576i_inlined
[13:19:48] =============== [PASSED] drm_modes_analog_tv ===============
[13:19:48] ============== drm_plane_helper (2 subtests) ===============
[13:19:48] =============== drm_test_check_plane_state  ================
[13:19:48] [PASSED] clipping_simple
[13:19:48] [PASSED] clipping_rotate_reflect
[13:19:48] [PASSED] positioning_simple
[13:19:48] [PASSED] upscaling
[13:19:48] [PASSED] downscaling
[13:19:48] [PASSED] rounding1
[13:19:48] [PASSED] rounding2
[13:19:48] [PASSED] rounding3
[13:19:48] [PASSED] rounding4
[13:19:48] =========== [PASSED] drm_test_check_plane_state ============
[13:19:48] =========== drm_test_check_invalid_plane_state  ============
[13:19:48] [PASSED] positioning_invalid
[13:19:48] [PASSED] upscaling_invalid
[13:19:48] [PASSED] downscaling_invalid
[13:19:48] ======= [PASSED] drm_test_check_invalid_plane_state ========
[13:19:48] ================ [PASSED] drm_plane_helper =================
[13:19:48] ====== drm_connector_helper_tv_get_modes (1 subtest) =======
[13:19:48] ====== drm_test_connector_helper_tv_get_modes_check  =======
[13:19:48] [PASSED] None
[13:19:48] [PASSED] PAL
[13:19:48] [PASSED] NTSC
[13:19:48] [PASSED] Both, NTSC Default
[13:19:48] [PASSED] Both, PAL Default
[13:19:48] [PASSED] Both, NTSC Default, with PAL on command-line
[13:19:48] [PASSED] Both, PAL Default, with NTSC on command-line
[13:19:48] == [PASSED] drm_test_connector_helper_tv_get_modes_check ===
[13:19:48] ======== [PASSED] drm_connector_helper_tv_get_modes ========
[13:19:48] ================== drm_rect (9 subtests) ===================
[13:19:48] [PASSED] drm_test_rect_clip_scaled_div_by_zero
[13:19:48] [PASSED] drm_test_rect_clip_scaled_not_clipped
[13:19:48] [PASSED] drm_test_rect_clip_scaled_clipped
[13:19:48] [PASSED] drm_test_rect_clip_scaled_signed_vs_unsigned
[13:19:48] ================= drm_test_rect_intersect  =================
[13:19:48] [PASSED] top-left x bottom-right: 2x2+1+1 x 2x2+0+0
[13:19:48] [PASSED] top-right x bottom-left: 2x2+0+0 x 2x2+1-1
[13:19:48] [PASSED] bottom-left x top-right: 2x2+1-1 x 2x2+0+0
[13:19:48] [PASSED] bottom-right x top-left: 2x2+0+0 x 2x2+1+1
[13:19:48] [PASSED] right x left: 2x1+0+0 x 3x1+1+0
[13:19:48] [PASSED] left x right: 3x1+1+0 x 2x1+0+0
[13:19:48] [PASSED] up x bottom: 1x2+0+0 x 1x3+0-1
[13:19:48] [PASSED] bottom x up: 1x3+0-1 x 1x2+0+0
[13:19:48] [PASSED] touching corner: 1x1+0+0 x 2x2+1+1
[13:19:48] [PASSED] touching side: 1x1+0+0 x 1x1+1+0
[13:19:48] [PASSED] equal rects: 2x2+0+0 x 2x2+0+0
[13:19:48] [PASSED] inside another: 2x2+0+0 x 1x1+1+1
[13:19:48] [PASSED] far away: 1x1+0+0 x 1x1+3+6
[13:19:48] [PASSED] points intersecting: 0x0+5+10 x 0x0+5+10
[13:19:48] [PASSED] points not intersecting: 0x0+0+0 x 0x0+5+10
[13:19:48] ============= [PASSED] drm_test_rect_intersect =============
[13:19:48] ================ drm_test_rect_calc_hscale  ================
[13:19:48] [PASSED] normal use
[13:19:48] [PASSED] out of max range
[13:19:48] [PASSED] out of min range
[13:19:48] [PASSED] zero dst
[13:19:48] [PASSED] negative src
[13:19:48] [PASSED] negative dst
[13:19:48] ============ [PASSED] drm_test_rect_calc_hscale ============
[13:19:48] ================ drm_test_rect_calc_vscale  ================
[13:19:48] [PASSED] normal use
stty: 'standard input': Inappropriate ioctl for device
[13:19:48] [PASSED] out of max range
[13:19:48] [PASSED] out of min range
[13:19:48] [PASSED] zero dst
[13:19:48] [PASSED] negative src
[13:19:48] [PASSED] negative dst
[13:19:48] ============ [PASSED] drm_test_rect_calc_vscale ============
[13:19:48] ================== drm_test_rect_rotate  ===================
[13:19:48] [PASSED] reflect-x
[13:19:48] [PASSED] reflect-y
[13:19:48] [PASSED] rotate-0
[13:19:48] [PASSED] rotate-90
[13:19:48] [PASSED] rotate-180
[13:19:48] [PASSED] rotate-270
[13:19:48] ============== [PASSED] drm_test_rect_rotate ===============
[13:19:48] ================ drm_test_rect_rotate_inv  =================
[13:19:48] [PASSED] reflect-x
[13:19:48] [PASSED] reflect-y
[13:19:48] [PASSED] rotate-0
[13:19:48] [PASSED] rotate-90
[13:19:48] [PASSED] rotate-180
[13:19:48] [PASSED] rotate-270
[13:19:48] ============ [PASSED] drm_test_rect_rotate_inv =============
[13:19:48] ==================== [PASSED] drm_rect =====================
[13:19:48] ============ drm_sysfb_modeset_test (1 subtest) ============
[13:19:48] ============ drm_test_sysfb_build_fourcc_list  =============
[13:19:48] [PASSED] no native formats
[13:19:48] [PASSED] XRGB8888 as native format
[13:19:48] [PASSED] remove duplicates
[13:19:48] [PASSED] convert alpha formats
[13:19:48] [PASSED] random formats
[13:19:48] ======== [PASSED] drm_test_sysfb_build_fourcc_list =========
[13:19:48] ============= [PASSED] drm_sysfb_modeset_test ==============
[13:19:48] ============================================================
[13:19:48] Testing complete. Ran 622 tests: passed: 622
[13:19:48] Elapsed time: 31.827s total, 1.636s configuring, 29.673s building, 0.462s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/ttm/tests/.kunitconfig
[13:19:48] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[13:19:50] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=25
[13:19:59] Starting KUnit Kernel (1/1)...
[13:19:59] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[13:19:59] ================= ttm_device (5 subtests) ==================
[13:19:59] [PASSED] ttm_device_init_basic
[13:19:59] [PASSED] ttm_device_init_multiple
[13:19:59] [PASSED] ttm_device_fini_basic
[13:19:59] [PASSED] ttm_device_init_no_vma_man
[13:19:59] ================== ttm_device_init_pools  ==================
[13:19:59] [PASSED] No DMA allocations, no DMA32 required
[13:19:59] [PASSED] DMA allocations, DMA32 required
[13:19:59] [PASSED] No DMA allocations, DMA32 required
[13:19:59] [PASSED] DMA allocations, no DMA32 required
[13:19:59] ============== [PASSED] ttm_device_init_pools ==============
[13:19:59] =================== [PASSED] ttm_device ====================
[13:19:59] ================== ttm_pool (8 subtests) ===================
[13:19:59] ================== ttm_pool_alloc_basic  ===================
[13:19:59] [PASSED] One page
[13:19:59] [PASSED] More than one page
[13:19:59] [PASSED] Above the allocation limit
[13:19:59] [PASSED] One page, with coherent DMA mappings enabled
[13:19:59] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[13:19:59] ============== [PASSED] ttm_pool_alloc_basic ===============
[13:19:59] ============== ttm_pool_alloc_basic_dma_addr  ==============
[13:19:59] [PASSED] One page
[13:19:59] [PASSED] More than one page
[13:19:59] [PASSED] Above the allocation limit
[13:19:59] [PASSED] One page, with coherent DMA mappings enabled
[13:19:59] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[13:19:59] ========== [PASSED] ttm_pool_alloc_basic_dma_addr ==========
[13:19:59] [PASSED] ttm_pool_alloc_order_caching_match
[13:19:59] [PASSED] ttm_pool_alloc_caching_mismatch
[13:19:59] [PASSED] ttm_pool_alloc_order_mismatch
[13:19:59] [PASSED] ttm_pool_free_dma_alloc
[13:19:59] [PASSED] ttm_pool_free_no_dma_alloc
[13:19:59] [PASSED] ttm_pool_fini_basic
[13:19:59] ==================== [PASSED] ttm_pool =====================
[13:19:59] ================ ttm_resource (8 subtests) =================
[13:19:59] ================= ttm_resource_init_basic  =================
[13:19:59] [PASSED] Init resource in TTM_PL_SYSTEM
[13:19:59] [PASSED] Init resource in TTM_PL_VRAM
[13:19:59] [PASSED] Init resource in a private placement
[13:19:59] [PASSED] Init resource in TTM_PL_SYSTEM, set placement flags
[13:19:59] ============= [PASSED] ttm_resource_init_basic =============
[13:19:59] [PASSED] ttm_resource_init_pinned
[13:19:59] [PASSED] ttm_resource_fini_basic
[13:19:59] [PASSED] ttm_resource_manager_init_basic
[13:19:59] [PASSED] ttm_resource_manager_usage_basic
[13:19:59] [PASSED] ttm_resource_manager_set_used_basic
[13:19:59] [PASSED] ttm_sys_man_alloc_basic
[13:19:59] [PASSED] ttm_sys_man_free_basic
[13:19:59] ================== [PASSED] ttm_resource ===================
[13:19:59] =================== ttm_tt (15 subtests) ===================
[13:19:59] ==================== ttm_tt_init_basic  ====================
[13:19:59] [PASSED] Page-aligned size
[13:19:59] [PASSED] Extra pages requested
[13:19:59] ================ [PASSED] ttm_tt_init_basic ================
[13:19:59] [PASSED] ttm_tt_init_misaligned
[13:19:59] [PASSED] ttm_tt_fini_basic
[13:19:59] [PASSED] ttm_tt_fini_sg
[13:19:59] [PASSED] ttm_tt_fini_shmem
[13:19:59] [PASSED] ttm_tt_create_basic
[13:19:59] [PASSED] ttm_tt_create_invalid_bo_type
[13:19:59] [PASSED] ttm_tt_create_ttm_exists
[13:19:59] [PASSED] ttm_tt_create_failed
[13:19:59] [PASSED] ttm_tt_destroy_basic
[13:19:59] [PASSED] ttm_tt_populate_null_ttm
[13:19:59] [PASSED] ttm_tt_populate_populated_ttm
[13:19:59] [PASSED] ttm_tt_unpopulate_basic
[13:19:59] [PASSED] ttm_tt_unpopulate_empty_ttm
[13:19:59] [PASSED] ttm_tt_swapin_basic
[13:19:59] ===================== [PASSED] ttm_tt ======================
[13:19:59] =================== ttm_bo (14 subtests) ===================
[13:19:59] =========== ttm_bo_reserve_optimistic_no_ticket  ===========
[13:19:59] [PASSED] Cannot be interrupted and sleeps
[13:19:59] [PASSED] Cannot be interrupted, locks straight away
[13:19:59] [PASSED] Can be interrupted, sleeps
[13:19:59] ======= [PASSED] ttm_bo_reserve_optimistic_no_ticket =======
[13:19:59] [PASSED] ttm_bo_reserve_locked_no_sleep
[13:19:59] [PASSED] ttm_bo_reserve_no_wait_ticket
[13:19:59] [PASSED] ttm_bo_reserve_double_resv
[13:19:59] [PASSED] ttm_bo_reserve_interrupted
[13:19:59] [PASSED] ttm_bo_reserve_deadlock
[13:19:59] [PASSED] ttm_bo_unreserve_basic
[13:19:59] [PASSED] ttm_bo_unreserve_pinned
[13:19:59] [PASSED] ttm_bo_unreserve_bulk
[13:19:59] [PASSED] ttm_bo_fini_basic
[13:19:59] [PASSED] ttm_bo_fini_shared_resv
[13:19:59] [PASSED] ttm_bo_pin_basic
[13:19:59] [PASSED] ttm_bo_pin_unpin_resource
[13:19:59] [PASSED] ttm_bo_multiple_pin_one_unpin
[13:19:59] ===================== [PASSED] ttm_bo ======================
[13:19:59] ============== ttm_bo_validate (21 subtests) ===============
[13:19:59] ============== ttm_bo_init_reserved_sys_man  ===============
[13:19:59] [PASSED] Buffer object for userspace
[13:19:59] [PASSED] Kernel buffer object
[13:19:59] [PASSED] Shared buffer object
[13:19:59] ========== [PASSED] ttm_bo_init_reserved_sys_man ===========
[13:19:59] ============== ttm_bo_init_reserved_mock_man  ==============
[13:19:59] [PASSED] Buffer object for userspace
[13:19:59] [PASSED] Kernel buffer object
[13:19:59] [PASSED] Shared buffer object
[13:19:59] ========== [PASSED] ttm_bo_init_reserved_mock_man ==========
[13:19:59] [PASSED] ttm_bo_init_reserved_resv
[13:19:59] ================== ttm_bo_validate_basic  ==================
[13:19:59] [PASSED] Buffer object for userspace
[13:19:59] [PASSED] Kernel buffer object
[13:19:59] [PASSED] Shared buffer object
[13:19:59] ============== [PASSED] ttm_bo_validate_basic ==============
[13:19:59] [PASSED] ttm_bo_validate_invalid_placement
[13:19:59] ============= ttm_bo_validate_same_placement  ==============
[13:19:59] [PASSED] System manager
[13:19:59] [PASSED] VRAM manager
[13:19:59] ========= [PASSED] ttm_bo_validate_same_placement ==========
[13:19:59] [PASSED] ttm_bo_validate_failed_alloc
[13:19:59] [PASSED] ttm_bo_validate_pinned
[13:19:59] [PASSED] ttm_bo_validate_busy_placement
[13:19:59] ================ ttm_bo_validate_multihop  =================
[13:19:59] [PASSED] Buffer object for userspace
[13:19:59] [PASSED] Kernel buffer object
[13:19:59] [PASSED] Shared buffer object
[13:19:59] ============ [PASSED] ttm_bo_validate_multihop =============
[13:19:59] ========== ttm_bo_validate_no_placement_signaled  ==========
[13:19:59] [PASSED] Buffer object in system domain, no page vector
[13:19:59] [PASSED] Buffer object in system domain with an existing page vector
[13:19:59] ====== [PASSED] ttm_bo_validate_no_placement_signaled ======
[13:19:59] ======== ttm_bo_validate_no_placement_not_signaled  ========
[13:19:59] [PASSED] Buffer object for userspace
[13:19:59] [PASSED] Kernel buffer object
[13:19:59] [PASSED] Shared buffer object
[13:19:59] ==== [PASSED] ttm_bo_validate_no_placement_not_signaled ====
[13:19:59] [PASSED] ttm_bo_validate_move_fence_signaled
[13:19:59] ========= ttm_bo_validate_move_fence_not_signaled  =========
[13:19:59] [PASSED] Waits for GPU
[13:19:59] [PASSED] Tries to lock straight away
[13:19:59] ===== [PASSED] ttm_bo_validate_move_fence_not_signaled =====
[13:19:59] [PASSED] ttm_bo_validate_happy_evict
[13:19:59] [PASSED] ttm_bo_validate_all_pinned_evict
[13:19:59] [PASSED] ttm_bo_validate_allowed_only_evict
[13:19:59] [PASSED] ttm_bo_validate_deleted_evict
[13:19:59] [PASSED] ttm_bo_validate_busy_domain_evict
[13:19:59] [PASSED] ttm_bo_validate_evict_gutting
[13:19:59] [PASSED] ttm_bo_validate_recrusive_evict
stty: 'standard input': Inappropriate ioctl for device
[13:19:59] ================= [PASSED] ttm_bo_validate =================
[13:19:59] ============================================================
[13:19:59] Testing complete. Ran 101 tests: passed: 101
[13:19:59] Elapsed time: 11.365s total, 1.638s configuring, 9.460s building, 0.225s running

+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2025-11-25 10:17   ` Matthew Auld
@ 2025-11-25 13:39     ` Souza, Jose
  0 siblings, 0 replies; 39+ messages in thread
From: Souza, Jose @ 2025-11-25 13:39 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Upadhyay, Tejas, Auld, Matthew,
	thomas.hellstrom@linux.intel.com

On Tue, 2025-11-25 at 10:17 +0000, Matthew Auld wrote:
> On 25/11/2025 09:43, Tejas Upadhyay wrote:
> > Starting NVL, HW will flush cachelines marked with XA only
> 
> I think would be good to give basic overview of what XA is?
> 
> > when media is off. We have few cases where kernel will have
> > non-XA cachelines which needs manual flush as we postpone
> > the invalidation. Flush asap from correctness POV to ensure
> > non accelerated CPU copy to swap/shmem file will see coherent
> > view of memory, but also from security POV where later flush
> > can't corrupt the next user of those pages.
> > 
> > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> >   drivers/gpu/drm/xe/xe_device.c  | 20 ++++++++++++++++++++
> >   drivers/gpu/drm/xe/xe_device.h  |  1 +
> >   drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> >   4 files changed, 25 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > b/drivers/gpu/drm/xe/xe_bo.c
> > index 465cf9fc7ce9..97e1e9d40e96 100644
> > --- a/drivers/gpu/drm/xe/xe_bo.c
> > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct
> > xe_device *xe, struct xe_bo *bo,
> >   
> >   		if (!xe_vm_in_fault_mode(vm)) {
> >   			drm_gpuvm_bo_evict(vm_bo, true);
> > -			continue;
> > +			if (!xe_device_needs_cache_flush(xe))
> > +				continue;
> >   		}
> >   
> >   		if (!idle) {
> > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > b/drivers/gpu/drm/xe/xe_device.c
> > index 92f883dd8877..6e8335b493e8 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -1079,6 +1079,26 @@ void xe_device_l2_flush(struct xe_device
> > *xe)
> >   	spin_unlock(&gt->global_invl_lock);
> >   }
> >   
> > +/**
> > + * xe_device_needs_cache_flush - Whether the cache needs to be
> > flushed
> > + * @xe: The device to check.
> > + *
> > + * Return: true if the device needs cache flush, false otherwise.
> > + */
> > +bool xe_device_needs_cache_flush(struct xe_device *xe)
> > +{
> > +	/*
> > +	 * Starting NVL, HW will flush cachelines marked with XA
> > only when media is off. We have
> 
> I think the wording could be improved here (same for commit message).
> XA 
> is *always* flushed, like at the end-of-submssion (and maybe other 
> places), just that internally as an optimisation hw doesn't need to
> make 
> that a full flush (which will also include XA) when Media is 
> off/powergated, since it doesn't need to worry about GT caches vs
> Media 
> coherency, and only CPU vs GPU coherency, so can make that flush a 
> targeted XA flush, since stuff tagged with XA now means it's shared
> with 
> the CPU.
> 
> > +	 * few cases where kernel will have non-XA cachelines
> > which needs manual flush and this is
> > +	 * one of them as we postpone the invalidation. Flush asap
> > from correctness POV to ensure
> > +	 * non accelerated CPU copy to swap/shmem file will see
> > coherent view of memory, but also
> > +	 * from security POV where later flush can't corrupt the
> > next user of those pages.
> > +	 */
> > +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> > +		return true;
> > +	return false;
> > +}
> > +
> >   /**
> >    * xe_device_td_flush() - Flush transient L3 cache entries
> >    * @xe: The device
> > diff --git a/drivers/gpu/drm/xe/xe_device.h
> > b/drivers/gpu/drm/xe/xe_device.h
> > index 32cc6323b7f6..15e67db44b56 100644
> > --- a/drivers/gpu/drm/xe/xe_device.h
> > +++ b/drivers/gpu/drm/xe/xe_device.h
> > @@ -179,6 +179,7 @@ void xe_device_snapshot_print(struct xe_device
> > *xe, struct drm_printer *p);
> >   u64 xe_device_canonicalize_addr(struct xe_device *xe, u64
> > address);
> >   u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64
> > address);
> >   
> > +bool xe_device_needs_cache_flush(struct xe_device *xe);
> >   void xe_device_td_flush(struct xe_device *xe);
> >   void xe_device_l2_flush(struct xe_device *xe);
> >   
> > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > b/drivers/gpu/drm/xe/xe_userptr.c
> > index 0d9130b1958a..a93c7e887cca 100644
> > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > @@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct
> > xe_vm *vm, struct xe_userptr_vma *uv
> >   				    false, MAX_SCHEDULE_TIMEOUT);
> >   	XE_WARN_ON(err <= 0);
> >   
> > -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> > +	if ((xe_vm_in_fault_mode(vm) ||
> > xe_device_needs_cache_flush(vm->xe)) &&
> 
> Other option is to ban non-XA or non-2WAY at the uAPI level on such 
> platforms, but I guess also depends on what UMD wants here?
> 
> Jose, I assume Mesa is just going to use XA or 2WAY for userptr on
> such 
> hw? Or do you see a usecase for being more flexible?

For userptr we will continue to use 2way, no plans to change to XA for
now.

> 
> > +	    userptr->initial_bind) {
> >   		err = xe_vm_invalidate_vma(vma);
> >   		XE_WARN_ON(err);
> >   	}

^ permalink raw reply	[flat|nested] 39+ messages in thread

* ✓ Xe.CI.BAT: success for drm/xe/xe3p_lpg: L2 flush optimization
  2025-11-25  9:43 [PATCH 0/3] drm/xe/xe3p_lpg: L2 flush optimization Tejas Upadhyay
                   ` (3 preceding siblings ...)
  2025-11-25 13:20 ` ✓ CI.KUnit: success for drm/xe/xe3p_lpg: L2 flush optimization Patchwork
@ 2025-11-25 14:47 ` Patchwork
  2025-11-25 17:42 ` ✓ Xe.CI.Full: " Patchwork
  5 siblings, 0 replies; 39+ messages in thread
From: Patchwork @ 2025-11-25 14:47 UTC (permalink / raw)
  To: Upadhyay, Tejas; +Cc: intel-xe

[-- Attachment #1: Type: text/plain, Size: 1918 bytes --]

== Series Details ==

Series: drm/xe/xe3p_lpg: L2 flush optimization
URL   : https://patchwork.freedesktop.org/series/158017/
State : success

== Summary ==

CI Bug Log - changes from xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8_BAT -> xe-pw-158017v1_BAT
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  

Participating hosts (12 -> 12)
------------------------------

  No changes in participating hosts

Known issues
------------

  Here are the changes found in xe-pw-158017v1_BAT that come from known issues:

### IGT changes ###

#### Possible fixes ####

  * igt@xe_waitfence@engine:
    - bat-dg2-oem2:       [FAIL][1] ([Intel XE#6519]) -> [PASS][2]
   [1]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/bat-dg2-oem2/igt@xe_waitfence@engine.html
   [2]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/bat-dg2-oem2/igt@xe_waitfence@engine.html

  * igt@xe_waitfence@reltime:
    - bat-dg2-oem2:       [FAIL][3] ([Intel XE#6520]) -> [PASS][4]
   [3]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/bat-dg2-oem2/igt@xe_waitfence@reltime.html
   [4]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/bat-dg2-oem2/igt@xe_waitfence@reltime.html

  
  [Intel XE#6519]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6519
  [Intel XE#6520]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6520


Build changes
-------------

  * Linux: xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8 -> xe-pw-158017v1

  IGT_8637: 730ee3dfb26f8d7891fc240b0132a08c5bc7b949 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8: c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8
  xe-pw-158017v1: 158017v1

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/index.html

[-- Attachment #2: Type: text/html, Size: 2517 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2025-11-25  9:43 ` [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually Tejas Upadhyay
  2025-11-25 10:17   ` Matthew Auld
@ 2025-11-25 15:06   ` Thomas Hellström
  2025-11-25 15:31     ` Upadhyay, Tejas
  1 sibling, 1 reply; 39+ messages in thread
From: Thomas Hellström @ 2025-11-25 15:06 UTC (permalink / raw)
  To: Tejas Upadhyay, intel-xe; +Cc: Matthew Auld

Hi.


On Tue, 2025-11-25 at 15:13 +0530, Tejas Upadhyay wrote:
> Starting NVL, HW will flush cachelines marked with XA only
> when media is off. We have few cases where kernel will have
> non-XA cachelines which needs manual flush as we postpone
> the invalidation. Flush asap from correctness POV to ensure
> non accelerated CPU copy to swap/shmem file will see coherent
> view of memory, but also from security POV where later flush
> can't corrupt the next user of those pages.
> 
> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>

I had a number of concerns last time this patch was sent to the list,
none of which seems to have been addressed? 

https://lore.kernel.org/intel-xe/d2517d66f571e11a760cb143981b7ca238f5cd58.camel@linux.intel.com/

The main concern is that the code indicates that not all GPU caches are
flushed when all fences are signalled (bo / userptr idle)?

Thanks,
Thomas



> ---
>  drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
>  drivers/gpu/drm/xe/xe_device.c  | 20 ++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_device.h  |  1 +
>  drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
>  4 files changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index 465cf9fc7ce9..97e1e9d40e96 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct xe_device
> *xe, struct xe_bo *bo,
>  
>  		if (!xe_vm_in_fault_mode(vm)) {
>  			drm_gpuvm_bo_evict(vm_bo, true);
> -			continue;
> +			if (!xe_device_needs_cache_flush(xe))
> +				continue;
>  		}
>  
>  		if (!idle) {
> diff --git a/drivers/gpu/drm/xe/xe_device.c
> b/drivers/gpu/drm/xe/xe_device.c
> index 92f883dd8877..6e8335b493e8 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -1079,6 +1079,26 @@ void xe_device_l2_flush(struct xe_device *xe)
>  	spin_unlock(&gt->global_invl_lock);
>  }
>  
> +/**
> + * xe_device_needs_cache_flush - Whether the cache needs to be
> flushed
> + * @xe: The device to check.
> + *
> + * Return: true if the device needs cache flush, false otherwise.
> + */
> +bool xe_device_needs_cache_flush(struct xe_device *xe)
> +{
> +	/*
> +	 * Starting NVL, HW will flush cachelines marked with XA
> only when media is off. We have
> +	 * few cases where kernel will have non-XA cachelines which
> needs manual flush and this is
> +	 * one of them as we postpone the invalidation. Flush asap
> from correctness POV to ensure
> +	 * non accelerated CPU copy to swap/shmem file will see
> coherent view of memory, but also
> +	 * from security POV where later flush can't corrupt the
> next user of those pages.
> +	 */
> +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> +		return true;
> +	return false;
> +}
> +
>  /**
>   * xe_device_td_flush() - Flush transient L3 cache entries
>   * @xe: The device
> diff --git a/drivers/gpu/drm/xe/xe_device.h
> b/drivers/gpu/drm/xe/xe_device.h
> index 32cc6323b7f6..15e67db44b56 100644
> --- a/drivers/gpu/drm/xe/xe_device.h
> +++ b/drivers/gpu/drm/xe/xe_device.h
> @@ -179,6 +179,7 @@ void xe_device_snapshot_print(struct xe_device
> *xe, struct drm_printer *p);
>  u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
>  u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64
> address);
>  
> +bool xe_device_needs_cache_flush(struct xe_device *xe);
>  void xe_device_td_flush(struct xe_device *xe);
>  void xe_device_l2_flush(struct xe_device *xe);
>  
> diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> b/drivers/gpu/drm/xe/xe_userptr.c
> index 0d9130b1958a..a93c7e887cca 100644
> --- a/drivers/gpu/drm/xe/xe_userptr.c
> +++ b/drivers/gpu/drm/xe/xe_userptr.c
> @@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct xe_vm
> *vm, struct xe_userptr_vma *uv
>  				    false, MAX_SCHEDULE_TIMEOUT);
>  	XE_WARN_ON(err <= 0);
>  
> -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> +	if ((xe_vm_in_fault_mode(vm) ||
> xe_device_needs_cache_flush(vm->xe)) &&
> +	    userptr->initial_bind) {
>  		err = xe_vm_invalidate_vma(vma);
>  		XE_WARN_ON(err);
>  	}


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2025-11-25 15:06   ` Thomas Hellström
@ 2025-11-25 15:31     ` Upadhyay, Tejas
  2025-11-26 10:26       ` Thomas Hellström
  0 siblings, 1 reply; 39+ messages in thread
From: Upadhyay, Tejas @ 2025-11-25 15:31 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe@lists.freedesktop.org,
	Auld, Matthew



> -----Original Message-----
> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Sent: 25 November 2025 20:37
> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> xe@lists.freedesktop.org
> Cc: Auld, Matthew <matthew.auld@intel.com>
> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
> cachelines manually
> 
> Hi.
> 
> 
> On Tue, 2025-11-25 at 15:13 +0530, Tejas Upadhyay wrote:
> > Starting NVL, HW will flush cachelines marked with XA only when media
> > is off. We have few cases where kernel will have non-XA cachelines
> > which needs manual flush as we postpone the invalidation. Flush asap
> > from correctness POV to ensure non accelerated CPU copy to swap/shmem
> > file will see coherent view of memory, but also from security POV
> > where later flush can't corrupt the next user of those pages.
> >
> > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> 
> I had a number of concerns last time this patch was sent to the list, none of
> which seems to have been addressed?

Sorry for missing to address your comments.

> 
> https://lore.kernel.org/intel-
> xe/d2517d66f571e11a760cb143981b7ca238f5cd58.camel@linux.intel.com/
> 
> The main concern is that the code indicates that not all GPU caches are flushed
> when all fences are signalled (bo / userptr idle)?

Xe3p is introducing feature that when media is off, only XA marked BO will be flushed not whole cache. From UMD perspective we might have non-XA buffers created which we would like to flush before buffer/user goes away during media off. 

Tejas
 
> 
> Thanks,
> Thomas
> 
> 
> 
> > ---
> >  drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> >  drivers/gpu/drm/xe/xe_device.c  | 20 ++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_device.h  |  1 +
> >  drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> >  4 files changed, 25 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> > index 465cf9fc7ce9..97e1e9d40e96 100644
> > --- a/drivers/gpu/drm/xe/xe_bo.c
> > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct xe_device
> > *xe, struct xe_bo *bo,
> >
> >  		if (!xe_vm_in_fault_mode(vm)) {
> >  			drm_gpuvm_bo_evict(vm_bo, true);
> > -			continue;
> > +			if (!xe_device_needs_cache_flush(xe))
> > +				continue;
> >  		}
> >
> >  		if (!idle) {
> > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > b/drivers/gpu/drm/xe/xe_device.c index 92f883dd8877..6e8335b493e8
> > 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -1079,6 +1079,26 @@ void xe_device_l2_flush(struct xe_device *xe)
> >  	spin_unlock(&gt->global_invl_lock);
> >  }
> >
> > +/**
> > + * xe_device_needs_cache_flush - Whether the cache needs to be
> > flushed
> > + * @xe: The device to check.
> > + *
> > + * Return: true if the device needs cache flush, false otherwise.
> > + */
> > +bool xe_device_needs_cache_flush(struct xe_device *xe) {
> > +	/*
> > +	 * Starting NVL, HW will flush cachelines marked with XA
> > only when media is off. We have
> > +	 * few cases where kernel will have non-XA cachelines which
> > needs manual flush and this is
> > +	 * one of them as we postpone the invalidation. Flush asap
> > from correctness POV to ensure
> > +	 * non accelerated CPU copy to swap/shmem file will see
> > coherent view of memory, but also
> > +	 * from security POV where later flush can't corrupt the
> > next user of those pages.
> > +	 */
> > +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> > +		return true;
> > +	return false;
> > +}
> > +
> >  /**
> >   * xe_device_td_flush() - Flush transient L3 cache entries
> >   * @xe: The device
> > diff --git a/drivers/gpu/drm/xe/xe_device.h
> > b/drivers/gpu/drm/xe/xe_device.h index 32cc6323b7f6..15e67db44b56
> > 100644
> > --- a/drivers/gpu/drm/xe/xe_device.h
> > +++ b/drivers/gpu/drm/xe/xe_device.h
> > @@ -179,6 +179,7 @@ void xe_device_snapshot_print(struct xe_device
> > *xe, struct drm_printer *p);
> >  u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
> >  u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address);
> >
> > +bool xe_device_needs_cache_flush(struct xe_device *xe);
> >  void xe_device_td_flush(struct xe_device *xe);
> >  void xe_device_l2_flush(struct xe_device *xe);
> >
> > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > b/drivers/gpu/drm/xe/xe_userptr.c index 0d9130b1958a..a93c7e887cca
> > 100644
> > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > @@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct xe_vm
> > *vm, struct xe_userptr_vma *uv
> >  				    false, MAX_SCHEDULE_TIMEOUT);
> >  	XE_WARN_ON(err <= 0);
> >
> > -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> > +	if ((xe_vm_in_fault_mode(vm) ||
> > xe_device_needs_cache_flush(vm->xe)) &&
> > +	    userptr->initial_bind) {
> >  		err = xe_vm_invalidate_vma(vma);
> >  		XE_WARN_ON(err);
> >  	}


^ permalink raw reply	[flat|nested] 39+ messages in thread

* ✓ Xe.CI.Full: success for drm/xe/xe3p_lpg: L2 flush optimization
  2025-11-25  9:43 [PATCH 0/3] drm/xe/xe3p_lpg: L2 flush optimization Tejas Upadhyay
                   ` (4 preceding siblings ...)
  2025-11-25 14:47 ` ✓ Xe.CI.BAT: " Patchwork
@ 2025-11-25 17:42 ` Patchwork
  5 siblings, 0 replies; 39+ messages in thread
From: Patchwork @ 2025-11-25 17:42 UTC (permalink / raw)
  To: Upadhyay, Tejas; +Cc: intel-xe

[-- Attachment #1: Type: text/plain, Size: 41295 bytes --]

== Series Details ==

Series: drm/xe/xe3p_lpg: L2 flush optimization
URL   : https://patchwork.freedesktop.org/series/158017/
State : success

== Summary ==

CI Bug Log - changes from xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8_FULL -> xe-pw-158017v1_FULL
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  

Participating hosts (4 -> 4)
------------------------------

  No changes in participating hosts

Known issues
------------

  Here are the changes found in xe-pw-158017v1_FULL that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@kms_big_fb@4-tiled-addfb:
    - shard-adlp:         NOTRUN -> [SKIP][1] ([Intel XE#619]) +1 other test skip
   [1]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-2/igt@kms_big_fb@4-tiled-addfb.html

  * igt@kms_big_fb@x-tiled-64bpp-rotate-0:
    - shard-adlp:         NOTRUN -> [DMESG-FAIL][2] ([Intel XE#4543]) +1 other test dmesg-fail
   [2]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@kms_big_fb@x-tiled-64bpp-rotate-0.html

  * igt@kms_big_fb@x-tiled-8bpp-rotate-270:
    - shard-adlp:         NOTRUN -> [SKIP][3] ([Intel XE#316])
   [3]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@kms_big_fb@x-tiled-8bpp-rotate-270.html

  * igt@kms_big_fb@yf-tiled-32bpp-rotate-180:
    - shard-bmg:          NOTRUN -> [SKIP][4] ([Intel XE#1124]) +1 other test skip
   [4]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-6/igt@kms_big_fb@yf-tiled-32bpp-rotate-180.html

  * igt@kms_big_fb@yf-tiled-addfb-size-overflow:
    - shard-adlp:         NOTRUN -> [SKIP][5] ([Intel XE#610])
   [5]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-4/igt@kms_big_fb@yf-tiled-addfb-size-overflow.html

  * igt@kms_big_fb@yf-tiled-max-hw-stride-64bpp-rotate-180-async-flip:
    - shard-adlp:         NOTRUN -> [SKIP][6] ([Intel XE#1124]) +3 other tests skip
   [6]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-4/igt@kms_big_fb@yf-tiled-max-hw-stride-64bpp-rotate-180-async-flip.html

  * igt@kms_bw@connected-linear-tiling-3-displays-2160x1440p:
    - shard-adlp:         NOTRUN -> [SKIP][7] ([Intel XE#2191])
   [7]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-4/igt@kms_bw@connected-linear-tiling-3-displays-2160x1440p.html

  * igt@kms_bw@linear-tiling-4-displays-2160x1440p:
    - shard-adlp:         NOTRUN -> [SKIP][8] ([Intel XE#367]) +2 other tests skip
   [8]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@kms_bw@linear-tiling-4-displays-2160x1440p.html

  * igt@kms_ccs@bad-rotation-90-4-tiled-lnl-ccs:
    - shard-adlp:         NOTRUN -> [SKIP][9] ([Intel XE#2907])
   [9]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-2/igt@kms_ccs@bad-rotation-90-4-tiled-lnl-ccs.html

  * igt@kms_ccs@ccs-on-another-bo-4-tiled-mtl-mc-ccs:
    - shard-bmg:          NOTRUN -> [SKIP][10] ([Intel XE#2887]) +1 other test skip
   [10]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-6/igt@kms_ccs@ccs-on-another-bo-4-tiled-mtl-mc-ccs.html

  * igt@kms_ccs@ccs-on-another-bo-4-tiled-mtl-rc-ccs:
    - shard-adlp:         NOTRUN -> [SKIP][11] ([Intel XE#455] / [Intel XE#787]) +19 other tests skip
   [11]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@kms_ccs@ccs-on-another-bo-4-tiled-mtl-rc-ccs.html

  * igt@kms_ccs@crc-primary-rotation-180-yf-tiled-ccs@pipe-c-hdmi-a-1:
    - shard-adlp:         NOTRUN -> [SKIP][12] ([Intel XE#787]) +29 other tests skip
   [12]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-9/igt@kms_ccs@crc-primary-rotation-180-yf-tiled-ccs@pipe-c-hdmi-a-1.html

  * igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc:
    - shard-dg2-set2:     [PASS][13] -> [INCOMPLETE][14] ([Intel XE#1727] / [Intel XE#3113] / [Intel XE#4345] / [Intel XE#6168])
   [13]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-dg2-464/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc.html
   [14]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-dg2-434/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc.html

  * igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc@pipe-b-hdmi-a-6:
    - shard-dg2-set2:     [PASS][15] -> [INCOMPLETE][16] ([Intel XE#1727] / [Intel XE#3113] / [Intel XE#6168])
   [15]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-dg2-464/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc@pipe-b-hdmi-a-6.html
   [16]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-dg2-434/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc@pipe-b-hdmi-a-6.html

  * igt@kms_chamelium_color@ctm-limited-range:
    - shard-adlp:         NOTRUN -> [SKIP][17] ([Intel XE#306])
   [17]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@kms_chamelium_color@ctm-limited-range.html

  * igt@kms_chamelium_edid@dp-edid-resolution-list:
    - shard-adlp:         NOTRUN -> [SKIP][18] ([Intel XE#373]) +4 other tests skip
   [18]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-9/igt@kms_chamelium_edid@dp-edid-resolution-list.html

  * igt@kms_chamelium_edid@vga-edid-read:
    - shard-lnl:          NOTRUN -> [SKIP][19] ([Intel XE#373])
   [19]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-lnl-4/igt@kms_chamelium_edid@vga-edid-read.html

  * igt@kms_chamelium_hpd@dp-hpd-after-suspend:
    - shard-bmg:          NOTRUN -> [SKIP][20] ([Intel XE#2252])
   [20]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-2/igt@kms_chamelium_hpd@dp-hpd-after-suspend.html

  * igt@kms_content_protection@dp-mst-type-0:
    - shard-adlp:         NOTRUN -> [SKIP][21] ([Intel XE#307]) +1 other test skip
   [21]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-2/igt@kms_content_protection@dp-mst-type-0.html

  * igt@kms_content_protection@legacy@pipe-a-dp-2:
    - shard-bmg:          NOTRUN -> [FAIL][22] ([Intel XE#1178])
   [22]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-4/igt@kms_content_protection@legacy@pipe-a-dp-2.html

  * igt@kms_cursor_crc@cursor-rapid-movement-512x170:
    - shard-bmg:          NOTRUN -> [SKIP][23] ([Intel XE#2321])
   [23]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-2/igt@kms_cursor_crc@cursor-rapid-movement-512x170.html

  * igt@kms_cursor_legacy@2x-long-nonblocking-modeset-vs-cursor-atomic:
    - shard-bmg:          NOTRUN -> [SKIP][24] ([Intel XE#2291])
   [24]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-6/igt@kms_cursor_legacy@2x-long-nonblocking-modeset-vs-cursor-atomic.html

  * igt@kms_cursor_legacy@cursorb-vs-flipa-atomic-transitions-varying-size:
    - shard-bmg:          [PASS][25] -> [SKIP][26] ([Intel XE#2291]) +4 other tests skip
   [25]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-7/igt@kms_cursor_legacy@cursorb-vs-flipa-atomic-transitions-varying-size.html
   [26]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-2/igt@kms_cursor_legacy@cursorb-vs-flipa-atomic-transitions-varying-size.html

  * igt@kms_dp_link_training@uhbr-sst:
    - shard-adlp:         NOTRUN -> [SKIP][27] ([Intel XE#4356])
   [27]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-1/igt@kms_dp_link_training@uhbr-sst.html

  * igt@kms_dp_linktrain_fallback@dsc-fallback:
    - shard-adlp:         NOTRUN -> [SKIP][28] ([Intel XE#4331])
   [28]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@kms_dp_linktrain_fallback@dsc-fallback.html

  * igt@kms_dsc@dsc-with-formats:
    - shard-lnl:          NOTRUN -> [SKIP][29] ([Intel XE#2244])
   [29]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-lnl-4/igt@kms_dsc@dsc-with-formats.html

  * igt@kms_feature_discovery@display-2x:
    - shard-bmg:          NOTRUN -> [SKIP][30] ([Intel XE#2373])
   [30]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-6/igt@kms_feature_discovery@display-2x.html

  * igt@kms_feature_discovery@psr2:
    - shard-adlp:         NOTRUN -> [SKIP][31] ([Intel XE#1135])
   [31]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@kms_feature_discovery@psr2.html

  * igt@kms_flip@2x-flip-vs-dpms-off-vs-modeset:
    - shard-adlp:         NOTRUN -> [SKIP][32] ([Intel XE#310]) +5 other tests skip
   [32]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-1/igt@kms_flip@2x-flip-vs-dpms-off-vs-modeset.html

  * igt@kms_flip@2x-plain-flip-interruptible:
    - shard-bmg:          [PASS][33] -> [SKIP][34] ([Intel XE#2316]) +1 other test skip
   [33]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-3/igt@kms_flip@2x-plain-flip-interruptible.html
   [34]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-2/igt@kms_flip@2x-plain-flip-interruptible.html

  * igt@kms_flip@flip-vs-expired-vblank@a-edp1:
    - shard-lnl:          [PASS][35] -> [FAIL][36] ([Intel XE#301])
   [35]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-lnl-4/igt@kms_flip@flip-vs-expired-vblank@a-edp1.html
   [36]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-lnl-1/igt@kms_flip@flip-vs-expired-vblank@a-edp1.html

  * igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs-upscaling:
    - shard-adlp:         NOTRUN -> [SKIP][37] ([Intel XE#455]) +7 other tests skip
   [37]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs-upscaling.html

  * igt@kms_frontbuffer_tracking@drrs-1p-offscreen-pri-indfb-draw-blt:
    - shard-bmg:          NOTRUN -> [SKIP][38] ([Intel XE#2311]) +2 other tests skip
   [38]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-6/igt@kms_frontbuffer_tracking@drrs-1p-offscreen-pri-indfb-draw-blt.html

  * igt@kms_frontbuffer_tracking@drrs-1p-primscrn-shrfb-msflip-blt:
    - shard-lnl:          NOTRUN -> [SKIP][39] ([Intel XE#651])
   [39]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-lnl-4/igt@kms_frontbuffer_tracking@drrs-1p-primscrn-shrfb-msflip-blt.html

  * igt@kms_frontbuffer_tracking@drrs-2p-primscrn-indfb-plflip-blt:
    - shard-lnl:          NOTRUN -> [SKIP][40] ([Intel XE#656]) +4 other tests skip
   [40]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-lnl-4/igt@kms_frontbuffer_tracking@drrs-2p-primscrn-indfb-plflip-blt.html

  * igt@kms_frontbuffer_tracking@fbc-1p-offscreen-pri-shrfb-draw-mmap-wc:
    - shard-bmg:          NOTRUN -> [SKIP][41] ([Intel XE#4141])
   [41]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-2/igt@kms_frontbuffer_tracking@fbc-1p-offscreen-pri-shrfb-draw-mmap-wc.html

  * igt@kms_frontbuffer_tracking@fbc-tiling-4:
    - shard-adlp:         NOTRUN -> [SKIP][42] ([Intel XE#1151])
   [42]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-1/igt@kms_frontbuffer_tracking@fbc-tiling-4.html

  * igt@kms_frontbuffer_tracking@fbcdrrs-1p-primscrn-pri-shrfb-draw-blt:
    - shard-adlp:         NOTRUN -> [SKIP][43] ([Intel XE#651]) +7 other tests skip
   [43]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-9/igt@kms_frontbuffer_tracking@fbcdrrs-1p-primscrn-pri-shrfb-draw-blt.html

  * igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-indfb-pgflip-blt:
    - shard-bmg:          NOTRUN -> [SKIP][44] ([Intel XE#2312])
   [44]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-6/igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-indfb-pgflip-blt.html

  * igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-cur-indfb-onoff:
    - shard-adlp:         NOTRUN -> [SKIP][45] ([Intel XE#653]) +5 other tests skip
   [45]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-cur-indfb-onoff.html

  * igt@kms_frontbuffer_tracking@fbcpsr-shrfb-scaledprimary:
    - shard-bmg:          NOTRUN -> [SKIP][46] ([Intel XE#2313])
   [46]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-6/igt@kms_frontbuffer_tracking@fbcpsr-shrfb-scaledprimary.html

  * igt@kms_frontbuffer_tracking@fbcpsr-tiling-y:
    - shard-bmg:          NOTRUN -> [SKIP][47] ([Intel XE#2352])
   [47]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-2/igt@kms_frontbuffer_tracking@fbcpsr-tiling-y.html

  * igt@kms_frontbuffer_tracking@psr-2p-primscrn-cur-indfb-onoff:
    - shard-adlp:         NOTRUN -> [SKIP][48] ([Intel XE#656]) +21 other tests skip
   [48]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-1/igt@kms_frontbuffer_tracking@psr-2p-primscrn-cur-indfb-onoff.html

  * igt@kms_multipipe_modeset@basic-max-pipe-crc-check:
    - shard-adlp:         NOTRUN -> [SKIP][49] ([Intel XE#356])
   [49]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-4/igt@kms_multipipe_modeset@basic-max-pipe-crc-check.html

  * igt@kms_plane_lowres@tiling-yf:
    - shard-lnl:          NOTRUN -> [SKIP][50] ([Intel XE#599])
   [50]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-lnl-4/igt@kms_plane_lowres@tiling-yf.html

  * igt@kms_plane_scaling@2x-scaler-multi-pipe:
    - shard-adlp:         NOTRUN -> [SKIP][51] ([Intel XE#309]) +1 other test skip
   [51]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-4/igt@kms_plane_scaling@2x-scaler-multi-pipe.html

  * igt@kms_pm_backlight@fade-with-dpms:
    - shard-adlp:         NOTRUN -> [SKIP][52] ([Intel XE#870])
   [52]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-1/igt@kms_pm_backlight@fade-with-dpms.html

  * igt@kms_psr2_sf@fbc-psr2-cursor-plane-move-continuous-exceed-fully-sf:
    - shard-lnl:          NOTRUN -> [SKIP][53] ([Intel XE#1406] / [Intel XE#2893] / [Intel XE#4608])
   [53]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-lnl-4/igt@kms_psr2_sf@fbc-psr2-cursor-plane-move-continuous-exceed-fully-sf.html

  * igt@kms_psr2_sf@fbc-psr2-cursor-plane-move-continuous-exceed-fully-sf@pipe-b-edp-1:
    - shard-lnl:          NOTRUN -> [SKIP][54] ([Intel XE#1406] / [Intel XE#4608]) +1 other test skip
   [54]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-lnl-4/igt@kms_psr2_sf@fbc-psr2-cursor-plane-move-continuous-exceed-fully-sf@pipe-b-edp-1.html

  * igt@kms_psr2_sf@fbc-psr2-overlay-primary-update-sf-dmg-area:
    - shard-bmg:          NOTRUN -> [SKIP][55] ([Intel XE#1406] / [Intel XE#1489]) +1 other test skip
   [55]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-6/igt@kms_psr2_sf@fbc-psr2-overlay-primary-update-sf-dmg-area.html

  * igt@kms_psr2_sf@pr-overlay-plane-update-sf-dmg-area:
    - shard-adlp:         NOTRUN -> [SKIP][56] ([Intel XE#1406] / [Intel XE#1489]) +4 other tests skip
   [56]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-2/igt@kms_psr2_sf@pr-overlay-plane-update-sf-dmg-area.html

  * igt@kms_psr2_su@page_flip-p010:
    - shard-adlp:         NOTRUN -> [SKIP][57] ([Intel XE#1122] / [Intel XE#1406] / [Intel XE#5580]) +1 other test skip
   [57]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-1/igt@kms_psr2_su@page_flip-p010.html

  * igt@kms_psr@fbc-pr-suspend:
    - shard-lnl:          NOTRUN -> [SKIP][58] ([Intel XE#1406])
   [58]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-lnl-4/igt@kms_psr@fbc-pr-suspend.html

  * igt@kms_psr@fbc-psr-primary-page-flip:
    - shard-adlp:         NOTRUN -> [SKIP][59] ([Intel XE#1406] / [Intel XE#2850] / [Intel XE#929]) +3 other tests skip
   [59]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@kms_psr@fbc-psr-primary-page-flip.html

  * igt@kms_rotation_crc@sprite-rotation-90:
    - shard-adlp:         NOTRUN -> [SKIP][60] ([Intel XE#3414])
   [60]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-4/igt@kms_rotation_crc@sprite-rotation-90.html

  * igt@kms_setmode@clone-exclusive-crtc:
    - shard-bmg:          [PASS][61] -> [SKIP][62] ([Intel XE#1435])
   [61]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-3/igt@kms_setmode@clone-exclusive-crtc.html
   [62]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-6/igt@kms_setmode@clone-exclusive-crtc.html

  * igt@xe_ccs@suspend-resume:
    - shard-adlp:         NOTRUN -> [SKIP][63] ([Intel XE#455] / [Intel XE#488] / [Intel XE#5607])
   [63]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-2/igt@xe_ccs@suspend-resume.html

  * igt@xe_compute@eu-busy-10s:
    - shard-adlp:         NOTRUN -> [SKIP][64] ([Intel XE#6599])
   [64]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@xe_compute@eu-busy-10s.html

  * igt@xe_copy_basic@mem-copy-linear-0xfffe:
    - shard-adlp:         NOTRUN -> [SKIP][65] ([Intel XE#1123])
   [65]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-4/igt@xe_copy_basic@mem-copy-linear-0xfffe.html

  * igt@xe_eu_stall@blocking-re-enable:
    - shard-adlp:         NOTRUN -> [SKIP][66] ([Intel XE#5626])
   [66]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@xe_eu_stall@blocking-re-enable.html

  * igt@xe_eudebug@basic-vm-access-userptr-faultable:
    - shard-lnl:          NOTRUN -> [SKIP][67] ([Intel XE#4837]) +1 other test skip
   [67]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-lnl-4/igt@xe_eudebug@basic-vm-access-userptr-faultable.html

  * igt@xe_eudebug@discovery-empty-clients:
    - shard-bmg:          NOTRUN -> [SKIP][68] ([Intel XE#4837])
   [68]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-6/igt@xe_eudebug@discovery-empty-clients.html

  * igt@xe_eudebug_online@set-breakpoint-sigint-debugger:
    - shard-adlp:         NOTRUN -> [SKIP][69] ([Intel XE#4837] / [Intel XE#5565]) +6 other tests skip
   [69]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-4/igt@xe_eudebug_online@set-breakpoint-sigint-debugger.html

  * igt@xe_eudebug_sriov@deny-eudebug:
    - shard-lnl:          NOTRUN -> [SKIP][70] ([Intel XE#4518])
   [70]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-lnl-4/igt@xe_eudebug_sriov@deny-eudebug.html

  * igt@xe_evict@evict-beng-mixed-many-threads-small:
    - shard-bmg:          [PASS][71] -> [INCOMPLETE][72] ([Intel XE#6321] / [Intel XE#6606])
   [71]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-6/igt@xe_evict@evict-beng-mixed-many-threads-small.html
   [72]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-1/igt@xe_evict@evict-beng-mixed-many-threads-small.html

  * igt@xe_evict@evict-beng-mixed-threads-large:
    - shard-adlp:         NOTRUN -> [SKIP][73] ([Intel XE#261]) +2 other tests skip
   [73]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@xe_evict@evict-beng-mixed-threads-large.html

  * igt@xe_evict@evict-beng-small-multi-vm:
    - shard-adlp:         NOTRUN -> [SKIP][74] ([Intel XE#261] / [Intel XE#5564] / [Intel XE#688]) +2 other tests skip
   [74]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-9/igt@xe_evict@evict-beng-small-multi-vm.html

  * igt@xe_evict@evict-large-cm:
    - shard-adlp:         NOTRUN -> [SKIP][75] ([Intel XE#261] / [Intel XE#5564]) +1 other test skip
   [75]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-2/igt@xe_evict@evict-large-cm.html

  * igt@xe_evict_ccs@evict-overcommit-parallel-instantfree-reopen:
    - shard-adlp:         NOTRUN -> [SKIP][76] ([Intel XE#688])
   [76]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-4/igt@xe_evict_ccs@evict-overcommit-parallel-instantfree-reopen.html

  * igt@xe_exec_basic@multigpu-no-exec-basic-defer-mmap:
    - shard-lnl:          NOTRUN -> [SKIP][77] ([Intel XE#1392])
   [77]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-lnl-4/igt@xe_exec_basic@multigpu-no-exec-basic-defer-mmap.html

  * igt@xe_exec_basic@multigpu-no-exec-bindexecqueue-userptr-invalidate-race:
    - shard-bmg:          NOTRUN -> [SKIP][78] ([Intel XE#2322]) +1 other test skip
   [78]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-2/igt@xe_exec_basic@multigpu-no-exec-bindexecqueue-userptr-invalidate-race.html

  * igt@xe_exec_basic@multigpu-once-basic-defer-bind:
    - shard-adlp:         NOTRUN -> [SKIP][79] ([Intel XE#1392] / [Intel XE#5575]) +4 other tests skip
   [79]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@xe_exec_basic@multigpu-once-basic-defer-bind.html

  * igt@xe_exec_fault_mode@many-execqueues-bindexecqueue-userptr-prefetch:
    - shard-adlp:         NOTRUN -> [SKIP][80] ([Intel XE#288] / [Intel XE#5561]) +11 other tests skip
   [80]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-2/igt@xe_exec_fault_mode@many-execqueues-bindexecqueue-userptr-prefetch.html

  * igt@xe_exec_system_allocator@once-mmap-new-race:
    - shard-adlp:         NOTRUN -> [SKIP][81] ([Intel XE#4915]) +184 other tests skip
   [81]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@xe_exec_system_allocator@once-mmap-new-race.html

  * igt@xe_exec_system_allocator@process-many-execqueues-mmap-new-huge-nomemset:
    - shard-lnl:          NOTRUN -> [SKIP][82] ([Intel XE#4943])
   [82]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-lnl-4/igt@xe_exec_system_allocator@process-many-execqueues-mmap-new-huge-nomemset.html

  * igt@xe_fault_injection@probe-fail-guc-xe_guc_mmio_send_recv:
    - shard-lnl:          NOTRUN -> [ABORT][83] ([Intel XE#4757])
   [83]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-lnl-4/igt@xe_fault_injection@probe-fail-guc-xe_guc_mmio_send_recv.html

  * igt@xe_mmap@small-bar:
    - shard-bmg:          NOTRUN -> [SKIP][84] ([Intel XE#586])
   [84]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-6/igt@xe_mmap@small-bar.html

  * igt@xe_oa@buffer-size:
    - shard-adlp:         NOTRUN -> [SKIP][85] ([Intel XE#6032])
   [85]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-4/igt@xe_oa@buffer-size.html

  * igt@xe_oa@mmio-triggered-reports:
    - shard-adlp:         NOTRUN -> [SKIP][86] ([Intel XE#3573]) +5 other tests skip
   [86]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-4/igt@xe_oa@mmio-triggered-reports.html

  * igt@xe_pat@display-vs-wb-transient:
    - shard-adlp:         NOTRUN -> [SKIP][87] ([Intel XE#1337] / [Intel XE#5572])
   [87]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-9/igt@xe_pat@display-vs-wb-transient.html

  * igt@xe_peer2peer@read:
    - shard-bmg:          NOTRUN -> [SKIP][88] ([Intel XE#2427])
   [88]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-2/igt@xe_peer2peer@read.html

  * igt@xe_pm@d3cold-i2c:
    - shard-adlp:         NOTRUN -> [SKIP][89] ([Intel XE#5694])
   [89]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@xe_pm@d3cold-i2c.html

  * igt@xe_pm@d3cold-multiple-execs:
    - shard-adlp:         NOTRUN -> [SKIP][90] ([Intel XE#2284] / [Intel XE#366]) +1 other test skip
   [90]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-2/igt@xe_pm@d3cold-multiple-execs.html

  * igt@xe_pm@s2idle-vm-bind-unbind-all:
    - shard-adlp:         NOTRUN -> [ABORT][91] ([Intel XE#6675]) +4 other tests abort
   [91]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-1/igt@xe_pm@s2idle-vm-bind-unbind-all.html

  * igt@xe_pm@s3-exec-after:
    - shard-bmg:          NOTRUN -> [ABORT][92] ([Intel XE#6675]) +1 other test abort
   [92]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-2/igt@xe_pm@s3-exec-after.html

  * igt@xe_pm@s3-vm-bind-userptr:
    - shard-dg2-set2:     [PASS][93] -> [ABORT][94] ([Intel XE#6675])
   [93]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-dg2-434/igt@xe_pm@s3-vm-bind-userptr.html
   [94]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-dg2-466/igt@xe_pm@s3-vm-bind-userptr.html

  * igt@xe_pm_residency@cpg-basic:
    - shard-adlp:         [PASS][95] -> [ABORT][96] ([Intel XE#6675]) +8 other tests abort
   [95]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-adlp-6/igt@xe_pm_residency@cpg-basic.html
   [96]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-9/igt@xe_pm_residency@cpg-basic.html

  * igt@xe_pxp@pxp-stale-bo-bind-post-termination-irq:
    - shard-adlp:         NOTRUN -> [SKIP][97] ([Intel XE#4733] / [Intel XE#5594]) +1 other test skip
   [97]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-8/igt@xe_pxp@pxp-stale-bo-bind-post-termination-irq.html

  * igt@xe_query@multigpu-query-pxp-status:
    - shard-adlp:         NOTRUN -> [SKIP][98] ([Intel XE#944]) +2 other tests skip
   [98]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-1/igt@xe_query@multigpu-query-pxp-status.html

  * igt@xe_vm@out-of-memory:
    - shard-adlp:         NOTRUN -> [SKIP][99] ([Intel XE#5745])
   [99]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-4/igt@xe_vm@out-of-memory.html

  
#### Possible fixes ####

  * igt@kms_cursor_legacy@2x-flip-vs-cursor-atomic:
    - shard-bmg:          [SKIP][100] ([Intel XE#2291]) -> [PASS][101] +3 other tests pass
   [100]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-6/igt@kms_cursor_legacy@2x-flip-vs-cursor-atomic.html
   [101]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-1/igt@kms_cursor_legacy@2x-flip-vs-cursor-atomic.html

  * igt@kms_cursor_legacy@flip-vs-cursor-legacy:
    - shard-bmg:          [FAIL][102] ([Intel XE#4633]) -> [PASS][103]
   [102]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-6/igt@kms_cursor_legacy@flip-vs-cursor-legacy.html
   [103]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-5/igt@kms_cursor_legacy@flip-vs-cursor-legacy.html

  * igt@kms_flip@2x-plain-flip-ts-check-interruptible:
    - shard-bmg:          [SKIP][104] ([Intel XE#2316]) -> [PASS][105] +3 other tests pass
   [104]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-6/igt@kms_flip@2x-plain-flip-ts-check-interruptible.html
   [105]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-1/igt@kms_flip@2x-plain-flip-ts-check-interruptible.html

  * igt@kms_flip@flip-vs-expired-vblank-interruptible@b-edp1:
    - shard-lnl:          [FAIL][106] ([Intel XE#301]) -> [PASS][107] +2 other tests pass
   [106]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-lnl-2/igt@kms_flip@flip-vs-expired-vblank-interruptible@b-edp1.html
   [107]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-lnl-8/igt@kms_flip@flip-vs-expired-vblank-interruptible@b-edp1.html

  * igt@kms_frontbuffer_tracking@fbc-1p-rte:
    - shard-adlp:         [DMESG-WARN][108] ([Intel XE#2953] / [Intel XE#4173]) -> [PASS][109]
   [108]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-adlp-3/igt@kms_frontbuffer_tracking@fbc-1p-rte.html
   [109]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-3/igt@kms_frontbuffer_tracking@fbc-1p-rte.html

  * igt@kms_hdr@bpc-switch-dpms:
    - shard-bmg:          [ABORT][110] -> [PASS][111]
   [110]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-5/igt@kms_hdr@bpc-switch-dpms.html
   [111]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-6/igt@kms_hdr@bpc-switch-dpms.html

  * igt@kms_joiner@basic-force-big-joiner:
    - shard-bmg:          [SKIP][112] ([Intel XE#3012]) -> [PASS][113]
   [112]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-6/igt@kms_joiner@basic-force-big-joiner.html
   [113]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-1/igt@kms_joiner@basic-force-big-joiner.html

  * igt@kms_plane_cursor@viewport:
    - shard-bmg:          [INCOMPLETE][114] -> [PASS][115]
   [114]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-3/igt@kms_plane_cursor@viewport.html
   [115]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-2/igt@kms_plane_cursor@viewport.html

  * igt@xe_exec_system_allocator@pat-index-madvise-pat-idx-uc-single-vma:
    - shard-lnl:          [FAIL][116] ([Intel XE#5625]) -> [PASS][117]
   [116]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-lnl-3/igt@xe_exec_system_allocator@pat-index-madvise-pat-idx-uc-single-vma.html
   [117]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-lnl-5/igt@xe_exec_system_allocator@pat-index-madvise-pat-idx-uc-single-vma.html

  * igt@xe_pm@s3-vm-bind-prefetch:
    - shard-adlp:         [ABORT][118] ([Intel XE#6675]) -> [PASS][119] +4 other tests pass
   [118]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-adlp-8/igt@xe_pm@s3-vm-bind-prefetch.html
   [119]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-4/igt@xe_pm@s3-vm-bind-prefetch.html

  
#### Warnings ####

  * igt@kms_big_fb@y-tiled-8bpp-rotate-180:
    - shard-adlp:         [DMESG-FAIL][120] ([Intel XE#2953] / [Intel XE#4173] / [Intel XE#4543]) -> [DMESG-FAIL][121] ([Intel XE#4543])
   [120]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-adlp-4/igt@kms_big_fb@y-tiled-8bpp-rotate-180.html
   [121]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-adlp-1/igt@kms_big_fb@y-tiled-8bpp-rotate-180.html

  * igt@kms_content_protection@legacy:
    - shard-bmg:          [SKIP][122] ([Intel XE#2341]) -> [FAIL][123] ([Intel XE#1178])
   [122]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-2/igt@kms_content_protection@legacy.html
   [123]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-4/igt@kms_content_protection@legacy.html

  * igt@kms_frontbuffer_tracking@drrs-2p-pri-indfb-multidraw:
    - shard-bmg:          [SKIP][124] ([Intel XE#2311]) -> [SKIP][125] ([Intel XE#2312]) +4 other tests skip
   [124]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-4/igt@kms_frontbuffer_tracking@drrs-2p-pri-indfb-multidraw.html
   [125]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-2/igt@kms_frontbuffer_tracking@drrs-2p-pri-indfb-multidraw.html

  * igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-spr-indfb-draw-mmap-wc:
    - shard-bmg:          [SKIP][126] ([Intel XE#2312]) -> [SKIP][127] ([Intel XE#4141]) +9 other tests skip
   [126]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-2/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-spr-indfb-draw-mmap-wc.html
   [127]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-4/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-spr-indfb-draw-mmap-wc.html

  * igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-spr-indfb-draw-render:
    - shard-bmg:          [SKIP][128] ([Intel XE#4141]) -> [SKIP][129] ([Intel XE#2312]) +1 other test skip
   [128]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-7/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-spr-indfb-draw-render.html
   [129]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-2/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-spr-indfb-draw-render.html

  * igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-cur-indfb-draw-mmap-wc:
    - shard-bmg:          [SKIP][130] ([Intel XE#2312]) -> [SKIP][131] ([Intel XE#2311]) +9 other tests skip
   [130]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-6/igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-cur-indfb-draw-mmap-wc.html
   [131]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-1/igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-cur-indfb-draw-mmap-wc.html

  * igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-spr-indfb-onoff:
    - shard-bmg:          [SKIP][132] ([Intel XE#2312]) -> [SKIP][133] ([Intel XE#2313]) +11 other tests skip
   [132]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-2/igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-spr-indfb-onoff.html
   [133]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-3/igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-spr-indfb-onoff.html

  * igt@kms_frontbuffer_tracking@psr-2p-scndscrn-shrfb-plflip-blt:
    - shard-bmg:          [SKIP][134] ([Intel XE#2313]) -> [SKIP][135] ([Intel XE#2312]) +8 other tests skip
   [134]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8/shard-bmg-5/igt@kms_frontbuffer_tracking@psr-2p-scndscrn-shrfb-plflip-blt.html
   [135]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/shard-bmg-6/igt@kms_frontbuffer_tracking@psr-2p-scndscrn-shrfb-plflip-blt.html

  
  [Intel XE#1122]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1122
  [Intel XE#1123]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1123
  [Intel XE#1124]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1124
  [Intel XE#1135]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1135
  [Intel XE#1151]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1151
  [Intel XE#1178]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1178
  [Intel XE#1337]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1337
  [Intel XE#1392]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1392
  [Intel XE#1406]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1406
  [Intel XE#1435]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1435
  [Intel XE#1489]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1489
  [Intel XE#1727]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1727
  [Intel XE#2191]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2191
  [Intel XE#2244]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2244
  [Intel XE#2252]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2252
  [Intel XE#2284]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2284
  [Intel XE#2291]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2291
  [Intel XE#2311]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2311
  [Intel XE#2312]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2312
  [Intel XE#2313]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2313
  [Intel XE#2316]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2316
  [Intel XE#2321]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2321
  [Intel XE#2322]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2322
  [Intel XE#2341]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2341
  [Intel XE#2352]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2352
  [Intel XE#2373]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2373
  [Intel XE#2427]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2427
  [Intel XE#261]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/261
  [Intel XE#2850]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2850
  [Intel XE#288]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/288
  [Intel XE#2887]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2887
  [Intel XE#2893]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2893
  [Intel XE#2907]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2907
  [Intel XE#2953]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2953
  [Intel XE#301]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/301
  [Intel XE#3012]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3012
  [Intel XE#306]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/306
  [Intel XE#307]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/307
  [Intel XE#309]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/309
  [Intel XE#310]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/310
  [Intel XE#3113]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3113
  [Intel XE#316]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/316
  [Intel XE#3414]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3414
  [Intel XE#356]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/356
  [Intel XE#3573]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3573
  [Intel XE#366]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/366
  [Intel XE#367]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/367
  [Intel XE#373]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/373
  [Intel XE#4141]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4141
  [Intel XE#4173]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4173
  [Intel XE#4331]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4331
  [Intel XE#4345]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4345
  [Intel XE#4356]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4356
  [Intel XE#4518]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4518
  [Intel XE#4543]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4543
  [Intel XE#455]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/455
  [Intel XE#4608]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4608
  [Intel XE#4633]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4633
  [Intel XE#4733]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4733
  [Intel XE#4757]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4757
  [Intel XE#4837]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4837
  [Intel XE#488]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/488
  [Intel XE#4915]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4915
  [Intel XE#4943]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4943
  [Intel XE#5561]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5561
  [Intel XE#5564]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5564
  [Intel XE#5565]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5565
  [Intel XE#5572]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5572
  [Intel XE#5575]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5575
  [Intel XE#5580]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5580
  [Intel XE#5594]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5594
  [Intel XE#5607]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5607
  [Intel XE#5625]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5625
  [Intel XE#5626]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5626
  [Intel XE#5694]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5694
  [Intel XE#5745]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5745
  [Intel XE#586]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/586
  [Intel XE#599]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/599
  [Intel XE#6032]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6032
  [Intel XE#610]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/610
  [Intel XE#6168]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6168
  [Intel XE#619]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/619
  [Intel XE#6321]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6321
  [Intel XE#651]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/651
  [Intel XE#653]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/653
  [Intel XE#656]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/656
  [Intel XE#6599]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6599
  [Intel XE#6606]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6606
  [Intel XE#6675]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6675
  [Intel XE#688]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/688
  [Intel XE#787]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/787
  [Intel XE#870]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/870
  [Intel XE#929]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/929
  [Intel XE#944]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/944


Build changes
-------------

  * Linux: xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8 -> xe-pw-158017v1

  IGT_8637: 730ee3dfb26f8d7891fc240b0132a08c5bc7b949 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  xe-4143-c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8: c9ffb8a8ab1294c1870e017e0502cb8089d3d8b8
  xe-pw-158017v1: 158017v1

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-158017v1/index.html

[-- Attachment #2: Type: text/html, Size: 47715 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2025-11-25 15:31     ` Upadhyay, Tejas
@ 2025-11-26 10:26       ` Thomas Hellström
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2025-11-26 10:26 UTC (permalink / raw)
  To: Upadhyay, Tejas, intel-xe@lists.freedesktop.org, Auld, Matthew

On Tue, 2025-11-25 at 15:31 +0000, Upadhyay, Tejas wrote:
> 
> 
> > -----Original Message-----
> > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Sent: 25 November 2025 20:37
> > To: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> > xe@lists.freedesktop.org
> > Cc: Auld, Matthew <matthew.auld@intel.com>
> > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
> > cachelines manually
> > 
> > Hi.
> > 
> > 
> > On Tue, 2025-11-25 at 15:13 +0530, Tejas Upadhyay wrote:
> > > Starting NVL, HW will flush cachelines marked with XA only when
> > > media
> > > is off. We have few cases where kernel will have non-XA
> > > cachelines
> > > which needs manual flush as we postpone the invalidation. Flush
> > > asap
> > > from correctness POV to ensure non accelerated CPU copy to
> > > swap/shmem
> > > file will see coherent view of memory, but also from security POV
> > > where later flush can't corrupt the next user of those pages.
> > > 
> > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > 
> > I had a number of concerns last time this patch was sent to the
> > list, none of
> > which seems to have been addressed?
> 
> Sorry for missing to address your comments.
> 
> > 
> > https://lore.kernel.org/intel-
> > xe/d2517d66f571e11a760cb143981b7ca238f5cd58.camel@linux.intel.com/
> > 
> > The main concern is that the code indicates that not all GPU caches
> > are flushed
> > when all fences are signalled (bo / userptr idle)?
> 
> Xe3p is introducing feature that when media is off, only XA marked BO
> will be flushed not whole cache. From UMD perspective we might have
> non-XA buffers created which we would like to flush before
> buffer/user goes away during media off. 

So for non-XA buffers, how would coherency be maintained for
gpu_write() -> cpu_read() from UMD's perspective? For dma-buf?

Also flushing in move_notify() can't be done unless the bo is idle
first, and that would force us to unnecessarily synchronize.

/Thomas


> 
> Tejas
>  
> > 
> > Thanks,
> > Thomas
> > 
> > 
> > 
> > > ---
> > >  drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> > >  drivers/gpu/drm/xe/xe_device.c  | 20 ++++++++++++++++++++
> > >  drivers/gpu/drm/xe/xe_device.h  |  1 +
> > >  drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> > >  4 files changed, 25 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > b/drivers/gpu/drm/xe/xe_bo.c
> > > index 465cf9fc7ce9..97e1e9d40e96 100644
> > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct
> > > xe_device
> > > *xe, struct xe_bo *bo,
> > > 
> > >  		if (!xe_vm_in_fault_mode(vm)) {
> > >  			drm_gpuvm_bo_evict(vm_bo, true);
> > > -			continue;
> > > +			if (!xe_device_needs_cache_flush(xe))
> > > +				continue;
> > >  		}
> > > 
> > >  		if (!idle) {
> > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > b/drivers/gpu/drm/xe/xe_device.c index 92f883dd8877..6e8335b493e8
> > > 100644
> > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > @@ -1079,6 +1079,26 @@ void xe_device_l2_flush(struct xe_device
> > > *xe)
> > >  	spin_unlock(&gt->global_invl_lock);
> > >  }
> > > 
> > > +/**
> > > + * xe_device_needs_cache_flush - Whether the cache needs to be
> > > flushed
> > > + * @xe: The device to check.
> > > + *
> > > + * Return: true if the device needs cache flush, false
> > > otherwise.
> > > + */
> > > +bool xe_device_needs_cache_flush(struct xe_device *xe) {
> > > +	/*
> > > +	 * Starting NVL, HW will flush cachelines marked with XA
> > > only when media is off. We have
> > > +	 * few cases where kernel will have non-XA cachelines
> > > which
> > > needs manual flush and this is
> > > +	 * one of them as we postpone the invalidation. Flush
> > > asap
> > > from correctness POV to ensure
> > > +	 * non accelerated CPU copy to swap/shmem file will see
> > > coherent view of memory, but also
> > > +	 * from security POV where later flush can't corrupt the
> > > next user of those pages.
> > > +	 */
> > > +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> > > +		return true;
> > > +	return false;
> > > +}
> > > +
> > >  /**
> > >   * xe_device_td_flush() - Flush transient L3 cache entries
> > >   * @xe: The device
> > > diff --git a/drivers/gpu/drm/xe/xe_device.h
> > > b/drivers/gpu/drm/xe/xe_device.h index 32cc6323b7f6..15e67db44b56
> > > 100644
> > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > @@ -179,6 +179,7 @@ void xe_device_snapshot_print(struct
> > > xe_device
> > > *xe, struct drm_printer *p);
> > >  u64 xe_device_canonicalize_addr(struct xe_device *xe, u64
> > > address);
> > >  u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64
> > > address);
> > > 
> > > +bool xe_device_needs_cache_flush(struct xe_device *xe);
> > >  void xe_device_td_flush(struct xe_device *xe);
> > >  void xe_device_l2_flush(struct xe_device *xe);
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > > b/drivers/gpu/drm/xe/xe_userptr.c index
> > > 0d9130b1958a..a93c7e887cca
> > > 100644
> > > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > > @@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct
> > > xe_vm
> > > *vm, struct xe_userptr_vma *uv
> > >  				    false,
> > > MAX_SCHEDULE_TIMEOUT);
> > >  	XE_WARN_ON(err <= 0);
> > > 
> > > -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> > > +	if ((xe_vm_in_fault_mode(vm) ||
> > > xe_device_needs_cache_flush(vm->xe)) &&
> > > +	    userptr->initial_bind) {
> > >  		err = xe_vm_invalidate_vma(vma);
> > >  		XE_WARN_ON(err);
> > >  	}
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-10 12:51 [PATCH 0/3] " Tejas Upadhyay
@ 2026-02-10 12:51 ` Tejas Upadhyay
  2026-02-10 21:05   ` Matt Roper
  0 siblings, 1 reply; 39+ messages in thread
From: Tejas Upadhyay @ 2026-02-10 12:51 UTC (permalink / raw)
  To: intel-xe; +Cc: matthew.auld, thomas.hellstrom, Tejas Upadhyay

"eXtended Architecture" (XA) tagged memory—memory shared between the CPU and GPU
is treated differently from other GPU memory when the Media engine is power-gated.

XA is *always* flushed, like at the end-of-submssion (and maybe other
places), just that internally as an optimisation hw doesn't need to make
that a full flush (which will also include XA) when Media is
off/powergated, since it doesn't need to worry about GT caches vs Media
coherency, and only CPU vs GPU coherency, so can make that flush a
targeted XA flush, since stuff tagged with XA now means it's shared with
the CPU. The main implication is that we now need to somehow flush non-XA before
freeing system memory pages, otherwise dirty cachelines could be flushed after the free
(like if Media suddenly turns on and does a full flush)

V2(MattA): Expand commit description

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
 drivers/gpu/drm/xe/xe_device.c  | 23 +++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_device.h  |  1 +
 drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
 4 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index e9180b01a4e4..4455886b211e 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct xe_device *xe, struct xe_bo *bo,
 
 		if (!xe_vm_in_fault_mode(vm)) {
 			drm_gpuvm_bo_evict(vm_bo, true);
-			continue;
+			if (!xe_device_needs_cache_flush(xe))
+				continue;
 		}
 
 		if (!idle) {
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 743c18e0c580..da2abed94bc0 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct xe_device *xe)
 	}
 }
 
+/**
+ * xe_device_needs_cache_flush - Whether the cache needs to be flushed
+ * @xe: The device to check.
+ *
+ * Return: true if the device needs cache flush, false otherwise.
+ */
+bool xe_device_needs_cache_flush(struct xe_device *xe)
+{
+	/* XA is *always* flushed, like at the end-of-submssion (and maybe other
+	 * places), just that internally as an optimisation hw doesn't need to make
+	 * that a full flush (which will also include XA) when Media is
+	 * off/powergated, since it doesn't need to worry about GT caches vs Media
+	 * coherency, and only CPU vs GPU coherency, so can make that flush a
+	 * targeted XA flush, since stuff tagged with XA now means it's shared with
+	 * the CPU. The main implication is that we now need to somehow flush non-XA before
+	 * freeing system memory pages, otherwise dirty cachelines could be flushed after the free
+	 * (like if Media suddenly turns on and does a full flush)
+	 */
+	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
+		return true;
+	return false;
+}
+
 void xe_device_l2_flush(struct xe_device *xe)
 {
 	struct xe_gt *gt;
diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
index 39464650533b..baf386e0e037 100644
--- a/drivers/gpu/drm/xe/xe_device.h
+++ b/drivers/gpu/drm/xe/xe_device.h
@@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct xe_device *xe, struct drm_printer *p);
 u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
 u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address);
 
+bool xe_device_needs_cache_flush(struct xe_device *xe);
 void xe_device_td_flush(struct xe_device *xe);
 void xe_device_l2_flush(struct xe_device *xe);
 
diff --git a/drivers/gpu/drm/xe/xe_userptr.c b/drivers/gpu/drm/xe/xe_userptr.c
index e120323c43bc..b435ea7f9b66 100644
--- a/drivers/gpu/drm/xe/xe_userptr.c
+++ b/drivers/gpu/drm/xe/xe_userptr.c
@@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct xe_vm *vm, struct xe_userptr_vma *uv
 				    false, MAX_SCHEDULE_TIMEOUT);
 	XE_WARN_ON(err <= 0);
 
-	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
+	if ((xe_vm_in_fault_mode(vm) || xe_device_needs_cache_flush(vm->xe)) &&
+	    userptr->initial_bind) {
 		err = xe_vm_invalidate_vma(vma);
 		XE_WARN_ON(err);
 	}
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-10 12:51 ` [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually Tejas Upadhyay
@ 2026-02-10 21:05   ` Matt Roper
  2026-02-11  0:02     ` Matthew Brost
  0 siblings, 1 reply; 39+ messages in thread
From: Matt Roper @ 2026-02-10 21:05 UTC (permalink / raw)
  To: Tejas Upadhyay; +Cc: intel-xe, matthew.auld, thomas.hellstrom

On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay wrote:
> "eXtended Architecture" (XA) tagged memory—memory shared between the CPU and GPU

I'm pretty sure this expansion of "XA" is wrong; where are you seeing
this definition?  Everything in the bspec indicates that XA means "wb -
transient app" (similar to how "XD" is 'wb - transient display").  I'm
not sure why exactly they picked "X" to refer to transient in both of
these cases, but I've never seen any documentation that refers to it as
"extended."

> is treated differently from other GPU memory when the Media engine is power-gated.
> 
> XA is *always* flushed, like at the end-of-submssion (and maybe other

I assume you're referring to the fact that the driver performs flushes
at the end of submission (via PIPE_CONTROL or MI_FLUSH_DW), and that
depending on other state/optimizations in the system, those flushes may
flush the entire device cache, or may only flush the subset of cache
data that is not marked as transient.  The way you worded this was
confusing since it makes it sound like cache flushes happen
automatically somewhere in hardware/firmware.

> places), just that internally as an optimisation hw doesn't need to make
> that a full flush (which will also include XA) when Media is
> off/powergated, since it doesn't need to worry about GT caches vs Media
> coherency, and only CPU vs GPU coherency, so can make that flush a
> targeted XA flush, since stuff tagged with XA now means it's shared with
> the CPU. The main implication is that we now need to somehow flush non-XA before
> freeing system memory pages, otherwise dirty cachelines could be flushed after the free
> (like if Media suddenly turns on and does a full flush)

This description seems really confusing.  My understanding is that
marking something as wb-transient-app indicates that it might be
accessed by something other than our graphics/media IP (i.e., accessed
from the CPU, exported to another device, etc.), so transient data truly
does need to be flushed at the points in the driver where a flush
typically happens.

However when something is _not_ transient, then either:
 - it's "private" to the GPU and only our graphics/media IP will be
   accessing it
 - it's bound with a coherent PAT index so that outside observers like
   the CPU can snoop the device cache, even when the cache hasn't been
   flushed

If media is not active, then there's really no need to include
non-transient data when an device cache flush happens since there's no
real need for the data to get to RAM.  So that enables an optimization
(which comes in your next patch), that allows flushes to only operate on
the subset of the device cache tagged as "transient" if media is idle.

As you said, we eventually do want to force a flush of the non-transient
data as well once we're freeing the underlying pages.  So how do we do
that?  It's not clear to me how the changes below are accomplishing
that.  Is there a way to explicitly request a full device cache flush
(ignoring the transient vs non-transient tagging)?  Since the GuC
handles the optimization in the next patch (toggling whether flushes are
full flushes vs non-transient flushes depending on whether media is
active), I thought there might be some kind of GuC interface to request
"please do one full flush now, even if media is idle."


Matt

> 
> V2(MattA): Expand commit description
> 
> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
>  drivers/gpu/drm/xe/xe_device.c  | 23 +++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_device.h  |  1 +
>  drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
>  4 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index e9180b01a4e4..4455886b211e 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct xe_device *xe, struct xe_bo *bo,
>  
>  		if (!xe_vm_in_fault_mode(vm)) {
>  			drm_gpuvm_bo_evict(vm_bo, true);
> -			continue;
> +			if (!xe_device_needs_cache_flush(xe))
> +				continue;
>  		}
>  
>  		if (!idle) {
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 743c18e0c580..da2abed94bc0 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct xe_device *xe)
>  	}
>  }
>  
> +/**
> + * xe_device_needs_cache_flush - Whether the cache needs to be flushed
> + * @xe: The device to check.
> + *
> + * Return: true if the device needs cache flush, false otherwise.
> + */
> +bool xe_device_needs_cache_flush(struct xe_device *xe)
> +{
> +	/* XA is *always* flushed, like at the end-of-submssion (and maybe other
> +	 * places), just that internally as an optimisation hw doesn't need to make
> +	 * that a full flush (which will also include XA) when Media is
> +	 * off/powergated, since it doesn't need to worry about GT caches vs Media
> +	 * coherency, and only CPU vs GPU coherency, so can make that flush a
> +	 * targeted XA flush, since stuff tagged with XA now means it's shared with
> +	 * the CPU. The main implication is that we now need to somehow flush non-XA before
> +	 * freeing system memory pages, otherwise dirty cachelines could be flushed after the free
> +	 * (like if Media suddenly turns on and does a full flush)
> +	 */
> +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> +		return true;
> +	return false;
> +}
> +
>  void xe_device_l2_flush(struct xe_device *xe)
>  {
>  	struct xe_gt *gt;
> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> index 39464650533b..baf386e0e037 100644
> --- a/drivers/gpu/drm/xe/xe_device.h
> +++ b/drivers/gpu/drm/xe/xe_device.h
> @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct xe_device *xe, struct drm_printer *p);
>  u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
>  u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address);
>  
> +bool xe_device_needs_cache_flush(struct xe_device *xe);
>  void xe_device_td_flush(struct xe_device *xe);
>  void xe_device_l2_flush(struct xe_device *xe);
>  
> diff --git a/drivers/gpu/drm/xe/xe_userptr.c b/drivers/gpu/drm/xe/xe_userptr.c
> index e120323c43bc..b435ea7f9b66 100644
> --- a/drivers/gpu/drm/xe/xe_userptr.c
> +++ b/drivers/gpu/drm/xe/xe_userptr.c
> @@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct xe_vm *vm, struct xe_userptr_vma *uv
>  				    false, MAX_SCHEDULE_TIMEOUT);
>  	XE_WARN_ON(err <= 0);
>  
> -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> +	if ((xe_vm_in_fault_mode(vm) || xe_device_needs_cache_flush(vm->xe)) &&
> +	    userptr->initial_bind) {
>  		err = xe_vm_invalidate_vma(vma);
>  		XE_WARN_ON(err);
>  	}
> -- 
> 2.52.0
> 

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-10 21:05   ` Matt Roper
@ 2026-02-11  0:02     ` Matthew Brost
  2026-02-11 19:06       ` Upadhyay, Tejas
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Brost @ 2026-02-11  0:02 UTC (permalink / raw)
  To: Matt Roper; +Cc: Tejas Upadhyay, intel-xe, matthew.auld, thomas.hellstrom

On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
> On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay wrote:
> > "eXtended Architecture" (XA) tagged memory—memory shared between the CPU and GPU
> 
> I'm pretty sure this expansion of "XA" is wrong; where are you seeing
> this definition?  Everything in the bspec indicates that XA means "wb -
> transient app" (similar to how "XD" is 'wb - transient display").  I'm
> not sure why exactly they picked "X" to refer to transient in both of
> these cases, but I've never seen any documentation that refers to it as
> "extended."
> 
> > is treated differently from other GPU memory when the Media engine is power-gated.
> > 
> > XA is *always* flushed, like at the end-of-submssion (and maybe other
> 
> I assume you're referring to the fact that the driver performs flushes
> at the end of submission (via PIPE_CONTROL or MI_FLUSH_DW), and that
> depending on other state/optimizations in the system, those flushes may
> flush the entire device cache, or may only flush the subset of cache
> data that is not marked as transient.  The way you worded this was
> confusing since it makes it sound like cache flushes happen
> automatically somewhere in hardware/firmware.
> 
> > places), just that internally as an optimisation hw doesn't need to make
> > that a full flush (which will also include XA) when Media is
> > off/powergated, since it doesn't need to worry about GT caches vs Media
> > coherency, and only CPU vs GPU coherency, so can make that flush a
> > targeted XA flush, since stuff tagged with XA now means it's shared with
> > the CPU. The main implication is that we now need to somehow flush non-XA before
> > freeing system memory pages, otherwise dirty cachelines could be flushed after the free
> > (like if Media suddenly turns on and does a full flush)
> 
> This description seems really confusing.  My understanding is that
> marking something as wb-transient-app indicates that it might be
> accessed by something other than our graphics/media IP (i.e., accessed
> from the CPU, exported to another device, etc.), so transient data truly
> does need to be flushed at the points in the driver where a flush
> typically happens.
> 
> However when something is _not_ transient, then either:
>  - it's "private" to the GPU and only our graphics/media IP will be
>    accessing it
>  - it's bound with a coherent PAT index so that outside observers like
>    the CPU can snoop the device cache, even when the cache hasn't been
>    flushed
> 
> If media is not active, then there's really no need to include
> non-transient data when an device cache flush happens since there's no
> real need for the data to get to RAM.  So that enables an optimization
> (which comes in your next patch), that allows flushes to only operate on
> the subset of the device cache tagged as "transient" if media is idle.
> 
> As you said, we eventually do want to force a flush of the non-transient
> data as well once we're freeing the underlying pages.  So how do we do
> that?  It's not clear to me how the changes below are accomplishing
> that.  Is there a way to explicitly request a full device cache flush
> (ignoring the transient vs non-transient tagging)?  Since the GuC
> handles the optimization in the next patch (toggling whether flushes are
> full flushes vs non-transient flushes depending on whether media is
> active), I thought there might be some kind of GuC interface to request
> "please do one full flush now, even if media is idle."
> 

I’m not an expert here by any means, but everything above from Matt
seems like valid concerns. Thomas also raised some concerns in the two
previous revisions; again I’m not an expert, but reading through
those, it doesn’t really seem like he received proper answers to his
questions.

A couple of comments below.

> 
> Matt
> 
> > 
> > V2(MattA): Expand commit description
> > 
> > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> >  drivers/gpu/drm/xe/xe_device.c  | 23 +++++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_device.h  |  1 +
> >  drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> >  4 files changed, 28 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> > index e9180b01a4e4..4455886b211e 100644
> > --- a/drivers/gpu/drm/xe/xe_bo.c
> > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct xe_device *xe, struct xe_bo *bo,
> >  
> >  		if (!xe_vm_in_fault_mode(vm)) {
> >  			drm_gpuvm_bo_evict(vm_bo, true);
> > -			continue;
> > +			if (!xe_device_needs_cache_flush(xe))
> > +				continue;

This will trigger a TLB invalidation (and I assume a cache flush) every
time we move or free memory in the 3D stack if it has a binding. It also
performs a synchronous wait on the BO being idle. Both of these are very
expensive operations. I can’t imagine the granularity we want here is to
do this on every move/free with bindings.

Also, for LR compute with preempt fences, we would trigger the preempt
fences during the wait, so a TLB invalidation after this seems
unnecessary, though perhaps the cache flush is still required?

I think this needs a bit more explanation, because without knowing a lot
about the exact requirements, the implementation does not look correct.

> >  		}
> >  
> >  		if (!idle) {
> > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > index 743c18e0c580..da2abed94bc0 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct xe_device *xe)
> >  	}
> >  }
> >  
> > +/**
> > + * xe_device_needs_cache_flush - Whether the cache needs to be flushed
> > + * @xe: The device to check.
> > + *
> > + * Return: true if the device needs cache flush, false otherwise.
> > + */
> > +bool xe_device_needs_cache_flush(struct xe_device *xe)
> > +{
> > +	/* XA is *always* flushed, like at the end-of-submssion (and maybe other
> > +	 * places), just that internally as an optimisation hw doesn't need to make
> > +	 * that a full flush (which will also include XA) when Media is
> > +	 * off/powergated, since it doesn't need to worry about GT caches vs Media
> > +	 * coherency, and only CPU vs GPU coherency, so can make that flush a
> > +	 * targeted XA flush, since stuff tagged with XA now means it's shared with
> > +	 * the CPU. The main implication is that we now need to somehow flush non-XA before
> > +	 * freeing system memory pages, otherwise dirty cachelines could be flushed after the free
> > +	 * (like if Media suddenly turns on and does a full flush)
> > +	 */
> > +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> > +		return true;
> > +	return false;
> > +}
> > +
> >  void xe_device_l2_flush(struct xe_device *xe)
> >  {
> >  	struct xe_gt *gt;
> > diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> > index 39464650533b..baf386e0e037 100644
> > --- a/drivers/gpu/drm/xe/xe_device.h
> > +++ b/drivers/gpu/drm/xe/xe_device.h
> > @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct xe_device *xe, struct drm_printer *p);
> >  u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
> >  u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address);
> >  
> > +bool xe_device_needs_cache_flush(struct xe_device *xe);
> >  void xe_device_td_flush(struct xe_device *xe);
> >  void xe_device_l2_flush(struct xe_device *xe);
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_userptr.c b/drivers/gpu/drm/xe/xe_userptr.c
> > index e120323c43bc..b435ea7f9b66 100644
> > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > @@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct xe_vm *vm, struct xe_userptr_vma *uv
> >  				    false, MAX_SCHEDULE_TIMEOUT);
> >  	XE_WARN_ON(err <= 0);
> >  
> > -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> > +	if ((xe_vm_in_fault_mode(vm) || xe_device_needs_cache_flush(vm->xe)) &&
> > +	    userptr->initial_bind) {

Same concern with the LR preempt fence as above — the hardware will be
interrupted via preempt fences, so it doesn’t seem necessary to
invalidate the TLBs but perhaps we need a cflush and TLB invalidation
is the mechanism for that too?

Matt

> >  		err = xe_vm_invalidate_vma(vma);
> >  		XE_WARN_ON(err);
> >  	}
> > -- 
> > 2.52.0
> > 
> 
> -- 
> Matt Roper
> Graphics Software Engineer
> Linux GPU Platform Enablement
> Intel Corporation

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-11  0:02     ` Matthew Brost
@ 2026-02-11 19:06       ` Upadhyay, Tejas
  2026-02-11 21:11         ` Matt Roper
  0 siblings, 1 reply; 39+ messages in thread
From: Upadhyay, Tejas @ 2026-02-11 19:06 UTC (permalink / raw)
  To: Brost, Matthew, Roper, Matthew D
  Cc: intel-xe@lists.freedesktop.org, Auld,  Matthew,
	thomas.hellstrom@linux.intel.com



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: 11 February 2026 05:32
> To: Roper, Matthew D <matthew.d.roper@intel.com>
> Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> xe@lists.freedesktop.org; Auld, Matthew <matthew.auld@intel.com>;
> thomas.hellstrom@linux.intel.com
> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
> cachelines manually
> 
> On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
> > On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay wrote:
> > > "eXtended Architecture" (XA) tagged memory—memory shared between
> the
> > > CPU and GPU
> >
> > I'm pretty sure this expansion of "XA" is wrong; where are you seeing
> > this definition?  Everything in the bspec indicates that XA means "wb
> > - transient app" (similar to how "XD" is 'wb - transient display").
> > I'm not sure why exactly they picked "X" to refer to transient in both
> > of these cases, but I've never seen any documentation that refers to
> > it as "extended."
> >
> > > is treated differently from other GPU memory when the Media engine is
> power-gated.
> > >
> > > XA is *always* flushed, like at the end-of-submssion (and maybe
> > > other
> >
> > I assume you're referring to the fact that the driver performs flushes
> > at the end of submission (via PIPE_CONTROL or MI_FLUSH_DW), and that
> > depending on other state/optimizations in the system, those flushes
> > may flush the entire device cache, or may only flush the subset of
> > cache data that is not marked as transient.  The way you worded this
> > was confusing since it makes it sound like cache flushes happen
> > automatically somewhere in hardware/firmware.
> >
> > > places), just that internally as an optimisation hw doesn't need to
> > > make that a full flush (which will also include XA) when Media is
> > > off/powergated, since it doesn't need to worry about GT caches vs
> > > Media coherency, and only CPU vs GPU coherency, so can make that
> > > flush a targeted XA flush, since stuff tagged with XA now means it's
> > > shared with the CPU. The main implication is that we now need to
> > > somehow flush non-XA before freeing system memory pages, otherwise
> > > dirty cachelines could be flushed after the free (like if Media
> > > suddenly turns on and does a full flush)
> >
> > This description seems really confusing.  My understanding is that
> > marking something as wb-transient-app indicates that it might be
> > accessed by something other than our graphics/media IP (i.e., accessed
> > from the CPU, exported to another device, etc.), so transient data
> > truly does need to be flushed at the points in the driver where a
> > flush typically happens.
> >
> > However when something is _not_ transient, then either:
> >  - it's "private" to the GPU and only our graphics/media IP will be
> >    accessing it
> >  - it's bound with a coherent PAT index so that outside observers like
> >    the CPU can snoop the device cache, even when the cache hasn't been
> >    flushed
> >
> > If media is not active, then there's really no need to include
> > non-transient data when an device cache flush happens since there's no
> > real need for the data to get to RAM.  So that enables an optimization
> > (which comes in your next patch), that allows flushes to only operate
> > on the subset of the device cache tagged as "transient" if media is idle.

But what If we have stale non-XA marked pages for userptr, and that object moves out and at the same time media comes back, will end up in full flush and flush the stale entry to RAM.

> >
> > As you said, we eventually do want to force a flush of the
> > non-transient data as well once we're freeing the underlying pages.
> > So how do we do that?  It's not clear to me how the changes below are
> > accomplishing that.  Is there a way to explicitly request a full
> > device cache flush (ignoring the transient vs non-transient tagging)?
> > Since the GuC handles the optimization in the next patch (toggling
> > whether flushes are full flushes vs non-transient flushes depending on
> > whether media is active), I thought there might be some kind of GuC
> > interface to request "please do one full flush now, even if media is idle."
> >
> 
> I’m not an expert here by any means, but everything above from Matt seems
> like valid concerns. Thomas also raised some concerns in the two previous
> revisions; again I’m not an expert, but reading through those, it doesn’t really
> seem like he received proper answers to his questions.

Its forcing flush via tlb invalidation PPC flag under xe_invalidate_vma( ). 

> 
> A couple of comments below.
> 
> >
> > Matt
> >
> > >
> > > V2(MattA): Expand commit description
> > >
> > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> > >  drivers/gpu/drm/xe/xe_device.c  | 23 +++++++++++++++++++++++
> > > drivers/gpu/drm/xe/xe_device.h  |  1 +
> > > drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> > >  4 files changed, 28 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> > > index e9180b01a4e4..4455886b211e 100644
> > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct xe_device
> > > *xe, struct xe_bo *bo,
> > >
> > >  		if (!xe_vm_in_fault_mode(vm)) {
> > >  			drm_gpuvm_bo_evict(vm_bo, true);
> > > -			continue;
> > > +			if (!xe_device_needs_cache_flush(xe))
> > > +				continue;
> 
> This will trigger a TLB invalidation (and I assume a cache flush) every time we
> move or free memory in the 3D stack if it has a binding. It also performs a
> synchronous wait on the BO being idle. Both of these are very expensive
> operations. I can’t imagine the granularity we want here is to do this on every
> move/free with bindings.
> 
> Also, for LR compute with preempt fences, we would trigger the preempt
> fences during the wait, so a TLB invalidation after this seems unnecessary,
> though perhaps the cache flush is still required?
> 
> I think this needs a bit more explanation, because without knowing a lot about
> the exact requirements, the implementation does not look correct.

The thing is that we are trying to solve problem with userptr with non-XA pat, consider if that BO got moved while media is not active. As soon as media will come back active, stale cached entries of that object will be flushed as part of full flush , which may corrupt things.
There was thinking that with this patch we would at least solve the problem of corruption and later when page_reclamation feature comes in will help in performance as well. But now when page reclamation feature is merged earlier and it tightly coupled with bind/unbind some cases like discussed above (which are not doing unbind immediately on move/free) are missed in reclamation.
 
So thought was to let this solution go in with little perf hit and discuss with page reclamation owner to come with cleaner solution together.  

Tejas
> 
> > >  		}
> > >
> > >  		if (!idle) {
> > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > b/drivers/gpu/drm/xe/xe_device.c index 743c18e0c580..da2abed94bc0
> > > 100644
> > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct xe_device
> *xe)
> > >  	}
> > >  }
> > >
> > > +/**
> > > + * xe_device_needs_cache_flush - Whether the cache needs to be
> > > +flushed
> > > + * @xe: The device to check.
> > > + *
> > > + * Return: true if the device needs cache flush, false otherwise.
> > > + */
> > > +bool xe_device_needs_cache_flush(struct xe_device *xe) {
> > > +	/* XA is *always* flushed, like at the end-of-submssion (and maybe
> other
> > > +	 * places), just that internally as an optimisation hw doesn't need to
> make
> > > +	 * that a full flush (which will also include XA) when Media is
> > > +	 * off/powergated, since it doesn't need to worry about GT caches vs
> Media
> > > +	 * coherency, and only CPU vs GPU coherency, so can make that flush
> a
> > > +	 * targeted XA flush, since stuff tagged with XA now means it's shared
> with
> > > +	 * the CPU. The main implication is that we now need to somehow
> flush non-XA before
> > > +	 * freeing system memory pages, otherwise dirty cachelines could be
> flushed after the free
> > > +	 * (like if Media suddenly turns on and does a full flush)
> > > +	 */
> > > +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> > > +		return true;
> > > +	return false;
> > > +}
> > > +
> > >  void xe_device_l2_flush(struct xe_device *xe)  {
> > >  	struct xe_gt *gt;
> > > diff --git a/drivers/gpu/drm/xe/xe_device.h
> > > b/drivers/gpu/drm/xe/xe_device.h index 39464650533b..baf386e0e037
> > > 100644
> > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct xe_device
> > > *xe, struct drm_printer *p);
> > >  u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
> > >  u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64
> > > address);
> > >
> > > +bool xe_device_needs_cache_flush(struct xe_device *xe);
> > >  void xe_device_td_flush(struct xe_device *xe);  void
> > > xe_device_l2_flush(struct xe_device *xe);
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > > b/drivers/gpu/drm/xe/xe_userptr.c index e120323c43bc..b435ea7f9b66
> > > 100644
> > > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > > @@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct xe_vm
> *vm, struct xe_userptr_vma *uv
> > >  				    false, MAX_SCHEDULE_TIMEOUT);
> > >  	XE_WARN_ON(err <= 0);
> > >
> > > -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> > > +	if ((xe_vm_in_fault_mode(vm) || xe_device_needs_cache_flush(vm-
> >xe)) &&
> > > +	    userptr->initial_bind) {
> 
> Same concern with the LR preempt fence as above — the hardware will be
> interrupted via preempt fences, so it doesn’t seem necessary to invalidate the
> TLBs but perhaps we need a cflush and TLB invalidation is the mechanism for
> that too?
> 
> Matt
> 
> > >  		err = xe_vm_invalidate_vma(vma);
> > >  		XE_WARN_ON(err);
> > >  	}
> > > --
> > > 2.52.0
> > >
> >
> > --
> > Matt Roper
> > Graphics Software Engineer
> > Linux GPU Platform Enablement
> > Intel Corporation

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-11 19:06       ` Upadhyay, Tejas
@ 2026-02-11 21:11         ` Matt Roper
  2026-02-12  9:53           ` Matthew Auld
                             ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Matt Roper @ 2026-02-11 21:11 UTC (permalink / raw)
  To: Upadhyay, Tejas
  Cc: Brost, Matthew, intel-xe@lists.freedesktop.org, Auld, Matthew,
	thomas.hellstrom@linux.intel.com

On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas wrote:
> 
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: 11 February 2026 05:32
> > To: Roper, Matthew D <matthew.d.roper@intel.com>
> > Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> > xe@lists.freedesktop.org; Auld, Matthew <matthew.auld@intel.com>;
> > thomas.hellstrom@linux.intel.com
> > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
> > cachelines manually
> > 
> > On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
> > > On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay wrote:
> > > > "eXtended Architecture" (XA) tagged memory—memory shared between
> > the
> > > > CPU and GPU
> > >
> > > I'm pretty sure this expansion of "XA" is wrong; where are you seeing
> > > this definition?  Everything in the bspec indicates that XA means "wb
> > > - transient app" (similar to how "XD" is 'wb - transient display").
> > > I'm not sure why exactly they picked "X" to refer to transient in both
> > > of these cases, but I've never seen any documentation that refers to
> > > it as "extended."
> > >
> > > > is treated differently from other GPU memory when the Media engine is
> > power-gated.
> > > >
> > > > XA is *always* flushed, like at the end-of-submssion (and maybe
> > > > other
> > >
> > > I assume you're referring to the fact that the driver performs flushes
> > > at the end of submission (via PIPE_CONTROL or MI_FLUSH_DW), and that
> > > depending on other state/optimizations in the system, those flushes
> > > may flush the entire device cache, or may only flush the subset of
> > > cache data that is not marked as transient.  The way you worded this
> > > was confusing since it makes it sound like cache flushes happen
> > > automatically somewhere in hardware/firmware.
> > >
> > > > places), just that internally as an optimisation hw doesn't need to
> > > > make that a full flush (which will also include XA) when Media is
> > > > off/powergated, since it doesn't need to worry about GT caches vs
> > > > Media coherency, and only CPU vs GPU coherency, so can make that
> > > > flush a targeted XA flush, since stuff tagged with XA now means it's
> > > > shared with the CPU. The main implication is that we now need to
> > > > somehow flush non-XA before freeing system memory pages, otherwise
> > > > dirty cachelines could be flushed after the free (like if Media
> > > > suddenly turns on and does a full flush)
> > >
> > > This description seems really confusing.  My understanding is that
> > > marking something as wb-transient-app indicates that it might be
> > > accessed by something other than our graphics/media IP (i.e., accessed
> > > from the CPU, exported to another device, etc.), so transient data
> > > truly does need to be flushed at the points in the driver where a
> > > flush typically happens.
> > >
> > > However when something is _not_ transient, then either:
> > >  - it's "private" to the GPU and only our graphics/media IP will be
> > >    accessing it
> > >  - it's bound with a coherent PAT index so that outside observers like
> > >    the CPU can snoop the device cache, even when the cache hasn't been
> > >    flushed
> > >
> > > If media is not active, then there's really no need to include
> > > non-transient data when an device cache flush happens since there's no
> > > real need for the data to get to RAM.  So that enables an optimization
> > > (which comes in your next patch), that allows flushes to only operate
> > > on the subset of the device cache tagged as "transient" if media is idle.
> 
> But what If we have stale non-XA marked pages for userptr, and that
> object moves out and at the same time media comes back, will end up in
> full flush and flush the stale entry to RAM.

What makes userptr special here?  During general, active usage, userptr
would be data that's accessible by the CPU, so it needs to either be
transient (so CPU can see the data in RAM after explicit flushes) or it
needs to be using a coherent PAT (so that the CPU can just snoop the GPU
cache).  If you marked userptr as both non-XA and non-coherent, then
that sounds likely to be a userspace bug (and probably something we can
catch and reject as an invalid case on any Xe3p or later platforms that
support this) since the CPU wouldn't have any reliable way of seeing GPU
updates.

If something happens that changes the GTT mapping of an object, then
doesn't that already trigger a TLB invalidation when necessary in the
driver today?  It was my understanding that "heavy" TLB invalidations
wait for data values to be globally observable before starting, so I
think that would ensure that any non-XA data makes it to RAM before any
binding changes, object, destruction, etc.?  Is there something special
about userptr that makes that case more of a problem?

I just found bspec page 74635 which gives an overview of the various
flush and invalidate cases, and I don't see anything there that makes it
obvious to me that userptr would be special.


> 
> > >
> > > As you said, we eventually do want to force a flush of the
> > > non-transient data as well once we're freeing the underlying pages.
> > > So how do we do that?  It's not clear to me how the changes below are
> > > accomplishing that.  Is there a way to explicitly request a full
> > > device cache flush (ignoring the transient vs non-transient tagging)?
> > > Since the GuC handles the optimization in the next patch (toggling
> > > whether flushes are full flushes vs non-transient flushes depending on
> > > whether media is active), I thought there might be some kind of GuC
> > > interface to request "please do one full flush now, even if media is idle."
> > >
> > 
> > I’m not an expert here by any means, but everything above from Matt seems
> > like valid concerns. Thomas also raised some concerns in the two previous
> > revisions; again I’m not an expert, but reading through those, it doesn’t really
> > seem like he received proper answers to his questions.
> 
> Its forcing flush via tlb invalidation PPC flag under xe_invalidate_vma( ). 

By the way, what is "PPC?"  It seems like it's another new synonym for
the device cache?  It's already really confusing that some of our
hardware docs use a mix of both "L2" and "L3" to refer to the same
device cache for historical reasons...


Matt

> 
> > 
> > A couple of comments below.
> > 
> > >
> > > Matt
> > >
> > > >
> > > > V2(MattA): Expand commit description
> > > >
> > > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> > > >  drivers/gpu/drm/xe/xe_device.c  | 23 +++++++++++++++++++++++
> > > > drivers/gpu/drm/xe/xe_device.h  |  1 +
> > > > drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> > > >  4 files changed, 28 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> > > > index e9180b01a4e4..4455886b211e 100644
> > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct xe_device
> > > > *xe, struct xe_bo *bo,
> > > >
> > > >  		if (!xe_vm_in_fault_mode(vm)) {
> > > >  			drm_gpuvm_bo_evict(vm_bo, true);
> > > > -			continue;
> > > > +			if (!xe_device_needs_cache_flush(xe))
> > > > +				continue;
> > 
> > This will trigger a TLB invalidation (and I assume a cache flush) every time we
> > move or free memory in the 3D stack if it has a binding. It also performs a
> > synchronous wait on the BO being idle. Both of these are very expensive
> > operations. I can’t imagine the granularity we want here is to do this on every
> > move/free with bindings.
> > 
> > Also, for LR compute with preempt fences, we would trigger the preempt
> > fences during the wait, so a TLB invalidation after this seems unnecessary,
> > though perhaps the cache flush is still required?
> > 
> > I think this needs a bit more explanation, because without knowing a lot about
> > the exact requirements, the implementation does not look correct.
> 
> The thing is that we are trying to solve problem with userptr with non-XA pat, consider if that BO got moved while media is not active. As soon as media will come back active, stale cached entries of that object will be flushed as part of full flush , which may corrupt things.
> There was thinking that with this patch we would at least solve the problem of corruption and later when page_reclamation feature comes in will help in performance as well. But now when page reclamation feature is merged earlier and it tightly coupled with bind/unbind some cases like discussed above (which are not doing unbind immediately on move/free) are missed in reclamation.
>  
> So thought was to let this solution go in with little perf hit and discuss with page reclamation owner to come with cleaner solution together.  
> 
> Tejas
> > 
> > > >  		}
> > > >
> > > >  		if (!idle) {
> > > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > > b/drivers/gpu/drm/xe/xe_device.c index 743c18e0c580..da2abed94bc0
> > > > 100644
> > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct xe_device
> > *xe)
> > > >  	}
> > > >  }
> > > >
> > > > +/**
> > > > + * xe_device_needs_cache_flush - Whether the cache needs to be
> > > > +flushed
> > > > + * @xe: The device to check.
> > > > + *
> > > > + * Return: true if the device needs cache flush, false otherwise.
> > > > + */
> > > > +bool xe_device_needs_cache_flush(struct xe_device *xe) {
> > > > +	/* XA is *always* flushed, like at the end-of-submssion (and maybe
> > other
> > > > +	 * places), just that internally as an optimisation hw doesn't need to
> > make
> > > > +	 * that a full flush (which will also include XA) when Media is
> > > > +	 * off/powergated, since it doesn't need to worry about GT caches vs
> > Media
> > > > +	 * coherency, and only CPU vs GPU coherency, so can make that flush
> > a
> > > > +	 * targeted XA flush, since stuff tagged with XA now means it's shared
> > with
> > > > +	 * the CPU. The main implication is that we now need to somehow
> > flush non-XA before
> > > > +	 * freeing system memory pages, otherwise dirty cachelines could be
> > flushed after the free
> > > > +	 * (like if Media suddenly turns on and does a full flush)
> > > > +	 */
> > > > +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> > > > +		return true;
> > > > +	return false;
> > > > +}
> > > > +
> > > >  void xe_device_l2_flush(struct xe_device *xe)  {
> > > >  	struct xe_gt *gt;
> > > > diff --git a/drivers/gpu/drm/xe/xe_device.h
> > > > b/drivers/gpu/drm/xe/xe_device.h index 39464650533b..baf386e0e037
> > > > 100644
> > > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > > @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct xe_device
> > > > *xe, struct drm_printer *p);
> > > >  u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
> > > >  u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64
> > > > address);
> > > >
> > > > +bool xe_device_needs_cache_flush(struct xe_device *xe);
> > > >  void xe_device_td_flush(struct xe_device *xe);  void
> > > > xe_device_l2_flush(struct xe_device *xe);
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > > > b/drivers/gpu/drm/xe/xe_userptr.c index e120323c43bc..b435ea7f9b66
> > > > 100644
> > > > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > > > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > > > @@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct xe_vm
> > *vm, struct xe_userptr_vma *uv
> > > >  				    false, MAX_SCHEDULE_TIMEOUT);
> > > >  	XE_WARN_ON(err <= 0);
> > > >
> > > > -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> > > > +	if ((xe_vm_in_fault_mode(vm) || xe_device_needs_cache_flush(vm-
> > >xe)) &&
> > > > +	    userptr->initial_bind) {
> > 
> > Same concern with the LR preempt fence as above — the hardware will be
> > interrupted via preempt fences, so it doesn’t seem necessary to invalidate the
> > TLBs but perhaps we need a cflush and TLB invalidation is the mechanism for
> > that too?
> > 
> > Matt
> > 
> > > >  		err = xe_vm_invalidate_vma(vma);
> > > >  		XE_WARN_ON(err);
> > > >  	}
> > > > --
> > > > 2.52.0
> > > >
> > >
> > > --
> > > Matt Roper
> > > Graphics Software Engineer
> > > Linux GPU Platform Enablement
> > > Intel Corporation

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-11 21:11         ` Matt Roper
@ 2026-02-12  9:53           ` Matthew Auld
  2026-02-13 11:17             ` Upadhyay, Tejas
  2026-02-13 16:23           ` Upadhyay, Tejas
  2026-02-13 17:29           ` Matthew Auld
  2 siblings, 1 reply; 39+ messages in thread
From: Matthew Auld @ 2026-02-12  9:53 UTC (permalink / raw)
  To: Matt Roper, Upadhyay, Tejas
  Cc: Brost, Matthew, intel-xe@lists.freedesktop.org,
	thomas.hellstrom@linux.intel.com

On 11/02/2026 21:11, Matt Roper wrote:
> On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas wrote:
>>
>>
>>> -----Original Message-----
>>> From: Brost, Matthew <matthew.brost@intel.com>
>>> Sent: 11 February 2026 05:32
>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>> Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
>>> xe@lists.freedesktop.org; Auld, Matthew <matthew.auld@intel.com>;
>>> thomas.hellstrom@linux.intel.com
>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
>>> cachelines manually
>>>
>>> On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
>>>> On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay wrote:
>>>>> "eXtended Architecture" (XA) tagged memory—memory shared between
>>> the
>>>>> CPU and GPU
>>>>
>>>> I'm pretty sure this expansion of "XA" is wrong; where are you seeing
>>>> this definition?  Everything in the bspec indicates that XA means "wb
>>>> - transient app" (similar to how "XD" is 'wb - transient display").
>>>> I'm not sure why exactly they picked "X" to refer to transient in both
>>>> of these cases, but I've never seen any documentation that refers to
>>>> it as "extended."
>>>>
>>>>> is treated differently from other GPU memory when the Media engine is
>>> power-gated.
>>>>>
>>>>> XA is *always* flushed, like at the end-of-submssion (and maybe
>>>>> other
>>>>
>>>> I assume you're referring to the fact that the driver performs flushes
>>>> at the end of submission (via PIPE_CONTROL or MI_FLUSH_DW), and that
>>>> depending on other state/optimizations in the system, those flushes
>>>> may flush the entire device cache, or may only flush the subset of
>>>> cache data that is not marked as transient.  The way you worded this
>>>> was confusing since it makes it sound like cache flushes happen
>>>> automatically somewhere in hardware/firmware.
>>>>
>>>>> places), just that internally as an optimisation hw doesn't need to
>>>>> make that a full flush (which will also include XA) when Media is
>>>>> off/powergated, since it doesn't need to worry about GT caches vs
>>>>> Media coherency, and only CPU vs GPU coherency, so can make that
>>>>> flush a targeted XA flush, since stuff tagged with XA now means it's
>>>>> shared with the CPU. The main implication is that we now need to
>>>>> somehow flush non-XA before freeing system memory pages, otherwise
>>>>> dirty cachelines could be flushed after the free (like if Media
>>>>> suddenly turns on and does a full flush)
>>>>
>>>> This description seems really confusing.  My understanding is that
>>>> marking something as wb-transient-app indicates that it might be
>>>> accessed by something other than our graphics/media IP (i.e., accessed
>>>> from the CPU, exported to another device, etc.), so transient data
>>>> truly does need to be flushed at the points in the driver where a
>>>> flush typically happens.
>>>>
>>>> However when something is _not_ transient, then either:
>>>>   - it's "private" to the GPU and only our graphics/media IP will be
>>>>     accessing it
>>>>   - it's bound with a coherent PAT index so that outside observers like
>>>>     the CPU can snoop the device cache, even when the cache hasn't been
>>>>     flushed
>>>>
>>>> If media is not active, then there's really no need to include
>>>> non-transient data when an device cache flush happens since there's no
>>>> real need for the data to get to RAM.  So that enables an optimization
>>>> (which comes in your next patch), that allows flushes to only operate
>>>> on the subset of the device cache tagged as "transient" if media is idle.
>>
>> But what If we have stale non-XA marked pages for userptr, and that
>> object moves out and at the same time media comes back, will end up in
>> full flush and flush the stale entry to RAM.
> 
> What makes userptr special here?  During general, active usage, userptr
> would be data that's accessible by the CPU, so it needs to either be
> transient (so CPU can see the data in RAM after explicit flushes) or it
> needs to be using a coherent PAT (so that the CPU can just snoop the GPU
> cache).  If you marked userptr as both non-XA and non-coherent, then
> that sounds likely to be a userspace bug (and probably something we can
> catch and reject as an invalid case on any Xe3p or later platforms that
> support this) since the CPU wouldn't have any reliable way of seeing GPU
> updates.
> 
> If something happens that changes the GTT mapping of an object, then
> doesn't that already trigger a TLB invalidation when necessary in the
> driver today?  It was my understanding that "heavy" TLB invalidations
> wait for data values to be globally observable before starting, so I
> think that would ensure that any non-XA data makes it to RAM before any
> binding changes, object, destruction, etc.?  Is there something special
> about userptr that makes that case more of a problem?
> 
> I just found bspec page 74635 which gives an overview of the various
> flush and invalidate cases, and I don't see anything there that makes it
> obvious to me that userptr would be special.
> 
> 
>>
>>>>
>>>> As you said, we eventually do want to force a flush of the
>>>> non-transient data as well once we're freeing the underlying pages.
>>>> So how do we do that?  It's not clear to me how the changes below are
>>>> accomplishing that.  Is there a way to explicitly request a full
>>>> device cache flush (ignoring the transient vs non-transient tagging)?
>>>> Since the GuC handles the optimization in the next patch (toggling
>>>> whether flushes are full flushes vs non-transient flushes depending on
>>>> whether media is active), I thought there might be some kind of GuC
>>>> interface to request "please do one full flush now, even if media is idle."
>>>>
>>>
>>> I’m not an expert here by any means, but everything above from Matt seems
>>> like valid concerns. Thomas also raised some concerns in the two previous
>>> revisions; again I’m not an expert, but reading through those, it doesn’t really
>>> seem like he received proper answers to his questions.
>>
>> Its forcing flush via tlb invalidation PPC flag under xe_invalidate_vma( ).
> 
> By the way, what is "PPC?"  It seems like it's another new synonym for
> the device cache?  It's already really confusing that some of our
> hardware docs use a mix of both "L2" and "L3" to refer to the same
> device cache for historical reasons...

Private-physical-cache. It's just what hw side calls the device side 
l2/l3 on newer igpu (I think LNL+). I assume it is quite different from 
the implemetation on dgpu, from HW pov, which is maybe why is has a 
special name. On dgpu they just refer to it as plain l2/l3, not PPC. On 
dgpu there is also additional SMRO (system-memory-read-only) device 
cache, which I assume is caching reads over pci to system memory from 
GPU side, but that gets flushed at the usual places, like end of 
submission etc. Setting the PPC bit on Guc TLB inval will only flush 
SMRO on dgpu.

> 
> 
> Matt
> 
>>
>>>
>>> A couple of comments below.
>>>
>>>>
>>>> Matt
>>>>
>>>>>
>>>>> V2(MattA): Expand commit description
>>>>>
>>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
>>>>>   drivers/gpu/drm/xe/xe_device.c  | 23 +++++++++++++++++++++++
>>>>> drivers/gpu/drm/xe/xe_device.h  |  1 +
>>>>> drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
>>>>>   4 files changed, 28 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
>>>>> index e9180b01a4e4..4455886b211e 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_bo.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>>>>> @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct xe_device
>>>>> *xe, struct xe_bo *bo,
>>>>>
>>>>>   		if (!xe_vm_in_fault_mode(vm)) {
>>>>>   			drm_gpuvm_bo_evict(vm_bo, true);
>>>>> -			continue;
>>>>> +			if (!xe_device_needs_cache_flush(xe))
>>>>> +				continue;
>>>
>>> This will trigger a TLB invalidation (and I assume a cache flush) every time we
>>> move or free memory in the 3D stack if it has a binding. It also performs a
>>> synchronous wait on the BO being idle. Both of these are very expensive
>>> operations. I can’t imagine the granularity we want here is to do this on every
>>> move/free with bindings.
>>>
>>> Also, for LR compute with preempt fences, we would trigger the preempt
>>> fences during the wait, so a TLB invalidation after this seems unnecessary,
>>> though perhaps the cache flush is still required?
>>>
>>> I think this needs a bit more explanation, because without knowing a lot about
>>> the exact requirements, the implementation does not look correct.
>>
>> The thing is that we are trying to solve problem with userptr with non-XA pat, consider if that BO got moved while media is not active. As soon as media will come back active, stale cached entries of that object will be flushed as part of full flush , which may corrupt things.
>> There was thinking that with this patch we would at least solve the problem of corruption and later when page_reclamation feature comes in will help in performance as well. But now when page reclamation feature is merged earlier and it tightly coupled with bind/unbind some cases like discussed above (which are not doing unbind immediately on move/free) are missed in reclamation.
>>   
>> So thought was to let this solution go in with little perf hit and discuss with page reclamation owner to come with cleaner solution together.
>>
>> Tejas
>>>
>>>>>   		}
>>>>>
>>>>>   		if (!idle) {
>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>>>>> b/drivers/gpu/drm/xe/xe_device.c index 743c18e0c580..da2abed94bc0
>>>>> 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>>> @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct xe_device
>>> *xe)
>>>>>   	}
>>>>>   }
>>>>>
>>>>> +/**
>>>>> + * xe_device_needs_cache_flush - Whether the cache needs to be
>>>>> +flushed
>>>>> + * @xe: The device to check.
>>>>> + *
>>>>> + * Return: true if the device needs cache flush, false otherwise.
>>>>> + */
>>>>> +bool xe_device_needs_cache_flush(struct xe_device *xe) {
>>>>> +	/* XA is *always* flushed, like at the end-of-submssion (and maybe
>>> other
>>>>> +	 * places), just that internally as an optimisation hw doesn't need to
>>> make
>>>>> +	 * that a full flush (which will also include XA) when Media is
>>>>> +	 * off/powergated, since it doesn't need to worry about GT caches vs
>>> Media
>>>>> +	 * coherency, and only CPU vs GPU coherency, so can make that flush
>>> a
>>>>> +	 * targeted XA flush, since stuff tagged with XA now means it's shared
>>> with
>>>>> +	 * the CPU. The main implication is that we now need to somehow
>>> flush non-XA before
>>>>> +	 * freeing system memory pages, otherwise dirty cachelines could be
>>> flushed after the free
>>>>> +	 * (like if Media suddenly turns on and does a full flush)
>>>>> +	 */
>>>>> +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
>>>>> +		return true;
>>>>> +	return false;
>>>>> +}
>>>>> +
>>>>>   void xe_device_l2_flush(struct xe_device *xe)  {
>>>>>   	struct xe_gt *gt;
>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h
>>>>> b/drivers/gpu/drm/xe/xe_device.h index 39464650533b..baf386e0e037
>>>>> 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>>> @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct xe_device
>>>>> *xe, struct drm_printer *p);
>>>>>   u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
>>>>>   u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64
>>>>> address);
>>>>>
>>>>> +bool xe_device_needs_cache_flush(struct xe_device *xe);
>>>>>   void xe_device_td_flush(struct xe_device *xe);  void
>>>>> xe_device_l2_flush(struct xe_device *xe);
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_userptr.c
>>>>> b/drivers/gpu/drm/xe/xe_userptr.c index e120323c43bc..b435ea7f9b66
>>>>> 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_userptr.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_userptr.c
>>>>> @@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct xe_vm
>>> *vm, struct xe_userptr_vma *uv
>>>>>   				    false, MAX_SCHEDULE_TIMEOUT);
>>>>>   	XE_WARN_ON(err <= 0);
>>>>>
>>>>> -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
>>>>> +	if ((xe_vm_in_fault_mode(vm) || xe_device_needs_cache_flush(vm-
>>>> xe)) &&
>>>>> +	    userptr->initial_bind) {
>>>
>>> Same concern with the LR preempt fence as above — the hardware will be
>>> interrupted via preempt fences, so it doesn’t seem necessary to invalidate the
>>> TLBs but perhaps we need a cflush and TLB invalidation is the mechanism for
>>> that too?
>>>
>>> Matt
>>>
>>>>>   		err = xe_vm_invalidate_vma(vma);
>>>>>   		XE_WARN_ON(err);
>>>>>   	}
>>>>> --
>>>>> 2.52.0
>>>>>
>>>>
>>>> --
>>>> Matt Roper
>>>> Graphics Software Engineer
>>>> Linux GPU Platform Enablement
>>>> Intel Corporation
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-12  9:53           ` Matthew Auld
@ 2026-02-13 11:17             ` Upadhyay, Tejas
  2026-02-13 13:27               ` Matthew Auld
  0 siblings, 1 reply; 39+ messages in thread
From: Upadhyay, Tejas @ 2026-02-13 11:17 UTC (permalink / raw)
  To: Auld, Matthew, Roper, Matthew D
  Cc: Brost, Matthew, intel-xe@lists.freedesktop.org,
	thomas.hellstrom@linux.intel.com, Mrozek, Michal, Souza, Jose

+ Michal

> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: 12 February 2026 15:24
> To: Roper, Matthew D <matthew.d.roper@intel.com>; Upadhyay, Tejas
> <tejas.upadhyay@intel.com>
> Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
> xe@lists.freedesktop.org; thomas.hellstrom@linux.intel.com
> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
> cachelines manually
> 
> On 11/02/2026 21:11, Matt Roper wrote:
> > On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas wrote:
> >>
> >>
> >>> -----Original Message-----
> >>> From: Brost, Matthew <matthew.brost@intel.com>
> >>> Sent: 11 February 2026 05:32
> >>> To: Roper, Matthew D <matthew.d.roper@intel.com>
> >>> Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> >>> xe@lists.freedesktop.org; Auld, Matthew <matthew.auld@intel.com>;
> >>> thomas.hellstrom@linux.intel.com
> >>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
> >>> cachelines manually
> >>>
> >>> On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
> >>>> On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay wrote:
> >>>>> "eXtended Architecture" (XA) tagged memory—memory shared
> between
> >>> the
> >>>>> CPU and GPU
> >>>>
> >>>> I'm pretty sure this expansion of "XA" is wrong; where are you
> >>>> seeing this definition?  Everything in the bspec indicates that XA
> >>>> means "wb
> >>>> - transient app" (similar to how "XD" is 'wb - transient display").
> >>>> I'm not sure why exactly they picked "X" to refer to transient in
> >>>> both of these cases, but I've never seen any documentation that
> >>>> refers to it as "extended."
> >>>>
> >>>>> is treated differently from other GPU memory when the Media engine
> >>>>> is
> >>> power-gated.
> >>>>>
> >>>>> XA is *always* flushed, like at the end-of-submssion (and maybe
> >>>>> other
> >>>>
> >>>> I assume you're referring to the fact that the driver performs
> >>>> flushes at the end of submission (via PIPE_CONTROL or MI_FLUSH_DW),
> >>>> and that depending on other state/optimizations in the system,
> >>>> those flushes may flush the entire device cache, or may only flush
> >>>> the subset of cache data that is not marked as transient.  The way
> >>>> you worded this was confusing since it makes it sound like cache
> >>>> flushes happen automatically somewhere in hardware/firmware.
> >>>>
> >>>>> places), just that internally as an optimisation hw doesn't need
> >>>>> to make that a full flush (which will also include XA) when Media
> >>>>> is off/powergated, since it doesn't need to worry about GT caches
> >>>>> vs Media coherency, and only CPU vs GPU coherency, so can make
> >>>>> that flush a targeted XA flush, since stuff tagged with XA now
> >>>>> means it's shared with the CPU. The main implication is that we
> >>>>> now need to somehow flush non-XA before freeing system memory
> >>>>> pages, otherwise dirty cachelines could be flushed after the free
> >>>>> (like if Media suddenly turns on and does a full flush)
> >>>>
> >>>> This description seems really confusing.  My understanding is that
> >>>> marking something as wb-transient-app indicates that it might be
> >>>> accessed by something other than our graphics/media IP (i.e.,
> >>>> accessed from the CPU, exported to another device, etc.), so
> >>>> transient data truly does need to be flushed at the points in the
> >>>> driver where a flush typically happens.
> >>>>
> >>>> However when something is _not_ transient, then either:
> >>>>   - it's "private" to the GPU and only our graphics/media IP will be
> >>>>     accessing it
> >>>>   - it's bound with a coherent PAT index so that outside observers like
> >>>>     the CPU can snoop the device cache, even when the cache hasn't been
> >>>>     flushed
> >>>>
> >>>> If media is not active, then there's really no need to include
> >>>> non-transient data when an device cache flush happens since there's
> >>>> no real need for the data to get to RAM.  So that enables an
> >>>> optimization (which comes in your next patch), that allows flushes
> >>>> to only operate on the subset of the device cache tagged as "transient" if
> media is idle.
> >>
> >> But what If we have stale non-XA marked pages for userptr, and that
> >> object moves out and at the same time media comes back, will end up
> >> in full flush and flush the stale entry to RAM.
> >
> > What makes userptr special here?  During general, active usage,
> > userptr would be data that's accessible by the CPU, so it needs to
> > either be transient (so CPU can see the data in RAM after explicit
> > flushes) or it needs to be using a coherent PAT (so that the CPU can
> > just snoop the GPU cache).  If you marked userptr as both non-XA and
> > non-coherent, then that sounds likely to be a userspace bug (and
> > probably something we can catch and reject as an invalid case on any
> > Xe3p or later platforms that support this) since the CPU wouldn't have
> > any reliable way of seeing GPU updates.
> >
> > If something happens that changes the GTT mapping of an object, then
> > doesn't that already trigger a TLB invalidation when necessary in the
> > driver today?  It was my understanding that "heavy" TLB invalidations
> > wait for data values to be globally observable before starting, so I
> > think that would ensure that any non-XA data makes it to RAM before
> > any binding changes, object, destruction, etc.?  Is there something
> > special about userptr that makes that case more of a problem?
> >
> > I just found bspec page 74635 which gives an overview of the various
> > flush and invalidate cases, and I don't see anything there that makes
> > it obvious to me that userptr would be special.
> >
> >
> >>
> >>>>
> >>>> As you said, we eventually do want to force a flush of the
> >>>> non-transient data as well once we're freeing the underlying pages.
> >>>> So how do we do that?  It's not clear to me how the changes below
> >>>> are accomplishing that.  Is there a way to explicitly request a
> >>>> full device cache flush (ignoring the transient vs non-transient tagging)?
> >>>> Since the GuC handles the optimization in the next patch (toggling
> >>>> whether flushes are full flushes vs non-transient flushes depending
> >>>> on whether media is active), I thought there might be some kind of
> >>>> GuC interface to request "please do one full flush now, even if media is
> idle."
> >>>>
> >>>
> >>> I’m not an expert here by any means, but everything above from Matt
> >>> seems like valid concerns. Thomas also raised some concerns in the
> >>> two previous revisions; again I’m not an expert, but reading through
> >>> those, it doesn’t really seem like he received proper answers to his
> questions.
> >>
> >> Its forcing flush via tlb invalidation PPC flag under xe_invalidate_vma( ).
> >
> > By the way, what is "PPC?"  It seems like it's another new synonym for
> > the device cache?  It's already really confusing that some of our
> > hardware docs use a mix of both "L2" and "L3" to refer to the same
> > device cache for historical reasons...
> 
> Private-physical-cache. It's just what hw side calls the device side
> l2/l3 on newer igpu (I think LNL+). I assume it is quite different from the
> implemetation on dgpu, from HW pov, which is maybe why is has a special
> name. On dgpu they just refer to it as plain l2/l3, not PPC. On dgpu there is
> also additional SMRO (system-memory-read-only) device cache, which I
> assume is caching reads over pci to system memory from GPU side, but that
> gets flushed at the usual places, like end of submission etc. Setting the PPC bit
> on Guc TLB inval will only flush SMRO on dgpu.

I had conversation with Michal and also had confirmation from Jose about their usage of pat_index for userptr, just summarising it below :

UMD Compute, with userptr :
- Uses pat_index 19 (which is App-transient(XA) and 1Way-coh)
- Flushing :  will be taken care by app-transient ability in HW if media is off, if media is on full flush anyways will happen.

Mesa, with userptr: 
- Use 2way-coh
- Flushing : 2 way Coherency will take care

With above understanding it seems we can drop this patch as its not needed, instead we need to have new patch to validate if right pat-index coherency is used and if not used reject the request.

Tejas
> 
> >
> >
> > Matt
> >
> >>
> >>>
> >>> A couple of comments below.
> >>>
> >>>>
> >>>> Matt
> >>>>
> >>>>>
> >>>>> V2(MattA): Expand commit description
> >>>>>
> >>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> >>>>> ---
> >>>>>   drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> >>>>>   drivers/gpu/drm/xe/xe_device.c  | 23 +++++++++++++++++++++++
> >>>>> drivers/gpu/drm/xe/xe_device.h  |  1 +
> >>>>> drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> >>>>>   4 files changed, 28 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c
> >>>>> b/drivers/gpu/drm/xe/xe_bo.c index e9180b01a4e4..4455886b211e
> >>>>> 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_bo.c
> >>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c
> >>>>> @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct
> >>>>> xe_device *xe, struct xe_bo *bo,
> >>>>>
> >>>>>   		if (!xe_vm_in_fault_mode(vm)) {
> >>>>>   			drm_gpuvm_bo_evict(vm_bo, true);
> >>>>> -			continue;
> >>>>> +			if (!xe_device_needs_cache_flush(xe))
> >>>>> +				continue;
> >>>
> >>> This will trigger a TLB invalidation (and I assume a cache flush)
> >>> every time we move or free memory in the 3D stack if it has a
> >>> binding. It also performs a synchronous wait on the BO being idle.
> >>> Both of these are very expensive operations. I can’t imagine the
> >>> granularity we want here is to do this on every move/free with bindings.
> >>>
> >>> Also, for LR compute with preempt fences, we would trigger the
> >>> preempt fences during the wait, so a TLB invalidation after this
> >>> seems unnecessary, though perhaps the cache flush is still required?
> >>>
> >>> I think this needs a bit more explanation, because without knowing a
> >>> lot about the exact requirements, the implementation does not look
> correct.
> >>
> >> The thing is that we are trying to solve problem with userptr with non-XA
> pat, consider if that BO got moved while media is not active. As soon as media
> will come back active, stale cached entries of that object will be flushed as part
> of full flush , which may corrupt things.
> >> There was thinking that with this patch we would at least solve the problem
> of corruption and later when page_reclamation feature comes in will help in
> performance as well. But now when page reclamation feature is merged earlier
> and it tightly coupled with bind/unbind some cases like discussed above
> (which are not doing unbind immediately on move/free) are missed in
> reclamation.
> >>
> >> So thought was to let this solution go in with little perf hit and discuss with
> page reclamation owner to come with cleaner solution together.
> >>
> >> Tejas
> >>>
> >>>>>   		}
> >>>>>
> >>>>>   		if (!idle) {
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
> >>>>> b/drivers/gpu/drm/xe/xe_device.c index
> 743c18e0c580..da2abed94bc0
> >>>>> 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_device.c
> >>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
> >>>>> @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct
> >>>>> xe_device
> >>> *xe)
> >>>>>   	}
> >>>>>   }
> >>>>>
> >>>>> +/**
> >>>>> + * xe_device_needs_cache_flush - Whether the cache needs to be
> >>>>> +flushed
> >>>>> + * @xe: The device to check.
> >>>>> + *
> >>>>> + * Return: true if the device needs cache flush, false otherwise.
> >>>>> + */
> >>>>> +bool xe_device_needs_cache_flush(struct xe_device *xe) {
> >>>>> +	/* XA is *always* flushed, like at the end-of-submssion (and
> >>>>> +maybe
> >>> other
> >>>>> +	 * places), just that internally as an optimisation hw doesn't
> >>>>> +need to
> >>> make
> >>>>> +	 * that a full flush (which will also include XA) when Media is
> >>>>> +	 * off/powergated, since it doesn't need to worry about GT
> >>>>> +caches vs
> >>> Media
> >>>>> +	 * coherency, and only CPU vs GPU coherency, so can make that
> >>>>> +flush
> >>> a
> >>>>> +	 * targeted XA flush, since stuff tagged with XA now means it's
> >>>>> +shared
> >>> with
> >>>>> +	 * the CPU. The main implication is that we now need to somehow
> >>> flush non-XA before
> >>>>> +	 * freeing system memory pages, otherwise dirty cachelines could
> >>>>> +be
> >>> flushed after the free
> >>>>> +	 * (like if Media suddenly turns on and does a full flush)
> >>>>> +	 */
> >>>>> +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> >>>>> +		return true;
> >>>>> +	return false;
> >>>>> +}
> >>>>> +
> >>>>>   void xe_device_l2_flush(struct xe_device *xe)  {
> >>>>>   	struct xe_gt *gt;
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h
> >>>>> b/drivers/gpu/drm/xe/xe_device.h index
> 39464650533b..baf386e0e037
> >>>>> 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_device.h
> >>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
> >>>>> @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct
> xe_device
> >>>>> *xe, struct drm_printer *p);
> >>>>>   u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
> >>>>>   u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64
> >>>>> address);
> >>>>>
> >>>>> +bool xe_device_needs_cache_flush(struct xe_device *xe);
> >>>>>   void xe_device_td_flush(struct xe_device *xe);  void
> >>>>> xe_device_l2_flush(struct xe_device *xe);
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> >>>>> b/drivers/gpu/drm/xe/xe_userptr.c index
> e120323c43bc..b435ea7f9b66
> >>>>> 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_userptr.c
> >>>>> +++ b/drivers/gpu/drm/xe/xe_userptr.c
> >>>>> @@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct
> >>>>> xe_vm
> >>> *vm, struct xe_userptr_vma *uv
> >>>>>   				    false, MAX_SCHEDULE_TIMEOUT);
> >>>>>   	XE_WARN_ON(err <= 0);
> >>>>>
> >>>>> -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> >>>>> +	if ((xe_vm_in_fault_mode(vm) || xe_device_needs_cache_flush(vm-
> >>>> xe)) &&
> >>>>> +	    userptr->initial_bind) {
> >>>
> >>> Same concern with the LR preempt fence as above — the hardware will
> >>> be interrupted via preempt fences, so it doesn’t seem necessary to
> >>> invalidate the TLBs but perhaps we need a cflush and TLB
> >>> invalidation is the mechanism for that too?
> >>>
> >>> Matt
> >>>
> >>>>>   		err = xe_vm_invalidate_vma(vma);
> >>>>>   		XE_WARN_ON(err);
> >>>>>   	}
> >>>>> --
> >>>>> 2.52.0
> >>>>>
> >>>>
> >>>> --
> >>>> Matt Roper
> >>>> Graphics Software Engineer
> >>>> Linux GPU Platform Enablement
> >>>> Intel Corporation
> >


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-13 11:17             ` Upadhyay, Tejas
@ 2026-02-13 13:27               ` Matthew Auld
  2026-02-13 13:30                 ` Souza, Jose
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Auld @ 2026-02-13 13:27 UTC (permalink / raw)
  To: Upadhyay, Tejas, Roper, Matthew D
  Cc: Brost, Matthew, intel-xe@lists.freedesktop.org,
	thomas.hellstrom@linux.intel.com, Mrozek, Michal, Souza, Jose

On 13/02/2026 11:17, Upadhyay, Tejas wrote:
> + Michal
> 
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld@intel.com>
>> Sent: 12 February 2026 15:24
>> To: Roper, Matthew D <matthew.d.roper@intel.com>; Upadhyay, Tejas
>> <tejas.upadhyay@intel.com>
>> Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
>> xe@lists.freedesktop.org; thomas.hellstrom@linux.intel.com
>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
>> cachelines manually
>>
>> On 11/02/2026 21:11, Matt Roper wrote:
>>> On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Brost, Matthew <matthew.brost@intel.com>
>>>>> Sent: 11 February 2026 05:32
>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>>>> Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
>>>>> xe@lists.freedesktop.org; Auld, Matthew <matthew.auld@intel.com>;
>>>>> thomas.hellstrom@linux.intel.com
>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
>>>>> cachelines manually
>>>>>
>>>>> On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
>>>>>> On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay wrote:
>>>>>>> "eXtended Architecture" (XA) tagged memory—memory shared
>> between
>>>>> the
>>>>>>> CPU and GPU
>>>>>>
>>>>>> I'm pretty sure this expansion of "XA" is wrong; where are you
>>>>>> seeing this definition?  Everything in the bspec indicates that XA
>>>>>> means "wb
>>>>>> - transient app" (similar to how "XD" is 'wb - transient display").
>>>>>> I'm not sure why exactly they picked "X" to refer to transient in
>>>>>> both of these cases, but I've never seen any documentation that
>>>>>> refers to it as "extended."
>>>>>>
>>>>>>> is treated differently from other GPU memory when the Media engine
>>>>>>> is
>>>>> power-gated.
>>>>>>>
>>>>>>> XA is *always* flushed, like at the end-of-submssion (and maybe
>>>>>>> other
>>>>>>
>>>>>> I assume you're referring to the fact that the driver performs
>>>>>> flushes at the end of submission (via PIPE_CONTROL or MI_FLUSH_DW),
>>>>>> and that depending on other state/optimizations in the system,
>>>>>> those flushes may flush the entire device cache, or may only flush
>>>>>> the subset of cache data that is not marked as transient.  The way
>>>>>> you worded this was confusing since it makes it sound like cache
>>>>>> flushes happen automatically somewhere in hardware/firmware.
>>>>>>
>>>>>>> places), just that internally as an optimisation hw doesn't need
>>>>>>> to make that a full flush (which will also include XA) when Media
>>>>>>> is off/powergated, since it doesn't need to worry about GT caches
>>>>>>> vs Media coherency, and only CPU vs GPU coherency, so can make
>>>>>>> that flush a targeted XA flush, since stuff tagged with XA now
>>>>>>> means it's shared with the CPU. The main implication is that we
>>>>>>> now need to somehow flush non-XA before freeing system memory
>>>>>>> pages, otherwise dirty cachelines could be flushed after the free
>>>>>>> (like if Media suddenly turns on and does a full flush)
>>>>>>
>>>>>> This description seems really confusing.  My understanding is that
>>>>>> marking something as wb-transient-app indicates that it might be
>>>>>> accessed by something other than our graphics/media IP (i.e.,
>>>>>> accessed from the CPU, exported to another device, etc.), so
>>>>>> transient data truly does need to be flushed at the points in the
>>>>>> driver where a flush typically happens.
>>>>>>
>>>>>> However when something is _not_ transient, then either:
>>>>>>    - it's "private" to the GPU and only our graphics/media IP will be
>>>>>>      accessing it
>>>>>>    - it's bound with a coherent PAT index so that outside observers like
>>>>>>      the CPU can snoop the device cache, even when the cache hasn't been
>>>>>>      flushed
>>>>>>
>>>>>> If media is not active, then there's really no need to include
>>>>>> non-transient data when an device cache flush happens since there's
>>>>>> no real need for the data to get to RAM.  So that enables an
>>>>>> optimization (which comes in your next patch), that allows flushes
>>>>>> to only operate on the subset of the device cache tagged as "transient" if
>> media is idle.
>>>>
>>>> But what If we have stale non-XA marked pages for userptr, and that
>>>> object moves out and at the same time media comes back, will end up
>>>> in full flush and flush the stale entry to RAM.
>>>
>>> What makes userptr special here?  During general, active usage,
>>> userptr would be data that's accessible by the CPU, so it needs to
>>> either be transient (so CPU can see the data in RAM after explicit
>>> flushes) or it needs to be using a coherent PAT (so that the CPU can
>>> just snoop the GPU cache).  If you marked userptr as both non-XA and
>>> non-coherent, then that sounds likely to be a userspace bug (and
>>> probably something we can catch and reject as an invalid case on any
>>> Xe3p or later platforms that support this) since the CPU wouldn't have
>>> any reliable way of seeing GPU updates.
>>>
>>> If something happens that changes the GTT mapping of an object, then
>>> doesn't that already trigger a TLB invalidation when necessary in the
>>> driver today?  It was my understanding that "heavy" TLB invalidations
>>> wait for data values to be globally observable before starting, so I
>>> think that would ensure that any non-XA data makes it to RAM before
>>> any binding changes, object, destruction, etc.?  Is there something
>>> special about userptr that makes that case more of a problem?
>>>
>>> I just found bspec page 74635 which gives an overview of the various
>>> flush and invalidate cases, and I don't see anything there that makes
>>> it obvious to me that userptr would be special.
>>>
>>>
>>>>
>>>>>>
>>>>>> As you said, we eventually do want to force a flush of the
>>>>>> non-transient data as well once we're freeing the underlying pages.
>>>>>> So how do we do that?  It's not clear to me how the changes below
>>>>>> are accomplishing that.  Is there a way to explicitly request a
>>>>>> full device cache flush (ignoring the transient vs non-transient tagging)?
>>>>>> Since the GuC handles the optimization in the next patch (toggling
>>>>>> whether flushes are full flushes vs non-transient flushes depending
>>>>>> on whether media is active), I thought there might be some kind of
>>>>>> GuC interface to request "please do one full flush now, even if media is
>> idle."
>>>>>>
>>>>>
>>>>> I’m not an expert here by any means, but everything above from Matt
>>>>> seems like valid concerns. Thomas also raised some concerns in the
>>>>> two previous revisions; again I’m not an expert, but reading through
>>>>> those, it doesn’t really seem like he received proper answers to his
>> questions.
>>>>
>>>> Its forcing flush via tlb invalidation PPC flag under xe_invalidate_vma( ).
>>>
>>> By the way, what is "PPC?"  It seems like it's another new synonym for
>>> the device cache?  It's already really confusing that some of our
>>> hardware docs use a mix of both "L2" and "L3" to refer to the same
>>> device cache for historical reasons...
>>
>> Private-physical-cache. It's just what hw side calls the device side
>> l2/l3 on newer igpu (I think LNL+). I assume it is quite different from the
>> implemetation on dgpu, from HW pov, which is maybe why is has a special
>> name. On dgpu they just refer to it as plain l2/l3, not PPC. On dgpu there is
>> also additional SMRO (system-memory-read-only) device cache, which I
>> assume is caching reads over pci to system memory from GPU side, but that
>> gets flushed at the usual places, like end of submission etc. Setting the PPC bit
>> on Guc TLB inval will only flush SMRO on dgpu.
> 
> I had conversation with Michal and also had confirmation from Jose about their usage of pat_index for userptr, just summarising it below :
> 
> UMD Compute, with userptr :
> - Uses pat_index 19 (which is App-transient(XA) and 1Way-coh)
> - Flushing :  will be taken care by app-transient ability in HW if media is off, if media is on full flush anyways will happen.
> 
> Mesa, with userptr:
> - Use 2way-coh
> - Flushing : 2 way Coherency will take care
> 
> With above understanding it seems we can drop this patch as its not needed, instead we need to have new patch to validate if right pat-index coherency is used and if not used reject the request.

I think that only resolves the userptr side. We still need something for 
eviction/shriker path with normal BOs, which this patch was also trying 
to address, right?

> 
> Tejas
>>
>>>
>>>
>>> Matt
>>>
>>>>
>>>>>
>>>>> A couple of comments below.
>>>>>
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>>>
>>>>>>> V2(MattA): Expand commit description
>>>>>>>
>>>>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
>>>>>>> ---
>>>>>>>    drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
>>>>>>>    drivers/gpu/drm/xe/xe_device.c  | 23 +++++++++++++++++++++++
>>>>>>> drivers/gpu/drm/xe/xe_device.h  |  1 +
>>>>>>> drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
>>>>>>>    4 files changed, 28 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c
>>>>>>> b/drivers/gpu/drm/xe/xe_bo.c index e9180b01a4e4..4455886b211e
>>>>>>> 100644
>>>>>>> --- a/drivers/gpu/drm/xe/xe_bo.c
>>>>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>>>>>>> @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct
>>>>>>> xe_device *xe, struct xe_bo *bo,
>>>>>>>
>>>>>>>    		if (!xe_vm_in_fault_mode(vm)) {
>>>>>>>    			drm_gpuvm_bo_evict(vm_bo, true);
>>>>>>> -			continue;
>>>>>>> +			if (!xe_device_needs_cache_flush(xe))
>>>>>>> +				continue;
>>>>>
>>>>> This will trigger a TLB invalidation (and I assume a cache flush)
>>>>> every time we move or free memory in the 3D stack if it has a
>>>>> binding. It also performs a synchronous wait on the BO being idle.
>>>>> Both of these are very expensive operations. I can’t imagine the
>>>>> granularity we want here is to do this on every move/free with bindings.
>>>>>
>>>>> Also, for LR compute with preempt fences, we would trigger the
>>>>> preempt fences during the wait, so a TLB invalidation after this
>>>>> seems unnecessary, though perhaps the cache flush is still required?
>>>>>
>>>>> I think this needs a bit more explanation, because without knowing a
>>>>> lot about the exact requirements, the implementation does not look
>> correct.
>>>>
>>>> The thing is that we are trying to solve problem with userptr with non-XA
>> pat, consider if that BO got moved while media is not active. As soon as media
>> will come back active, stale cached entries of that object will be flushed as part
>> of full flush , which may corrupt things.
>>>> There was thinking that with this patch we would at least solve the problem
>> of corruption and later when page_reclamation feature comes in will help in
>> performance as well. But now when page reclamation feature is merged earlier
>> and it tightly coupled with bind/unbind some cases like discussed above
>> (which are not doing unbind immediately on move/free) are missed in
>> reclamation.
>>>>
>>>> So thought was to let this solution go in with little perf hit and discuss with
>> page reclamation owner to come with cleaner solution together.
>>>>
>>>> Tejas
>>>>>
>>>>>>>    		}
>>>>>>>
>>>>>>>    		if (!idle) {
>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>>>>>>> b/drivers/gpu/drm/xe/xe_device.c index
>> 743c18e0c580..da2abed94bc0
>>>>>>> 100644
>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>>>>> @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct
>>>>>>> xe_device
>>>>> *xe)
>>>>>>>    	}
>>>>>>>    }
>>>>>>>
>>>>>>> +/**
>>>>>>> + * xe_device_needs_cache_flush - Whether the cache needs to be
>>>>>>> +flushed
>>>>>>> + * @xe: The device to check.
>>>>>>> + *
>>>>>>> + * Return: true if the device needs cache flush, false otherwise.
>>>>>>> + */
>>>>>>> +bool xe_device_needs_cache_flush(struct xe_device *xe) {
>>>>>>> +	/* XA is *always* flushed, like at the end-of-submssion (and
>>>>>>> +maybe
>>>>> other
>>>>>>> +	 * places), just that internally as an optimisation hw doesn't
>>>>>>> +need to
>>>>> make
>>>>>>> +	 * that a full flush (which will also include XA) when Media is
>>>>>>> +	 * off/powergated, since it doesn't need to worry about GT
>>>>>>> +caches vs
>>>>> Media
>>>>>>> +	 * coherency, and only CPU vs GPU coherency, so can make that
>>>>>>> +flush
>>>>> a
>>>>>>> +	 * targeted XA flush, since stuff tagged with XA now means it's
>>>>>>> +shared
>>>>> with
>>>>>>> +	 * the CPU. The main implication is that we now need to somehow
>>>>> flush non-XA before
>>>>>>> +	 * freeing system memory pages, otherwise dirty cachelines could
>>>>>>> +be
>>>>> flushed after the free
>>>>>>> +	 * (like if Media suddenly turns on and does a full flush)
>>>>>>> +	 */
>>>>>>> +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
>>>>>>> +		return true;
>>>>>>> +	return false;
>>>>>>> +}
>>>>>>> +
>>>>>>>    void xe_device_l2_flush(struct xe_device *xe)  {
>>>>>>>    	struct xe_gt *gt;
>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h
>>>>>>> b/drivers/gpu/drm/xe/xe_device.h index
>> 39464650533b..baf386e0e037
>>>>>>> 100644
>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>>>>> @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct
>> xe_device
>>>>>>> *xe, struct drm_printer *p);
>>>>>>>    u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
>>>>>>>    u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64
>>>>>>> address);
>>>>>>>
>>>>>>> +bool xe_device_needs_cache_flush(struct xe_device *xe);
>>>>>>>    void xe_device_td_flush(struct xe_device *xe);  void
>>>>>>> xe_device_l2_flush(struct xe_device *xe);
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>> b/drivers/gpu/drm/xe/xe_userptr.c index
>> e120323c43bc..b435ea7f9b66
>>>>>>> 100644
>>>>>>> --- a/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>> +++ b/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>> @@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct
>>>>>>> xe_vm
>>>>> *vm, struct xe_userptr_vma *uv
>>>>>>>    				    false, MAX_SCHEDULE_TIMEOUT);
>>>>>>>    	XE_WARN_ON(err <= 0);
>>>>>>>
>>>>>>> -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
>>>>>>> +	if ((xe_vm_in_fault_mode(vm) || xe_device_needs_cache_flush(vm-
>>>>>> xe)) &&
>>>>>>> +	    userptr->initial_bind) {
>>>>>
>>>>> Same concern with the LR preempt fence as above — the hardware will
>>>>> be interrupted via preempt fences, so it doesn’t seem necessary to
>>>>> invalidate the TLBs but perhaps we need a cflush and TLB
>>>>> invalidation is the mechanism for that too?
>>>>>
>>>>> Matt
>>>>>
>>>>>>>    		err = xe_vm_invalidate_vma(vma);
>>>>>>>    		XE_WARN_ON(err);
>>>>>>>    	}
>>>>>>> --
>>>>>>> 2.52.0
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Matt Roper
>>>>>> Graphics Software Engineer
>>>>>> Linux GPU Platform Enablement
>>>>>> Intel Corporation
>>>
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-13 13:27               ` Matthew Auld
@ 2026-02-13 13:30                 ` Souza, Jose
  0 siblings, 0 replies; 39+ messages in thread
From: Souza, Jose @ 2026-02-13 13:30 UTC (permalink / raw)
  To: Upadhyay, Tejas, Roper, Matthew D, Auld, Matthew
  Cc: intel-xe@lists.freedesktop.org, Brost,  Matthew,
	thomas.hellstrom@linux.intel.com, Mrozek, Michal

On Fri, 2026-02-13 at 13:27 +0000, Matthew Auld wrote:
> On 13/02/2026 11:17, Upadhyay, Tejas wrote:
> > + Michal
> > 
> > > -----Original Message-----
> > > From: Auld, Matthew <matthew.auld@intel.com>
> > > Sent: 12 February 2026 15:24
> > > To: Roper, Matthew D <matthew.d.roper@intel.com>; Upadhyay, Tejas
> > > <tejas.upadhyay@intel.com>
> > > Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
> > > xe@lists.freedesktop.org; thomas.hellstrom@linux.intel.com
> > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker
> > > bo
> > > cachelines manually
> > > 
> > > On 11/02/2026 21:11, Matt Roper wrote:
> > > > On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas
> > > > wrote:
> > > > > 
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > > Sent: 11 February 2026 05:32
> > > > > > To: Roper, Matthew D <matthew.d.roper@intel.com>
> > > > > > Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> > > > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > > > <matthew.auld@intel.com>;
> > > > > > thomas.hellstrom@linux.intel.com
> > > > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
> > > > > > userptr/shrinker bo
> > > > > > cachelines manually
> > > > > > 
> > > > > > On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
> > > > > > > On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay
> > > > > > > wrote:
> > > > > > > > "eXtended Architecture" (XA) tagged memory—memory
> > > > > > > > shared
> > > between
> > > > > > the
> > > > > > > > CPU and GPU
> > > > > > > 
> > > > > > > I'm pretty sure this expansion of "XA" is wrong; where
> > > > > > > are you
> > > > > > > seeing this definition?  Everything in the bspec
> > > > > > > indicates that XA
> > > > > > > means "wb
> > > > > > > - transient app" (similar to how "XD" is 'wb - transient
> > > > > > > display").
> > > > > > > I'm not sure why exactly they picked "X" to refer to
> > > > > > > transient in
> > > > > > > both of these cases, but I've never seen any
> > > > > > > documentation that
> > > > > > > refers to it as "extended."
> > > > > > > 
> > > > > > > > is treated differently from other GPU memory when the
> > > > > > > > Media engine
> > > > > > > > is
> > > > > > power-gated.
> > > > > > > > 
> > > > > > > > XA is *always* flushed, like at the end-of-submssion
> > > > > > > > (and maybe
> > > > > > > > other
> > > > > > > 
> > > > > > > I assume you're referring to the fact that the driver
> > > > > > > performs
> > > > > > > flushes at the end of submission (via PIPE_CONTROL or
> > > > > > > MI_FLUSH_DW),
> > > > > > > and that depending on other state/optimizations in the
> > > > > > > system,
> > > > > > > those flushes may flush the entire device cache, or may
> > > > > > > only flush
> > > > > > > the subset of cache data that is not marked as
> > > > > > > transient.  The way
> > > > > > > you worded this was confusing since it makes it sound
> > > > > > > like cache
> > > > > > > flushes happen automatically somewhere in
> > > > > > > hardware/firmware.
> > > > > > > 
> > > > > > > > places), just that internally as an optimisation hw
> > > > > > > > doesn't need
> > > > > > > > to make that a full flush (which will also include XA)
> > > > > > > > when Media
> > > > > > > > is off/powergated, since it doesn't need to worry about
> > > > > > > > GT caches
> > > > > > > > vs Media coherency, and only CPU vs GPU coherency, so
> > > > > > > > can make
> > > > > > > > that flush a targeted XA flush, since stuff tagged with
> > > > > > > > XA now
> > > > > > > > means it's shared with the CPU. The main implication is
> > > > > > > > that we
> > > > > > > > now need to somehow flush non-XA before freeing system
> > > > > > > > memory
> > > > > > > > pages, otherwise dirty cachelines could be flushed
> > > > > > > > after the free
> > > > > > > > (like if Media suddenly turns on and does a full flush)
> > > > > > > 
> > > > > > > This description seems really confusing.  My
> > > > > > > understanding is that
> > > > > > > marking something as wb-transient-app indicates that it
> > > > > > > might be
> > > > > > > accessed by something other than our graphics/media IP
> > > > > > > (i.e.,
> > > > > > > accessed from the CPU, exported to another device, etc.),
> > > > > > > so
> > > > > > > transient data truly does need to be flushed at the
> > > > > > > points in the
> > > > > > > driver where a flush typically happens.
> > > > > > > 
> > > > > > > However when something is _not_ transient, then either:
> > > > > > >    - it's "private" to the GPU and only our
> > > > > > > graphics/media IP will be
> > > > > > >      accessing it
> > > > > > >    - it's bound with a coherent PAT index so that outside
> > > > > > > observers like
> > > > > > >      the CPU can snoop the device cache, even when the
> > > > > > > cache hasn't been
> > > > > > >      flushed
> > > > > > > 
> > > > > > > If media is not active, then there's really no need to
> > > > > > > include
> > > > > > > non-transient data when an device cache flush happens
> > > > > > > since there's
> > > > > > > no real need for the data to get to RAM.  So that enables
> > > > > > > an
> > > > > > > optimization (which comes in your next patch), that
> > > > > > > allows flushes
> > > > > > > to only operate on the subset of the device cache tagged
> > > > > > > as "transient" if
> > > media is idle.
> > > > > 
> > > > > But what If we have stale non-XA marked pages for userptr,
> > > > > and that
> > > > > object moves out and at the same time media comes back, will
> > > > > end up
> > > > > in full flush and flush the stale entry to RAM.
> > > > 
> > > > What makes userptr special here?  During general, active usage,
> > > > userptr would be data that's accessible by the CPU, so it needs
> > > > to
> > > > either be transient (so CPU can see the data in RAM after
> > > > explicit
> > > > flushes) or it needs to be using a coherent PAT (so that the
> > > > CPU can
> > > > just snoop the GPU cache).  If you marked userptr as both non-
> > > > XA and
> > > > non-coherent, then that sounds likely to be a userspace bug
> > > > (and
> > > > probably something we can catch and reject as an invalid case
> > > > on any
> > > > Xe3p or later platforms that support this) since the CPU
> > > > wouldn't have
> > > > any reliable way of seeing GPU updates.
> > > > 
> > > > If something happens that changes the GTT mapping of an object,
> > > > then
> > > > doesn't that already trigger a TLB invalidation when necessary
> > > > in the
> > > > driver today?  It was my understanding that "heavy" TLB
> > > > invalidations
> > > > wait for data values to be globally observable before starting,
> > > > so I
> > > > think that would ensure that any non-XA data makes it to RAM
> > > > before
> > > > any binding changes, object, destruction, etc.?  Is there
> > > > something
> > > > special about userptr that makes that case more of a problem?
> > > > 
> > > > I just found bspec page 74635 which gives an overview of the
> > > > various
> > > > flush and invalidate cases, and I don't see anything there that
> > > > makes
> > > > it obvious to me that userptr would be special.
> > > > 
> > > > 
> > > > > 
> > > > > > > 
> > > > > > > As you said, we eventually do want to force a flush of
> > > > > > > the
> > > > > > > non-transient data as well once we're freeing the
> > > > > > > underlying pages.
> > > > > > > So how do we do that?  It's not clear to me how the
> > > > > > > changes below
> > > > > > > are accomplishing that.  Is there a way to explicitly
> > > > > > > request a
> > > > > > > full device cache flush (ignoring the transient vs non-
> > > > > > > transient tagging)?
> > > > > > > Since the GuC handles the optimization in the next patch
> > > > > > > (toggling
> > > > > > > whether flushes are full flushes vs non-transient flushes
> > > > > > > depending
> > > > > > > on whether media is active), I thought there might be
> > > > > > > some kind of
> > > > > > > GuC interface to request "please do one full flush now,
> > > > > > > even if media is
> > > idle."
> > > > > > > 
> > > > > > 
> > > > > > I’m not an expert here by any means, but everything above
> > > > > > from Matt
> > > > > > seems like valid concerns. Thomas also raised some concerns
> > > > > > in the
> > > > > > two previous revisions; again I’m not an expert, but
> > > > > > reading through
> > > > > > those, it doesn’t really seem like he received proper
> > > > > > answers to his
> > > questions.
> > > > > 
> > > > > Its forcing flush via tlb invalidation PPC flag under
> > > > > xe_invalidate_vma( ).
> > > > 
> > > > By the way, what is "PPC?"  It seems like it's another new
> > > > synonym for
> > > > the device cache?  It's already really confusing that some of
> > > > our
> > > > hardware docs use a mix of both "L2" and "L3" to refer to the
> > > > same
> > > > device cache for historical reasons...
> > > 
> > > Private-physical-cache. It's just what hw side calls the device
> > > side
> > > l2/l3 on newer igpu (I think LNL+). I assume it is quite
> > > different from the
> > > implemetation on dgpu, from HW pov, which is maybe why is has a
> > > special
> > > name. On dgpu they just refer to it as plain l2/l3, not PPC. On
> > > dgpu there is
> > > also additional SMRO (system-memory-read-only) device cache,
> > > which I
> > > assume is caching reads over pci to system memory from GPU side,
> > > but that
> > > gets flushed at the usual places, like end of submission etc.
> > > Setting the PPC bit
> > > on Guc TLB inval will only flush SMRO on dgpu.
> > 
> > I had conversation with Michal and also had confirmation from Jose
> > about their usage of pat_index for userptr, just summarising it
> > below :
> > 
> > UMD Compute, with userptr :
> > - Uses pat_index 19 (which is App-transient(XA) and 1Way-coh)
> > - Flushing :  will be taken care by app-transient ability in HW if
> > media is off, if media is on full flush anyways will happen.
> > 
> > Mesa, with userptr:
> > - Use 2way-coh
> > - Flushing : 2 way Coherency will take care

Mesa uses 1 way coherency, 2 way coherencey has a larger impact in iGPU
not backed by LLC.

> > 
> > With above understanding it seems we can drop this patch as its not
> > needed, instead we need to have new patch to validate if right pat-
> > index coherency is used and if not used reject the request.
> 
> I think that only resolves the userptr side. We still need something
> for 
> eviction/shriker path with normal BOs, which this patch was also
> trying 
> to address, right?
> 
> > 
> > Tejas
> > > 
> > > > 
> > > > 
> > > > Matt
> > > > 
> > > > > 
> > > > > > 
> > > > > > A couple of comments below.
> > > > > > 
> > > > > > > 
> > > > > > > Matt
> > > > > > > 
> > > > > > > > 
> > > > > > > > V2(MattA): Expand commit description
> > > > > > > > 
> > > > > > > > Signed-off-by: Tejas Upadhyay
> > > > > > > > <tejas.upadhyay@intel.com>
> > > > > > > > ---
> > > > > > > >    drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> > > > > > > >    drivers/gpu/drm/xe/xe_device.c  | 23
> > > > > > > > +++++++++++++++++++++++
> > > > > > > > drivers/gpu/drm/xe/xe_device.h  |  1 +
> > > > > > > > drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> > > > > > > >    4 files changed, 28 insertions(+), 2 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > b/drivers/gpu/drm/xe/xe_bo.c index
> > > > > > > > e9180b01a4e4..4455886b211e
> > > > > > > > 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > @@ -689,7 +689,8 @@ static int
> > > > > > > > xe_bo_trigger_rebind(struct
> > > > > > > > xe_device *xe, struct xe_bo *bo,
> > > > > > > > 
> > > > > > > >    		if (!xe_vm_in_fault_mode(vm)) {
> > > > > > > >    			drm_gpuvm_bo_evict(vm_bo,
> > > > > > > > true);
> > > > > > > > -			continue;
> > > > > > > > +			if
> > > > > > > > (!xe_device_needs_cache_flush(xe))
> > > > > > > > +				continue;
> > > > > > 
> > > > > > This will trigger a TLB invalidation (and I assume a cache
> > > > > > flush)
> > > > > > every time we move or free memory in the 3D stack if it has
> > > > > > a
> > > > > > binding. It also performs a synchronous wait on the BO
> > > > > > being idle.
> > > > > > Both of these are very expensive operations. I can’t
> > > > > > imagine the
> > > > > > granularity we want here is to do this on every move/free
> > > > > > with bindings.
> > > > > > 
> > > > > > Also, for LR compute with preempt fences, we would trigger
> > > > > > the
> > > > > > preempt fences during the wait, so a TLB invalidation after
> > > > > > this
> > > > > > seems unnecessary, though perhaps the cache flush is still
> > > > > > required?
> > > > > > 
> > > > > > I think this needs a bit more explanation, because without
> > > > > > knowing a
> > > > > > lot about the exact requirements, the implementation does
> > > > > > not look
> > > correct.
> > > > > 
> > > > > The thing is that we are trying to solve problem with userptr
> > > > > with non-XA
> > > pat, consider if that BO got moved while media is not active. As
> > > soon as media
> > > will come back active, stale cached entries of that object will
> > > be flushed as part
> > > of full flush , which may corrupt things.
> > > > > There was thinking that with this patch we would at least
> > > > > solve the problem
> > > of corruption and later when page_reclamation feature comes in
> > > will help in
> > > performance as well. But now when page reclamation feature is
> > > merged earlier
> > > and it tightly coupled with bind/unbind some cases like discussed
> > > above
> > > (which are not doing unbind immediately on move/free) are missed
> > > in
> > > reclamation.
> > > > > 
> > > > > So thought was to let this solution go in with little perf
> > > > > hit and discuss with
> > > page reclamation owner to come with cleaner solution together.
> > > > > 
> > > > > Tejas
> > > > > > 
> > > > > > > >    		}
> > > > > > > > 
> > > > > > > >    		if (!idle) {
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > b/drivers/gpu/drm/xe/xe_device.c index
> > > 743c18e0c580..da2abed94bc0
> > > > > > > > 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > @@ -1097,6 +1097,29 @@ static void
> > > > > > > > tdf_request_sync(struct
> > > > > > > > xe_device
> > > > > > *xe)
> > > > > > > >    	}
> > > > > > > >    }
> > > > > > > > 
> > > > > > > > +/**
> > > > > > > > + * xe_device_needs_cache_flush - Whether the cache
> > > > > > > > needs to be
> > > > > > > > +flushed
> > > > > > > > + * @xe: The device to check.
> > > > > > > > + *
> > > > > > > > + * Return: true if the device needs cache flush, false
> > > > > > > > otherwise.
> > > > > > > > + */
> > > > > > > > +bool xe_device_needs_cache_flush(struct xe_device *xe)
> > > > > > > > {
> > > > > > > > +	/* XA is *always* flushed, like at the end-of-
> > > > > > > > submssion (and
> > > > > > > > +maybe
> > > > > > other
> > > > > > > > +	 * places), just that internally as an
> > > > > > > > optimisation hw doesn't
> > > > > > > > +need to
> > > > > > make
> > > > > > > > +	 * that a full flush (which will also include
> > > > > > > > XA) when Media is
> > > > > > > > +	 * off/powergated, since it doesn't need to
> > > > > > > > worry about GT
> > > > > > > > +caches vs
> > > > > > Media
> > > > > > > > +	 * coherency, and only CPU vs GPU coherency,
> > > > > > > > so can make that
> > > > > > > > +flush
> > > > > > a
> > > > > > > > +	 * targeted XA flush, since stuff tagged with
> > > > > > > > XA now means it's
> > > > > > > > +shared
> > > > > > with
> > > > > > > > +	 * the CPU. The main implication is that we
> > > > > > > > now need to somehow
> > > > > > flush non-XA before
> > > > > > > > +	 * freeing system memory pages, otherwise
> > > > > > > > dirty cachelines could
> > > > > > > > +be
> > > > > > flushed after the free
> > > > > > > > +	 * (like if Media suddenly turns on and does a
> > > > > > > > full flush)
> > > > > > > > +	 */
> > > > > > > > +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> > > > > > > > +		return true;
> > > > > > > > +	return false;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >    void xe_device_l2_flush(struct xe_device *xe)  {
> > > > > > > >    	struct xe_gt *gt;
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > b/drivers/gpu/drm/xe/xe_device.h index
> > > 39464650533b..baf386e0e037
> > > > > > > > 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > @@ -184,6 +184,7 @@ void
> > > > > > > > xe_device_snapshot_print(struct
> > > xe_device
> > > > > > > > *xe, struct drm_printer *p);
> > > > > > > >    u64 xe_device_canonicalize_addr(struct xe_device
> > > > > > > > *xe, u64 address);
> > > > > > > >    u64 xe_device_uncanonicalize_addr(struct xe_device
> > > > > > > > *xe, u64
> > > > > > > > address);
> > > > > > > > 
> > > > > > > > +bool xe_device_needs_cache_flush(struct xe_device
> > > > > > > > *xe);
> > > > > > > >    void xe_device_td_flush(struct xe_device *xe);  void
> > > > > > > > xe_device_l2_flush(struct xe_device *xe);
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > b/drivers/gpu/drm/xe/xe_userptr.c index
> > > e120323c43bc..b435ea7f9b66
> > > > > > > > 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > @@ -114,7 +114,8 @@ static void
> > > > > > > > __vma_userptr_invalidate(struct
> > > > > > > > xe_vm
> > > > > > *vm, struct xe_userptr_vma *uv
> > > > > > > >    				    false,
> > > > > > > > MAX_SCHEDULE_TIMEOUT);
> > > > > > > >    	XE_WARN_ON(err <= 0);
> > > > > > > > 
> > > > > > > > -	if (xe_vm_in_fault_mode(vm) && userptr-
> > > > > > > > >initial_bind) {
> > > > > > > > +	if ((xe_vm_in_fault_mode(vm) ||
> > > > > > > > xe_device_needs_cache_flush(vm-
> > > > > > > xe)) &&
> > > > > > > > +	    userptr->initial_bind) {
> > > > > > 
> > > > > > Same concern with the LR preempt fence as above — the
> > > > > > hardware will
> > > > > > be interrupted via preempt fences, so it doesn’t seem
> > > > > > necessary to
> > > > > > invalidate the TLBs but perhaps we need a cflush and TLB
> > > > > > invalidation is the mechanism for that too?
> > > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > > >    		err = xe_vm_invalidate_vma(vma);
> > > > > > > >    		XE_WARN_ON(err);
> > > > > > > >    	}
> > > > > > > > --
> > > > > > > > 2.52.0
> > > > > > > > 
> > > > > > > 
> > > > > > > --
> > > > > > > Matt Roper
> > > > > > > Graphics Software Engineer
> > > > > > > Linux GPU Platform Enablement
> > > > > > > Intel Corporation
> > > > 
> > 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-11 21:11         ` Matt Roper
  2026-02-12  9:53           ` Matthew Auld
@ 2026-02-13 16:23           ` Upadhyay, Tejas
  2026-02-13 16:48             ` Souza, Jose
  2026-02-16 10:56             ` Thomas Hellström
  2026-02-13 17:29           ` Matthew Auld
  2 siblings, 2 replies; 39+ messages in thread
From: Upadhyay, Tejas @ 2026-02-13 16:23 UTC (permalink / raw)
  To: Roper, Matthew D, Mrozek, Michal, Souza, Jose
  Cc: Brost, Matthew, intel-xe@lists.freedesktop.org, Auld, Matthew,
	thomas.hellstrom@linux.intel.com



> -----Original Message-----
> From: Roper, Matthew D <matthew.d.roper@intel.com>
> Sent: 12 February 2026 02:41
> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
> xe@lists.freedesktop.org; Auld, Matthew <matthew.auld@intel.com>;
> thomas.hellstrom@linux.intel.com
> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
> cachelines manually
> 
> On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas wrote:
> >
> >
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: 11 February 2026 05:32
> > > To: Roper, Matthew D <matthew.d.roper@intel.com>
> > > Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> > > xe@lists.freedesktop.org; Auld, Matthew <matthew.auld@intel.com>;
> > > thomas.hellstrom@linux.intel.com
> > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
> > > cachelines manually
> > >
> > > On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
> > > > On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay wrote:
> > > > > "eXtended Architecture" (XA) tagged memory—memory shared
> between
> > > the
> > > > > CPU and GPU
> > > >
> > > > I'm pretty sure this expansion of "XA" is wrong; where are you
> > > > seeing this definition?  Everything in the bspec indicates that XA
> > > > means "wb
> > > > - transient app" (similar to how "XD" is 'wb - transient display").
> > > > I'm not sure why exactly they picked "X" to refer to transient in
> > > > both of these cases, but I've never seen any documentation that
> > > > refers to it as "extended."
> > > >
> > > > > is treated differently from other GPU memory when the Media
> > > > > engine is
> > > power-gated.
> > > > >
> > > > > XA is *always* flushed, like at the end-of-submssion (and maybe
> > > > > other
> > > >
> > > > I assume you're referring to the fact that the driver performs
> > > > flushes at the end of submission (via PIPE_CONTROL or
> > > > MI_FLUSH_DW), and that depending on other state/optimizations in
> > > > the system, those flushes may flush the entire device cache, or
> > > > may only flush the subset of cache data that is not marked as
> > > > transient.  The way you worded this was confusing since it makes
> > > > it sound like cache flushes happen automatically somewhere in
> hardware/firmware.
> > > >
> > > > > places), just that internally as an optimisation hw doesn't need
> > > > > to make that a full flush (which will also include XA) when
> > > > > Media is off/powergated, since it doesn't need to worry about GT
> > > > > caches vs Media coherency, and only CPU vs GPU coherency, so can
> > > > > make that flush a targeted XA flush, since stuff tagged with XA
> > > > > now means it's shared with the CPU. The main implication is that
> > > > > we now need to somehow flush non-XA before freeing system memory
> > > > > pages, otherwise dirty cachelines could be flushed after the
> > > > > free (like if Media suddenly turns on and does a full flush)
> > > >
> > > > This description seems really confusing.  My understanding is that
> > > > marking something as wb-transient-app indicates that it might be
> > > > accessed by something other than our graphics/media IP (i.e.,
> > > > accessed from the CPU, exported to another device, etc.), so
> > > > transient data truly does need to be flushed at the points in the
> > > > driver where a flush typically happens.
> > > >
> > > > However when something is _not_ transient, then either:
> > > >  - it's "private" to the GPU and only our graphics/media IP will be
> > > >    accessing it
> > > >  - it's bound with a coherent PAT index so that outside observers like
> > > >    the CPU can snoop the device cache, even when the cache hasn't been
> > > >    flushed
> > > >
> > > > If media is not active, then there's really no need to include
> > > > non-transient data when an device cache flush happens since
> > > > there's no real need for the data to get to RAM.  So that enables
> > > > an optimization (which comes in your next patch), that allows
> > > > flushes to only operate on the subset of the device cache tagged as
> "transient" if media is idle.
> >
> > But what If we have stale non-XA marked pages for userptr, and that
> > object moves out and at the same time media comes back, will end up in
> > full flush and flush the stale entry to RAM.
> 
> What makes userptr special here?  During general, active usage, userptr would
> be data that's accessible by the CPU, so it needs to either be transient (so CPU
> can see the data in RAM after explicit flushes) or it needs to be using a
> coherent PAT (so that the CPU can just snoop the GPU cache).  If you marked
> userptr as both non-XA and non-coherent, then that sounds likely to be a
> userspace bug (and probably something we can catch and reject as an invalid
> case on any Xe3p or later platforms that support this) since the CPU wouldn't
> have any reliable way of seeing GPU updates.

Right. FYI @Mrozek, Michal @Souza, Jose
For userptr, as explained above, it needs to be either coherent or XA pat index, or else KMD will reject as invalid case. 

> 
> If something happens that changes the GTT mapping of an object, then
> doesn't that already trigger a TLB invalidation when necessary in the driver
> today?  It was my understanding that "heavy" TLB invalidations wait for data
> values to be globally observable before starting, so I think that would ensure
> that any non-XA data makes it to RAM before any binding changes, object,
> destruction, etc.?  Is there something special about userptr that makes that
> case more of a problem?
> 
> I just found bspec page 74635 which gives an overview of the various flush
> and invalidate cases, and I don't see anything there that makes it obvious to
> me that userptr would be special.
> 
> 
> >
> > > >
> > > > As you said, we eventually do want to force a flush of the
> > > > non-transient data as well once we're freeing the underlying pages.
> > > > So how do we do that?  It's not clear to me how the changes below
> > > > are accomplishing that.  Is there a way to explicitly request a
> > > > full device cache flush (ignoring the transient vs non-transient tagging)?
> > > > Since the GuC handles the optimization in the next patch (toggling
> > > > whether flushes are full flushes vs non-transient flushes
> > > > depending on whether media is active), I thought there might be
> > > > some kind of GuC interface to request "please do one full flush now, even
> if media is idle."
> > > >
> > >
> > > I’m not an expert here by any means, but everything above from Matt
> > > seems like valid concerns. Thomas also raised some concerns in the
> > > two previous revisions; again I’m not an expert, but reading through
> > > those, it doesn’t really seem like he received proper answers to his
> questions.
> >
> > Its forcing flush via tlb invalidation PPC flag under xe_invalidate_vma( ).
> 
> By the way, what is "PPC?"  It seems like it's another new synonym for the
> device cache?  It's already really confusing that some of our hardware docs use
> a mix of both "L2" and "L3" to refer to the same device cache for historical
> reasons...
> 
> 
> Matt
> 
> >
> > >
> > > A couple of comments below.
> > >
> > > >
> > > > Matt
> > > >
> > > > >
> > > > > V2(MattA): Expand commit description
> > > > >
> > > > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> > > > >  drivers/gpu/drm/xe/xe_device.c  | 23 +++++++++++++++++++++++
> > > > > drivers/gpu/drm/xe/xe_device.h  |  1 +
> > > > > drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> > > > >  4 files changed, 28 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > b/drivers/gpu/drm/xe/xe_bo.c index e9180b01a4e4..4455886b211e
> > > > > 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct
> > > > > xe_device *xe, struct xe_bo *bo,
> > > > >
> > > > >  		if (!xe_vm_in_fault_mode(vm)) {
> > > > >  			drm_gpuvm_bo_evict(vm_bo, true);
> > > > > -			continue;
> > > > > +			if (!xe_device_needs_cache_flush(xe))
> > > > > +				continue;

Matt R,
This flush will be still needed as there can be non-xa buffers which can be evicted while media was off and stale entries can be flushed when media comes back on. Which was not case earlier as full flush was happening at regular sync points and that’s where this feature is bringing optimization now.

Tejas

> > >
> > > This will trigger a TLB invalidation (and I assume a cache flush)
> > > every time we move or free memory in the 3D stack if it has a
> > > binding. It also performs a synchronous wait on the BO being idle.
> > > Both of these are very expensive operations. I can’t imagine the
> > > granularity we want here is to do this on every move/free with bindings.
> > >
> > > Also, for LR compute with preempt fences, we would trigger the
> > > preempt fences during the wait, so a TLB invalidation after this
> > > seems unnecessary, though perhaps the cache flush is still required?
> > >
> > > I think this needs a bit more explanation, because without knowing a
> > > lot about the exact requirements, the implementation does not look
> correct.
> >
> > The thing is that we are trying to solve problem with userptr with non-XA
> pat, consider if that BO got moved while media is not active. As soon as media
> will come back active, stale cached entries of that object will be flushed as part
> of full flush , which may corrupt things.
> > There was thinking that with this patch we would at least solve the problem
> of corruption and later when page_reclamation feature comes in will help in
> performance as well. But now when page reclamation feature is merged earlier
> and it tightly coupled with bind/unbind some cases like discussed above
> (which are not doing unbind immediately on move/free) are missed in
> reclamation.
> >
> > So thought was to let this solution go in with little perf hit and discuss with
> page reclamation owner to come with cleaner solution together.
> >
> > Tejas
> > >
> > > > >  		}
> > > > >
> > > > >  		if (!idle) {
> > > > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > > > b/drivers/gpu/drm/xe/xe_device.c index
> > > > > 743c18e0c580..da2abed94bc0
> > > > > 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > > @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct
> > > > > xe_device
> > > *xe)
> > > > >  	}
> > > > >  }
> > > > >
> > > > > +/**
> > > > > + * xe_device_needs_cache_flush - Whether the cache needs to be
> > > > > +flushed
> > > > > + * @xe: The device to check.
> > > > > + *
> > > > > + * Return: true if the device needs cache flush, false otherwise.
> > > > > + */
> > > > > +bool xe_device_needs_cache_flush(struct xe_device *xe) {
> > > > > +	/* XA is *always* flushed, like at the end-of-submssion (and
> > > > > +maybe
> > > other
> > > > > +	 * places), just that internally as an optimisation hw doesn't
> > > > > +need to
> > > make
> > > > > +	 * that a full flush (which will also include XA) when Media is
> > > > > +	 * off/powergated, since it doesn't need to worry about GT
> > > > > +caches vs
> > > Media
> > > > > +	 * coherency, and only CPU vs GPU coherency, so can make
> that
> > > > > +flush
> > > a
> > > > > +	 * targeted XA flush, since stuff tagged with XA now means
> > > > > +it's shared
> > > with
> > > > > +	 * the CPU. The main implication is that we now need to
> > > > > +somehow
> > > flush non-XA before
> > > > > +	 * freeing system memory pages, otherwise dirty cachelines
> > > > > +could be
> > > flushed after the free
> > > > > +	 * (like if Media suddenly turns on and does a full flush)
> > > > > +	 */
> > > > > +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> > > > > +		return true;
> > > > > +	return false;
> > > > > +}
> > > > > +
> > > > >  void xe_device_l2_flush(struct xe_device *xe)  {
> > > > >  	struct xe_gt *gt;
> > > > > diff --git a/drivers/gpu/drm/xe/xe_device.h
> > > > > b/drivers/gpu/drm/xe/xe_device.h index
> > > > > 39464650533b..baf386e0e037
> > > > > 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > > > @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct
> > > > > xe_device *xe, struct drm_printer *p);
> > > > >  u64 xe_device_canonicalize_addr(struct xe_device *xe, u64
> > > > > address);
> > > > >  u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64
> > > > > address);
> > > > >
> > > > > +bool xe_device_needs_cache_flush(struct xe_device *xe);
> > > > >  void xe_device_td_flush(struct xe_device *xe);  void
> > > > > xe_device_l2_flush(struct xe_device *xe);
> > > > >
> > > > > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > b/drivers/gpu/drm/xe/xe_userptr.c index
> > > > > e120323c43bc..b435ea7f9b66
> > > > > 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > > > > @@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct
> > > > > xe_vm
> > > *vm, struct xe_userptr_vma *uv
> > > > >  				    false, MAX_SCHEDULE_TIMEOUT);
> > > > >  	XE_WARN_ON(err <= 0);
> > > > >
> > > > > -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
> > > > > +	if ((xe_vm_in_fault_mode(vm) ||
> > > > > +xe_device_needs_cache_flush(vm-
> > > >xe)) &&
> > > > > +	    userptr->initial_bind) {
> > >
> > > Same concern with the LR preempt fence as above — the hardware will
> > > be interrupted via preempt fences, so it doesn’t seem necessary to
> > > invalidate the TLBs but perhaps we need a cflush and TLB
> > > invalidation is the mechanism for that too?
> > >
> > > Matt
> > >
> > > > >  		err = xe_vm_invalidate_vma(vma);
> > > > >  		XE_WARN_ON(err);
> > > > >  	}
> > > > > --
> > > > > 2.52.0
> > > > >
> > > >
> > > > --
> > > > Matt Roper
> > > > Graphics Software Engineer
> > > > Linux GPU Platform Enablement
> > > > Intel Corporation
> 
> --
> Matt Roper
> Graphics Software Engineer
> Linux GPU Platform Enablement
> Intel Corporation

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-13 16:23           ` Upadhyay, Tejas
@ 2026-02-13 16:48             ` Souza, Jose
  2026-02-13 17:16               ` Matt Roper
  2026-02-16 10:56             ` Thomas Hellström
  1 sibling, 1 reply; 39+ messages in thread
From: Souza, Jose @ 2026-02-13 16:48 UTC (permalink / raw)
  To: Upadhyay, Tejas, Roper, Matthew D, Mrozek, Michal
  Cc: intel-xe@lists.freedesktop.org, Brost,  Matthew, Auld, Matthew,
	thomas.hellstrom@linux.intel.com

On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote:
> 
> 
> > -----Original Message-----
> > From: Roper, Matthew D <matthew.d.roper@intel.com>
> > Sent: 12 February 2026 02:41
> > To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> > Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
> > xe@lists.freedesktop.org; Auld, Matthew <matthew.auld@intel.com>;
> > thomas.hellstrom@linux.intel.com
> > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
> > cachelines manually
> > 
> > On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas wrote:
> > > 
> > > 
> > > > -----Original Message-----
> > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > Sent: 11 February 2026 05:32
> > > > To: Roper, Matthew D <matthew.d.roper@intel.com>
> > > > Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > <matthew.auld@intel.com>;
> > > > thomas.hellstrom@linux.intel.com
> > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
> > > > userptr/shrinker bo
> > > > cachelines manually
> > > > 
> > > > On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
> > > > > On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay
> > > > > wrote:
> > > > > > "eXtended Architecture" (XA) tagged memory—memory shared
> > between
> > > > the
> > > > > > CPU and GPU
> > > > > 
> > > > > I'm pretty sure this expansion of "XA" is wrong; where are
> > > > > you
> > > > > seeing this definition?  Everything in the bspec indicates
> > > > > that XA
> > > > > means "wb
> > > > > - transient app" (similar to how "XD" is 'wb - transient
> > > > > display").
> > > > > I'm not sure why exactly they picked "X" to refer to
> > > > > transient in
> > > > > both of these cases, but I've never seen any documentation
> > > > > that
> > > > > refers to it as "extended."
> > > > > 
> > > > > > is treated differently from other GPU memory when the Media
> > > > > > engine is
> > > > power-gated.
> > > > > > 
> > > > > > XA is *always* flushed, like at the end-of-submssion (and
> > > > > > maybe
> > > > > > other
> > > > > 
> > > > > I assume you're referring to the fact that the driver
> > > > > performs
> > > > > flushes at the end of submission (via PIPE_CONTROL or
> > > > > MI_FLUSH_DW), and that depending on other state/optimizations
> > > > > in
> > > > > the system, those flushes may flush the entire device cache,
> > > > > or
> > > > > may only flush the subset of cache data that is not marked as
> > > > > transient.  The way you worded this was confusing since it
> > > > > makes
> > > > > it sound like cache flushes happen automatically somewhere in
> > hardware/firmware.
> > > > > 
> > > > > > places), just that internally as an optimisation hw doesn't
> > > > > > need
> > > > > > to make that a full flush (which will also include XA) when
> > > > > > Media is off/powergated, since it doesn't need to worry
> > > > > > about GT
> > > > > > caches vs Media coherency, and only CPU vs GPU coherency,
> > > > > > so can
> > > > > > make that flush a targeted XA flush, since stuff tagged
> > > > > > with XA
> > > > > > now means it's shared with the CPU. The main implication is
> > > > > > that
> > > > > > we now need to somehow flush non-XA before freeing system
> > > > > > memory
> > > > > > pages, otherwise dirty cachelines could be flushed after
> > > > > > the
> > > > > > free (like if Media suddenly turns on and does a full
> > > > > > flush)
> > > > > 
> > > > > This description seems really confusing.  My understanding is
> > > > > that
> > > > > marking something as wb-transient-app indicates that it might
> > > > > be
> > > > > accessed by something other than our graphics/media IP (i.e.,
> > > > > accessed from the CPU, exported to another device, etc.), so
> > > > > transient data truly does need to be flushed at the points in
> > > > > the
> > > > > driver where a flush typically happens.
> > > > > 
> > > > > However when something is _not_ transient, then either:
> > > > >  - it's "private" to the GPU and only our graphics/media IP
> > > > > will be
> > > > >    accessing it
> > > > >  - it's bound with a coherent PAT index so that outside
> > > > > observers like
> > > > >    the CPU can snoop the device cache, even when the cache
> > > > > hasn't been
> > > > >    flushed
> > > > > 
> > > > > If media is not active, then there's really no need to
> > > > > include
> > > > > non-transient data when an device cache flush happens since
> > > > > there's no real need for the data to get to RAM.  So that
> > > > > enables
> > > > > an optimization (which comes in your next patch), that allows
> > > > > flushes to only operate on the subset of the device cache
> > > > > tagged as
> > "transient" if media is idle.
> > > 
> > > But what If we have stale non-XA marked pages for userptr, and
> > > that
> > > object moves out and at the same time media comes back, will end
> > > up in
> > > full flush and flush the stale entry to RAM.
> > 
> > What makes userptr special here?  During general, active usage,
> > userptr would
> > be data that's accessible by the CPU, so it needs to either be
> > transient (so CPU
> > can see the data in RAM after explicit flushes) or it needs to be
> > using a
> > coherent PAT (so that the CPU can just snoop the GPU cache).  If
> > you marked
> > userptr as both non-XA and non-coherent, then that sounds likely to
> > be a
> > userspace bug (and probably something we can catch and reject as an
> > invalid
> > case on any Xe3p or later platforms that support this) since the
> > CPU wouldn't
> > have any reliable way of seeing GPU updates.
> 
> Right. FYI @Mrozek, Michal @Souza, Jose
> For userptr, as explained above, it needs to be either coherent or XA
> pat index, or else KMD will reject as invalid case. 
> 
> 
The coherency restriction is already in the uAPI:

"Note: For userptr and externally imported dma-buf the kernel expects
either 1WAY or 2WAY for the @pat_index."

Using 1 way is enough as Xe KMD does a PIPE_CONTROL flushing GPU caches
at the end of batch buffers.

> 
> > 
> > If something happens that changes the GTT mapping of an object,
> > then
> > doesn't that already trigger a TLB invalidation when necessary in
> > the driver
> > today?  It was my understanding that "heavy" TLB invalidations wait
> > for data
> > values to be globally observable before starting, so I think that
> > would ensure
> > that any non-XA data makes it to RAM before any binding changes,
> > object,
> > destruction, etc.?  Is there something special about userptr that
> > makes that
> > case more of a problem?
> > 
> > I just found bspec page 74635 which gives an overview of the
> > various flush
> > and invalidate cases, and I don't see anything there that makes it
> > obvious to
> > me that userptr would be special.
> > 
> > 
> > > 
> > > > > 
> > > > > As you said, we eventually do want to force a flush of the
> > > > > non-transient data as well once we're freeing the underlying
> > > > > pages.
> > > > > So how do we do that?  It's not clear to me how the changes
> > > > > below
> > > > > are accomplishing that.  Is there a way to explicitly request
> > > > > a
> > > > > full device cache flush (ignoring the transient vs non-
> > > > > transient tagging)?
> > > > > Since the GuC handles the optimization in the next patch
> > > > > (toggling
> > > > > whether flushes are full flushes vs non-transient flushes
> > > > > depending on whether media is active), I thought there might
> > > > > be
> > > > > some kind of GuC interface to request "please do one full
> > > > > flush now, even
> > if media is idle."
> > > > > 
> > > > 
> > > > I’m not an expert here by any means, but everything above from
> > > > Matt
> > > > seems like valid concerns. Thomas also raised some concerns in
> > > > the
> > > > two previous revisions; again I’m not an expert, but reading
> > > > through
> > > > those, it doesn’t really seem like he received proper answers
> > > > to his
> > questions.
> > > 
> > > Its forcing flush via tlb invalidation PPC flag under
> > > xe_invalidate_vma( ).
> > 
> > By the way, what is "PPC?"  It seems like it's another new synonym
> > for the
> > device cache?  It's already really confusing that some of our
> > hardware docs use
> > a mix of both "L2" and "L3" to refer to the same device cache for
> > historical
> > reasons...
> > 
> > 
> > Matt
> > 
> > > 
> > > > 
> > > > A couple of comments below.
> > > > 
> > > > > 
> > > > > Matt
> > > > > 
> > > > > > 
> > > > > > V2(MattA): Expand commit description
> > > > > > 
> > > > > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> > > > > >  drivers/gpu/drm/xe/xe_device.c  | 23
> > > > > > +++++++++++++++++++++++
> > > > > > drivers/gpu/drm/xe/xe_device.h  |  1 +
> > > > > > drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> > > > > >  4 files changed, 28 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > b/drivers/gpu/drm/xe/xe_bo.c index
> > > > > > e9180b01a4e4..4455886b211e
> > > > > > 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct
> > > > > > xe_device *xe, struct xe_bo *bo,
> > > > > > 
> > > > > >  		if (!xe_vm_in_fault_mode(vm)) {
> > > > > >  			drm_gpuvm_bo_evict(vm_bo, true);
> > > > > > -			continue;
> > > > > > +			if
> > > > > > (!xe_device_needs_cache_flush(xe))
> > > > > > +				continue;
> 
> Matt R,
> This flush will be still needed as there can be non-xa buffers which
> can be evicted while media was off and stale entries can be flushed
> when media comes back on. Which was not case earlier as full flush
> was happening at regular sync points and that’s where this feature is
> bringing optimization now.
> 
> Tejas
> 
> > > > 
> > > > This will trigger a TLB invalidation (and I assume a cache
> > > > flush)
> > > > every time we move or free memory in the 3D stack if it has a
> > > > binding. It also performs a synchronous wait on the BO being
> > > > idle.
> > > > Both of these are very expensive operations. I can’t imagine
> > > > the
> > > > granularity we want here is to do this on every move/free with
> > > > bindings.
> > > > 
> > > > Also, for LR compute with preempt fences, we would trigger the
> > > > preempt fences during the wait, so a TLB invalidation after
> > > > this
> > > > seems unnecessary, though perhaps the cache flush is still
> > > > required?
> > > > 
> > > > I think this needs a bit more explanation, because without
> > > > knowing a
> > > > lot about the exact requirements, the implementation does not
> > > > look
> > correct.
> > > 
> > > The thing is that we are trying to solve problem with userptr
> > > with non-XA
> > pat, consider if that BO got moved while media is not active. As
> > soon as media
> > will come back active, stale cached entries of that object will be
> > flushed as part
> > of full flush , which may corrupt things.
> > > There was thinking that with this patch we would at least solve
> > > the problem
> > of corruption and later when page_reclamation feature comes in will
> > help in
> > performance as well. But now when page reclamation feature is
> > merged earlier
> > and it tightly coupled with bind/unbind some cases like discussed
> > above
> > (which are not doing unbind immediately on move/free) are missed in
> > reclamation.
> > > 
> > > So thought was to let this solution go in with little perf hit
> > > and discuss with
> > page reclamation owner to come with cleaner solution together.
> > > 
> > > Tejas
> > > > 
> > > > > >  		}
> > > > > > 
> > > > > >  		if (!idle) {
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > > > > b/drivers/gpu/drm/xe/xe_device.c index
> > > > > > 743c18e0c580..da2abed94bc0
> > > > > > 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > > > @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct
> > > > > > xe_device
> > > > *xe)
> > > > > >  	}
> > > > > >  }
> > > > > > 
> > > > > > +/**
> > > > > > + * xe_device_needs_cache_flush - Whether the cache needs
> > > > > > to be
> > > > > > +flushed
> > > > > > + * @xe: The device to check.
> > > > > > + *
> > > > > > + * Return: true if the device needs cache flush, false
> > > > > > otherwise.
> > > > > > + */
> > > > > > +bool xe_device_needs_cache_flush(struct xe_device *xe) {
> > > > > > +	/* XA is *always* flushed, like at the end-of-
> > > > > > submssion (and
> > > > > > +maybe
> > > > other
> > > > > > +	 * places), just that internally as an
> > > > > > optimisation hw doesn't
> > > > > > +need to
> > > > make
> > > > > > +	 * that a full flush (which will also include XA)
> > > > > > when Media is
> > > > > > +	 * off/powergated, since it doesn't need to worry
> > > > > > about GT
> > > > > > +caches vs
> > > > Media
> > > > > > +	 * coherency, and only CPU vs GPU coherency, so
> > > > > > can make
> > that
> > > > > > +flush
> > > > a
> > > > > > +	 * targeted XA flush, since stuff tagged with XA
> > > > > > now means
> > > > > > +it's shared
> > > > with
> > > > > > +	 * the CPU. The main implication is that we now
> > > > > > need to
> > > > > > +somehow
> > > > flush non-XA before
> > > > > > +	 * freeing system memory pages, otherwise dirty
> > > > > > cachelines
> > > > > > +could be
> > > > flushed after the free
> > > > > > +	 * (like if Media suddenly turns on and does a
> > > > > > full flush)
> > > > > > +	 */
> > > > > > +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> > > > > > +		return true;
> > > > > > +	return false;
> > > > > > +}
> > > > > > +
> > > > > >  void xe_device_l2_flush(struct xe_device *xe)  {
> > > > > >  	struct xe_gt *gt;
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.h
> > > > > > b/drivers/gpu/drm/xe/xe_device.h index
> > > > > > 39464650533b..baf386e0e037
> > > > > > 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > > > > @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct
> > > > > > xe_device *xe, struct drm_printer *p);
> > > > > >  u64 xe_device_canonicalize_addr(struct xe_device *xe, u64
> > > > > > address);
> > > > > >  u64 xe_device_uncanonicalize_addr(struct xe_device *xe,
> > > > > > u64
> > > > > > address);
> > > > > > 
> > > > > > +bool xe_device_needs_cache_flush(struct xe_device *xe);
> > > > > >  void xe_device_td_flush(struct xe_device *xe);  void
> > > > > > xe_device_l2_flush(struct xe_device *xe);
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > b/drivers/gpu/drm/xe/xe_userptr.c index
> > > > > > e120323c43bc..b435ea7f9b66
> > > > > > 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > @@ -114,7 +114,8 @@ static void
> > > > > > __vma_userptr_invalidate(struct
> > > > > > xe_vm
> > > > *vm, struct xe_userptr_vma *uv
> > > > > >  				    false,
> > > > > > MAX_SCHEDULE_TIMEOUT);
> > > > > >  	XE_WARN_ON(err <= 0);
> > > > > > 
> > > > > > -	if (xe_vm_in_fault_mode(vm) && userptr-
> > > > > > >initial_bind) {
> > > > > > +	if ((xe_vm_in_fault_mode(vm) ||
> > > > > > +xe_device_needs_cache_flush(vm-
> > > > > xe)) &&
> > > > > > +	    userptr->initial_bind) {
> > > > 
> > > > Same concern with the LR preempt fence as above — the hardware
> > > > will
> > > > be interrupted via preempt fences, so it doesn’t seem necessary
> > > > to
> > > > invalidate the TLBs but perhaps we need a cflush and TLB
> > > > invalidation is the mechanism for that too?
> > > > 
> > > > Matt
> > > > 
> > > > > >  		err = xe_vm_invalidate_vma(vma);
> > > > > >  		XE_WARN_ON(err);
> > > > > >  	}
> > > > > > --
> > > > > > 2.52.0
> > > > > > 
> > > > > 
> > > > > --
> > > > > Matt Roper
> > > > > Graphics Software Engineer
> > > > > Linux GPU Platform Enablement
> > > > > Intel Corporation
> > 
> > --
> > Matt Roper
> > Graphics Software Engineer
> > Linux GPU Platform Enablement
> > Intel Corporation

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-13 16:48             ` Souza, Jose
@ 2026-02-13 17:16               ` Matt Roper
  2026-02-13 17:31                 ` Souza, Jose
  2026-02-13 17:31                 ` Matthew Auld
  0 siblings, 2 replies; 39+ messages in thread
From: Matt Roper @ 2026-02-13 17:16 UTC (permalink / raw)
  To: Souza, Jose
  Cc: Upadhyay, Tejas, Mrozek, Michal, intel-xe@lists.freedesktop.org,
	Brost, Matthew, Auld, Matthew, thomas.hellstrom@linux.intel.com

On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose wrote:
> On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: Roper, Matthew D <matthew.d.roper@intel.com>
> > > Sent: 12 February 2026 02:41
> > > To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> > > Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
> > > xe@lists.freedesktop.org; Auld, Matthew <matthew.auld@intel.com>;
> > > thomas.hellstrom@linux.intel.com
> > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
> > > cachelines manually
> > > 
> > > On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas wrote:
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > Sent: 11 February 2026 05:32
> > > > > To: Roper, Matthew D <matthew.d.roper@intel.com>
> > > > > Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> > > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > > <matthew.auld@intel.com>;
> > > > > thomas.hellstrom@linux.intel.com
> > > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
> > > > > userptr/shrinker bo
> > > > > cachelines manually
> > > > > 
> > > > > On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
> > > > > > On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay
> > > > > > wrote:
> > > > > > > "eXtended Architecture" (XA) tagged memory—memory shared
> > > between
> > > > > the
> > > > > > > CPU and GPU
> > > > > > 
> > > > > > I'm pretty sure this expansion of "XA" is wrong; where are
> > > > > > you
> > > > > > seeing this definition?  Everything in the bspec indicates
> > > > > > that XA
> > > > > > means "wb
> > > > > > - transient app" (similar to how "XD" is 'wb - transient
> > > > > > display").
> > > > > > I'm not sure why exactly they picked "X" to refer to
> > > > > > transient in
> > > > > > both of these cases, but I've never seen any documentation
> > > > > > that
> > > > > > refers to it as "extended."
> > > > > > 
> > > > > > > is treated differently from other GPU memory when the Media
> > > > > > > engine is
> > > > > power-gated.
> > > > > > > 
> > > > > > > XA is *always* flushed, like at the end-of-submssion (and
> > > > > > > maybe
> > > > > > > other
> > > > > > 
> > > > > > I assume you're referring to the fact that the driver
> > > > > > performs
> > > > > > flushes at the end of submission (via PIPE_CONTROL or
> > > > > > MI_FLUSH_DW), and that depending on other state/optimizations
> > > > > > in
> > > > > > the system, those flushes may flush the entire device cache,
> > > > > > or
> > > > > > may only flush the subset of cache data that is not marked as
> > > > > > transient.  The way you worded this was confusing since it
> > > > > > makes
> > > > > > it sound like cache flushes happen automatically somewhere in
> > > hardware/firmware.
> > > > > > 
> > > > > > > places), just that internally as an optimisation hw doesn't
> > > > > > > need
> > > > > > > to make that a full flush (which will also include XA) when
> > > > > > > Media is off/powergated, since it doesn't need to worry
> > > > > > > about GT
> > > > > > > caches vs Media coherency, and only CPU vs GPU coherency,
> > > > > > > so can
> > > > > > > make that flush a targeted XA flush, since stuff tagged
> > > > > > > with XA
> > > > > > > now means it's shared with the CPU. The main implication is
> > > > > > > that
> > > > > > > we now need to somehow flush non-XA before freeing system
> > > > > > > memory
> > > > > > > pages, otherwise dirty cachelines could be flushed after
> > > > > > > the
> > > > > > > free (like if Media suddenly turns on and does a full
> > > > > > > flush)
> > > > > > 
> > > > > > This description seems really confusing.  My understanding is
> > > > > > that
> > > > > > marking something as wb-transient-app indicates that it might
> > > > > > be
> > > > > > accessed by something other than our graphics/media IP (i.e.,
> > > > > > accessed from the CPU, exported to another device, etc.), so
> > > > > > transient data truly does need to be flushed at the points in
> > > > > > the
> > > > > > driver where a flush typically happens.
> > > > > > 
> > > > > > However when something is _not_ transient, then either:
> > > > > >  - it's "private" to the GPU and only our graphics/media IP
> > > > > > will be
> > > > > >    accessing it
> > > > > >  - it's bound with a coherent PAT index so that outside
> > > > > > observers like
> > > > > >    the CPU can snoop the device cache, even when the cache
> > > > > > hasn't been
> > > > > >    flushed
> > > > > > 
> > > > > > If media is not active, then there's really no need to
> > > > > > include
> > > > > > non-transient data when an device cache flush happens since
> > > > > > there's no real need for the data to get to RAM.  So that
> > > > > > enables
> > > > > > an optimization (which comes in your next patch), that allows
> > > > > > flushes to only operate on the subset of the device cache
> > > > > > tagged as
> > > "transient" if media is idle.
> > > > 
> > > > But what If we have stale non-XA marked pages for userptr, and
> > > > that
> > > > object moves out and at the same time media comes back, will end
> > > > up in
> > > > full flush and flush the stale entry to RAM.
> > > 
> > > What makes userptr special here?  During general, active usage,
> > > userptr would
> > > be data that's accessible by the CPU, so it needs to either be
> > > transient (so CPU
> > > can see the data in RAM after explicit flushes) or it needs to be
> > > using a
> > > coherent PAT (so that the CPU can just snoop the GPU cache).  If
> > > you marked
> > > userptr as both non-XA and non-coherent, then that sounds likely to
> > > be a
> > > userspace bug (and probably something we can catch and reject as an
> > > invalid
> > > case on any Xe3p or later platforms that support this) since the
> > > CPU wouldn't
> > > have any reliable way of seeing GPU updates.
> > 
> > Right. FYI @Mrozek, Michal @Souza, Jose
> > For userptr, as explained above, it needs to be either coherent or XA
> > pat index, or else KMD will reject as invalid case. 
> > 
> > 
> The coherency restriction is already in the uAPI:
> 
> "Note: For userptr and externally imported dma-buf the kernel expects
> either 1WAY or 2WAY for the @pat_index."
> 
> Using 1 way is enough as Xe KMD does a PIPE_CONTROL flushing GPU caches
> at the end of batch buffers.

But isn't that what we're discussing here?  1-way *won't* necessarily be
enough anymore because PIPE_CONTROL instructions don't flush the entire
cache anymore.  Whenever the GuC determines that media is inactive and
activates the optimization, PIPE_CONTROL, MI_FLUSH_DW, etc. change
behavior to only flush out the subset of data that was marked as
app-transient; anything not marked that way doesn't get flushed now.  So
there's a new requirement here that you ensure you're using an XA PAT
index, or you switch to use 2-way coherency which will allow the CPU to
snoop the GPU's caches.


Matt

> 
> > 
> > > 
> > > If something happens that changes the GTT mapping of an object,
> > > then
> > > doesn't that already trigger a TLB invalidation when necessary in
> > > the driver
> > > today?  It was my understanding that "heavy" TLB invalidations wait
> > > for data
> > > values to be globally observable before starting, so I think that
> > > would ensure
> > > that any non-XA data makes it to RAM before any binding changes,
> > > object,
> > > destruction, etc.?  Is there something special about userptr that
> > > makes that
> > > case more of a problem?
> > > 
> > > I just found bspec page 74635 which gives an overview of the
> > > various flush
> > > and invalidate cases, and I don't see anything there that makes it
> > > obvious to
> > > me that userptr would be special.
> > > 
> > > 
> > > > 
> > > > > > 
> > > > > > As you said, we eventually do want to force a flush of the
> > > > > > non-transient data as well once we're freeing the underlying
> > > > > > pages.
> > > > > > So how do we do that?  It's not clear to me how the changes
> > > > > > below
> > > > > > are accomplishing that.  Is there a way to explicitly request
> > > > > > a
> > > > > > full device cache flush (ignoring the transient vs non-
> > > > > > transient tagging)?
> > > > > > Since the GuC handles the optimization in the next patch
> > > > > > (toggling
> > > > > > whether flushes are full flushes vs non-transient flushes
> > > > > > depending on whether media is active), I thought there might
> > > > > > be
> > > > > > some kind of GuC interface to request "please do one full
> > > > > > flush now, even
> > > if media is idle."
> > > > > > 
> > > > > 
> > > > > I’m not an expert here by any means, but everything above from
> > > > > Matt
> > > > > seems like valid concerns. Thomas also raised some concerns in
> > > > > the
> > > > > two previous revisions; again I’m not an expert, but reading
> > > > > through
> > > > > those, it doesn’t really seem like he received proper answers
> > > > > to his
> > > questions.
> > > > 
> > > > Its forcing flush via tlb invalidation PPC flag under
> > > > xe_invalidate_vma( ).
> > > 
> > > By the way, what is "PPC?"  It seems like it's another new synonym
> > > for the
> > > device cache?  It's already really confusing that some of our
> > > hardware docs use
> > > a mix of both "L2" and "L3" to refer to the same device cache for
> > > historical
> > > reasons...
> > > 
> > > 
> > > Matt
> > > 
> > > > 
> > > > > 
> > > > > A couple of comments below.
> > > > > 
> > > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > > 
> > > > > > > V2(MattA): Expand commit description
> > > > > > > 
> > > > > > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > > > > > > ---
> > > > > > >  drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> > > > > > >  drivers/gpu/drm/xe/xe_device.c  | 23
> > > > > > > +++++++++++++++++++++++
> > > > > > > drivers/gpu/drm/xe/xe_device.h  |  1 +
> > > > > > > drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> > > > > > >  4 files changed, 28 insertions(+), 2 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > b/drivers/gpu/drm/xe/xe_bo.c index
> > > > > > > e9180b01a4e4..4455886b211e
> > > > > > > 100644
> > > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct
> > > > > > > xe_device *xe, struct xe_bo *bo,
> > > > > > > 
> > > > > > >  		if (!xe_vm_in_fault_mode(vm)) {
> > > > > > >  			drm_gpuvm_bo_evict(vm_bo, true);
> > > > > > > -			continue;
> > > > > > > +			if
> > > > > > > (!xe_device_needs_cache_flush(xe))
> > > > > > > +				continue;
> > 
> > Matt R,
> > This flush will be still needed as there can be non-xa buffers which
> > can be evicted while media was off and stale entries can be flushed
> > when media comes back on. Which was not case earlier as full flush
> > was happening at regular sync points and that’s where this feature is
> > bringing optimization now.
> > 
> > Tejas
> > 
> > > > > 
> > > > > This will trigger a TLB invalidation (and I assume a cache
> > > > > flush)
> > > > > every time we move or free memory in the 3D stack if it has a
> > > > > binding. It also performs a synchronous wait on the BO being
> > > > > idle.
> > > > > Both of these are very expensive operations. I can’t imagine
> > > > > the
> > > > > granularity we want here is to do this on every move/free with
> > > > > bindings.
> > > > > 
> > > > > Also, for LR compute with preempt fences, we would trigger the
> > > > > preempt fences during the wait, so a TLB invalidation after
> > > > > this
> > > > > seems unnecessary, though perhaps the cache flush is still
> > > > > required?
> > > > > 
> > > > > I think this needs a bit more explanation, because without
> > > > > knowing a
> > > > > lot about the exact requirements, the implementation does not
> > > > > look
> > > correct.
> > > > 
> > > > The thing is that we are trying to solve problem with userptr
> > > > with non-XA
> > > pat, consider if that BO got moved while media is not active. As
> > > soon as media
> > > will come back active, stale cached entries of that object will be
> > > flushed as part
> > > of full flush , which may corrupt things.
> > > > There was thinking that with this patch we would at least solve
> > > > the problem
> > > of corruption and later when page_reclamation feature comes in will
> > > help in
> > > performance as well. But now when page reclamation feature is
> > > merged earlier
> > > and it tightly coupled with bind/unbind some cases like discussed
> > > above
> > > (which are not doing unbind immediately on move/free) are missed in
> > > reclamation.
> > > > 
> > > > So thought was to let this solution go in with little perf hit
> > > > and discuss with
> > > page reclamation owner to come with cleaner solution together.
> > > > 
> > > > Tejas
> > > > > 
> > > > > > >  		}
> > > > > > > 
> > > > > > >  		if (!idle) {
> > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > b/drivers/gpu/drm/xe/xe_device.c index
> > > > > > > 743c18e0c580..da2abed94bc0
> > > > > > > 100644
> > > > > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > > > > @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct
> > > > > > > xe_device
> > > > > *xe)
> > > > > > >  	}
> > > > > > >  }
> > > > > > > 
> > > > > > > +/**
> > > > > > > + * xe_device_needs_cache_flush - Whether the cache needs
> > > > > > > to be
> > > > > > > +flushed
> > > > > > > + * @xe: The device to check.
> > > > > > > + *
> > > > > > > + * Return: true if the device needs cache flush, false
> > > > > > > otherwise.
> > > > > > > + */
> > > > > > > +bool xe_device_needs_cache_flush(struct xe_device *xe) {
> > > > > > > +	/* XA is *always* flushed, like at the end-of-
> > > > > > > submssion (and
> > > > > > > +maybe
> > > > > other
> > > > > > > +	 * places), just that internally as an
> > > > > > > optimisation hw doesn't
> > > > > > > +need to
> > > > > make
> > > > > > > +	 * that a full flush (which will also include XA)
> > > > > > > when Media is
> > > > > > > +	 * off/powergated, since it doesn't need to worry
> > > > > > > about GT
> > > > > > > +caches vs
> > > > > Media
> > > > > > > +	 * coherency, and only CPU vs GPU coherency, so
> > > > > > > can make
> > > that
> > > > > > > +flush
> > > > > a
> > > > > > > +	 * targeted XA flush, since stuff tagged with XA
> > > > > > > now means
> > > > > > > +it's shared
> > > > > with
> > > > > > > +	 * the CPU. The main implication is that we now
> > > > > > > need to
> > > > > > > +somehow
> > > > > flush non-XA before
> > > > > > > +	 * freeing system memory pages, otherwise dirty
> > > > > > > cachelines
> > > > > > > +could be
> > > > > flushed after the free
> > > > > > > +	 * (like if Media suddenly turns on and does a
> > > > > > > full flush)
> > > > > > > +	 */
> > > > > > > +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> > > > > > > +		return true;
> > > > > > > +	return false;
> > > > > > > +}
> > > > > > > +
> > > > > > >  void xe_device_l2_flush(struct xe_device *xe)  {
> > > > > > >  	struct xe_gt *gt;
> > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > b/drivers/gpu/drm/xe/xe_device.h index
> > > > > > > 39464650533b..baf386e0e037
> > > > > > > 100644
> > > > > > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > > > > > @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct
> > > > > > > xe_device *xe, struct drm_printer *p);
> > > > > > >  u64 xe_device_canonicalize_addr(struct xe_device *xe, u64
> > > > > > > address);
> > > > > > >  u64 xe_device_uncanonicalize_addr(struct xe_device *xe,
> > > > > > > u64
> > > > > > > address);
> > > > > > > 
> > > > > > > +bool xe_device_needs_cache_flush(struct xe_device *xe);
> > > > > > >  void xe_device_td_flush(struct xe_device *xe);  void
> > > > > > > xe_device_l2_flush(struct xe_device *xe);
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > b/drivers/gpu/drm/xe/xe_userptr.c index
> > > > > > > e120323c43bc..b435ea7f9b66
> > > > > > > 100644
> > > > > > > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > @@ -114,7 +114,8 @@ static void
> > > > > > > __vma_userptr_invalidate(struct
> > > > > > > xe_vm
> > > > > *vm, struct xe_userptr_vma *uv
> > > > > > >  				    false,
> > > > > > > MAX_SCHEDULE_TIMEOUT);
> > > > > > >  	XE_WARN_ON(err <= 0);
> > > > > > > 
> > > > > > > -	if (xe_vm_in_fault_mode(vm) && userptr-
> > > > > > > >initial_bind) {
> > > > > > > +	if ((xe_vm_in_fault_mode(vm) ||
> > > > > > > +xe_device_needs_cache_flush(vm-
> > > > > > xe)) &&
> > > > > > > +	    userptr->initial_bind) {
> > > > > 
> > > > > Same concern with the LR preempt fence as above — the hardware
> > > > > will
> > > > > be interrupted via preempt fences, so it doesn’t seem necessary
> > > > > to
> > > > > invalidate the TLBs but perhaps we need a cflush and TLB
> > > > > invalidation is the mechanism for that too?
> > > > > 
> > > > > Matt
> > > > > 
> > > > > > >  		err = xe_vm_invalidate_vma(vma);
> > > > > > >  		XE_WARN_ON(err);
> > > > > > >  	}
> > > > > > > --
> > > > > > > 2.52.0
> > > > > > > 
> > > > > > 
> > > > > > --
> > > > > > Matt Roper
> > > > > > Graphics Software Engineer
> > > > > > Linux GPU Platform Enablement
> > > > > > Intel Corporation
> > > 
> > > --
> > > Matt Roper
> > > Graphics Software Engineer
> > > Linux GPU Platform Enablement
> > > Intel Corporation

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-11 21:11         ` Matt Roper
  2026-02-12  9:53           ` Matthew Auld
  2026-02-13 16:23           ` Upadhyay, Tejas
@ 2026-02-13 17:29           ` Matthew Auld
  2 siblings, 0 replies; 39+ messages in thread
From: Matthew Auld @ 2026-02-13 17:29 UTC (permalink / raw)
  To: Matt Roper, Upadhyay, Tejas
  Cc: Brost, Matthew, intel-xe@lists.freedesktop.org,
	thomas.hellstrom@linux.intel.com

On 11/02/2026 21:11, Matt Roper wrote:
> On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas wrote:
>>
>>
>>> -----Original Message-----
>>> From: Brost, Matthew <matthew.brost@intel.com>
>>> Sent: 11 February 2026 05:32
>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>> Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
>>> xe@lists.freedesktop.org; Auld, Matthew <matthew.auld@intel.com>;
>>> thomas.hellstrom@linux.intel.com
>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
>>> cachelines manually
>>>
>>> On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
>>>> On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay wrote:
>>>>> "eXtended Architecture" (XA) tagged memory—memory shared between
>>> the
>>>>> CPU and GPU
>>>>
>>>> I'm pretty sure this expansion of "XA" is wrong; where are you seeing
>>>> this definition?  Everything in the bspec indicates that XA means "wb
>>>> - transient app" (similar to how "XD" is 'wb - transient display").
>>>> I'm not sure why exactly they picked "X" to refer to transient in both
>>>> of these cases, but I've never seen any documentation that refers to
>>>> it as "extended."
>>>>
>>>>> is treated differently from other GPU memory when the Media engine is
>>> power-gated.
>>>>>
>>>>> XA is *always* flushed, like at the end-of-submssion (and maybe
>>>>> other
>>>>
>>>> I assume you're referring to the fact that the driver performs flushes
>>>> at the end of submission (via PIPE_CONTROL or MI_FLUSH_DW), and that
>>>> depending on other state/optimizations in the system, those flushes
>>>> may flush the entire device cache, or may only flush the subset of
>>>> cache data that is not marked as transient.  The way you worded this
>>>> was confusing since it makes it sound like cache flushes happen
>>>> automatically somewhere in hardware/firmware.
>>>>
>>>>> places), just that internally as an optimisation hw doesn't need to
>>>>> make that a full flush (which will also include XA) when Media is
>>>>> off/powergated, since it doesn't need to worry about GT caches vs
>>>>> Media coherency, and only CPU vs GPU coherency, so can make that
>>>>> flush a targeted XA flush, since stuff tagged with XA now means it's
>>>>> shared with the CPU. The main implication is that we now need to
>>>>> somehow flush non-XA before freeing system memory pages, otherwise
>>>>> dirty cachelines could be flushed after the free (like if Media
>>>>> suddenly turns on and does a full flush)
>>>>
>>>> This description seems really confusing.  My understanding is that
>>>> marking something as wb-transient-app indicates that it might be
>>>> accessed by something other than our graphics/media IP (i.e., accessed
>>>> from the CPU, exported to another device, etc.), so transient data
>>>> truly does need to be flushed at the points in the driver where a
>>>> flush typically happens.
>>>>
>>>> However when something is _not_ transient, then either:
>>>>   - it's "private" to the GPU and only our graphics/media IP will be
>>>>     accessing it
>>>>   - it's bound with a coherent PAT index so that outside observers like
>>>>     the CPU can snoop the device cache, even when the cache hasn't been
>>>>     flushed
>>>>
>>>> If media is not active, then there's really no need to include
>>>> non-transient data when an device cache flush happens since there's no
>>>> real need for the data to get to RAM.  So that enables an optimization
>>>> (which comes in your next patch), that allows flushes to only operate
>>>> on the subset of the device cache tagged as "transient" if media is idle.
>>
>> But what If we have stale non-XA marked pages for userptr, and that
>> object moves out and at the same time media comes back, will end up in
>> full flush and flush the stale entry to RAM.
> 
> What makes userptr special here?  During general, active usage, userptr
> would be data that's accessible by the CPU, so it needs to either be
> transient (so CPU can see the data in RAM after explicit flushes) or it
> needs to be using a coherent PAT (so that the CPU can just snoop the GPU
> cache).  If you marked userptr as both non-XA and non-coherent, then
> that sounds likely to be a userspace bug (and probably something we can
> catch and reject as an invalid case on any Xe3p or later platforms that
> support this) since the CPU wouldn't have any reliable way of seeing GPU
> updates.

Yes, we have always rejected coh_none + useptr, if someone tries that 
with vm_bind.

My understanding is that you can now no longer use vanilla 1WAY and 
expect it to always be flushed from the GPU caches when the fence has 
signalled i.e end of submission, like when media is off. So you have to 
only allow 2WAY or XA for userptr, or you need a manual sync flush here, 
for security, before the pages potentially vanish.

> 
> If something happens that changes the GTT mapping of an object, then
> doesn't that already trigger a TLB invalidation when necessary in the
> driver today?  It was my understanding that "heavy" TLB invalidations

Right, but the driver is clever and in some cases postpones that TLB 
flush (with PPC bit set) to the next re-bind, so when you next call the 
exec() ioctl, for example. I think the idea is you then only need one 
flush, instead of two. But that is way too late for this, as we need to 
do the flush before the memory is potentially freed. Otherwise the 
memory can be freed and when media next wakes up or you trigger full GPU 
flush on next exec() you potentially corrupt the next user of that memory.

For useptr case, this patch was just trying to make that TLB flush (with 
PPC bit) happen immediately before we dma unmap the physical pages, and 
before the memory is potentially freed.


> wait for data values to be globally observable before starting, so I
> think that would ensure that any non-XA data makes it to RAM before any
> binding changes, object, destruction, etc.?  Is there something special
> about userptr that makes that case more of a problem?
> 
> I just found bspec page 74635 which gives an overview of the various
> flush and invalidate cases, and I don't see anything there that makes it
> obvious to me that userptr would be special.
> 
> 
>>
>>>>
>>>> As you said, we eventually do want to force a flush of the
>>>> non-transient data as well once we're freeing the underlying pages.
>>>> So how do we do that?  It's not clear to me how the changes below are
>>>> accomplishing that.  Is there a way to explicitly request a full
>>>> device cache flush (ignoring the transient vs non-transient tagging)?
>>>> Since the GuC handles the optimization in the next patch (toggling
>>>> whether flushes are full flushes vs non-transient flushes depending on
>>>> whether media is active), I thought there might be some kind of GuC
>>>> interface to request "please do one full flush now, even if media is idle."
>>>>
>>>
>>> I’m not an expert here by any means, but everything above from Matt seems
>>> like valid concerns. Thomas also raised some concerns in the two previous
>>> revisions; again I’m not an expert, but reading through those, it doesn’t really
>>> seem like he received proper answers to his questions.
>>
>> Its forcing flush via tlb invalidation PPC flag under xe_invalidate_vma( ).
> 
> By the way, what is "PPC?"  It seems like it's another new synonym for
> the device cache?  It's already really confusing that some of our
> hardware docs use a mix of both "L2" and "L3" to refer to the same
> device cache for historical reasons...
> 
> 
> Matt
> 
>>
>>>
>>> A couple of comments below.
>>>
>>>>
>>>> Matt
>>>>
>>>>>
>>>>> V2(MattA): Expand commit description
>>>>>
>>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
>>>>>   drivers/gpu/drm/xe/xe_device.c  | 23 +++++++++++++++++++++++
>>>>> drivers/gpu/drm/xe/xe_device.h  |  1 +
>>>>> drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
>>>>>   4 files changed, 28 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
>>>>> index e9180b01a4e4..4455886b211e 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_bo.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>>>>> @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct xe_device
>>>>> *xe, struct xe_bo *bo,
>>>>>
>>>>>   		if (!xe_vm_in_fault_mode(vm)) {
>>>>>   			drm_gpuvm_bo_evict(vm_bo, true);
>>>>> -			continue;
>>>>> +			if (!xe_device_needs_cache_flush(xe))
>>>>> +				continue;
>>>
>>> This will trigger a TLB invalidation (and I assume a cache flush) every time we
>>> move or free memory in the 3D stack if it has a binding. It also performs a
>>> synchronous wait on the BO being idle. Both of these are very expensive
>>> operations. I can’t imagine the granularity we want here is to do this on every
>>> move/free with bindings.
>>>
>>> Also, for LR compute with preempt fences, we would trigger the preempt
>>> fences during the wait, so a TLB invalidation after this seems unnecessary,
>>> though perhaps the cache flush is still required?
>>>
>>> I think this needs a bit more explanation, because without knowing a lot about
>>> the exact requirements, the implementation does not look correct.
>>
>> The thing is that we are trying to solve problem with userptr with non-XA pat, consider if that BO got moved while media is not active. As soon as media will come back active, stale cached entries of that object will be flushed as part of full flush , which may corrupt things.
>> There was thinking that with this patch we would at least solve the problem of corruption and later when page_reclamation feature comes in will help in performance as well. But now when page reclamation feature is merged earlier and it tightly coupled with bind/unbind some cases like discussed above (which are not doing unbind immediately on move/free) are missed in reclamation.
>>   
>> So thought was to let this solution go in with little perf hit and discuss with page reclamation owner to come with cleaner solution together.
>>
>> Tejas
>>>
>>>>>   		}
>>>>>
>>>>>   		if (!idle) {
>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>>>>> b/drivers/gpu/drm/xe/xe_device.c index 743c18e0c580..da2abed94bc0
>>>>> 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>>> @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct xe_device
>>> *xe)
>>>>>   	}
>>>>>   }
>>>>>
>>>>> +/**
>>>>> + * xe_device_needs_cache_flush - Whether the cache needs to be
>>>>> +flushed
>>>>> + * @xe: The device to check.
>>>>> + *
>>>>> + * Return: true if the device needs cache flush, false otherwise.
>>>>> + */
>>>>> +bool xe_device_needs_cache_flush(struct xe_device *xe) {
>>>>> +	/* XA is *always* flushed, like at the end-of-submssion (and maybe
>>> other
>>>>> +	 * places), just that internally as an optimisation hw doesn't need to
>>> make
>>>>> +	 * that a full flush (which will also include XA) when Media is
>>>>> +	 * off/powergated, since it doesn't need to worry about GT caches vs
>>> Media
>>>>> +	 * coherency, and only CPU vs GPU coherency, so can make that flush
>>> a
>>>>> +	 * targeted XA flush, since stuff tagged with XA now means it's shared
>>> with
>>>>> +	 * the CPU. The main implication is that we now need to somehow
>>> flush non-XA before
>>>>> +	 * freeing system memory pages, otherwise dirty cachelines could be
>>> flushed after the free
>>>>> +	 * (like if Media suddenly turns on and does a full flush)
>>>>> +	 */
>>>>> +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
>>>>> +		return true;
>>>>> +	return false;
>>>>> +}
>>>>> +
>>>>>   void xe_device_l2_flush(struct xe_device *xe)  {
>>>>>   	struct xe_gt *gt;
>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h
>>>>> b/drivers/gpu/drm/xe/xe_device.h index 39464650533b..baf386e0e037
>>>>> 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>>> @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct xe_device
>>>>> *xe, struct drm_printer *p);
>>>>>   u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
>>>>>   u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64
>>>>> address);
>>>>>
>>>>> +bool xe_device_needs_cache_flush(struct xe_device *xe);
>>>>>   void xe_device_td_flush(struct xe_device *xe);  void
>>>>> xe_device_l2_flush(struct xe_device *xe);
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_userptr.c
>>>>> b/drivers/gpu/drm/xe/xe_userptr.c index e120323c43bc..b435ea7f9b66
>>>>> 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_userptr.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_userptr.c
>>>>> @@ -114,7 +114,8 @@ static void __vma_userptr_invalidate(struct xe_vm
>>> *vm, struct xe_userptr_vma *uv
>>>>>   				    false, MAX_SCHEDULE_TIMEOUT);
>>>>>   	XE_WARN_ON(err <= 0);
>>>>>
>>>>> -	if (xe_vm_in_fault_mode(vm) && userptr->initial_bind) {
>>>>> +	if ((xe_vm_in_fault_mode(vm) || xe_device_needs_cache_flush(vm-
>>>> xe)) &&
>>>>> +	    userptr->initial_bind) {
>>>
>>> Same concern with the LR preempt fence as above — the hardware will be
>>> interrupted via preempt fences, so it doesn’t seem necessary to invalidate the
>>> TLBs but perhaps we need a cflush and TLB invalidation is the mechanism for
>>> that too?
>>>
>>> Matt
>>>
>>>>>   		err = xe_vm_invalidate_vma(vma);
>>>>>   		XE_WARN_ON(err);
>>>>>   	}
>>>>> --
>>>>> 2.52.0
>>>>>
>>>>
>>>> --
>>>> Matt Roper
>>>> Graphics Software Engineer
>>>> Linux GPU Platform Enablement
>>>> Intel Corporation
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-13 17:16               ` Matt Roper
@ 2026-02-13 17:31                 ` Souza, Jose
  2026-02-13 17:31                 ` Matthew Auld
  1 sibling, 0 replies; 39+ messages in thread
From: Souza, Jose @ 2026-02-13 17:31 UTC (permalink / raw)
  To: Roper, Matthew D
  Cc: intel-xe@lists.freedesktop.org, Upadhyay, Tejas, Brost, Matthew,
	Auld, Matthew, Mrozek,  Michal, thomas.hellstrom@linux.intel.com

On Fri, 2026-02-13 at 09:16 -0800, Matt Roper wrote:
> On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose wrote:
> > On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote:
> > > 
> > > 
> > > > -----Original Message-----
> > > > From: Roper, Matthew D <matthew.d.roper@intel.com>
> > > > Sent: 12 February 2026 02:41
> > > > To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> > > > Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
> > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > <matthew.auld@intel.com>;
> > > > thomas.hellstrom@linux.intel.com
> > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
> > > > userptr/shrinker bo
> > > > cachelines manually
> > > > 
> > > > On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas
> > > > wrote:
> > > > > 
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > > Sent: 11 February 2026 05:32
> > > > > > To: Roper, Matthew D <matthew.d.roper@intel.com>
> > > > > > Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> > > > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > > > <matthew.auld@intel.com>;
> > > > > > thomas.hellstrom@linux.intel.com
> > > > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
> > > > > > userptr/shrinker bo
> > > > > > cachelines manually
> > > > > > 
> > > > > > On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
> > > > > > > On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay
> > > > > > > wrote:
> > > > > > > > "eXtended Architecture" (XA) tagged memory—memory
> > > > > > > > shared
> > > > between
> > > > > > the
> > > > > > > > CPU and GPU
> > > > > > > 
> > > > > > > I'm pretty sure this expansion of "XA" is wrong; where
> > > > > > > are
> > > > > > > you
> > > > > > > seeing this definition?  Everything in the bspec
> > > > > > > indicates
> > > > > > > that XA
> > > > > > > means "wb
> > > > > > > - transient app" (similar to how "XD" is 'wb - transient
> > > > > > > display").
> > > > > > > I'm not sure why exactly they picked "X" to refer to
> > > > > > > transient in
> > > > > > > both of these cases, but I've never seen any
> > > > > > > documentation
> > > > > > > that
> > > > > > > refers to it as "extended."
> > > > > > > 
> > > > > > > > is treated differently from other GPU memory when the
> > > > > > > > Media
> > > > > > > > engine is
> > > > > > power-gated.
> > > > > > > > 
> > > > > > > > XA is *always* flushed, like at the end-of-submssion
> > > > > > > > (and
> > > > > > > > maybe
> > > > > > > > other
> > > > > > > 
> > > > > > > I assume you're referring to the fact that the driver
> > > > > > > performs
> > > > > > > flushes at the end of submission (via PIPE_CONTROL or
> > > > > > > MI_FLUSH_DW), and that depending on other
> > > > > > > state/optimizations
> > > > > > > in
> > > > > > > the system, those flushes may flush the entire device
> > > > > > > cache,
> > > > > > > or
> > > > > > > may only flush the subset of cache data that is not
> > > > > > > marked as
> > > > > > > transient.  The way you worded this was confusing since
> > > > > > > it
> > > > > > > makes
> > > > > > > it sound like cache flushes happen automatically
> > > > > > > somewhere in
> > > > hardware/firmware.
> > > > > > > 
> > > > > > > > places), just that internally as an optimisation hw
> > > > > > > > doesn't
> > > > > > > > need
> > > > > > > > to make that a full flush (which will also include XA)
> > > > > > > > when
> > > > > > > > Media is off/powergated, since it doesn't need to worry
> > > > > > > > about GT
> > > > > > > > caches vs Media coherency, and only CPU vs GPU
> > > > > > > > coherency,
> > > > > > > > so can
> > > > > > > > make that flush a targeted XA flush, since stuff tagged
> > > > > > > > with XA
> > > > > > > > now means it's shared with the CPU. The main
> > > > > > > > implication is
> > > > > > > > that
> > > > > > > > we now need to somehow flush non-XA before freeing
> > > > > > > > system
> > > > > > > > memory
> > > > > > > > pages, otherwise dirty cachelines could be flushed
> > > > > > > > after
> > > > > > > > the
> > > > > > > > free (like if Media suddenly turns on and does a full
> > > > > > > > flush)
> > > > > > > 
> > > > > > > This description seems really confusing.  My
> > > > > > > understanding is
> > > > > > > that
> > > > > > > marking something as wb-transient-app indicates that it
> > > > > > > might
> > > > > > > be
> > > > > > > accessed by something other than our graphics/media IP
> > > > > > > (i.e.,
> > > > > > > accessed from the CPU, exported to another device, etc.),
> > > > > > > so
> > > > > > > transient data truly does need to be flushed at the
> > > > > > > points in
> > > > > > > the
> > > > > > > driver where a flush typically happens.
> > > > > > > 
> > > > > > > However when something is _not_ transient, then either:
> > > > > > >  - it's "private" to the GPU and only our graphics/media
> > > > > > > IP
> > > > > > > will be
> > > > > > >    accessing it
> > > > > > >  - it's bound with a coherent PAT index so that outside
> > > > > > > observers like
> > > > > > >    the CPU can snoop the device cache, even when the
> > > > > > > cache
> > > > > > > hasn't been
> > > > > > >    flushed
> > > > > > > 
> > > > > > > If media is not active, then there's really no need to
> > > > > > > include
> > > > > > > non-transient data when an device cache flush happens
> > > > > > > since
> > > > > > > there's no real need for the data to get to RAM.  So that
> > > > > > > enables
> > > > > > > an optimization (which comes in your next patch), that
> > > > > > > allows
> > > > > > > flushes to only operate on the subset of the device cache
> > > > > > > tagged as
> > > > "transient" if media is idle.
> > > > > 
> > > > > But what If we have stale non-XA marked pages for userptr,
> > > > > and
> > > > > that
> > > > > object moves out and at the same time media comes back, will
> > > > > end
> > > > > up in
> > > > > full flush and flush the stale entry to RAM.
> > > > 
> > > > What makes userptr special here?  During general, active usage,
> > > > userptr would
> > > > be data that's accessible by the CPU, so it needs to either be
> > > > transient (so CPU
> > > > can see the data in RAM after explicit flushes) or it needs to
> > > > be
> > > > using a
> > > > coherent PAT (so that the CPU can just snoop the GPU cache). 
> > > > If
> > > > you marked
> > > > userptr as both non-XA and non-coherent, then that sounds
> > > > likely to
> > > > be a
> > > > userspace bug (and probably something we can catch and reject
> > > > as an
> > > > invalid
> > > > case on any Xe3p or later platforms that support this) since
> > > > the
> > > > CPU wouldn't
> > > > have any reliable way of seeing GPU updates.
> > > 
> > > Right. FYI @Mrozek, Michal @Souza, Jose
> > > For userptr, as explained above, it needs to be either coherent
> > > or XA
> > > pat index, or else KMD will reject as invalid case. 
> > > 
> > > 
> > The coherency restriction is already in the uAPI:
> > 
> > "Note: For userptr and externally imported dma-buf the kernel
> > expects
> > either 1WAY or 2WAY for the @pat_index."
> > 
> > Using 1 way is enough as Xe KMD does a PIPE_CONTROL flushing GPU
> > caches
> > at the end of batch buffers.
> 
> But isn't that what we're discussing here?  1-way *won't* necessarily
> be
> enough anymore because PIPE_CONTROL instructions don't flush the
> entire
> cache anymore.  Whenever the GuC determines that media is inactive
> and
> activates the optimization, PIPE_CONTROL, MI_FLUSH_DW, etc. change
> behavior to only flush out the subset of data that was marked as
> app-transient; anything not marked that way doesn't get flushed now. 
> So
> there's a new requirement here that you ensure you're using an XA PAT
> index, or you switch to use 2-way coherency which will allow the CPU
> to
> snoop the GPU's caches.

Ah okay.
Mesa is using using PAT 19 in Gfx 35, that is 1-way coherency +
Transient.

> 
> 
> Matt
> 
> > 
> > > 
> > > > 
> > > > If something happens that changes the GTT mapping of an object,
> > > > then
> > > > doesn't that already trigger a TLB invalidation when necessary
> > > > in
> > > > the driver
> > > > today?  It was my understanding that "heavy" TLB invalidations
> > > > wait
> > > > for data
> > > > values to be globally observable before starting, so I think
> > > > that
> > > > would ensure
> > > > that any non-XA data makes it to RAM before any binding
> > > > changes,
> > > > object,
> > > > destruction, etc.?  Is there something special about userptr
> > > > that
> > > > makes that
> > > > case more of a problem?
> > > > 
> > > > I just found bspec page 74635 which gives an overview of the
> > > > various flush
> > > > and invalidate cases, and I don't see anything there that makes
> > > > it
> > > > obvious to
> > > > me that userptr would be special.
> > > > 
> > > > 
> > > > > 
> > > > > > > 
> > > > > > > As you said, we eventually do want to force a flush of
> > > > > > > the
> > > > > > > non-transient data as well once we're freeing the
> > > > > > > underlying
> > > > > > > pages.
> > > > > > > So how do we do that?  It's not clear to me how the
> > > > > > > changes
> > > > > > > below
> > > > > > > are accomplishing that.  Is there a way to explicitly
> > > > > > > request
> > > > > > > a
> > > > > > > full device cache flush (ignoring the transient vs non-
> > > > > > > transient tagging)?
> > > > > > > Since the GuC handles the optimization in the next patch
> > > > > > > (toggling
> > > > > > > whether flushes are full flushes vs non-transient flushes
> > > > > > > depending on whether media is active), I thought there
> > > > > > > might
> > > > > > > be
> > > > > > > some kind of GuC interface to request "please do one full
> > > > > > > flush now, even
> > > > if media is idle."
> > > > > > > 
> > > > > > 
> > > > > > I’m not an expert here by any means, but everything above
> > > > > > from
> > > > > > Matt
> > > > > > seems like valid concerns. Thomas also raised some concerns
> > > > > > in
> > > > > > the
> > > > > > two previous revisions; again I’m not an expert, but
> > > > > > reading
> > > > > > through
> > > > > > those, it doesn’t really seem like he received proper
> > > > > > answers
> > > > > > to his
> > > > questions.
> > > > > 
> > > > > Its forcing flush via tlb invalidation PPC flag under
> > > > > xe_invalidate_vma( ).
> > > > 
> > > > By the way, what is "PPC?"  It seems like it's another new
> > > > synonym
> > > > for the
> > > > device cache?  It's already really confusing that some of our
> > > > hardware docs use
> > > > a mix of both "L2" and "L3" to refer to the same device cache
> > > > for
> > > > historical
> > > > reasons...
> > > > 
> > > > 
> > > > Matt
> > > > 
> > > > > 
> > > > > > 
> > > > > > A couple of comments below.
> > > > > > 
> > > > > > > 
> > > > > > > Matt
> > > > > > > 
> > > > > > > > 
> > > > > > > > V2(MattA): Expand commit description
> > > > > > > > 
> > > > > > > > Signed-off-by: Tejas Upadhyay
> > > > > > > > <tejas.upadhyay@intel.com>
> > > > > > > > ---
> > > > > > > >  drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> > > > > > > >  drivers/gpu/drm/xe/xe_device.c  | 23
> > > > > > > > +++++++++++++++++++++++
> > > > > > > > drivers/gpu/drm/xe/xe_device.h  |  1 +
> > > > > > > > drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> > > > > > > >  4 files changed, 28 insertions(+), 2 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > b/drivers/gpu/drm/xe/xe_bo.c index
> > > > > > > > e9180b01a4e4..4455886b211e
> > > > > > > > 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > @@ -689,7 +689,8 @@ static int
> > > > > > > > xe_bo_trigger_rebind(struct
> > > > > > > > xe_device *xe, struct xe_bo *bo,
> > > > > > > > 
> > > > > > > >  		if (!xe_vm_in_fault_mode(vm)) {
> > > > > > > >  			drm_gpuvm_bo_evict(vm_bo,
> > > > > > > > true);
> > > > > > > > -			continue;
> > > > > > > > +			if
> > > > > > > > (!xe_device_needs_cache_flush(xe))
> > > > > > > > +				continue;
> > > 
> > > Matt R,
> > > This flush will be still needed as there can be non-xa buffers
> > > which
> > > can be evicted while media was off and stale entries can be
> > > flushed
> > > when media comes back on. Which was not case earlier as full
> > > flush
> > > was happening at regular sync points and that’s where this
> > > feature is
> > > bringing optimization now.
> > > 
> > > Tejas
> > > 
> > > > > > 
> > > > > > This will trigger a TLB invalidation (and I assume a cache
> > > > > > flush)
> > > > > > every time we move or free memory in the 3D stack if it has
> > > > > > a
> > > > > > binding. It also performs a synchronous wait on the BO
> > > > > > being
> > > > > > idle.
> > > > > > Both of these are very expensive operations. I can’t
> > > > > > imagine
> > > > > > the
> > > > > > granularity we want here is to do this on every move/free
> > > > > > with
> > > > > > bindings.
> > > > > > 
> > > > > > Also, for LR compute with preempt fences, we would trigger
> > > > > > the
> > > > > > preempt fences during the wait, so a TLB invalidation after
> > > > > > this
> > > > > > seems unnecessary, though perhaps the cache flush is still
> > > > > > required?
> > > > > > 
> > > > > > I think this needs a bit more explanation, because without
> > > > > > knowing a
> > > > > > lot about the exact requirements, the implementation does
> > > > > > not
> > > > > > look
> > > > correct.
> > > > > 
> > > > > The thing is that we are trying to solve problem with userptr
> > > > > with non-XA
> > > > pat, consider if that BO got moved while media is not active.
> > > > As
> > > > soon as media
> > > > will come back active, stale cached entries of that object will
> > > > be
> > > > flushed as part
> > > > of full flush , which may corrupt things.
> > > > > There was thinking that with this patch we would at least
> > > > > solve
> > > > > the problem
> > > > of corruption and later when page_reclamation feature comes in
> > > > will
> > > > help in
> > > > performance as well. But now when page reclamation feature is
> > > > merged earlier
> > > > and it tightly coupled with bind/unbind some cases like
> > > > discussed
> > > > above
> > > > (which are not doing unbind immediately on move/free) are
> > > > missed in
> > > > reclamation.
> > > > > 
> > > > > So thought was to let this solution go in with little perf
> > > > > hit
> > > > > and discuss with
> > > > page reclamation owner to come with cleaner solution together.
> > > > > 
> > > > > Tejas
> > > > > > 
> > > > > > > >  		}
> > > > > > > > 
> > > > > > > >  		if (!idle) {
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > b/drivers/gpu/drm/xe/xe_device.c index
> > > > > > > > 743c18e0c580..da2abed94bc0
> > > > > > > > 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > @@ -1097,6 +1097,29 @@ static void
> > > > > > > > tdf_request_sync(struct
> > > > > > > > xe_device
> > > > > > *xe)
> > > > > > > >  	}
> > > > > > > >  }
> > > > > > > > 
> > > > > > > > +/**
> > > > > > > > + * xe_device_needs_cache_flush - Whether the cache
> > > > > > > > needs
> > > > > > > > to be
> > > > > > > > +flushed
> > > > > > > > + * @xe: The device to check.
> > > > > > > > + *
> > > > > > > > + * Return: true if the device needs cache flush, false
> > > > > > > > otherwise.
> > > > > > > > + */
> > > > > > > > +bool xe_device_needs_cache_flush(struct xe_device *xe)
> > > > > > > > {
> > > > > > > > +	/* XA is *always* flushed, like at the end-of-
> > > > > > > > submssion (and
> > > > > > > > +maybe
> > > > > > other
> > > > > > > > +	 * places), just that internally as an
> > > > > > > > optimisation hw doesn't
> > > > > > > > +need to
> > > > > > make
> > > > > > > > +	 * that a full flush (which will also include
> > > > > > > > XA)
> > > > > > > > when Media is
> > > > > > > > +	 * off/powergated, since it doesn't need to
> > > > > > > > worry
> > > > > > > > about GT
> > > > > > > > +caches vs
> > > > > > Media
> > > > > > > > +	 * coherency, and only CPU vs GPU coherency,
> > > > > > > > so
> > > > > > > > can make
> > > > that
> > > > > > > > +flush
> > > > > > a
> > > > > > > > +	 * targeted XA flush, since stuff tagged with
> > > > > > > > XA
> > > > > > > > now means
> > > > > > > > +it's shared
> > > > > > with
> > > > > > > > +	 * the CPU. The main implication is that we
> > > > > > > > now
> > > > > > > > need to
> > > > > > > > +somehow
> > > > > > flush non-XA before
> > > > > > > > +	 * freeing system memory pages, otherwise
> > > > > > > > dirty
> > > > > > > > cachelines
> > > > > > > > +could be
> > > > > > flushed after the free
> > > > > > > > +	 * (like if Media suddenly turns on and does a
> > > > > > > > full flush)
> > > > > > > > +	 */
> > > > > > > > +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> > > > > > > > +		return true;
> > > > > > > > +	return false;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >  void xe_device_l2_flush(struct xe_device *xe)  {
> > > > > > > >  	struct xe_gt *gt;
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > b/drivers/gpu/drm/xe/xe_device.h index
> > > > > > > > 39464650533b..baf386e0e037
> > > > > > > > 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > @@ -184,6 +184,7 @@ void
> > > > > > > > xe_device_snapshot_print(struct
> > > > > > > > xe_device *xe, struct drm_printer *p);
> > > > > > > >  u64 xe_device_canonicalize_addr(struct xe_device *xe,
> > > > > > > > u64
> > > > > > > > address);
> > > > > > > >  u64 xe_device_uncanonicalize_addr(struct xe_device
> > > > > > > > *xe,
> > > > > > > > u64
> > > > > > > > address);
> > > > > > > > 
> > > > > > > > +bool xe_device_needs_cache_flush(struct xe_device
> > > > > > > > *xe);
> > > > > > > >  void xe_device_td_flush(struct xe_device *xe);  void
> > > > > > > > xe_device_l2_flush(struct xe_device *xe);
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > b/drivers/gpu/drm/xe/xe_userptr.c index
> > > > > > > > e120323c43bc..b435ea7f9b66
> > > > > > > > 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > @@ -114,7 +114,8 @@ static void
> > > > > > > > __vma_userptr_invalidate(struct
> > > > > > > > xe_vm
> > > > > > *vm, struct xe_userptr_vma *uv
> > > > > > > >  				    false,
> > > > > > > > MAX_SCHEDULE_TIMEOUT);
> > > > > > > >  	XE_WARN_ON(err <= 0);
> > > > > > > > 
> > > > > > > > -	if (xe_vm_in_fault_mode(vm) && userptr-
> > > > > > > > > initial_bind) {
> > > > > > > > +	if ((xe_vm_in_fault_mode(vm) ||
> > > > > > > > +xe_device_needs_cache_flush(vm-
> > > > > > > xe)) &&
> > > > > > > > +	    userptr->initial_bind) {
> > > > > > 
> > > > > > Same concern with the LR preempt fence as above — the
> > > > > > hardware
> > > > > > will
> > > > > > be interrupted via preempt fences, so it doesn’t seem
> > > > > > necessary
> > > > > > to
> > > > > > invalidate the TLBs but perhaps we need a cflush and TLB
> > > > > > invalidation is the mechanism for that too?
> > > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > > >  		err = xe_vm_invalidate_vma(vma);
> > > > > > > >  		XE_WARN_ON(err);
> > > > > > > >  	}
> > > > > > > > --
> > > > > > > > 2.52.0
> > > > > > > > 
> > > > > > > 
> > > > > > > --
> > > > > > > Matt Roper
> > > > > > > Graphics Software Engineer
> > > > > > > Linux GPU Platform Enablement
> > > > > > > Intel Corporation
> > > > 
> > > > --
> > > > Matt Roper
> > > > Graphics Software Engineer
> > > > Linux GPU Platform Enablement
> > > > Intel Corporation

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-13 17:16               ` Matt Roper
  2026-02-13 17:31                 ` Souza, Jose
@ 2026-02-13 17:31                 ` Matthew Auld
  2026-02-16 10:23                   ` Thomas Hellström
  1 sibling, 1 reply; 39+ messages in thread
From: Matthew Auld @ 2026-02-13 17:31 UTC (permalink / raw)
  To: Matt Roper, Souza, Jose
  Cc: Upadhyay, Tejas, Mrozek, Michal, intel-xe@lists.freedesktop.org,
	Brost, Matthew, thomas.hellstrom@linux.intel.com

On 13/02/2026 17:16, Matt Roper wrote:
> On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose wrote:
>> On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Roper, Matthew D <matthew.d.roper@intel.com>
>>>> Sent: 12 February 2026 02:41
>>>> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
>>>> Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
>>>> xe@lists.freedesktop.org; Auld, Matthew <matthew.auld@intel.com>;
>>>> thomas.hellstrom@linux.intel.com
>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
>>>> cachelines manually
>>>>
>>>> On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Brost, Matthew <matthew.brost@intel.com>
>>>>>> Sent: 11 February 2026 05:32
>>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>>>>> Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
>>>>>> xe@lists.freedesktop.org; Auld, Matthew
>>>>>> <matthew.auld@intel.com>;
>>>>>> thomas.hellstrom@linux.intel.com
>>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
>>>>>> userptr/shrinker bo
>>>>>> cachelines manually
>>>>>>
>>>>>> On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
>>>>>>> On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay
>>>>>>> wrote:
>>>>>>>> "eXtended Architecture" (XA) tagged memory—memory shared
>>>> between
>>>>>> the
>>>>>>>> CPU and GPU
>>>>>>>
>>>>>>> I'm pretty sure this expansion of "XA" is wrong; where are
>>>>>>> you
>>>>>>> seeing this definition?  Everything in the bspec indicates
>>>>>>> that XA
>>>>>>> means "wb
>>>>>>> - transient app" (similar to how "XD" is 'wb - transient
>>>>>>> display").
>>>>>>> I'm not sure why exactly they picked "X" to refer to
>>>>>>> transient in
>>>>>>> both of these cases, but I've never seen any documentation
>>>>>>> that
>>>>>>> refers to it as "extended."
>>>>>>>
>>>>>>>> is treated differently from other GPU memory when the Media
>>>>>>>> engine is
>>>>>> power-gated.
>>>>>>>>
>>>>>>>> XA is *always* flushed, like at the end-of-submssion (and
>>>>>>>> maybe
>>>>>>>> other
>>>>>>>
>>>>>>> I assume you're referring to the fact that the driver
>>>>>>> performs
>>>>>>> flushes at the end of submission (via PIPE_CONTROL or
>>>>>>> MI_FLUSH_DW), and that depending on other state/optimizations
>>>>>>> in
>>>>>>> the system, those flushes may flush the entire device cache,
>>>>>>> or
>>>>>>> may only flush the subset of cache data that is not marked as
>>>>>>> transient.  The way you worded this was confusing since it
>>>>>>> makes
>>>>>>> it sound like cache flushes happen automatically somewhere in
>>>> hardware/firmware.
>>>>>>>
>>>>>>>> places), just that internally as an optimisation hw doesn't
>>>>>>>> need
>>>>>>>> to make that a full flush (which will also include XA) when
>>>>>>>> Media is off/powergated, since it doesn't need to worry
>>>>>>>> about GT
>>>>>>>> caches vs Media coherency, and only CPU vs GPU coherency,
>>>>>>>> so can
>>>>>>>> make that flush a targeted XA flush, since stuff tagged
>>>>>>>> with XA
>>>>>>>> now means it's shared with the CPU. The main implication is
>>>>>>>> that
>>>>>>>> we now need to somehow flush non-XA before freeing system
>>>>>>>> memory
>>>>>>>> pages, otherwise dirty cachelines could be flushed after
>>>>>>>> the
>>>>>>>> free (like if Media suddenly turns on and does a full
>>>>>>>> flush)
>>>>>>>
>>>>>>> This description seems really confusing.  My understanding is
>>>>>>> that
>>>>>>> marking something as wb-transient-app indicates that it might
>>>>>>> be
>>>>>>> accessed by something other than our graphics/media IP (i.e.,
>>>>>>> accessed from the CPU, exported to another device, etc.), so
>>>>>>> transient data truly does need to be flushed at the points in
>>>>>>> the
>>>>>>> driver where a flush typically happens.
>>>>>>>
>>>>>>> However when something is _not_ transient, then either:
>>>>>>>   - it's "private" to the GPU and only our graphics/media IP
>>>>>>> will be
>>>>>>>     accessing it
>>>>>>>   - it's bound with a coherent PAT index so that outside
>>>>>>> observers like
>>>>>>>     the CPU can snoop the device cache, even when the cache
>>>>>>> hasn't been
>>>>>>>     flushed
>>>>>>>
>>>>>>> If media is not active, then there's really no need to
>>>>>>> include
>>>>>>> non-transient data when an device cache flush happens since
>>>>>>> there's no real need for the data to get to RAM.  So that
>>>>>>> enables
>>>>>>> an optimization (which comes in your next patch), that allows
>>>>>>> flushes to only operate on the subset of the device cache
>>>>>>> tagged as
>>>> "transient" if media is idle.
>>>>>
>>>>> But what If we have stale non-XA marked pages for userptr, and
>>>>> that
>>>>> object moves out and at the same time media comes back, will end
>>>>> up in
>>>>> full flush and flush the stale entry to RAM.
>>>>
>>>> What makes userptr special here?  During general, active usage,
>>>> userptr would
>>>> be data that's accessible by the CPU, so it needs to either be
>>>> transient (so CPU
>>>> can see the data in RAM after explicit flushes) or it needs to be
>>>> using a
>>>> coherent PAT (so that the CPU can just snoop the GPU cache).  If
>>>> you marked
>>>> userptr as both non-XA and non-coherent, then that sounds likely to
>>>> be a
>>>> userspace bug (and probably something we can catch and reject as an
>>>> invalid
>>>> case on any Xe3p or later platforms that support this) since the
>>>> CPU wouldn't
>>>> have any reliable way of seeing GPU updates.
>>>
>>> Right. FYI @Mrozek, Michal @Souza, Jose
>>> For userptr, as explained above, it needs to be either coherent or XA
>>> pat index, or else KMD will reject as invalid case.
>>>
>>>
>> The coherency restriction is already in the uAPI:
>>
>> "Note: For userptr and externally imported dma-buf the kernel expects
>> either 1WAY or 2WAY for the @pat_index."
>>
>> Using 1 way is enough as Xe KMD does a PIPE_CONTROL flushing GPU caches
>> at the end of batch buffers.
> 
> But isn't that what we're discussing here?  1-way *won't* necessarily be
> enough anymore because PIPE_CONTROL instructions don't flush the entire
> cache anymore.  Whenever the GuC determines that media is inactive and
> activates the optimization, PIPE_CONTROL, MI_FLUSH_DW, etc. change
> behavior to only flush out the subset of data that was marked as
> app-transient; anything not marked that way doesn't get flushed now.  So
> there's a new requirement here that you ensure you're using an XA PAT
> index, or you switch to use 2-way coherency which will allow the CPU to
> snoop the GPU's caches.

That exactly matches my understanding also.

> 
> 
> Matt
> 
>>
>>>
>>>>
>>>> If something happens that changes the GTT mapping of an object,
>>>> then
>>>> doesn't that already trigger a TLB invalidation when necessary in
>>>> the driver
>>>> today?  It was my understanding that "heavy" TLB invalidations wait
>>>> for data
>>>> values to be globally observable before starting, so I think that
>>>> would ensure
>>>> that any non-XA data makes it to RAM before any binding changes,
>>>> object,
>>>> destruction, etc.?  Is there something special about userptr that
>>>> makes that
>>>> case more of a problem?
>>>>
>>>> I just found bspec page 74635 which gives an overview of the
>>>> various flush
>>>> and invalidate cases, and I don't see anything there that makes it
>>>> obvious to
>>>> me that userptr would be special.
>>>>
>>>>
>>>>>
>>>>>>>
>>>>>>> As you said, we eventually do want to force a flush of the
>>>>>>> non-transient data as well once we're freeing the underlying
>>>>>>> pages.
>>>>>>> So how do we do that?  It's not clear to me how the changes
>>>>>>> below
>>>>>>> are accomplishing that.  Is there a way to explicitly request
>>>>>>> a
>>>>>>> full device cache flush (ignoring the transient vs non-
>>>>>>> transient tagging)?
>>>>>>> Since the GuC handles the optimization in the next patch
>>>>>>> (toggling
>>>>>>> whether flushes are full flushes vs non-transient flushes
>>>>>>> depending on whether media is active), I thought there might
>>>>>>> be
>>>>>>> some kind of GuC interface to request "please do one full
>>>>>>> flush now, even
>>>> if media is idle."
>>>>>>>
>>>>>>
>>>>>> I’m not an expert here by any means, but everything above from
>>>>>> Matt
>>>>>> seems like valid concerns. Thomas also raised some concerns in
>>>>>> the
>>>>>> two previous revisions; again I’m not an expert, but reading
>>>>>> through
>>>>>> those, it doesn’t really seem like he received proper answers
>>>>>> to his
>>>> questions.
>>>>>
>>>>> Its forcing flush via tlb invalidation PPC flag under
>>>>> xe_invalidate_vma( ).
>>>>
>>>> By the way, what is "PPC?"  It seems like it's another new synonym
>>>> for the
>>>> device cache?  It's already really confusing that some of our
>>>> hardware docs use
>>>> a mix of both "L2" and "L3" to refer to the same device cache for
>>>> historical
>>>> reasons...
>>>>
>>>>
>>>> Matt
>>>>
>>>>>
>>>>>>
>>>>>> A couple of comments below.
>>>>>>
>>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>>>
>>>>>>>> V2(MattA): Expand commit description
>>>>>>>>
>>>>>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
>>>>>>>> ---
>>>>>>>>   drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
>>>>>>>>   drivers/gpu/drm/xe/xe_device.c  | 23
>>>>>>>> +++++++++++++++++++++++
>>>>>>>> drivers/gpu/drm/xe/xe_device.h  |  1 +
>>>>>>>> drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
>>>>>>>>   4 files changed, 28 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>> b/drivers/gpu/drm/xe/xe_bo.c index
>>>>>>>> e9180b01a4e4..4455886b211e
>>>>>>>> 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>> @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct
>>>>>>>> xe_device *xe, struct xe_bo *bo,
>>>>>>>>
>>>>>>>>   		if (!xe_vm_in_fault_mode(vm)) {
>>>>>>>>   			drm_gpuvm_bo_evict(vm_bo, true);
>>>>>>>> -			continue;
>>>>>>>> +			if
>>>>>>>> (!xe_device_needs_cache_flush(xe))
>>>>>>>> +				continue;
>>>
>>> Matt R,
>>> This flush will be still needed as there can be non-xa buffers which
>>> can be evicted while media was off and stale entries can be flushed
>>> when media comes back on. Which was not case earlier as full flush
>>> was happening at regular sync points and that’s where this feature is
>>> bringing optimization now.
>>>
>>> Tejas
>>>
>>>>>>
>>>>>> This will trigger a TLB invalidation (and I assume a cache
>>>>>> flush)
>>>>>> every time we move or free memory in the 3D stack if it has a
>>>>>> binding. It also performs a synchronous wait on the BO being
>>>>>> idle.
>>>>>> Both of these are very expensive operations. I can’t imagine
>>>>>> the
>>>>>> granularity we want here is to do this on every move/free with
>>>>>> bindings.
>>>>>>
>>>>>> Also, for LR compute with preempt fences, we would trigger the
>>>>>> preempt fences during the wait, so a TLB invalidation after
>>>>>> this
>>>>>> seems unnecessary, though perhaps the cache flush is still
>>>>>> required?
>>>>>>
>>>>>> I think this needs a bit more explanation, because without
>>>>>> knowing a
>>>>>> lot about the exact requirements, the implementation does not
>>>>>> look
>>>> correct.
>>>>>
>>>>> The thing is that we are trying to solve problem with userptr
>>>>> with non-XA
>>>> pat, consider if that BO got moved while media is not active. As
>>>> soon as media
>>>> will come back active, stale cached entries of that object will be
>>>> flushed as part
>>>> of full flush , which may corrupt things.
>>>>> There was thinking that with this patch we would at least solve
>>>>> the problem
>>>> of corruption and later when page_reclamation feature comes in will
>>>> help in
>>>> performance as well. But now when page reclamation feature is
>>>> merged earlier
>>>> and it tightly coupled with bind/unbind some cases like discussed
>>>> above
>>>> (which are not doing unbind immediately on move/free) are missed in
>>>> reclamation.
>>>>>
>>>>> So thought was to let this solution go in with little perf hit
>>>>> and discuss with
>>>> page reclamation owner to come with cleaner solution together.
>>>>>
>>>>> Tejas
>>>>>>
>>>>>>>>   		}
>>>>>>>>
>>>>>>>>   		if (!idle) {
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>>>>>>>> b/drivers/gpu/drm/xe/xe_device.c index
>>>>>>>> 743c18e0c580..da2abed94bc0
>>>>>>>> 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>>>>>> @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct
>>>>>>>> xe_device
>>>>>> *xe)
>>>>>>>>   	}
>>>>>>>>   }
>>>>>>>>
>>>>>>>> +/**
>>>>>>>> + * xe_device_needs_cache_flush - Whether the cache needs
>>>>>>>> to be
>>>>>>>> +flushed
>>>>>>>> + * @xe: The device to check.
>>>>>>>> + *
>>>>>>>> + * Return: true if the device needs cache flush, false
>>>>>>>> otherwise.
>>>>>>>> + */
>>>>>>>> +bool xe_device_needs_cache_flush(struct xe_device *xe) {
>>>>>>>> +	/* XA is *always* flushed, like at the end-of-
>>>>>>>> submssion (and
>>>>>>>> +maybe
>>>>>> other
>>>>>>>> +	 * places), just that internally as an
>>>>>>>> optimisation hw doesn't
>>>>>>>> +need to
>>>>>> make
>>>>>>>> +	 * that a full flush (which will also include XA)
>>>>>>>> when Media is
>>>>>>>> +	 * off/powergated, since it doesn't need to worry
>>>>>>>> about GT
>>>>>>>> +caches vs
>>>>>> Media
>>>>>>>> +	 * coherency, and only CPU vs GPU coherency, so
>>>>>>>> can make
>>>> that
>>>>>>>> +flush
>>>>>> a
>>>>>>>> +	 * targeted XA flush, since stuff tagged with XA
>>>>>>>> now means
>>>>>>>> +it's shared
>>>>>> with
>>>>>>>> +	 * the CPU. The main implication is that we now
>>>>>>>> need to
>>>>>>>> +somehow
>>>>>> flush non-XA before
>>>>>>>> +	 * freeing system memory pages, otherwise dirty
>>>>>>>> cachelines
>>>>>>>> +could be
>>>>>> flushed after the free
>>>>>>>> +	 * (like if Media suddenly turns on and does a
>>>>>>>> full flush)
>>>>>>>> +	 */
>>>>>>>> +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
>>>>>>>> +		return true;
>>>>>>>> +	return false;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>   void xe_device_l2_flush(struct xe_device *xe)  {
>>>>>>>>   	struct xe_gt *gt;
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h
>>>>>>>> b/drivers/gpu/drm/xe/xe_device.h index
>>>>>>>> 39464650533b..baf386e0e037
>>>>>>>> 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>>>>>> @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct
>>>>>>>> xe_device *xe, struct drm_printer *p);
>>>>>>>>   u64 xe_device_canonicalize_addr(struct xe_device *xe, u64
>>>>>>>> address);
>>>>>>>>   u64 xe_device_uncanonicalize_addr(struct xe_device *xe,
>>>>>>>> u64
>>>>>>>> address);
>>>>>>>>
>>>>>>>> +bool xe_device_needs_cache_flush(struct xe_device *xe);
>>>>>>>>   void xe_device_td_flush(struct xe_device *xe);  void
>>>>>>>> xe_device_l2_flush(struct xe_device *xe);
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>> b/drivers/gpu/drm/xe/xe_userptr.c index
>>>>>>>> e120323c43bc..b435ea7f9b66
>>>>>>>> 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>> @@ -114,7 +114,8 @@ static void
>>>>>>>> __vma_userptr_invalidate(struct
>>>>>>>> xe_vm
>>>>>> *vm, struct xe_userptr_vma *uv
>>>>>>>>   				    false,
>>>>>>>> MAX_SCHEDULE_TIMEOUT);
>>>>>>>>   	XE_WARN_ON(err <= 0);
>>>>>>>>
>>>>>>>> -	if (xe_vm_in_fault_mode(vm) && userptr-
>>>>>>>>> initial_bind) {
>>>>>>>> +	if ((xe_vm_in_fault_mode(vm) ||
>>>>>>>> +xe_device_needs_cache_flush(vm-
>>>>>>> xe)) &&
>>>>>>>> +	    userptr->initial_bind) {
>>>>>>
>>>>>> Same concern with the LR preempt fence as above — the hardware
>>>>>> will
>>>>>> be interrupted via preempt fences, so it doesn’t seem necessary
>>>>>> to
>>>>>> invalidate the TLBs but perhaps we need a cflush and TLB
>>>>>> invalidation is the mechanism for that too?
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>>>>   		err = xe_vm_invalidate_vma(vma);
>>>>>>>>   		XE_WARN_ON(err);
>>>>>>>>   	}
>>>>>>>> --
>>>>>>>> 2.52.0
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Matt Roper
>>>>>>> Graphics Software Engineer
>>>>>>> Linux GPU Platform Enablement
>>>>>>> Intel Corporation
>>>>
>>>> --
>>>> Matt Roper
>>>> Graphics Software Engineer
>>>> Linux GPU Platform Enablement
>>>> Intel Corporation
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-13 17:31                 ` Matthew Auld
@ 2026-02-16 10:23                   ` Thomas Hellström
  2026-02-16 10:58                     ` Matthew Auld
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Hellström @ 2026-02-16 10:23 UTC (permalink / raw)
  To: Matthew Auld, Matt Roper, Souza, Jose
  Cc: Upadhyay, Tejas, Mrozek, Michal, intel-xe@lists.freedesktop.org,
	Brost, Matthew

On Fri, 2026-02-13 at 17:31 +0000, Matthew Auld wrote:
> On 13/02/2026 17:16, Matt Roper wrote:
> > On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose wrote:
> > > On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote:
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Roper, Matthew D <matthew.d.roper@intel.com>
> > > > > Sent: 12 February 2026 02:41
> > > > > To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> > > > > Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
> > > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > > <matthew.auld@intel.com>;
> > > > > thomas.hellstrom@linux.intel.com
> > > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
> > > > > userptr/shrinker bo
> > > > > cachelines manually
> > > > > 
> > > > > On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas
> > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > > > Sent: 11 February 2026 05:32
> > > > > > > To: Roper, Matthew D <matthew.d.roper@intel.com>
> > > > > > > Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> > > > > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > > > > <matthew.auld@intel.com>;
> > > > > > > thomas.hellstrom@linux.intel.com
> > > > > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
> > > > > > > userptr/shrinker bo
> > > > > > > cachelines manually
> > > > > > > 
> > > > > > > On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper
> > > > > > > wrote:
> > > > > > > > On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas
> > > > > > > > Upadhyay
> > > > > > > > wrote:
> > > > > > > > > "eXtended Architecture" (XA) tagged memory—memory
> > > > > > > > > shared
> > > > > between
> > > > > > > the
> > > > > > > > > CPU and GPU
> > > > > > > > 
> > > > > > > > I'm pretty sure this expansion of "XA" is wrong; where
> > > > > > > > are
> > > > > > > > you
> > > > > > > > seeing this definition?  Everything in the bspec
> > > > > > > > indicates
> > > > > > > > that XA
> > > > > > > > means "wb
> > > > > > > > - transient app" (similar to how "XD" is 'wb -
> > > > > > > > transient
> > > > > > > > display").
> > > > > > > > I'm not sure why exactly they picked "X" to refer to
> > > > > > > > transient in
> > > > > > > > both of these cases, but I've never seen any
> > > > > > > > documentation
> > > > > > > > that
> > > > > > > > refers to it as "extended."
> > > > > > > > 
> > > > > > > > > is treated differently from other GPU memory when the
> > > > > > > > > Media
> > > > > > > > > engine is
> > > > > > > power-gated.
> > > > > > > > > 
> > > > > > > > > XA is *always* flushed, like at the end-of-submssion
> > > > > > > > > (and
> > > > > > > > > maybe
> > > > > > > > > other
> > > > > > > > 
> > > > > > > > I assume you're referring to the fact that the driver
> > > > > > > > performs
> > > > > > > > flushes at the end of submission (via PIPE_CONTROL or
> > > > > > > > MI_FLUSH_DW), and that depending on other
> > > > > > > > state/optimizations
> > > > > > > > in
> > > > > > > > the system, those flushes may flush the entire device
> > > > > > > > cache,
> > > > > > > > or
> > > > > > > > may only flush the subset of cache data that is not
> > > > > > > > marked as
> > > > > > > > transient.  The way you worded this was confusing since
> > > > > > > > it
> > > > > > > > makes
> > > > > > > > it sound like cache flushes happen automatically
> > > > > > > > somewhere in
> > > > > hardware/firmware.
> > > > > > > > 
> > > > > > > > > places), just that internally as an optimisation hw
> > > > > > > > > doesn't
> > > > > > > > > need
> > > > > > > > > to make that a full flush (which will also include
> > > > > > > > > XA) when
> > > > > > > > > Media is off/powergated, since it doesn't need to
> > > > > > > > > worry
> > > > > > > > > about GT
> > > > > > > > > caches vs Media coherency, and only CPU vs GPU
> > > > > > > > > coherency,
> > > > > > > > > so can
> > > > > > > > > make that flush a targeted XA flush, since stuff
> > > > > > > > > tagged
> > > > > > > > > with XA
> > > > > > > > > now means it's shared with the CPU. The main
> > > > > > > > > implication is
> > > > > > > > > that
> > > > > > > > > we now need to somehow flush non-XA before freeing
> > > > > > > > > system
> > > > > > > > > memory
> > > > > > > > > pages, otherwise dirty cachelines could be flushed
> > > > > > > > > after
> > > > > > > > > the
> > > > > > > > > free (like if Media suddenly turns on and does a full
> > > > > > > > > flush)
> > > > > > > > 
> > > > > > > > This description seems really confusing.  My
> > > > > > > > understanding is
> > > > > > > > that
> > > > > > > > marking something as wb-transient-app indicates that it
> > > > > > > > might
> > > > > > > > be
> > > > > > > > accessed by something other than our graphics/media IP
> > > > > > > > (i.e.,
> > > > > > > > accessed from the CPU, exported to another device,
> > > > > > > > etc.), so
> > > > > > > > transient data truly does need to be flushed at the
> > > > > > > > points in
> > > > > > > > the
> > > > > > > > driver where a flush typically happens.
> > > > > > > > 
> > > > > > > > However when something is _not_ transient, then either:
> > > > > > > >   - it's "private" to the GPU and only our
> > > > > > > > graphics/media IP
> > > > > > > > will be
> > > > > > > >     accessing it
> > > > > > > >   - it's bound with a coherent PAT index so that
> > > > > > > > outside
> > > > > > > > observers like
> > > > > > > >     the CPU can snoop the device cache, even when the
> > > > > > > > cache
> > > > > > > > hasn't been
> > > > > > > >     flushed
> > > > > > > > 
> > > > > > > > If media is not active, then there's really no need to
> > > > > > > > include
> > > > > > > > non-transient data when an device cache flush happens
> > > > > > > > since
> > > > > > > > there's no real need for the data to get to RAM.  So
> > > > > > > > that
> > > > > > > > enables
> > > > > > > > an optimization (which comes in your next patch), that
> > > > > > > > allows
> > > > > > > > flushes to only operate on the subset of the device
> > > > > > > > cache
> > > > > > > > tagged as
> > > > > "transient" if media is idle.
> > > > > > 
> > > > > > But what If we have stale non-XA marked pages for userptr,
> > > > > > and
> > > > > > that
> > > > > > object moves out and at the same time media comes back,
> > > > > > will end
> > > > > > up in
> > > > > > full flush and flush the stale entry to RAM.
> > > > > 
> > > > > What makes userptr special here?  During general, active
> > > > > usage,
> > > > > userptr would
> > > > > be data that's accessible by the CPU, so it needs to either
> > > > > be
> > > > > transient (so CPU
> > > > > can see the data in RAM after explicit flushes) or it needs
> > > > > to be
> > > > > using a
> > > > > coherent PAT (so that the CPU can just snoop the GPU cache). 
> > > > > If
> > > > > you marked
> > > > > userptr as both non-XA and non-coherent, then that sounds
> > > > > likely to
> > > > > be a
> > > > > userspace bug (and probably something we can catch and reject
> > > > > as an
> > > > > invalid
> > > > > case on any Xe3p or later platforms that support this) since
> > > > > the
> > > > > CPU wouldn't
> > > > > have any reliable way of seeing GPU updates.
> > > > 
> > > > Right. FYI @Mrozek, Michal @Souza, Jose
> > > > For userptr, as explained above, it needs to be either coherent
> > > > or XA
> > > > pat index, or else KMD will reject as invalid case.
> > > > 
> > > > 
> > > The coherency restriction is already in the uAPI:
> > > 
> > > "Note: For userptr and externally imported dma-buf the kernel
> > > expects
> > > either 1WAY or 2WAY for the @pat_index."
> > > 
> > > Using 1 way is enough as Xe KMD does a PIPE_CONTROL flushing GPU
> > > caches
> > > at the end of batch buffers.
> > 
> > But isn't that what we're discussing here?  1-way *won't*
> > necessarily be
> > enough anymore because PIPE_CONTROL instructions don't flush the
> > entire
> > cache anymore.  Whenever the GuC determines that media is inactive
> > and
> > activates the optimization, PIPE_CONTROL, MI_FLUSH_DW, etc. change
> > behavior to only flush out the subset of data that was marked as
> > app-transient; anything not marked that way doesn't get flushed
> > now.  So
> > there's a new requirement here that you ensure you're using an XA
> > PAT
> > index, or you switch to use 2-way coherency which will allow the
> > CPU to
> > snoop the GPU's caches.
> 
> That exactly matches my understanding also.

This only ever affects IGFX, right? Since AFAIU we don't have 2-way
coherency with DGFX?

It sounds like the same PAT restriction is needed also for imported
dma-buf, right?

/Thomas


> 
> > 
> > 
> > Matt
> > 
> > > 
> > > > 
> > > > > 
> > > > > If something happens that changes the GTT mapping of an
> > > > > object,
> > > > > then
> > > > > doesn't that already trigger a TLB invalidation when
> > > > > necessary in
> > > > > the driver
> > > > > today?  It was my understanding that "heavy" TLB
> > > > > invalidations wait
> > > > > for data
> > > > > values to be globally observable before starting, so I think
> > > > > that
> > > > > would ensure
> > > > > that any non-XA data makes it to RAM before any binding
> > > > > changes,
> > > > > object,
> > > > > destruction, etc.?  Is there something special about userptr
> > > > > that
> > > > > makes that
> > > > > case more of a problem?
> > > > > 
> > > > > I just found bspec page 74635 which gives an overview of the
> > > > > various flush
> > > > > and invalidate cases, and I don't see anything there that
> > > > > makes it
> > > > > obvious to
> > > > > me that userptr would be special.
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > > 
> > > > > > > > As you said, we eventually do want to force a flush of
> > > > > > > > the
> > > > > > > > non-transient data as well once we're freeing the
> > > > > > > > underlying
> > > > > > > > pages.
> > > > > > > > So how do we do that?  It's not clear to me how the
> > > > > > > > changes
> > > > > > > > below
> > > > > > > > are accomplishing that.  Is there a way to explicitly
> > > > > > > > request
> > > > > > > > a
> > > > > > > > full device cache flush (ignoring the transient vs non-
> > > > > > > > transient tagging)?
> > > > > > > > Since the GuC handles the optimization in the next
> > > > > > > > patch
> > > > > > > > (toggling
> > > > > > > > whether flushes are full flushes vs non-transient
> > > > > > > > flushes
> > > > > > > > depending on whether media is active), I thought there
> > > > > > > > might
> > > > > > > > be
> > > > > > > > some kind of GuC interface to request "please do one
> > > > > > > > full
> > > > > > > > flush now, even
> > > > > if media is idle."
> > > > > > > > 
> > > > > > > 
> > > > > > > I’m not an expert here by any means, but everything above
> > > > > > > from
> > > > > > > Matt
> > > > > > > seems like valid concerns. Thomas also raised some
> > > > > > > concerns in
> > > > > > > the
> > > > > > > two previous revisions; again I’m not an expert, but
> > > > > > > reading
> > > > > > > through
> > > > > > > those, it doesn’t really seem like he received proper
> > > > > > > answers
> > > > > > > to his
> > > > > questions.
> > > > > > 
> > > > > > Its forcing flush via tlb invalidation PPC flag under
> > > > > > xe_invalidate_vma( ).
> > > > > 
> > > > > By the way, what is "PPC?"  It seems like it's another new
> > > > > synonym
> > > > > for the
> > > > > device cache?  It's already really confusing that some of our
> > > > > hardware docs use
> > > > > a mix of both "L2" and "L3" to refer to the same device cache
> > > > > for
> > > > > historical
> > > > > reasons...
> > > > > 
> > > > > 
> > > > > Matt
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > A couple of comments below.
> > > > > > > 
> > > > > > > > 
> > > > > > > > Matt
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > V2(MattA): Expand commit description
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Tejas Upadhyay
> > > > > > > > > <tejas.upadhyay@intel.com>
> > > > > > > > > ---
> > > > > > > > >   drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> > > > > > > > >   drivers/gpu/drm/xe/xe_device.c  | 23
> > > > > > > > > +++++++++++++++++++++++
> > > > > > > > > drivers/gpu/drm/xe/xe_device.h  |  1 +
> > > > > > > > > drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> > > > > > > > >   4 files changed, 28 insertions(+), 2 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > b/drivers/gpu/drm/xe/xe_bo.c index
> > > > > > > > > e9180b01a4e4..4455886b211e
> > > > > > > > > 100644
> > > > > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > @@ -689,7 +689,8 @@ static int
> > > > > > > > > xe_bo_trigger_rebind(struct
> > > > > > > > > xe_device *xe, struct xe_bo *bo,
> > > > > > > > > 
> > > > > > > > >   		if (!xe_vm_in_fault_mode(vm)) {
> > > > > > > > >   			drm_gpuvm_bo_evict(vm_bo,
> > > > > > > > > true);
> > > > > > > > > -			continue;
> > > > > > > > > +			if
> > > > > > > > > (!xe_device_needs_cache_flush(xe))
> > > > > > > > > +				continue;
> > > > 
> > > > Matt R,
> > > > This flush will be still needed as there can be non-xa buffers
> > > > which
> > > > can be evicted while media was off and stale entries can be
> > > > flushed
> > > > when media comes back on. Which was not case earlier as full
> > > > flush
> > > > was happening at regular sync points and that’s where this
> > > > feature is
> > > > bringing optimization now.
> > > > 
> > > > Tejas
> > > > 
> > > > > > > 
> > > > > > > This will trigger a TLB invalidation (and I assume a
> > > > > > > cache
> > > > > > > flush)
> > > > > > > every time we move or free memory in the 3D stack if it
> > > > > > > has a
> > > > > > > binding. It also performs a synchronous wait on the BO
> > > > > > > being
> > > > > > > idle.
> > > > > > > Both of these are very expensive operations. I can’t
> > > > > > > imagine
> > > > > > > the
> > > > > > > granularity we want here is to do this on every move/free
> > > > > > > with
> > > > > > > bindings.
> > > > > > > 
> > > > > > > Also, for LR compute with preempt fences, we would
> > > > > > > trigger the
> > > > > > > preempt fences during the wait, so a TLB invalidation
> > > > > > > after
> > > > > > > this
> > > > > > > seems unnecessary, though perhaps the cache flush is
> > > > > > > still
> > > > > > > required?
> > > > > > > 
> > > > > > > I think this needs a bit more explanation, because
> > > > > > > without
> > > > > > > knowing a
> > > > > > > lot about the exact requirements, the implementation does
> > > > > > > not
> > > > > > > look
> > > > > correct.
> > > > > > 
> > > > > > The thing is that we are trying to solve problem with
> > > > > > userptr
> > > > > > with non-XA
> > > > > pat, consider if that BO got moved while media is not active.
> > > > > As
> > > > > soon as media
> > > > > will come back active, stale cached entries of that object
> > > > > will be
> > > > > flushed as part
> > > > > of full flush , which may corrupt things.
> > > > > > There was thinking that with this patch we would at least
> > > > > > solve
> > > > > > the problem
> > > > > of corruption and later when page_reclamation feature comes
> > > > > in will
> > > > > help in
> > > > > performance as well. But now when page reclamation feature is
> > > > > merged earlier
> > > > > and it tightly coupled with bind/unbind some cases like
> > > > > discussed
> > > > > above
> > > > > (which are not doing unbind immediately on move/free) are
> > > > > missed in
> > > > > reclamation.
> > > > > > 
> > > > > > So thought was to let this solution go in with little perf
> > > > > > hit
> > > > > > and discuss with
> > > > > page reclamation owner to come with cleaner solution
> > > > > together.
> > > > > > 
> > > > > > Tejas
> > > > > > > 
> > > > > > > > >   		}
> > > > > > > > > 
> > > > > > > > >   		if (!idle) {
> > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > > b/drivers/gpu/drm/xe/xe_device.c index
> > > > > > > > > 743c18e0c580..da2abed94bc0
> > > > > > > > > 100644
> > > > > > > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > > @@ -1097,6 +1097,29 @@ static void
> > > > > > > > > tdf_request_sync(struct
> > > > > > > > > xe_device
> > > > > > > *xe)
> > > > > > > > >   	}
> > > > > > > > >   }
> > > > > > > > > 
> > > > > > > > > +/**
> > > > > > > > > + * xe_device_needs_cache_flush - Whether the cache
> > > > > > > > > needs
> > > > > > > > > to be
> > > > > > > > > +flushed
> > > > > > > > > + * @xe: The device to check.
> > > > > > > > > + *
> > > > > > > > > + * Return: true if the device needs cache flush,
> > > > > > > > > false
> > > > > > > > > otherwise.
> > > > > > > > > + */
> > > > > > > > > +bool xe_device_needs_cache_flush(struct xe_device
> > > > > > > > > *xe) {
> > > > > > > > > +	/* XA is *always* flushed, like at the end-
> > > > > > > > > of-
> > > > > > > > > submssion (and
> > > > > > > > > +maybe
> > > > > > > other
> > > > > > > > > +	 * places), just that internally as an
> > > > > > > > > optimisation hw doesn't
> > > > > > > > > +need to
> > > > > > > make
> > > > > > > > > +	 * that a full flush (which will also
> > > > > > > > > include XA)
> > > > > > > > > when Media is
> > > > > > > > > +	 * off/powergated, since it doesn't need to
> > > > > > > > > worry
> > > > > > > > > about GT
> > > > > > > > > +caches vs
> > > > > > > Media
> > > > > > > > > +	 * coherency, and only CPU vs GPU coherency,
> > > > > > > > > so
> > > > > > > > > can make
> > > > > that
> > > > > > > > > +flush
> > > > > > > a
> > > > > > > > > +	 * targeted XA flush, since stuff tagged
> > > > > > > > > with XA
> > > > > > > > > now means
> > > > > > > > > +it's shared
> > > > > > > with
> > > > > > > > > +	 * the CPU. The main implication is that we
> > > > > > > > > now
> > > > > > > > > need to
> > > > > > > > > +somehow
> > > > > > > flush non-XA before
> > > > > > > > > +	 * freeing system memory pages, otherwise
> > > > > > > > > dirty
> > > > > > > > > cachelines
> > > > > > > > > +could be
> > > > > > > flushed after the free
> > > > > > > > > +	 * (like if Media suddenly turns on and does
> > > > > > > > > a
> > > > > > > > > full flush)
> > > > > > > > > +	 */
> > > > > > > > > +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> > > > > > > > > +		return true;
> > > > > > > > > +	return false;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > >   void xe_device_l2_flush(struct xe_device *xe)  {
> > > > > > > > >   	struct xe_gt *gt;
> > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > > b/drivers/gpu/drm/xe/xe_device.h index
> > > > > > > > > 39464650533b..baf386e0e037
> > > > > > > > > 100644
> > > > > > > > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > > @@ -184,6 +184,7 @@ void
> > > > > > > > > xe_device_snapshot_print(struct
> > > > > > > > > xe_device *xe, struct drm_printer *p);
> > > > > > > > >   u64 xe_device_canonicalize_addr(struct xe_device
> > > > > > > > > *xe, u64
> > > > > > > > > address);
> > > > > > > > >   u64 xe_device_uncanonicalize_addr(struct xe_device
> > > > > > > > > *xe,
> > > > > > > > > u64
> > > > > > > > > address);
> > > > > > > > > 
> > > > > > > > > +bool xe_device_needs_cache_flush(struct xe_device
> > > > > > > > > *xe);
> > > > > > > > >   void xe_device_td_flush(struct xe_device *xe); 
> > > > > > > > > void
> > > > > > > > > xe_device_l2_flush(struct xe_device *xe);
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > > b/drivers/gpu/drm/xe/xe_userptr.c index
> > > > > > > > > e120323c43bc..b435ea7f9b66
> > > > > > > > > 100644
> > > > > > > > > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > > @@ -114,7 +114,8 @@ static void
> > > > > > > > > __vma_userptr_invalidate(struct
> > > > > > > > > xe_vm
> > > > > > > *vm, struct xe_userptr_vma *uv
> > > > > > > > >   				    false,
> > > > > > > > > MAX_SCHEDULE_TIMEOUT);
> > > > > > > > >   	XE_WARN_ON(err <= 0);
> > > > > > > > > 
> > > > > > > > > -	if (xe_vm_in_fault_mode(vm) && userptr-
> > > > > > > > > > initial_bind) {
> > > > > > > > > +	if ((xe_vm_in_fault_mode(vm) ||
> > > > > > > > > +xe_device_needs_cache_flush(vm-
> > > > > > > > xe)) &&
> > > > > > > > > +	    userptr->initial_bind) {
> > > > > > > 
> > > > > > > Same concern with the LR preempt fence as above — the
> > > > > > > hardware
> > > > > > > will
> > > > > > > be interrupted via preempt fences, so it doesn’t seem
> > > > > > > necessary
> > > > > > > to
> > > > > > > invalidate the TLBs but perhaps we need a cflush and TLB
> > > > > > > invalidation is the mechanism for that too?
> > > > > > > 
> > > > > > > Matt
> > > > > > > 
> > > > > > > > >   		err = xe_vm_invalidate_vma(vma);
> > > > > > > > >   		XE_WARN_ON(err);
> > > > > > > > >   	}
> > > > > > > > > --
> > > > > > > > > 2.52.0
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > --
> > > > > > > > Matt Roper
> > > > > > > > Graphics Software Engineer
> > > > > > > > Linux GPU Platform Enablement
> > > > > > > > Intel Corporation
> > > > > 
> > > > > --
> > > > > Matt Roper
> > > > > Graphics Software Engineer
> > > > > Linux GPU Platform Enablement
> > > > > Intel Corporation
> > 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-13 16:23           ` Upadhyay, Tejas
  2026-02-13 16:48             ` Souza, Jose
@ 2026-02-16 10:56             ` Thomas Hellström
  2026-02-16 11:26               ` Upadhyay, Tejas
  1 sibling, 1 reply; 39+ messages in thread
From: Thomas Hellström @ 2026-02-16 10:56 UTC (permalink / raw)
  To: Upadhyay, Tejas, Roper, Matthew D, Mrozek, Michal, Souza, Jose
  Cc: Brost, Matthew, intel-xe@lists.freedesktop.org, Auld, Matthew

On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote:
> 
> 
> > -----Original Message-----
> > From: Roper, Matthew D <matthew.d.roper@intel.com>
> > Sent: 12 February 2026 02:41
> > To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> > Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
> > xe@lists.freedesktop.org; Auld, Matthew <matthew.auld@intel.com>;
> > thomas.hellstrom@linux.intel.com
> > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
> > cachelines manually
> > 
> > On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas wrote:
> > > 
> > > 
> > > > -----Original Message-----
> > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > Sent: 11 February 2026 05:32
> > > > To: Roper, Matthew D <matthew.d.roper@intel.com>
> > > > Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > <matthew.auld@intel.com>;
> > > > thomas.hellstrom@linux.intel.com
> > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
> > > > userptr/shrinker bo
> > > > cachelines manually
> > > > 
> > > > On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
> > > > > On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay
> > > > > wrote:
> > > > > > "eXtended Architecture" (XA) tagged memory—memory shared
> > between
> > > > the
> > > > > > CPU and GPU
> > > > > 
> > > > > I'm pretty sure this expansion of "XA" is wrong; where are
> > > > > you
> > > > > seeing this definition?  Everything in the bspec indicates
> > > > > that XA
> > > > > means "wb
> > > > > - transient app" (similar to how "XD" is 'wb - transient
> > > > > display").
> > > > > I'm not sure why exactly they picked "X" to refer to
> > > > > transient in
> > > > > both of these cases, but I've never seen any documentation
> > > > > that
> > > > > refers to it as "extended."
> > > > > 
> > > > > > is treated differently from other GPU memory when the Media
> > > > > > engine is
> > > > power-gated.
> > > > > > 
> > > > > > XA is *always* flushed, like at the end-of-submssion (and
> > > > > > maybe
> > > > > > other
> > > > > 
> > > > > I assume you're referring to the fact that the driver
> > > > > performs
> > > > > flushes at the end of submission (via PIPE_CONTROL or
> > > > > MI_FLUSH_DW), and that depending on other state/optimizations
> > > > > in
> > > > > the system, those flushes may flush the entire device cache,
> > > > > or
> > > > > may only flush the subset of cache data that is not marked as
> > > > > transient.  The way you worded this was confusing since it
> > > > > makes
> > > > > it sound like cache flushes happen automatically somewhere in
> > hardware/firmware.
> > > > > 
> > > > > > places), just that internally as an optimisation hw doesn't
> > > > > > need
> > > > > > to make that a full flush (which will also include XA) when
> > > > > > Media is off/powergated, since it doesn't need to worry
> > > > > > about GT
> > > > > > caches vs Media coherency, and only CPU vs GPU coherency,
> > > > > > so can
> > > > > > make that flush a targeted XA flush, since stuff tagged
> > > > > > with XA
> > > > > > now means it's shared with the CPU. The main implication is
> > > > > > that
> > > > > > we now need to somehow flush non-XA before freeing system
> > > > > > memory
> > > > > > pages, otherwise dirty cachelines could be flushed after
> > > > > > the
> > > > > > free (like if Media suddenly turns on and does a full
> > > > > > flush)
> > > > > 
> > > > > This description seems really confusing.  My understanding is
> > > > > that
> > > > > marking something as wb-transient-app indicates that it might
> > > > > be
> > > > > accessed by something other than our graphics/media IP (i.e.,
> > > > > accessed from the CPU, exported to another device, etc.), so
> > > > > transient data truly does need to be flushed at the points in
> > > > > the
> > > > > driver where a flush typically happens.
> > > > > 
> > > > > However when something is _not_ transient, then either:
> > > > >  - it's "private" to the GPU and only our graphics/media IP
> > > > > will be
> > > > >    accessing it
> > > > >  - it's bound with a coherent PAT index so that outside
> > > > > observers like
> > > > >    the CPU can snoop the device cache, even when the cache
> > > > > hasn't been
> > > > >    flushed
> > > > > 
> > > > > If media is not active, then there's really no need to
> > > > > include
> > > > > non-transient data when an device cache flush happens since
> > > > > there's no real need for the data to get to RAM.  So that
> > > > > enables
> > > > > an optimization (which comes in your next patch), that allows
> > > > > flushes to only operate on the subset of the device cache
> > > > > tagged as
> > "transient" if media is idle.
> > > 
> > > But what If we have stale non-XA marked pages for userptr, and
> > > that
> > > object moves out and at the same time media comes back, will end
> > > up in
> > > full flush and flush the stale entry to RAM.
> > 
> > What makes userptr special here?  During general, active usage,
> > userptr would
> > be data that's accessible by the CPU, so it needs to either be
> > transient (so CPU
> > can see the data in RAM after explicit flushes) or it needs to be
> > using a
> > coherent PAT (so that the CPU can just snoop the GPU cache).  If
> > you marked
> > userptr as both non-XA and non-coherent, then that sounds likely to
> > be a
> > userspace bug (and probably something we can catch and reject as an
> > invalid
> > case on any Xe3p or later platforms that support this) since the
> > CPU wouldn't
> > have any reliable way of seeing GPU updates.
> 
> Right. FYI @Mrozek, Michal @Souza, Jose
> For userptr, as explained above, it needs to be either coherent or XA
> pat index, or else KMD will reject as invalid case. 
> 
> > 
> > If something happens that changes the GTT mapping of an object,
> > then
> > doesn't that already trigger a TLB invalidation when necessary in
> > the driver
> > today?  It was my understanding that "heavy" TLB invalidations wait
> > for data
> > values to be globally observable before starting, so I think that
> > would ensure
> > that any non-XA data makes it to RAM before any binding changes,
> > object,
> > destruction, etc.?  Is there something special about userptr that
> > makes that
> > case more of a problem?
> > 
> > I just found bspec page 74635 which gives an overview of the
> > various flush
> > and invalidate cases, and I don't see anything there that makes it
> > obvious to
> > me that userptr would be special.
> > 
> > 
> > > 
> > > > > 
> > > > > As you said, we eventually do want to force a flush of the
> > > > > non-transient data as well once we're freeing the underlying
> > > > > pages.
> > > > > So how do we do that?  It's not clear to me how the changes
> > > > > below
> > > > > are accomplishing that.  Is there a way to explicitly request
> > > > > a
> > > > > full device cache flush (ignoring the transient vs non-
> > > > > transient tagging)?
> > > > > Since the GuC handles the optimization in the next patch
> > > > > (toggling
> > > > > whether flushes are full flushes vs non-transient flushes
> > > > > depending on whether media is active), I thought there might
> > > > > be
> > > > > some kind of GuC interface to request "please do one full
> > > > > flush now, even
> > if media is idle."
> > > > > 
> > > > 
> > > > I’m not an expert here by any means, but everything above from
> > > > Matt
> > > > seems like valid concerns. Thomas also raised some concerns in
> > > > the
> > > > two previous revisions; again I’m not an expert, but reading
> > > > through
> > > > those, it doesn’t really seem like he received proper answers
> > > > to his
> > questions.
> > > 
> > > Its forcing flush via tlb invalidation PPC flag under
> > > xe_invalidate_vma( ).
> > 
> > By the way, what is "PPC?"  It seems like it's another new synonym
> > for the
> > device cache?  It's already really confusing that some of our
> > hardware docs use
> > a mix of both "L2" and "L3" to refer to the same device cache for
> > historical
> > reasons...
> > 
> > 
> > Matt
> > 
> > > 
> > > > 
> > > > A couple of comments below.
> > > > 
> > > > > 
> > > > > Matt
> > > > > 
> > > > > > 
> > > > > > V2(MattA): Expand commit description
> > > > > > 
> > > > > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> > > > > >  drivers/gpu/drm/xe/xe_device.c  | 23
> > > > > > +++++++++++++++++++++++
> > > > > > drivers/gpu/drm/xe/xe_device.h  |  1 +
> > > > > > drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> > > > > >  4 files changed, 28 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > b/drivers/gpu/drm/xe/xe_bo.c index
> > > > > > e9180b01a4e4..4455886b211e
> > > > > > 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct
> > > > > > xe_device *xe, struct xe_bo *bo,
> > > > > > 
> > > > > >  		if (!xe_vm_in_fault_mode(vm)) {
> > > > > >  			drm_gpuvm_bo_evict(vm_bo, true);
> > > > > > -			continue;
> > > > > > +			if
> > > > > > (!xe_device_needs_cache_flush(xe))
> > > > > > +				continue;
> 
> Matt R,
> This flush will be still needed as there can be non-xa buffers which
> can be evicted while media was off and stale entries can be flushed
> when media comes back on. Which was not case earlier as full flush
> was happening at regular sync points and that’s where this feature is
> bringing optimization now.
> 
> Tejas

This flush would effecively eliminate any asynchronous migration, so
again if this is only affecting IGFX we're synchronizing on shrinking
anyway, but in general I think we should avoid synchronizing here.

Could this instead be done asynchronously similar to how we flush TLB?
so it would be part of the migration fence?

/Thomas

> 
> > > > 
> > > > This will trigger a TLB invalidation (and I assume a cache
> > > > flush)
> > > > every time we move or free memory in the 3D stack if it has a
> > > > binding. It also performs a synchronous wait on the BO being
> > > > idle.
> > > > Both of these are very expensive operations. I can’t imagine
> > > > the
> > > > granularity we want here is to do this on every move/free with
> > > > bindings.
> > > > 
> > > > Also, for LR compute with preempt fences, we would trigger the
> > > > preempt fences during the wait, so a TLB invalidation after
> > > > this
> > > > seems unnecessary, though perhaps the cache flush is still
> > > > required?
> > > > 
> > > > I think this needs a bit more explanation, because without
> > > > knowing a
> > > > lot about the exact requirements, the implementation does not
> > > > look
> > correct.
> > > 
> > > The thing is that we are trying to solve problem with userptr
> > > with non-XA
> > pat, consider if that BO got moved while media is not active. As
> > soon as media
> > will come back active, stale cached entries of that object will be
> > flushed as part
> > of full flush , which may corrupt things.
> > > There was thinking that with this patch we would at least solve
> > > the problem
> > of corruption and later when page_reclamation feature comes in will
> > help in
> > performance as well. But now when page reclamation feature is
> > merged earlier
> > and it tightly coupled with bind/unbind some cases like discussed
> > above
> > (which are not doing unbind immediately on move/free) are missed in
> > reclamation.
> > > 
> > > So thought was to let this solution go in with little perf hit
> > > and discuss with
> > page reclamation owner to come with cleaner solution together.
> > > 
> > > Tejas
> > > > 
> > > > > >  		}
> > > > > > 
> > > > > >  		if (!idle) {
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > > > > b/drivers/gpu/drm/xe/xe_device.c index
> > > > > > 743c18e0c580..da2abed94bc0
> > > > > > 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > > > @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct
> > > > > > xe_device
> > > > *xe)
> > > > > >  	}
> > > > > >  }
> > > > > > 
> > > > > > +/**
> > > > > > + * xe_device_needs_cache_flush - Whether the cache needs
> > > > > > to be
> > > > > > +flushed
> > > > > > + * @xe: The device to check.
> > > > > > + *
> > > > > > + * Return: true if the device needs cache flush, false
> > > > > > otherwise.
> > > > > > + */
> > > > > > +bool xe_device_needs_cache_flush(struct xe_device *xe) {
> > > > > > +	/* XA is *always* flushed, like at the end-of-
> > > > > > submssion (and
> > > > > > +maybe
> > > > other
> > > > > > +	 * places), just that internally as an
> > > > > > optimisation hw doesn't
> > > > > > +need to
> > > > make
> > > > > > +	 * that a full flush (which will also include XA)
> > > > > > when Media is
> > > > > > +	 * off/powergated, since it doesn't need to worry
> > > > > > about GT
> > > > > > +caches vs
> > > > Media
> > > > > > +	 * coherency, and only CPU vs GPU coherency, so
> > > > > > can make
> > that
> > > > > > +flush
> > > > a
> > > > > > +	 * targeted XA flush, since stuff tagged with XA
> > > > > > now means
> > > > > > +it's shared
> > > > with
> > > > > > +	 * the CPU. The main implication is that we now
> > > > > > need to
> > > > > > +somehow
> > > > flush non-XA before
> > > > > > +	 * freeing system memory pages, otherwise dirty
> > > > > > cachelines
> > > > > > +could be
> > > > flushed after the free
> > > > > > +	 * (like if Media suddenly turns on and does a
> > > > > > full flush)
> > > > > > +	 */
> > > > > > +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> > > > > > +		return true;
> > > > > > +	return false;
> > > > > > +}
> > > > > > +
> > > > > >  void xe_device_l2_flush(struct xe_device *xe)  {
> > > > > >  	struct xe_gt *gt;
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.h
> > > > > > b/drivers/gpu/drm/xe/xe_device.h index
> > > > > > 39464650533b..baf386e0e037
> > > > > > 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > > > > @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct
> > > > > > xe_device *xe, struct drm_printer *p);
> > > > > >  u64 xe_device_canonicalize_addr(struct xe_device *xe, u64
> > > > > > address);
> > > > > >  u64 xe_device_uncanonicalize_addr(struct xe_device *xe,
> > > > > > u64
> > > > > > address);
> > > > > > 
> > > > > > +bool xe_device_needs_cache_flush(struct xe_device *xe);
> > > > > >  void xe_device_td_flush(struct xe_device *xe);  void
> > > > > > xe_device_l2_flush(struct xe_device *xe);
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > b/drivers/gpu/drm/xe/xe_userptr.c index
> > > > > > e120323c43bc..b435ea7f9b66
> > > > > > 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > @@ -114,7 +114,8 @@ static void
> > > > > > __vma_userptr_invalidate(struct
> > > > > > xe_vm
> > > > *vm, struct xe_userptr_vma *uv
> > > > > >  				    false,
> > > > > > MAX_SCHEDULE_TIMEOUT);
> > > > > >  	XE_WARN_ON(err <= 0);
> > > > > > 
> > > > > > -	if (xe_vm_in_fault_mode(vm) && userptr-
> > > > > > >initial_bind) {
> > > > > > +	if ((xe_vm_in_fault_mode(vm) ||
> > > > > > +xe_device_needs_cache_flush(vm-
> > > > > xe)) &&
> > > > > > +	    userptr->initial_bind) {
> > > > 
> > > > Same concern with the LR preempt fence as above — the hardware
> > > > will
> > > > be interrupted via preempt fences, so it doesn’t seem necessary
> > > > to
> > > > invalidate the TLBs but perhaps we need a cflush and TLB
> > > > invalidation is the mechanism for that too?
> > > > 
> > > > Matt
> > > > 
> > > > > >  		err = xe_vm_invalidate_vma(vma);
> > > > > >  		XE_WARN_ON(err);
> > > > > >  	}
> > > > > > --
> > > > > > 2.52.0
> > > > > > 
> > > > > 
> > > > > --
> > > > > Matt Roper
> > > > > Graphics Software Engineer
> > > > > Linux GPU Platform Enablement
> > > > > Intel Corporation
> > 
> > --
> > Matt Roper
> > Graphics Software Engineer
> > Linux GPU Platform Enablement
> > Intel Corporation

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-16 10:23                   ` Thomas Hellström
@ 2026-02-16 10:58                     ` Matthew Auld
  2026-02-16 12:07                       ` Thomas Hellström
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Auld @ 2026-02-16 10:58 UTC (permalink / raw)
  To: Thomas Hellström, Matt Roper, Souza, Jose
  Cc: Upadhyay, Tejas, Mrozek, Michal, intel-xe@lists.freedesktop.org,
	Brost, Matthew

On 16/02/2026 10:23, Thomas Hellström wrote:
> On Fri, 2026-02-13 at 17:31 +0000, Matthew Auld wrote:
>> On 13/02/2026 17:16, Matt Roper wrote:
>>> On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose wrote:
>>>> On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Roper, Matthew D <matthew.d.roper@intel.com>
>>>>>> Sent: 12 February 2026 02:41
>>>>>> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
>>>>>> Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
>>>>>> xe@lists.freedesktop.org; Auld, Matthew
>>>>>> <matthew.auld@intel.com>;
>>>>>> thomas.hellstrom@linux.intel.com
>>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
>>>>>> userptr/shrinker bo
>>>>>> cachelines manually
>>>>>>
>>>>>> On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Brost, Matthew <matthew.brost@intel.com>
>>>>>>>> Sent: 11 February 2026 05:32
>>>>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>>>>>>> Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
>>>>>>>> xe@lists.freedesktop.org; Auld, Matthew
>>>>>>>> <matthew.auld@intel.com>;
>>>>>>>> thomas.hellstrom@linux.intel.com
>>>>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
>>>>>>>> userptr/shrinker bo
>>>>>>>> cachelines manually
>>>>>>>>
>>>>>>>> On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper
>>>>>>>> wrote:
>>>>>>>>> On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas
>>>>>>>>> Upadhyay
>>>>>>>>> wrote:
>>>>>>>>>> "eXtended Architecture" (XA) tagged memory—memory
>>>>>>>>>> shared
>>>>>> between
>>>>>>>> the
>>>>>>>>>> CPU and GPU
>>>>>>>>>
>>>>>>>>> I'm pretty sure this expansion of "XA" is wrong; where
>>>>>>>>> are
>>>>>>>>> you
>>>>>>>>> seeing this definition?  Everything in the bspec
>>>>>>>>> indicates
>>>>>>>>> that XA
>>>>>>>>> means "wb
>>>>>>>>> - transient app" (similar to how "XD" is 'wb -
>>>>>>>>> transient
>>>>>>>>> display").
>>>>>>>>> I'm not sure why exactly they picked "X" to refer to
>>>>>>>>> transient in
>>>>>>>>> both of these cases, but I've never seen any
>>>>>>>>> documentation
>>>>>>>>> that
>>>>>>>>> refers to it as "extended."
>>>>>>>>>
>>>>>>>>>> is treated differently from other GPU memory when the
>>>>>>>>>> Media
>>>>>>>>>> engine is
>>>>>>>> power-gated.
>>>>>>>>>>
>>>>>>>>>> XA is *always* flushed, like at the end-of-submssion
>>>>>>>>>> (and
>>>>>>>>>> maybe
>>>>>>>>>> other
>>>>>>>>>
>>>>>>>>> I assume you're referring to the fact that the driver
>>>>>>>>> performs
>>>>>>>>> flushes at the end of submission (via PIPE_CONTROL or
>>>>>>>>> MI_FLUSH_DW), and that depending on other
>>>>>>>>> state/optimizations
>>>>>>>>> in
>>>>>>>>> the system, those flushes may flush the entire device
>>>>>>>>> cache,
>>>>>>>>> or
>>>>>>>>> may only flush the subset of cache data that is not
>>>>>>>>> marked as
>>>>>>>>> transient.  The way you worded this was confusing since
>>>>>>>>> it
>>>>>>>>> makes
>>>>>>>>> it sound like cache flushes happen automatically
>>>>>>>>> somewhere in
>>>>>> hardware/firmware.
>>>>>>>>>
>>>>>>>>>> places), just that internally as an optimisation hw
>>>>>>>>>> doesn't
>>>>>>>>>> need
>>>>>>>>>> to make that a full flush (which will also include
>>>>>>>>>> XA) when
>>>>>>>>>> Media is off/powergated, since it doesn't need to
>>>>>>>>>> worry
>>>>>>>>>> about GT
>>>>>>>>>> caches vs Media coherency, and only CPU vs GPU
>>>>>>>>>> coherency,
>>>>>>>>>> so can
>>>>>>>>>> make that flush a targeted XA flush, since stuff
>>>>>>>>>> tagged
>>>>>>>>>> with XA
>>>>>>>>>> now means it's shared with the CPU. The main
>>>>>>>>>> implication is
>>>>>>>>>> that
>>>>>>>>>> we now need to somehow flush non-XA before freeing
>>>>>>>>>> system
>>>>>>>>>> memory
>>>>>>>>>> pages, otherwise dirty cachelines could be flushed
>>>>>>>>>> after
>>>>>>>>>> the
>>>>>>>>>> free (like if Media suddenly turns on and does a full
>>>>>>>>>> flush)
>>>>>>>>>
>>>>>>>>> This description seems really confusing.  My
>>>>>>>>> understanding is
>>>>>>>>> that
>>>>>>>>> marking something as wb-transient-app indicates that it
>>>>>>>>> might
>>>>>>>>> be
>>>>>>>>> accessed by something other than our graphics/media IP
>>>>>>>>> (i.e.,
>>>>>>>>> accessed from the CPU, exported to another device,
>>>>>>>>> etc.), so
>>>>>>>>> transient data truly does need to be flushed at the
>>>>>>>>> points in
>>>>>>>>> the
>>>>>>>>> driver where a flush typically happens.
>>>>>>>>>
>>>>>>>>> However when something is _not_ transient, then either:
>>>>>>>>>    - it's "private" to the GPU and only our
>>>>>>>>> graphics/media IP
>>>>>>>>> will be
>>>>>>>>>      accessing it
>>>>>>>>>    - it's bound with a coherent PAT index so that
>>>>>>>>> outside
>>>>>>>>> observers like
>>>>>>>>>      the CPU can snoop the device cache, even when the
>>>>>>>>> cache
>>>>>>>>> hasn't been
>>>>>>>>>      flushed
>>>>>>>>>
>>>>>>>>> If media is not active, then there's really no need to
>>>>>>>>> include
>>>>>>>>> non-transient data when an device cache flush happens
>>>>>>>>> since
>>>>>>>>> there's no real need for the data to get to RAM.  So
>>>>>>>>> that
>>>>>>>>> enables
>>>>>>>>> an optimization (which comes in your next patch), that
>>>>>>>>> allows
>>>>>>>>> flushes to only operate on the subset of the device
>>>>>>>>> cache
>>>>>>>>> tagged as
>>>>>> "transient" if media is idle.
>>>>>>>
>>>>>>> But what If we have stale non-XA marked pages for userptr,
>>>>>>> and
>>>>>>> that
>>>>>>> object moves out and at the same time media comes back,
>>>>>>> will end
>>>>>>> up in
>>>>>>> full flush and flush the stale entry to RAM.
>>>>>>
>>>>>> What makes userptr special here?  During general, active
>>>>>> usage,
>>>>>> userptr would
>>>>>> be data that's accessible by the CPU, so it needs to either
>>>>>> be
>>>>>> transient (so CPU
>>>>>> can see the data in RAM after explicit flushes) or it needs
>>>>>> to be
>>>>>> using a
>>>>>> coherent PAT (so that the CPU can just snoop the GPU cache).
>>>>>> If
>>>>>> you marked
>>>>>> userptr as both non-XA and non-coherent, then that sounds
>>>>>> likely to
>>>>>> be a
>>>>>> userspace bug (and probably something we can catch and reject
>>>>>> as an
>>>>>> invalid
>>>>>> case on any Xe3p or later platforms that support this) since
>>>>>> the
>>>>>> CPU wouldn't
>>>>>> have any reliable way of seeing GPU updates.
>>>>>
>>>>> Right. FYI @Mrozek, Michal @Souza, Jose
>>>>> For userptr, as explained above, it needs to be either coherent
>>>>> or XA
>>>>> pat index, or else KMD will reject as invalid case.
>>>>>
>>>>>
>>>> The coherency restriction is already in the uAPI:
>>>>
>>>> "Note: For userptr and externally imported dma-buf the kernel
>>>> expects
>>>> either 1WAY or 2WAY for the @pat_index."
>>>>
>>>> Using 1 way is enough as Xe KMD does a PIPE_CONTROL flushing GPU
>>>> caches
>>>> at the end of batch buffers.
>>>
>>> But isn't that what we're discussing here?  1-way *won't*
>>> necessarily be
>>> enough anymore because PIPE_CONTROL instructions don't flush the
>>> entire
>>> cache anymore.  Whenever the GuC determines that media is inactive
>>> and
>>> activates the optimization, PIPE_CONTROL, MI_FLUSH_DW, etc. change
>>> behavior to only flush out the subset of data that was marked as
>>> app-transient; anything not marked that way doesn't get flushed
>>> now.  So
>>> there's a new requirement here that you ensure you're using an XA
>>> PAT
>>> index, or you switch to use 2-way coherency which will allow the
>>> CPU to
>>> snoop the GPU's caches.
>>
>> That exactly matches my understanding also.
> 
> This only ever affects IGFX, right? Since AFAIU we don't have 2-way
> coherency with DGFX?

Yeah, this should be igpu only. I seem to also recall that on dgpu, 
Media is coherent with l2/l3, but also I don't think system memory can 
be cached in l2/l3 (only VRAM), which I assume is why there is the 
special SMRO (system-memory-read-only) cache only on dgpu, which is 
flushed when the fence signals, unlike the l2/l3.

> 
> It sounds like the same PAT restriction is needed also for imported
> dma-buf, right?

Good point. Looks like we are missing that still. Otherwise we can run 
into the same issues with stale l2/l3/ppc.

> 
> /Thomas
> 
> 
>>
>>>
>>>
>>> Matt
>>>
>>>>
>>>>>
>>>>>>
>>>>>> If something happens that changes the GTT mapping of an
>>>>>> object,
>>>>>> then
>>>>>> doesn't that already trigger a TLB invalidation when
>>>>>> necessary in
>>>>>> the driver
>>>>>> today?  It was my understanding that "heavy" TLB
>>>>>> invalidations wait
>>>>>> for data
>>>>>> values to be globally observable before starting, so I think
>>>>>> that
>>>>>> would ensure
>>>>>> that any non-XA data makes it to RAM before any binding
>>>>>> changes,
>>>>>> object,
>>>>>> destruction, etc.?  Is there something special about userptr
>>>>>> that
>>>>>> makes that
>>>>>> case more of a problem?
>>>>>>
>>>>>> I just found bspec page 74635 which gives an overview of the
>>>>>> various flush
>>>>>> and invalidate cases, and I don't see anything there that
>>>>>> makes it
>>>>>> obvious to
>>>>>> me that userptr would be special.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> As you said, we eventually do want to force a flush of
>>>>>>>>> the
>>>>>>>>> non-transient data as well once we're freeing the
>>>>>>>>> underlying
>>>>>>>>> pages.
>>>>>>>>> So how do we do that?  It's not clear to me how the
>>>>>>>>> changes
>>>>>>>>> below
>>>>>>>>> are accomplishing that.  Is there a way to explicitly
>>>>>>>>> request
>>>>>>>>> a
>>>>>>>>> full device cache flush (ignoring the transient vs non-
>>>>>>>>> transient tagging)?
>>>>>>>>> Since the GuC handles the optimization in the next
>>>>>>>>> patch
>>>>>>>>> (toggling
>>>>>>>>> whether flushes are full flushes vs non-transient
>>>>>>>>> flushes
>>>>>>>>> depending on whether media is active), I thought there
>>>>>>>>> might
>>>>>>>>> be
>>>>>>>>> some kind of GuC interface to request "please do one
>>>>>>>>> full
>>>>>>>>> flush now, even
>>>>>> if media is idle."
>>>>>>>>>
>>>>>>>>
>>>>>>>> I’m not an expert here by any means, but everything above
>>>>>>>> from
>>>>>>>> Matt
>>>>>>>> seems like valid concerns. Thomas also raised some
>>>>>>>> concerns in
>>>>>>>> the
>>>>>>>> two previous revisions; again I’m not an expert, but
>>>>>>>> reading
>>>>>>>> through
>>>>>>>> those, it doesn’t really seem like he received proper
>>>>>>>> answers
>>>>>>>> to his
>>>>>> questions.
>>>>>>>
>>>>>>> Its forcing flush via tlb invalidation PPC flag under
>>>>>>> xe_invalidate_vma( ).
>>>>>>
>>>>>> By the way, what is "PPC?"  It seems like it's another new
>>>>>> synonym
>>>>>> for the
>>>>>> device cache?  It's already really confusing that some of our
>>>>>> hardware docs use
>>>>>> a mix of both "L2" and "L3" to refer to the same device cache
>>>>>> for
>>>>>> historical
>>>>>> reasons...
>>>>>>
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> A couple of comments below.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Matt
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> V2(MattA): Expand commit description
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Tejas Upadhyay
>>>>>>>>>> <tejas.upadhyay@intel.com>
>>>>>>>>>> ---
>>>>>>>>>>    drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
>>>>>>>>>>    drivers/gpu/drm/xe/xe_device.c  | 23
>>>>>>>>>> +++++++++++++++++++++++
>>>>>>>>>> drivers/gpu/drm/xe/xe_device.h  |  1 +
>>>>>>>>>> drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
>>>>>>>>>>    4 files changed, 28 insertions(+), 2 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>>>> b/drivers/gpu/drm/xe/xe_bo.c index
>>>>>>>>>> e9180b01a4e4..4455886b211e
>>>>>>>>>> 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>>>> @@ -689,7 +689,8 @@ static int
>>>>>>>>>> xe_bo_trigger_rebind(struct
>>>>>>>>>> xe_device *xe, struct xe_bo *bo,
>>>>>>>>>>
>>>>>>>>>>    		if (!xe_vm_in_fault_mode(vm)) {
>>>>>>>>>>    			drm_gpuvm_bo_evict(vm_bo,
>>>>>>>>>> true);
>>>>>>>>>> -			continue;
>>>>>>>>>> +			if
>>>>>>>>>> (!xe_device_needs_cache_flush(xe))
>>>>>>>>>> +				continue;
>>>>>
>>>>> Matt R,
>>>>> This flush will be still needed as there can be non-xa buffers
>>>>> which
>>>>> can be evicted while media was off and stale entries can be
>>>>> flushed
>>>>> when media comes back on. Which was not case earlier as full
>>>>> flush
>>>>> was happening at regular sync points and that’s where this
>>>>> feature is
>>>>> bringing optimization now.
>>>>>
>>>>> Tejas
>>>>>
>>>>>>>>
>>>>>>>> This will trigger a TLB invalidation (and I assume a
>>>>>>>> cache
>>>>>>>> flush)
>>>>>>>> every time we move or free memory in the 3D stack if it
>>>>>>>> has a
>>>>>>>> binding. It also performs a synchronous wait on the BO
>>>>>>>> being
>>>>>>>> idle.
>>>>>>>> Both of these are very expensive operations. I can’t
>>>>>>>> imagine
>>>>>>>> the
>>>>>>>> granularity we want here is to do this on every move/free
>>>>>>>> with
>>>>>>>> bindings.
>>>>>>>>
>>>>>>>> Also, for LR compute with preempt fences, we would
>>>>>>>> trigger the
>>>>>>>> preempt fences during the wait, so a TLB invalidation
>>>>>>>> after
>>>>>>>> this
>>>>>>>> seems unnecessary, though perhaps the cache flush is
>>>>>>>> still
>>>>>>>> required?
>>>>>>>>
>>>>>>>> I think this needs a bit more explanation, because
>>>>>>>> without
>>>>>>>> knowing a
>>>>>>>> lot about the exact requirements, the implementation does
>>>>>>>> not
>>>>>>>> look
>>>>>> correct.
>>>>>>>
>>>>>>> The thing is that we are trying to solve problem with
>>>>>>> userptr
>>>>>>> with non-XA
>>>>>> pat, consider if that BO got moved while media is not active.
>>>>>> As
>>>>>> soon as media
>>>>>> will come back active, stale cached entries of that object
>>>>>> will be
>>>>>> flushed as part
>>>>>> of full flush , which may corrupt things.
>>>>>>> There was thinking that with this patch we would at least
>>>>>>> solve
>>>>>>> the problem
>>>>>> of corruption and later when page_reclamation feature comes
>>>>>> in will
>>>>>> help in
>>>>>> performance as well. But now when page reclamation feature is
>>>>>> merged earlier
>>>>>> and it tightly coupled with bind/unbind some cases like
>>>>>> discussed
>>>>>> above
>>>>>> (which are not doing unbind immediately on move/free) are
>>>>>> missed in
>>>>>> reclamation.
>>>>>>>
>>>>>>> So thought was to let this solution go in with little perf
>>>>>>> hit
>>>>>>> and discuss with
>>>>>> page reclamation owner to come with cleaner solution
>>>>>> together.
>>>>>>>
>>>>>>> Tejas
>>>>>>>>
>>>>>>>>>>    		}
>>>>>>>>>>
>>>>>>>>>>    		if (!idle) {
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.c index
>>>>>>>>>> 743c18e0c580..da2abed94bc0
>>>>>>>>>> 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>>> @@ -1097,6 +1097,29 @@ static void
>>>>>>>>>> tdf_request_sync(struct
>>>>>>>>>> xe_device
>>>>>>>> *xe)
>>>>>>>>>>    	}
>>>>>>>>>>    }
>>>>>>>>>>
>>>>>>>>>> +/**
>>>>>>>>>> + * xe_device_needs_cache_flush - Whether the cache
>>>>>>>>>> needs
>>>>>>>>>> to be
>>>>>>>>>> +flushed
>>>>>>>>>> + * @xe: The device to check.
>>>>>>>>>> + *
>>>>>>>>>> + * Return: true if the device needs cache flush,
>>>>>>>>>> false
>>>>>>>>>> otherwise.
>>>>>>>>>> + */
>>>>>>>>>> +bool xe_device_needs_cache_flush(struct xe_device
>>>>>>>>>> *xe) {
>>>>>>>>>> +	/* XA is *always* flushed, like at the end-
>>>>>>>>>> of-
>>>>>>>>>> submssion (and
>>>>>>>>>> +maybe
>>>>>>>> other
>>>>>>>>>> +	 * places), just that internally as an
>>>>>>>>>> optimisation hw doesn't
>>>>>>>>>> +need to
>>>>>>>> make
>>>>>>>>>> +	 * that a full flush (which will also
>>>>>>>>>> include XA)
>>>>>>>>>> when Media is
>>>>>>>>>> +	 * off/powergated, since it doesn't need to
>>>>>>>>>> worry
>>>>>>>>>> about GT
>>>>>>>>>> +caches vs
>>>>>>>> Media
>>>>>>>>>> +	 * coherency, and only CPU vs GPU coherency,
>>>>>>>>>> so
>>>>>>>>>> can make
>>>>>> that
>>>>>>>>>> +flush
>>>>>>>> a
>>>>>>>>>> +	 * targeted XA flush, since stuff tagged
>>>>>>>>>> with XA
>>>>>>>>>> now means
>>>>>>>>>> +it's shared
>>>>>>>> with
>>>>>>>>>> +	 * the CPU. The main implication is that we
>>>>>>>>>> now
>>>>>>>>>> need to
>>>>>>>>>> +somehow
>>>>>>>> flush non-XA before
>>>>>>>>>> +	 * freeing system memory pages, otherwise
>>>>>>>>>> dirty
>>>>>>>>>> cachelines
>>>>>>>>>> +could be
>>>>>>>> flushed after the free
>>>>>>>>>> +	 * (like if Media suddenly turns on and does
>>>>>>>>>> a
>>>>>>>>>> full flush)
>>>>>>>>>> +	 */
>>>>>>>>>> +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
>>>>>>>>>> +		return true;
>>>>>>>>>> +	return false;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>    void xe_device_l2_flush(struct xe_device *xe)  {
>>>>>>>>>>    	struct xe_gt *gt;
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h
>>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.h index
>>>>>>>>>> 39464650533b..baf386e0e037
>>>>>>>>>> 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>>>>>>>> @@ -184,6 +184,7 @@ void
>>>>>>>>>> xe_device_snapshot_print(struct
>>>>>>>>>> xe_device *xe, struct drm_printer *p);
>>>>>>>>>>    u64 xe_device_canonicalize_addr(struct xe_device
>>>>>>>>>> *xe, u64
>>>>>>>>>> address);
>>>>>>>>>>    u64 xe_device_uncanonicalize_addr(struct xe_device
>>>>>>>>>> *xe,
>>>>>>>>>> u64
>>>>>>>>>> address);
>>>>>>>>>>
>>>>>>>>>> +bool xe_device_needs_cache_flush(struct xe_device
>>>>>>>>>> *xe);
>>>>>>>>>>    void xe_device_td_flush(struct xe_device *xe);
>>>>>>>>>> void
>>>>>>>>>> xe_device_l2_flush(struct xe_device *xe);
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>>>> b/drivers/gpu/drm/xe/xe_userptr.c index
>>>>>>>>>> e120323c43bc..b435ea7f9b66
>>>>>>>>>> 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>>>> @@ -114,7 +114,8 @@ static void
>>>>>>>>>> __vma_userptr_invalidate(struct
>>>>>>>>>> xe_vm
>>>>>>>> *vm, struct xe_userptr_vma *uv
>>>>>>>>>>    				    false,
>>>>>>>>>> MAX_SCHEDULE_TIMEOUT);
>>>>>>>>>>    	XE_WARN_ON(err <= 0);
>>>>>>>>>>
>>>>>>>>>> -	if (xe_vm_in_fault_mode(vm) && userptr-
>>>>>>>>>>> initial_bind) {
>>>>>>>>>> +	if ((xe_vm_in_fault_mode(vm) ||
>>>>>>>>>> +xe_device_needs_cache_flush(vm-
>>>>>>>>> xe)) &&
>>>>>>>>>> +	    userptr->initial_bind) {
>>>>>>>>
>>>>>>>> Same concern with the LR preempt fence as above — the
>>>>>>>> hardware
>>>>>>>> will
>>>>>>>> be interrupted via preempt fences, so it doesn’t seem
>>>>>>>> necessary
>>>>>>>> to
>>>>>>>> invalidate the TLBs but perhaps we need a cflush and TLB
>>>>>>>> invalidation is the mechanism for that too?
>>>>>>>>
>>>>>>>> Matt
>>>>>>>>
>>>>>>>>>>    		err = xe_vm_invalidate_vma(vma);
>>>>>>>>>>    		XE_WARN_ON(err);
>>>>>>>>>>    	}
>>>>>>>>>> --
>>>>>>>>>> 2.52.0
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Matt Roper
>>>>>>>>> Graphics Software Engineer
>>>>>>>>> Linux GPU Platform Enablement
>>>>>>>>> Intel Corporation
>>>>>>
>>>>>> --
>>>>>> Matt Roper
>>>>>> Graphics Software Engineer
>>>>>> Linux GPU Platform Enablement
>>>>>> Intel Corporation
>>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-16 10:56             ` Thomas Hellström
@ 2026-02-16 11:26               ` Upadhyay, Tejas
  0 siblings, 0 replies; 39+ messages in thread
From: Upadhyay, Tejas @ 2026-02-16 11:26 UTC (permalink / raw)
  To: Thomas Hellström, Roper, Matthew D, Mrozek, Michal,
	Souza, Jose
  Cc: Brost, Matthew, intel-xe@lists.freedesktop.org, Auld, Matthew



> -----Original Message-----
> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Sent: 16 February 2026 16:27
> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>; Roper, Matthew D
> <matthew.d.roper@intel.com>; Mrozek, Michal <michal.mrozek@intel.com>;
> Souza, Jose <jose.souza@intel.com>
> Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
> xe@lists.freedesktop.org; Auld, Matthew <matthew.auld@intel.com>
> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
> cachelines manually
> 
> On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote:
> >
> >
> > > -----Original Message-----
> > > From: Roper, Matthew D <matthew.d.roper@intel.com>
> > > Sent: 12 February 2026 02:41
> > > To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> > > Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
> > > xe@lists.freedesktop.org; Auld, Matthew <matthew.auld@intel.com>;
> > > thomas.hellstrom@linux.intel.com
> > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
> > > cachelines manually
> > >
> > > On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > Sent: 11 February 2026 05:32
> > > > > To: Roper, Matthew D <matthew.d.roper@intel.com>
> > > > > Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> > > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > > <matthew.auld@intel.com>; thomas.hellstrom@linux.intel.com
> > > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker
> > > > > bo cachelines manually
> > > > >
> > > > > On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper wrote:
> > > > > > On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay
> > > > > > wrote:
> > > > > > > "eXtended Architecture" (XA) tagged memory—memory shared
> > > between
> > > > > the
> > > > > > > CPU and GPU
> > > > > >
> > > > > > I'm pretty sure this expansion of "XA" is wrong; where are you
> > > > > > seeing this definition?  Everything in the bspec indicates
> > > > > > that XA means "wb
> > > > > > - transient app" (similar to how "XD" is 'wb - transient
> > > > > > display").
> > > > > > I'm not sure why exactly they picked "X" to refer to transient
> > > > > > in both of these cases, but I've never seen any documentation
> > > > > > that refers to it as "extended."
> > > > > >
> > > > > > > is treated differently from other GPU memory when the Media
> > > > > > > engine is
> > > > > power-gated.
> > > > > > >
> > > > > > > XA is *always* flushed, like at the end-of-submssion (and
> > > > > > > maybe other
> > > > > >
> > > > > > I assume you're referring to the fact that the driver performs
> > > > > > flushes at the end of submission (via PIPE_CONTROL or
> > > > > > MI_FLUSH_DW), and that depending on other state/optimizations
> > > > > > in the system, those flushes may flush the entire device
> > > > > > cache, or may only flush the subset of cache data that is not
> > > > > > marked as transient.  The way you worded this was confusing
> > > > > > since it makes it sound like cache flushes happen
> > > > > > automatically somewhere in
> > > hardware/firmware.
> > > > > >
> > > > > > > places), just that internally as an optimisation hw doesn't
> > > > > > > need to make that a full flush (which will also include XA)
> > > > > > > when Media is off/powergated, since it doesn't need to worry
> > > > > > > about GT caches vs Media coherency, and only CPU vs GPU
> > > > > > > coherency, so can make that flush a targeted XA flush, since
> > > > > > > stuff tagged with XA now means it's shared with the CPU. The
> > > > > > > main implication is that we now need to somehow flush non-XA
> > > > > > > before freeing system memory pages, otherwise dirty
> > > > > > > cachelines could be flushed after the free (like if Media
> > > > > > > suddenly turns on and does a full
> > > > > > > flush)
> > > > > >
> > > > > > This description seems really confusing.  My understanding is
> > > > > > that marking something as wb-transient-app indicates that it
> > > > > > might be accessed by something other than our graphics/media
> > > > > > IP (i.e., accessed from the CPU, exported to another device,
> > > > > > etc.), so transient data truly does need to be flushed at the
> > > > > > points in the driver where a flush typically happens.
> > > > > >
> > > > > > However when something is _not_ transient, then either:
> > > > > >  - it's "private" to the GPU and only our graphics/media IP
> > > > > > will be
> > > > > >    accessing it
> > > > > >  - it's bound with a coherent PAT index so that outside
> > > > > > observers like
> > > > > >    the CPU can snoop the device cache, even when the cache
> > > > > > hasn't been
> > > > > >    flushed
> > > > > >
> > > > > > If media is not active, then there's really no need to include
> > > > > > non-transient data when an device cache flush happens since
> > > > > > there's no real need for the data to get to RAM.  So that
> > > > > > enables an optimization (which comes in your next patch), that
> > > > > > allows flushes to only operate on the subset of the device
> > > > > > cache tagged as
> > > "transient" if media is idle.
> > > >
> > > > But what If we have stale non-XA marked pages for userptr, and
> > > > that object moves out and at the same time media comes back, will
> > > > end up in full flush and flush the stale entry to RAM.
> > >
> > > What makes userptr special here?  During general, active usage,
> > > userptr would be data that's accessible by the CPU, so it needs to
> > > either be transient (so CPU can see the data in RAM after explicit
> > > flushes) or it needs to be using a coherent PAT (so that the CPU can
> > > just snoop the GPU cache).  If you marked userptr as both non-XA and
> > > non-coherent, then that sounds likely to be a userspace bug (and
> > > probably something we can catch and reject as an invalid case on any
> > > Xe3p or later platforms that support this) since the CPU wouldn't
> > > have any reliable way of seeing GPU updates.
> >
> > Right. FYI @Mrozek, Michal @Souza, Jose For userptr, as explained
> > above, it needs to be either coherent or XA pat index, or else KMD
> > will reject as invalid case.
> >
> > >
> > > If something happens that changes the GTT mapping of an object, then
> > > doesn't that already trigger a TLB invalidation when necessary in
> > > the driver today?  It was my understanding that "heavy" TLB
> > > invalidations wait for data values to be globally observable before
> > > starting, so I think that would ensure that any non-XA data makes it
> > > to RAM before any binding changes, object, destruction, etc.?  Is
> > > there something special about userptr that makes that case more of a
> > > problem?
> > >
> > > I just found bspec page 74635 which gives an overview of the various
> > > flush and invalidate cases, and I don't see anything there that
> > > makes it obvious to me that userptr would be special.
> > >
> > >
> > > >
> > > > > >
> > > > > > As you said, we eventually do want to force a flush of the
> > > > > > non-transient data as well once we're freeing the underlying
> > > > > > pages.
> > > > > > So how do we do that?  It's not clear to me how the changes
> > > > > > below are accomplishing that.  Is there a way to explicitly
> > > > > > request a full device cache flush (ignoring the transient vs
> > > > > > non- transient tagging)?
> > > > > > Since the GuC handles the optimization in the next patch
> > > > > > (toggling whether flushes are full flushes vs non-transient
> > > > > > flushes depending on whether media is active), I thought there
> > > > > > might be some kind of GuC interface to request "please do one
> > > > > > full flush now, even
> > > if media is idle."
> > > > > >
> > > > >
> > > > > I’m not an expert here by any means, but everything above from
> > > > > Matt seems like valid concerns. Thomas also raised some concerns
> > > > > in the two previous revisions; again I’m not an expert, but
> > > > > reading through those, it doesn’t really seem like he received
> > > > > proper answers to his
> > > questions.
> > > >
> > > > Its forcing flush via tlb invalidation PPC flag under
> > > > xe_invalidate_vma( ).
> > >
> > > By the way, what is "PPC?"  It seems like it's another new synonym
> > > for the device cache?  It's already really confusing that some of
> > > our hardware docs use a mix of both "L2" and "L3" to refer to the
> > > same device cache for historical reasons...
> > >
> > >
> > > Matt
> > >
> > > >
> > > > >
> > > > > A couple of comments below.
> > > > >
> > > > > >
> > > > > > Matt
> > > > > >
> > > > > > >
> > > > > > > V2(MattA): Expand commit description
> > > > > > >
> > > > > > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > > > > > > ---
> > > > > > >  drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> > > > > > >  drivers/gpu/drm/xe/xe_device.c  | 23
> > > > > > > +++++++++++++++++++++++
> > > > > > > drivers/gpu/drm/xe/xe_device.h  |  1 +
> > > > > > > drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> > > > > > >  4 files changed, 28 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > b/drivers/gpu/drm/xe/xe_bo.c index
> > > > > > > e9180b01a4e4..4455886b211e
> > > > > > > 100644
> > > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > @@ -689,7 +689,8 @@ static int xe_bo_trigger_rebind(struct
> > > > > > > xe_device *xe, struct xe_bo *bo,
> > > > > > >
> > > > > > >  		if (!xe_vm_in_fault_mode(vm)) {
> > > > > > >  			drm_gpuvm_bo_evict(vm_bo, true);
> > > > > > > -			continue;
> > > > > > > +			if
> > > > > > > (!xe_device_needs_cache_flush(xe))
> > > > > > > +				continue;
> >
> > Matt R,
> > This flush will be still needed as there can be non-xa buffers which
> > can be evicted while media was off and stale entries can be flushed
> > when media comes back on. Which was not case earlier as full flush was
> > happening at regular sync points and that’s where this feature is
> > bringing optimization now.
> >
> > Tejas
> 
> This flush would effecively eliminate any asynchronous migration, so again if
> this is only affecting IGFX we're synchronizing on shrinking anyway, but in
> general I think we should avoid synchronizing here.
> 
> Could this instead be done asynchronously similar to how we flush TLB?
> so it would be part of the migration fence?

How to make sure or know when media comes on and do full flush? because deferring or async flush would have same risk right?

Tejas

> 
> /Thomas
> 
> >
> > > > >
> > > > > This will trigger a TLB invalidation (and I assume a cache
> > > > > flush)
> > > > > every time we move or free memory in the 3D stack if it has a
> > > > > binding. It also performs a synchronous wait on the BO being
> > > > > idle.
> > > > > Both of these are very expensive operations. I can’t imagine the
> > > > > granularity we want here is to do this on every move/free with
> > > > > bindings.
> > > > >
> > > > > Also, for LR compute with preempt fences, we would trigger the
> > > > > preempt fences during the wait, so a TLB invalidation after this
> > > > > seems unnecessary, though perhaps the cache flush is still
> > > > > required?
> > > > >
> > > > > I think this needs a bit more explanation, because without
> > > > > knowing a lot about the exact requirements, the implementation
> > > > > does not look
> > > correct.
> > > >
> > > > The thing is that we are trying to solve problem with userptr with
> > > > non-XA
> > > pat, consider if that BO got moved while media is not active. As
> > > soon as media will come back active, stale cached entries of that
> > > object will be flushed as part of full flush , which may corrupt
> > > things.
> > > > There was thinking that with this patch we would at least solve
> > > > the problem
> > > of corruption and later when page_reclamation feature comes in will
> > > help in performance as well. But now when page reclamation feature
> > > is merged earlier and it tightly coupled with bind/unbind some cases
> > > like discussed above (which are not doing unbind immediately on
> > > move/free) are missed in reclamation.
> > > >
> > > > So thought was to let this solution go in with little perf hit and
> > > > discuss with
> > > page reclamation owner to come with cleaner solution together.
> > > >
> > > > Tejas
> > > > >
> > > > > > >  		}
> > > > > > >
> > > > > > >  		if (!idle) {
> > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > b/drivers/gpu/drm/xe/xe_device.c index
> > > > > > > 743c18e0c580..da2abed94bc0
> > > > > > > 100644
> > > > > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > > > > @@ -1097,6 +1097,29 @@ static void tdf_request_sync(struct
> > > > > > > xe_device
> > > > > *xe)
> > > > > > >  	}
> > > > > > >  }
> > > > > > >
> > > > > > > +/**
> > > > > > > + * xe_device_needs_cache_flush - Whether the cache needs
> > > > > > > to be
> > > > > > > +flushed
> > > > > > > + * @xe: The device to check.
> > > > > > > + *
> > > > > > > + * Return: true if the device needs cache flush, false
> > > > > > > otherwise.
> > > > > > > + */
> > > > > > > +bool xe_device_needs_cache_flush(struct xe_device *xe) {
> > > > > > > +	/* XA is *always* flushed, like at the end-of-
> > > > > > > submssion (and
> > > > > > > +maybe
> > > > > other
> > > > > > > +	 * places), just that internally as an
> > > > > > > optimisation hw doesn't
> > > > > > > +need to
> > > > > make
> > > > > > > +	 * that a full flush (which will also include XA)
> > > > > > > when Media is
> > > > > > > +	 * off/powergated, since it doesn't need to worry
> > > > > > > about GT
> > > > > > > +caches vs
> > > > > Media
> > > > > > > +	 * coherency, and only CPU vs GPU coherency, so
> > > > > > > can make
> > > that
> > > > > > > +flush
> > > > > a
> > > > > > > +	 * targeted XA flush, since stuff tagged with XA
> > > > > > > now means
> > > > > > > +it's shared
> > > > > with
> > > > > > > +	 * the CPU. The main implication is that we now
> > > > > > > need to
> > > > > > > +somehow
> > > > > flush non-XA before
> > > > > > > +	 * freeing system memory pages, otherwise dirty
> > > > > > > cachelines
> > > > > > > +could be
> > > > > flushed after the free
> > > > > > > +	 * (like if Media suddenly turns on and does a
> > > > > > > full flush)
> > > > > > > +	 */
> > > > > > > +	if (GRAPHICS_VER(xe) >= 35 && !IS_DGFX(xe))
> > > > > > > +		return true;
> > > > > > > +	return false;
> > > > > > > +}
> > > > > > > +
> > > > > > >  void xe_device_l2_flush(struct xe_device *xe)  {
> > > > > > >  	struct xe_gt *gt;
> > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > b/drivers/gpu/drm/xe/xe_device.h index
> > > > > > > 39464650533b..baf386e0e037
> > > > > > > 100644
> > > > > > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > > > > > @@ -184,6 +184,7 @@ void xe_device_snapshot_print(struct
> > > > > > > xe_device *xe, struct drm_printer *p);
> > > > > > >  u64 xe_device_canonicalize_addr(struct xe_device *xe, u64
> > > > > > > address);
> > > > > > >  u64 xe_device_uncanonicalize_addr(struct xe_device *xe,
> > > > > > > u64
> > > > > > > address);
> > > > > > >
> > > > > > > +bool xe_device_needs_cache_flush(struct xe_device *xe);
> > > > > > >  void xe_device_td_flush(struct xe_device *xe);  void
> > > > > > > xe_device_l2_flush(struct xe_device *xe);
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > b/drivers/gpu/drm/xe/xe_userptr.c index
> > > > > > > e120323c43bc..b435ea7f9b66
> > > > > > > 100644
> > > > > > > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > @@ -114,7 +114,8 @@ static void
> > > > > > > __vma_userptr_invalidate(struct xe_vm
> > > > > *vm, struct xe_userptr_vma *uv
> > > > > > >  				    false,
> > > > > > > MAX_SCHEDULE_TIMEOUT);
> > > > > > >  	XE_WARN_ON(err <= 0);
> > > > > > >
> > > > > > > -	if (xe_vm_in_fault_mode(vm) && userptr-
> > > > > > > >initial_bind) {
> > > > > > > +	if ((xe_vm_in_fault_mode(vm) ||
> > > > > > > +xe_device_needs_cache_flush(vm-
> > > > > > xe)) &&
> > > > > > > +	    userptr->initial_bind) {
> > > > >
> > > > > Same concern with the LR preempt fence as above — the hardware
> > > > > will be interrupted via preempt fences, so it doesn’t seem
> > > > > necessary to invalidate the TLBs but perhaps we need a cflush
> > > > > and TLB invalidation is the mechanism for that too?
> > > > >
> > > > > Matt
> > > > >
> > > > > > >  		err = xe_vm_invalidate_vma(vma);
> > > > > > >  		XE_WARN_ON(err);
> > > > > > >  	}
> > > > > > > --
> > > > > > > 2.52.0
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Matt Roper
> > > > > > Graphics Software Engineer
> > > > > > Linux GPU Platform Enablement
> > > > > > Intel Corporation
> > >
> > > --
> > > Matt Roper
> > > Graphics Software Engineer
> > > Linux GPU Platform Enablement
> > > Intel Corporation

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-16 10:58                     ` Matthew Auld
@ 2026-02-16 12:07                       ` Thomas Hellström
  2026-02-16 14:55                         ` Matthew Auld
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Hellström @ 2026-02-16 12:07 UTC (permalink / raw)
  To: Matthew Auld, Matt Roper, Souza, Jose
  Cc: Upadhyay, Tejas, Mrozek, Michal, intel-xe@lists.freedesktop.org,
	Brost, Matthew

On Mon, 2026-02-16 at 10:58 +0000, Matthew Auld wrote:
> On 16/02/2026 10:23, Thomas Hellström wrote:
> > On Fri, 2026-02-13 at 17:31 +0000, Matthew Auld wrote:
> > > On 13/02/2026 17:16, Matt Roper wrote:
> > > > On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose wrote:
> > > > > On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote:
> > > > > > 
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Roper, Matthew D <matthew.d.roper@intel.com>
> > > > > > > Sent: 12 February 2026 02:41
> > > > > > > To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> > > > > > > Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
> > > > > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > > > > <matthew.auld@intel.com>;
> > > > > > > thomas.hellstrom@linux.intel.com
> > > > > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
> > > > > > > userptr/shrinker bo
> > > > > > > cachelines manually
> > > > > > > 
> > > > > > > On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas
> > > > > > > wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > > > > > Sent: 11 February 2026 05:32
> > > > > > > > > To: Roper, Matthew D <matthew.d.roper@intel.com>
> > > > > > > > > Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>;
> > > > > > > > > intel-
> > > > > > > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > > > > > > <matthew.auld@intel.com>;
> > > > > > > > > thomas.hellstrom@linux.intel.com
> > > > > > > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
> > > > > > > > > userptr/shrinker bo
> > > > > > > > > cachelines manually
> > > > > > > > > 
> > > > > > > > > On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper
> > > > > > > > > wrote:
> > > > > > > > > > On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas
> > > > > > > > > > Upadhyay
> > > > > > > > > > wrote:
> > > > > > > > > > > "eXtended Architecture" (XA) tagged memory—memory
> > > > > > > > > > > shared
> > > > > > > between
> > > > > > > > > the
> > > > > > > > > > > CPU and GPU
> > > > > > > > > > 
> > > > > > > > > > I'm pretty sure this expansion of "XA" is wrong;
> > > > > > > > > > where
> > > > > > > > > > are
> > > > > > > > > > you
> > > > > > > > > > seeing this definition?  Everything in the bspec
> > > > > > > > > > indicates
> > > > > > > > > > that XA
> > > > > > > > > > means "wb
> > > > > > > > > > - transient app" (similar to how "XD" is 'wb -
> > > > > > > > > > transient
> > > > > > > > > > display").
> > > > > > > > > > I'm not sure why exactly they picked "X" to refer
> > > > > > > > > > to
> > > > > > > > > > transient in
> > > > > > > > > > both of these cases, but I've never seen any
> > > > > > > > > > documentation
> > > > > > > > > > that
> > > > > > > > > > refers to it as "extended."
> > > > > > > > > > 
> > > > > > > > > > > is treated differently from other GPU memory when
> > > > > > > > > > > the
> > > > > > > > > > > Media
> > > > > > > > > > > engine is
> > > > > > > > > power-gated.
> > > > > > > > > > > 
> > > > > > > > > > > XA is *always* flushed, like at the end-of-
> > > > > > > > > > > submssion
> > > > > > > > > > > (and
> > > > > > > > > > > maybe
> > > > > > > > > > > other
> > > > > > > > > > 
> > > > > > > > > > I assume you're referring to the fact that the
> > > > > > > > > > driver
> > > > > > > > > > performs
> > > > > > > > > > flushes at the end of submission (via PIPE_CONTROL
> > > > > > > > > > or
> > > > > > > > > > MI_FLUSH_DW), and that depending on other
> > > > > > > > > > state/optimizations
> > > > > > > > > > in
> > > > > > > > > > the system, those flushes may flush the entire
> > > > > > > > > > device
> > > > > > > > > > cache,
> > > > > > > > > > or
> > > > > > > > > > may only flush the subset of cache data that is not
> > > > > > > > > > marked as
> > > > > > > > > > transient.  The way you worded this was confusing
> > > > > > > > > > since
> > > > > > > > > > it
> > > > > > > > > > makes
> > > > > > > > > > it sound like cache flushes happen automatically
> > > > > > > > > > somewhere in
> > > > > > > hardware/firmware.
> > > > > > > > > > 
> > > > > > > > > > > places), just that internally as an optimisation
> > > > > > > > > > > hw
> > > > > > > > > > > doesn't
> > > > > > > > > > > need
> > > > > > > > > > > to make that a full flush (which will also
> > > > > > > > > > > include
> > > > > > > > > > > XA) when
> > > > > > > > > > > Media is off/powergated, since it doesn't need to
> > > > > > > > > > > worry
> > > > > > > > > > > about GT
> > > > > > > > > > > caches vs Media coherency, and only CPU vs GPU
> > > > > > > > > > > coherency,
> > > > > > > > > > > so can
> > > > > > > > > > > make that flush a targeted XA flush, since stuff
> > > > > > > > > > > tagged
> > > > > > > > > > > with XA
> > > > > > > > > > > now means it's shared with the CPU. The main
> > > > > > > > > > > implication is
> > > > > > > > > > > that
> > > > > > > > > > > we now need to somehow flush non-XA before
> > > > > > > > > > > freeing
> > > > > > > > > > > system
> > > > > > > > > > > memory
> > > > > > > > > > > pages, otherwise dirty cachelines could be
> > > > > > > > > > > flushed
> > > > > > > > > > > after
> > > > > > > > > > > the
> > > > > > > > > > > free (like if Media suddenly turns on and does a
> > > > > > > > > > > full
> > > > > > > > > > > flush)
> > > > > > > > > > 
> > > > > > > > > > This description seems really confusing.  My
> > > > > > > > > > understanding is
> > > > > > > > > > that
> > > > > > > > > > marking something as wb-transient-app indicates
> > > > > > > > > > that it
> > > > > > > > > > might
> > > > > > > > > > be
> > > > > > > > > > accessed by something other than our graphics/media
> > > > > > > > > > IP
> > > > > > > > > > (i.e.,
> > > > > > > > > > accessed from the CPU, exported to another device,
> > > > > > > > > > etc.), so
> > > > > > > > > > transient data truly does need to be flushed at the
> > > > > > > > > > points in
> > > > > > > > > > the
> > > > > > > > > > driver where a flush typically happens.
> > > > > > > > > > 
> > > > > > > > > > However when something is _not_ transient, then
> > > > > > > > > > either:
> > > > > > > > > >    - it's "private" to the GPU and only our
> > > > > > > > > > graphics/media IP
> > > > > > > > > > will be
> > > > > > > > > >      accessing it
> > > > > > > > > >    - it's bound with a coherent PAT index so that
> > > > > > > > > > outside
> > > > > > > > > > observers like
> > > > > > > > > >      the CPU can snoop the device cache, even when
> > > > > > > > > > the
> > > > > > > > > > cache
> > > > > > > > > > hasn't been
> > > > > > > > > >      flushed
> > > > > > > > > > 
> > > > > > > > > > If media is not active, then there's really no need
> > > > > > > > > > to
> > > > > > > > > > include
> > > > > > > > > > non-transient data when an device cache flush
> > > > > > > > > > happens
> > > > > > > > > > since
> > > > > > > > > > there's no real need for the data to get to RAM. 
> > > > > > > > > > So
> > > > > > > > > > that
> > > > > > > > > > enables
> > > > > > > > > > an optimization (which comes in your next patch),
> > > > > > > > > > that
> > > > > > > > > > allows
> > > > > > > > > > flushes to only operate on the subset of the device
> > > > > > > > > > cache
> > > > > > > > > > tagged as
> > > > > > > "transient" if media is idle.
> > > > > > > > 
> > > > > > > > But what If we have stale non-XA marked pages for
> > > > > > > > userptr,
> > > > > > > > and
> > > > > > > > that
> > > > > > > > object moves out and at the same time media comes back,
> > > > > > > > will end
> > > > > > > > up in
> > > > > > > > full flush and flush the stale entry to RAM.
> > > > > > > 
> > > > > > > What makes userptr special here?  During general, active
> > > > > > > usage,
> > > > > > > userptr would
> > > > > > > be data that's accessible by the CPU, so it needs to
> > > > > > > either
> > > > > > > be
> > > > > > > transient (so CPU
> > > > > > > can see the data in RAM after explicit flushes) or it
> > > > > > > needs
> > > > > > > to be
> > > > > > > using a
> > > > > > > coherent PAT (so that the CPU can just snoop the GPU
> > > > > > > cache).
> > > > > > > If
> > > > > > > you marked
> > > > > > > userptr as both non-XA and non-coherent, then that sounds
> > > > > > > likely to
> > > > > > > be a
> > > > > > > userspace bug (and probably something we can catch and
> > > > > > > reject
> > > > > > > as an
> > > > > > > invalid
> > > > > > > case on any Xe3p or later platforms that support this)
> > > > > > > since
> > > > > > > the
> > > > > > > CPU wouldn't
> > > > > > > have any reliable way of seeing GPU updates.
> > > > > > 
> > > > > > Right. FYI @Mrozek, Michal @Souza, Jose
> > > > > > For userptr, as explained above, it needs to be either
> > > > > > coherent
> > > > > > or XA
> > > > > > pat index, or else KMD will reject as invalid case.
> > > > > > 
> > > > > > 
> > > > > The coherency restriction is already in the uAPI:
> > > > > 
> > > > > "Note: For userptr and externally imported dma-buf the kernel
> > > > > expects
> > > > > either 1WAY or 2WAY for the @pat_index."
> > > > > 
> > > > > Using 1 way is enough as Xe KMD does a PIPE_CONTROL flushing
> > > > > GPU
> > > > > caches
> > > > > at the end of batch buffers.
> > > > 
> > > > But isn't that what we're discussing here?  1-way *won't*
> > > > necessarily be
> > > > enough anymore because PIPE_CONTROL instructions don't flush
> > > > the
> > > > entire
> > > > cache anymore.  Whenever the GuC determines that media is
> > > > inactive
> > > > and
> > > > activates the optimization, PIPE_CONTROL, MI_FLUSH_DW, etc.
> > > > change
> > > > behavior to only flush out the subset of data that was marked
> > > > as
> > > > app-transient; anything not marked that way doesn't get flushed
> > > > now.  So
> > > > there's a new requirement here that you ensure you're using an
> > > > XA
> > > > PAT
> > > > index, or you switch to use 2-way coherency which will allow
> > > > the
> > > > CPU to
> > > > snoop the GPU's caches.
> > > 
> > > That exactly matches my understanding also.
> > 
> > This only ever affects IGFX, right? Since AFAIU we don't have 2-way
> > coherency with DGFX?
> 
> Yeah, this should be igpu only. I seem to also recall that on dgpu, 
> Media is coherent with l2/l3, but also I don't think system memory
> can 
> be cached in l2/l3 (only VRAM), which I assume is why there is the 
> special SMRO (system-memory-read-only) cache only on dgpu, which is 
> flushed when the fence signals, unlike the l2/l3.

Yes that sounds reasonable.

> 
> > 
> > It sounds like the same PAT restriction is needed also for imported
> > dma-buf, right?
> 
> Good point. Looks like we are missing that still. Otherwise we can
> run 
> into the same issues with stale l2/l3/ppc.

So if this affects only system memory could we instead of relying on 2-
way coherency or XA, just flush at dma unmap time, because that's
typically just before releasing the pages.

The exception, though, is dma-buf where the exporter can actually
release memory before all importers have given up their dma-mappings.

/Thomas

> 
> > 
> > /Thomas
> > 
> > 
> > > 
> > > > 
> > > > 
> > > > Matt
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > If something happens that changes the GTT mapping of an
> > > > > > > object,
> > > > > > > then
> > > > > > > doesn't that already trigger a TLB invalidation when
> > > > > > > necessary in
> > > > > > > the driver
> > > > > > > today?  It was my understanding that "heavy" TLB
> > > > > > > invalidations wait
> > > > > > > for data
> > > > > > > values to be globally observable before starting, so I
> > > > > > > think
> > > > > > > that
> > > > > > > would ensure
> > > > > > > that any non-XA data makes it to RAM before any binding
> > > > > > > changes,
> > > > > > > object,
> > > > > > > destruction, etc.?  Is there something special about
> > > > > > > userptr
> > > > > > > that
> > > > > > > makes that
> > > > > > > case more of a problem?
> > > > > > > 
> > > > > > > I just found bspec page 74635 which gives an overview of
> > > > > > > the
> > > > > > > various flush
> > > > > > > and invalidate cases, and I don't see anything there that
> > > > > > > makes it
> > > > > > > obvious to
> > > > > > > me that userptr would be special.
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > As you said, we eventually do want to force a flush
> > > > > > > > > > of
> > > > > > > > > > the
> > > > > > > > > > non-transient data as well once we're freeing the
> > > > > > > > > > underlying
> > > > > > > > > > pages.
> > > > > > > > > > So how do we do that?  It's not clear to me how the
> > > > > > > > > > changes
> > > > > > > > > > below
> > > > > > > > > > are accomplishing that.  Is there a way to
> > > > > > > > > > explicitly
> > > > > > > > > > request
> > > > > > > > > > a
> > > > > > > > > > full device cache flush (ignoring the transient vs
> > > > > > > > > > non-
> > > > > > > > > > transient tagging)?
> > > > > > > > > > Since the GuC handles the optimization in the next
> > > > > > > > > > patch
> > > > > > > > > > (toggling
> > > > > > > > > > whether flushes are full flushes vs non-transient
> > > > > > > > > > flushes
> > > > > > > > > > depending on whether media is active), I thought
> > > > > > > > > > there
> > > > > > > > > > might
> > > > > > > > > > be
> > > > > > > > > > some kind of GuC interface to request "please do
> > > > > > > > > > one
> > > > > > > > > > full
> > > > > > > > > > flush now, even
> > > > > > > if media is idle."
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I’m not an expert here by any means, but everything
> > > > > > > > > above
> > > > > > > > > from
> > > > > > > > > Matt
> > > > > > > > > seems like valid concerns. Thomas also raised some
> > > > > > > > > concerns in
> > > > > > > > > the
> > > > > > > > > two previous revisions; again I’m not an expert, but
> > > > > > > > > reading
> > > > > > > > > through
> > > > > > > > > those, it doesn’t really seem like he received proper
> > > > > > > > > answers
> > > > > > > > > to his
> > > > > > > questions.
> > > > > > > > 
> > > > > > > > Its forcing flush via tlb invalidation PPC flag under
> > > > > > > > xe_invalidate_vma( ).
> > > > > > > 
> > > > > > > By the way, what is "PPC?"  It seems like it's another
> > > > > > > new
> > > > > > > synonym
> > > > > > > for the
> > > > > > > device cache?  It's already really confusing that some of
> > > > > > > our
> > > > > > > hardware docs use
> > > > > > > a mix of both "L2" and "L3" to refer to the same device
> > > > > > > cache
> > > > > > > for
> > > > > > > historical
> > > > > > > reasons...
> > > > > > > 
> > > > > > > 
> > > > > > > Matt
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > A couple of comments below.
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Matt
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > V2(MattA): Expand commit description
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Tejas Upadhyay
> > > > > > > > > > > <tejas.upadhyay@intel.com>
> > > > > > > > > > > ---
> > > > > > > > > > >    drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> > > > > > > > > > >    drivers/gpu/drm/xe/xe_device.c  | 23
> > > > > > > > > > > +++++++++++++++++++++++
> > > > > > > > > > > drivers/gpu/drm/xe/xe_device.h  |  1 +
> > > > > > > > > > > drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> > > > > > > > > > >    4 files changed, 28 insertions(+), 2
> > > > > > > > > > > deletions(-)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > b/drivers/gpu/drm/xe/xe_bo.c index
> > > > > > > > > > > e9180b01a4e4..4455886b211e
> > > > > > > > > > > 100644
> > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > @@ -689,7 +689,8 @@ static int
> > > > > > > > > > > xe_bo_trigger_rebind(struct
> > > > > > > > > > > xe_device *xe, struct xe_bo *bo,
> > > > > > > > > > > 
> > > > > > > > > > >    		if (!xe_vm_in_fault_mode(vm)) {
> > > > > > > > > > >    			drm_gpuvm_bo_evict(vm_bo
> > > > > > > > > > > ,
> > > > > > > > > > > true);
> > > > > > > > > > > -			continue;
> > > > > > > > > > > +			if
> > > > > > > > > > > (!xe_device_needs_cache_flush(xe))
> > > > > > > > > > > +				continue;
> > > > > > 
> > > > > > Matt R,
> > > > > > This flush will be still needed as there can be non-xa
> > > > > > buffers
> > > > > > which
> > > > > > can be evicted while media was off and stale entries can be
> > > > > > flushed
> > > > > > when media comes back on. Which was not case earlier as
> > > > > > full
> > > > > > flush
> > > > > > was happening at regular sync points and that’s where this
> > > > > > feature is
> > > > > > bringing optimization now.
> > > > > > 
> > > > > > Tejas
> > > > > > 
> > > > > > > > > 
> > > > > > > > > This will trigger a TLB invalidation (and I assume a
> > > > > > > > > cache
> > > > > > > > > flush)
> > > > > > > > > every time we move or free memory in the 3D stack if
> > > > > > > > > it
> > > > > > > > > has a
> > > > > > > > > binding. It also performs a synchronous wait on the
> > > > > > > > > BO
> > > > > > > > > being
> > > > > > > > > idle.
> > > > > > > > > Both of these are very expensive operations. I can’t
> > > > > > > > > imagine
> > > > > > > > > the
> > > > > > > > > granularity we want here is to do this on every
> > > > > > > > > move/free
> > > > > > > > > with
> > > > > > > > > bindings.
> > > > > > > > > 
> > > > > > > > > Also, for LR compute with preempt fences, we would
> > > > > > > > > trigger the
> > > > > > > > > preempt fences during the wait, so a TLB invalidation
> > > > > > > > > after
> > > > > > > > > this
> > > > > > > > > seems unnecessary, though perhaps the cache flush is
> > > > > > > > > still
> > > > > > > > > required?
> > > > > > > > > 
> > > > > > > > > I think this needs a bit more explanation, because
> > > > > > > > > without
> > > > > > > > > knowing a
> > > > > > > > > lot about the exact requirements, the implementation
> > > > > > > > > does
> > > > > > > > > not
> > > > > > > > > look
> > > > > > > correct.
> > > > > > > > 
> > > > > > > > The thing is that we are trying to solve problem with
> > > > > > > > userptr
> > > > > > > > with non-XA
> > > > > > > pat, consider if that BO got moved while media is not
> > > > > > > active.
> > > > > > > As
> > > > > > > soon as media
> > > > > > > will come back active, stale cached entries of that
> > > > > > > object
> > > > > > > will be
> > > > > > > flushed as part
> > > > > > > of full flush , which may corrupt things.
> > > > > > > > There was thinking that with this patch we would at
> > > > > > > > least
> > > > > > > > solve
> > > > > > > > the problem
> > > > > > > of corruption and later when page_reclamation feature
> > > > > > > comes
> > > > > > > in will
> > > > > > > help in
> > > > > > > performance as well. But now when page reclamation
> > > > > > > feature is
> > > > > > > merged earlier
> > > > > > > and it tightly coupled with bind/unbind some cases like
> > > > > > > discussed
> > > > > > > above
> > > > > > > (which are not doing unbind immediately on move/free) are
> > > > > > > missed in
> > > > > > > reclamation.
> > > > > > > > 
> > > > > > > > So thought was to let this solution go in with little
> > > > > > > > perf
> > > > > > > > hit
> > > > > > > > and discuss with
> > > > > > > page reclamation owner to come with cleaner solution
> > > > > > > together.
> > > > > > > > 
> > > > > > > > Tejas
> > > > > > > > > 
> > > > > > > > > > >    		}
> > > > > > > > > > > 
> > > > > > > > > > >    		if (!idle) {
> > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > > > > b/drivers/gpu/drm/xe/xe_device.c index
> > > > > > > > > > > 743c18e0c580..da2abed94bc0
> > > > > > > > > > > 100644
> > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > > > > @@ -1097,6 +1097,29 @@ static void
> > > > > > > > > > > tdf_request_sync(struct
> > > > > > > > > > > xe_device
> > > > > > > > > *xe)
> > > > > > > > > > >    	}
> > > > > > > > > > >    }
> > > > > > > > > > > 
> > > > > > > > > > > +/**
> > > > > > > > > > > + * xe_device_needs_cache_flush - Whether the
> > > > > > > > > > > cache
> > > > > > > > > > > needs
> > > > > > > > > > > to be
> > > > > > > > > > > +flushed
> > > > > > > > > > > + * @xe: The device to check.
> > > > > > > > > > > + *
> > > > > > > > > > > + * Return: true if the device needs cache flush,
> > > > > > > > > > > false
> > > > > > > > > > > otherwise.
> > > > > > > > > > > + */
> > > > > > > > > > > +bool xe_device_needs_cache_flush(struct
> > > > > > > > > > > xe_device
> > > > > > > > > > > *xe) {
> > > > > > > > > > > +	/* XA is *always* flushed, like at the
> > > > > > > > > > > end-
> > > > > > > > > > > of-
> > > > > > > > > > > submssion (and
> > > > > > > > > > > +maybe
> > > > > > > > > other
> > > > > > > > > > > +	 * places), just that internally as an
> > > > > > > > > > > optimisation hw doesn't
> > > > > > > > > > > +need to
> > > > > > > > > make
> > > > > > > > > > > +	 * that a full flush (which will also
> > > > > > > > > > > include XA)
> > > > > > > > > > > when Media is
> > > > > > > > > > > +	 * off/powergated, since it doesn't need
> > > > > > > > > > > to
> > > > > > > > > > > worry
> > > > > > > > > > > about GT
> > > > > > > > > > > +caches vs
> > > > > > > > > Media
> > > > > > > > > > > +	 * coherency, and only CPU vs GPU
> > > > > > > > > > > coherency,
> > > > > > > > > > > so
> > > > > > > > > > > can make
> > > > > > > that
> > > > > > > > > > > +flush
> > > > > > > > > a
> > > > > > > > > > > +	 * targeted XA flush, since stuff tagged
> > > > > > > > > > > with XA
> > > > > > > > > > > now means
> > > > > > > > > > > +it's shared
> > > > > > > > > with
> > > > > > > > > > > +	 * the CPU. The main implication is that
> > > > > > > > > > > we
> > > > > > > > > > > now
> > > > > > > > > > > need to
> > > > > > > > > > > +somehow
> > > > > > > > > flush non-XA before
> > > > > > > > > > > +	 * freeing system memory pages,
> > > > > > > > > > > otherwise
> > > > > > > > > > > dirty
> > > > > > > > > > > cachelines
> > > > > > > > > > > +could be
> > > > > > > > > flushed after the free
> > > > > > > > > > > +	 * (like if Media suddenly turns on and
> > > > > > > > > > > does
> > > > > > > > > > > a
> > > > > > > > > > > full flush)
> > > > > > > > > > > +	 */
> > > > > > > > > > > +	if (GRAPHICS_VER(xe) >= 35 &&
> > > > > > > > > > > !IS_DGFX(xe))
> > > > > > > > > > > +		return true;
> > > > > > > > > > > +	return false;
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > >    void xe_device_l2_flush(struct xe_device *xe) 
> > > > > > > > > > > {
> > > > > > > > > > >    	struct xe_gt *gt;
> > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > > > > b/drivers/gpu/drm/xe/xe_device.h index
> > > > > > > > > > > 39464650533b..baf386e0e037
> > > > > > > > > > > 100644
> > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > > > > @@ -184,6 +184,7 @@ void
> > > > > > > > > > > xe_device_snapshot_print(struct
> > > > > > > > > > > xe_device *xe, struct drm_printer *p);
> > > > > > > > > > >    u64 xe_device_canonicalize_addr(struct
> > > > > > > > > > > xe_device
> > > > > > > > > > > *xe, u64
> > > > > > > > > > > address);
> > > > > > > > > > >    u64 xe_device_uncanonicalize_addr(struct
> > > > > > > > > > > xe_device
> > > > > > > > > > > *xe,
> > > > > > > > > > > u64
> > > > > > > > > > > address);
> > > > > > > > > > > 
> > > > > > > > > > > +bool xe_device_needs_cache_flush(struct
> > > > > > > > > > > xe_device
> > > > > > > > > > > *xe);
> > > > > > > > > > >    void xe_device_td_flush(struct xe_device *xe);
> > > > > > > > > > > void
> > > > > > > > > > > xe_device_l2_flush(struct xe_device *xe);
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > > > > b/drivers/gpu/drm/xe/xe_userptr.c index
> > > > > > > > > > > e120323c43bc..b435ea7f9b66
> > > > > > > > > > > 100644
> > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > > > > @@ -114,7 +114,8 @@ static void
> > > > > > > > > > > __vma_userptr_invalidate(struct
> > > > > > > > > > > xe_vm
> > > > > > > > > *vm, struct xe_userptr_vma *uv
> > > > > > > > > > >    				    false,
> > > > > > > > > > > MAX_SCHEDULE_TIMEOUT);
> > > > > > > > > > >    	XE_WARN_ON(err <= 0);
> > > > > > > > > > > 
> > > > > > > > > > > -	if (xe_vm_in_fault_mode(vm) && userptr-
> > > > > > > > > > > > initial_bind) {
> > > > > > > > > > > +	if ((xe_vm_in_fault_mode(vm) ||
> > > > > > > > > > > +xe_device_needs_cache_flush(vm-
> > > > > > > > > > xe)) &&
> > > > > > > > > > > +	    userptr->initial_bind) {
> > > > > > > > > 
> > > > > > > > > Same concern with the LR preempt fence as above — the
> > > > > > > > > hardware
> > > > > > > > > will
> > > > > > > > > be interrupted via preempt fences, so it doesn’t seem
> > > > > > > > > necessary
> > > > > > > > > to
> > > > > > > > > invalidate the TLBs but perhaps we need a cflush and
> > > > > > > > > TLB
> > > > > > > > > invalidation is the mechanism for that too?
> > > > > > > > > 
> > > > > > > > > Matt
> > > > > > > > > 
> > > > > > > > > > >    		err = xe_vm_invalidate_vma(vma);
> > > > > > > > > > >    		XE_WARN_ON(err);
> > > > > > > > > > >    	}
> > > > > > > > > > > --
> > > > > > > > > > > 2.52.0
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > --
> > > > > > > > > > Matt Roper
> > > > > > > > > > Graphics Software Engineer
> > > > > > > > > > Linux GPU Platform Enablement
> > > > > > > > > > Intel Corporation
> > > > > > > 
> > > > > > > --
> > > > > > > Matt Roper
> > > > > > > Graphics Software Engineer
> > > > > > > Linux GPU Platform Enablement
> > > > > > > Intel Corporation
> > > > 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-16 12:07                       ` Thomas Hellström
@ 2026-02-16 14:55                         ` Matthew Auld
  2026-02-16 15:38                           ` Thomas Hellström
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Auld @ 2026-02-16 14:55 UTC (permalink / raw)
  To: Thomas Hellström, Matt Roper, Souza, Jose
  Cc: Upadhyay, Tejas, Mrozek, Michal, intel-xe@lists.freedesktop.org,
	Brost, Matthew

On 16/02/2026 12:07, Thomas Hellström wrote:
> On Mon, 2026-02-16 at 10:58 +0000, Matthew Auld wrote:
>> On 16/02/2026 10:23, Thomas Hellström wrote:
>>> On Fri, 2026-02-13 at 17:31 +0000, Matthew Auld wrote:
>>>> On 13/02/2026 17:16, Matt Roper wrote:
>>>>> On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose wrote:
>>>>>> On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote:
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Roper, Matthew D <matthew.d.roper@intel.com>
>>>>>>>> Sent: 12 February 2026 02:41
>>>>>>>> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
>>>>>>>> Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
>>>>>>>> xe@lists.freedesktop.org; Auld, Matthew
>>>>>>>> <matthew.auld@intel.com>;
>>>>>>>> thomas.hellstrom@linux.intel.com
>>>>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
>>>>>>>> userptr/shrinker bo
>>>>>>>> cachelines manually
>>>>>>>>
>>>>>>>> On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Brost, Matthew <matthew.brost@intel.com>
>>>>>>>>>> Sent: 11 February 2026 05:32
>>>>>>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>>>>>>>>> Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>;
>>>>>>>>>> intel-
>>>>>>>>>> xe@lists.freedesktop.org; Auld, Matthew
>>>>>>>>>> <matthew.auld@intel.com>;
>>>>>>>>>> thomas.hellstrom@linux.intel.com
>>>>>>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
>>>>>>>>>> userptr/shrinker bo
>>>>>>>>>> cachelines manually
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper
>>>>>>>>>> wrote:
>>>>>>>>>>> On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas
>>>>>>>>>>> Upadhyay
>>>>>>>>>>> wrote:
>>>>>>>>>>>> "eXtended Architecture" (XA) tagged memory—memory
>>>>>>>>>>>> shared
>>>>>>>> between
>>>>>>>>>> the
>>>>>>>>>>>> CPU and GPU
>>>>>>>>>>>
>>>>>>>>>>> I'm pretty sure this expansion of "XA" is wrong;
>>>>>>>>>>> where
>>>>>>>>>>> are
>>>>>>>>>>> you
>>>>>>>>>>> seeing this definition?  Everything in the bspec
>>>>>>>>>>> indicates
>>>>>>>>>>> that XA
>>>>>>>>>>> means "wb
>>>>>>>>>>> - transient app" (similar to how "XD" is 'wb -
>>>>>>>>>>> transient
>>>>>>>>>>> display").
>>>>>>>>>>> I'm not sure why exactly they picked "X" to refer
>>>>>>>>>>> to
>>>>>>>>>>> transient in
>>>>>>>>>>> both of these cases, but I've never seen any
>>>>>>>>>>> documentation
>>>>>>>>>>> that
>>>>>>>>>>> refers to it as "extended."
>>>>>>>>>>>
>>>>>>>>>>>> is treated differently from other GPU memory when
>>>>>>>>>>>> the
>>>>>>>>>>>> Media
>>>>>>>>>>>> engine is
>>>>>>>>>> power-gated.
>>>>>>>>>>>>
>>>>>>>>>>>> XA is *always* flushed, like at the end-of-
>>>>>>>>>>>> submssion
>>>>>>>>>>>> (and
>>>>>>>>>>>> maybe
>>>>>>>>>>>> other
>>>>>>>>>>>
>>>>>>>>>>> I assume you're referring to the fact that the
>>>>>>>>>>> driver
>>>>>>>>>>> performs
>>>>>>>>>>> flushes at the end of submission (via PIPE_CONTROL
>>>>>>>>>>> or
>>>>>>>>>>> MI_FLUSH_DW), and that depending on other
>>>>>>>>>>> state/optimizations
>>>>>>>>>>> in
>>>>>>>>>>> the system, those flushes may flush the entire
>>>>>>>>>>> device
>>>>>>>>>>> cache,
>>>>>>>>>>> or
>>>>>>>>>>> may only flush the subset of cache data that is not
>>>>>>>>>>> marked as
>>>>>>>>>>> transient.  The way you worded this was confusing
>>>>>>>>>>> since
>>>>>>>>>>> it
>>>>>>>>>>> makes
>>>>>>>>>>> it sound like cache flushes happen automatically
>>>>>>>>>>> somewhere in
>>>>>>>> hardware/firmware.
>>>>>>>>>>>
>>>>>>>>>>>> places), just that internally as an optimisation
>>>>>>>>>>>> hw
>>>>>>>>>>>> doesn't
>>>>>>>>>>>> need
>>>>>>>>>>>> to make that a full flush (which will also
>>>>>>>>>>>> include
>>>>>>>>>>>> XA) when
>>>>>>>>>>>> Media is off/powergated, since it doesn't need to
>>>>>>>>>>>> worry
>>>>>>>>>>>> about GT
>>>>>>>>>>>> caches vs Media coherency, and only CPU vs GPU
>>>>>>>>>>>> coherency,
>>>>>>>>>>>> so can
>>>>>>>>>>>> make that flush a targeted XA flush, since stuff
>>>>>>>>>>>> tagged
>>>>>>>>>>>> with XA
>>>>>>>>>>>> now means it's shared with the CPU. The main
>>>>>>>>>>>> implication is
>>>>>>>>>>>> that
>>>>>>>>>>>> we now need to somehow flush non-XA before
>>>>>>>>>>>> freeing
>>>>>>>>>>>> system
>>>>>>>>>>>> memory
>>>>>>>>>>>> pages, otherwise dirty cachelines could be
>>>>>>>>>>>> flushed
>>>>>>>>>>>> after
>>>>>>>>>>>> the
>>>>>>>>>>>> free (like if Media suddenly turns on and does a
>>>>>>>>>>>> full
>>>>>>>>>>>> flush)
>>>>>>>>>>>
>>>>>>>>>>> This description seems really confusing.  My
>>>>>>>>>>> understanding is
>>>>>>>>>>> that
>>>>>>>>>>> marking something as wb-transient-app indicates
>>>>>>>>>>> that it
>>>>>>>>>>> might
>>>>>>>>>>> be
>>>>>>>>>>> accessed by something other than our graphics/media
>>>>>>>>>>> IP
>>>>>>>>>>> (i.e.,
>>>>>>>>>>> accessed from the CPU, exported to another device,
>>>>>>>>>>> etc.), so
>>>>>>>>>>> transient data truly does need to be flushed at the
>>>>>>>>>>> points in
>>>>>>>>>>> the
>>>>>>>>>>> driver where a flush typically happens.
>>>>>>>>>>>
>>>>>>>>>>> However when something is _not_ transient, then
>>>>>>>>>>> either:
>>>>>>>>>>>     - it's "private" to the GPU and only our
>>>>>>>>>>> graphics/media IP
>>>>>>>>>>> will be
>>>>>>>>>>>       accessing it
>>>>>>>>>>>     - it's bound with a coherent PAT index so that
>>>>>>>>>>> outside
>>>>>>>>>>> observers like
>>>>>>>>>>>       the CPU can snoop the device cache, even when
>>>>>>>>>>> the
>>>>>>>>>>> cache
>>>>>>>>>>> hasn't been
>>>>>>>>>>>       flushed
>>>>>>>>>>>
>>>>>>>>>>> If media is not active, then there's really no need
>>>>>>>>>>> to
>>>>>>>>>>> include
>>>>>>>>>>> non-transient data when an device cache flush
>>>>>>>>>>> happens
>>>>>>>>>>> since
>>>>>>>>>>> there's no real need for the data to get to RAM.
>>>>>>>>>>> So
>>>>>>>>>>> that
>>>>>>>>>>> enables
>>>>>>>>>>> an optimization (which comes in your next patch),
>>>>>>>>>>> that
>>>>>>>>>>> allows
>>>>>>>>>>> flushes to only operate on the subset of the device
>>>>>>>>>>> cache
>>>>>>>>>>> tagged as
>>>>>>>> "transient" if media is idle.
>>>>>>>>>
>>>>>>>>> But what If we have stale non-XA marked pages for
>>>>>>>>> userptr,
>>>>>>>>> and
>>>>>>>>> that
>>>>>>>>> object moves out and at the same time media comes back,
>>>>>>>>> will end
>>>>>>>>> up in
>>>>>>>>> full flush and flush the stale entry to RAM.
>>>>>>>>
>>>>>>>> What makes userptr special here?  During general, active
>>>>>>>> usage,
>>>>>>>> userptr would
>>>>>>>> be data that's accessible by the CPU, so it needs to
>>>>>>>> either
>>>>>>>> be
>>>>>>>> transient (so CPU
>>>>>>>> can see the data in RAM after explicit flushes) or it
>>>>>>>> needs
>>>>>>>> to be
>>>>>>>> using a
>>>>>>>> coherent PAT (so that the CPU can just snoop the GPU
>>>>>>>> cache).
>>>>>>>> If
>>>>>>>> you marked
>>>>>>>> userptr as both non-XA and non-coherent, then that sounds
>>>>>>>> likely to
>>>>>>>> be a
>>>>>>>> userspace bug (and probably something we can catch and
>>>>>>>> reject
>>>>>>>> as an
>>>>>>>> invalid
>>>>>>>> case on any Xe3p or later platforms that support this)
>>>>>>>> since
>>>>>>>> the
>>>>>>>> CPU wouldn't
>>>>>>>> have any reliable way of seeing GPU updates.
>>>>>>>
>>>>>>> Right. FYI @Mrozek, Michal @Souza, Jose
>>>>>>> For userptr, as explained above, it needs to be either
>>>>>>> coherent
>>>>>>> or XA
>>>>>>> pat index, or else KMD will reject as invalid case.
>>>>>>>
>>>>>>>
>>>>>> The coherency restriction is already in the uAPI:
>>>>>>
>>>>>> "Note: For userptr and externally imported dma-buf the kernel
>>>>>> expects
>>>>>> either 1WAY or 2WAY for the @pat_index."
>>>>>>
>>>>>> Using 1 way is enough as Xe KMD does a PIPE_CONTROL flushing
>>>>>> GPU
>>>>>> caches
>>>>>> at the end of batch buffers.
>>>>>
>>>>> But isn't that what we're discussing here?  1-way *won't*
>>>>> necessarily be
>>>>> enough anymore because PIPE_CONTROL instructions don't flush
>>>>> the
>>>>> entire
>>>>> cache anymore.  Whenever the GuC determines that media is
>>>>> inactive
>>>>> and
>>>>> activates the optimization, PIPE_CONTROL, MI_FLUSH_DW, etc.
>>>>> change
>>>>> behavior to only flush out the subset of data that was marked
>>>>> as
>>>>> app-transient; anything not marked that way doesn't get flushed
>>>>> now.  So
>>>>> there's a new requirement here that you ensure you're using an
>>>>> XA
>>>>> PAT
>>>>> index, or you switch to use 2-way coherency which will allow
>>>>> the
>>>>> CPU to
>>>>> snoop the GPU's caches.
>>>>
>>>> That exactly matches my understanding also.
>>>
>>> This only ever affects IGFX, right? Since AFAIU we don't have 2-way
>>> coherency with DGFX?
>>
>> Yeah, this should be igpu only. I seem to also recall that on dgpu,
>> Media is coherent with l2/l3, but also I don't think system memory
>> can
>> be cached in l2/l3 (only VRAM), which I assume is why there is the
>> special SMRO (system-memory-read-only) cache only on dgpu, which is
>> flushed when the fence signals, unlike the l2/l3.
> 
> Yes that sounds reasonable.
> 
>>
>>>
>>> It sounds like the same PAT restriction is needed also for imported
>>> dma-buf, right?
>>
>> Good point. Looks like we are missing that still. Otherwise we can
>> run
>> into the same issues with stale l2/l3/ppc.
> 
> So if this affects only system memory could we instead of relying on 2-
> way coherency or XA, just flush at dma unmap time, because that's
> typically just before releasing the pages.

Yeah, I think we could make it work, from security pov, similar to 
userptr, with the right manual flushes in KMD. Maybe just a question if 
userspace wants such a model? Anything cached in l2/l3 might require 
manual flushing by userspace (if that is even possible)?

> 
> The exception, though, is dma-buf where the exporter can actually
> release memory before all importers have given up their dma-mappings.
> 
> /Thomas
> 
>>
>>>
>>> /Thomas
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>> Matt
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> If something happens that changes the GTT mapping of an
>>>>>>>> object,
>>>>>>>> then
>>>>>>>> doesn't that already trigger a TLB invalidation when
>>>>>>>> necessary in
>>>>>>>> the driver
>>>>>>>> today?  It was my understanding that "heavy" TLB
>>>>>>>> invalidations wait
>>>>>>>> for data
>>>>>>>> values to be globally observable before starting, so I
>>>>>>>> think
>>>>>>>> that
>>>>>>>> would ensure
>>>>>>>> that any non-XA data makes it to RAM before any binding
>>>>>>>> changes,
>>>>>>>> object,
>>>>>>>> destruction, etc.?  Is there something special about
>>>>>>>> userptr
>>>>>>>> that
>>>>>>>> makes that
>>>>>>>> case more of a problem?
>>>>>>>>
>>>>>>>> I just found bspec page 74635 which gives an overview of
>>>>>>>> the
>>>>>>>> various flush
>>>>>>>> and invalidate cases, and I don't see anything there that
>>>>>>>> makes it
>>>>>>>> obvious to
>>>>>>>> me that userptr would be special.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> As you said, we eventually do want to force a flush
>>>>>>>>>>> of
>>>>>>>>>>> the
>>>>>>>>>>> non-transient data as well once we're freeing the
>>>>>>>>>>> underlying
>>>>>>>>>>> pages.
>>>>>>>>>>> So how do we do that?  It's not clear to me how the
>>>>>>>>>>> changes
>>>>>>>>>>> below
>>>>>>>>>>> are accomplishing that.  Is there a way to
>>>>>>>>>>> explicitly
>>>>>>>>>>> request
>>>>>>>>>>> a
>>>>>>>>>>> full device cache flush (ignoring the transient vs
>>>>>>>>>>> non-
>>>>>>>>>>> transient tagging)?
>>>>>>>>>>> Since the GuC handles the optimization in the next
>>>>>>>>>>> patch
>>>>>>>>>>> (toggling
>>>>>>>>>>> whether flushes are full flushes vs non-transient
>>>>>>>>>>> flushes
>>>>>>>>>>> depending on whether media is active), I thought
>>>>>>>>>>> there
>>>>>>>>>>> might
>>>>>>>>>>> be
>>>>>>>>>>> some kind of GuC interface to request "please do
>>>>>>>>>>> one
>>>>>>>>>>> full
>>>>>>>>>>> flush now, even
>>>>>>>> if media is idle."
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I’m not an expert here by any means, but everything
>>>>>>>>>> above
>>>>>>>>>> from
>>>>>>>>>> Matt
>>>>>>>>>> seems like valid concerns. Thomas also raised some
>>>>>>>>>> concerns in
>>>>>>>>>> the
>>>>>>>>>> two previous revisions; again I’m not an expert, but
>>>>>>>>>> reading
>>>>>>>>>> through
>>>>>>>>>> those, it doesn’t really seem like he received proper
>>>>>>>>>> answers
>>>>>>>>>> to his
>>>>>>>> questions.
>>>>>>>>>
>>>>>>>>> Its forcing flush via tlb invalidation PPC flag under
>>>>>>>>> xe_invalidate_vma( ).
>>>>>>>>
>>>>>>>> By the way, what is "PPC?"  It seems like it's another
>>>>>>>> new
>>>>>>>> synonym
>>>>>>>> for the
>>>>>>>> device cache?  It's already really confusing that some of
>>>>>>>> our
>>>>>>>> hardware docs use
>>>>>>>> a mix of both "L2" and "L3" to refer to the same device
>>>>>>>> cache
>>>>>>>> for
>>>>>>>> historical
>>>>>>>> reasons...
>>>>>>>>
>>>>>>>>
>>>>>>>> Matt
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> A couple of comments below.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Matt
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> V2(MattA): Expand commit description
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Tejas Upadhyay
>>>>>>>>>>>> <tejas.upadhyay@intel.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>     drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
>>>>>>>>>>>>     drivers/gpu/drm/xe/xe_device.c  | 23
>>>>>>>>>>>> +++++++++++++++++++++++
>>>>>>>>>>>> drivers/gpu/drm/xe/xe_device.h  |  1 +
>>>>>>>>>>>> drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
>>>>>>>>>>>>     4 files changed, 28 insertions(+), 2
>>>>>>>>>>>> deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_bo.c index
>>>>>>>>>>>> e9180b01a4e4..4455886b211e
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>>>>>> @@ -689,7 +689,8 @@ static int
>>>>>>>>>>>> xe_bo_trigger_rebind(struct
>>>>>>>>>>>> xe_device *xe, struct xe_bo *bo,
>>>>>>>>>>>>
>>>>>>>>>>>>     		if (!xe_vm_in_fault_mode(vm)) {
>>>>>>>>>>>>     			drm_gpuvm_bo_evict(vm_bo
>>>>>>>>>>>> ,
>>>>>>>>>>>> true);
>>>>>>>>>>>> -			continue;
>>>>>>>>>>>> +			if
>>>>>>>>>>>> (!xe_device_needs_cache_flush(xe))
>>>>>>>>>>>> +				continue;
>>>>>>>
>>>>>>> Matt R,
>>>>>>> This flush will be still needed as there can be non-xa
>>>>>>> buffers
>>>>>>> which
>>>>>>> can be evicted while media was off and stale entries can be
>>>>>>> flushed
>>>>>>> when media comes back on. Which was not case earlier as
>>>>>>> full
>>>>>>> flush
>>>>>>> was happening at regular sync points and that’s where this
>>>>>>> feature is
>>>>>>> bringing optimization now.
>>>>>>>
>>>>>>> Tejas
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This will trigger a TLB invalidation (and I assume a
>>>>>>>>>> cache
>>>>>>>>>> flush)
>>>>>>>>>> every time we move or free memory in the 3D stack if
>>>>>>>>>> it
>>>>>>>>>> has a
>>>>>>>>>> binding. It also performs a synchronous wait on the
>>>>>>>>>> BO
>>>>>>>>>> being
>>>>>>>>>> idle.
>>>>>>>>>> Both of these are very expensive operations. I can’t
>>>>>>>>>> imagine
>>>>>>>>>> the
>>>>>>>>>> granularity we want here is to do this on every
>>>>>>>>>> move/free
>>>>>>>>>> with
>>>>>>>>>> bindings.
>>>>>>>>>>
>>>>>>>>>> Also, for LR compute with preempt fences, we would
>>>>>>>>>> trigger the
>>>>>>>>>> preempt fences during the wait, so a TLB invalidation
>>>>>>>>>> after
>>>>>>>>>> this
>>>>>>>>>> seems unnecessary, though perhaps the cache flush is
>>>>>>>>>> still
>>>>>>>>>> required?
>>>>>>>>>>
>>>>>>>>>> I think this needs a bit more explanation, because
>>>>>>>>>> without
>>>>>>>>>> knowing a
>>>>>>>>>> lot about the exact requirements, the implementation
>>>>>>>>>> does
>>>>>>>>>> not
>>>>>>>>>> look
>>>>>>>> correct.
>>>>>>>>>
>>>>>>>>> The thing is that we are trying to solve problem with
>>>>>>>>> userptr
>>>>>>>>> with non-XA
>>>>>>>> pat, consider if that BO got moved while media is not
>>>>>>>> active.
>>>>>>>> As
>>>>>>>> soon as media
>>>>>>>> will come back active, stale cached entries of that
>>>>>>>> object
>>>>>>>> will be
>>>>>>>> flushed as part
>>>>>>>> of full flush , which may corrupt things.
>>>>>>>>> There was thinking that with this patch we would at
>>>>>>>>> least
>>>>>>>>> solve
>>>>>>>>> the problem
>>>>>>>> of corruption and later when page_reclamation feature
>>>>>>>> comes
>>>>>>>> in will
>>>>>>>> help in
>>>>>>>> performance as well. But now when page reclamation
>>>>>>>> feature is
>>>>>>>> merged earlier
>>>>>>>> and it tightly coupled with bind/unbind some cases like
>>>>>>>> discussed
>>>>>>>> above
>>>>>>>> (which are not doing unbind immediately on move/free) are
>>>>>>>> missed in
>>>>>>>> reclamation.
>>>>>>>>>
>>>>>>>>> So thought was to let this solution go in with little
>>>>>>>>> perf
>>>>>>>>> hit
>>>>>>>>> and discuss with
>>>>>>>> page reclamation owner to come with cleaner solution
>>>>>>>> together.
>>>>>>>>>
>>>>>>>>> Tejas
>>>>>>>>>>
>>>>>>>>>>>>     		}
>>>>>>>>>>>>
>>>>>>>>>>>>     		if (!idle) {
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.c index
>>>>>>>>>>>> 743c18e0c580..da2abed94bc0
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>>>>> @@ -1097,6 +1097,29 @@ static void
>>>>>>>>>>>> tdf_request_sync(struct
>>>>>>>>>>>> xe_device
>>>>>>>>>> *xe)
>>>>>>>>>>>>     	}
>>>>>>>>>>>>     }
>>>>>>>>>>>>
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * xe_device_needs_cache_flush - Whether the
>>>>>>>>>>>> cache
>>>>>>>>>>>> needs
>>>>>>>>>>>> to be
>>>>>>>>>>>> +flushed
>>>>>>>>>>>> + * @xe: The device to check.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * Return: true if the device needs cache flush,
>>>>>>>>>>>> false
>>>>>>>>>>>> otherwise.
>>>>>>>>>>>> + */
>>>>>>>>>>>> +bool xe_device_needs_cache_flush(struct
>>>>>>>>>>>> xe_device
>>>>>>>>>>>> *xe) {
>>>>>>>>>>>> +	/* XA is *always* flushed, like at the
>>>>>>>>>>>> end-
>>>>>>>>>>>> of-
>>>>>>>>>>>> submssion (and
>>>>>>>>>>>> +maybe
>>>>>>>>>> other
>>>>>>>>>>>> +	 * places), just that internally as an
>>>>>>>>>>>> optimisation hw doesn't
>>>>>>>>>>>> +need to
>>>>>>>>>> make
>>>>>>>>>>>> +	 * that a full flush (which will also
>>>>>>>>>>>> include XA)
>>>>>>>>>>>> when Media is
>>>>>>>>>>>> +	 * off/powergated, since it doesn't need
>>>>>>>>>>>> to
>>>>>>>>>>>> worry
>>>>>>>>>>>> about GT
>>>>>>>>>>>> +caches vs
>>>>>>>>>> Media
>>>>>>>>>>>> +	 * coherency, and only CPU vs GPU
>>>>>>>>>>>> coherency,
>>>>>>>>>>>> so
>>>>>>>>>>>> can make
>>>>>>>> that
>>>>>>>>>>>> +flush
>>>>>>>>>> a
>>>>>>>>>>>> +	 * targeted XA flush, since stuff tagged
>>>>>>>>>>>> with XA
>>>>>>>>>>>> now means
>>>>>>>>>>>> +it's shared
>>>>>>>>>> with
>>>>>>>>>>>> +	 * the CPU. The main implication is that
>>>>>>>>>>>> we
>>>>>>>>>>>> now
>>>>>>>>>>>> need to
>>>>>>>>>>>> +somehow
>>>>>>>>>> flush non-XA before
>>>>>>>>>>>> +	 * freeing system memory pages,
>>>>>>>>>>>> otherwise
>>>>>>>>>>>> dirty
>>>>>>>>>>>> cachelines
>>>>>>>>>>>> +could be
>>>>>>>>>> flushed after the free
>>>>>>>>>>>> +	 * (like if Media suddenly turns on and
>>>>>>>>>>>> does
>>>>>>>>>>>> a
>>>>>>>>>>>> full flush)
>>>>>>>>>>>> +	 */
>>>>>>>>>>>> +	if (GRAPHICS_VER(xe) >= 35 &&
>>>>>>>>>>>> !IS_DGFX(xe))
>>>>>>>>>>>> +		return true;
>>>>>>>>>>>> +	return false;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>>     void xe_device_l2_flush(struct xe_device *xe)
>>>>>>>>>>>> {
>>>>>>>>>>>>     	struct xe_gt *gt;
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h
>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.h index
>>>>>>>>>>>> 39464650533b..baf386e0e037
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>>>>>>>>>> @@ -184,6 +184,7 @@ void
>>>>>>>>>>>> xe_device_snapshot_print(struct
>>>>>>>>>>>> xe_device *xe, struct drm_printer *p);
>>>>>>>>>>>>     u64 xe_device_canonicalize_addr(struct
>>>>>>>>>>>> xe_device
>>>>>>>>>>>> *xe, u64
>>>>>>>>>>>> address);
>>>>>>>>>>>>     u64 xe_device_uncanonicalize_addr(struct
>>>>>>>>>>>> xe_device
>>>>>>>>>>>> *xe,
>>>>>>>>>>>> u64
>>>>>>>>>>>> address);
>>>>>>>>>>>>
>>>>>>>>>>>> +bool xe_device_needs_cache_flush(struct
>>>>>>>>>>>> xe_device
>>>>>>>>>>>> *xe);
>>>>>>>>>>>>     void xe_device_td_flush(struct xe_device *xe);
>>>>>>>>>>>> void
>>>>>>>>>>>> xe_device_l2_flush(struct xe_device *xe);
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_userptr.c index
>>>>>>>>>>>> e120323c43bc..b435ea7f9b66
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>>>>>> @@ -114,7 +114,8 @@ static void
>>>>>>>>>>>> __vma_userptr_invalidate(struct
>>>>>>>>>>>> xe_vm
>>>>>>>>>> *vm, struct xe_userptr_vma *uv
>>>>>>>>>>>>     				    false,
>>>>>>>>>>>> MAX_SCHEDULE_TIMEOUT);
>>>>>>>>>>>>     	XE_WARN_ON(err <= 0);
>>>>>>>>>>>>
>>>>>>>>>>>> -	if (xe_vm_in_fault_mode(vm) && userptr-
>>>>>>>>>>>>> initial_bind) {
>>>>>>>>>>>> +	if ((xe_vm_in_fault_mode(vm) ||
>>>>>>>>>>>> +xe_device_needs_cache_flush(vm-
>>>>>>>>>>> xe)) &&
>>>>>>>>>>>> +	    userptr->initial_bind) {
>>>>>>>>>>
>>>>>>>>>> Same concern with the LR preempt fence as above — the
>>>>>>>>>> hardware
>>>>>>>>>> will
>>>>>>>>>> be interrupted via preempt fences, so it doesn’t seem
>>>>>>>>>> necessary
>>>>>>>>>> to
>>>>>>>>>> invalidate the TLBs but perhaps we need a cflush and
>>>>>>>>>> TLB
>>>>>>>>>> invalidation is the mechanism for that too?
>>>>>>>>>>
>>>>>>>>>> Matt
>>>>>>>>>>
>>>>>>>>>>>>     		err = xe_vm_invalidate_vma(vma);
>>>>>>>>>>>>     		XE_WARN_ON(err);
>>>>>>>>>>>>     	}
>>>>>>>>>>>> --
>>>>>>>>>>>> 2.52.0
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Matt Roper
>>>>>>>>>>> Graphics Software Engineer
>>>>>>>>>>> Linux GPU Platform Enablement
>>>>>>>>>>> Intel Corporation
>>>>>>>>
>>>>>>>> --
>>>>>>>> Matt Roper
>>>>>>>> Graphics Software Engineer
>>>>>>>> Linux GPU Platform Enablement
>>>>>>>> Intel Corporation
>>>>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-16 14:55                         ` Matthew Auld
@ 2026-02-16 15:38                           ` Thomas Hellström
  2026-02-16 16:41                             ` Matthew Auld
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Hellström @ 2026-02-16 15:38 UTC (permalink / raw)
  To: Matthew Auld, Matt Roper, Souza, Jose
  Cc: Upadhyay, Tejas, Mrozek, Michal, intel-xe@lists.freedesktop.org,
	Brost, Matthew

On Mon, 2026-02-16 at 14:55 +0000, Matthew Auld wrote:
> On 16/02/2026 12:07, Thomas Hellström wrote:
> > On Mon, 2026-02-16 at 10:58 +0000, Matthew Auld wrote:
> > > On 16/02/2026 10:23, Thomas Hellström wrote:
> > > > On Fri, 2026-02-13 at 17:31 +0000, Matthew Auld wrote:
> > > > > On 13/02/2026 17:16, Matt Roper wrote:
> > > > > > On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose
> > > > > > wrote:
> > > > > > > On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Roper, Matthew D <matthew.d.roper@intel.com>
> > > > > > > > > Sent: 12 February 2026 02:41
> > > > > > > > > To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> > > > > > > > > Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
> > > > > > > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > > > > > > <matthew.auld@intel.com>;
> > > > > > > > > thomas.hellstrom@linux.intel.com
> > > > > > > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
> > > > > > > > > userptr/shrinker bo
> > > > > > > > > cachelines manually
> > > > > > > > > 
> > > > > > > > > On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay,
> > > > > > > > > Tejas
> > > > > > > > > wrote:
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > > > > > > > Sent: 11 February 2026 05:32
> > > > > > > > > > > To: Roper, Matthew D <matthew.d.roper@intel.com>
> > > > > > > > > > > Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>;
> > > > > > > > > > > intel-
> > > > > > > > > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > > > > > > > > <matthew.auld@intel.com>;
> > > > > > > > > > > thomas.hellstrom@linux.intel.com
> > > > > > > > > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
> > > > > > > > > > > userptr/shrinker bo
> > > > > > > > > > > cachelines manually
> > > > > > > > > > > 
> > > > > > > > > > > On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt
> > > > > > > > > > > Roper
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas
> > > > > > > > > > > > Upadhyay
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > "eXtended Architecture" (XA) tagged
> > > > > > > > > > > > > memory—memory
> > > > > > > > > > > > > shared
> > > > > > > > > between
> > > > > > > > > > > the
> > > > > > > > > > > > > CPU and GPU
> > > > > > > > > > > > 
> > > > > > > > > > > > I'm pretty sure this expansion of "XA" is
> > > > > > > > > > > > wrong;
> > > > > > > > > > > > where
> > > > > > > > > > > > are
> > > > > > > > > > > > you
> > > > > > > > > > > > seeing this definition?  Everything in the
> > > > > > > > > > > > bspec
> > > > > > > > > > > > indicates
> > > > > > > > > > > > that XA
> > > > > > > > > > > > means "wb
> > > > > > > > > > > > - transient app" (similar to how "XD" is 'wb -
> > > > > > > > > > > > transient
> > > > > > > > > > > > display").
> > > > > > > > > > > > I'm not sure why exactly they picked "X" to
> > > > > > > > > > > > refer
> > > > > > > > > > > > to
> > > > > > > > > > > > transient in
> > > > > > > > > > > > both of these cases, but I've never seen any
> > > > > > > > > > > > documentation
> > > > > > > > > > > > that
> > > > > > > > > > > > refers to it as "extended."
> > > > > > > > > > > > 
> > > > > > > > > > > > > is treated differently from other GPU memory
> > > > > > > > > > > > > when
> > > > > > > > > > > > > the
> > > > > > > > > > > > > Media
> > > > > > > > > > > > > engine is
> > > > > > > > > > > power-gated.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > XA is *always* flushed, like at the end-of-
> > > > > > > > > > > > > submssion
> > > > > > > > > > > > > (and
> > > > > > > > > > > > > maybe
> > > > > > > > > > > > > other
> > > > > > > > > > > > 
> > > > > > > > > > > > I assume you're referring to the fact that the
> > > > > > > > > > > > driver
> > > > > > > > > > > > performs
> > > > > > > > > > > > flushes at the end of submission (via
> > > > > > > > > > > > PIPE_CONTROL
> > > > > > > > > > > > or
> > > > > > > > > > > > MI_FLUSH_DW), and that depending on other
> > > > > > > > > > > > state/optimizations
> > > > > > > > > > > > in
> > > > > > > > > > > > the system, those flushes may flush the entire
> > > > > > > > > > > > device
> > > > > > > > > > > > cache,
> > > > > > > > > > > > or
> > > > > > > > > > > > may only flush the subset of cache data that is
> > > > > > > > > > > > not
> > > > > > > > > > > > marked as
> > > > > > > > > > > > transient.  The way you worded this was
> > > > > > > > > > > > confusing
> > > > > > > > > > > > since
> > > > > > > > > > > > it
> > > > > > > > > > > > makes
> > > > > > > > > > > > it sound like cache flushes happen
> > > > > > > > > > > > automatically
> > > > > > > > > > > > somewhere in
> > > > > > > > > hardware/firmware.
> > > > > > > > > > > > 
> > > > > > > > > > > > > places), just that internally as an
> > > > > > > > > > > > > optimisation
> > > > > > > > > > > > > hw
> > > > > > > > > > > > > doesn't
> > > > > > > > > > > > > need
> > > > > > > > > > > > > to make that a full flush (which will also
> > > > > > > > > > > > > include
> > > > > > > > > > > > > XA) when
> > > > > > > > > > > > > Media is off/powergated, since it doesn't
> > > > > > > > > > > > > need to
> > > > > > > > > > > > > worry
> > > > > > > > > > > > > about GT
> > > > > > > > > > > > > caches vs Media coherency, and only CPU vs
> > > > > > > > > > > > > GPU
> > > > > > > > > > > > > coherency,
> > > > > > > > > > > > > so can
> > > > > > > > > > > > > make that flush a targeted XA flush, since
> > > > > > > > > > > > > stuff
> > > > > > > > > > > > > tagged
> > > > > > > > > > > > > with XA
> > > > > > > > > > > > > now means it's shared with the CPU. The main
> > > > > > > > > > > > > implication is
> > > > > > > > > > > > > that
> > > > > > > > > > > > > we now need to somehow flush non-XA before
> > > > > > > > > > > > > freeing
> > > > > > > > > > > > > system
> > > > > > > > > > > > > memory
> > > > > > > > > > > > > pages, otherwise dirty cachelines could be
> > > > > > > > > > > > > flushed
> > > > > > > > > > > > > after
> > > > > > > > > > > > > the
> > > > > > > > > > > > > free (like if Media suddenly turns on and
> > > > > > > > > > > > > does a
> > > > > > > > > > > > > full
> > > > > > > > > > > > > flush)
> > > > > > > > > > > > 
> > > > > > > > > > > > This description seems really confusing.  My
> > > > > > > > > > > > understanding is
> > > > > > > > > > > > that
> > > > > > > > > > > > marking something as wb-transient-app indicates
> > > > > > > > > > > > that it
> > > > > > > > > > > > might
> > > > > > > > > > > > be
> > > > > > > > > > > > accessed by something other than our
> > > > > > > > > > > > graphics/media
> > > > > > > > > > > > IP
> > > > > > > > > > > > (i.e.,
> > > > > > > > > > > > accessed from the CPU, exported to another
> > > > > > > > > > > > device,
> > > > > > > > > > > > etc.), so
> > > > > > > > > > > > transient data truly does need to be flushed at
> > > > > > > > > > > > the
> > > > > > > > > > > > points in
> > > > > > > > > > > > the
> > > > > > > > > > > > driver where a flush typically happens.
> > > > > > > > > > > > 
> > > > > > > > > > > > However when something is _not_ transient, then
> > > > > > > > > > > > either:
> > > > > > > > > > > >     - it's "private" to the GPU and only our
> > > > > > > > > > > > graphics/media IP
> > > > > > > > > > > > will be
> > > > > > > > > > > >       accessing it
> > > > > > > > > > > >     - it's bound with a coherent PAT index so
> > > > > > > > > > > > that
> > > > > > > > > > > > outside
> > > > > > > > > > > > observers like
> > > > > > > > > > > >       the CPU can snoop the device cache, even
> > > > > > > > > > > > when
> > > > > > > > > > > > the
> > > > > > > > > > > > cache
> > > > > > > > > > > > hasn't been
> > > > > > > > > > > >       flushed
> > > > > > > > > > > > 
> > > > > > > > > > > > If media is not active, then there's really no
> > > > > > > > > > > > need
> > > > > > > > > > > > to
> > > > > > > > > > > > include
> > > > > > > > > > > > non-transient data when an device cache flush
> > > > > > > > > > > > happens
> > > > > > > > > > > > since
> > > > > > > > > > > > there's no real need for the data to get to
> > > > > > > > > > > > RAM.
> > > > > > > > > > > > So
> > > > > > > > > > > > that
> > > > > > > > > > > > enables
> > > > > > > > > > > > an optimization (which comes in your next
> > > > > > > > > > > > patch),
> > > > > > > > > > > > that
> > > > > > > > > > > > allows
> > > > > > > > > > > > flushes to only operate on the subset of the
> > > > > > > > > > > > device
> > > > > > > > > > > > cache
> > > > > > > > > > > > tagged as
> > > > > > > > > "transient" if media is idle.
> > > > > > > > > > 
> > > > > > > > > > But what If we have stale non-XA marked pages for
> > > > > > > > > > userptr,
> > > > > > > > > > and
> > > > > > > > > > that
> > > > > > > > > > object moves out and at the same time media comes
> > > > > > > > > > back,
> > > > > > > > > > will end
> > > > > > > > > > up in
> > > > > > > > > > full flush and flush the stale entry to RAM.
> > > > > > > > > 
> > > > > > > > > What makes userptr special here?  During general,
> > > > > > > > > active
> > > > > > > > > usage,
> > > > > > > > > userptr would
> > > > > > > > > be data that's accessible by the CPU, so it needs to
> > > > > > > > > either
> > > > > > > > > be
> > > > > > > > > transient (so CPU
> > > > > > > > > can see the data in RAM after explicit flushes) or it
> > > > > > > > > needs
> > > > > > > > > to be
> > > > > > > > > using a
> > > > > > > > > coherent PAT (so that the CPU can just snoop the GPU
> > > > > > > > > cache).
> > > > > > > > > If
> > > > > > > > > you marked
> > > > > > > > > userptr as both non-XA and non-coherent, then that
> > > > > > > > > sounds
> > > > > > > > > likely to
> > > > > > > > > be a
> > > > > > > > > userspace bug (and probably something we can catch
> > > > > > > > > and
> > > > > > > > > reject
> > > > > > > > > as an
> > > > > > > > > invalid
> > > > > > > > > case on any Xe3p or later platforms that support
> > > > > > > > > this)
> > > > > > > > > since
> > > > > > > > > the
> > > > > > > > > CPU wouldn't
> > > > > > > > > have any reliable way of seeing GPU updates.
> > > > > > > > 
> > > > > > > > Right. FYI @Mrozek, Michal @Souza, Jose
> > > > > > > > For userptr, as explained above, it needs to be either
> > > > > > > > coherent
> > > > > > > > or XA
> > > > > > > > pat index, or else KMD will reject as invalid case.
> > > > > > > > 
> > > > > > > > 
> > > > > > > The coherency restriction is already in the uAPI:
> > > > > > > 
> > > > > > > "Note: For userptr and externally imported dma-buf the
> > > > > > > kernel
> > > > > > > expects
> > > > > > > either 1WAY or 2WAY for the @pat_index."
> > > > > > > 
> > > > > > > Using 1 way is enough as Xe KMD does a PIPE_CONTROL
> > > > > > > flushing
> > > > > > > GPU
> > > > > > > caches
> > > > > > > at the end of batch buffers.
> > > > > > 
> > > > > > But isn't that what we're discussing here?  1-way *won't*
> > > > > > necessarily be
> > > > > > enough anymore because PIPE_CONTROL instructions don't
> > > > > > flush
> > > > > > the
> > > > > > entire
> > > > > > cache anymore.  Whenever the GuC determines that media is
> > > > > > inactive
> > > > > > and
> > > > > > activates the optimization, PIPE_CONTROL, MI_FLUSH_DW, etc.
> > > > > > change
> > > > > > behavior to only flush out the subset of data that was
> > > > > > marked
> > > > > > as
> > > > > > app-transient; anything not marked that way doesn't get
> > > > > > flushed
> > > > > > now.  So
> > > > > > there's a new requirement here that you ensure you're using
> > > > > > an
> > > > > > XA
> > > > > > PAT
> > > > > > index, or you switch to use 2-way coherency which will
> > > > > > allow
> > > > > > the
> > > > > > CPU to
> > > > > > snoop the GPU's caches.
> > > > > 
> > > > > That exactly matches my understanding also.
> > > > 
> > > > This only ever affects IGFX, right? Since AFAIU we don't have
> > > > 2-way
> > > > coherency with DGFX?
> > > 
> > > Yeah, this should be igpu only. I seem to also recall that on
> > > dgpu,
> > > Media is coherent with l2/l3, but also I don't think system
> > > memory
> > > can
> > > be cached in l2/l3 (only VRAM), which I assume is why there is
> > > the
> > > special SMRO (system-memory-read-only) cache only on dgpu, which
> > > is
> > > flushed when the fence signals, unlike the l2/l3.
> > 
> > Yes that sounds reasonable.
> > 
> > > 
> > > > 
> > > > It sounds like the same PAT restriction is needed also for
> > > > imported
> > > > dma-buf, right?
> > > 
> > > Good point. Looks like we are missing that still. Otherwise we
> > > can
> > > run
> > > into the same issues with stale l2/l3/ppc.
> > 
> > So if this affects only system memory could we instead of relying
> > on 2-
> > way coherency or XA, just flush at dma unmap time, because that's
> > typically just before releasing the pages.
> 
> Yeah, I think we could make it work, from security pov, similar to 
> userptr, with the right manual flushes in KMD. Maybe just a question
> if 
> userspace wants such a model? Anything cached in l2/l3 might require 
> manual flushing by userspace (if that is even possible)?

So that would mean if user-space wants gpu-cpu coherency at fence
synchronization points, they'd have to use either 2-way or XA pat
indices, but not enforced by KMD.

For imported dma-buf kernel requires 2-way or XA for security due to
the relaxed dma-buf unmap.

For SVM/System allocator we'd require 2-way or XA.

Otherwise KMD security is enforced by flush at dma-unmap time?

/Thomas

> 
> > 
> > The exception, though, is dma-buf where the exporter can actually
> > release memory before all importers have given up their dma-
> > mappings.
> > 
> > /Thomas
> > 
> > > 
> > > > 
> > > > /Thomas
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > If something happens that changes the GTT mapping of
> > > > > > > > > an
> > > > > > > > > object,
> > > > > > > > > then
> > > > > > > > > doesn't that already trigger a TLB invalidation when
> > > > > > > > > necessary in
> > > > > > > > > the driver
> > > > > > > > > today?  It was my understanding that "heavy" TLB
> > > > > > > > > invalidations wait
> > > > > > > > > for data
> > > > > > > > > values to be globally observable before starting, so
> > > > > > > > > I
> > > > > > > > > think
> > > > > > > > > that
> > > > > > > > > would ensure
> > > > > > > > > that any non-XA data makes it to RAM before any
> > > > > > > > > binding
> > > > > > > > > changes,
> > > > > > > > > object,
> > > > > > > > > destruction, etc.?  Is there something special about
> > > > > > > > > userptr
> > > > > > > > > that
> > > > > > > > > makes that
> > > > > > > > > case more of a problem?
> > > > > > > > > 
> > > > > > > > > I just found bspec page 74635 which gives an overview
> > > > > > > > > of
> > > > > > > > > the
> > > > > > > > > various flush
> > > > > > > > > and invalidate cases, and I don't see anything there
> > > > > > > > > that
> > > > > > > > > makes it
> > > > > > > > > obvious to
> > > > > > > > > me that userptr would be special.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > As you said, we eventually do want to force a
> > > > > > > > > > > > flush
> > > > > > > > > > > > of
> > > > > > > > > > > > the
> > > > > > > > > > > > non-transient data as well once we're freeing
> > > > > > > > > > > > the
> > > > > > > > > > > > underlying
> > > > > > > > > > > > pages.
> > > > > > > > > > > > So how do we do that?  It's not clear to me how
> > > > > > > > > > > > the
> > > > > > > > > > > > changes
> > > > > > > > > > > > below
> > > > > > > > > > > > are accomplishing that.  Is there a way to
> > > > > > > > > > > > explicitly
> > > > > > > > > > > > request
> > > > > > > > > > > > a
> > > > > > > > > > > > full device cache flush (ignoring the transient
> > > > > > > > > > > > vs
> > > > > > > > > > > > non-
> > > > > > > > > > > > transient tagging)?
> > > > > > > > > > > > Since the GuC handles the optimization in the
> > > > > > > > > > > > next
> > > > > > > > > > > > patch
> > > > > > > > > > > > (toggling
> > > > > > > > > > > > whether flushes are full flushes vs non-
> > > > > > > > > > > > transient
> > > > > > > > > > > > flushes
> > > > > > > > > > > > depending on whether media is active), I
> > > > > > > > > > > > thought
> > > > > > > > > > > > there
> > > > > > > > > > > > might
> > > > > > > > > > > > be
> > > > > > > > > > > > some kind of GuC interface to request "please
> > > > > > > > > > > > do
> > > > > > > > > > > > one
> > > > > > > > > > > > full
> > > > > > > > > > > > flush now, even
> > > > > > > > > if media is idle."
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > I’m not an expert here by any means, but
> > > > > > > > > > > everything
> > > > > > > > > > > above
> > > > > > > > > > > from
> > > > > > > > > > > Matt
> > > > > > > > > > > seems like valid concerns. Thomas also raised
> > > > > > > > > > > some
> > > > > > > > > > > concerns in
> > > > > > > > > > > the
> > > > > > > > > > > two previous revisions; again I’m not an expert,
> > > > > > > > > > > but
> > > > > > > > > > > reading
> > > > > > > > > > > through
> > > > > > > > > > > those, it doesn’t really seem like he received
> > > > > > > > > > > proper
> > > > > > > > > > > answers
> > > > > > > > > > > to his
> > > > > > > > > questions.
> > > > > > > > > > 
> > > > > > > > > > Its forcing flush via tlb invalidation PPC flag
> > > > > > > > > > under
> > > > > > > > > > xe_invalidate_vma( ).
> > > > > > > > > 
> > > > > > > > > By the way, what is "PPC?"  It seems like it's
> > > > > > > > > another
> > > > > > > > > new
> > > > > > > > > synonym
> > > > > > > > > for the
> > > > > > > > > device cache?  It's already really confusing that
> > > > > > > > > some of
> > > > > > > > > our
> > > > > > > > > hardware docs use
> > > > > > > > > a mix of both "L2" and "L3" to refer to the same
> > > > > > > > > device
> > > > > > > > > cache
> > > > > > > > > for
> > > > > > > > > historical
> > > > > > > > > reasons...
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Matt
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > A couple of comments below.
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Matt
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > V2(MattA): Expand commit description
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Signed-off-by: Tejas Upadhyay
> > > > > > > > > > > > > <tejas.upadhyay@intel.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >     drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> > > > > > > > > > > > >     drivers/gpu/drm/xe/xe_device.c  | 23
> > > > > > > > > > > > > +++++++++++++++++++++++
> > > > > > > > > > > > > drivers/gpu/drm/xe/xe_device.h  |  1 +
> > > > > > > > > > > > > drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> > > > > > > > > > > > >     4 files changed, 28 insertions(+), 2
> > > > > > > > > > > > > deletions(-)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_bo.c index
> > > > > > > > > > > > > e9180b01a4e4..4455886b211e
> > > > > > > > > > > > > 100644
> > > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > > > @@ -689,7 +689,8 @@ static int
> > > > > > > > > > > > > xe_bo_trigger_rebind(struct
> > > > > > > > > > > > > xe_device *xe, struct xe_bo *bo,
> > > > > > > > > > > > > 
> > > > > > > > > > > > >     		if
> > > > > > > > > > > > > (!xe_vm_in_fault_mode(vm)) {
> > > > > > > > > > > > >     			drm_gpuvm_bo_evict(v
> > > > > > > > > > > > > m_bo
> > > > > > > > > > > > > ,
> > > > > > > > > > > > > true);
> > > > > > > > > > > > > -			continue;
> > > > > > > > > > > > > +			if
> > > > > > > > > > > > > (!xe_device_needs_cache_flush(xe))
> > > > > > > > > > > > > +				continue;
> > > > > > > > 
> > > > > > > > Matt R,
> > > > > > > > This flush will be still needed as there can be non-xa
> > > > > > > > buffers
> > > > > > > > which
> > > > > > > > can be evicted while media was off and stale entries
> > > > > > > > can be
> > > > > > > > flushed
> > > > > > > > when media comes back on. Which was not case earlier as
> > > > > > > > full
> > > > > > > > flush
> > > > > > > > was happening at regular sync points and that’s where
> > > > > > > > this
> > > > > > > > feature is
> > > > > > > > bringing optimization now.
> > > > > > > > 
> > > > > > > > Tejas
> > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > This will trigger a TLB invalidation (and I
> > > > > > > > > > > assume a
> > > > > > > > > > > cache
> > > > > > > > > > > flush)
> > > > > > > > > > > every time we move or free memory in the 3D stack
> > > > > > > > > > > if
> > > > > > > > > > > it
> > > > > > > > > > > has a
> > > > > > > > > > > binding. It also performs a synchronous wait on
> > > > > > > > > > > the
> > > > > > > > > > > BO
> > > > > > > > > > > being
> > > > > > > > > > > idle.
> > > > > > > > > > > Both of these are very expensive operations. I
> > > > > > > > > > > can’t
> > > > > > > > > > > imagine
> > > > > > > > > > > the
> > > > > > > > > > > granularity we want here is to do this on every
> > > > > > > > > > > move/free
> > > > > > > > > > > with
> > > > > > > > > > > bindings.
> > > > > > > > > > > 
> > > > > > > > > > > Also, for LR compute with preempt fences, we
> > > > > > > > > > > would
> > > > > > > > > > > trigger the
> > > > > > > > > > > preempt fences during the wait, so a TLB
> > > > > > > > > > > invalidation
> > > > > > > > > > > after
> > > > > > > > > > > this
> > > > > > > > > > > seems unnecessary, though perhaps the cache flush
> > > > > > > > > > > is
> > > > > > > > > > > still
> > > > > > > > > > > required?
> > > > > > > > > > > 
> > > > > > > > > > > I think this needs a bit more explanation,
> > > > > > > > > > > because
> > > > > > > > > > > without
> > > > > > > > > > > knowing a
> > > > > > > > > > > lot about the exact requirements, the
> > > > > > > > > > > implementation
> > > > > > > > > > > does
> > > > > > > > > > > not
> > > > > > > > > > > look
> > > > > > > > > correct.
> > > > > > > > > > 
> > > > > > > > > > The thing is that we are trying to solve problem
> > > > > > > > > > with
> > > > > > > > > > userptr
> > > > > > > > > > with non-XA
> > > > > > > > > pat, consider if that BO got moved while media is not
> > > > > > > > > active.
> > > > > > > > > As
> > > > > > > > > soon as media
> > > > > > > > > will come back active, stale cached entries of that
> > > > > > > > > object
> > > > > > > > > will be
> > > > > > > > > flushed as part
> > > > > > > > > of full flush , which may corrupt things.
> > > > > > > > > > There was thinking that with this patch we would at
> > > > > > > > > > least
> > > > > > > > > > solve
> > > > > > > > > > the problem
> > > > > > > > > of corruption and later when page_reclamation feature
> > > > > > > > > comes
> > > > > > > > > in will
> > > > > > > > > help in
> > > > > > > > > performance as well. But now when page reclamation
> > > > > > > > > feature is
> > > > > > > > > merged earlier
> > > > > > > > > and it tightly coupled with bind/unbind some cases
> > > > > > > > > like
> > > > > > > > > discussed
> > > > > > > > > above
> > > > > > > > > (which are not doing unbind immediately on move/free)
> > > > > > > > > are
> > > > > > > > > missed in
> > > > > > > > > reclamation.
> > > > > > > > > > 
> > > > > > > > > > So thought was to let this solution go in with
> > > > > > > > > > little
> > > > > > > > > > perf
> > > > > > > > > > hit
> > > > > > > > > > and discuss with
> > > > > > > > > page reclamation owner to come with cleaner solution
> > > > > > > > > together.
> > > > > > > > > > 
> > > > > > > > > > Tejas
> > > > > > > > > > > 
> > > > > > > > > > > > >     		}
> > > > > > > > > > > > > 
> > > > > > > > > > > > >     		if (!idle) {
> > > > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_device.c index
> > > > > > > > > > > > > 743c18e0c580..da2abed94bc0
> > > > > > > > > > > > > 100644
> > > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > > > > > > @@ -1097,6 +1097,29 @@ static void
> > > > > > > > > > > > > tdf_request_sync(struct
> > > > > > > > > > > > > xe_device
> > > > > > > > > > > *xe)
> > > > > > > > > > > > >     	}
> > > > > > > > > > > > >     }
> > > > > > > > > > > > > 
> > > > > > > > > > > > > +/**
> > > > > > > > > > > > > + * xe_device_needs_cache_flush - Whether the
> > > > > > > > > > > > > cache
> > > > > > > > > > > > > needs
> > > > > > > > > > > > > to be
> > > > > > > > > > > > > +flushed
> > > > > > > > > > > > > + * @xe: The device to check.
> > > > > > > > > > > > > + *
> > > > > > > > > > > > > + * Return: true if the device needs cache
> > > > > > > > > > > > > flush,
> > > > > > > > > > > > > false
> > > > > > > > > > > > > otherwise.
> > > > > > > > > > > > > + */
> > > > > > > > > > > > > +bool xe_device_needs_cache_flush(struct
> > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > *xe) {
> > > > > > > > > > > > > +	/* XA is *always* flushed, like at
> > > > > > > > > > > > > the
> > > > > > > > > > > > > end-
> > > > > > > > > > > > > of-
> > > > > > > > > > > > > submssion (and
> > > > > > > > > > > > > +maybe
> > > > > > > > > > > other
> > > > > > > > > > > > > +	 * places), just that internally as
> > > > > > > > > > > > > an
> > > > > > > > > > > > > optimisation hw doesn't
> > > > > > > > > > > > > +need to
> > > > > > > > > > > make
> > > > > > > > > > > > > +	 * that a full flush (which will
> > > > > > > > > > > > > also
> > > > > > > > > > > > > include XA)
> > > > > > > > > > > > > when Media is
> > > > > > > > > > > > > +	 * off/powergated, since it doesn't
> > > > > > > > > > > > > need
> > > > > > > > > > > > > to
> > > > > > > > > > > > > worry
> > > > > > > > > > > > > about GT
> > > > > > > > > > > > > +caches vs
> > > > > > > > > > > Media
> > > > > > > > > > > > > +	 * coherency, and only CPU vs GPU
> > > > > > > > > > > > > coherency,
> > > > > > > > > > > > > so
> > > > > > > > > > > > > can make
> > > > > > > > > that
> > > > > > > > > > > > > +flush
> > > > > > > > > > > a
> > > > > > > > > > > > > +	 * targeted XA flush, since stuff
> > > > > > > > > > > > > tagged
> > > > > > > > > > > > > with XA
> > > > > > > > > > > > > now means
> > > > > > > > > > > > > +it's shared
> > > > > > > > > > > with
> > > > > > > > > > > > > +	 * the CPU. The main implication is
> > > > > > > > > > > > > that
> > > > > > > > > > > > > we
> > > > > > > > > > > > > now
> > > > > > > > > > > > > need to
> > > > > > > > > > > > > +somehow
> > > > > > > > > > > flush non-XA before
> > > > > > > > > > > > > +	 * freeing system memory pages,
> > > > > > > > > > > > > otherwise
> > > > > > > > > > > > > dirty
> > > > > > > > > > > > > cachelines
> > > > > > > > > > > > > +could be
> > > > > > > > > > > flushed after the free
> > > > > > > > > > > > > +	 * (like if Media suddenly turns on
> > > > > > > > > > > > > and
> > > > > > > > > > > > > does
> > > > > > > > > > > > > a
> > > > > > > > > > > > > full flush)
> > > > > > > > > > > > > +	 */
> > > > > > > > > > > > > +	if (GRAPHICS_VER(xe) >= 35 &&
> > > > > > > > > > > > > !IS_DGFX(xe))
> > > > > > > > > > > > > +		return true;
> > > > > > > > > > > > > +	return false;
> > > > > > > > > > > > > +}
> > > > > > > > > > > > > +
> > > > > > > > > > > > >     void xe_device_l2_flush(struct xe_device
> > > > > > > > > > > > > *xe)
> > > > > > > > > > > > > {
> > > > > > > > > > > > >     	struct xe_gt *gt;
> > > > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_device.h index
> > > > > > > > > > > > > 39464650533b..baf386e0e037
> > > > > > > > > > > > > 100644
> > > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > > > > > > @@ -184,6 +184,7 @@ void
> > > > > > > > > > > > > xe_device_snapshot_print(struct
> > > > > > > > > > > > > xe_device *xe, struct drm_printer *p);
> > > > > > > > > > > > >     u64 xe_device_canonicalize_addr(struct
> > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > *xe, u64
> > > > > > > > > > > > > address);
> > > > > > > > > > > > >     u64 xe_device_uncanonicalize_addr(struct
> > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > *xe,
> > > > > > > > > > > > > u64
> > > > > > > > > > > > > address);
> > > > > > > > > > > > > 
> > > > > > > > > > > > > +bool xe_device_needs_cache_flush(struct
> > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > *xe);
> > > > > > > > > > > > >     void xe_device_td_flush(struct xe_device
> > > > > > > > > > > > > *xe);
> > > > > > > > > > > > > void
> > > > > > > > > > > > > xe_device_l2_flush(struct xe_device *xe);
> > > > > > > > > > > > > 
> > > > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_userptr.c index
> > > > > > > > > > > > > e120323c43bc..b435ea7f9b66
> > > > > > > > > > > > > 100644
> > > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > > > > > > @@ -114,7 +114,8 @@ static void
> > > > > > > > > > > > > __vma_userptr_invalidate(struct
> > > > > > > > > > > > > xe_vm
> > > > > > > > > > > *vm, struct xe_userptr_vma *uv
> > > > > > > > > > > > >     				    false,
> > > > > > > > > > > > > MAX_SCHEDULE_TIMEOUT);
> > > > > > > > > > > > >     	XE_WARN_ON(err <= 0);
> > > > > > > > > > > > > 
> > > > > > > > > > > > > -	if (xe_vm_in_fault_mode(vm) &&
> > > > > > > > > > > > > userptr-
> > > > > > > > > > > > > > initial_bind) {
> > > > > > > > > > > > > +	if ((xe_vm_in_fault_mode(vm) ||
> > > > > > > > > > > > > +xe_device_needs_cache_flush(vm-
> > > > > > > > > > > > xe)) &&
> > > > > > > > > > > > > +	    userptr->initial_bind) {
> > > > > > > > > > > 
> > > > > > > > > > > Same concern with the LR preempt fence as above —
> > > > > > > > > > > the
> > > > > > > > > > > hardware
> > > > > > > > > > > will
> > > > > > > > > > > be interrupted via preempt fences, so it doesn’t
> > > > > > > > > > > seem
> > > > > > > > > > > necessary
> > > > > > > > > > > to
> > > > > > > > > > > invalidate the TLBs but perhaps we need a cflush
> > > > > > > > > > > and
> > > > > > > > > > > TLB
> > > > > > > > > > > invalidation is the mechanism for that too?
> > > > > > > > > > > 
> > > > > > > > > > > Matt
> > > > > > > > > > > 
> > > > > > > > > > > > >     		err =
> > > > > > > > > > > > > xe_vm_invalidate_vma(vma);
> > > > > > > > > > > > >     		XE_WARN_ON(err);
> > > > > > > > > > > > >     	}
> > > > > > > > > > > > > --
> > > > > > > > > > > > > 2.52.0
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > --
> > > > > > > > > > > > Matt Roper
> > > > > > > > > > > > Graphics Software Engineer
> > > > > > > > > > > > Linux GPU Platform Enablement
> > > > > > > > > > > > Intel Corporation
> > > > > > > > > 
> > > > > > > > > --
> > > > > > > > > Matt Roper
> > > > > > > > > Graphics Software Engineer
> > > > > > > > > Linux GPU Platform Enablement
> > > > > > > > > Intel Corporation
> > > > > > 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-16 15:38                           ` Thomas Hellström
@ 2026-02-16 16:41                             ` Matthew Auld
  2026-02-17  6:19                               ` Upadhyay, Tejas
  2026-02-17 17:04                               ` Thomas Hellström
  0 siblings, 2 replies; 39+ messages in thread
From: Matthew Auld @ 2026-02-16 16:41 UTC (permalink / raw)
  To: Thomas Hellström, Matt Roper, Souza, Jose
  Cc: Upadhyay, Tejas, Mrozek, Michal, intel-xe@lists.freedesktop.org,
	Brost, Matthew

On 16/02/2026 15:38, Thomas Hellström wrote:
> On Mon, 2026-02-16 at 14:55 +0000, Matthew Auld wrote:
>> On 16/02/2026 12:07, Thomas Hellström wrote:
>>> On Mon, 2026-02-16 at 10:58 +0000, Matthew Auld wrote:
>>>> On 16/02/2026 10:23, Thomas Hellström wrote:
>>>>> On Fri, 2026-02-13 at 17:31 +0000, Matthew Auld wrote:
>>>>>> On 13/02/2026 17:16, Matt Roper wrote:
>>>>>>> On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose
>>>>>>> wrote:
>>>>>>>> On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Roper, Matthew D <matthew.d.roper@intel.com>
>>>>>>>>>> Sent: 12 February 2026 02:41
>>>>>>>>>> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
>>>>>>>>>> Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
>>>>>>>>>> xe@lists.freedesktop.org; Auld, Matthew
>>>>>>>>>> <matthew.auld@intel.com>;
>>>>>>>>>> thomas.hellstrom@linux.intel.com
>>>>>>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
>>>>>>>>>> userptr/shrinker bo
>>>>>>>>>> cachelines manually
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay,
>>>>>>>>>> Tejas
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Brost, Matthew <matthew.brost@intel.com>
>>>>>>>>>>>> Sent: 11 February 2026 05:32
>>>>>>>>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>>>>>>>>>>> Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>;
>>>>>>>>>>>> intel-
>>>>>>>>>>>> xe@lists.freedesktop.org; Auld, Matthew
>>>>>>>>>>>> <matthew.auld@intel.com>;
>>>>>>>>>>>> thomas.hellstrom@linux.intel.com
>>>>>>>>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
>>>>>>>>>>>> userptr/shrinker bo
>>>>>>>>>>>> cachelines manually
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt
>>>>>>>>>>>> Roper
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas
>>>>>>>>>>>>> Upadhyay
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> "eXtended Architecture" (XA) tagged
>>>>>>>>>>>>>> memory—memory
>>>>>>>>>>>>>> shared
>>>>>>>>>> between
>>>>>>>>>>>> the
>>>>>>>>>>>>>> CPU and GPU
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm pretty sure this expansion of "XA" is
>>>>>>>>>>>>> wrong;
>>>>>>>>>>>>> where
>>>>>>>>>>>>> are
>>>>>>>>>>>>> you
>>>>>>>>>>>>> seeing this definition?  Everything in the
>>>>>>>>>>>>> bspec
>>>>>>>>>>>>> indicates
>>>>>>>>>>>>> that XA
>>>>>>>>>>>>> means "wb
>>>>>>>>>>>>> - transient app" (similar to how "XD" is 'wb -
>>>>>>>>>>>>> transient
>>>>>>>>>>>>> display").
>>>>>>>>>>>>> I'm not sure why exactly they picked "X" to
>>>>>>>>>>>>> refer
>>>>>>>>>>>>> to
>>>>>>>>>>>>> transient in
>>>>>>>>>>>>> both of these cases, but I've never seen any
>>>>>>>>>>>>> documentation
>>>>>>>>>>>>> that
>>>>>>>>>>>>> refers to it as "extended."
>>>>>>>>>>>>>
>>>>>>>>>>>>>> is treated differently from other GPU memory
>>>>>>>>>>>>>> when
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> Media
>>>>>>>>>>>>>> engine is
>>>>>>>>>>>> power-gated.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> XA is *always* flushed, like at the end-of-
>>>>>>>>>>>>>> submssion
>>>>>>>>>>>>>> (and
>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>> other
>>>>>>>>>>>>>
>>>>>>>>>>>>> I assume you're referring to the fact that the
>>>>>>>>>>>>> driver
>>>>>>>>>>>>> performs
>>>>>>>>>>>>> flushes at the end of submission (via
>>>>>>>>>>>>> PIPE_CONTROL
>>>>>>>>>>>>> or
>>>>>>>>>>>>> MI_FLUSH_DW), and that depending on other
>>>>>>>>>>>>> state/optimizations
>>>>>>>>>>>>> in
>>>>>>>>>>>>> the system, those flushes may flush the entire
>>>>>>>>>>>>> device
>>>>>>>>>>>>> cache,
>>>>>>>>>>>>> or
>>>>>>>>>>>>> may only flush the subset of cache data that is
>>>>>>>>>>>>> not
>>>>>>>>>>>>> marked as
>>>>>>>>>>>>> transient.  The way you worded this was
>>>>>>>>>>>>> confusing
>>>>>>>>>>>>> since
>>>>>>>>>>>>> it
>>>>>>>>>>>>> makes
>>>>>>>>>>>>> it sound like cache flushes happen
>>>>>>>>>>>>> automatically
>>>>>>>>>>>>> somewhere in
>>>>>>>>>> hardware/firmware.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> places), just that internally as an
>>>>>>>>>>>>>> optimisation
>>>>>>>>>>>>>> hw
>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>> need
>>>>>>>>>>>>>> to make that a full flush (which will also
>>>>>>>>>>>>>> include
>>>>>>>>>>>>>> XA) when
>>>>>>>>>>>>>> Media is off/powergated, since it doesn't
>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>> worry
>>>>>>>>>>>>>> about GT
>>>>>>>>>>>>>> caches vs Media coherency, and only CPU vs
>>>>>>>>>>>>>> GPU
>>>>>>>>>>>>>> coherency,
>>>>>>>>>>>>>> so can
>>>>>>>>>>>>>> make that flush a targeted XA flush, since
>>>>>>>>>>>>>> stuff
>>>>>>>>>>>>>> tagged
>>>>>>>>>>>>>> with XA
>>>>>>>>>>>>>> now means it's shared with the CPU. The main
>>>>>>>>>>>>>> implication is
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> we now need to somehow flush non-XA before
>>>>>>>>>>>>>> freeing
>>>>>>>>>>>>>> system
>>>>>>>>>>>>>> memory
>>>>>>>>>>>>>> pages, otherwise dirty cachelines could be
>>>>>>>>>>>>>> flushed
>>>>>>>>>>>>>> after
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> free (like if Media suddenly turns on and
>>>>>>>>>>>>>> does a
>>>>>>>>>>>>>> full
>>>>>>>>>>>>>> flush)
>>>>>>>>>>>>>
>>>>>>>>>>>>> This description seems really confusing.  My
>>>>>>>>>>>>> understanding is
>>>>>>>>>>>>> that
>>>>>>>>>>>>> marking something as wb-transient-app indicates
>>>>>>>>>>>>> that it
>>>>>>>>>>>>> might
>>>>>>>>>>>>> be
>>>>>>>>>>>>> accessed by something other than our
>>>>>>>>>>>>> graphics/media
>>>>>>>>>>>>> IP
>>>>>>>>>>>>> (i.e.,
>>>>>>>>>>>>> accessed from the CPU, exported to another
>>>>>>>>>>>>> device,
>>>>>>>>>>>>> etc.), so
>>>>>>>>>>>>> transient data truly does need to be flushed at
>>>>>>>>>>>>> the
>>>>>>>>>>>>> points in
>>>>>>>>>>>>> the
>>>>>>>>>>>>> driver where a flush typically happens.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However when something is _not_ transient, then
>>>>>>>>>>>>> either:
>>>>>>>>>>>>>      - it's "private" to the GPU and only our
>>>>>>>>>>>>> graphics/media IP
>>>>>>>>>>>>> will be
>>>>>>>>>>>>>        accessing it
>>>>>>>>>>>>>      - it's bound with a coherent PAT index so
>>>>>>>>>>>>> that
>>>>>>>>>>>>> outside
>>>>>>>>>>>>> observers like
>>>>>>>>>>>>>        the CPU can snoop the device cache, even
>>>>>>>>>>>>> when
>>>>>>>>>>>>> the
>>>>>>>>>>>>> cache
>>>>>>>>>>>>> hasn't been
>>>>>>>>>>>>>        flushed
>>>>>>>>>>>>>
>>>>>>>>>>>>> If media is not active, then there's really no
>>>>>>>>>>>>> need
>>>>>>>>>>>>> to
>>>>>>>>>>>>> include
>>>>>>>>>>>>> non-transient data when an device cache flush
>>>>>>>>>>>>> happens
>>>>>>>>>>>>> since
>>>>>>>>>>>>> there's no real need for the data to get to
>>>>>>>>>>>>> RAM.
>>>>>>>>>>>>> So
>>>>>>>>>>>>> that
>>>>>>>>>>>>> enables
>>>>>>>>>>>>> an optimization (which comes in your next
>>>>>>>>>>>>> patch),
>>>>>>>>>>>>> that
>>>>>>>>>>>>> allows
>>>>>>>>>>>>> flushes to only operate on the subset of the
>>>>>>>>>>>>> device
>>>>>>>>>>>>> cache
>>>>>>>>>>>>> tagged as
>>>>>>>>>> "transient" if media is idle.
>>>>>>>>>>>
>>>>>>>>>>> But what If we have stale non-XA marked pages for
>>>>>>>>>>> userptr,
>>>>>>>>>>> and
>>>>>>>>>>> that
>>>>>>>>>>> object moves out and at the same time media comes
>>>>>>>>>>> back,
>>>>>>>>>>> will end
>>>>>>>>>>> up in
>>>>>>>>>>> full flush and flush the stale entry to RAM.
>>>>>>>>>>
>>>>>>>>>> What makes userptr special here?  During general,
>>>>>>>>>> active
>>>>>>>>>> usage,
>>>>>>>>>> userptr would
>>>>>>>>>> be data that's accessible by the CPU, so it needs to
>>>>>>>>>> either
>>>>>>>>>> be
>>>>>>>>>> transient (so CPU
>>>>>>>>>> can see the data in RAM after explicit flushes) or it
>>>>>>>>>> needs
>>>>>>>>>> to be
>>>>>>>>>> using a
>>>>>>>>>> coherent PAT (so that the CPU can just snoop the GPU
>>>>>>>>>> cache).
>>>>>>>>>> If
>>>>>>>>>> you marked
>>>>>>>>>> userptr as both non-XA and non-coherent, then that
>>>>>>>>>> sounds
>>>>>>>>>> likely to
>>>>>>>>>> be a
>>>>>>>>>> userspace bug (and probably something we can catch
>>>>>>>>>> and
>>>>>>>>>> reject
>>>>>>>>>> as an
>>>>>>>>>> invalid
>>>>>>>>>> case on any Xe3p or later platforms that support
>>>>>>>>>> this)
>>>>>>>>>> since
>>>>>>>>>> the
>>>>>>>>>> CPU wouldn't
>>>>>>>>>> have any reliable way of seeing GPU updates.
>>>>>>>>>
>>>>>>>>> Right. FYI @Mrozek, Michal @Souza, Jose
>>>>>>>>> For userptr, as explained above, it needs to be either
>>>>>>>>> coherent
>>>>>>>>> or XA
>>>>>>>>> pat index, or else KMD will reject as invalid case.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> The coherency restriction is already in the uAPI:
>>>>>>>>
>>>>>>>> "Note: For userptr and externally imported dma-buf the
>>>>>>>> kernel
>>>>>>>> expects
>>>>>>>> either 1WAY or 2WAY for the @pat_index."
>>>>>>>>
>>>>>>>> Using 1 way is enough as Xe KMD does a PIPE_CONTROL
>>>>>>>> flushing
>>>>>>>> GPU
>>>>>>>> caches
>>>>>>>> at the end of batch buffers.
>>>>>>>
>>>>>>> But isn't that what we're discussing here?  1-way *won't*
>>>>>>> necessarily be
>>>>>>> enough anymore because PIPE_CONTROL instructions don't
>>>>>>> flush
>>>>>>> the
>>>>>>> entire
>>>>>>> cache anymore.  Whenever the GuC determines that media is
>>>>>>> inactive
>>>>>>> and
>>>>>>> activates the optimization, PIPE_CONTROL, MI_FLUSH_DW, etc.
>>>>>>> change
>>>>>>> behavior to only flush out the subset of data that was
>>>>>>> marked
>>>>>>> as
>>>>>>> app-transient; anything not marked that way doesn't get
>>>>>>> flushed
>>>>>>> now.  So
>>>>>>> there's a new requirement here that you ensure you're using
>>>>>>> an
>>>>>>> XA
>>>>>>> PAT
>>>>>>> index, or you switch to use 2-way coherency which will
>>>>>>> allow
>>>>>>> the
>>>>>>> CPU to
>>>>>>> snoop the GPU's caches.
>>>>>>
>>>>>> That exactly matches my understanding also.
>>>>>
>>>>> This only ever affects IGFX, right? Since AFAIU we don't have
>>>>> 2-way
>>>>> coherency with DGFX?
>>>>
>>>> Yeah, this should be igpu only. I seem to also recall that on
>>>> dgpu,
>>>> Media is coherent with l2/l3, but also I don't think system
>>>> memory
>>>> can
>>>> be cached in l2/l3 (only VRAM), which I assume is why there is
>>>> the
>>>> special SMRO (system-memory-read-only) cache only on dgpu, which
>>>> is
>>>> flushed when the fence signals, unlike the l2/l3.
>>>
>>> Yes that sounds reasonable.
>>>
>>>>
>>>>>
>>>>> It sounds like the same PAT restriction is needed also for
>>>>> imported
>>>>> dma-buf, right?
>>>>
>>>> Good point. Looks like we are missing that still. Otherwise we
>>>> can
>>>> run
>>>> into the same issues with stale l2/l3/ppc.
>>>
>>> So if this affects only system memory could we instead of relying
>>> on 2-
>>> way coherency or XA, just flush at dma unmap time, because that's
>>> typically just before releasing the pages.
>>
>> Yeah, I think we could make it work, from security pov, similar to
>> userptr, with the right manual flushes in KMD. Maybe just a question
>> if
>> userspace wants such a model? Anything cached in l2/l3 might require
>> manual flushing by userspace (if that is even possible)?
> 
> So that would mean if user-space wants gpu-cpu coherency at fence
> synchronization points, they'd have to use either 2-way or XA pat
> indices, but not enforced by KMD.

Yeah, looking at BSpec 74635 (Media off case), I'm only really seeing 
MEM_SET which userspace could potentially use by itself? But then it's 
unclear if they mean to actually clear-the-memory (which is not what we 
want) or using the special evict mode, but that seems to be talking more 
about flushing to local memory, so not completely sure what that does on 
igpu. If it's the evict mode then should in theory be possible for 
userpace to do a manual flush, but that would have to be done per-bo/vma?

> 
> For imported dma-buf kernel requires 2-way or XA for security due to
> the relaxed dma-buf unmap.
> 
> For SVM/System allocator we'd require 2-way or XA.
> 
> Otherwise KMD security is enforced by flush at dma-unmap time?

Yeah, that is my understanding. Otherwise I don't currently see what 
prevents the dirty non-XA cache lines being flushed at some random point 
later, after we have already freed the corresponding system memory, 
potentially nuking the next user who allocates those pages.

> 
> /Thomas
> 
>>
>>>
>>> The exception, though, is dma-buf where the exporter can actually
>>> release memory before all importers have given up their dma-
>>> mappings.
>>>
>>> /Thomas
>>>
>>>>
>>>>>
>>>>> /Thomas
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> If something happens that changes the GTT mapping of
>>>>>>>>>> an
>>>>>>>>>> object,
>>>>>>>>>> then
>>>>>>>>>> doesn't that already trigger a TLB invalidation when
>>>>>>>>>> necessary in
>>>>>>>>>> the driver
>>>>>>>>>> today?  It was my understanding that "heavy" TLB
>>>>>>>>>> invalidations wait
>>>>>>>>>> for data
>>>>>>>>>> values to be globally observable before starting, so
>>>>>>>>>> I
>>>>>>>>>> think
>>>>>>>>>> that
>>>>>>>>>> would ensure
>>>>>>>>>> that any non-XA data makes it to RAM before any
>>>>>>>>>> binding
>>>>>>>>>> changes,
>>>>>>>>>> object,
>>>>>>>>>> destruction, etc.?  Is there something special about
>>>>>>>>>> userptr
>>>>>>>>>> that
>>>>>>>>>> makes that
>>>>>>>>>> case more of a problem?
>>>>>>>>>>
>>>>>>>>>> I just found bspec page 74635 which gives an overview
>>>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>> various flush
>>>>>>>>>> and invalidate cases, and I don't see anything there
>>>>>>>>>> that
>>>>>>>>>> makes it
>>>>>>>>>> obvious to
>>>>>>>>>> me that userptr would be special.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> As you said, we eventually do want to force a
>>>>>>>>>>>>> flush
>>>>>>>>>>>>> of
>>>>>>>>>>>>> the
>>>>>>>>>>>>> non-transient data as well once we're freeing
>>>>>>>>>>>>> the
>>>>>>>>>>>>> underlying
>>>>>>>>>>>>> pages.
>>>>>>>>>>>>> So how do we do that?  It's not clear to me how
>>>>>>>>>>>>> the
>>>>>>>>>>>>> changes
>>>>>>>>>>>>> below
>>>>>>>>>>>>> are accomplishing that.  Is there a way to
>>>>>>>>>>>>> explicitly
>>>>>>>>>>>>> request
>>>>>>>>>>>>> a
>>>>>>>>>>>>> full device cache flush (ignoring the transient
>>>>>>>>>>>>> vs
>>>>>>>>>>>>> non-
>>>>>>>>>>>>> transient tagging)?
>>>>>>>>>>>>> Since the GuC handles the optimization in the
>>>>>>>>>>>>> next
>>>>>>>>>>>>> patch
>>>>>>>>>>>>> (toggling
>>>>>>>>>>>>> whether flushes are full flushes vs non-
>>>>>>>>>>>>> transient
>>>>>>>>>>>>> flushes
>>>>>>>>>>>>> depending on whether media is active), I
>>>>>>>>>>>>> thought
>>>>>>>>>>>>> there
>>>>>>>>>>>>> might
>>>>>>>>>>>>> be
>>>>>>>>>>>>> some kind of GuC interface to request "please
>>>>>>>>>>>>> do
>>>>>>>>>>>>> one
>>>>>>>>>>>>> full
>>>>>>>>>>>>> flush now, even
>>>>>>>>>> if media is idle."
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I’m not an expert here by any means, but
>>>>>>>>>>>> everything
>>>>>>>>>>>> above
>>>>>>>>>>>> from
>>>>>>>>>>>> Matt
>>>>>>>>>>>> seems like valid concerns. Thomas also raised
>>>>>>>>>>>> some
>>>>>>>>>>>> concerns in
>>>>>>>>>>>> the
>>>>>>>>>>>> two previous revisions; again I’m not an expert,
>>>>>>>>>>>> but
>>>>>>>>>>>> reading
>>>>>>>>>>>> through
>>>>>>>>>>>> those, it doesn’t really seem like he received
>>>>>>>>>>>> proper
>>>>>>>>>>>> answers
>>>>>>>>>>>> to his
>>>>>>>>>> questions.
>>>>>>>>>>>
>>>>>>>>>>> Its forcing flush via tlb invalidation PPC flag
>>>>>>>>>>> under
>>>>>>>>>>> xe_invalidate_vma( ).
>>>>>>>>>>
>>>>>>>>>> By the way, what is "PPC?"  It seems like it's
>>>>>>>>>> another
>>>>>>>>>> new
>>>>>>>>>> synonym
>>>>>>>>>> for the
>>>>>>>>>> device cache?  It's already really confusing that
>>>>>>>>>> some of
>>>>>>>>>> our
>>>>>>>>>> hardware docs use
>>>>>>>>>> a mix of both "L2" and "L3" to refer to the same
>>>>>>>>>> device
>>>>>>>>>> cache
>>>>>>>>>> for
>>>>>>>>>> historical
>>>>>>>>>> reasons...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Matt
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> A couple of comments below.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Matt
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> V2(MattA): Expand commit description
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Tejas Upadhyay
>>>>>>>>>>>>>> <tejas.upadhyay@intel.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>      drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
>>>>>>>>>>>>>>      drivers/gpu/drm/xe/xe_device.c  | 23
>>>>>>>>>>>>>> +++++++++++++++++++++++
>>>>>>>>>>>>>> drivers/gpu/drm/xe/xe_device.h  |  1 +
>>>>>>>>>>>>>> drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
>>>>>>>>>>>>>>      4 files changed, 28 insertions(+), 2
>>>>>>>>>>>>>> deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_bo.c index
>>>>>>>>>>>>>> e9180b01a4e4..4455886b211e
>>>>>>>>>>>>>> 100644
>>>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>>>>>>>> @@ -689,7 +689,8 @@ static int
>>>>>>>>>>>>>> xe_bo_trigger_rebind(struct
>>>>>>>>>>>>>> xe_device *xe, struct xe_bo *bo,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>      		if
>>>>>>>>>>>>>> (!xe_vm_in_fault_mode(vm)) {
>>>>>>>>>>>>>>      			drm_gpuvm_bo_evict(v
>>>>>>>>>>>>>> m_bo
>>>>>>>>>>>>>> ,
>>>>>>>>>>>>>> true);
>>>>>>>>>>>>>> -			continue;
>>>>>>>>>>>>>> +			if
>>>>>>>>>>>>>> (!xe_device_needs_cache_flush(xe))
>>>>>>>>>>>>>> +				continue;
>>>>>>>>>
>>>>>>>>> Matt R,
>>>>>>>>> This flush will be still needed as there can be non-xa
>>>>>>>>> buffers
>>>>>>>>> which
>>>>>>>>> can be evicted while media was off and stale entries
>>>>>>>>> can be
>>>>>>>>> flushed
>>>>>>>>> when media comes back on. Which was not case earlier as
>>>>>>>>> full
>>>>>>>>> flush
>>>>>>>>> was happening at regular sync points and that’s where
>>>>>>>>> this
>>>>>>>>> feature is
>>>>>>>>> bringing optimization now.
>>>>>>>>>
>>>>>>>>> Tejas
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> This will trigger a TLB invalidation (and I
>>>>>>>>>>>> assume a
>>>>>>>>>>>> cache
>>>>>>>>>>>> flush)
>>>>>>>>>>>> every time we move or free memory in the 3D stack
>>>>>>>>>>>> if
>>>>>>>>>>>> it
>>>>>>>>>>>> has a
>>>>>>>>>>>> binding. It also performs a synchronous wait on
>>>>>>>>>>>> the
>>>>>>>>>>>> BO
>>>>>>>>>>>> being
>>>>>>>>>>>> idle.
>>>>>>>>>>>> Both of these are very expensive operations. I
>>>>>>>>>>>> can’t
>>>>>>>>>>>> imagine
>>>>>>>>>>>> the
>>>>>>>>>>>> granularity we want here is to do this on every
>>>>>>>>>>>> move/free
>>>>>>>>>>>> with
>>>>>>>>>>>> bindings.
>>>>>>>>>>>>
>>>>>>>>>>>> Also, for LR compute with preempt fences, we
>>>>>>>>>>>> would
>>>>>>>>>>>> trigger the
>>>>>>>>>>>> preempt fences during the wait, so a TLB
>>>>>>>>>>>> invalidation
>>>>>>>>>>>> after
>>>>>>>>>>>> this
>>>>>>>>>>>> seems unnecessary, though perhaps the cache flush
>>>>>>>>>>>> is
>>>>>>>>>>>> still
>>>>>>>>>>>> required?
>>>>>>>>>>>>
>>>>>>>>>>>> I think this needs a bit more explanation,
>>>>>>>>>>>> because
>>>>>>>>>>>> without
>>>>>>>>>>>> knowing a
>>>>>>>>>>>> lot about the exact requirements, the
>>>>>>>>>>>> implementation
>>>>>>>>>>>> does
>>>>>>>>>>>> not
>>>>>>>>>>>> look
>>>>>>>>>> correct.
>>>>>>>>>>>
>>>>>>>>>>> The thing is that we are trying to solve problem
>>>>>>>>>>> with
>>>>>>>>>>> userptr
>>>>>>>>>>> with non-XA
>>>>>>>>>> pat, consider if that BO got moved while media is not
>>>>>>>>>> active.
>>>>>>>>>> As
>>>>>>>>>> soon as media
>>>>>>>>>> will come back active, stale cached entries of that
>>>>>>>>>> object
>>>>>>>>>> will be
>>>>>>>>>> flushed as part
>>>>>>>>>> of full flush , which may corrupt things.
>>>>>>>>>>> There was thinking that with this patch we would at
>>>>>>>>>>> least
>>>>>>>>>>> solve
>>>>>>>>>>> the problem
>>>>>>>>>> of corruption and later when page_reclamation feature
>>>>>>>>>> comes
>>>>>>>>>> in will
>>>>>>>>>> help in
>>>>>>>>>> performance as well. But now when page reclamation
>>>>>>>>>> feature is
>>>>>>>>>> merged earlier
>>>>>>>>>> and it tightly coupled with bind/unbind some cases
>>>>>>>>>> like
>>>>>>>>>> discussed
>>>>>>>>>> above
>>>>>>>>>> (which are not doing unbind immediately on move/free)
>>>>>>>>>> are
>>>>>>>>>> missed in
>>>>>>>>>> reclamation.
>>>>>>>>>>>
>>>>>>>>>>> So thought was to let this solution go in with
>>>>>>>>>>> little
>>>>>>>>>>> perf
>>>>>>>>>>> hit
>>>>>>>>>>> and discuss with
>>>>>>>>>> page reclamation owner to come with cleaner solution
>>>>>>>>>> together.
>>>>>>>>>>>
>>>>>>>>>>> Tejas
>>>>>>>>>>>>
>>>>>>>>>>>>>>      		}
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>      		if (!idle) {
>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.c index
>>>>>>>>>>>>>> 743c18e0c580..da2abed94bc0
>>>>>>>>>>>>>> 100644
>>>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>>>>>>> @@ -1097,6 +1097,29 @@ static void
>>>>>>>>>>>>>> tdf_request_sync(struct
>>>>>>>>>>>>>> xe_device
>>>>>>>>>>>> *xe)
>>>>>>>>>>>>>>      	}
>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>> + * xe_device_needs_cache_flush - Whether the
>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>> to be
>>>>>>>>>>>>>> +flushed
>>>>>>>>>>>>>> + * @xe: The device to check.
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * Return: true if the device needs cache
>>>>>>>>>>>>>> flush,
>>>>>>>>>>>>>> false
>>>>>>>>>>>>>> otherwise.
>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>> +bool xe_device_needs_cache_flush(struct
>>>>>>>>>>>>>> xe_device
>>>>>>>>>>>>>> *xe) {
>>>>>>>>>>>>>> +	/* XA is *always* flushed, like at
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> end-
>>>>>>>>>>>>>> of-
>>>>>>>>>>>>>> submssion (and
>>>>>>>>>>>>>> +maybe
>>>>>>>>>>>> other
>>>>>>>>>>>>>> +	 * places), just that internally as
>>>>>>>>>>>>>> an
>>>>>>>>>>>>>> optimisation hw doesn't
>>>>>>>>>>>>>> +need to
>>>>>>>>>>>> make
>>>>>>>>>>>>>> +	 * that a full flush (which will
>>>>>>>>>>>>>> also
>>>>>>>>>>>>>> include XA)
>>>>>>>>>>>>>> when Media is
>>>>>>>>>>>>>> +	 * off/powergated, since it doesn't
>>>>>>>>>>>>>> need
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> worry
>>>>>>>>>>>>>> about GT
>>>>>>>>>>>>>> +caches vs
>>>>>>>>>>>> Media
>>>>>>>>>>>>>> +	 * coherency, and only CPU vs GPU
>>>>>>>>>>>>>> coherency,
>>>>>>>>>>>>>> so
>>>>>>>>>>>>>> can make
>>>>>>>>>> that
>>>>>>>>>>>>>> +flush
>>>>>>>>>>>> a
>>>>>>>>>>>>>> +	 * targeted XA flush, since stuff
>>>>>>>>>>>>>> tagged
>>>>>>>>>>>>>> with XA
>>>>>>>>>>>>>> now means
>>>>>>>>>>>>>> +it's shared
>>>>>>>>>>>> with
>>>>>>>>>>>>>> +	 * the CPU. The main implication is
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> we
>>>>>>>>>>>>>> now
>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>> +somehow
>>>>>>>>>>>> flush non-XA before
>>>>>>>>>>>>>> +	 * freeing system memory pages,
>>>>>>>>>>>>>> otherwise
>>>>>>>>>>>>>> dirty
>>>>>>>>>>>>>> cachelines
>>>>>>>>>>>>>> +could be
>>>>>>>>>>>> flushed after the free
>>>>>>>>>>>>>> +	 * (like if Media suddenly turns on
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> does
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>> full flush)
>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>> +	if (GRAPHICS_VER(xe) >= 35 &&
>>>>>>>>>>>>>> !IS_DGFX(xe))
>>>>>>>>>>>>>> +		return true;
>>>>>>>>>>>>>> +	return false;
>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>      void xe_device_l2_flush(struct xe_device
>>>>>>>>>>>>>> *xe)
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>      	struct xe_gt *gt;
>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h
>>>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.h index
>>>>>>>>>>>>>> 39464650533b..baf386e0e037
>>>>>>>>>>>>>> 100644
>>>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>>>>>>>>>>>> @@ -184,6 +184,7 @@ void
>>>>>>>>>>>>>> xe_device_snapshot_print(struct
>>>>>>>>>>>>>> xe_device *xe, struct drm_printer *p);
>>>>>>>>>>>>>>      u64 xe_device_canonicalize_addr(struct
>>>>>>>>>>>>>> xe_device
>>>>>>>>>>>>>> *xe, u64
>>>>>>>>>>>>>> address);
>>>>>>>>>>>>>>      u64 xe_device_uncanonicalize_addr(struct
>>>>>>>>>>>>>> xe_device
>>>>>>>>>>>>>> *xe,
>>>>>>>>>>>>>> u64
>>>>>>>>>>>>>> address);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +bool xe_device_needs_cache_flush(struct
>>>>>>>>>>>>>> xe_device
>>>>>>>>>>>>>> *xe);
>>>>>>>>>>>>>>      void xe_device_td_flush(struct xe_device
>>>>>>>>>>>>>> *xe);
>>>>>>>>>>>>>> void
>>>>>>>>>>>>>> xe_device_l2_flush(struct xe_device *xe);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_userptr.c index
>>>>>>>>>>>>>> e120323c43bc..b435ea7f9b66
>>>>>>>>>>>>>> 100644
>>>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>>>>>>>> @@ -114,7 +114,8 @@ static void
>>>>>>>>>>>>>> __vma_userptr_invalidate(struct
>>>>>>>>>>>>>> xe_vm
>>>>>>>>>>>> *vm, struct xe_userptr_vma *uv
>>>>>>>>>>>>>>      				    false,
>>>>>>>>>>>>>> MAX_SCHEDULE_TIMEOUT);
>>>>>>>>>>>>>>      	XE_WARN_ON(err <= 0);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -	if (xe_vm_in_fault_mode(vm) &&
>>>>>>>>>>>>>> userptr-
>>>>>>>>>>>>>>> initial_bind) {
>>>>>>>>>>>>>> +	if ((xe_vm_in_fault_mode(vm) ||
>>>>>>>>>>>>>> +xe_device_needs_cache_flush(vm-
>>>>>>>>>>>>> xe)) &&
>>>>>>>>>>>>>> +	    userptr->initial_bind) {
>>>>>>>>>>>>
>>>>>>>>>>>> Same concern with the LR preempt fence as above —
>>>>>>>>>>>> the
>>>>>>>>>>>> hardware
>>>>>>>>>>>> will
>>>>>>>>>>>> be interrupted via preempt fences, so it doesn’t
>>>>>>>>>>>> seem
>>>>>>>>>>>> necessary
>>>>>>>>>>>> to
>>>>>>>>>>>> invalidate the TLBs but perhaps we need a cflush
>>>>>>>>>>>> and
>>>>>>>>>>>> TLB
>>>>>>>>>>>> invalidation is the mechanism for that too?
>>>>>>>>>>>>
>>>>>>>>>>>> Matt
>>>>>>>>>>>>
>>>>>>>>>>>>>>      		err =
>>>>>>>>>>>>>> xe_vm_invalidate_vma(vma);
>>>>>>>>>>>>>>      		XE_WARN_ON(err);
>>>>>>>>>>>>>>      	}
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> 2.52.0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Matt Roper
>>>>>>>>>>>>> Graphics Software Engineer
>>>>>>>>>>>>> Linux GPU Platform Enablement
>>>>>>>>>>>>> Intel Corporation
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Matt Roper
>>>>>>>>>> Graphics Software Engineer
>>>>>>>>>> Linux GPU Platform Enablement
>>>>>>>>>> Intel Corporation
>>>>>>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-16 16:41                             ` Matthew Auld
@ 2026-02-17  6:19                               ` Upadhyay, Tejas
  2026-02-17  9:53                                 ` Thomas Hellström
  2026-02-17 17:04                               ` Thomas Hellström
  1 sibling, 1 reply; 39+ messages in thread
From: Upadhyay, Tejas @ 2026-02-17  6:19 UTC (permalink / raw)
  To: Auld, Matthew, Thomas Hellström, Roper, Matthew D,
	Souza, Jose
  Cc: Mrozek, Michal, intel-xe@lists.freedesktop.org, Brost, Matthew



> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: 16 February 2026 22:12
> To: Thomas Hellström <thomas.hellstrom@linux.intel.com>; Roper, Matthew
> D <matthew.d.roper@intel.com>; Souza, Jose <jose.souza@intel.com>
> Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; Mrozek, Michal
> <michal.mrozek@intel.com>; intel-xe@lists.freedesktop.org; Brost, Matthew
> <matthew.brost@intel.com>
> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
> cachelines manually
> 
> On 16/02/2026 15:38, Thomas Hellström wrote:
> > On Mon, 2026-02-16 at 14:55 +0000, Matthew Auld wrote:
> >> On 16/02/2026 12:07, Thomas Hellström wrote:
> >>> On Mon, 2026-02-16 at 10:58 +0000, Matthew Auld wrote:
> >>>> On 16/02/2026 10:23, Thomas Hellström wrote:
> >>>>> On Fri, 2026-02-13 at 17:31 +0000, Matthew Auld wrote:
> >>>>>> On 13/02/2026 17:16, Matt Roper wrote:
> >>>>>>> On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose
> >>>>>>> wrote:
> >>>>>>>> On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Roper, Matthew D <matthew.d.roper@intel.com>
> >>>>>>>>>> Sent: 12 February 2026 02:41
> >>>>>>>>>> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> >>>>>>>>>> Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
> >>>>>>>>>> xe@lists.freedesktop.org; Auld, Matthew
> >>>>>>>>>> <matthew.auld@intel.com>; thomas.hellstrom@linux.intel.com
> >>>>>>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
> >>>>>>>>>> userptr/shrinker bo cachelines manually
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Feb 11, 2026 at 07:06:05PM +0000, Upadhyay, Tejas
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Brost, Matthew <matthew.brost@intel.com>
> >>>>>>>>>>>> Sent: 11 February 2026 05:32
> >>>>>>>>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
> >>>>>>>>>>>> Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>;
> >>>>>>>>>>>> intel-
> >>>>>>>>>>>> xe@lists.freedesktop.org; Auld, Matthew
> >>>>>>>>>>>> <matthew.auld@intel.com>; thomas.hellstrom@linux.intel.com
> >>>>>>>>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
> >>>>>>>>>>>> userptr/shrinker bo cachelines manually
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Feb 10, 2026 at 01:05:25PM -0800, Matt Roper
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>> On Tue, Feb 10, 2026 at 06:21:22PM +0530, Tejas Upadhyay
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> "eXtended Architecture" (XA) tagged memory—memory
> shared
> >>>>>>>>>> between
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>> CPU and GPU
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm pretty sure this expansion of "XA" is wrong; where are
> >>>>>>>>>>>>> you seeing this definition?  Everything in the bspec
> >>>>>>>>>>>>> indicates that XA means "wb
> >>>>>>>>>>>>> - transient app" (similar to how "XD" is 'wb - transient
> >>>>>>>>>>>>> display").
> >>>>>>>>>>>>> I'm not sure why exactly they picked "X" to refer to
> >>>>>>>>>>>>> transient in both of these cases, but I've never seen any
> >>>>>>>>>>>>> documentation that refers to it as "extended."
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> is treated differently from other GPU memory when the
> >>>>>>>>>>>>>> Media engine is
> >>>>>>>>>>>> power-gated.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> XA is *always* flushed, like at the end-of- submssion
> >>>>>>>>>>>>>> (and maybe other
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I assume you're referring to the fact that the driver
> >>>>>>>>>>>>> performs flushes at the end of submission (via
> >>>>>>>>>>>>> PIPE_CONTROL or MI_FLUSH_DW), and that depending on
> other
> >>>>>>>>>>>>> state/optimizations in the system, those flushes may flush
> >>>>>>>>>>>>> the entire device cache, or may only flush the subset of
> >>>>>>>>>>>>> cache data that is not marked as transient.  The way you
> >>>>>>>>>>>>> worded this was confusing since it makes it sound like
> >>>>>>>>>>>>> cache flushes happen automatically somewhere in
> >>>>>>>>>> hardware/firmware.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> places), just that internally as an optimisation hw
> >>>>>>>>>>>>>> doesn't need to make that a full flush (which will also
> >>>>>>>>>>>>>> include
> >>>>>>>>>>>>>> XA) when
> >>>>>>>>>>>>>> Media is off/powergated, since it doesn't need to worry
> >>>>>>>>>>>>>> about GT caches vs Media coherency, and only CPU vs GPU
> >>>>>>>>>>>>>> coherency, so can make that flush a targeted XA flush,
> >>>>>>>>>>>>>> since stuff tagged with XA now means it's shared with the
> >>>>>>>>>>>>>> CPU. The main implication is that we now need to somehow
> >>>>>>>>>>>>>> flush non-XA before freeing system memory pages,
> >>>>>>>>>>>>>> otherwise dirty cachelines could be flushed after the
> >>>>>>>>>>>>>> free (like if Media suddenly turns on and does a full
> >>>>>>>>>>>>>> flush)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This description seems really confusing.  My understanding
> >>>>>>>>>>>>> is that marking something as wb-transient-app indicates
> >>>>>>>>>>>>> that it might be accessed by something other than our
> >>>>>>>>>>>>> graphics/media IP (i.e., accessed from the CPU, exported
> >>>>>>>>>>>>> to another device, etc.), so transient data truly does
> >>>>>>>>>>>>> need to be flushed at the points in the driver where a
> >>>>>>>>>>>>> flush typically happens.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> However when something is _not_ transient, then
> >>>>>>>>>>>>> either:
> >>>>>>>>>>>>>      - it's "private" to the GPU and only our
> >>>>>>>>>>>>> graphics/media IP will be
> >>>>>>>>>>>>>        accessing it
> >>>>>>>>>>>>>      - it's bound with a coherent PAT index so that
> >>>>>>>>>>>>> outside observers like
> >>>>>>>>>>>>>        the CPU can snoop the device cache, even when the
> >>>>>>>>>>>>> cache hasn't been
> >>>>>>>>>>>>>        flushed
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If media is not active, then there's really no need to
> >>>>>>>>>>>>> include non-transient data when an device cache flush
> >>>>>>>>>>>>> happens since there's no real need for the data to get to
> >>>>>>>>>>>>> RAM.
> >>>>>>>>>>>>> So
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>> enables
> >>>>>>>>>>>>> an optimization (which comes in your next patch), that
> >>>>>>>>>>>>> allows flushes to only operate on the subset of the device
> >>>>>>>>>>>>> cache tagged as
> >>>>>>>>>> "transient" if media is idle.
> >>>>>>>>>>>
> >>>>>>>>>>> But what If we have stale non-XA marked pages for userptr,
> >>>>>>>>>>> and that object moves out and at the same time media comes
> >>>>>>>>>>> back, will end up in full flush and flush the stale entry to
> >>>>>>>>>>> RAM.
> >>>>>>>>>>
> >>>>>>>>>> What makes userptr special here?  During general, active
> >>>>>>>>>> usage, userptr would be data that's accessible by the CPU, so
> >>>>>>>>>> it needs to either be transient (so CPU can see the data in
> >>>>>>>>>> RAM after explicit flushes) or it needs to be using a
> >>>>>>>>>> coherent PAT (so that the CPU can just snoop the GPU cache).
> >>>>>>>>>> If
> >>>>>>>>>> you marked
> >>>>>>>>>> userptr as both non-XA and non-coherent, then that sounds
> >>>>>>>>>> likely to be a userspace bug (and probably something we can
> >>>>>>>>>> catch and reject as an invalid case on any Xe3p or later
> >>>>>>>>>> platforms that support
> >>>>>>>>>> this)
> >>>>>>>>>> since
> >>>>>>>>>> the
> >>>>>>>>>> CPU wouldn't
> >>>>>>>>>> have any reliable way of seeing GPU updates.
> >>>>>>>>>
> >>>>>>>>> Right. FYI @Mrozek, Michal @Souza, Jose For userptr, as
> >>>>>>>>> explained above, it needs to be either coherent or XA pat
> >>>>>>>>> index, or else KMD will reject as invalid case.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> The coherency restriction is already in the uAPI:
> >>>>>>>>
> >>>>>>>> "Note: For userptr and externally imported dma-buf the kernel
> >>>>>>>> expects either 1WAY or 2WAY for the @pat_index."
> >>>>>>>>
> >>>>>>>> Using 1 way is enough as Xe KMD does a PIPE_CONTROL flushing
> >>>>>>>> GPU caches at the end of batch buffers.
> >>>>>>>
> >>>>>>> But isn't that what we're discussing here?  1-way *won't*
> >>>>>>> necessarily be enough anymore because PIPE_CONTROL instructions
> >>>>>>> don't flush the entire cache anymore.  Whenever the GuC
> >>>>>>> determines that media is inactive and activates the
> >>>>>>> optimization, PIPE_CONTROL, MI_FLUSH_DW, etc.
> >>>>>>> change
> >>>>>>> behavior to only flush out the subset of data that was marked as
> >>>>>>> app-transient; anything not marked that way doesn't get flushed
> >>>>>>> now.  So there's a new requirement here that you ensure you're
> >>>>>>> using an XA PAT index, or you switch to use 2-way coherency
> >>>>>>> which will allow the CPU to snoop the GPU's caches.
> >>>>>>
> >>>>>> That exactly matches my understanding also.
> >>>>>
> >>>>> This only ever affects IGFX, right? Since AFAIU we don't have
> >>>>> 2-way coherency with DGFX?
> >>>>
> >>>> Yeah, this should be igpu only. I seem to also recall that on dgpu,
> >>>> Media is coherent with l2/l3, but also I don't think system memory
> >>>> can be cached in l2/l3 (only VRAM), which I assume is why there is
> >>>> the special SMRO (system-memory-read-only) cache only on dgpu,
> >>>> which is flushed when the fence signals, unlike the l2/l3.
> >>>
> >>> Yes that sounds reasonable.
> >>>
> >>>>
> >>>>>
> >>>>> It sounds like the same PAT restriction is needed also for
> >>>>> imported dma-buf, right?
> >>>>
> >>>> Good point. Looks like we are missing that still. Otherwise we can
> >>>> run into the same issues with stale l2/l3/ppc.
> >>>
> >>> So if this affects only system memory could we instead of relying on
> >>> 2- way coherency or XA, just flush at dma unmap time, because that's
> >>> typically just before releasing the pages.
> >>
> >> Yeah, I think we could make it work, from security pov, similar to
> >> userptr, with the right manual flushes in KMD. Maybe just a question
> >> if userspace wants such a model? Anything cached in l2/l3 might
> >> require manual flushing by userspace (if that is even possible)?
> >
> > So that would mean if user-space wants gpu-cpu coherency at fence
> > synchronization points, they'd have to use either 2-way or XA pat
> > indices, but not enforced by KMD.
> 
> Yeah, looking at BSpec 74635 (Media off case), I'm only really seeing
> MEM_SET which userspace could potentially use by itself? But then it's unclear
> if they mean to actually clear-the-memory (which is not what we
> want) or using the special evict mode, but that seems to be talking more about
> flushing to local memory, so not completely sure what that does on igpu. If it's
> the evict mode then should in theory be possible for userpace to do a manual
> flush, but that would have to be done per-bo/vma?

MEM_SET says, range needs to be specified as part of command with evict mode.

> 
> >
> > For imported dma-buf kernel requires 2-way or XA for security due to
> > the relaxed dma-buf unmap.
> >
> > For SVM/System allocator we'd require 2-way or XA.
> >
> > Otherwise KMD security is enforced by flush at dma-unmap time?
> 
> Yeah, that is my understanding. Otherwise I don't currently see what prevents
> the dirty non-XA cache lines being flushed at some random point later, after
> we have already freed the corresponding system memory, potentially nuking
> the next user who allocates those pages.

Hmm, so it means we can drop this patch completely and do something like below :

In, xe_migrate_dma_unmap(),

dma_unmap_page()
if (pat_index != 18 or 19 && coh_mode != 2_way)
	/* manual_flush */

Tejas

> 
> >
> > /Thomas
> >
> >>
> >>>
> >>> The exception, though, is dma-buf where the exporter can actually
> >>> release memory before all importers have given up their dma-
> >>> mappings.
> >>>
> >>> /Thomas
> >>>
> >>>>
> >>>>>
> >>>>> /Thomas
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Matt
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> If something happens that changes the GTT mapping of
> >>>>>>>>>> an
> >>>>>>>>>> object,
> >>>>>>>>>> then
> >>>>>>>>>> doesn't that already trigger a TLB invalidation when
> >>>>>>>>>> necessary in
> >>>>>>>>>> the driver
> >>>>>>>>>> today?  It was my understanding that "heavy" TLB
> >>>>>>>>>> invalidations wait
> >>>>>>>>>> for data
> >>>>>>>>>> values to be globally observable before starting, so
> >>>>>>>>>> I
> >>>>>>>>>> think
> >>>>>>>>>> that
> >>>>>>>>>> would ensure
> >>>>>>>>>> that any non-XA data makes it to RAM before any
> >>>>>>>>>> binding
> >>>>>>>>>> changes,
> >>>>>>>>>> object,
> >>>>>>>>>> destruction, etc.?  Is there something special about
> >>>>>>>>>> userptr
> >>>>>>>>>> that
> >>>>>>>>>> makes that
> >>>>>>>>>> case more of a problem?
> >>>>>>>>>>
> >>>>>>>>>> I just found bspec page 74635 which gives an overview
> >>>>>>>>>> of
> >>>>>>>>>> the
> >>>>>>>>>> various flush
> >>>>>>>>>> and invalidate cases, and I don't see anything there
> >>>>>>>>>> that
> >>>>>>>>>> makes it
> >>>>>>>>>> obvious to
> >>>>>>>>>> me that userptr would be special.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As you said, we eventually do want to force a
> >>>>>>>>>>>>> flush
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>> non-transient data as well once we're freeing
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>> underlying
> >>>>>>>>>>>>> pages.
> >>>>>>>>>>>>> So how do we do that?  It's not clear to me how
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>> changes
> >>>>>>>>>>>>> below
> >>>>>>>>>>>>> are accomplishing that.  Is there a way to
> >>>>>>>>>>>>> explicitly
> >>>>>>>>>>>>> request
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>> full device cache flush (ignoring the transient
> >>>>>>>>>>>>> vs
> >>>>>>>>>>>>> non-
> >>>>>>>>>>>>> transient tagging)?
> >>>>>>>>>>>>> Since the GuC handles the optimization in the
> >>>>>>>>>>>>> next
> >>>>>>>>>>>>> patch
> >>>>>>>>>>>>> (toggling
> >>>>>>>>>>>>> whether flushes are full flushes vs non-
> >>>>>>>>>>>>> transient
> >>>>>>>>>>>>> flushes
> >>>>>>>>>>>>> depending on whether media is active), I
> >>>>>>>>>>>>> thought
> >>>>>>>>>>>>> there
> >>>>>>>>>>>>> might
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>> some kind of GuC interface to request "please
> >>>>>>>>>>>>> do
> >>>>>>>>>>>>> one
> >>>>>>>>>>>>> full
> >>>>>>>>>>>>> flush now, even
> >>>>>>>>>> if media is idle."
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I’m not an expert here by any means, but
> >>>>>>>>>>>> everything
> >>>>>>>>>>>> above
> >>>>>>>>>>>> from
> >>>>>>>>>>>> Matt
> >>>>>>>>>>>> seems like valid concerns. Thomas also raised
> >>>>>>>>>>>> some
> >>>>>>>>>>>> concerns in
> >>>>>>>>>>>> the
> >>>>>>>>>>>> two previous revisions; again I’m not an expert,
> >>>>>>>>>>>> but
> >>>>>>>>>>>> reading
> >>>>>>>>>>>> through
> >>>>>>>>>>>> those, it doesn’t really seem like he received
> >>>>>>>>>>>> proper
> >>>>>>>>>>>> answers
> >>>>>>>>>>>> to his
> >>>>>>>>>> questions.
> >>>>>>>>>>>
> >>>>>>>>>>> Its forcing flush via tlb invalidation PPC flag
> >>>>>>>>>>> under
> >>>>>>>>>>> xe_invalidate_vma( ).
> >>>>>>>>>>
> >>>>>>>>>> By the way, what is "PPC?"  It seems like it's
> >>>>>>>>>> another
> >>>>>>>>>> new
> >>>>>>>>>> synonym
> >>>>>>>>>> for the
> >>>>>>>>>> device cache?  It's already really confusing that
> >>>>>>>>>> some of
> >>>>>>>>>> our
> >>>>>>>>>> hardware docs use
> >>>>>>>>>> a mix of both "L2" and "L3" to refer to the same
> >>>>>>>>>> device
> >>>>>>>>>> cache
> >>>>>>>>>> for
> >>>>>>>>>> historical
> >>>>>>>>>> reasons...
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Matt
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> A couple of comments below.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Matt
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> V2(MattA): Expand commit description
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Signed-off-by: Tejas Upadhyay
> >>>>>>>>>>>>>> <tejas.upadhyay@intel.com>
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>      drivers/gpu/drm/xe/xe_bo.c      |  3 ++-
> >>>>>>>>>>>>>>      drivers/gpu/drm/xe/xe_device.c  | 23
> >>>>>>>>>>>>>> +++++++++++++++++++++++
> >>>>>>>>>>>>>> drivers/gpu/drm/xe/xe_device.h  |  1 +
> >>>>>>>>>>>>>> drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> >>>>>>>>>>>>>>      4 files changed, 28 insertions(+), 2
> >>>>>>>>>>>>>> deletions(-)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c
> >>>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_bo.c index
> >>>>>>>>>>>>>> e9180b01a4e4..4455886b211e
> >>>>>>>>>>>>>> 100644
> >>>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_bo.c
> >>>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c
> >>>>>>>>>>>>>> @@ -689,7 +689,8 @@ static int
> >>>>>>>>>>>>>> xe_bo_trigger_rebind(struct
> >>>>>>>>>>>>>> xe_device *xe, struct xe_bo *bo,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>      		if
> >>>>>>>>>>>>>> (!xe_vm_in_fault_mode(vm)) {
> >>>>>>>>>>>>>>      			drm_gpuvm_bo_evict(v
> >>>>>>>>>>>>>> m_bo
> >>>>>>>>>>>>>> ,
> >>>>>>>>>>>>>> true);
> >>>>>>>>>>>>>> -			continue;
> >>>>>>>>>>>>>> +			if
> >>>>>>>>>>>>>> (!xe_device_needs_cache_flush(xe))
> >>>>>>>>>>>>>> +				continue;
> >>>>>>>>>
> >>>>>>>>> Matt R,
> >>>>>>>>> This flush will be still needed as there can be non-xa
> >>>>>>>>> buffers
> >>>>>>>>> which
> >>>>>>>>> can be evicted while media was off and stale entries
> >>>>>>>>> can be
> >>>>>>>>> flushed
> >>>>>>>>> when media comes back on. Which was not case earlier as
> >>>>>>>>> full
> >>>>>>>>> flush
> >>>>>>>>> was happening at regular sync points and that’s where
> >>>>>>>>> this
> >>>>>>>>> feature is
> >>>>>>>>> bringing optimization now.
> >>>>>>>>>
> >>>>>>>>> Tejas
> >>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> This will trigger a TLB invalidation (and I
> >>>>>>>>>>>> assume a
> >>>>>>>>>>>> cache
> >>>>>>>>>>>> flush)
> >>>>>>>>>>>> every time we move or free memory in the 3D stack
> >>>>>>>>>>>> if
> >>>>>>>>>>>> it
> >>>>>>>>>>>> has a
> >>>>>>>>>>>> binding. It also performs a synchronous wait on
> >>>>>>>>>>>> the
> >>>>>>>>>>>> BO
> >>>>>>>>>>>> being
> >>>>>>>>>>>> idle.
> >>>>>>>>>>>> Both of these are very expensive operations. I
> >>>>>>>>>>>> can’t
> >>>>>>>>>>>> imagine
> >>>>>>>>>>>> the
> >>>>>>>>>>>> granularity we want here is to do this on every
> >>>>>>>>>>>> move/free
> >>>>>>>>>>>> with
> >>>>>>>>>>>> bindings.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Also, for LR compute with preempt fences, we
> >>>>>>>>>>>> would
> >>>>>>>>>>>> trigger the
> >>>>>>>>>>>> preempt fences during the wait, so a TLB
> >>>>>>>>>>>> invalidation
> >>>>>>>>>>>> after
> >>>>>>>>>>>> this
> >>>>>>>>>>>> seems unnecessary, though perhaps the cache flush
> >>>>>>>>>>>> is
> >>>>>>>>>>>> still
> >>>>>>>>>>>> required?
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think this needs a bit more explanation,
> >>>>>>>>>>>> because
> >>>>>>>>>>>> without
> >>>>>>>>>>>> knowing a
> >>>>>>>>>>>> lot about the exact requirements, the
> >>>>>>>>>>>> implementation
> >>>>>>>>>>>> does
> >>>>>>>>>>>> not
> >>>>>>>>>>>> look
> >>>>>>>>>> correct.
> >>>>>>>>>>>
> >>>>>>>>>>> The thing is that we are trying to solve problem
> >>>>>>>>>>> with
> >>>>>>>>>>> userptr
> >>>>>>>>>>> with non-XA
> >>>>>>>>>> pat, consider if that BO got moved while media is not
> >>>>>>>>>> active.
> >>>>>>>>>> As
> >>>>>>>>>> soon as media
> >>>>>>>>>> will come back active, stale cached entries of that
> >>>>>>>>>> object
> >>>>>>>>>> will be
> >>>>>>>>>> flushed as part
> >>>>>>>>>> of full flush , which may corrupt things.
> >>>>>>>>>>> There was thinking that with this patch we would at
> >>>>>>>>>>> least
> >>>>>>>>>>> solve
> >>>>>>>>>>> the problem
> >>>>>>>>>> of corruption and later when page_reclamation feature
> >>>>>>>>>> comes
> >>>>>>>>>> in will
> >>>>>>>>>> help in
> >>>>>>>>>> performance as well. But now when page reclamation
> >>>>>>>>>> feature is
> >>>>>>>>>> merged earlier
> >>>>>>>>>> and it tightly coupled with bind/unbind some cases
> >>>>>>>>>> like
> >>>>>>>>>> discussed
> >>>>>>>>>> above
> >>>>>>>>>> (which are not doing unbind immediately on move/free)
> >>>>>>>>>> are
> >>>>>>>>>> missed in
> >>>>>>>>>> reclamation.
> >>>>>>>>>>>
> >>>>>>>>>>> So thought was to let this solution go in with
> >>>>>>>>>>> little
> >>>>>>>>>>> perf
> >>>>>>>>>>> hit
> >>>>>>>>>>> and discuss with
> >>>>>>>>>> page reclamation owner to come with cleaner solution
> >>>>>>>>>> together.
> >>>>>>>>>>>
> >>>>>>>>>>> Tejas
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>      		}
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>      		if (!idle) {
> >>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
> >>>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.c index
> >>>>>>>>>>>>>> 743c18e0c580..da2abed94bc0
> >>>>>>>>>>>>>> 100644
> >>>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
> >>>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
> >>>>>>>>>>>>>> @@ -1097,6 +1097,29 @@ static void
> >>>>>>>>>>>>>> tdf_request_sync(struct
> >>>>>>>>>>>>>> xe_device
> >>>>>>>>>>>> *xe)
> >>>>>>>>>>>>>>      	}
> >>>>>>>>>>>>>>      }
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> +/**
> >>>>>>>>>>>>>> + * xe_device_needs_cache_flush - Whether the
> >>>>>>>>>>>>>> cache
> >>>>>>>>>>>>>> needs
> >>>>>>>>>>>>>> to be
> >>>>>>>>>>>>>> +flushed
> >>>>>>>>>>>>>> + * @xe: The device to check.
> >>>>>>>>>>>>>> + *
> >>>>>>>>>>>>>> + * Return: true if the device needs cache
> >>>>>>>>>>>>>> flush,
> >>>>>>>>>>>>>> false
> >>>>>>>>>>>>>> otherwise.
> >>>>>>>>>>>>>> + */
> >>>>>>>>>>>>>> +bool xe_device_needs_cache_flush(struct
> >>>>>>>>>>>>>> xe_device
> >>>>>>>>>>>>>> *xe) {
> >>>>>>>>>>>>>> +	/* XA is *always* flushed, like at
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> end-
> >>>>>>>>>>>>>> of-
> >>>>>>>>>>>>>> submssion (and
> >>>>>>>>>>>>>> +maybe
> >>>>>>>>>>>> other
> >>>>>>>>>>>>>> +	 * places), just that internally as
> >>>>>>>>>>>>>> an
> >>>>>>>>>>>>>> optimisation hw doesn't
> >>>>>>>>>>>>>> +need to
> >>>>>>>>>>>> make
> >>>>>>>>>>>>>> +	 * that a full flush (which will
> >>>>>>>>>>>>>> also
> >>>>>>>>>>>>>> include XA)
> >>>>>>>>>>>>>> when Media is
> >>>>>>>>>>>>>> +	 * off/powergated, since it doesn't
> >>>>>>>>>>>>>> need
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>> worry
> >>>>>>>>>>>>>> about GT
> >>>>>>>>>>>>>> +caches vs
> >>>>>>>>>>>> Media
> >>>>>>>>>>>>>> +	 * coherency, and only CPU vs GPU
> >>>>>>>>>>>>>> coherency,
> >>>>>>>>>>>>>> so
> >>>>>>>>>>>>>> can make
> >>>>>>>>>> that
> >>>>>>>>>>>>>> +flush
> >>>>>>>>>>>> a
> >>>>>>>>>>>>>> +	 * targeted XA flush, since stuff
> >>>>>>>>>>>>>> tagged
> >>>>>>>>>>>>>> with XA
> >>>>>>>>>>>>>> now means
> >>>>>>>>>>>>>> +it's shared
> >>>>>>>>>>>> with
> >>>>>>>>>>>>>> +	 * the CPU. The main implication is
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>> we
> >>>>>>>>>>>>>> now
> >>>>>>>>>>>>>> need to
> >>>>>>>>>>>>>> +somehow
> >>>>>>>>>>>> flush non-XA before
> >>>>>>>>>>>>>> +	 * freeing system memory pages,
> >>>>>>>>>>>>>> otherwise
> >>>>>>>>>>>>>> dirty
> >>>>>>>>>>>>>> cachelines
> >>>>>>>>>>>>>> +could be
> >>>>>>>>>>>> flushed after the free
> >>>>>>>>>>>>>> +	 * (like if Media suddenly turns on
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>> does
> >>>>>>>>>>>>>> a
> >>>>>>>>>>>>>> full flush)
> >>>>>>>>>>>>>> +	 */
> >>>>>>>>>>>>>> +	if (GRAPHICS_VER(xe) >= 35 &&
> >>>>>>>>>>>>>> !IS_DGFX(xe))
> >>>>>>>>>>>>>> +		return true;
> >>>>>>>>>>>>>> +	return false;
> >>>>>>>>>>>>>> +}
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>      void xe_device_l2_flush(struct xe_device
> >>>>>>>>>>>>>> *xe)
> >>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>      	struct xe_gt *gt;
> >>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h
> >>>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.h index
> >>>>>>>>>>>>>> 39464650533b..baf386e0e037
> >>>>>>>>>>>>>> 100644
> >>>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.h
> >>>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
> >>>>>>>>>>>>>> @@ -184,6 +184,7 @@ void
> >>>>>>>>>>>>>> xe_device_snapshot_print(struct
> >>>>>>>>>>>>>> xe_device *xe, struct drm_printer *p);
> >>>>>>>>>>>>>>      u64 xe_device_canonicalize_addr(struct
> >>>>>>>>>>>>>> xe_device
> >>>>>>>>>>>>>> *xe, u64
> >>>>>>>>>>>>>> address);
> >>>>>>>>>>>>>>      u64 xe_device_uncanonicalize_addr(struct
> >>>>>>>>>>>>>> xe_device
> >>>>>>>>>>>>>> *xe,
> >>>>>>>>>>>>>> u64
> >>>>>>>>>>>>>> address);
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> +bool xe_device_needs_cache_flush(struct
> >>>>>>>>>>>>>> xe_device
> >>>>>>>>>>>>>> *xe);
> >>>>>>>>>>>>>>      void xe_device_td_flush(struct xe_device
> >>>>>>>>>>>>>> *xe);
> >>>>>>>>>>>>>> void
> >>>>>>>>>>>>>> xe_device_l2_flush(struct xe_device *xe);
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_userptr.c
> >>>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_userptr.c index
> >>>>>>>>>>>>>> e120323c43bc..b435ea7f9b66
> >>>>>>>>>>>>>> 100644
> >>>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_userptr.c
> >>>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_userptr.c
> >>>>>>>>>>>>>> @@ -114,7 +114,8 @@ static void
> >>>>>>>>>>>>>> __vma_userptr_invalidate(struct
> >>>>>>>>>>>>>> xe_vm
> >>>>>>>>>>>> *vm, struct xe_userptr_vma *uv
> >>>>>>>>>>>>>>      				    false,
> >>>>>>>>>>>>>> MAX_SCHEDULE_TIMEOUT);
> >>>>>>>>>>>>>>      	XE_WARN_ON(err <= 0);
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -	if (xe_vm_in_fault_mode(vm) &&
> >>>>>>>>>>>>>> userptr-
> >>>>>>>>>>>>>>> initial_bind) {
> >>>>>>>>>>>>>> +	if ((xe_vm_in_fault_mode(vm) ||
> >>>>>>>>>>>>>> +xe_device_needs_cache_flush(vm-
> >>>>>>>>>>>>> xe)) &&
> >>>>>>>>>>>>>> +	    userptr->initial_bind) {
> >>>>>>>>>>>>
> >>>>>>>>>>>> Same concern with the LR preempt fence as above —
> >>>>>>>>>>>> the
> >>>>>>>>>>>> hardware
> >>>>>>>>>>>> will
> >>>>>>>>>>>> be interrupted via preempt fences, so it doesn’t
> >>>>>>>>>>>> seem
> >>>>>>>>>>>> necessary
> >>>>>>>>>>>> to
> >>>>>>>>>>>> invalidate the TLBs but perhaps we need a cflush
> >>>>>>>>>>>> and
> >>>>>>>>>>>> TLB
> >>>>>>>>>>>> invalidation is the mechanism for that too?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Matt
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>      		err =
> >>>>>>>>>>>>>> xe_vm_invalidate_vma(vma);
> >>>>>>>>>>>>>>      		XE_WARN_ON(err);
> >>>>>>>>>>>>>>      	}
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> 2.52.0
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Matt Roper
> >>>>>>>>>>>>> Graphics Software Engineer
> >>>>>>>>>>>>> Linux GPU Platform Enablement
> >>>>>>>>>>>>> Intel Corporation
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Matt Roper
> >>>>>>>>>> Graphics Software Engineer
> >>>>>>>>>> Linux GPU Platform Enablement
> >>>>>>>>>> Intel Corporation
> >>>>>>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-17  6:19                               ` Upadhyay, Tejas
@ 2026-02-17  9:53                                 ` Thomas Hellström
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2026-02-17  9:53 UTC (permalink / raw)
  To: Upadhyay, Tejas, Auld, Matthew, Roper, Matthew D, Souza, Jose
  Cc: Mrozek, Michal, intel-xe@lists.freedesktop.org, Brost, Matthew

On Tue, 2026-02-17 at 06:19 +0000, Upadhyay, Tejas wrote:
> 
> 
> > -----Original Message-----
> > From: Auld, Matthew <matthew.auld@intel.com>
> > Sent: 16 February 2026 22:12
> > To: Thomas Hellström <thomas.hellstrom@linux.intel.com>; Roper,
> > Matthew
> > D <matthew.d.roper@intel.com>; Souza, Jose <jose.souza@intel.com>
> > Cc: Upadhyay, Tejas <tejas.upadhyay@intel.com>; Mrozek, Michal
> > <michal.mrozek@intel.com>; intel-xe@lists.freedesktop.org; Brost,
> > Matthew
> > <matthew.brost@intel.com>
> > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo
> > cachelines manually
> > 
> > On 16/02/2026 15:38, Thomas Hellström wrote:
> > > On Mon, 2026-02-16 at 14:55 +0000, Matthew Auld wrote:
> > > > On 16/02/2026 12:07, Thomas Hellström wrote:
> > > > > On Mon, 2026-02-16 at 10:58 +0000, Matthew Auld wrote:
> > > > > > On 16/02/2026 10:23, Thomas Hellström wrote:
> > > > > > > On Fri, 2026-02-13 at 17:31 +0000, Matthew Auld wrote:
> > > > > > > > On 13/02/2026 17:16, Matt Roper wrote:
> > > > > > > > > On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose
> > > > > > > > > wrote:
> > > > > > > > > > On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas
> > > > > > > > > > wrote:
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: Roper, Matthew D
> > > > > > > > > > > > <matthew.d.roper@intel.com>
> > > > > > > > > > > > Sent: 12 February 2026 02:41
> > > > > > > > > > > > To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> > > > > > > > > > > > Cc: Brost, Matthew <matthew.brost@intel.com>;
> > > > > > > > > > > > intel-
> > > > > > > > > > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > > > > > > > > > <matthew.auld@intel.com>;
> > > > > > > > > > > > thomas.hellstrom@linux.intel.com
> > > > > > > > > > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
> > > > > > > > > > > > userptr/shrinker bo cachelines manually
> > > > > > > > > > > > 
> > > > > > > > > > > > On Wed, Feb 11, 2026 at 07:06:05PM +0000,
> > > > > > > > > > > > Upadhyay, Tejas
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > From: Brost, Matthew
> > > > > > > > > > > > > > <matthew.brost@intel.com>
> > > > > > > > > > > > > > Sent: 11 February 2026 05:32
> > > > > > > > > > > > > > To: Roper, Matthew D
> > > > > > > > > > > > > > <matthew.d.roper@intel.com>
> > > > > > > > > > > > > > Cc: Upadhyay, Tejas
> > > > > > > > > > > > > > <tejas.upadhyay@intel.com>;
> > > > > > > > > > > > > > intel-
> > > > > > > > > > > > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > > > > > > > > > > > <matthew.auld@intel.com>;
> > > > > > > > > > > > > > thomas.hellstrom@linux.intel.com
> > > > > > > > > > > > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg:
> > > > > > > > > > > > > > flush
> > > > > > > > > > > > > > userptr/shrinker bo cachelines manually
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > On Tue, Feb 10, 2026 at 01:05:25PM -0800,
> > > > > > > > > > > > > > Matt Roper
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > On Tue, Feb 10, 2026 at 06:21:22PM +0530,
> > > > > > > > > > > > > > > Tejas Upadhyay
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > "eXtended Architecture" (XA) tagged
> > > > > > > > > > > > > > > > memory—memory
> > shared
> > > > > > > > > > > > between
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > CPU and GPU
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I'm pretty sure this expansion of "XA" is
> > > > > > > > > > > > > > > wrong; where are
> > > > > > > > > > > > > > > you seeing this definition?  Everything
> > > > > > > > > > > > > > > in the bspec
> > > > > > > > > > > > > > > indicates that XA means "wb
> > > > > > > > > > > > > > > - transient app" (similar to how "XD" is
> > > > > > > > > > > > > > > 'wb - transient
> > > > > > > > > > > > > > > display").
> > > > > > > > > > > > > > > I'm not sure why exactly they picked "X"
> > > > > > > > > > > > > > > to refer to
> > > > > > > > > > > > > > > transient in both of these cases, but
> > > > > > > > > > > > > > > I've never seen any
> > > > > > > > > > > > > > > documentation that refers to it as
> > > > > > > > > > > > > > > "extended."
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > is treated differently from other GPU
> > > > > > > > > > > > > > > > memory when the
> > > > > > > > > > > > > > > > Media engine is
> > > > > > > > > > > > > > power-gated.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > XA is *always* flushed, like at the
> > > > > > > > > > > > > > > > end-of- submssion
> > > > > > > > > > > > > > > > (and maybe other
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I assume you're referring to the fact
> > > > > > > > > > > > > > > that the driver
> > > > > > > > > > > > > > > performs flushes at the end of submission
> > > > > > > > > > > > > > > (via
> > > > > > > > > > > > > > > PIPE_CONTROL or MI_FLUSH_DW), and that
> > > > > > > > > > > > > > > depending on
> > other
> > > > > > > > > > > > > > > state/optimizations in the system, those
> > > > > > > > > > > > > > > flushes may flush
> > > > > > > > > > > > > > > the entire device cache, or may only
> > > > > > > > > > > > > > > flush the subset of
> > > > > > > > > > > > > > > cache data that is not marked as
> > > > > > > > > > > > > > > transient.  The way you
> > > > > > > > > > > > > > > worded this was confusing since it makes
> > > > > > > > > > > > > > > it sound like
> > > > > > > > > > > > > > > cache flushes happen automatically
> > > > > > > > > > > > > > > somewhere in
> > > > > > > > > > > > hardware/firmware.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > places), just that internally as an
> > > > > > > > > > > > > > > > optimisation hw
> > > > > > > > > > > > > > > > doesn't need to make that a full flush
> > > > > > > > > > > > > > > > (which will also
> > > > > > > > > > > > > > > > include
> > > > > > > > > > > > > > > > XA) when
> > > > > > > > > > > > > > > > Media is off/powergated, since it
> > > > > > > > > > > > > > > > doesn't need to worry
> > > > > > > > > > > > > > > > about GT caches vs Media coherency, and
> > > > > > > > > > > > > > > > only CPU vs GPU
> > > > > > > > > > > > > > > > coherency, so can make that flush a
> > > > > > > > > > > > > > > > targeted XA flush,
> > > > > > > > > > > > > > > > since stuff tagged with XA now means
> > > > > > > > > > > > > > > > it's shared with the
> > > > > > > > > > > > > > > > CPU. The main implication is that we
> > > > > > > > > > > > > > > > now need to somehow
> > > > > > > > > > > > > > > > flush non-XA before freeing system
> > > > > > > > > > > > > > > > memory pages,
> > > > > > > > > > > > > > > > otherwise dirty cachelines could be
> > > > > > > > > > > > > > > > flushed after the
> > > > > > > > > > > > > > > > free (like if Media suddenly turns on
> > > > > > > > > > > > > > > > and does a full
> > > > > > > > > > > > > > > > flush)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > This description seems really confusing. 
> > > > > > > > > > > > > > > My understanding
> > > > > > > > > > > > > > > is that marking something as wb-
> > > > > > > > > > > > > > > transient-app indicates
> > > > > > > > > > > > > > > that it might be accessed by something
> > > > > > > > > > > > > > > other than our
> > > > > > > > > > > > > > > graphics/media IP (i.e., accessed from
> > > > > > > > > > > > > > > the CPU, exported
> > > > > > > > > > > > > > > to another device, etc.), so transient
> > > > > > > > > > > > > > > data truly does
> > > > > > > > > > > > > > > need to be flushed at the points in the
> > > > > > > > > > > > > > > driver where a
> > > > > > > > > > > > > > > flush typically happens.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > However when something is _not_
> > > > > > > > > > > > > > > transient, then
> > > > > > > > > > > > > > > either:
> > > > > > > > > > > > > > >      - it's "private" to the GPU and only
> > > > > > > > > > > > > > > our
> > > > > > > > > > > > > > > graphics/media IP will be
> > > > > > > > > > > > > > >        accessing it
> > > > > > > > > > > > > > >      - it's bound with a coherent PAT
> > > > > > > > > > > > > > > index so that
> > > > > > > > > > > > > > > outside observers like
> > > > > > > > > > > > > > >        the CPU can snoop the device
> > > > > > > > > > > > > > > cache, even when the
> > > > > > > > > > > > > > > cache hasn't been
> > > > > > > > > > > > > > >        flushed
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > If media is not active, then there's
> > > > > > > > > > > > > > > really no need to
> > > > > > > > > > > > > > > include non-transient data when an device
> > > > > > > > > > > > > > > cache flush
> > > > > > > > > > > > > > > happens since there's no real need for
> > > > > > > > > > > > > > > the data to get to
> > > > > > > > > > > > > > > RAM.
> > > > > > > > > > > > > > > So
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > enables
> > > > > > > > > > > > > > > an optimization (which comes in your next
> > > > > > > > > > > > > > > patch), that
> > > > > > > > > > > > > > > allows flushes to only operate on the
> > > > > > > > > > > > > > > subset of the device
> > > > > > > > > > > > > > > cache tagged as
> > > > > > > > > > > > "transient" if media is idle.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > But what If we have stale non-XA marked pages
> > > > > > > > > > > > > for userptr,
> > > > > > > > > > > > > and that object moves out and at the same
> > > > > > > > > > > > > time media comes
> > > > > > > > > > > > > back, will end up in full flush and flush the
> > > > > > > > > > > > > stale entry to
> > > > > > > > > > > > > RAM.
> > > > > > > > > > > > 
> > > > > > > > > > > > What makes userptr special here?  During
> > > > > > > > > > > > general, active
> > > > > > > > > > > > usage, userptr would be data that's accessible
> > > > > > > > > > > > by the CPU, so
> > > > > > > > > > > > it needs to either be transient (so CPU can see
> > > > > > > > > > > > the data in
> > > > > > > > > > > > RAM after explicit flushes) or it needs to be
> > > > > > > > > > > > using a
> > > > > > > > > > > > coherent PAT (so that the CPU can just snoop
> > > > > > > > > > > > the GPU cache).
> > > > > > > > > > > > If
> > > > > > > > > > > > you marked
> > > > > > > > > > > > userptr as both non-XA and non-coherent, then
> > > > > > > > > > > > that sounds
> > > > > > > > > > > > likely to be a userspace bug (and probably
> > > > > > > > > > > > something we can
> > > > > > > > > > > > catch and reject as an invalid case on any Xe3p
> > > > > > > > > > > > or later
> > > > > > > > > > > > platforms that support
> > > > > > > > > > > > this)
> > > > > > > > > > > > since
> > > > > > > > > > > > the
> > > > > > > > > > > > CPU wouldn't
> > > > > > > > > > > > have any reliable way of seeing GPU updates.
> > > > > > > > > > > 
> > > > > > > > > > > Right. FYI @Mrozek, Michal @Souza, Jose For
> > > > > > > > > > > userptr, as
> > > > > > > > > > > explained above, it needs to be either coherent
> > > > > > > > > > > or XA pat
> > > > > > > > > > > index, or else KMD will reject as invalid case.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > The coherency restriction is already in the uAPI:
> > > > > > > > > > 
> > > > > > > > > > "Note: For userptr and externally imported dma-buf
> > > > > > > > > > the kernel
> > > > > > > > > > expects either 1WAY or 2WAY for the @pat_index."
> > > > > > > > > > 
> > > > > > > > > > Using 1 way is enough as Xe KMD does a PIPE_CONTROL
> > > > > > > > > > flushing
> > > > > > > > > > GPU caches at the end of batch buffers.
> > > > > > > > > 
> > > > > > > > > But isn't that what we're discussing here?  1-way
> > > > > > > > > *won't*
> > > > > > > > > necessarily be enough anymore because PIPE_CONTROL
> > > > > > > > > instructions
> > > > > > > > > don't flush the entire cache anymore.  Whenever the
> > > > > > > > > GuC
> > > > > > > > > determines that media is inactive and activates the
> > > > > > > > > optimization, PIPE_CONTROL, MI_FLUSH_DW, etc.
> > > > > > > > > change
> > > > > > > > > behavior to only flush out the subset of data that
> > > > > > > > > was marked as
> > > > > > > > > app-transient; anything not marked that way doesn't
> > > > > > > > > get flushed
> > > > > > > > > now.  So there's a new requirement here that you
> > > > > > > > > ensure you're
> > > > > > > > > using an XA PAT index, or you switch to use 2-way
> > > > > > > > > coherency
> > > > > > > > > which will allow the CPU to snoop the GPU's caches.
> > > > > > > > 
> > > > > > > > That exactly matches my understanding also.
> > > > > > > 
> > > > > > > This only ever affects IGFX, right? Since AFAIU we don't
> > > > > > > have
> > > > > > > 2-way coherency with DGFX?
> > > > > > 
> > > > > > Yeah, this should be igpu only. I seem to also recall that
> > > > > > on dgpu,
> > > > > > Media is coherent with l2/l3, but also I don't think system
> > > > > > memory
> > > > > > can be cached in l2/l3 (only VRAM), which I assume is why
> > > > > > there is
> > > > > > the special SMRO (system-memory-read-only) cache only on
> > > > > > dgpu,
> > > > > > which is flushed when the fence signals, unlike the l2/l3.
> > > > > 
> > > > > Yes that sounds reasonable.
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > It sounds like the same PAT restriction is needed also
> > > > > > > for
> > > > > > > imported dma-buf, right?
> > > > > > 
> > > > > > Good point. Looks like we are missing that still. Otherwise
> > > > > > we can
> > > > > > run into the same issues with stale l2/l3/ppc.
> > > > > 
> > > > > So if this affects only system memory could we instead of
> > > > > relying on
> > > > > 2- way coherency or XA, just flush at dma unmap time, because
> > > > > that's
> > > > > typically just before releasing the pages.
> > > > 
> > > > Yeah, I think we could make it work, from security pov, similar
> > > > to
> > > > userptr, with the right manual flushes in KMD. Maybe just a
> > > > question
> > > > if userspace wants such a model? Anything cached in l2/l3 might
> > > > require manual flushing by userspace (if that is even
> > > > possible)?
> > > 
> > > So that would mean if user-space wants gpu-cpu coherency at fence
> > > synchronization points, they'd have to use either 2-way or XA pat
> > > indices, but not enforced by KMD.
> > 
> > Yeah, looking at BSpec 74635 (Media off case), I'm only really
> > seeing
> > MEM_SET which userspace could potentially use by itself? But then
> > it's unclear
> > if they mean to actually clear-the-memory (which is not what we
> > want) or using the special evict mode, but that seems to be talking
> > more about
> > flushing to local memory, so not completely sure what that does on
> > igpu. If it's
> > the evict mode then should in theory be possible for userpace to do
> > a manual
> > flush, but that would have to be done per-bo/vma?
> 
> MEM_SET says, range needs to be specified as part of command with
> evict mode.
> 
> > 
> > > 
> > > For imported dma-buf kernel requires 2-way or XA for security due
> > > to
> > > the relaxed dma-buf unmap.
> > > 
> > > For SVM/System allocator we'd require 2-way or XA.
> > > 
> > > Otherwise KMD security is enforced by flush at dma-unmap time?
> > 
> > Yeah, that is my understanding. Otherwise I don't currently see
> > what prevents
> > the dirty non-XA cache lines being flushed at some random point
> > later, after
> > we have already freed the corresponding system memory, potentially
> > nuking
> > the next user who allocates those pages.
> 
> Hmm, so it means we can drop this patch completely and do something
> like below :
> 
> In, xe_migrate_dma_unmap(),
> 
> dma_unmap_page()
> if (pat_index != 18 or 19 && coh_mode != 2_way)
> 	/* manual_flush */

I think for userptr you'd want to add this in
__vma_userptr_invalidate(), just before drm_gpusvm_unmap_pages().

For bos you'd want to add it in xe_tt_unmap_sg() just before
dma_unmap_sgtable(). But if you want to do it conditionally, you would
need a flag in the struct xe_ttm_tt that is set whenever the owning bo
is mapped in such a way that content has not been flushed.

/Thomas

> 
> Tejas
> 
> > 
> > > 
> > > /Thomas
> > > 
> > > > 
> > > > > 
> > > > > The exception, though, is dma-buf where the exporter can
> > > > > actually
> > > > > release memory before all importers have given up their dma-
> > > > > mappings.
> > > > > 
> > > > > /Thomas
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > /Thomas
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Matt
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > If something happens that changes the GTT
> > > > > > > > > > > > mapping of
> > > > > > > > > > > > an
> > > > > > > > > > > > object,
> > > > > > > > > > > > then
> > > > > > > > > > > > doesn't that already trigger a TLB invalidation
> > > > > > > > > > > > when
> > > > > > > > > > > > necessary in
> > > > > > > > > > > > the driver
> > > > > > > > > > > > today?  It was my understanding that "heavy"
> > > > > > > > > > > > TLB
> > > > > > > > > > > > invalidations wait
> > > > > > > > > > > > for data
> > > > > > > > > > > > values to be globally observable before
> > > > > > > > > > > > starting, so
> > > > > > > > > > > > I
> > > > > > > > > > > > think
> > > > > > > > > > > > that
> > > > > > > > > > > > would ensure
> > > > > > > > > > > > that any non-XA data makes it to RAM before any
> > > > > > > > > > > > binding
> > > > > > > > > > > > changes,
> > > > > > > > > > > > object,
> > > > > > > > > > > > destruction, etc.?  Is there something special
> > > > > > > > > > > > about
> > > > > > > > > > > > userptr
> > > > > > > > > > > > that
> > > > > > > > > > > > makes that
> > > > > > > > > > > > case more of a problem?
> > > > > > > > > > > > 
> > > > > > > > > > > > I just found bspec page 74635 which gives an
> > > > > > > > > > > > overview
> > > > > > > > > > > > of
> > > > > > > > > > > > the
> > > > > > > > > > > > various flush
> > > > > > > > > > > > and invalidate cases, and I don't see anything
> > > > > > > > > > > > there
> > > > > > > > > > > > that
> > > > > > > > > > > > makes it
> > > > > > > > > > > > obvious to
> > > > > > > > > > > > me that userptr would be special.
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > As you said, we eventually do want to
> > > > > > > > > > > > > > > force a
> > > > > > > > > > > > > > > flush
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > non-transient data as well once we're
> > > > > > > > > > > > > > > freeing
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > underlying
> > > > > > > > > > > > > > > pages.
> > > > > > > > > > > > > > > So how do we do that?  It's not clear to
> > > > > > > > > > > > > > > me how
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > changes
> > > > > > > > > > > > > > > below
> > > > > > > > > > > > > > > are accomplishing that.  Is there a way
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > explicitly
> > > > > > > > > > > > > > > request
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > full device cache flush (ignoring the
> > > > > > > > > > > > > > > transient
> > > > > > > > > > > > > > > vs
> > > > > > > > > > > > > > > non-
> > > > > > > > > > > > > > > transient tagging)?
> > > > > > > > > > > > > > > Since the GuC handles the optimization in
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > next
> > > > > > > > > > > > > > > patch
> > > > > > > > > > > > > > > (toggling
> > > > > > > > > > > > > > > whether flushes are full flushes vs non-
> > > > > > > > > > > > > > > transient
> > > > > > > > > > > > > > > flushes
> > > > > > > > > > > > > > > depending on whether media is active), I
> > > > > > > > > > > > > > > thought
> > > > > > > > > > > > > > > there
> > > > > > > > > > > > > > > might
> > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > some kind of GuC interface to request
> > > > > > > > > > > > > > > "please
> > > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > one
> > > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > flush now, even
> > > > > > > > > > > > if media is idle."
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I’m not an expert here by any means, but
> > > > > > > > > > > > > > everything
> > > > > > > > > > > > > > above
> > > > > > > > > > > > > > from
> > > > > > > > > > > > > > Matt
> > > > > > > > > > > > > > seems like valid concerns. Thomas also
> > > > > > > > > > > > > > raised
> > > > > > > > > > > > > > some
> > > > > > > > > > > > > > concerns in
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > two previous revisions; again I’m not an
> > > > > > > > > > > > > > expert,
> > > > > > > > > > > > > > but
> > > > > > > > > > > > > > reading
> > > > > > > > > > > > > > through
> > > > > > > > > > > > > > those, it doesn’t really seem like he
> > > > > > > > > > > > > > received
> > > > > > > > > > > > > > proper
> > > > > > > > > > > > > > answers
> > > > > > > > > > > > > > to his
> > > > > > > > > > > > questions.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Its forcing flush via tlb invalidation PPC
> > > > > > > > > > > > > flag
> > > > > > > > > > > > > under
> > > > > > > > > > > > > xe_invalidate_vma( ).
> > > > > > > > > > > > 
> > > > > > > > > > > > By the way, what is "PPC?"  It seems like it's
> > > > > > > > > > > > another
> > > > > > > > > > > > new
> > > > > > > > > > > > synonym
> > > > > > > > > > > > for the
> > > > > > > > > > > > device cache?  It's already really confusing
> > > > > > > > > > > > that
> > > > > > > > > > > > some of
> > > > > > > > > > > > our
> > > > > > > > > > > > hardware docs use
> > > > > > > > > > > > a mix of both "L2" and "L3" to refer to the
> > > > > > > > > > > > same
> > > > > > > > > > > > device
> > > > > > > > > > > > cache
> > > > > > > > > > > > for
> > > > > > > > > > > > historical
> > > > > > > > > > > > reasons...
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Matt
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > A couple of comments below.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Matt
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > V2(MattA): Expand commit description
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Signed-off-by: Tejas Upadhyay
> > > > > > > > > > > > > > > > <tejas.upadhyay@intel.com>
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > >      drivers/gpu/drm/xe/xe_bo.c      | 
> > > > > > > > > > > > > > > > 3 ++-
> > > > > > > > > > > > > > > >      drivers/gpu/drm/xe/xe_device.c  |
> > > > > > > > > > > > > > > > 23
> > > > > > > > > > > > > > > > +++++++++++++++++++++++
> > > > > > > > > > > > > > > > drivers/gpu/drm/xe/xe_device.h  |  1 +
> > > > > > > > > > > > > > > > drivers/gpu/drm/xe/xe_userptr.c |  3
> > > > > > > > > > > > > > > > ++-
> > > > > > > > > > > > > > > >      4 files changed, 28 insertions(+),
> > > > > > > > > > > > > > > > 2
> > > > > > > > > > > > > > > > deletions(-)
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_bo.c index
> > > > > > > > > > > > > > > > e9180b01a4e4..4455886b211e
> > > > > > > > > > > > > > > > 100644
> > > > > > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > > > > > > @@ -689,7 +689,8 @@ static int
> > > > > > > > > > > > > > > > xe_bo_trigger_rebind(struct
> > > > > > > > > > > > > > > > xe_device *xe, struct xe_bo *bo,
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >      		if
> > > > > > > > > > > > > > > > (!xe_vm_in_fault_mode(vm)) {
> > > > > > > > > > > > > > > >      			drm_gpuvm_bo_e
> > > > > > > > > > > > > > > > vict(v
> > > > > > > > > > > > > > > > m_bo
> > > > > > > > > > > > > > > > ,
> > > > > > > > > > > > > > > > true);
> > > > > > > > > > > > > > > > -			continue;
> > > > > > > > > > > > > > > > +			if
> > > > > > > > > > > > > > > > (!xe_device_needs_cache_flush(xe))
> > > > > > > > > > > > > > > > +				contin
> > > > > > > > > > > > > > > > ue;
> > > > > > > > > > > 
> > > > > > > > > > > Matt R,
> > > > > > > > > > > This flush will be still needed as there can be
> > > > > > > > > > > non-xa
> > > > > > > > > > > buffers
> > > > > > > > > > > which
> > > > > > > > > > > can be evicted while media was off and stale
> > > > > > > > > > > entries
> > > > > > > > > > > can be
> > > > > > > > > > > flushed
> > > > > > > > > > > when media comes back on. Which was not case
> > > > > > > > > > > earlier as
> > > > > > > > > > > full
> > > > > > > > > > > flush
> > > > > > > > > > > was happening at regular sync points and that’s
> > > > > > > > > > > where
> > > > > > > > > > > this
> > > > > > > > > > > feature is
> > > > > > > > > > > bringing optimization now.
> > > > > > > > > > > 
> > > > > > > > > > > Tejas
> > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > This will trigger a TLB invalidation (and I
> > > > > > > > > > > > > > assume a
> > > > > > > > > > > > > > cache
> > > > > > > > > > > > > > flush)
> > > > > > > > > > > > > > every time we move or free memory in the 3D
> > > > > > > > > > > > > > stack
> > > > > > > > > > > > > > if
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > has a
> > > > > > > > > > > > > > binding. It also performs a synchronous
> > > > > > > > > > > > > > wait on
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > BO
> > > > > > > > > > > > > > being
> > > > > > > > > > > > > > idle.
> > > > > > > > > > > > > > Both of these are very expensive
> > > > > > > > > > > > > > operations. I
> > > > > > > > > > > > > > can’t
> > > > > > > > > > > > > > imagine
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > granularity we want here is to do this on
> > > > > > > > > > > > > > every
> > > > > > > > > > > > > > move/free
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > bindings.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Also, for LR compute with preempt fences,
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > would
> > > > > > > > > > > > > > trigger the
> > > > > > > > > > > > > > preempt fences during the wait, so a TLB
> > > > > > > > > > > > > > invalidation
> > > > > > > > > > > > > > after
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > seems unnecessary, though perhaps the cache
> > > > > > > > > > > > > > flush
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > still
> > > > > > > > > > > > > > required?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I think this needs a bit more explanation,
> > > > > > > > > > > > > > because
> > > > > > > > > > > > > > without
> > > > > > > > > > > > > > knowing a
> > > > > > > > > > > > > > lot about the exact requirements, the
> > > > > > > > > > > > > > implementation
> > > > > > > > > > > > > > does
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > look
> > > > > > > > > > > > correct.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The thing is that we are trying to solve
> > > > > > > > > > > > > problem
> > > > > > > > > > > > > with
> > > > > > > > > > > > > userptr
> > > > > > > > > > > > > with non-XA
> > > > > > > > > > > > pat, consider if that BO got moved while media
> > > > > > > > > > > > is not
> > > > > > > > > > > > active.
> > > > > > > > > > > > As
> > > > > > > > > > > > soon as media
> > > > > > > > > > > > will come back active, stale cached entries of
> > > > > > > > > > > > that
> > > > > > > > > > > > object
> > > > > > > > > > > > will be
> > > > > > > > > > > > flushed as part
> > > > > > > > > > > > of full flush , which may corrupt things.
> > > > > > > > > > > > > There was thinking that with this patch we
> > > > > > > > > > > > > would at
> > > > > > > > > > > > > least
> > > > > > > > > > > > > solve
> > > > > > > > > > > > > the problem
> > > > > > > > > > > > of corruption and later when page_reclamation
> > > > > > > > > > > > feature
> > > > > > > > > > > > comes
> > > > > > > > > > > > in will
> > > > > > > > > > > > help in
> > > > > > > > > > > > performance as well. But now when page
> > > > > > > > > > > > reclamation
> > > > > > > > > > > > feature is
> > > > > > > > > > > > merged earlier
> > > > > > > > > > > > and it tightly coupled with bind/unbind some
> > > > > > > > > > > > cases
> > > > > > > > > > > > like
> > > > > > > > > > > > discussed
> > > > > > > > > > > > above
> > > > > > > > > > > > (which are not doing unbind immediately on
> > > > > > > > > > > > move/free)
> > > > > > > > > > > > are
> > > > > > > > > > > > missed in
> > > > > > > > > > > > reclamation.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > So thought was to let this solution go in
> > > > > > > > > > > > > with
> > > > > > > > > > > > > little
> > > > > > > > > > > > > perf
> > > > > > > > > > > > > hit
> > > > > > > > > > > > > and discuss with
> > > > > > > > > > > > page reclamation owner to come with cleaner
> > > > > > > > > > > > solution
> > > > > > > > > > > > together.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Tejas
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >      		}
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >      		if (!idle) {
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_device.c index
> > > > > > > > > > > > > > > > 743c18e0c580..da2abed94bc0
> > > > > > > > > > > > > > > > 100644
> > > > > > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > > > > > > > > > @@ -1097,6 +1097,29 @@ static void
> > > > > > > > > > > > > > > > tdf_request_sync(struct
> > > > > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > > *xe)
> > > > > > > > > > > > > > > >      	}
> > > > > > > > > > > > > > > >      }
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > + * xe_device_needs_cache_flush -
> > > > > > > > > > > > > > > > Whether the
> > > > > > > > > > > > > > > > cache
> > > > > > > > > > > > > > > > needs
> > > > > > > > > > > > > > > > to be
> > > > > > > > > > > > > > > > +flushed
> > > > > > > > > > > > > > > > + * @xe: The device to check.
> > > > > > > > > > > > > > > > + *
> > > > > > > > > > > > > > > > + * Return: true if the device needs
> > > > > > > > > > > > > > > > cache
> > > > > > > > > > > > > > > > flush,
> > > > > > > > > > > > > > > > false
> > > > > > > > > > > > > > > > otherwise.
> > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > +bool
> > > > > > > > > > > > > > > > xe_device_needs_cache_flush(struct
> > > > > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > > > > *xe) {
> > > > > > > > > > > > > > > > +	/* XA is *always* flushed,
> > > > > > > > > > > > > > > > like at
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > end-
> > > > > > > > > > > > > > > > of-
> > > > > > > > > > > > > > > > submssion (and
> > > > > > > > > > > > > > > > +maybe
> > > > > > > > > > > > > > other
> > > > > > > > > > > > > > > > +	 * places), just that
> > > > > > > > > > > > > > > > internally as
> > > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > optimisation hw doesn't
> > > > > > > > > > > > > > > > +need to
> > > > > > > > > > > > > > make
> > > > > > > > > > > > > > > > +	 * that a full flush (which
> > > > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > > also
> > > > > > > > > > > > > > > > include XA)
> > > > > > > > > > > > > > > > when Media is
> > > > > > > > > > > > > > > > +	 * off/powergated, since it
> > > > > > > > > > > > > > > > doesn't
> > > > > > > > > > > > > > > > need
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > worry
> > > > > > > > > > > > > > > > about GT
> > > > > > > > > > > > > > > > +caches vs
> > > > > > > > > > > > > > Media
> > > > > > > > > > > > > > > > +	 * coherency, and only CPU vs
> > > > > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > coherency,
> > > > > > > > > > > > > > > > so
> > > > > > > > > > > > > > > > can make
> > > > > > > > > > > > that
> > > > > > > > > > > > > > > > +flush
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > +	 * targeted XA flush, since
> > > > > > > > > > > > > > > > stuff
> > > > > > > > > > > > > > > > tagged
> > > > > > > > > > > > > > > > with XA
> > > > > > > > > > > > > > > > now means
> > > > > > > > > > > > > > > > +it's shared
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > +	 * the CPU. The main
> > > > > > > > > > > > > > > > implication is
> > > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > now
> > > > > > > > > > > > > > > > need to
> > > > > > > > > > > > > > > > +somehow
> > > > > > > > > > > > > > flush non-XA before
> > > > > > > > > > > > > > > > +	 * freeing system memory
> > > > > > > > > > > > > > > > pages,
> > > > > > > > > > > > > > > > otherwise
> > > > > > > > > > > > > > > > dirty
> > > > > > > > > > > > > > > > cachelines
> > > > > > > > > > > > > > > > +could be
> > > > > > > > > > > > > > flushed after the free
> > > > > > > > > > > > > > > > +	 * (like if Media suddenly
> > > > > > > > > > > > > > > > turns on
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > does
> > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > full flush)
> > > > > > > > > > > > > > > > +	 */
> > > > > > > > > > > > > > > > +	if (GRAPHICS_VER(xe) >= 35 &&
> > > > > > > > > > > > > > > > !IS_DGFX(xe))
> > > > > > > > > > > > > > > > +		return true;
> > > > > > > > > > > > > > > > +	return false;
> > > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > >      void xe_device_l2_flush(struct
> > > > > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > > > > *xe)
> > > > > > > > > > > > > > > > {
> > > > > > > > > > > > > > > >      	struct xe_gt *gt;
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_device.h index
> > > > > > > > > > > > > > > > 39464650533b..baf386e0e037
> > > > > > > > > > > > > > > > 100644
> > > > > > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > > > > > > > > > @@ -184,6 +184,7 @@ void
> > > > > > > > > > > > > > > > xe_device_snapshot_print(struct
> > > > > > > > > > > > > > > > xe_device *xe, struct drm_printer *p);
> > > > > > > > > > > > > > > >      u64
> > > > > > > > > > > > > > > > xe_device_canonicalize_addr(struct
> > > > > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > > > > *xe, u64
> > > > > > > > > > > > > > > > address);
> > > > > > > > > > > > > > > >      u64
> > > > > > > > > > > > > > > > xe_device_uncanonicalize_addr(struct
> > > > > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > > > > *xe,
> > > > > > > > > > > > > > > > u64
> > > > > > > > > > > > > > > > address);
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > +bool
> > > > > > > > > > > > > > > > xe_device_needs_cache_flush(struct
> > > > > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > > > > *xe);
> > > > > > > > > > > > > > > >      void xe_device_td_flush(struct
> > > > > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > > > > *xe);
> > > > > > > > > > > > > > > > void
> > > > > > > > > > > > > > > > xe_device_l2_flush(struct xe_device
> > > > > > > > > > > > > > > > *xe);
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_userptr.c index
> > > > > > > > > > > > > > > > e120323c43bc..b435ea7f9b66
> > > > > > > > > > > > > > > > 100644
> > > > > > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > > > > > > > > > @@ -114,7 +114,8 @@ static void
> > > > > > > > > > > > > > > > __vma_userptr_invalidate(struct
> > > > > > > > > > > > > > > > xe_vm
> > > > > > > > > > > > > > *vm, struct xe_userptr_vma *uv
> > > > > > > > > > > > > > > >      				   
> > > > > > > > > > > > > > > > false,
> > > > > > > > > > > > > > > > MAX_SCHEDULE_TIMEOUT);
> > > > > > > > > > > > > > > >      	XE_WARN_ON(err <= 0);
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > -	if (xe_vm_in_fault_mode(vm) &&
> > > > > > > > > > > > > > > > userptr-
> > > > > > > > > > > > > > > > > initial_bind) {
> > > > > > > > > > > > > > > > +	if ((xe_vm_in_fault_mode(vm)
> > > > > > > > > > > > > > > > ||
> > > > > > > > > > > > > > > > +xe_device_needs_cache_flush(vm-
> > > > > > > > > > > > > > > xe)) &&
> > > > > > > > > > > > > > > > +	    userptr->initial_bind) {
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Same concern with the LR preempt fence as
> > > > > > > > > > > > > > above —
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > hardware
> > > > > > > > > > > > > > will
> > > > > > > > > > > > > > be interrupted via preempt fences, so it
> > > > > > > > > > > > > > doesn’t
> > > > > > > > > > > > > > seem
> > > > > > > > > > > > > > necessary
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > invalidate the TLBs but perhaps we need a
> > > > > > > > > > > > > > cflush
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > TLB
> > > > > > > > > > > > > > invalidation is the mechanism for that too?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Matt
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >      		err =
> > > > > > > > > > > > > > > > xe_vm_invalidate_vma(vma);
> > > > > > > > > > > > > > > >      		XE_WARN_ON(err);
> > > > > > > > > > > > > > > >      	}
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > 2.52.0
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Matt Roper
> > > > > > > > > > > > > > > Graphics Software Engineer
> > > > > > > > > > > > > > > Linux GPU Platform Enablement
> > > > > > > > > > > > > > > Intel Corporation
> > > > > > > > > > > > 
> > > > > > > > > > > > --
> > > > > > > > > > > > Matt Roper
> > > > > > > > > > > > Graphics Software Engineer
> > > > > > > > > > > > Linux GPU Platform Enablement
> > > > > > > > > > > > Intel Corporation
> > > > > > > > > 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-16 16:41                             ` Matthew Auld
  2026-02-17  6:19                               ` Upadhyay, Tejas
@ 2026-02-17 17:04                               ` Thomas Hellström
  2026-02-17 18:41                                 ` Matthew Auld
  1 sibling, 1 reply; 39+ messages in thread
From: Thomas Hellström @ 2026-02-17 17:04 UTC (permalink / raw)
  To: Matthew Auld, Matt Roper, Souza, Jose
  Cc: Upadhyay, Tejas, Mrozek, Michal, intel-xe@lists.freedesktop.org,
	Brost, Matthew

On Mon, 2026-02-16 at 16:41 +0000, Matthew Auld wrote:
> On 16/02/2026 15:38, Thomas Hellström wrote:
> > On Mon, 2026-02-16 at 14:55 +0000, Matthew Auld wrote:
> > > On 16/02/2026 12:07, Thomas Hellström wrote:
> > > > On Mon, 2026-02-16 at 10:58 +0000, Matthew Auld wrote:
> > > > > On 16/02/2026 10:23, Thomas Hellström wrote:
> > > > > > On Fri, 2026-02-13 at 17:31 +0000, Matthew Auld wrote:
> > > > > > > On 13/02/2026 17:16, Matt Roper wrote:
> > > > > > > > On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose
> > > > > > > > wrote:
> > > > > > > > > On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas
> > > > > > > > > wrote:
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Roper, Matthew D
> > > > > > > > > > > <matthew.d.roper@intel.com>
> > > > > > > > > > > Sent: 12 February 2026 02:41
> > > > > > > > > > > To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> > > > > > > > > > > Cc: Brost, Matthew <matthew.brost@intel.com>;
> > > > > > > > > > > intel-
> > > > > > > > > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > > > > > > > > <matthew.auld@intel.com>;
> > > > > > > > > > > thomas.hellstrom@linux.intel.com
> > > > > > > > > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
> > > > > > > > > > > userptr/shrinker bo
> > > > > > > > > > > cachelines manually
> > > > > > > > > > > 
> > > > > > > > > > > On Wed, Feb 11, 2026 at 07:06:05PM +0000,
> > > > > > > > > > > Upadhyay,
> > > > > > > > > > > Tejas
> > > > > > > > > > > wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > From: Brost, Matthew
> > > > > > > > > > > > > <matthew.brost@intel.com>
> > > > > > > > > > > > > Sent: 11 February 2026 05:32
> > > > > > > > > > > > > To: Roper, Matthew D
> > > > > > > > > > > > > <matthew.d.roper@intel.com>
> > > > > > > > > > > > > Cc: Upadhyay, Tejas
> > > > > > > > > > > > > <tejas.upadhyay@intel.com>;
> > > > > > > > > > > > > intel-
> > > > > > > > > > > > > xe@lists.freedesktop.org; Auld, Matthew
> > > > > > > > > > > > > <matthew.auld@intel.com>;
> > > > > > > > > > > > > thomas.hellstrom@linux.intel.com
> > > > > > > > > > > > > Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg:
> > > > > > > > > > > > > flush
> > > > > > > > > > > > > userptr/shrinker bo
> > > > > > > > > > > > > cachelines manually
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On Tue, Feb 10, 2026 at 01:05:25PM -0800,
> > > > > > > > > > > > > Matt
> > > > > > > > > > > > > Roper
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > On Tue, Feb 10, 2026 at 06:21:22PM +0530,
> > > > > > > > > > > > > > Tejas
> > > > > > > > > > > > > > Upadhyay
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > "eXtended Architecture" (XA) tagged
> > > > > > > > > > > > > > > memory—memory
> > > > > > > > > > > > > > > shared
> > > > > > > > > > > between
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > CPU and GPU
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I'm pretty sure this expansion of "XA" is
> > > > > > > > > > > > > > wrong;
> > > > > > > > > > > > > > where
> > > > > > > > > > > > > > are
> > > > > > > > > > > > > > you
> > > > > > > > > > > > > > seeing this definition?  Everything in the
> > > > > > > > > > > > > > bspec
> > > > > > > > > > > > > > indicates
> > > > > > > > > > > > > > that XA
> > > > > > > > > > > > > > means "wb
> > > > > > > > > > > > > > - transient app" (similar to how "XD" is
> > > > > > > > > > > > > > 'wb -
> > > > > > > > > > > > > > transient
> > > > > > > > > > > > > > display").
> > > > > > > > > > > > > > I'm not sure why exactly they picked "X" to
> > > > > > > > > > > > > > refer
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > transient in
> > > > > > > > > > > > > > both of these cases, but I've never seen
> > > > > > > > > > > > > > any
> > > > > > > > > > > > > > documentation
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > refers to it as "extended."
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > is treated differently from other GPU
> > > > > > > > > > > > > > > memory
> > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > Media
> > > > > > > > > > > > > > > engine is
> > > > > > > > > > > > > power-gated.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > XA is *always* flushed, like at the end-
> > > > > > > > > > > > > > > of-
> > > > > > > > > > > > > > > submssion
> > > > > > > > > > > > > > > (and
> > > > > > > > > > > > > > > maybe
> > > > > > > > > > > > > > > other
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I assume you're referring to the fact that
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > driver
> > > > > > > > > > > > > > performs
> > > > > > > > > > > > > > flushes at the end of submission (via
> > > > > > > > > > > > > > PIPE_CONTROL
> > > > > > > > > > > > > > or
> > > > > > > > > > > > > > MI_FLUSH_DW), and that depending on other
> > > > > > > > > > > > > > state/optimizations
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > the system, those flushes may flush the
> > > > > > > > > > > > > > entire
> > > > > > > > > > > > > > device
> > > > > > > > > > > > > > cache,
> > > > > > > > > > > > > > or
> > > > > > > > > > > > > > may only flush the subset of cache data
> > > > > > > > > > > > > > that is
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > marked as
> > > > > > > > > > > > > > transient.  The way you worded this was
> > > > > > > > > > > > > > confusing
> > > > > > > > > > > > > > since
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > makes
> > > > > > > > > > > > > > it sound like cache flushes happen
> > > > > > > > > > > > > > automatically
> > > > > > > > > > > > > > somewhere in
> > > > > > > > > > > hardware/firmware.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > places), just that internally as an
> > > > > > > > > > > > > > > optimisation
> > > > > > > > > > > > > > > hw
> > > > > > > > > > > > > > > doesn't
> > > > > > > > > > > > > > > need
> > > > > > > > > > > > > > > to make that a full flush (which will
> > > > > > > > > > > > > > > also
> > > > > > > > > > > > > > > include
> > > > > > > > > > > > > > > XA) when
> > > > > > > > > > > > > > > Media is off/powergated, since it doesn't
> > > > > > > > > > > > > > > need to
> > > > > > > > > > > > > > > worry
> > > > > > > > > > > > > > > about GT
> > > > > > > > > > > > > > > caches vs Media coherency, and only CPU
> > > > > > > > > > > > > > > vs
> > > > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > coherency,
> > > > > > > > > > > > > > > so can
> > > > > > > > > > > > > > > make that flush a targeted XA flush,
> > > > > > > > > > > > > > > since
> > > > > > > > > > > > > > > stuff
> > > > > > > > > > > > > > > tagged
> > > > > > > > > > > > > > > with XA
> > > > > > > > > > > > > > > now means it's shared with the CPU. The
> > > > > > > > > > > > > > > main
> > > > > > > > > > > > > > > implication is
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > we now need to somehow flush non-XA
> > > > > > > > > > > > > > > before
> > > > > > > > > > > > > > > freeing
> > > > > > > > > > > > > > > system
> > > > > > > > > > > > > > > memory
> > > > > > > > > > > > > > > pages, otherwise dirty cachelines could
> > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > flushed
> > > > > > > > > > > > > > > after
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > free (like if Media suddenly turns on and
> > > > > > > > > > > > > > > does a
> > > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > flush)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > This description seems really confusing. 
> > > > > > > > > > > > > > My
> > > > > > > > > > > > > > understanding is
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > marking something as wb-transient-app
> > > > > > > > > > > > > > indicates
> > > > > > > > > > > > > > that it
> > > > > > > > > > > > > > might
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > accessed by something other than our
> > > > > > > > > > > > > > graphics/media
> > > > > > > > > > > > > > IP
> > > > > > > > > > > > > > (i.e.,
> > > > > > > > > > > > > > accessed from the CPU, exported to another
> > > > > > > > > > > > > > device,
> > > > > > > > > > > > > > etc.), so
> > > > > > > > > > > > > > transient data truly does need to be
> > > > > > > > > > > > > > flushed at
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > points in
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > driver where a flush typically happens.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > However when something is _not_ transient,
> > > > > > > > > > > > > > then
> > > > > > > > > > > > > > either:
> > > > > > > > > > > > > >      - it's "private" to the GPU and only
> > > > > > > > > > > > > > our
> > > > > > > > > > > > > > graphics/media IP
> > > > > > > > > > > > > > will be
> > > > > > > > > > > > > >        accessing it
> > > > > > > > > > > > > >      - it's bound with a coherent PAT index
> > > > > > > > > > > > > > so
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > outside
> > > > > > > > > > > > > > observers like
> > > > > > > > > > > > > >        the CPU can snoop the device cache,
> > > > > > > > > > > > > > even
> > > > > > > > > > > > > > when
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > cache
> > > > > > > > > > > > > > hasn't been
> > > > > > > > > > > > > >        flushed
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > If media is not active, then there's really
> > > > > > > > > > > > > > no
> > > > > > > > > > > > > > need
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > include
> > > > > > > > > > > > > > non-transient data when an device cache
> > > > > > > > > > > > > > flush
> > > > > > > > > > > > > > happens
> > > > > > > > > > > > > > since
> > > > > > > > > > > > > > there's no real need for the data to get to
> > > > > > > > > > > > > > RAM.
> > > > > > > > > > > > > > So
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > enables
> > > > > > > > > > > > > > an optimization (which comes in your next
> > > > > > > > > > > > > > patch),
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > allows
> > > > > > > > > > > > > > flushes to only operate on the subset of
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > device
> > > > > > > > > > > > > > cache
> > > > > > > > > > > > > > tagged as
> > > > > > > > > > > "transient" if media is idle.
> > > > > > > > > > > > 
> > > > > > > > > > > > But what If we have stale non-XA marked pages
> > > > > > > > > > > > for
> > > > > > > > > > > > userptr,
> > > > > > > > > > > > and
> > > > > > > > > > > > that
> > > > > > > > > > > > object moves out and at the same time media
> > > > > > > > > > > > comes
> > > > > > > > > > > > back,
> > > > > > > > > > > > will end
> > > > > > > > > > > > up in
> > > > > > > > > > > > full flush and flush the stale entry to RAM.
> > > > > > > > > > > 
> > > > > > > > > > > What makes userptr special here?  During general,
> > > > > > > > > > > active
> > > > > > > > > > > usage,
> > > > > > > > > > > userptr would
> > > > > > > > > > > be data that's accessible by the CPU, so it needs
> > > > > > > > > > > to
> > > > > > > > > > > either
> > > > > > > > > > > be
> > > > > > > > > > > transient (so CPU
> > > > > > > > > > > can see the data in RAM after explicit flushes)
> > > > > > > > > > > or it
> > > > > > > > > > > needs
> > > > > > > > > > > to be
> > > > > > > > > > > using a
> > > > > > > > > > > coherent PAT (so that the CPU can just snoop the
> > > > > > > > > > > GPU
> > > > > > > > > > > cache).
> > > > > > > > > > > If
> > > > > > > > > > > you marked
> > > > > > > > > > > userptr as both non-XA and non-coherent, then
> > > > > > > > > > > that
> > > > > > > > > > > sounds
> > > > > > > > > > > likely to
> > > > > > > > > > > be a
> > > > > > > > > > > userspace bug (and probably something we can
> > > > > > > > > > > catch
> > > > > > > > > > > and
> > > > > > > > > > > reject
> > > > > > > > > > > as an
> > > > > > > > > > > invalid
> > > > > > > > > > > case on any Xe3p or later platforms that support
> > > > > > > > > > > this)
> > > > > > > > > > > since
> > > > > > > > > > > the
> > > > > > > > > > > CPU wouldn't
> > > > > > > > > > > have any reliable way of seeing GPU updates.
> > > > > > > > > > 
> > > > > > > > > > Right. FYI @Mrozek, Michal @Souza, Jose
> > > > > > > > > > For userptr, as explained above, it needs to be
> > > > > > > > > > either
> > > > > > > > > > coherent
> > > > > > > > > > or XA
> > > > > > > > > > pat index, or else KMD will reject as invalid case.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > The coherency restriction is already in the uAPI:
> > > > > > > > > 
> > > > > > > > > "Note: For userptr and externally imported dma-buf
> > > > > > > > > the
> > > > > > > > > kernel
> > > > > > > > > expects
> > > > > > > > > either 1WAY or 2WAY for the @pat_index."
> > > > > > > > > 
> > > > > > > > > Using 1 way is enough as Xe KMD does a PIPE_CONTROL
> > > > > > > > > flushing
> > > > > > > > > GPU
> > > > > > > > > caches
> > > > > > > > > at the end of batch buffers.
> > > > > > > > 
> > > > > > > > But isn't that what we're discussing here?  1-way
> > > > > > > > *won't*
> > > > > > > > necessarily be
> > > > > > > > enough anymore because PIPE_CONTROL instructions don't
> > > > > > > > flush
> > > > > > > > the
> > > > > > > > entire
> > > > > > > > cache anymore.  Whenever the GuC determines that media
> > > > > > > > is
> > > > > > > > inactive
> > > > > > > > and
> > > > > > > > activates the optimization, PIPE_CONTROL, MI_FLUSH_DW,
> > > > > > > > etc.
> > > > > > > > change
> > > > > > > > behavior to only flush out the subset of data that was
> > > > > > > > marked
> > > > > > > > as
> > > > > > > > app-transient; anything not marked that way doesn't get
> > > > > > > > flushed
> > > > > > > > now.  So
> > > > > > > > there's a new requirement here that you ensure you're
> > > > > > > > using
> > > > > > > > an
> > > > > > > > XA
> > > > > > > > PAT
> > > > > > > > index, or you switch to use 2-way coherency which will
> > > > > > > > allow
> > > > > > > > the
> > > > > > > > CPU to
> > > > > > > > snoop the GPU's caches.
> > > > > > > 
> > > > > > > That exactly matches my understanding also.
> > > > > > 
> > > > > > This only ever affects IGFX, right? Since AFAIU we don't
> > > > > > have
> > > > > > 2-way
> > > > > > coherency with DGFX?
> > > > > 
> > > > > Yeah, this should be igpu only. I seem to also recall that on
> > > > > dgpu,
> > > > > Media is coherent with l2/l3, but also I don't think system
> > > > > memory
> > > > > can
> > > > > be cached in l2/l3 (only VRAM), which I assume is why there
> > > > > is
> > > > > the
> > > > > special SMRO (system-memory-read-only) cache only on dgpu,
> > > > > which
> > > > > is
> > > > > flushed when the fence signals, unlike the l2/l3.
> > > > 
> > > > Yes that sounds reasonable.
> > > > 
> > > > > 
> > > > > > 
> > > > > > It sounds like the same PAT restriction is needed also for
> > > > > > imported
> > > > > > dma-buf, right?
> > > > > 
> > > > > Good point. Looks like we are missing that still. Otherwise
> > > > > we
> > > > > can
> > > > > run
> > > > > into the same issues with stale l2/l3/ppc.
> > > > 
> > > > So if this affects only system memory could we instead of
> > > > relying
> > > > on 2-
> > > > way coherency or XA, just flush at dma unmap time, because
> > > > that's
> > > > typically just before releasing the pages.
> > > 
> > > Yeah, I think we could make it work, from security pov, similar
> > > to
> > > userptr, with the right manual flushes in KMD. Maybe just a
> > > question
> > > if
> > > userspace wants such a model? Anything cached in l2/l3 might
> > > require
> > > manual flushing by userspace (if that is even possible)?
> > 
> > So that would mean if user-space wants gpu-cpu coherency at fence
> > synchronization points, they'd have to use either 2-way or XA pat
> > indices, but not enforced by KMD.
> 
> Yeah, looking at BSpec 74635 (Media off case), I'm only really seeing
> MEM_SET which userspace could potentially use by itself? But then
> it's 
> unclear if they mean to actually clear-the-memory (which is not what
> we 
> want) or using the special evict mode, but that seems to be talking
> more 
> about flushing to local memory, so not completely sure what that does
> on 
> igpu. If it's the evict mode then should in theory be possible for 
> userpace to do a manual flush, but that would have to be done per-
> bo/vma?
> 
> > 
> > For imported dma-buf kernel requires 2-way or XA for security due
> > to
> > the relaxed dma-buf unmap.
> > 
> > For SVM/System allocator we'd require 2-way or XA.
> > 
> > Otherwise KMD security is enforced by flush at dma-unmap time?
> 
> Yeah, that is my understanding. Otherwise I don't currently see what 
> prevents the dirty non-XA cache lines being flushed at some random
> point 
> later, after we have already freed the corresponding system memory, 
> potentially nuking the next user who allocates those pages.

So I've discussed a bit more with Tejas and since the virtual addresses
are needed for the flush, flushing at dma-unmap time doesn't really
work. And since this is IGFX only, where we sync on moves, a flush in
xe_bo_trigger_rebind() should be completely ok, at least until affected
DGFX occurs, where we might want to look at async TLB flushes.

And for simplicitly then go for the PAT restriction also for userptr,
svm and imported dma-buf.

Thoughts?

Thomas

> 
> > 
> > /Thomas
> > 
> > > 
> > > > 
> > > > The exception, though, is dma-buf where the exporter can
> > > > actually
> > > > release memory before all importers have given up their dma-
> > > > mappings.
> > > > 
> > > > /Thomas
> > > > 
> > > > > 
> > > > > > 
> > > > > > /Thomas
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Matt
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > If something happens that changes the GTT mapping
> > > > > > > > > > > of
> > > > > > > > > > > an
> > > > > > > > > > > object,
> > > > > > > > > > > then
> > > > > > > > > > > doesn't that already trigger a TLB invalidation
> > > > > > > > > > > when
> > > > > > > > > > > necessary in
> > > > > > > > > > > the driver
> > > > > > > > > > > today?  It was my understanding that "heavy" TLB
> > > > > > > > > > > invalidations wait
> > > > > > > > > > > for data
> > > > > > > > > > > values to be globally observable before starting,
> > > > > > > > > > > so
> > > > > > > > > > > I
> > > > > > > > > > > think
> > > > > > > > > > > that
> > > > > > > > > > > would ensure
> > > > > > > > > > > that any non-XA data makes it to RAM before any
> > > > > > > > > > > binding
> > > > > > > > > > > changes,
> > > > > > > > > > > object,
> > > > > > > > > > > destruction, etc.?  Is there something special
> > > > > > > > > > > about
> > > > > > > > > > > userptr
> > > > > > > > > > > that
> > > > > > > > > > > makes that
> > > > > > > > > > > case more of a problem?
> > > > > > > > > > > 
> > > > > > > > > > > I just found bspec page 74635 which gives an
> > > > > > > > > > > overview
> > > > > > > > > > > of
> > > > > > > > > > > the
> > > > > > > > > > > various flush
> > > > > > > > > > > and invalidate cases, and I don't see anything
> > > > > > > > > > > there
> > > > > > > > > > > that
> > > > > > > > > > > makes it
> > > > > > > > > > > obvious to
> > > > > > > > > > > me that userptr would be special.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > As you said, we eventually do want to force
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > flush
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > non-transient data as well once we're
> > > > > > > > > > > > > > freeing
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > underlying
> > > > > > > > > > > > > > pages.
> > > > > > > > > > > > > > So how do we do that?  It's not clear to me
> > > > > > > > > > > > > > how
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > changes
> > > > > > > > > > > > > > below
> > > > > > > > > > > > > > are accomplishing that.  Is there a way to
> > > > > > > > > > > > > > explicitly
> > > > > > > > > > > > > > request
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > full device cache flush (ignoring the
> > > > > > > > > > > > > > transient
> > > > > > > > > > > > > > vs
> > > > > > > > > > > > > > non-
> > > > > > > > > > > > > > transient tagging)?
> > > > > > > > > > > > > > Since the GuC handles the optimization in
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > next
> > > > > > > > > > > > > > patch
> > > > > > > > > > > > > > (toggling
> > > > > > > > > > > > > > whether flushes are full flushes vs non-
> > > > > > > > > > > > > > transient
> > > > > > > > > > > > > > flushes
> > > > > > > > > > > > > > depending on whether media is active), I
> > > > > > > > > > > > > > thought
> > > > > > > > > > > > > > there
> > > > > > > > > > > > > > might
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > some kind of GuC interface to request
> > > > > > > > > > > > > > "please
> > > > > > > > > > > > > > do
> > > > > > > > > > > > > > one
> > > > > > > > > > > > > > full
> > > > > > > > > > > > > > flush now, even
> > > > > > > > > > > if media is idle."
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I’m not an expert here by any means, but
> > > > > > > > > > > > > everything
> > > > > > > > > > > > > above
> > > > > > > > > > > > > from
> > > > > > > > > > > > > Matt
> > > > > > > > > > > > > seems like valid concerns. Thomas also raised
> > > > > > > > > > > > > some
> > > > > > > > > > > > > concerns in
> > > > > > > > > > > > > the
> > > > > > > > > > > > > two previous revisions; again I’m not an
> > > > > > > > > > > > > expert,
> > > > > > > > > > > > > but
> > > > > > > > > > > > > reading
> > > > > > > > > > > > > through
> > > > > > > > > > > > > those, it doesn’t really seem like he
> > > > > > > > > > > > > received
> > > > > > > > > > > > > proper
> > > > > > > > > > > > > answers
> > > > > > > > > > > > > to his
> > > > > > > > > > > questions.
> > > > > > > > > > > > 
> > > > > > > > > > > > Its forcing flush via tlb invalidation PPC flag
> > > > > > > > > > > > under
> > > > > > > > > > > > xe_invalidate_vma( ).
> > > > > > > > > > > 
> > > > > > > > > > > By the way, what is "PPC?"  It seems like it's
> > > > > > > > > > > another
> > > > > > > > > > > new
> > > > > > > > > > > synonym
> > > > > > > > > > > for the
> > > > > > > > > > > device cache?  It's already really confusing that
> > > > > > > > > > > some of
> > > > > > > > > > > our
> > > > > > > > > > > hardware docs use
> > > > > > > > > > > a mix of both "L2" and "L3" to refer to the same
> > > > > > > > > > > device
> > > > > > > > > > > cache
> > > > > > > > > > > for
> > > > > > > > > > > historical
> > > > > > > > > > > reasons...
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Matt
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > A couple of comments below.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Matt
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > V2(MattA): Expand commit description
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Signed-off-by: Tejas Upadhyay
> > > > > > > > > > > > > > > <tejas.upadhyay@intel.com>
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > >      drivers/gpu/drm/xe/xe_bo.c      |  3
> > > > > > > > > > > > > > > ++-
> > > > > > > > > > > > > > >      drivers/gpu/drm/xe/xe_device.c  | 23
> > > > > > > > > > > > > > > +++++++++++++++++++++++
> > > > > > > > > > > > > > > drivers/gpu/drm/xe/xe_device.h  |  1 +
> > > > > > > > > > > > > > > drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
> > > > > > > > > > > > > > >      4 files changed, 28 insertions(+), 2
> > > > > > > > > > > > > > > deletions(-)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_bo.c index
> > > > > > > > > > > > > > > e9180b01a4e4..4455886b211e
> > > > > > > > > > > > > > > 100644
> > > > > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > > > > > @@ -689,7 +689,8 @@ static int
> > > > > > > > > > > > > > > xe_bo_trigger_rebind(struct
> > > > > > > > > > > > > > > xe_device *xe, struct xe_bo *bo,
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >      		if
> > > > > > > > > > > > > > > (!xe_vm_in_fault_mode(vm)) {
> > > > > > > > > > > > > > >      			drm_gpuvm_bo_evi
> > > > > > > > > > > > > > > ct(v
> > > > > > > > > > > > > > > m_bo
> > > > > > > > > > > > > > > ,
> > > > > > > > > > > > > > > true);
> > > > > > > > > > > > > > > -			continue;
> > > > > > > > > > > > > > > +			if
> > > > > > > > > > > > > > > (!xe_device_needs_cache_flush(xe))
> > > > > > > > > > > > > > > +				continue
> > > > > > > > > > > > > > > ;
> > > > > > > > > > 
> > > > > > > > > > Matt R,
> > > > > > > > > > This flush will be still needed as there can be
> > > > > > > > > > non-xa
> > > > > > > > > > buffers
> > > > > > > > > > which
> > > > > > > > > > can be evicted while media was off and stale
> > > > > > > > > > entries
> > > > > > > > > > can be
> > > > > > > > > > flushed
> > > > > > > > > > when media comes back on. Which was not case
> > > > > > > > > > earlier as
> > > > > > > > > > full
> > > > > > > > > > flush
> > > > > > > > > > was happening at regular sync points and that’s
> > > > > > > > > > where
> > > > > > > > > > this
> > > > > > > > > > feature is
> > > > > > > > > > bringing optimization now.
> > > > > > > > > > 
> > > > > > > > > > Tejas
> > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > This will trigger a TLB invalidation (and I
> > > > > > > > > > > > > assume a
> > > > > > > > > > > > > cache
> > > > > > > > > > > > > flush)
> > > > > > > > > > > > > every time we move or free memory in the 3D
> > > > > > > > > > > > > stack
> > > > > > > > > > > > > if
> > > > > > > > > > > > > it
> > > > > > > > > > > > > has a
> > > > > > > > > > > > > binding. It also performs a synchronous wait
> > > > > > > > > > > > > on
> > > > > > > > > > > > > the
> > > > > > > > > > > > > BO
> > > > > > > > > > > > > being
> > > > > > > > > > > > > idle.
> > > > > > > > > > > > > Both of these are very expensive operations.
> > > > > > > > > > > > > I
> > > > > > > > > > > > > can’t
> > > > > > > > > > > > > imagine
> > > > > > > > > > > > > the
> > > > > > > > > > > > > granularity we want here is to do this on
> > > > > > > > > > > > > every
> > > > > > > > > > > > > move/free
> > > > > > > > > > > > > with
> > > > > > > > > > > > > bindings.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Also, for LR compute with preempt fences, we
> > > > > > > > > > > > > would
> > > > > > > > > > > > > trigger the
> > > > > > > > > > > > > preempt fences during the wait, so a TLB
> > > > > > > > > > > > > invalidation
> > > > > > > > > > > > > after
> > > > > > > > > > > > > this
> > > > > > > > > > > > > seems unnecessary, though perhaps the cache
> > > > > > > > > > > > > flush
> > > > > > > > > > > > > is
> > > > > > > > > > > > > still
> > > > > > > > > > > > > required?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I think this needs a bit more explanation,
> > > > > > > > > > > > > because
> > > > > > > > > > > > > without
> > > > > > > > > > > > > knowing a
> > > > > > > > > > > > > lot about the exact requirements, the
> > > > > > > > > > > > > implementation
> > > > > > > > > > > > > does
> > > > > > > > > > > > > not
> > > > > > > > > > > > > look
> > > > > > > > > > > correct.
> > > > > > > > > > > > 
> > > > > > > > > > > > The thing is that we are trying to solve
> > > > > > > > > > > > problem
> > > > > > > > > > > > with
> > > > > > > > > > > > userptr
> > > > > > > > > > > > with non-XA
> > > > > > > > > > > pat, consider if that BO got moved while media is
> > > > > > > > > > > not
> > > > > > > > > > > active.
> > > > > > > > > > > As
> > > > > > > > > > > soon as media
> > > > > > > > > > > will come back active, stale cached entries of
> > > > > > > > > > > that
> > > > > > > > > > > object
> > > > > > > > > > > will be
> > > > > > > > > > > flushed as part
> > > > > > > > > > > of full flush , which may corrupt things.
> > > > > > > > > > > > There was thinking that with this patch we
> > > > > > > > > > > > would at
> > > > > > > > > > > > least
> > > > > > > > > > > > solve
> > > > > > > > > > > > the problem
> > > > > > > > > > > of corruption and later when page_reclamation
> > > > > > > > > > > feature
> > > > > > > > > > > comes
> > > > > > > > > > > in will
> > > > > > > > > > > help in
> > > > > > > > > > > performance as well. But now when page
> > > > > > > > > > > reclamation
> > > > > > > > > > > feature is
> > > > > > > > > > > merged earlier
> > > > > > > > > > > and it tightly coupled with bind/unbind some
> > > > > > > > > > > cases
> > > > > > > > > > > like
> > > > > > > > > > > discussed
> > > > > > > > > > > above
> > > > > > > > > > > (which are not doing unbind immediately on
> > > > > > > > > > > move/free)
> > > > > > > > > > > are
> > > > > > > > > > > missed in
> > > > > > > > > > > reclamation.
> > > > > > > > > > > > 
> > > > > > > > > > > > So thought was to let this solution go in with
> > > > > > > > > > > > little
> > > > > > > > > > > > perf
> > > > > > > > > > > > hit
> > > > > > > > > > > > and discuss with
> > > > > > > > > > > page reclamation owner to come with cleaner
> > > > > > > > > > > solution
> > > > > > > > > > > together.
> > > > > > > > > > > > 
> > > > > > > > > > > > Tejas
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > >      		}
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >      		if (!idle) {
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_device.c index
> > > > > > > > > > > > > > > 743c18e0c580..da2abed94bc0
> > > > > > > > > > > > > > > 100644
> > > > > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > > > > > > > > > > > > @@ -1097,6 +1097,29 @@ static void
> > > > > > > > > > > > > > > tdf_request_sync(struct
> > > > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > *xe)
> > > > > > > > > > > > > > >      	}
> > > > > > > > > > > > > > >      }
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > + * xe_device_needs_cache_flush - Whether
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > cache
> > > > > > > > > > > > > > > needs
> > > > > > > > > > > > > > > to be
> > > > > > > > > > > > > > > +flushed
> > > > > > > > > > > > > > > + * @xe: The device to check.
> > > > > > > > > > > > > > > + *
> > > > > > > > > > > > > > > + * Return: true if the device needs
> > > > > > > > > > > > > > > cache
> > > > > > > > > > > > > > > flush,
> > > > > > > > > > > > > > > false
> > > > > > > > > > > > > > > otherwise.
> > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > +bool xe_device_needs_cache_flush(struct
> > > > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > > > *xe) {
> > > > > > > > > > > > > > > +	/* XA is *always* flushed, like
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > end-
> > > > > > > > > > > > > > > of-
> > > > > > > > > > > > > > > submssion (and
> > > > > > > > > > > > > > > +maybe
> > > > > > > > > > > > > other
> > > > > > > > > > > > > > > +	 * places), just that internally
> > > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > optimisation hw doesn't
> > > > > > > > > > > > > > > +need to
> > > > > > > > > > > > > make
> > > > > > > > > > > > > > > +	 * that a full flush (which will
> > > > > > > > > > > > > > > also
> > > > > > > > > > > > > > > include XA)
> > > > > > > > > > > > > > > when Media is
> > > > > > > > > > > > > > > +	 * off/powergated, since it
> > > > > > > > > > > > > > > doesn't
> > > > > > > > > > > > > > > need
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > worry
> > > > > > > > > > > > > > > about GT
> > > > > > > > > > > > > > > +caches vs
> > > > > > > > > > > > > Media
> > > > > > > > > > > > > > > +	 * coherency, and only CPU vs
> > > > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > coherency,
> > > > > > > > > > > > > > > so
> > > > > > > > > > > > > > > can make
> > > > > > > > > > > that
> > > > > > > > > > > > > > > +flush
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > > +	 * targeted XA flush, since
> > > > > > > > > > > > > > > stuff
> > > > > > > > > > > > > > > tagged
> > > > > > > > > > > > > > > with XA
> > > > > > > > > > > > > > > now means
> > > > > > > > > > > > > > > +it's shared
> > > > > > > > > > > > > with
> > > > > > > > > > > > > > > +	 * the CPU. The main implication
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > now
> > > > > > > > > > > > > > > need to
> > > > > > > > > > > > > > > +somehow
> > > > > > > > > > > > > flush non-XA before
> > > > > > > > > > > > > > > +	 * freeing system memory pages,
> > > > > > > > > > > > > > > otherwise
> > > > > > > > > > > > > > > dirty
> > > > > > > > > > > > > > > cachelines
> > > > > > > > > > > > > > > +could be
> > > > > > > > > > > > > flushed after the free
> > > > > > > > > > > > > > > +	 * (like if Media suddenly turns
> > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > does
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > full flush)
> > > > > > > > > > > > > > > +	 */
> > > > > > > > > > > > > > > +	if (GRAPHICS_VER(xe) >= 35 &&
> > > > > > > > > > > > > > > !IS_DGFX(xe))
> > > > > > > > > > > > > > > +		return true;
> > > > > > > > > > > > > > > +	return false;
> > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > >      void xe_device_l2_flush(struct
> > > > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > > > *xe)
> > > > > > > > > > > > > > > {
> > > > > > > > > > > > > > >      	struct xe_gt *gt;
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_device.h index
> > > > > > > > > > > > > > > 39464650533b..baf386e0e037
> > > > > > > > > > > > > > > 100644
> > > > > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > > > > > > > > > > > > > @@ -184,6 +184,7 @@ void
> > > > > > > > > > > > > > > xe_device_snapshot_print(struct
> > > > > > > > > > > > > > > xe_device *xe, struct drm_printer *p);
> > > > > > > > > > > > > > >      u64
> > > > > > > > > > > > > > > xe_device_canonicalize_addr(struct
> > > > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > > > *xe, u64
> > > > > > > > > > > > > > > address);
> > > > > > > > > > > > > > >      u64
> > > > > > > > > > > > > > > xe_device_uncanonicalize_addr(struct
> > > > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > > > *xe,
> > > > > > > > > > > > > > > u64
> > > > > > > > > > > > > > > address);
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +bool xe_device_needs_cache_flush(struct
> > > > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > > > *xe);
> > > > > > > > > > > > > > >      void xe_device_td_flush(struct
> > > > > > > > > > > > > > > xe_device
> > > > > > > > > > > > > > > *xe);
> > > > > > > > > > > > > > > void
> > > > > > > > > > > > > > > xe_device_l2_flush(struct xe_device *xe);
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_userptr.c index
> > > > > > > > > > > > > > > e120323c43bc..b435ea7f9b66
> > > > > > > > > > > > > > > 100644
> > > > > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_userptr.c
> > > > > > > > > > > > > > > @@ -114,7 +114,8 @@ static void
> > > > > > > > > > > > > > > __vma_userptr_invalidate(struct
> > > > > > > > > > > > > > > xe_vm
> > > > > > > > > > > > > *vm, struct xe_userptr_vma *uv
> > > > > > > > > > > > > > >      				   
> > > > > > > > > > > > > > > false,
> > > > > > > > > > > > > > > MAX_SCHEDULE_TIMEOUT);
> > > > > > > > > > > > > > >      	XE_WARN_ON(err <= 0);
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > -	if (xe_vm_in_fault_mode(vm) &&
> > > > > > > > > > > > > > > userptr-
> > > > > > > > > > > > > > > > initial_bind) {
> > > > > > > > > > > > > > > +	if ((xe_vm_in_fault_mode(vm) ||
> > > > > > > > > > > > > > > +xe_device_needs_cache_flush(vm-
> > > > > > > > > > > > > > xe)) &&
> > > > > > > > > > > > > > > +	    userptr->initial_bind) {
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Same concern with the LR preempt fence as
> > > > > > > > > > > > > above —
> > > > > > > > > > > > > the
> > > > > > > > > > > > > hardware
> > > > > > > > > > > > > will
> > > > > > > > > > > > > be interrupted via preempt fences, so it
> > > > > > > > > > > > > doesn’t
> > > > > > > > > > > > > seem
> > > > > > > > > > > > > necessary
> > > > > > > > > > > > > to
> > > > > > > > > > > > > invalidate the TLBs but perhaps we need a
> > > > > > > > > > > > > cflush
> > > > > > > > > > > > > and
> > > > > > > > > > > > > TLB
> > > > > > > > > > > > > invalidation is the mechanism for that too?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Matt
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > >      		err =
> > > > > > > > > > > > > > > xe_vm_invalidate_vma(vma);
> > > > > > > > > > > > > > >      		XE_WARN_ON(err);
> > > > > > > > > > > > > > >      	}
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > 2.52.0
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Matt Roper
> > > > > > > > > > > > > > Graphics Software Engineer
> > > > > > > > > > > > > > Linux GPU Platform Enablement
> > > > > > > > > > > > > > Intel Corporation
> > > > > > > > > > > 
> > > > > > > > > > > --
> > > > > > > > > > > Matt Roper
> > > > > > > > > > > Graphics Software Engineer
> > > > > > > > > > > Linux GPU Platform Enablement
> > > > > > > > > > > Intel Corporation
> > > > > > > > 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually
  2026-02-17 17:04                               ` Thomas Hellström
@ 2026-02-17 18:41                                 ` Matthew Auld
  0 siblings, 0 replies; 39+ messages in thread
From: Matthew Auld @ 2026-02-17 18:41 UTC (permalink / raw)
  To: Thomas Hellström, Matt Roper, Souza, Jose
  Cc: Upadhyay, Tejas, Mrozek, Michal, intel-xe@lists.freedesktop.org,
	Brost, Matthew

On 17/02/2026 17:04, Thomas Hellström wrote:
> On Mon, 2026-02-16 at 16:41 +0000, Matthew Auld wrote:
>> On 16/02/2026 15:38, Thomas Hellström wrote:
>>> On Mon, 2026-02-16 at 14:55 +0000, Matthew Auld wrote:
>>>> On 16/02/2026 12:07, Thomas Hellström wrote:
>>>>> On Mon, 2026-02-16 at 10:58 +0000, Matthew Auld wrote:
>>>>>> On 16/02/2026 10:23, Thomas Hellström wrote:
>>>>>>> On Fri, 2026-02-13 at 17:31 +0000, Matthew Auld wrote:
>>>>>>>> On 13/02/2026 17:16, Matt Roper wrote:
>>>>>>>>> On Fri, Feb 13, 2026 at 04:48:39PM +0000, Souza, Jose
>>>>>>>>> wrote:
>>>>>>>>>> On Fri, 2026-02-13 at 16:23 +0000, Upadhyay, Tejas
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Roper, Matthew D
>>>>>>>>>>>> <matthew.d.roper@intel.com>
>>>>>>>>>>>> Sent: 12 February 2026 02:41
>>>>>>>>>>>> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
>>>>>>>>>>>> Cc: Brost, Matthew <matthew.brost@intel.com>;
>>>>>>>>>>>> intel-
>>>>>>>>>>>> xe@lists.freedesktop.org; Auld, Matthew
>>>>>>>>>>>> <matthew.auld@intel.com>;
>>>>>>>>>>>> thomas.hellstrom@linux.intel.com
>>>>>>>>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg: flush
>>>>>>>>>>>> userptr/shrinker bo
>>>>>>>>>>>> cachelines manually
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Feb 11, 2026 at 07:06:05PM +0000,
>>>>>>>>>>>> Upadhyay,
>>>>>>>>>>>> Tejas
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Brost, Matthew
>>>>>>>>>>>>>> <matthew.brost@intel.com>
>>>>>>>>>>>>>> Sent: 11 February 2026 05:32
>>>>>>>>>>>>>> To: Roper, Matthew D
>>>>>>>>>>>>>> <matthew.d.roper@intel.com>
>>>>>>>>>>>>>> Cc: Upadhyay, Tejas
>>>>>>>>>>>>>> <tejas.upadhyay@intel.com>;
>>>>>>>>>>>>>> intel-
>>>>>>>>>>>>>> xe@lists.freedesktop.org; Auld, Matthew
>>>>>>>>>>>>>> <matthew.auld@intel.com>;
>>>>>>>>>>>>>> thomas.hellstrom@linux.intel.com
>>>>>>>>>>>>>> Subject: Re: [PATCH 1/3] drm/xe/xe3p_lpg:
>>>>>>>>>>>>>> flush
>>>>>>>>>>>>>> userptr/shrinker bo
>>>>>>>>>>>>>> cachelines manually
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Feb 10, 2026 at 01:05:25PM -0800,
>>>>>>>>>>>>>> Matt
>>>>>>>>>>>>>> Roper
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> On Tue, Feb 10, 2026 at 06:21:22PM +0530,
>>>>>>>>>>>>>>> Tejas
>>>>>>>>>>>>>>> Upadhyay
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> "eXtended Architecture" (XA) tagged
>>>>>>>>>>>>>>>> memory—memory
>>>>>>>>>>>>>>>> shared
>>>>>>>>>>>> between
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> CPU and GPU
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm pretty sure this expansion of "XA" is
>>>>>>>>>>>>>>> wrong;
>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> seeing this definition?  Everything in the
>>>>>>>>>>>>>>> bspec
>>>>>>>>>>>>>>> indicates
>>>>>>>>>>>>>>> that XA
>>>>>>>>>>>>>>> means "wb
>>>>>>>>>>>>>>> - transient app" (similar to how "XD" is
>>>>>>>>>>>>>>> 'wb -
>>>>>>>>>>>>>>> transient
>>>>>>>>>>>>>>> display").
>>>>>>>>>>>>>>> I'm not sure why exactly they picked "X" to
>>>>>>>>>>>>>>> refer
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> transient in
>>>>>>>>>>>>>>> both of these cases, but I've never seen
>>>>>>>>>>>>>>> any
>>>>>>>>>>>>>>> documentation
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> refers to it as "extended."
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> is treated differently from other GPU
>>>>>>>>>>>>>>>> memory
>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> Media
>>>>>>>>>>>>>>>> engine is
>>>>>>>>>>>>>> power-gated.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> XA is *always* flushed, like at the end-
>>>>>>>>>>>>>>>> of-
>>>>>>>>>>>>>>>> submssion
>>>>>>>>>>>>>>>> (and
>>>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I assume you're referring to the fact that
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> driver
>>>>>>>>>>>>>>> performs
>>>>>>>>>>>>>>> flushes at the end of submission (via
>>>>>>>>>>>>>>> PIPE_CONTROL
>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>> MI_FLUSH_DW), and that depending on other
>>>>>>>>>>>>>>> state/optimizations
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> the system, those flushes may flush the
>>>>>>>>>>>>>>> entire
>>>>>>>>>>>>>>> device
>>>>>>>>>>>>>>> cache,
>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>> may only flush the subset of cache data
>>>>>>>>>>>>>>> that is
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>> marked as
>>>>>>>>>>>>>>> transient.  The way you worded this was
>>>>>>>>>>>>>>> confusing
>>>>>>>>>>>>>>> since
>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>> makes
>>>>>>>>>>>>>>> it sound like cache flushes happen
>>>>>>>>>>>>>>> automatically
>>>>>>>>>>>>>>> somewhere in
>>>>>>>>>>>> hardware/firmware.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> places), just that internally as an
>>>>>>>>>>>>>>>> optimisation
>>>>>>>>>>>>>>>> hw
>>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>> to make that a full flush (which will
>>>>>>>>>>>>>>>> also
>>>>>>>>>>>>>>>> include
>>>>>>>>>>>>>>>> XA) when
>>>>>>>>>>>>>>>> Media is off/powergated, since it doesn't
>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>> worry
>>>>>>>>>>>>>>>> about GT
>>>>>>>>>>>>>>>> caches vs Media coherency, and only CPU
>>>>>>>>>>>>>>>> vs
>>>>>>>>>>>>>>>> GPU
>>>>>>>>>>>>>>>> coherency,
>>>>>>>>>>>>>>>> so can
>>>>>>>>>>>>>>>> make that flush a targeted XA flush,
>>>>>>>>>>>>>>>> since
>>>>>>>>>>>>>>>> stuff
>>>>>>>>>>>>>>>> tagged
>>>>>>>>>>>>>>>> with XA
>>>>>>>>>>>>>>>> now means it's shared with the CPU. The
>>>>>>>>>>>>>>>> main
>>>>>>>>>>>>>>>> implication is
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> we now need to somehow flush non-XA
>>>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>>> freeing
>>>>>>>>>>>>>>>> system
>>>>>>>>>>>>>>>> memory
>>>>>>>>>>>>>>>> pages, otherwise dirty cachelines could
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>> flushed
>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> free (like if Media suddenly turns on and
>>>>>>>>>>>>>>>> does a
>>>>>>>>>>>>>>>> full
>>>>>>>>>>>>>>>> flush)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This description seems really confusing.
>>>>>>>>>>>>>>> My
>>>>>>>>>>>>>>> understanding is
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> marking something as wb-transient-app
>>>>>>>>>>>>>>> indicates
>>>>>>>>>>>>>>> that it
>>>>>>>>>>>>>>> might
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> accessed by something other than our
>>>>>>>>>>>>>>> graphics/media
>>>>>>>>>>>>>>> IP
>>>>>>>>>>>>>>> (i.e.,
>>>>>>>>>>>>>>> accessed from the CPU, exported to another
>>>>>>>>>>>>>>> device,
>>>>>>>>>>>>>>> etc.), so
>>>>>>>>>>>>>>> transient data truly does need to be
>>>>>>>>>>>>>>> flushed at
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> points in
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> driver where a flush typically happens.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However when something is _not_ transient,
>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>> either:
>>>>>>>>>>>>>>>       - it's "private" to the GPU and only
>>>>>>>>>>>>>>> our
>>>>>>>>>>>>>>> graphics/media IP
>>>>>>>>>>>>>>> will be
>>>>>>>>>>>>>>>         accessing it
>>>>>>>>>>>>>>>       - it's bound with a coherent PAT index
>>>>>>>>>>>>>>> so
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> outside
>>>>>>>>>>>>>>> observers like
>>>>>>>>>>>>>>>         the CPU can snoop the device cache,
>>>>>>>>>>>>>>> even
>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>> hasn't been
>>>>>>>>>>>>>>>         flushed
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If media is not active, then there's really
>>>>>>>>>>>>>>> no
>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> include
>>>>>>>>>>>>>>> non-transient data when an device cache
>>>>>>>>>>>>>>> flush
>>>>>>>>>>>>>>> happens
>>>>>>>>>>>>>>> since
>>>>>>>>>>>>>>> there's no real need for the data to get to
>>>>>>>>>>>>>>> RAM.
>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> enables
>>>>>>>>>>>>>>> an optimization (which comes in your next
>>>>>>>>>>>>>>> patch),
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> allows
>>>>>>>>>>>>>>> flushes to only operate on the subset of
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> device
>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>> tagged as
>>>>>>>>>>>> "transient" if media is idle.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But what If we have stale non-XA marked pages
>>>>>>>>>>>>> for
>>>>>>>>>>>>> userptr,
>>>>>>>>>>>>> and
>>>>>>>>>>>>> that
>>>>>>>>>>>>> object moves out and at the same time media
>>>>>>>>>>>>> comes
>>>>>>>>>>>>> back,
>>>>>>>>>>>>> will end
>>>>>>>>>>>>> up in
>>>>>>>>>>>>> full flush and flush the stale entry to RAM.
>>>>>>>>>>>>
>>>>>>>>>>>> What makes userptr special here?  During general,
>>>>>>>>>>>> active
>>>>>>>>>>>> usage,
>>>>>>>>>>>> userptr would
>>>>>>>>>>>> be data that's accessible by the CPU, so it needs
>>>>>>>>>>>> to
>>>>>>>>>>>> either
>>>>>>>>>>>> be
>>>>>>>>>>>> transient (so CPU
>>>>>>>>>>>> can see the data in RAM after explicit flushes)
>>>>>>>>>>>> or it
>>>>>>>>>>>> needs
>>>>>>>>>>>> to be
>>>>>>>>>>>> using a
>>>>>>>>>>>> coherent PAT (so that the CPU can just snoop the
>>>>>>>>>>>> GPU
>>>>>>>>>>>> cache).
>>>>>>>>>>>> If
>>>>>>>>>>>> you marked
>>>>>>>>>>>> userptr as both non-XA and non-coherent, then
>>>>>>>>>>>> that
>>>>>>>>>>>> sounds
>>>>>>>>>>>> likely to
>>>>>>>>>>>> be a
>>>>>>>>>>>> userspace bug (and probably something we can
>>>>>>>>>>>> catch
>>>>>>>>>>>> and
>>>>>>>>>>>> reject
>>>>>>>>>>>> as an
>>>>>>>>>>>> invalid
>>>>>>>>>>>> case on any Xe3p or later platforms that support
>>>>>>>>>>>> this)
>>>>>>>>>>>> since
>>>>>>>>>>>> the
>>>>>>>>>>>> CPU wouldn't
>>>>>>>>>>>> have any reliable way of seeing GPU updates.
>>>>>>>>>>>
>>>>>>>>>>> Right. FYI @Mrozek, Michal @Souza, Jose
>>>>>>>>>>> For userptr, as explained above, it needs to be
>>>>>>>>>>> either
>>>>>>>>>>> coherent
>>>>>>>>>>> or XA
>>>>>>>>>>> pat index, or else KMD will reject as invalid case.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> The coherency restriction is already in the uAPI:
>>>>>>>>>>
>>>>>>>>>> "Note: For userptr and externally imported dma-buf
>>>>>>>>>> the
>>>>>>>>>> kernel
>>>>>>>>>> expects
>>>>>>>>>> either 1WAY or 2WAY for the @pat_index."
>>>>>>>>>>
>>>>>>>>>> Using 1 way is enough as Xe KMD does a PIPE_CONTROL
>>>>>>>>>> flushing
>>>>>>>>>> GPU
>>>>>>>>>> caches
>>>>>>>>>> at the end of batch buffers.
>>>>>>>>>
>>>>>>>>> But isn't that what we're discussing here?  1-way
>>>>>>>>> *won't*
>>>>>>>>> necessarily be
>>>>>>>>> enough anymore because PIPE_CONTROL instructions don't
>>>>>>>>> flush
>>>>>>>>> the
>>>>>>>>> entire
>>>>>>>>> cache anymore.  Whenever the GuC determines that media
>>>>>>>>> is
>>>>>>>>> inactive
>>>>>>>>> and
>>>>>>>>> activates the optimization, PIPE_CONTROL, MI_FLUSH_DW,
>>>>>>>>> etc.
>>>>>>>>> change
>>>>>>>>> behavior to only flush out the subset of data that was
>>>>>>>>> marked
>>>>>>>>> as
>>>>>>>>> app-transient; anything not marked that way doesn't get
>>>>>>>>> flushed
>>>>>>>>> now.  So
>>>>>>>>> there's a new requirement here that you ensure you're
>>>>>>>>> using
>>>>>>>>> an
>>>>>>>>> XA
>>>>>>>>> PAT
>>>>>>>>> index, or you switch to use 2-way coherency which will
>>>>>>>>> allow
>>>>>>>>> the
>>>>>>>>> CPU to
>>>>>>>>> snoop the GPU's caches.
>>>>>>>>
>>>>>>>> That exactly matches my understanding also.
>>>>>>>
>>>>>>> This only ever affects IGFX, right? Since AFAIU we don't
>>>>>>> have
>>>>>>> 2-way
>>>>>>> coherency with DGFX?
>>>>>>
>>>>>> Yeah, this should be igpu only. I seem to also recall that on
>>>>>> dgpu,
>>>>>> Media is coherent with l2/l3, but also I don't think system
>>>>>> memory
>>>>>> can
>>>>>> be cached in l2/l3 (only VRAM), which I assume is why there
>>>>>> is
>>>>>> the
>>>>>> special SMRO (system-memory-read-only) cache only on dgpu,
>>>>>> which
>>>>>> is
>>>>>> flushed when the fence signals, unlike the l2/l3.
>>>>>
>>>>> Yes that sounds reasonable.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> It sounds like the same PAT restriction is needed also for
>>>>>>> imported
>>>>>>> dma-buf, right?
>>>>>>
>>>>>> Good point. Looks like we are missing that still. Otherwise
>>>>>> we
>>>>>> can
>>>>>> run
>>>>>> into the same issues with stale l2/l3/ppc.
>>>>>
>>>>> So if this affects only system memory could we instead of
>>>>> relying
>>>>> on 2-
>>>>> way coherency or XA, just flush at dma unmap time, because
>>>>> that's
>>>>> typically just before releasing the pages.
>>>>
>>>> Yeah, I think we could make it work, from security pov, similar
>>>> to
>>>> userptr, with the right manual flushes in KMD. Maybe just a
>>>> question
>>>> if
>>>> userspace wants such a model? Anything cached in l2/l3 might
>>>> require
>>>> manual flushing by userspace (if that is even possible)?
>>>
>>> So that would mean if user-space wants gpu-cpu coherency at fence
>>> synchronization points, they'd have to use either 2-way or XA pat
>>> indices, but not enforced by KMD.
>>
>> Yeah, looking at BSpec 74635 (Media off case), I'm only really seeing
>> MEM_SET which userspace could potentially use by itself? But then
>> it's
>> unclear if they mean to actually clear-the-memory (which is not what
>> we
>> want) or using the special evict mode, but that seems to be talking
>> more
>> about flushing to local memory, so not completely sure what that does
>> on
>> igpu. If it's the evict mode then should in theory be possible for
>> userpace to do a manual flush, but that would have to be done per-
>> bo/vma?
>>
>>>
>>> For imported dma-buf kernel requires 2-way or XA for security due
>>> to
>>> the relaxed dma-buf unmap.
>>>
>>> For SVM/System allocator we'd require 2-way or XA.
>>>
>>> Otherwise KMD security is enforced by flush at dma-unmap time?
>>
>> Yeah, that is my understanding. Otherwise I don't currently see what
>> prevents the dirty non-XA cache lines being flushed at some random
>> point
>> later, after we have already freed the corresponding system memory,
>> potentially nuking the next user who allocates those pages.
> 
> So I've discussed a bit more with Tejas and since the virtual addresses
> are needed for the flush, flushing at dma-unmap time doesn't really
> work. And since this is IGFX only, where we sync on moves, a flush in
> xe_bo_trigger_rebind() should be completely ok, at least until affected
> DGFX occurs, where we might want to look at async TLB flushes.
> 
> And for simplicitly then go for the PAT restriction also for userptr,
> svm and imported dma-buf.
> 
> Thoughts?

Yeah agreed, I think that should be good enough and will hopefully cover 
all the missing cases.

Otherwise maybe we could somehow use a known dummy address range from 
the migrate vm or something, and use that to flush the PPC from unmap? I 
assume the address range doesn't actually matter if we just care about 
flushing entire PPC?

There is also the xe_page_reclaim stuff for a targeted flush instead of 
nuking entire PPC, which doesn't seem to need ppgtt virtual address, 
just a list of physical page addresses, which will be a lot better for 
the smaller BOs. But maybe this is more follow up stuff.

> 
> Thomas
> 
>>
>>>
>>> /Thomas
>>>
>>>>
>>>>>
>>>>> The exception, though, is dma-buf where the exporter can
>>>>> actually
>>>>> release memory before all importers have given up their dma-
>>>>> mappings.
>>>>>
>>>>> /Thomas
>>>>>
>>>>>>
>>>>>>>
>>>>>>> /Thomas
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Matt
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> If something happens that changes the GTT mapping
>>>>>>>>>>>> of
>>>>>>>>>>>> an
>>>>>>>>>>>> object,
>>>>>>>>>>>> then
>>>>>>>>>>>> doesn't that already trigger a TLB invalidation
>>>>>>>>>>>> when
>>>>>>>>>>>> necessary in
>>>>>>>>>>>> the driver
>>>>>>>>>>>> today?  It was my understanding that "heavy" TLB
>>>>>>>>>>>> invalidations wait
>>>>>>>>>>>> for data
>>>>>>>>>>>> values to be globally observable before starting,
>>>>>>>>>>>> so
>>>>>>>>>>>> I
>>>>>>>>>>>> think
>>>>>>>>>>>> that
>>>>>>>>>>>> would ensure
>>>>>>>>>>>> that any non-XA data makes it to RAM before any
>>>>>>>>>>>> binding
>>>>>>>>>>>> changes,
>>>>>>>>>>>> object,
>>>>>>>>>>>> destruction, etc.?  Is there something special
>>>>>>>>>>>> about
>>>>>>>>>>>> userptr
>>>>>>>>>>>> that
>>>>>>>>>>>> makes that
>>>>>>>>>>>> case more of a problem?
>>>>>>>>>>>>
>>>>>>>>>>>> I just found bspec page 74635 which gives an
>>>>>>>>>>>> overview
>>>>>>>>>>>> of
>>>>>>>>>>>> the
>>>>>>>>>>>> various flush
>>>>>>>>>>>> and invalidate cases, and I don't see anything
>>>>>>>>>>>> there
>>>>>>>>>>>> that
>>>>>>>>>>>> makes it
>>>>>>>>>>>> obvious to
>>>>>>>>>>>> me that userptr would be special.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As you said, we eventually do want to force
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> flush
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> non-transient data as well once we're
>>>>>>>>>>>>>>> freeing
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> underlying
>>>>>>>>>>>>>>> pages.
>>>>>>>>>>>>>>> So how do we do that?  It's not clear to me
>>>>>>>>>>>>>>> how
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>>> below
>>>>>>>>>>>>>>> are accomplishing that.  Is there a way to
>>>>>>>>>>>>>>> explicitly
>>>>>>>>>>>>>>> request
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> full device cache flush (ignoring the
>>>>>>>>>>>>>>> transient
>>>>>>>>>>>>>>> vs
>>>>>>>>>>>>>>> non-
>>>>>>>>>>>>>>> transient tagging)?
>>>>>>>>>>>>>>> Since the GuC handles the optimization in
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> next
>>>>>>>>>>>>>>> patch
>>>>>>>>>>>>>>> (toggling
>>>>>>>>>>>>>>> whether flushes are full flushes vs non-
>>>>>>>>>>>>>>> transient
>>>>>>>>>>>>>>> flushes
>>>>>>>>>>>>>>> depending on whether media is active), I
>>>>>>>>>>>>>>> thought
>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>> might
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> some kind of GuC interface to request
>>>>>>>>>>>>>>> "please
>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>> full
>>>>>>>>>>>>>>> flush now, even
>>>>>>>>>>>> if media is idle."
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I’m not an expert here by any means, but
>>>>>>>>>>>>>> everything
>>>>>>>>>>>>>> above
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>> Matt
>>>>>>>>>>>>>> seems like valid concerns. Thomas also raised
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>> concerns in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> two previous revisions; again I’m not an
>>>>>>>>>>>>>> expert,
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>> reading
>>>>>>>>>>>>>> through
>>>>>>>>>>>>>> those, it doesn’t really seem like he
>>>>>>>>>>>>>> received
>>>>>>>>>>>>>> proper
>>>>>>>>>>>>>> answers
>>>>>>>>>>>>>> to his
>>>>>>>>>>>> questions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Its forcing flush via tlb invalidation PPC flag
>>>>>>>>>>>>> under
>>>>>>>>>>>>> xe_invalidate_vma( ).
>>>>>>>>>>>>
>>>>>>>>>>>> By the way, what is "PPC?"  It seems like it's
>>>>>>>>>>>> another
>>>>>>>>>>>> new
>>>>>>>>>>>> synonym
>>>>>>>>>>>> for the
>>>>>>>>>>>> device cache?  It's already really confusing that
>>>>>>>>>>>> some of
>>>>>>>>>>>> our
>>>>>>>>>>>> hardware docs use
>>>>>>>>>>>> a mix of both "L2" and "L3" to refer to the same
>>>>>>>>>>>> device
>>>>>>>>>>>> cache
>>>>>>>>>>>> for
>>>>>>>>>>>> historical
>>>>>>>>>>>> reasons...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Matt
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A couple of comments below.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Matt
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> V2(MattA): Expand commit description
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Tejas Upadhyay
>>>>>>>>>>>>>>>> <tejas.upadhyay@intel.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>       drivers/gpu/drm/xe/xe_bo.c      |  3
>>>>>>>>>>>>>>>> ++-
>>>>>>>>>>>>>>>>       drivers/gpu/drm/xe/xe_device.c  | 23
>>>>>>>>>>>>>>>> +++++++++++++++++++++++
>>>>>>>>>>>>>>>> drivers/gpu/drm/xe/xe_device.h  |  1 +
>>>>>>>>>>>>>>>> drivers/gpu/drm/xe/xe_userptr.c |  3 ++-
>>>>>>>>>>>>>>>>       4 files changed, 28 insertions(+), 2
>>>>>>>>>>>>>>>> deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_bo.c index
>>>>>>>>>>>>>>>> e9180b01a4e4..4455886b211e
>>>>>>>>>>>>>>>> 100644
>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>>>>>>>>>>>>>>>> @@ -689,7 +689,8 @@ static int
>>>>>>>>>>>>>>>> xe_bo_trigger_rebind(struct
>>>>>>>>>>>>>>>> xe_device *xe, struct xe_bo *bo,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>       		if
>>>>>>>>>>>>>>>> (!xe_vm_in_fault_mode(vm)) {
>>>>>>>>>>>>>>>>       			drm_gpuvm_bo_evi
>>>>>>>>>>>>>>>> ct(v
>>>>>>>>>>>>>>>> m_bo
>>>>>>>>>>>>>>>> ,
>>>>>>>>>>>>>>>> true);
>>>>>>>>>>>>>>>> -			continue;
>>>>>>>>>>>>>>>> +			if
>>>>>>>>>>>>>>>> (!xe_device_needs_cache_flush(xe))
>>>>>>>>>>>>>>>> +				continue
>>>>>>>>>>>>>>>> ;
>>>>>>>>>>>
>>>>>>>>>>> Matt R,
>>>>>>>>>>> This flush will be still needed as there can be
>>>>>>>>>>> non-xa
>>>>>>>>>>> buffers
>>>>>>>>>>> which
>>>>>>>>>>> can be evicted while media was off and stale
>>>>>>>>>>> entries
>>>>>>>>>>> can be
>>>>>>>>>>> flushed
>>>>>>>>>>> when media comes back on. Which was not case
>>>>>>>>>>> earlier as
>>>>>>>>>>> full
>>>>>>>>>>> flush
>>>>>>>>>>> was happening at regular sync points and that’s
>>>>>>>>>>> where
>>>>>>>>>>> this
>>>>>>>>>>> feature is
>>>>>>>>>>> bringing optimization now.
>>>>>>>>>>>
>>>>>>>>>>> Tejas
>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This will trigger a TLB invalidation (and I
>>>>>>>>>>>>>> assume a
>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>> flush)
>>>>>>>>>>>>>> every time we move or free memory in the 3D
>>>>>>>>>>>>>> stack
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>> has a
>>>>>>>>>>>>>> binding. It also performs a synchronous wait
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> BO
>>>>>>>>>>>>>> being
>>>>>>>>>>>>>> idle.
>>>>>>>>>>>>>> Both of these are very expensive operations.
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>> can’t
>>>>>>>>>>>>>> imagine
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> granularity we want here is to do this on
>>>>>>>>>>>>>> every
>>>>>>>>>>>>>> move/free
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> bindings.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also, for LR compute with preempt fences, we
>>>>>>>>>>>>>> would
>>>>>>>>>>>>>> trigger the
>>>>>>>>>>>>>> preempt fences during the wait, so a TLB
>>>>>>>>>>>>>> invalidation
>>>>>>>>>>>>>> after
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> seems unnecessary, though perhaps the cache
>>>>>>>>>>>>>> flush
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> still
>>>>>>>>>>>>>> required?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think this needs a bit more explanation,
>>>>>>>>>>>>>> because
>>>>>>>>>>>>>> without
>>>>>>>>>>>>>> knowing a
>>>>>>>>>>>>>> lot about the exact requirements, the
>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>> does
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>> look
>>>>>>>>>>>> correct.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The thing is that we are trying to solve
>>>>>>>>>>>>> problem
>>>>>>>>>>>>> with
>>>>>>>>>>>>> userptr
>>>>>>>>>>>>> with non-XA
>>>>>>>>>>>> pat, consider if that BO got moved while media is
>>>>>>>>>>>> not
>>>>>>>>>>>> active.
>>>>>>>>>>>> As
>>>>>>>>>>>> soon as media
>>>>>>>>>>>> will come back active, stale cached entries of
>>>>>>>>>>>> that
>>>>>>>>>>>> object
>>>>>>>>>>>> will be
>>>>>>>>>>>> flushed as part
>>>>>>>>>>>> of full flush , which may corrupt things.
>>>>>>>>>>>>> There was thinking that with this patch we
>>>>>>>>>>>>> would at
>>>>>>>>>>>>> least
>>>>>>>>>>>>> solve
>>>>>>>>>>>>> the problem
>>>>>>>>>>>> of corruption and later when page_reclamation
>>>>>>>>>>>> feature
>>>>>>>>>>>> comes
>>>>>>>>>>>> in will
>>>>>>>>>>>> help in
>>>>>>>>>>>> performance as well. But now when page
>>>>>>>>>>>> reclamation
>>>>>>>>>>>> feature is
>>>>>>>>>>>> merged earlier
>>>>>>>>>>>> and it tightly coupled with bind/unbind some
>>>>>>>>>>>> cases
>>>>>>>>>>>> like
>>>>>>>>>>>> discussed
>>>>>>>>>>>> above
>>>>>>>>>>>> (which are not doing unbind immediately on
>>>>>>>>>>>> move/free)
>>>>>>>>>>>> are
>>>>>>>>>>>> missed in
>>>>>>>>>>>> reclamation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So thought was to let this solution go in with
>>>>>>>>>>>>> little
>>>>>>>>>>>>> perf
>>>>>>>>>>>>> hit
>>>>>>>>>>>>> and discuss with
>>>>>>>>>>>> page reclamation owner to come with cleaner
>>>>>>>>>>>> solution
>>>>>>>>>>>> together.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tejas
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>       		}
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>       		if (!idle) {
>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>> a/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.c index
>>>>>>>>>>>>>>>> 743c18e0c580..da2abed94bc0
>>>>>>>>>>>>>>>> 100644
>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>>>>>>>>>>>>>> @@ -1097,6 +1097,29 @@ static void
>>>>>>>>>>>>>>>> tdf_request_sync(struct
>>>>>>>>>>>>>>>> xe_device
>>>>>>>>>>>>>> *xe)
>>>>>>>>>>>>>>>>       	}
>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>> + * xe_device_needs_cache_flush - Whether
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>> to be
>>>>>>>>>>>>>>>> +flushed
>>>>>>>>>>>>>>>> + * @xe: The device to check.
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * Return: true if the device needs
>>>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>>>> flush,
>>>>>>>>>>>>>>>> false
>>>>>>>>>>>>>>>> otherwise.
>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>> +bool xe_device_needs_cache_flush(struct
>>>>>>>>>>>>>>>> xe_device
>>>>>>>>>>>>>>>> *xe) {
>>>>>>>>>>>>>>>> +	/* XA is *always* flushed, like
>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> end-
>>>>>>>>>>>>>>>> of-
>>>>>>>>>>>>>>>> submssion (and
>>>>>>>>>>>>>>>> +maybe
>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>> +	 * places), just that internally
>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>> optimisation hw doesn't
>>>>>>>>>>>>>>>> +need to
>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>> +	 * that a full flush (which will
>>>>>>>>>>>>>>>> also
>>>>>>>>>>>>>>>> include XA)
>>>>>>>>>>>>>>>> when Media is
>>>>>>>>>>>>>>>> +	 * off/powergated, since it
>>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> worry
>>>>>>>>>>>>>>>> about GT
>>>>>>>>>>>>>>>> +caches vs
>>>>>>>>>>>>>> Media
>>>>>>>>>>>>>>>> +	 * coherency, and only CPU vs
>>>>>>>>>>>>>>>> GPU
>>>>>>>>>>>>>>>> coherency,
>>>>>>>>>>>>>>>> so
>>>>>>>>>>>>>>>> can make
>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> +flush
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> +	 * targeted XA flush, since
>>>>>>>>>>>>>>>> stuff
>>>>>>>>>>>>>>>> tagged
>>>>>>>>>>>>>>>> with XA
>>>>>>>>>>>>>>>> now means
>>>>>>>>>>>>>>>> +it's shared
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>> +	 * the CPU. The main implication
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>> now
>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>> +somehow
>>>>>>>>>>>>>> flush non-XA before
>>>>>>>>>>>>>>>> +	 * freeing system memory pages,
>>>>>>>>>>>>>>>> otherwise
>>>>>>>>>>>>>>>> dirty
>>>>>>>>>>>>>>>> cachelines
>>>>>>>>>>>>>>>> +could be
>>>>>>>>>>>>>> flushed after the free
>>>>>>>>>>>>>>>> +	 * (like if Media suddenly turns
>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> does
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> full flush)
>>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>>> +	if (GRAPHICS_VER(xe) >= 35 &&
>>>>>>>>>>>>>>>> !IS_DGFX(xe))
>>>>>>>>>>>>>>>> +		return true;
>>>>>>>>>>>>>>>> +	return false;
>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>       void xe_device_l2_flush(struct
>>>>>>>>>>>>>>>> xe_device
>>>>>>>>>>>>>>>> *xe)
>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>       	struct xe_gt *gt;
>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>> a/drivers/gpu/drm/xe/xe_device.h
>>>>>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_device.h index
>>>>>>>>>>>>>>>> 39464650533b..baf386e0e037
>>>>>>>>>>>>>>>> 100644
>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>>>>>>>>>>>>>> @@ -184,6 +184,7 @@ void
>>>>>>>>>>>>>>>> xe_device_snapshot_print(struct
>>>>>>>>>>>>>>>> xe_device *xe, struct drm_printer *p);
>>>>>>>>>>>>>>>>       u64
>>>>>>>>>>>>>>>> xe_device_canonicalize_addr(struct
>>>>>>>>>>>>>>>> xe_device
>>>>>>>>>>>>>>>> *xe, u64
>>>>>>>>>>>>>>>> address);
>>>>>>>>>>>>>>>>       u64
>>>>>>>>>>>>>>>> xe_device_uncanonicalize_addr(struct
>>>>>>>>>>>>>>>> xe_device
>>>>>>>>>>>>>>>> *xe,
>>>>>>>>>>>>>>>> u64
>>>>>>>>>>>>>>>> address);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +bool xe_device_needs_cache_flush(struct
>>>>>>>>>>>>>>>> xe_device
>>>>>>>>>>>>>>>> *xe);
>>>>>>>>>>>>>>>>       void xe_device_td_flush(struct
>>>>>>>>>>>>>>>> xe_device
>>>>>>>>>>>>>>>> *xe);
>>>>>>>>>>>>>>>> void
>>>>>>>>>>>>>>>> xe_device_l2_flush(struct xe_device *xe);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>> a/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_userptr.c index
>>>>>>>>>>>>>>>> e120323c43bc..b435ea7f9b66
>>>>>>>>>>>>>>>> 100644
>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_userptr.c
>>>>>>>>>>>>>>>> @@ -114,7 +114,8 @@ static void
>>>>>>>>>>>>>>>> __vma_userptr_invalidate(struct
>>>>>>>>>>>>>>>> xe_vm
>>>>>>>>>>>>>> *vm, struct xe_userptr_vma *uv
>>>>>>>>>>>>>>>>       				
>>>>>>>>>>>>>>>> false,
>>>>>>>>>>>>>>>> MAX_SCHEDULE_TIMEOUT);
>>>>>>>>>>>>>>>>       	XE_WARN_ON(err <= 0);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -	if (xe_vm_in_fault_mode(vm) &&
>>>>>>>>>>>>>>>> userptr-
>>>>>>>>>>>>>>>>> initial_bind) {
>>>>>>>>>>>>>>>> +	if ((xe_vm_in_fault_mode(vm) ||
>>>>>>>>>>>>>>>> +xe_device_needs_cache_flush(vm-
>>>>>>>>>>>>>>> xe)) &&
>>>>>>>>>>>>>>>> +	    userptr->initial_bind) {
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Same concern with the LR preempt fence as
>>>>>>>>>>>>>> above —
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> hardware
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>> be interrupted via preempt fences, so it
>>>>>>>>>>>>>> doesn’t
>>>>>>>>>>>>>> seem
>>>>>>>>>>>>>> necessary
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> invalidate the TLBs but perhaps we need a
>>>>>>>>>>>>>> cflush
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> TLB
>>>>>>>>>>>>>> invalidation is the mechanism for that too?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Matt
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>       		err =
>>>>>>>>>>>>>>>> xe_vm_invalidate_vma(vma);
>>>>>>>>>>>>>>>>       		XE_WARN_ON(err);
>>>>>>>>>>>>>>>>       	}
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> 2.52.0
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Matt Roper
>>>>>>>>>>>>>>> Graphics Software Engineer
>>>>>>>>>>>>>>> Linux GPU Platform Enablement
>>>>>>>>>>>>>>> Intel Corporation
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Matt Roper
>>>>>>>>>>>> Graphics Software Engineer
>>>>>>>>>>>> Linux GPU Platform Enablement
>>>>>>>>>>>> Intel Corporation
>>>>>>>>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2026-02-17 18:41 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-25  9:43 [PATCH 0/3] drm/xe/xe3p_lpg: L2 flush optimization Tejas Upadhyay
2025-11-25  9:43 ` [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually Tejas Upadhyay
2025-11-25 10:17   ` Matthew Auld
2025-11-25 13:39     ` Souza, Jose
2025-11-25 15:06   ` Thomas Hellström
2025-11-25 15:31     ` Upadhyay, Tejas
2025-11-26 10:26       ` Thomas Hellström
2025-11-25  9:43 ` [PATCH 2/3] drm/xe/xe3p_lpg: Enable L2 flush optimization feature Tejas Upadhyay
2025-11-25  9:43 ` [PATCH 3/3] drm/xe/xe3p: Skip TD flush Tejas Upadhyay
2025-11-25 13:20 ` ✓ CI.KUnit: success for drm/xe/xe3p_lpg: L2 flush optimization Patchwork
2025-11-25 14:47 ` ✓ Xe.CI.BAT: " Patchwork
2025-11-25 17:42 ` ✓ Xe.CI.Full: " Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2026-02-10 12:51 [PATCH 0/3] " Tejas Upadhyay
2026-02-10 12:51 ` [PATCH 1/3] drm/xe/xe3p_lpg: flush userptr/shrinker bo cachelines manually Tejas Upadhyay
2026-02-10 21:05   ` Matt Roper
2026-02-11  0:02     ` Matthew Brost
2026-02-11 19:06       ` Upadhyay, Tejas
2026-02-11 21:11         ` Matt Roper
2026-02-12  9:53           ` Matthew Auld
2026-02-13 11:17             ` Upadhyay, Tejas
2026-02-13 13:27               ` Matthew Auld
2026-02-13 13:30                 ` Souza, Jose
2026-02-13 16:23           ` Upadhyay, Tejas
2026-02-13 16:48             ` Souza, Jose
2026-02-13 17:16               ` Matt Roper
2026-02-13 17:31                 ` Souza, Jose
2026-02-13 17:31                 ` Matthew Auld
2026-02-16 10:23                   ` Thomas Hellström
2026-02-16 10:58                     ` Matthew Auld
2026-02-16 12:07                       ` Thomas Hellström
2026-02-16 14:55                         ` Matthew Auld
2026-02-16 15:38                           ` Thomas Hellström
2026-02-16 16:41                             ` Matthew Auld
2026-02-17  6:19                               ` Upadhyay, Tejas
2026-02-17  9:53                                 ` Thomas Hellström
2026-02-17 17:04                               ` Thomas Hellström
2026-02-17 18:41                                 ` Matthew Auld
2026-02-16 10:56             ` Thomas Hellström
2026-02-16 11:26               ` Upadhyay, Tejas
2026-02-13 17:29           ` Matthew Auld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox