[Intel-xe] [RFC 0/5] PAT and cache coherency support

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [Intel-xe] [RFC 0/5] PAT and cache coherency support
@ 2023-08-29 16:28 Matthew Auld
  2023-08-29 16:28 ` [Intel-xe] [RFC 1/5] drm/xe/uapi: Add support for cache and coherency mode Matthew Auld
                   ` (6 more replies)
  0 siblings, 7 replies; 47+ messages in thread
From: Matthew Auld @ 2023-08-29 16:28 UTC (permalink / raw)
  To: intel-xe

Series still needs bunch of testing, but sending as RFC to get early feedback
while the IGT bits are being worked on. Note that the series also directly
depends on the WIP series here: https://patchwork.freedesktop.org/series/122708/

Goal here is to allow userspace to directly control the pat_index when mapping
memory via the ppGTT, in addtion to the CPU caching mode for system memory. This
is very much needed on newer igpu platforms which allow incoherent GT access,
where the choice over the cache level and expected coherency is best left to
userspace depending on their usecase.  In the future there may also be other
stuff encoded in the pat_index, so giving userspace direct control will also be
needed there.

To support this we added new gem_create uAPI for selecting the CPU cache
mode to use for system memory, including the expected GPU coherency mode. There
are various restrictions here for the selected coherency mode and compatible CPU
cache modes.  With that in place the actual pat_index can now be provided as
part of vm_bind. The only restriction is that the coherency mode of the
pat_index must match the gem_create coherency mode. There are also some special
cases like with userptr and dma-buf.

-- 
2.41.0

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Intel-xe] [RFC 1/5] drm/xe/uapi: Add support for cache and coherency mode
  2023-08-29 16:28 [Intel-xe] [RFC 0/5] PAT and cache coherency support Matthew Auld
@ 2023-08-29 16:28 ` Matthew Auld
  2023-08-29 18:09   ` Matt Roper
  2023-09-04 20:00   ` Souza, Jose
  2023-08-29 16:28 ` [Intel-xe] [RFC 2/5] drm/xe: fix has_llc on rkl Matthew Auld
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 47+ messages in thread
From: Matthew Auld @ 2023-08-29 16:28 UTC (permalink / raw)
  To: intel-xe
  Cc: Filip Hazubski, Joonas Lahtinen, Lucas De Marchi, Carl Zhang,
	Effie Yu, Matt Roper

From: Pallavi Mishra <pallavi.mishra@intel.com>

Allow userspace to specify the CPU caching mode to use for system memory
in addition to coherency modes during object creation. Modify gem create
handler and introduce xe_bo_create_user to replace xe_bo_create. In a
later patch we will support setting the pat_index as part of vm_bind,
where expectation is that the coherency mode extracted from the
pat_index must match the one set at object creation.

Co-authored-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Pallavi Mishra <pallavi.mishra@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Cc: Filip Hazubski <filip.hazubski@intel.com>
Cc: Carl Zhang <carl.zhang@intel.com>
Cc: Effie Yu <effie.yu@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c       | 95 +++++++++++++++++++++++++++-----
 drivers/gpu/drm/xe/xe_bo.h       |  3 +-
 drivers/gpu/drm/xe/xe_bo_types.h | 10 ++++
 drivers/gpu/drm/xe/xe_dma_buf.c  |  5 +-
 include/uapi/drm/xe_drm.h        | 47 +++++++++++++++-
 5 files changed, 140 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 1ab682d61e3c..f60090fe8cd2 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -339,6 +339,15 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
 		extra_pages = DIV_ROUND_UP(xe_device_ccs_bytes(xe, bo->size),
 					   PAGE_SIZE);
 
+	if (bo->smem_caching) {
+		if (bo->smem_caching == XE_GEM_CACHING_WB)
+			caching = ttm_cached;
+		else if (bo->smem_caching == XE_GEM_CACHING_WC)
+			caching = ttm_write_combined;
+		else if (bo->smem_caching == XE_GEM_CACHING_UC)
+			caching = ttm_uncached;
+	}
+
 	/*
 	 * Display scanout is always non-coherent with the CPU cache.
 	 *
@@ -1183,9 +1192,10 @@ void xe_bo_free(struct xe_bo *bo)
 	kfree(bo);
 }
 
-struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
+struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
 				    struct xe_tile *tile, struct dma_resv *resv,
 				    struct ttm_lru_bulk_move *bulk, size_t size,
+				    u16 smem_caching, u16 coh_mode,
 				    enum ttm_bo_type type, u32 flags)
 {
 	struct ttm_operation_ctx ctx = {
@@ -1223,6 +1233,8 @@ struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
 	bo->tile = tile;
 	bo->size = size;
 	bo->flags = flags;
+	bo->smem_caching = smem_caching;
+	bo->coh_mode = coh_mode;
 	bo->ttm.base.funcs = &xe_gem_object_funcs;
 	bo->props.preferred_mem_class = XE_BO_PROPS_INVALID;
 	bo->props.preferred_gt = XE_BO_PROPS_INVALID;
@@ -1306,10 +1318,11 @@ static int __xe_bo_fixed_placement(struct xe_device *xe,
 }
 
 struct xe_bo *
-xe_bo_create_locked_range(struct xe_device *xe,
-			  struct xe_tile *tile, struct xe_vm *vm,
-			  size_t size, u64 start, u64 end,
-			  enum ttm_bo_type type, u32 flags)
+__xe_bo_create_locked(struct xe_device *xe,
+		      struct xe_tile *tile, struct xe_vm *vm,
+		      size_t size, u64 start, u64 end,
+		      u16 smem_caching, u16 coh_mode,
+		      enum ttm_bo_type type, u32 flags)
 {
 	struct xe_bo *bo = NULL;
 	int err;
@@ -1330,10 +1343,11 @@ xe_bo_create_locked_range(struct xe_device *xe,
 		}
 	}
 
-	bo = __xe_bo_create_locked(xe, bo, tile, vm ? &vm->resv : NULL,
+	bo = ___xe_bo_create_locked(xe, bo, tile, vm ? &vm->resv : NULL,
 				   vm && !xe_vm_in_fault_mode(vm) &&
 				   flags & XE_BO_CREATE_USER_BIT ?
 				   &vm->lru_bulk_move : NULL, size,
+				   smem_caching, coh_mode,
 				   type, flags);
 	if (IS_ERR(bo))
 		return bo;
@@ -1367,11 +1381,31 @@ xe_bo_create_locked_range(struct xe_device *xe,
 	return ERR_PTR(err);
 }
 
+struct xe_bo *
+xe_bo_create_locked_range(struct xe_device *xe,
+			  struct xe_tile *tile, struct xe_vm *vm,
+			  size_t size, u64 start, u64 end,
+			  enum ttm_bo_type type, u32 flags)
+{
+	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL, 0, 0, type, flags);
+}
+
 struct xe_bo *xe_bo_create_locked(struct xe_device *xe, struct xe_tile *tile,
 				  struct xe_vm *vm, size_t size,
 				  enum ttm_bo_type type, u32 flags)
 {
-	return xe_bo_create_locked_range(xe, tile, vm, size, 0, ~0ULL, type, flags);
+	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL, 0, 0, type, flags);
+}
+
+struct xe_bo *xe_bo_create_user(struct xe_device *xe, struct xe_tile *tile,
+				struct xe_vm *vm, size_t size,
+				enum ttm_bo_type type,
+				u16 smem_caching, u16 coh_mode,
+				u32 flags)
+{
+	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL,
+				     smem_caching, coh_mode, type,
+				     flags | XE_BO_CREATE_USER_BIT);
 }
 
 struct xe_bo *xe_bo_create(struct xe_device *xe, struct xe_tile *tile,
@@ -1754,11 +1788,11 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
 	struct ww_acquire_ctx ww;
 	struct xe_vm *vm = NULL;
 	struct xe_bo *bo;
-	unsigned int bo_flags = XE_BO_CREATE_USER_BIT;
+	unsigned int bo_flags;
 	u32 handle;
 	int err;
 
-	if (XE_IOCTL_DBG(xe, args->extensions) || XE_IOCTL_DBG(xe, args->pad) ||
+	if (XE_IOCTL_DBG(xe, args->extensions) ||
 	    XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
 		return -EINVAL;
 
@@ -1811,8 +1845,38 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
 		bo_flags |= XE_BO_NEEDS_CPU_ACCESS;
 	}
 
-	bo = xe_bo_create(xe, NULL, vm, args->size, ttm_bo_type_device,
-			  bo_flags);
+	if (XE_IOCTL_DBG(xe, args->coh_mode > 2))
+		return -EINVAL;
+
+	if (XE_IOCTL_DBG(xe, args->smem_caching > 2))
+		return -EINVAL;
+
+        if (bo_flags & XE_BO_CREATE_SYSTEM_BIT) {
+		if (XE_IOCTL_DBG(xe, !args->coh_mode))
+			return -EINVAL;
+
+		if (XE_IOCTL_DBG(xe, !args->smem_caching))
+			return -EINVAL;
+
+		if (XE_IOCTL_DBG(xe, !IS_DGFX(xe) &&
+				 bo_flags & XE_BO_SCANOUT_BIT &&
+				 args->smem_caching == XE_GEM_CACHING_WB))
+			return -EINVAL;
+
+		if (args->coh_mode == XE_GEM_COHERENCY_NONE) {
+			if (XE_IOCTL_DBG(xe, args->smem_caching == XE_GEM_CACHING_WB))
+				return -EINVAL;
+		} else if (args->coh_mode == XE_GEM_COHERENCY_2WAY) {
+			if (XE_IOCTL_DBG(xe, args->smem_caching != XE_GEM_CACHING_WB))
+				return -EINVAL;
+		}
+	} else if (XE_IOCTL_DBG(xe, args->smem_caching)) {
+		return -EINVAL;
+	}
+
+	bo = xe_bo_create_user(xe, NULL, vm, args->size,
+			       args->smem_caching, args->coh_mode,
+			       ttm_bo_type_device, bo_flags);
 	if (IS_ERR(bo)) {
 		err = PTR_ERR(bo);
 		goto out_vm;
@@ -2093,10 +2157,11 @@ int xe_bo_dumb_create(struct drm_file *file_priv,
 	args->size = ALIGN(mul_u32_u32(args->pitch, args->height),
 			   page_size);
 
-	bo = xe_bo_create(xe, NULL, NULL, args->size, ttm_bo_type_device,
-			  XE_BO_CREATE_VRAM_IF_DGFX(xe_device_get_root_tile(xe)) |
-			  XE_BO_CREATE_USER_BIT | XE_BO_SCANOUT_BIT |
-			  XE_BO_NEEDS_CPU_ACCESS);
+	bo = xe_bo_create_user(xe, NULL, NULL, args->size, ttm_bo_type_device,
+			       XE_GEM_CACHING_WC, XE_GEM_COHERENCY_NONE,
+			       XE_BO_CREATE_VRAM_IF_DGFX(xe_device_get_root_tile(xe)) |
+			       XE_BO_CREATE_USER_BIT | XE_BO_SCANOUT_BIT |
+			       XE_BO_NEEDS_CPU_ACCESS);
 	if (IS_ERR(bo))
 		return PTR_ERR(bo);
 
diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
index 0823dda0f31b..2311ef3ffaf1 100644
--- a/drivers/gpu/drm/xe/xe_bo.h
+++ b/drivers/gpu/drm/xe/xe_bo.h
@@ -81,9 +81,10 @@ struct sg_table;
 struct xe_bo *xe_bo_alloc(void);
 void xe_bo_free(struct xe_bo *bo);
 
-struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
+struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
 				    struct xe_tile *tile, struct dma_resv *resv,
 				    struct ttm_lru_bulk_move *bulk, size_t size,
+				    u16 smem_caching, u16 coh_mode,
 				    enum ttm_bo_type type, u32 flags);
 struct xe_bo *
 xe_bo_create_locked_range(struct xe_device *xe,
diff --git a/drivers/gpu/drm/xe/xe_bo_types.h b/drivers/gpu/drm/xe/xe_bo_types.h
index f6ee920303af..a98ba5bed499 100644
--- a/drivers/gpu/drm/xe/xe_bo_types.h
+++ b/drivers/gpu/drm/xe/xe_bo_types.h
@@ -68,6 +68,16 @@ struct xe_bo {
 	struct llist_node freed;
 	/** @created: Whether the bo has passed initial creation */
 	bool created;
+	/**
+	 * @coh_mode: Coherency setting. Currently only used for userspace
+	 * objects.
+	 */
+	u16 coh_mode;
+	/**
+	 * @smem_caching: Caching mode for smem. Currently only used for
+	 * userspace objects.
+	 */
+	u16 smem_caching;
 };
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_dma_buf.c b/drivers/gpu/drm/xe/xe_dma_buf.c
index 975dee1f770f..8ba7daf011bc 100644
--- a/drivers/gpu/drm/xe/xe_dma_buf.c
+++ b/drivers/gpu/drm/xe/xe_dma_buf.c
@@ -199,8 +199,9 @@ xe_dma_buf_init_obj(struct drm_device *dev, struct xe_bo *storage,
 	int ret;
 
 	dma_resv_lock(resv, NULL);
-	bo = __xe_bo_create_locked(xe, storage, NULL, resv, NULL, dma_buf->size,
-				   ttm_bo_type_sg, XE_BO_CREATE_SYSTEM_BIT);
+	bo = ___xe_bo_create_locked(xe, storage, NULL, resv, NULL, dma_buf->size,
+				    0, 0, /* Will require 1way or 2way for vm_bind */
+				    ttm_bo_type_sg, XE_BO_CREATE_SYSTEM_BIT);
 	if (IS_ERR(bo)) {
 		ret = PTR_ERR(bo);
 		goto error;
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index 86f16d50e9cc..64bc66d4b550 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -461,8 +461,51 @@ struct drm_xe_gem_create {
 	 */
 	__u32 handle;
 
-	/** @pad: MBZ */
-	__u32 pad;
+	/**
+	 * @coh_mode: The coherency mode for this object. This will limit the
+	 * possible @smem_caching values.
+	 *
+	 * Supported values:
+	 *
+	 * XE_GEM_COHERENCY_NONE: GPU access is assumed to be not coherent with
+	 * CPU. CPU caches are not snooped.
+	 *
+	 * XE_GEM_COHERENCY_1WAY: GPU access is coherent with CPU (CPU caches
+	 * are snooped) until GPU acquires. The acquire by the GPU is not
+	 * tracked by CPU caches.
+	 *
+	 * XE_GEM_COHERENCY_2WAY: Fully coherent between GPU and CPU. Fully
+	 * tracked by CPU caches. Both CPU and GPU caches are snooped.
+	 */
+#define XE_GEM_COHERENCY_NONE                  1
+#define XE_GEM_COHERENCY_1WAY                  2
+#define XE_GEM_COHERENCY_2WAY                  3
+	__u16 coh_mode;
+
+	/**
+	 * @smem_caching: The CPU caching mode to select for system memory.
+	 *
+	 * Supported values:
+	 *
+	 * XE_GEM_CACHING_WB: Allocate the pages with write-back caching.  On
+	 * iGPU this can't be used for scanout surfaces. The @coh_mode must
+	 * either be XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY.
+	 *
+	 * XE_GEM_CACHING_WC: Allocate the pages as write-combined. This is
+	 * uncached. The @coh_mode must be XE_GEM_COHERENCY_NONE or
+	 * XE_GEM_COHERENCY_1WAY. Scanout surfaces should likely use this on
+	 * igpu.
+	 *
+	 * XE_GEM_CACHING_UC: Allocate the pages as uncached. The @coh_mode must
+	 * be XE_GEM_COHERENCY_NONE or XE_GEM_COHERENCY_1WAY. Scanout surfaces
+	 * are permitted to use this.
+	 *
+	 * MUST be left as zero for VRAM-only objects.
+	 */
+#define XE_GEM_CACHING_WB                      1
+#define XE_GEM_CACHING_WC                      2
+#define XE_GEM_CACHING_UC                      3
+	__u16 smem_caching;
 
 	/** @reserved: Reserved */
 	__u64 reserved[2];
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 1/5] drm/xe/uapi: Add support for cache and coherency mode
  2023-08-29 16:28 ` [Intel-xe] [RFC 1/5] drm/xe/uapi: Add support for cache and coherency mode Matthew Auld
@ 2023-08-29 18:09   ` Matt Roper
  2023-08-30 11:13     ` Matthew Auld
  2023-09-04 20:00   ` Souza, Jose
  1 sibling, 1 reply; 47+ messages in thread
From: Matt Roper @ 2023-08-29 18:09 UTC (permalink / raw)
  To: Matthew Auld
  Cc: Filip Hazubski, Joonas Lahtinen, Lucas De Marchi, Carl Zhang,
	Effie Yu, intel-xe

On Tue, Aug 29, 2023 at 05:28:42PM +0100, Matthew Auld wrote:
> From: Pallavi Mishra <pallavi.mishra@intel.com>
> 
> Allow userspace to specify the CPU caching mode to use for system memory
> in addition to coherency modes during object creation. Modify gem create
> handler and introduce xe_bo_create_user to replace xe_bo_create. In a
> later patch we will support setting the pat_index as part of vm_bind,
> where expectation is that the coherency mode extracted from the
> pat_index must match the one set at object creation.

Minor request:  can we add the word "cpu" to some of the fields and
definitions?  Terms like WB and UC and such get used all over the place
and it can get confusing when exactly we're talking about the the CPU
cache behavior and when we're talking about the GPU cache behavior.
Since this gets exposed via the uapi, it's best to be as explicit as
possible.  E.g.,  "smem_cpu_caching" and "XE_GEM_CPU_CACHING_WB."


It might be good to explain why we're defining coherency mode at object
creation time instead of solely at vm_bind.  My understanding is that
it's mainly so that we know whether it will be possible for userspace to
pick a PAT that will bypass the CPU cache.  If they're able to do that,
we need to clflush after clearing the buffer contents so that they can't
potentially read old, stale data from memory.

> 
> Co-authored-by: Matthew Auld <matthew.auld@intel.com>
> Signed-off-by: Pallavi Mishra <pallavi.mishra@intel.com>

Nitpick:  As original author, Pallavi's s-o-b should come first, then
your C-a-b, and then your s-o-b (which seems to be missing) right after
it.

> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: José Roberto de Souza <jose.souza@intel.com>
> Cc: Filip Hazubski <filip.hazubski@intel.com>
> Cc: Carl Zhang <carl.zhang@intel.com>
> Cc: Effie Yu <effie.yu@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_bo.c       | 95 +++++++++++++++++++++++++++-----
>  drivers/gpu/drm/xe/xe_bo.h       |  3 +-
>  drivers/gpu/drm/xe/xe_bo_types.h | 10 ++++
>  drivers/gpu/drm/xe/xe_dma_buf.c  |  5 +-
>  include/uapi/drm/xe_drm.h        | 47 +++++++++++++++-
>  5 files changed, 140 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index 1ab682d61e3c..f60090fe8cd2 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -339,6 +339,15 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
>  		extra_pages = DIV_ROUND_UP(xe_device_ccs_bytes(xe, bo->size),
>  					   PAGE_SIZE);
>  
> +	if (bo->smem_caching) {
> +		if (bo->smem_caching == XE_GEM_CACHING_WB)
> +			caching = ttm_cached;
> +		else if (bo->smem_caching == XE_GEM_CACHING_WC)
> +			caching = ttm_write_combined;
> +		else if (bo->smem_caching == XE_GEM_CACHING_UC)
> +			caching = ttm_uncached;
> +	}
> +
>  	/*
>  	 * Display scanout is always non-coherent with the CPU cache.
>  	 *
> @@ -1183,9 +1192,10 @@ void xe_bo_free(struct xe_bo *bo)
>  	kfree(bo);
>  }
>  
> -struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
> +struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>  				    struct xe_tile *tile, struct dma_resv *resv,
>  				    struct ttm_lru_bulk_move *bulk, size_t size,
> +				    u16 smem_caching, u16 coh_mode,
>  				    enum ttm_bo_type type, u32 flags)
>  {
>  	struct ttm_operation_ctx ctx = {
> @@ -1223,6 +1233,8 @@ struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>  	bo->tile = tile;
>  	bo->size = size;
>  	bo->flags = flags;
> +	bo->smem_caching = smem_caching;
> +	bo->coh_mode = coh_mode;
>  	bo->ttm.base.funcs = &xe_gem_object_funcs;
>  	bo->props.preferred_mem_class = XE_BO_PROPS_INVALID;
>  	bo->props.preferred_gt = XE_BO_PROPS_INVALID;
> @@ -1306,10 +1318,11 @@ static int __xe_bo_fixed_placement(struct xe_device *xe,
>  }
>  
>  struct xe_bo *
> -xe_bo_create_locked_range(struct xe_device *xe,
> -			  struct xe_tile *tile, struct xe_vm *vm,
> -			  size_t size, u64 start, u64 end,
> -			  enum ttm_bo_type type, u32 flags)
> +__xe_bo_create_locked(struct xe_device *xe,
> +		      struct xe_tile *tile, struct xe_vm *vm,
> +		      size_t size, u64 start, u64 end,
> +		      u16 smem_caching, u16 coh_mode,
> +		      enum ttm_bo_type type, u32 flags)
>  {
>  	struct xe_bo *bo = NULL;
>  	int err;
> @@ -1330,10 +1343,11 @@ xe_bo_create_locked_range(struct xe_device *xe,
>  		}
>  	}
>  
> -	bo = __xe_bo_create_locked(xe, bo, tile, vm ? &vm->resv : NULL,
> +	bo = ___xe_bo_create_locked(xe, bo, tile, vm ? &vm->resv : NULL,
>  				   vm && !xe_vm_in_fault_mode(vm) &&
>  				   flags & XE_BO_CREATE_USER_BIT ?
>  				   &vm->lru_bulk_move : NULL, size,
> +				   smem_caching, coh_mode,
>  				   type, flags);
>  	if (IS_ERR(bo))
>  		return bo;
> @@ -1367,11 +1381,31 @@ xe_bo_create_locked_range(struct xe_device *xe,
>  	return ERR_PTR(err);
>  }
>  
> +struct xe_bo *
> +xe_bo_create_locked_range(struct xe_device *xe,
> +			  struct xe_tile *tile, struct xe_vm *vm,
> +			  size_t size, u64 start, u64 end,
> +			  enum ttm_bo_type type, u32 flags)
> +{
> +	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL, 0, 0, type, flags);
> +}
> +
>  struct xe_bo *xe_bo_create_locked(struct xe_device *xe, struct xe_tile *tile,
>  				  struct xe_vm *vm, size_t size,
>  				  enum ttm_bo_type type, u32 flags)
>  {
> -	return xe_bo_create_locked_range(xe, tile, vm, size, 0, ~0ULL, type, flags);
> +	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL, 0, 0, type, flags);

It looks like xe_bo_create_locked() is only called from two tests now,
right?  It's not obvious to me why caching=0, coh=0 is what we want for
those tests?

> +}
> +
> +struct xe_bo *xe_bo_create_user(struct xe_device *xe, struct xe_tile *tile,
> +				struct xe_vm *vm, size_t size,
> +				enum ttm_bo_type type,
> +				u16 smem_caching, u16 coh_mode,
> +				u32 flags)
> +{
> +	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL,
> +				     smem_caching, coh_mode, type,
> +				     flags | XE_BO_CREATE_USER_BIT);
>  }
>  
>  struct xe_bo *xe_bo_create(struct xe_device *xe, struct xe_tile *tile,
> @@ -1754,11 +1788,11 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
>  	struct ww_acquire_ctx ww;
>  	struct xe_vm *vm = NULL;
>  	struct xe_bo *bo;
> -	unsigned int bo_flags = XE_BO_CREATE_USER_BIT;
> +	unsigned int bo_flags;
>  	u32 handle;
>  	int err;
>  
> -	if (XE_IOCTL_DBG(xe, args->extensions) || XE_IOCTL_DBG(xe, args->pad) ||
> +	if (XE_IOCTL_DBG(xe, args->extensions) ||
>  	    XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
>  		return -EINVAL;
>  
> @@ -1811,8 +1845,38 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
>  		bo_flags |= XE_BO_NEEDS_CPU_ACCESS;
>  	}
>  
> -	bo = xe_bo_create(xe, NULL, vm, args->size, ttm_bo_type_device,
> -			  bo_flags);
> +	if (XE_IOCTL_DBG(xe, args->coh_mode > 2))
> +		return -EINVAL;
> +
> +	if (XE_IOCTL_DBG(xe, args->smem_caching > 2))
> +		return -EINVAL;
> +
> +        if (bo_flags & XE_BO_CREATE_SYSTEM_BIT) {

Indentation whitespace looks wrong here (spaces where it should be
tabs).


Matt

> +		if (XE_IOCTL_DBG(xe, !args->coh_mode))
> +			return -EINVAL;
> +
> +		if (XE_IOCTL_DBG(xe, !args->smem_caching))
> +			return -EINVAL;
> +
> +		if (XE_IOCTL_DBG(xe, !IS_DGFX(xe) &&
> +				 bo_flags & XE_BO_SCANOUT_BIT &&
> +				 args->smem_caching == XE_GEM_CACHING_WB))
> +			return -EINVAL;
> +
> +		if (args->coh_mode == XE_GEM_COHERENCY_NONE) {
> +			if (XE_IOCTL_DBG(xe, args->smem_caching == XE_GEM_CACHING_WB))
> +				return -EINVAL;
> +		} else if (args->coh_mode == XE_GEM_COHERENCY_2WAY) {
> +			if (XE_IOCTL_DBG(xe, args->smem_caching != XE_GEM_CACHING_WB))
> +				return -EINVAL;
> +		}
> +	} else if (XE_IOCTL_DBG(xe, args->smem_caching)) {
> +		return -EINVAL;
> +	}
> +
> +	bo = xe_bo_create_user(xe, NULL, vm, args->size,
> +			       args->smem_caching, args->coh_mode,
> +			       ttm_bo_type_device, bo_flags);
>  	if (IS_ERR(bo)) {
>  		err = PTR_ERR(bo);
>  		goto out_vm;
> @@ -2093,10 +2157,11 @@ int xe_bo_dumb_create(struct drm_file *file_priv,
>  	args->size = ALIGN(mul_u32_u32(args->pitch, args->height),
>  			   page_size);
>  
> -	bo = xe_bo_create(xe, NULL, NULL, args->size, ttm_bo_type_device,
> -			  XE_BO_CREATE_VRAM_IF_DGFX(xe_device_get_root_tile(xe)) |
> -			  XE_BO_CREATE_USER_BIT | XE_BO_SCANOUT_BIT |
> -			  XE_BO_NEEDS_CPU_ACCESS);
> +	bo = xe_bo_create_user(xe, NULL, NULL, args->size, ttm_bo_type_device,
> +			       XE_GEM_CACHING_WC, XE_GEM_COHERENCY_NONE,
> +			       XE_BO_CREATE_VRAM_IF_DGFX(xe_device_get_root_tile(xe)) |
> +			       XE_BO_CREATE_USER_BIT | XE_BO_SCANOUT_BIT |
> +			       XE_BO_NEEDS_CPU_ACCESS);
>  	if (IS_ERR(bo))
>  		return PTR_ERR(bo);
>  
> diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
> index 0823dda0f31b..2311ef3ffaf1 100644
> --- a/drivers/gpu/drm/xe/xe_bo.h
> +++ b/drivers/gpu/drm/xe/xe_bo.h
> @@ -81,9 +81,10 @@ struct sg_table;
>  struct xe_bo *xe_bo_alloc(void);
>  void xe_bo_free(struct xe_bo *bo);
>  
> -struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
> +struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>  				    struct xe_tile *tile, struct dma_resv *resv,
>  				    struct ttm_lru_bulk_move *bulk, size_t size,
> +				    u16 smem_caching, u16 coh_mode,
>  				    enum ttm_bo_type type, u32 flags);
>  struct xe_bo *
>  xe_bo_create_locked_range(struct xe_device *xe,
> diff --git a/drivers/gpu/drm/xe/xe_bo_types.h b/drivers/gpu/drm/xe/xe_bo_types.h
> index f6ee920303af..a98ba5bed499 100644
> --- a/drivers/gpu/drm/xe/xe_bo_types.h
> +++ b/drivers/gpu/drm/xe/xe_bo_types.h
> @@ -68,6 +68,16 @@ struct xe_bo {
>  	struct llist_node freed;
>  	/** @created: Whether the bo has passed initial creation */
>  	bool created;
> +	/**
> +	 * @coh_mode: Coherency setting. Currently only used for userspace
> +	 * objects.
> +	 */
> +	u16 coh_mode;
> +	/**
> +	 * @smem_caching: Caching mode for smem. Currently only used for
> +	 * userspace objects.
> +	 */
> +	u16 smem_caching;
>  };
>  
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_dma_buf.c b/drivers/gpu/drm/xe/xe_dma_buf.c
> index 975dee1f770f..8ba7daf011bc 100644
> --- a/drivers/gpu/drm/xe/xe_dma_buf.c
> +++ b/drivers/gpu/drm/xe/xe_dma_buf.c
> @@ -199,8 +199,9 @@ xe_dma_buf_init_obj(struct drm_device *dev, struct xe_bo *storage,
>  	int ret;
>  
>  	dma_resv_lock(resv, NULL);
> -	bo = __xe_bo_create_locked(xe, storage, NULL, resv, NULL, dma_buf->size,
> -				   ttm_bo_type_sg, XE_BO_CREATE_SYSTEM_BIT);
> +	bo = ___xe_bo_create_locked(xe, storage, NULL, resv, NULL, dma_buf->size,
> +				    0, 0, /* Will require 1way or 2way for vm_bind */
> +				    ttm_bo_type_sg, XE_BO_CREATE_SYSTEM_BIT);
>  	if (IS_ERR(bo)) {
>  		ret = PTR_ERR(bo);
>  		goto error;
> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> index 86f16d50e9cc..64bc66d4b550 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -461,8 +461,51 @@ struct drm_xe_gem_create {
>  	 */
>  	__u32 handle;
>  
> -	/** @pad: MBZ */
> -	__u32 pad;
> +	/**
> +	 * @coh_mode: The coherency mode for this object. This will limit the
> +	 * possible @smem_caching values.
> +	 *
> +	 * Supported values:
> +	 *
> +	 * XE_GEM_COHERENCY_NONE: GPU access is assumed to be not coherent with
> +	 * CPU. CPU caches are not snooped.
> +	 *
> +	 * XE_GEM_COHERENCY_1WAY: GPU access is coherent with CPU (CPU caches
> +	 * are snooped) until GPU acquires. The acquire by the GPU is not
> +	 * tracked by CPU caches.
> +	 *
> +	 * XE_GEM_COHERENCY_2WAY: Fully coherent between GPU and CPU. Fully
> +	 * tracked by CPU caches. Both CPU and GPU caches are snooped.
> +	 */
> +#define XE_GEM_COHERENCY_NONE                  1
> +#define XE_GEM_COHERENCY_1WAY                  2
> +#define XE_GEM_COHERENCY_2WAY                  3
> +	__u16 coh_mode;
> +
> +	/**
> +	 * @smem_caching: The CPU caching mode to select for system memory.
> +	 *
> +	 * Supported values:
> +	 *
> +	 * XE_GEM_CACHING_WB: Allocate the pages with write-back caching.  On
> +	 * iGPU this can't be used for scanout surfaces. The @coh_mode must
> +	 * either be XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY.
> +	 *
> +	 * XE_GEM_CACHING_WC: Allocate the pages as write-combined. This is
> +	 * uncached. The @coh_mode must be XE_GEM_COHERENCY_NONE or
> +	 * XE_GEM_COHERENCY_1WAY. Scanout surfaces should likely use this on
> +	 * igpu.
> +	 *
> +	 * XE_GEM_CACHING_UC: Allocate the pages as uncached. The @coh_mode must
> +	 * be XE_GEM_COHERENCY_NONE or XE_GEM_COHERENCY_1WAY. Scanout surfaces
> +	 * are permitted to use this.
> +	 *
> +	 * MUST be left as zero for VRAM-only objects.
> +	 */
> +#define XE_GEM_CACHING_WB                      1
> +#define XE_GEM_CACHING_WC                      2
> +#define XE_GEM_CACHING_UC                      3
> +	__u16 smem_caching;
>  
>  	/** @reserved: Reserved */
>  	__u64 reserved[2];
> -- 
> 2.41.0
> 

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 1/5] drm/xe/uapi: Add support for cache and coherency mode
  2023-08-29 18:09   ` Matt Roper
@ 2023-08-30 11:13     ` Matthew Auld
  0 siblings, 0 replies; 47+ messages in thread
From: Matthew Auld @ 2023-08-30 11:13 UTC (permalink / raw)
  To: Matt Roper
  Cc: Filip Hazubski, Joonas Lahtinen, Lucas De Marchi, Carl Zhang,
	Effie Yu, intel-xe

On 29/08/2023 19:09, Matt Roper wrote:
> On Tue, Aug 29, 2023 at 05:28:42PM +0100, Matthew Auld wrote:
>> From: Pallavi Mishra <pallavi.mishra@intel.com>
>>
>> Allow userspace to specify the CPU caching mode to use for system memory
>> in addition to coherency modes during object creation. Modify gem create
>> handler and introduce xe_bo_create_user to replace xe_bo_create. In a
>> later patch we will support setting the pat_index as part of vm_bind,
>> where expectation is that the coherency mode extracted from the
>> pat_index must match the one set at object creation.
> 
> Minor request:  can we add the word "cpu" to some of the fields and
> definitions?  Terms like WB and UC and such get used all over the place
> and it can get confusing when exactly we're talking about the the CPU
> cache behavior and when we're talking about the GPU cache behavior.
> Since this gets exposed via the uapi, it's best to be as explicit as
> possible.  E.g.,  "smem_cpu_caching" and "XE_GEM_CPU_CACHING_WB."

Ok, that seems fine to me. If no objections will tweak.

> 
> 
> It might be good to explain why we're defining coherency mode at object
> creation time instead of solely at vm_bind.  My understanding is that
> it's mainly so that we know whether it will be possible for userspace to
> pick a PAT that will bypass the CPU cache.  If they're able to do that,
> we need to clflush after clearing the buffer contents so that they can't
> potentially read old, stale data from memory.

AFAIK it mostly just prevents multiple bindings from mixing coherency 
modes for the same object. AFAIK that scenario was deemed to be avoided, 
and rather have it immutable at creation time. I think the flush will 
always happen if the pages are allocated on the CPU as WC/UC on x86.

> 
>>
>> Co-authored-by: Matthew Auld <matthew.auld@intel.com>
>> Signed-off-by: Pallavi Mishra <pallavi.mishra@intel.com>
> 
> Nitpick:  As original author, Pallavi's s-o-b should come first, then
> your C-a-b, and then your s-o-b (which seems to be missing) right after
> it.

Oops, missed that. Will fix.

> 
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: José Roberto de Souza <jose.souza@intel.com>
>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>> Cc: Carl Zhang <carl.zhang@intel.com>
>> Cc: Effie Yu <effie.yu@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_bo.c       | 95 +++++++++++++++++++++++++++-----
>>   drivers/gpu/drm/xe/xe_bo.h       |  3 +-
>>   drivers/gpu/drm/xe/xe_bo_types.h | 10 ++++
>>   drivers/gpu/drm/xe/xe_dma_buf.c  |  5 +-
>>   include/uapi/drm/xe_drm.h        | 47 +++++++++++++++-
>>   5 files changed, 140 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
>> index 1ab682d61e3c..f60090fe8cd2 100644
>> --- a/drivers/gpu/drm/xe/xe_bo.c
>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>> @@ -339,6 +339,15 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
>>   		extra_pages = DIV_ROUND_UP(xe_device_ccs_bytes(xe, bo->size),
>>   					   PAGE_SIZE);
>>   
>> +	if (bo->smem_caching) {
>> +		if (bo->smem_caching == XE_GEM_CACHING_WB)
>> +			caching = ttm_cached;
>> +		else if (bo->smem_caching == XE_GEM_CACHING_WC)
>> +			caching = ttm_write_combined;
>> +		else if (bo->smem_caching == XE_GEM_CACHING_UC)
>> +			caching = ttm_uncached;
>> +	}
>> +
>>   	/*
>>   	 * Display scanout is always non-coherent with the CPU cache.
>>   	 *
>> @@ -1183,9 +1192,10 @@ void xe_bo_free(struct xe_bo *bo)
>>   	kfree(bo);
>>   }
>>   
>> -struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>> +struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>>   				    struct xe_tile *tile, struct dma_resv *resv,
>>   				    struct ttm_lru_bulk_move *bulk, size_t size,
>> +				    u16 smem_caching, u16 coh_mode,
>>   				    enum ttm_bo_type type, u32 flags)
>>   {
>>   	struct ttm_operation_ctx ctx = {
>> @@ -1223,6 +1233,8 @@ struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>>   	bo->tile = tile;
>>   	bo->size = size;
>>   	bo->flags = flags;
>> +	bo->smem_caching = smem_caching;
>> +	bo->coh_mode = coh_mode;
>>   	bo->ttm.base.funcs = &xe_gem_object_funcs;
>>   	bo->props.preferred_mem_class = XE_BO_PROPS_INVALID;
>>   	bo->props.preferred_gt = XE_BO_PROPS_INVALID;
>> @@ -1306,10 +1318,11 @@ static int __xe_bo_fixed_placement(struct xe_device *xe,
>>   }
>>   
>>   struct xe_bo *
>> -xe_bo_create_locked_range(struct xe_device *xe,
>> -			  struct xe_tile *tile, struct xe_vm *vm,
>> -			  size_t size, u64 start, u64 end,
>> -			  enum ttm_bo_type type, u32 flags)
>> +__xe_bo_create_locked(struct xe_device *xe,
>> +		      struct xe_tile *tile, struct xe_vm *vm,
>> +		      size_t size, u64 start, u64 end,
>> +		      u16 smem_caching, u16 coh_mode,
>> +		      enum ttm_bo_type type, u32 flags)
>>   {
>>   	struct xe_bo *bo = NULL;
>>   	int err;
>> @@ -1330,10 +1343,11 @@ xe_bo_create_locked_range(struct xe_device *xe,
>>   		}
>>   	}
>>   
>> -	bo = __xe_bo_create_locked(xe, bo, tile, vm ? &vm->resv : NULL,
>> +	bo = ___xe_bo_create_locked(xe, bo, tile, vm ? &vm->resv : NULL,
>>   				   vm && !xe_vm_in_fault_mode(vm) &&
>>   				   flags & XE_BO_CREATE_USER_BIT ?
>>   				   &vm->lru_bulk_move : NULL, size,
>> +				   smem_caching, coh_mode,
>>   				   type, flags);
>>   	if (IS_ERR(bo))
>>   		return bo;
>> @@ -1367,11 +1381,31 @@ xe_bo_create_locked_range(struct xe_device *xe,
>>   	return ERR_PTR(err);
>>   }
>>   
>> +struct xe_bo *
>> +xe_bo_create_locked_range(struct xe_device *xe,
>> +			  struct xe_tile *tile, struct xe_vm *vm,
>> +			  size_t size, u64 start, u64 end,
>> +			  enum ttm_bo_type type, u32 flags)
>> +{
>> +	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL, 0, 0, type, flags);
>> +}
>> +
>>   struct xe_bo *xe_bo_create_locked(struct xe_device *xe, struct xe_tile *tile,
>>   				  struct xe_vm *vm, size_t size,
>>   				  enum ttm_bo_type type, u32 flags)
>>   {
>> -	return xe_bo_create_locked_range(xe, tile, vm, size, 0, ~0ULL, type, flags);
>> +	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL, 0, 0, type, flags);
> 
> It looks like xe_bo_create_locked() is only called from two tests now,
> right?  It's not obvious to me why caching=0, coh=0 is what we want for
> those tests?

AFAICT the bo test only applies to dgpu, it does use the ppGTT but that 
is via migrate vm stuff which I assume just uses 1way. I need to double 
check.

The other one also only uses the migration stuff and it should use 1way 
when encoding the PTEs. Hmm, actually emit_pte() looks like it now needs 
to use pat_index and the ->pte_encode().

> 
>> +}
>> +
>> +struct xe_bo *xe_bo_create_user(struct xe_device *xe, struct xe_tile *tile,
>> +				struct xe_vm *vm, size_t size,
>> +				enum ttm_bo_type type,
>> +				u16 smem_caching, u16 coh_mode,
>> +				u32 flags)
>> +{
>> +	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL,
>> +				     smem_caching, coh_mode, type,
>> +				     flags | XE_BO_CREATE_USER_BIT);
>>   }
>>   
>>   struct xe_bo *xe_bo_create(struct xe_device *xe, struct xe_tile *tile,
>> @@ -1754,11 +1788,11 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
>>   	struct ww_acquire_ctx ww;
>>   	struct xe_vm *vm = NULL;
>>   	struct xe_bo *bo;
>> -	unsigned int bo_flags = XE_BO_CREATE_USER_BIT;
>> +	unsigned int bo_flags;
>>   	u32 handle;
>>   	int err;
>>   
>> -	if (XE_IOCTL_DBG(xe, args->extensions) || XE_IOCTL_DBG(xe, args->pad) ||
>> +	if (XE_IOCTL_DBG(xe, args->extensions) ||
>>   	    XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
>>   		return -EINVAL;
>>   
>> @@ -1811,8 +1845,38 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
>>   		bo_flags |= XE_BO_NEEDS_CPU_ACCESS;
>>   	}
>>   
>> -	bo = xe_bo_create(xe, NULL, vm, args->size, ttm_bo_type_device,
>> -			  bo_flags);
>> +	if (XE_IOCTL_DBG(xe, args->coh_mode > 2))
>> +		return -EINVAL;
>> +
>> +	if (XE_IOCTL_DBG(xe, args->smem_caching > 2))
>> +		return -EINVAL;
>> +
>> +        if (bo_flags & XE_BO_CREATE_SYSTEM_BIT) {
> 
> Indentation whitespace looks wrong here (spaces where it should be
> tabs).

Yeah, not sure what happened here. Will fix.

> 
> 
> Matt
> 
>> +		if (XE_IOCTL_DBG(xe, !args->coh_mode))
>> +			return -EINVAL;
>> +
>> +		if (XE_IOCTL_DBG(xe, !args->smem_caching))
>> +			return -EINVAL;
>> +
>> +		if (XE_IOCTL_DBG(xe, !IS_DGFX(xe) &&
>> +				 bo_flags & XE_BO_SCANOUT_BIT &&
>> +				 args->smem_caching == XE_GEM_CACHING_WB))
>> +			return -EINVAL;
>> +
>> +		if (args->coh_mode == XE_GEM_COHERENCY_NONE) {
>> +			if (XE_IOCTL_DBG(xe, args->smem_caching == XE_GEM_CACHING_WB))
>> +				return -EINVAL;
>> +		} else if (args->coh_mode == XE_GEM_COHERENCY_2WAY) {
>> +			if (XE_IOCTL_DBG(xe, args->smem_caching != XE_GEM_CACHING_WB))
>> +				return -EINVAL;
>> +		}
>> +	} else if (XE_IOCTL_DBG(xe, args->smem_caching)) {
>> +		return -EINVAL;
>> +	}
>> +
>> +	bo = xe_bo_create_user(xe, NULL, vm, args->size,
>> +			       args->smem_caching, args->coh_mode,
>> +			       ttm_bo_type_device, bo_flags);
>>   	if (IS_ERR(bo)) {
>>   		err = PTR_ERR(bo);
>>   		goto out_vm;
>> @@ -2093,10 +2157,11 @@ int xe_bo_dumb_create(struct drm_file *file_priv,
>>   	args->size = ALIGN(mul_u32_u32(args->pitch, args->height),
>>   			   page_size);
>>   
>> -	bo = xe_bo_create(xe, NULL, NULL, args->size, ttm_bo_type_device,
>> -			  XE_BO_CREATE_VRAM_IF_DGFX(xe_device_get_root_tile(xe)) |
>> -			  XE_BO_CREATE_USER_BIT | XE_BO_SCANOUT_BIT |
>> -			  XE_BO_NEEDS_CPU_ACCESS);
>> +	bo = xe_bo_create_user(xe, NULL, NULL, args->size, ttm_bo_type_device,
>> +			       XE_GEM_CACHING_WC, XE_GEM_COHERENCY_NONE,
>> +			       XE_BO_CREATE_VRAM_IF_DGFX(xe_device_get_root_tile(xe)) |
>> +			       XE_BO_CREATE_USER_BIT | XE_BO_SCANOUT_BIT |
>> +			       XE_BO_NEEDS_CPU_ACCESS);
>>   	if (IS_ERR(bo))
>>   		return PTR_ERR(bo);
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
>> index 0823dda0f31b..2311ef3ffaf1 100644
>> --- a/drivers/gpu/drm/xe/xe_bo.h
>> +++ b/drivers/gpu/drm/xe/xe_bo.h
>> @@ -81,9 +81,10 @@ struct sg_table;
>>   struct xe_bo *xe_bo_alloc(void);
>>   void xe_bo_free(struct xe_bo *bo);
>>   
>> -struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>> +struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>>   				    struct xe_tile *tile, struct dma_resv *resv,
>>   				    struct ttm_lru_bulk_move *bulk, size_t size,
>> +				    u16 smem_caching, u16 coh_mode,
>>   				    enum ttm_bo_type type, u32 flags);
>>   struct xe_bo *
>>   xe_bo_create_locked_range(struct xe_device *xe,
>> diff --git a/drivers/gpu/drm/xe/xe_bo_types.h b/drivers/gpu/drm/xe/xe_bo_types.h
>> index f6ee920303af..a98ba5bed499 100644
>> --- a/drivers/gpu/drm/xe/xe_bo_types.h
>> +++ b/drivers/gpu/drm/xe/xe_bo_types.h
>> @@ -68,6 +68,16 @@ struct xe_bo {
>>   	struct llist_node freed;
>>   	/** @created: Whether the bo has passed initial creation */
>>   	bool created;
>> +	/**
>> +	 * @coh_mode: Coherency setting. Currently only used for userspace
>> +	 * objects.
>> +	 */
>> +	u16 coh_mode;
>> +	/**
>> +	 * @smem_caching: Caching mode for smem. Currently only used for
>> +	 * userspace objects.
>> +	 */
>> +	u16 smem_caching;
>>   };
>>   
>>   #endif
>> diff --git a/drivers/gpu/drm/xe/xe_dma_buf.c b/drivers/gpu/drm/xe/xe_dma_buf.c
>> index 975dee1f770f..8ba7daf011bc 100644
>> --- a/drivers/gpu/drm/xe/xe_dma_buf.c
>> +++ b/drivers/gpu/drm/xe/xe_dma_buf.c
>> @@ -199,8 +199,9 @@ xe_dma_buf_init_obj(struct drm_device *dev, struct xe_bo *storage,
>>   	int ret;
>>   
>>   	dma_resv_lock(resv, NULL);
>> -	bo = __xe_bo_create_locked(xe, storage, NULL, resv, NULL, dma_buf->size,
>> -				   ttm_bo_type_sg, XE_BO_CREATE_SYSTEM_BIT);
>> +	bo = ___xe_bo_create_locked(xe, storage, NULL, resv, NULL, dma_buf->size,
>> +				    0, 0, /* Will require 1way or 2way for vm_bind */
>> +				    ttm_bo_type_sg, XE_BO_CREATE_SYSTEM_BIT);
>>   	if (IS_ERR(bo)) {
>>   		ret = PTR_ERR(bo);
>>   		goto error;
>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>> index 86f16d50e9cc..64bc66d4b550 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -461,8 +461,51 @@ struct drm_xe_gem_create {
>>   	 */
>>   	__u32 handle;
>>   
>> -	/** @pad: MBZ */
>> -	__u32 pad;
>> +	/**
>> +	 * @coh_mode: The coherency mode for this object. This will limit the
>> +	 * possible @smem_caching values.
>> +	 *
>> +	 * Supported values:
>> +	 *
>> +	 * XE_GEM_COHERENCY_NONE: GPU access is assumed to be not coherent with
>> +	 * CPU. CPU caches are not snooped.
>> +	 *
>> +	 * XE_GEM_COHERENCY_1WAY: GPU access is coherent with CPU (CPU caches
>> +	 * are snooped) until GPU acquires. The acquire by the GPU is not
>> +	 * tracked by CPU caches.
>> +	 *
>> +	 * XE_GEM_COHERENCY_2WAY: Fully coherent between GPU and CPU. Fully
>> +	 * tracked by CPU caches. Both CPU and GPU caches are snooped.
>> +	 */
>> +#define XE_GEM_COHERENCY_NONE                  1
>> +#define XE_GEM_COHERENCY_1WAY                  2
>> +#define XE_GEM_COHERENCY_2WAY                  3
>> +	__u16 coh_mode;
>> +
>> +	/**
>> +	 * @smem_caching: The CPU caching mode to select for system memory.
>> +	 *
>> +	 * Supported values:
>> +	 *
>> +	 * XE_GEM_CACHING_WB: Allocate the pages with write-back caching.  On
>> +	 * iGPU this can't be used for scanout surfaces. The @coh_mode must
>> +	 * either be XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY.
>> +	 *
>> +	 * XE_GEM_CACHING_WC: Allocate the pages as write-combined. This is
>> +	 * uncached. The @coh_mode must be XE_GEM_COHERENCY_NONE or
>> +	 * XE_GEM_COHERENCY_1WAY. Scanout surfaces should likely use this on
>> +	 * igpu.
>> +	 *
>> +	 * XE_GEM_CACHING_UC: Allocate the pages as uncached. The @coh_mode must
>> +	 * be XE_GEM_COHERENCY_NONE or XE_GEM_COHERENCY_1WAY. Scanout surfaces
>> +	 * are permitted to use this.
>> +	 *
>> +	 * MUST be left as zero for VRAM-only objects.
>> +	 */
>> +#define XE_GEM_CACHING_WB                      1
>> +#define XE_GEM_CACHING_WC                      2
>> +#define XE_GEM_CACHING_UC                      3
>> +	__u16 smem_caching;
>>   
>>   	/** @reserved: Reserved */
>>   	__u64 reserved[2];
>> -- 
>> 2.41.0
>>
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 1/5] drm/xe/uapi: Add support for cache and coherency mode
  2023-08-29 16:28 ` [Intel-xe] [RFC 1/5] drm/xe/uapi: Add support for cache and coherency mode Matthew Auld
  2023-08-29 18:09   ` Matt Roper
@ 2023-09-04 20:00   ` Souza, Jose
  2023-09-05  9:04     ` Matthew Auld
  1 sibling, 1 reply; 47+ messages in thread
From: Souza, Jose @ 2023-09-04 20:00 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Mishra, Pallavi, Auld, Matthew

On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
> From: Pallavi Mishra <pallavi.mishra@intel.com>
> 
> Allow userspace to specify the CPU caching mode to use for system memory
> in addition to coherency modes during object creation. Modify gem create
> handler and introduce xe_bo_create_user to replace xe_bo_create. In a
> later patch we will support setting the pat_index as part of vm_bind,
> where expectation is that the coherency mode extracted from the
> pat_index must match the one set at object creation.
> 
> Co-authored-by: Matthew Auld <matthew.auld@intel.com>
> Signed-off-by: Pallavi Mishra <pallavi.mishra@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: José Roberto de Souza <jose.souza@intel.com>
> Cc: Filip Hazubski <filip.hazubski@intel.com>
> Cc: Carl Zhang <carl.zhang@intel.com>
> Cc: Effie Yu <effie.yu@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_bo.c       | 95 +++++++++++++++++++++++++++-----
>  drivers/gpu/drm/xe/xe_bo.h       |  3 +-
>  drivers/gpu/drm/xe/xe_bo_types.h | 10 ++++
>  drivers/gpu/drm/xe/xe_dma_buf.c  |  5 +-
>  include/uapi/drm/xe_drm.h        | 47 +++++++++++++++-
>  5 files changed, 140 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index 1ab682d61e3c..f60090fe8cd2 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -339,6 +339,15 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
>  		extra_pages = DIV_ROUND_UP(xe_device_ccs_bytes(xe, bo->size),
>  					   PAGE_SIZE);
>  
> +	if (bo->smem_caching) {
> +		if (bo->smem_caching == XE_GEM_CACHING_WB)
> +			caching = ttm_cached;
> +		else if (bo->smem_caching == XE_GEM_CACHING_WC)
> +			caching = ttm_write_combined;
> +		else if (bo->smem_caching == XE_GEM_CACHING_UC)
> +			caching = ttm_uncached;
> +	}

why not a switch/case?

> +
>  	/*
>  	 * Display scanout is always non-coherent with the CPU cache.
>  	 *
> @@ -1183,9 +1192,10 @@ void xe_bo_free(struct xe_bo *bo)
>  	kfree(bo);
>  }
>  
> -struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
> +struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>  				    struct xe_tile *tile, struct dma_resv *resv,
>  				    struct ttm_lru_bulk_move *bulk, size_t size,
> +				    u16 smem_caching, u16 coh_mode,
>  				    enum ttm_bo_type type, u32 flags)
>  {
>  	struct ttm_operation_ctx ctx = {
> @@ -1223,6 +1233,8 @@ struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>  	bo->tile = tile;
>  	bo->size = size;
>  	bo->flags = flags;
> +	bo->smem_caching = smem_caching;
> +	bo->coh_mode = coh_mode;
>  	bo->ttm.base.funcs = &xe_gem_object_funcs;
>  	bo->props.preferred_mem_class = XE_BO_PROPS_INVALID;
>  	bo->props.preferred_gt = XE_BO_PROPS_INVALID;
> @@ -1306,10 +1318,11 @@ static int __xe_bo_fixed_placement(struct xe_device *xe,
>  }
>  
>  struct xe_bo *
> -xe_bo_create_locked_range(struct xe_device *xe,
> -			  struct xe_tile *tile, struct xe_vm *vm,
> -			  size_t size, u64 start, u64 end,
> -			  enum ttm_bo_type type, u32 flags)
> +__xe_bo_create_locked(struct xe_device *xe,
> +		      struct xe_tile *tile, struct xe_vm *vm,
> +		      size_t size, u64 start, u64 end,
> +		      u16 smem_caching, u16 coh_mode,
> +		      enum ttm_bo_type type, u32 flags)
>  {
>  	struct xe_bo *bo = NULL;
>  	int err;
> @@ -1330,10 +1343,11 @@ xe_bo_create_locked_range(struct xe_device *xe,
>  		}
>  	}
>  
> -	bo = __xe_bo_create_locked(xe, bo, tile, vm ? &vm->resv : NULL,
> +	bo = ___xe_bo_create_locked(xe, bo, tile, vm ? &vm->resv : NULL,
>  				   vm && !xe_vm_in_fault_mode(vm) &&
>  				   flags & XE_BO_CREATE_USER_BIT ?
>  				   &vm->lru_bulk_move : NULL, size,
> +				   smem_caching, coh_mode,
>  				   type, flags);
>  	if (IS_ERR(bo))
>  		return bo;
> @@ -1367,11 +1381,31 @@ xe_bo_create_locked_range(struct xe_device *xe,
>  	return ERR_PTR(err);
>  }
>  
> +struct xe_bo *
> +xe_bo_create_locked_range(struct xe_device *xe,
> +			  struct xe_tile *tile, struct xe_vm *vm,
> +			  size_t size, u64 start, u64 end,
> +			  enum ttm_bo_type type, u32 flags)
> +{
> +	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL, 0, 0, type, flags);
> +}
> +
>  struct xe_bo *xe_bo_create_locked(struct xe_device *xe, struct xe_tile *tile,
>  				  struct xe_vm *vm, size_t size,
>  				  enum ttm_bo_type type, u32 flags)
>  {
> -	return xe_bo_create_locked_range(xe, tile, vm, size, 0, ~0ULL, type, flags);
> +	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL, 0, 0, type, flags);
> +}
> +
> +struct xe_bo *xe_bo_create_user(struct xe_device *xe, struct xe_tile *tile,
> +				struct xe_vm *vm, size_t size,
> +				enum ttm_bo_type type,
> +				u16 smem_caching, u16 coh_mode,
> +				u32 flags)
> +{
> +	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL,
> +				     smem_caching, coh_mode, type,
> +				     flags | XE_BO_CREATE_USER_BIT);
>  }
>  
>  struct xe_bo *xe_bo_create(struct xe_device *xe, struct xe_tile *tile,
> @@ -1754,11 +1788,11 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
>  	struct ww_acquire_ctx ww;
>  	struct xe_vm *vm = NULL;
>  	struct xe_bo *bo;
> -	unsigned int bo_flags = XE_BO_CREATE_USER_BIT;
> +	unsigned int bo_flags;
>  	u32 handle;
>  	int err;
>  
> -	if (XE_IOCTL_DBG(xe, args->extensions) || XE_IOCTL_DBG(xe, args->pad) ||
> +	if (XE_IOCTL_DBG(xe, args->extensions) ||
>  	    XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
>  		return -EINVAL;
>  
> @@ -1811,8 +1845,38 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
>  		bo_flags |= XE_BO_NEEDS_CPU_ACCESS;
>  	}
>  
> -	bo = xe_bo_create(xe, NULL, vm, args->size, ttm_bo_type_device,
> -			  bo_flags);
> +	if (XE_IOCTL_DBG(xe, args->coh_mode > 2))

This will not allow XE_GEM_COHERENCY_2WAY, also would be better to use XE_GEM_COHERENCY_XX instead of a magic number.

> +		return -EINVAL;
> +
> +	if (XE_IOCTL_DBG(xe, args->smem_caching > 2))
> +		return -EINVAL;

Same here, this will not allow XE_GEM_CACHING_UC

> +
> +        if (bo_flags & XE_BO_CREATE_SYSTEM_BIT) {
> +		if (XE_IOCTL_DBG(xe, !args->coh_mode))
> +			return -EINVAL;
> +
> +		if (XE_IOCTL_DBG(xe, !args->smem_caching))
> +			return -EINVAL;
> +
> +		if (XE_IOCTL_DBG(xe, !IS_DGFX(xe) &&
> +				 bo_flags & XE_BO_SCANOUT_BIT &&
> +				 args->smem_caching == XE_GEM_CACHING_WB))
> +			return -EINVAL;
> +
> +		if (args->coh_mode == XE_GEM_COHERENCY_NONE) {
> +			if (XE_IOCTL_DBG(xe, args->smem_caching == XE_GEM_CACHING_WB))
> +				return -EINVAL;
> +		} else if (args->coh_mode == XE_GEM_COHERENCY_2WAY) {
> +			if (XE_IOCTL_DBG(xe, args->smem_caching != XE_GEM_CACHING_WB))
> +				return -EINVAL;
> +		}
> +	} else if (XE_IOCTL_DBG(xe, args->smem_caching)) {
> +		return -EINVAL;
> +	}
> +
> +	bo = xe_bo_create_user(xe, NULL, vm, args->size,
> +			       args->smem_caching, args->coh_mode,
> +			       ttm_bo_type_device, bo_flags);
>  	if (IS_ERR(bo)) {
>  		err = PTR_ERR(bo);
>  		goto out_vm;
> @@ -2093,10 +2157,11 @@ int xe_bo_dumb_create(struct drm_file *file_priv,
>  	args->size = ALIGN(mul_u32_u32(args->pitch, args->height),
>  			   page_size);
>  
> -	bo = xe_bo_create(xe, NULL, NULL, args->size, ttm_bo_type_device,
> -			  XE_BO_CREATE_VRAM_IF_DGFX(xe_device_get_root_tile(xe)) |
> -			  XE_BO_CREATE_USER_BIT | XE_BO_SCANOUT_BIT |
> -			  XE_BO_NEEDS_CPU_ACCESS);
> +	bo = xe_bo_create_user(xe, NULL, NULL, args->size, ttm_bo_type_device,
> +			       XE_GEM_CACHING_WC, XE_GEM_COHERENCY_NONE,
> +			       XE_BO_CREATE_VRAM_IF_DGFX(xe_device_get_root_tile(xe)) |
> +			       XE_BO_CREATE_USER_BIT | XE_BO_SCANOUT_BIT |
> +			       XE_BO_NEEDS_CPU_ACCESS);
>  	if (IS_ERR(bo))
>  		return PTR_ERR(bo);
>  
> diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
> index 0823dda0f31b..2311ef3ffaf1 100644
> --- a/drivers/gpu/drm/xe/xe_bo.h
> +++ b/drivers/gpu/drm/xe/xe_bo.h
> @@ -81,9 +81,10 @@ struct sg_table;
>  struct xe_bo *xe_bo_alloc(void);
>  void xe_bo_free(struct xe_bo *bo);
>  
> -struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
> +struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>  				    struct xe_tile *tile, struct dma_resv *resv,
>  				    struct ttm_lru_bulk_move *bulk, size_t size,
> +				    u16 smem_caching, u16 coh_mode,
>  				    enum ttm_bo_type type, u32 flags);
>  struct xe_bo *
>  xe_bo_create_locked_range(struct xe_device *xe,
> diff --git a/drivers/gpu/drm/xe/xe_bo_types.h b/drivers/gpu/drm/xe/xe_bo_types.h
> index f6ee920303af..a98ba5bed499 100644
> --- a/drivers/gpu/drm/xe/xe_bo_types.h
> +++ b/drivers/gpu/drm/xe/xe_bo_types.h
> @@ -68,6 +68,16 @@ struct xe_bo {
>  	struct llist_node freed;
>  	/** @created: Whether the bo has passed initial creation */
>  	bool created;
> +	/**
> +	 * @coh_mode: Coherency setting. Currently only used for userspace
> +	 * objects.
> +	 */
> +	u16 coh_mode;
> +	/**
> +	 * @smem_caching: Caching mode for smem. Currently only used for
> +	 * userspace objects.
> +	 */
> +	u16 smem_caching;
>  };
>  
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_dma_buf.c b/drivers/gpu/drm/xe/xe_dma_buf.c
> index 975dee1f770f..8ba7daf011bc 100644
> --- a/drivers/gpu/drm/xe/xe_dma_buf.c
> +++ b/drivers/gpu/drm/xe/xe_dma_buf.c
> @@ -199,8 +199,9 @@ xe_dma_buf_init_obj(struct drm_device *dev, struct xe_bo *storage,
>  	int ret;
>  
>  	dma_resv_lock(resv, NULL);
> -	bo = __xe_bo_create_locked(xe, storage, NULL, resv, NULL, dma_buf->size,
> -				   ttm_bo_type_sg, XE_BO_CREATE_SYSTEM_BIT);
> +	bo = ___xe_bo_create_locked(xe, storage, NULL, resv, NULL, dma_buf->size,
> +				    0, 0, /* Will require 1way or 2way for vm_bind */
> +				    ttm_bo_type_sg, XE_BO_CREATE_SYSTEM_BIT);
>  	if (IS_ERR(bo)) {
>  		ret = PTR_ERR(bo);
>  		goto error;
> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> index 86f16d50e9cc..64bc66d4b550 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -461,8 +461,51 @@ struct drm_xe_gem_create {
>  	 */
>  	__u32 handle;
>  
> -	/** @pad: MBZ */
> -	__u32 pad;
> +	/**
> +	 * @coh_mode: The coherency mode for this object. This will limit the
> +	 * possible @smem_caching values.
> +	 *
> +	 * Supported values:
> +	 *
> +	 * XE_GEM_COHERENCY_NONE: GPU access is assumed to be not coherent with
> +	 * CPU. CPU caches are not snooped.
> +	 *
> +	 * XE_GEM_COHERENCY_1WAY: GPU access is coherent with CPU (CPU caches
> +	 * are snooped) until GPU acquires. The acquire by the GPU is not
> +	 * tracked by CPU caches.
> +	 *
> +	 * XE_GEM_COHERENCY_2WAY: Fully coherent between GPU and CPU. Fully
> +	 * tracked by CPU caches. Both CPU and GPU caches are snooped.
> +	 */
> +#define XE_GEM_COHERENCY_NONE                  1
> +#define XE_GEM_COHERENCY_1WAY                  2
> +#define XE_GEM_COHERENCY_2WAY                  3
> +	__u16 coh_mode;

Why coh_mode is necessary in the create uAPI?
It is not actually used in this patch and at the end the series it is only used to check if the coh_mode of the pat_index in the vm_bind is equal to
coh_mode set in gem create.
I guess it is not allowed to bind the same bo in 2 different addresses with different coh_mode, is that right?

If so we still could remove it from the gem create uAPI, set coh_mode in gem_bo in the first bind of this bo and then check if equal in the subsequent
binds.

> +
> +	/**
> +	 * @smem_caching: The CPU caching mode to select for system memory.
> +	 *
> +	 * Supported values:
> +	 *
> +	 * XE_GEM_CACHING_WB: Allocate the pages with write-back caching.  On
> +	 * iGPU this can't be used for scanout surfaces. The @coh_mode must
> +	 * either be XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY.
> +	 *
> +	 * XE_GEM_CACHING_WC: Allocate the pages as write-combined. This is
> +	 * uncached. The @coh_mode must be XE_GEM_COHERENCY_NONE or
> +	 * XE_GEM_COHERENCY_1WAY. Scanout surfaces should likely use this on
> +	 * igpu.
> +	 *
> +	 * XE_GEM_CACHING_UC: Allocate the pages as uncached. The @coh_mode must
> +	 * be XE_GEM_COHERENCY_NONE or XE_GEM_COHERENCY_1WAY. Scanout surfaces
> +	 * are permitted to use this.
> +	 *
> +	 * MUST be left as zero for VRAM-only objects.
> +	 */
> +#define XE_GEM_CACHING_WB                      1
> +#define XE_GEM_CACHING_WC                      2
> +#define XE_GEM_CACHING_UC                      3
> +	__u16 smem_caching;
>  
>  	/** @reserved: Reserved */
>  	__u64 reserved[2];


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 1/5] drm/xe/uapi: Add support for cache and coherency mode
  2023-09-04 20:00   ` Souza, Jose
@ 2023-09-05  9:04     ` Matthew Auld
  2023-09-05 15:30       ` Souza, Jose
  0 siblings, 1 reply; 47+ messages in thread
From: Matthew Auld @ 2023-09-05  9:04 UTC (permalink / raw)
  To: Souza, Jose, intel-xe@lists.freedesktop.org, Mishra, Pallavi,
	joonas Lahtinen, Thomas Hellström, Zhang, Carl,
	Roper, Matthew D, Pallavi Mishra

On 04/09/2023 21:00, Souza, Jose wrote:
> On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
>> From: Pallavi Mishra <pallavi.mishra@intel.com>
>>
>> Allow userspace to specify the CPU caching mode to use for system memory
>> in addition to coherency modes during object creation. Modify gem create
>> handler and introduce xe_bo_create_user to replace xe_bo_create. In a
>> later patch we will support setting the pat_index as part of vm_bind,
>> where expectation is that the coherency mode extracted from the
>> pat_index must match the one set at object creation.
>>
>> Co-authored-by: Matthew Auld <matthew.auld@intel.com>
>> Signed-off-by: Pallavi Mishra <pallavi.mishra@intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: José Roberto de Souza <jose.souza@intel.com>
>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>> Cc: Carl Zhang <carl.zhang@intel.com>
>> Cc: Effie Yu <effie.yu@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_bo.c       | 95 +++++++++++++++++++++++++++-----
>>   drivers/gpu/drm/xe/xe_bo.h       |  3 +-
>>   drivers/gpu/drm/xe/xe_bo_types.h | 10 ++++
>>   drivers/gpu/drm/xe/xe_dma_buf.c  |  5 +-
>>   include/uapi/drm/xe_drm.h        | 47 +++++++++++++++-
>>   5 files changed, 140 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
>> index 1ab682d61e3c..f60090fe8cd2 100644
>> --- a/drivers/gpu/drm/xe/xe_bo.c
>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>> @@ -339,6 +339,15 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
>>   		extra_pages = DIV_ROUND_UP(xe_device_ccs_bytes(xe, bo->size),
>>   					   PAGE_SIZE);
>>   
>> +	if (bo->smem_caching) {
>> +		if (bo->smem_caching == XE_GEM_CACHING_WB)
>> +			caching = ttm_cached;
>> +		else if (bo->smem_caching == XE_GEM_CACHING_WC)
>> +			caching = ttm_write_combined;
>> +		else if (bo->smem_caching == XE_GEM_CACHING_UC)
>> +			caching = ttm_uncached;
>> +	}
> 
> why not a switch/case?

Will fix.

> 
>> +
>>   	/*
>>   	 * Display scanout is always non-coherent with the CPU cache.
>>   	 *
>> @@ -1183,9 +1192,10 @@ void xe_bo_free(struct xe_bo *bo)
>>   	kfree(bo);
>>   }
>>   
>> -struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>> +struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>>   				    struct xe_tile *tile, struct dma_resv *resv,
>>   				    struct ttm_lru_bulk_move *bulk, size_t size,
>> +				    u16 smem_caching, u16 coh_mode,
>>   				    enum ttm_bo_type type, u32 flags)
>>   {
>>   	struct ttm_operation_ctx ctx = {
>> @@ -1223,6 +1233,8 @@ struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>>   	bo->tile = tile;
>>   	bo->size = size;
>>   	bo->flags = flags;
>> +	bo->smem_caching = smem_caching;
>> +	bo->coh_mode = coh_mode;
>>   	bo->ttm.base.funcs = &xe_gem_object_funcs;
>>   	bo->props.preferred_mem_class = XE_BO_PROPS_INVALID;
>>   	bo->props.preferred_gt = XE_BO_PROPS_INVALID;
>> @@ -1306,10 +1318,11 @@ static int __xe_bo_fixed_placement(struct xe_device *xe,
>>   }
>>   
>>   struct xe_bo *
>> -xe_bo_create_locked_range(struct xe_device *xe,
>> -			  struct xe_tile *tile, struct xe_vm *vm,
>> -			  size_t size, u64 start, u64 end,
>> -			  enum ttm_bo_type type, u32 flags)
>> +__xe_bo_create_locked(struct xe_device *xe,
>> +		      struct xe_tile *tile, struct xe_vm *vm,
>> +		      size_t size, u64 start, u64 end,
>> +		      u16 smem_caching, u16 coh_mode,
>> +		      enum ttm_bo_type type, u32 flags)
>>   {
>>   	struct xe_bo *bo = NULL;
>>   	int err;
>> @@ -1330,10 +1343,11 @@ xe_bo_create_locked_range(struct xe_device *xe,
>>   		}
>>   	}
>>   
>> -	bo = __xe_bo_create_locked(xe, bo, tile, vm ? &vm->resv : NULL,
>> +	bo = ___xe_bo_create_locked(xe, bo, tile, vm ? &vm->resv : NULL,
>>   				   vm && !xe_vm_in_fault_mode(vm) &&
>>   				   flags & XE_BO_CREATE_USER_BIT ?
>>   				   &vm->lru_bulk_move : NULL, size,
>> +				   smem_caching, coh_mode,
>>   				   type, flags);
>>   	if (IS_ERR(bo))
>>   		return bo;
>> @@ -1367,11 +1381,31 @@ xe_bo_create_locked_range(struct xe_device *xe,
>>   	return ERR_PTR(err);
>>   }
>>   
>> +struct xe_bo *
>> +xe_bo_create_locked_range(struct xe_device *xe,
>> +			  struct xe_tile *tile, struct xe_vm *vm,
>> +			  size_t size, u64 start, u64 end,
>> +			  enum ttm_bo_type type, u32 flags)
>> +{
>> +	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL, 0, 0, type, flags);
>> +}
>> +
>>   struct xe_bo *xe_bo_create_locked(struct xe_device *xe, struct xe_tile *tile,
>>   				  struct xe_vm *vm, size_t size,
>>   				  enum ttm_bo_type type, u32 flags)
>>   {
>> -	return xe_bo_create_locked_range(xe, tile, vm, size, 0, ~0ULL, type, flags);
>> +	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL, 0, 0, type, flags);
>> +}
>> +
>> +struct xe_bo *xe_bo_create_user(struct xe_device *xe, struct xe_tile *tile,
>> +				struct xe_vm *vm, size_t size,
>> +				enum ttm_bo_type type,
>> +				u16 smem_caching, u16 coh_mode,
>> +				u32 flags)
>> +{
>> +	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL,
>> +				     smem_caching, coh_mode, type,
>> +				     flags | XE_BO_CREATE_USER_BIT);
>>   }
>>   
>>   struct xe_bo *xe_bo_create(struct xe_device *xe, struct xe_tile *tile,
>> @@ -1754,11 +1788,11 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
>>   	struct ww_acquire_ctx ww;
>>   	struct xe_vm *vm = NULL;
>>   	struct xe_bo *bo;
>> -	unsigned int bo_flags = XE_BO_CREATE_USER_BIT;
>> +	unsigned int bo_flags;
>>   	u32 handle;
>>   	int err;
>>   
>> -	if (XE_IOCTL_DBG(xe, args->extensions) || XE_IOCTL_DBG(xe, args->pad) ||
>> +	if (XE_IOCTL_DBG(xe, args->extensions) ||
>>   	    XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
>>   		return -EINVAL;
>>   
>> @@ -1811,8 +1845,38 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
>>   		bo_flags |= XE_BO_NEEDS_CPU_ACCESS;
>>   	}
>>   
>> -	bo = xe_bo_create(xe, NULL, vm, args->size, ttm_bo_type_device,
>> -			  bo_flags);
>> +	if (XE_IOCTL_DBG(xe, args->coh_mode > 2))
> 
> This will not allow XE_GEM_COHERENCY_2WAY, also would be better to use XE_GEM_COHERENCY_XX instead of a magic number.

Indeed.

> 
>> +		return -EINVAL;
>> +
>> +	if (XE_IOCTL_DBG(xe, args->smem_caching > 2))
>> +		return -EINVAL;
> 
> Same here, this will not allow XE_GEM_CACHING_UC

Pallavi also noticed this. Fixed locally.

> 
>> +
>> +        if (bo_flags & XE_BO_CREATE_SYSTEM_BIT) {
>> +		if (XE_IOCTL_DBG(xe, !args->coh_mode))
>> +			return -EINVAL;
>> +
>> +		if (XE_IOCTL_DBG(xe, !args->smem_caching))
>> +			return -EINVAL;
>> +
>> +		if (XE_IOCTL_DBG(xe, !IS_DGFX(xe) &&
>> +				 bo_flags & XE_BO_SCANOUT_BIT &&
>> +				 args->smem_caching == XE_GEM_CACHING_WB))
>> +			return -EINVAL;
>> +
>> +		if (args->coh_mode == XE_GEM_COHERENCY_NONE) {
>> +			if (XE_IOCTL_DBG(xe, args->smem_caching == XE_GEM_CACHING_WB))
>> +				return -EINVAL;
>> +		} else if (args->coh_mode == XE_GEM_COHERENCY_2WAY) {
>> +			if (XE_IOCTL_DBG(xe, args->smem_caching != XE_GEM_CACHING_WB))
>> +				return -EINVAL;
>> +		}
>> +	} else if (XE_IOCTL_DBG(xe, args->smem_caching)) {
>> +		return -EINVAL;
>> +	}
>> +
>> +	bo = xe_bo_create_user(xe, NULL, vm, args->size,
>> +			       args->smem_caching, args->coh_mode,
>> +			       ttm_bo_type_device, bo_flags);
>>   	if (IS_ERR(bo)) {
>>   		err = PTR_ERR(bo);
>>   		goto out_vm;
>> @@ -2093,10 +2157,11 @@ int xe_bo_dumb_create(struct drm_file *file_priv,
>>   	args->size = ALIGN(mul_u32_u32(args->pitch, args->height),
>>   			   page_size);
>>   
>> -	bo = xe_bo_create(xe, NULL, NULL, args->size, ttm_bo_type_device,
>> -			  XE_BO_CREATE_VRAM_IF_DGFX(xe_device_get_root_tile(xe)) |
>> -			  XE_BO_CREATE_USER_BIT | XE_BO_SCANOUT_BIT |
>> -			  XE_BO_NEEDS_CPU_ACCESS);
>> +	bo = xe_bo_create_user(xe, NULL, NULL, args->size, ttm_bo_type_device,
>> +			       XE_GEM_CACHING_WC, XE_GEM_COHERENCY_NONE,
>> +			       XE_BO_CREATE_VRAM_IF_DGFX(xe_device_get_root_tile(xe)) |
>> +			       XE_BO_CREATE_USER_BIT | XE_BO_SCANOUT_BIT |
>> +			       XE_BO_NEEDS_CPU_ACCESS);
>>   	if (IS_ERR(bo))
>>   		return PTR_ERR(bo);
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
>> index 0823dda0f31b..2311ef3ffaf1 100644
>> --- a/drivers/gpu/drm/xe/xe_bo.h
>> +++ b/drivers/gpu/drm/xe/xe_bo.h
>> @@ -81,9 +81,10 @@ struct sg_table;
>>   struct xe_bo *xe_bo_alloc(void);
>>   void xe_bo_free(struct xe_bo *bo);
>>   
>> -struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>> +struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>>   				    struct xe_tile *tile, struct dma_resv *resv,
>>   				    struct ttm_lru_bulk_move *bulk, size_t size,
>> +				    u16 smem_caching, u16 coh_mode,
>>   				    enum ttm_bo_type type, u32 flags);
>>   struct xe_bo *
>>   xe_bo_create_locked_range(struct xe_device *xe,
>> diff --git a/drivers/gpu/drm/xe/xe_bo_types.h b/drivers/gpu/drm/xe/xe_bo_types.h
>> index f6ee920303af..a98ba5bed499 100644
>> --- a/drivers/gpu/drm/xe/xe_bo_types.h
>> +++ b/drivers/gpu/drm/xe/xe_bo_types.h
>> @@ -68,6 +68,16 @@ struct xe_bo {
>>   	struct llist_node freed;
>>   	/** @created: Whether the bo has passed initial creation */
>>   	bool created;
>> +	/**
>> +	 * @coh_mode: Coherency setting. Currently only used for userspace
>> +	 * objects.
>> +	 */
>> +	u16 coh_mode;
>> +	/**
>> +	 * @smem_caching: Caching mode for smem. Currently only used for
>> +	 * userspace objects.
>> +	 */
>> +	u16 smem_caching;
>>   };
>>   
>>   #endif
>> diff --git a/drivers/gpu/drm/xe/xe_dma_buf.c b/drivers/gpu/drm/xe/xe_dma_buf.c
>> index 975dee1f770f..8ba7daf011bc 100644
>> --- a/drivers/gpu/drm/xe/xe_dma_buf.c
>> +++ b/drivers/gpu/drm/xe/xe_dma_buf.c
>> @@ -199,8 +199,9 @@ xe_dma_buf_init_obj(struct drm_device *dev, struct xe_bo *storage,
>>   	int ret;
>>   
>>   	dma_resv_lock(resv, NULL);
>> -	bo = __xe_bo_create_locked(xe, storage, NULL, resv, NULL, dma_buf->size,
>> -				   ttm_bo_type_sg, XE_BO_CREATE_SYSTEM_BIT);
>> +	bo = ___xe_bo_create_locked(xe, storage, NULL, resv, NULL, dma_buf->size,
>> +				    0, 0, /* Will require 1way or 2way for vm_bind */
>> +				    ttm_bo_type_sg, XE_BO_CREATE_SYSTEM_BIT);
>>   	if (IS_ERR(bo)) {
>>   		ret = PTR_ERR(bo);
>>   		goto error;
>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>> index 86f16d50e9cc..64bc66d4b550 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -461,8 +461,51 @@ struct drm_xe_gem_create {
>>   	 */
>>   	__u32 handle;
>>   
>> -	/** @pad: MBZ */
>> -	__u32 pad;
>> +	/**
>> +	 * @coh_mode: The coherency mode for this object. This will limit the
>> +	 * possible @smem_caching values.
>> +	 *
>> +	 * Supported values:
>> +	 *
>> +	 * XE_GEM_COHERENCY_NONE: GPU access is assumed to be not coherent with
>> +	 * CPU. CPU caches are not snooped.
>> +	 *
>> +	 * XE_GEM_COHERENCY_1WAY: GPU access is coherent with CPU (CPU caches
>> +	 * are snooped) until GPU acquires. The acquire by the GPU is not
>> +	 * tracked by CPU caches.
>> +	 *
>> +	 * XE_GEM_COHERENCY_2WAY: Fully coherent between GPU and CPU. Fully
>> +	 * tracked by CPU caches. Both CPU and GPU caches are snooped.
>> +	 */
>> +#define XE_GEM_COHERENCY_NONE                  1
>> +#define XE_GEM_COHERENCY_1WAY                  2
>> +#define XE_GEM_COHERENCY_2WAY                  3
>> +	__u16 coh_mode;
> 
> Why coh_mode is necessary in the create uAPI?

With this it becomes an immutable creation time property of the BO. At 
creation time this limits the values userspace can choose for 
smem_caching (and anything invalid can already be rejected here). Since 
this is a property of the BO it then becomes natural that vm_binding it 
will need to use compatible coh_mode for the pat_index, which then also 
prevents mixing modes.

> It is not actually used in this patch and at the end the series it is only used to check if the coh_mode of the pat_index in the vm_bind is equal to
> coh_mode set in gem create.

In this patch it is also used to restrict the possible smem_caching 
values depending on the coh_mode and at creation time rejects anything 
that KMD considers invalid, for example NONE + WB.

> I guess it is not allowed to bind the same bo in 2 different addresses with different coh_mode, is that right?

Correct, the coh_mode must be compatible. Although if this become NONE 
and AT_LEAST_1WAY this might change slightly.

> 
> If so we still could remove it from the gem create uAPI, set coh_mode in gem_bo in the first bind of this bo and then check if equal in the subsequent
> binds.

Yeah, I think that would also work. Although the KMD would now need to 
consider what locking is needed to protect coh_mode, previously we had 
zero locking as per the api. Also given some object, the KMD in some 
cases no longer knows what the coh_mode will be (currently doesn't 
matter but maybe it does in the future?). Also with such an API should 
the unmap also reset things, assuming there are no remaining binds?

Assumption here is that UMD knows at BO creation time the suitable 
coh_mode and doesn't need to mix modes when binding, so locking it down 
by making it an immutable creation time property seems reasonable. I 
think the API and implementation will also be slightly simpler in the end.

Does Mesa need more flexibility here?

> 
>> +
>> +	/**
>> +	 * @smem_caching: The CPU caching mode to select for system memory.
>> +	 *
>> +	 * Supported values:
>> +	 *
>> +	 * XE_GEM_CACHING_WB: Allocate the pages with write-back caching.  On
>> +	 * iGPU this can't be used for scanout surfaces. The @coh_mode must
>> +	 * either be XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY.
>> +	 *
>> +	 * XE_GEM_CACHING_WC: Allocate the pages as write-combined. This is
>> +	 * uncached. The @coh_mode must be XE_GEM_COHERENCY_NONE or
>> +	 * XE_GEM_COHERENCY_1WAY. Scanout surfaces should likely use this on
>> +	 * igpu.
>> +	 *
>> +	 * XE_GEM_CACHING_UC: Allocate the pages as uncached. The @coh_mode must
>> +	 * be XE_GEM_COHERENCY_NONE or XE_GEM_COHERENCY_1WAY. Scanout surfaces
>> +	 * are permitted to use this.
>> +	 *
>> +	 * MUST be left as zero for VRAM-only objects.
>> +	 */
>> +#define XE_GEM_CACHING_WB                      1
>> +#define XE_GEM_CACHING_WC                      2
>> +#define XE_GEM_CACHING_UC                      3
>> +	__u16 smem_caching;
>>   
>>   	/** @reserved: Reserved */
>>   	__u64 reserved[2];
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 1/5] drm/xe/uapi: Add support for cache and coherency mode
  2023-09-05  9:04     ` Matthew Auld
@ 2023-09-05 15:30       ` Souza, Jose
  0 siblings, 0 replies; 47+ messages in thread
From: Souza, Jose @ 2023-09-05 15:30 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, joonas.lahtinen@linux.intel.com,
	Mishra,  Pallavi, Zhang, Carl, Auld, Matthew,
	thomas.hellstrom@linux.intel.com, Roper, Matthew D

On Tue, 2023-09-05 at 10:04 +0100, Matthew Auld wrote:
> On 04/09/2023 21:00, Souza, Jose wrote:
> > On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
> > > From: Pallavi Mishra <pallavi.mishra@intel.com>
> > > 
> > > Allow userspace to specify the CPU caching mode to use for system memory
> > > in addition to coherency modes during object creation. Modify gem create
> > > handler and introduce xe_bo_create_user to replace xe_bo_create. In a
> > > later patch we will support setting the pat_index as part of vm_bind,
> > > where expectation is that the coherency mode extracted from the
> > > pat_index must match the one set at object creation.
> > > 
> > > Co-authored-by: Matthew Auld <matthew.auld@intel.com>
> > > Signed-off-by: Pallavi Mishra <pallavi.mishra@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > > Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> > > Cc: Matt Roper <matthew.d.roper@intel.com>
> > > Cc: José Roberto de Souza <jose.souza@intel.com>
> > > Cc: Filip Hazubski <filip.hazubski@intel.com>
> > > Cc: Carl Zhang <carl.zhang@intel.com>
> > > Cc: Effie Yu <effie.yu@intel.com>
> > > ---
> > >   drivers/gpu/drm/xe/xe_bo.c       | 95 +++++++++++++++++++++++++++-----
> > >   drivers/gpu/drm/xe/xe_bo.h       |  3 +-
> > >   drivers/gpu/drm/xe/xe_bo_types.h | 10 ++++
> > >   drivers/gpu/drm/xe/xe_dma_buf.c  |  5 +-
> > >   include/uapi/drm/xe_drm.h        | 47 +++++++++++++++-
> > >   5 files changed, 140 insertions(+), 20 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> > > index 1ab682d61e3c..f60090fe8cd2 100644
> > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > @@ -339,6 +339,15 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
> > >   		extra_pages = DIV_ROUND_UP(xe_device_ccs_bytes(xe, bo->size),
> > >   					   PAGE_SIZE);
> > >   
> > > +	if (bo->smem_caching) {
> > > +		if (bo->smem_caching == XE_GEM_CACHING_WB)
> > > +			caching = ttm_cached;
> > > +		else if (bo->smem_caching == XE_GEM_CACHING_WC)
> > > +			caching = ttm_write_combined;
> > > +		else if (bo->smem_caching == XE_GEM_CACHING_UC)
> > > +			caching = ttm_uncached;
> > > +	}
> > 
> > why not a switch/case?
> 
> Will fix.
> 
> > 
> > > +
> > >   	/*
> > >   	 * Display scanout is always non-coherent with the CPU cache.
> > >   	 *
> > > @@ -1183,9 +1192,10 @@ void xe_bo_free(struct xe_bo *bo)
> > >   	kfree(bo);
> > >   }
> > >   
> > > -struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
> > > +struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
> > >   				    struct xe_tile *tile, struct dma_resv *resv,
> > >   				    struct ttm_lru_bulk_move *bulk, size_t size,
> > > +				    u16 smem_caching, u16 coh_mode,
> > >   				    enum ttm_bo_type type, u32 flags)
> > >   {
> > >   	struct ttm_operation_ctx ctx = {
> > > @@ -1223,6 +1233,8 @@ struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
> > >   	bo->tile = tile;
> > >   	bo->size = size;
> > >   	bo->flags = flags;
> > > +	bo->smem_caching = smem_caching;
> > > +	bo->coh_mode = coh_mode;
> > >   	bo->ttm.base.funcs = &xe_gem_object_funcs;
> > >   	bo->props.preferred_mem_class = XE_BO_PROPS_INVALID;
> > >   	bo->props.preferred_gt = XE_BO_PROPS_INVALID;
> > > @@ -1306,10 +1318,11 @@ static int __xe_bo_fixed_placement(struct xe_device *xe,
> > >   }
> > >   
> > >   struct xe_bo *
> > > -xe_bo_create_locked_range(struct xe_device *xe,
> > > -			  struct xe_tile *tile, struct xe_vm *vm,
> > > -			  size_t size, u64 start, u64 end,
> > > -			  enum ttm_bo_type type, u32 flags)
> > > +__xe_bo_create_locked(struct xe_device *xe,
> > > +		      struct xe_tile *tile, struct xe_vm *vm,
> > > +		      size_t size, u64 start, u64 end,
> > > +		      u16 smem_caching, u16 coh_mode,
> > > +		      enum ttm_bo_type type, u32 flags)
> > >   {
> > >   	struct xe_bo *bo = NULL;
> > >   	int err;
> > > @@ -1330,10 +1343,11 @@ xe_bo_create_locked_range(struct xe_device *xe,
> > >   		}
> > >   	}
> > >   
> > > -	bo = __xe_bo_create_locked(xe, bo, tile, vm ? &vm->resv : NULL,
> > > +	bo = ___xe_bo_create_locked(xe, bo, tile, vm ? &vm->resv : NULL,
> > >   				   vm && !xe_vm_in_fault_mode(vm) &&
> > >   				   flags & XE_BO_CREATE_USER_BIT ?
> > >   				   &vm->lru_bulk_move : NULL, size,
> > > +				   smem_caching, coh_mode,
> > >   				   type, flags);
> > >   	if (IS_ERR(bo))
> > >   		return bo;
> > > @@ -1367,11 +1381,31 @@ xe_bo_create_locked_range(struct xe_device *xe,
> > >   	return ERR_PTR(err);
> > >   }
> > >   
> > > +struct xe_bo *
> > > +xe_bo_create_locked_range(struct xe_device *xe,
> > > +			  struct xe_tile *tile, struct xe_vm *vm,
> > > +			  size_t size, u64 start, u64 end,
> > > +			  enum ttm_bo_type type, u32 flags)
> > > +{
> > > +	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL, 0, 0, type, flags);
> > > +}
> > > +
> > >   struct xe_bo *xe_bo_create_locked(struct xe_device *xe, struct xe_tile *tile,
> > >   				  struct xe_vm *vm, size_t size,
> > >   				  enum ttm_bo_type type, u32 flags)
> > >   {
> > > -	return xe_bo_create_locked_range(xe, tile, vm, size, 0, ~0ULL, type, flags);
> > > +	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL, 0, 0, type, flags);
> > > +}
> > > +
> > > +struct xe_bo *xe_bo_create_user(struct xe_device *xe, struct xe_tile *tile,
> > > +				struct xe_vm *vm, size_t size,
> > > +				enum ttm_bo_type type,
> > > +				u16 smem_caching, u16 coh_mode,
> > > +				u32 flags)
> > > +{
> > > +	return __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL,
> > > +				     smem_caching, coh_mode, type,
> > > +				     flags | XE_BO_CREATE_USER_BIT);
> > >   }
> > >   
> > >   struct xe_bo *xe_bo_create(struct xe_device *xe, struct xe_tile *tile,
> > > @@ -1754,11 +1788,11 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
> > >   	struct ww_acquire_ctx ww;
> > >   	struct xe_vm *vm = NULL;
> > >   	struct xe_bo *bo;
> > > -	unsigned int bo_flags = XE_BO_CREATE_USER_BIT;
> > > +	unsigned int bo_flags;
> > >   	u32 handle;
> > >   	int err;
> > >   
> > > -	if (XE_IOCTL_DBG(xe, args->extensions) || XE_IOCTL_DBG(xe, args->pad) ||
> > > +	if (XE_IOCTL_DBG(xe, args->extensions) ||
> > >   	    XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
> > >   		return -EINVAL;
> > >   
> > > @@ -1811,8 +1845,38 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
> > >   		bo_flags |= XE_BO_NEEDS_CPU_ACCESS;
> > >   	}
> > >   
> > > -	bo = xe_bo_create(xe, NULL, vm, args->size, ttm_bo_type_device,
> > > -			  bo_flags);
> > > +	if (XE_IOCTL_DBG(xe, args->coh_mode > 2))
> > 
> > This will not allow XE_GEM_COHERENCY_2WAY, also would be better to use XE_GEM_COHERENCY_XX instead of a magic number.
> 
> Indeed.
> 
> > 
> > > +		return -EINVAL;
> > > +
> > > +	if (XE_IOCTL_DBG(xe, args->smem_caching > 2))
> > > +		return -EINVAL;
> > 
> > Same here, this will not allow XE_GEM_CACHING_UC
> 
> Pallavi also noticed this. Fixed locally.
> 
> > 
> > > +
> > > +        if (bo_flags & XE_BO_CREATE_SYSTEM_BIT) {
> > > +		if (XE_IOCTL_DBG(xe, !args->coh_mode))
> > > +			return -EINVAL;
> > > +
> > > +		if (XE_IOCTL_DBG(xe, !args->smem_caching))
> > > +			return -EINVAL;
> > > +
> > > +		if (XE_IOCTL_DBG(xe, !IS_DGFX(xe) &&
> > > +				 bo_flags & XE_BO_SCANOUT_BIT &&
> > > +				 args->smem_caching == XE_GEM_CACHING_WB))
> > > +			return -EINVAL;
> > > +
> > > +		if (args->coh_mode == XE_GEM_COHERENCY_NONE) {
> > > +			if (XE_IOCTL_DBG(xe, args->smem_caching == XE_GEM_CACHING_WB))
> > > +				return -EINVAL;
> > > +		} else if (args->coh_mode == XE_GEM_COHERENCY_2WAY) {
> > > +			if (XE_IOCTL_DBG(xe, args->smem_caching != XE_GEM_CACHING_WB))
> > > +				return -EINVAL;
> > > +		}
> > > +	} else if (XE_IOCTL_DBG(xe, args->smem_caching)) {
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	bo = xe_bo_create_user(xe, NULL, vm, args->size,
> > > +			       args->smem_caching, args->coh_mode,
> > > +			       ttm_bo_type_device, bo_flags);
> > >   	if (IS_ERR(bo)) {
> > >   		err = PTR_ERR(bo);
> > >   		goto out_vm;
> > > @@ -2093,10 +2157,11 @@ int xe_bo_dumb_create(struct drm_file *file_priv,
> > >   	args->size = ALIGN(mul_u32_u32(args->pitch, args->height),
> > >   			   page_size);
> > >   
> > > -	bo = xe_bo_create(xe, NULL, NULL, args->size, ttm_bo_type_device,
> > > -			  XE_BO_CREATE_VRAM_IF_DGFX(xe_device_get_root_tile(xe)) |
> > > -			  XE_BO_CREATE_USER_BIT | XE_BO_SCANOUT_BIT |
> > > -			  XE_BO_NEEDS_CPU_ACCESS);
> > > +	bo = xe_bo_create_user(xe, NULL, NULL, args->size, ttm_bo_type_device,
> > > +			       XE_GEM_CACHING_WC, XE_GEM_COHERENCY_NONE,
> > > +			       XE_BO_CREATE_VRAM_IF_DGFX(xe_device_get_root_tile(xe)) |
> > > +			       XE_BO_CREATE_USER_BIT | XE_BO_SCANOUT_BIT |
> > > +			       XE_BO_NEEDS_CPU_ACCESS);
> > >   	if (IS_ERR(bo))
> > >   		return PTR_ERR(bo);
> > >   
> > > diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
> > > index 0823dda0f31b..2311ef3ffaf1 100644
> > > --- a/drivers/gpu/drm/xe/xe_bo.h
> > > +++ b/drivers/gpu/drm/xe/xe_bo.h
> > > @@ -81,9 +81,10 @@ struct sg_table;
> > >   struct xe_bo *xe_bo_alloc(void);
> > >   void xe_bo_free(struct xe_bo *bo);
> > >   
> > > -struct xe_bo *__xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
> > > +struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
> > >   				    struct xe_tile *tile, struct dma_resv *resv,
> > >   				    struct ttm_lru_bulk_move *bulk, size_t size,
> > > +				    u16 smem_caching, u16 coh_mode,
> > >   				    enum ttm_bo_type type, u32 flags);
> > >   struct xe_bo *
> > >   xe_bo_create_locked_range(struct xe_device *xe,
> > > diff --git a/drivers/gpu/drm/xe/xe_bo_types.h b/drivers/gpu/drm/xe/xe_bo_types.h
> > > index f6ee920303af..a98ba5bed499 100644
> > > --- a/drivers/gpu/drm/xe/xe_bo_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_bo_types.h
> > > @@ -68,6 +68,16 @@ struct xe_bo {
> > >   	struct llist_node freed;
> > >   	/** @created: Whether the bo has passed initial creation */
> > >   	bool created;
> > > +	/**
> > > +	 * @coh_mode: Coherency setting. Currently only used for userspace
> > > +	 * objects.
> > > +	 */
> > > +	u16 coh_mode;
> > > +	/**
> > > +	 * @smem_caching: Caching mode for smem. Currently only used for
> > > +	 * userspace objects.
> > > +	 */
> > > +	u16 smem_caching;
> > >   };
> > >   
> > >   #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_dma_buf.c b/drivers/gpu/drm/xe/xe_dma_buf.c
> > > index 975dee1f770f..8ba7daf011bc 100644
> > > --- a/drivers/gpu/drm/xe/xe_dma_buf.c
> > > +++ b/drivers/gpu/drm/xe/xe_dma_buf.c
> > > @@ -199,8 +199,9 @@ xe_dma_buf_init_obj(struct drm_device *dev, struct xe_bo *storage,
> > >   	int ret;
> > >   
> > >   	dma_resv_lock(resv, NULL);
> > > -	bo = __xe_bo_create_locked(xe, storage, NULL, resv, NULL, dma_buf->size,
> > > -				   ttm_bo_type_sg, XE_BO_CREATE_SYSTEM_BIT);
> > > +	bo = ___xe_bo_create_locked(xe, storage, NULL, resv, NULL, dma_buf->size,
> > > +				    0, 0, /* Will require 1way or 2way for vm_bind */
> > > +				    ttm_bo_type_sg, XE_BO_CREATE_SYSTEM_BIT);
> > >   	if (IS_ERR(bo)) {
> > >   		ret = PTR_ERR(bo);
> > >   		goto error;
> > > diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> > > index 86f16d50e9cc..64bc66d4b550 100644
> > > --- a/include/uapi/drm/xe_drm.h
> > > +++ b/include/uapi/drm/xe_drm.h
> > > @@ -461,8 +461,51 @@ struct drm_xe_gem_create {
> > >   	 */
> > >   	__u32 handle;
> > >   
> > > -	/** @pad: MBZ */
> > > -	__u32 pad;
> > > +	/**
> > > +	 * @coh_mode: The coherency mode for this object. This will limit the
> > > +	 * possible @smem_caching values.
> > > +	 *
> > > +	 * Supported values:
> > > +	 *
> > > +	 * XE_GEM_COHERENCY_NONE: GPU access is assumed to be not coherent with
> > > +	 * CPU. CPU caches are not snooped.
> > > +	 *
> > > +	 * XE_GEM_COHERENCY_1WAY: GPU access is coherent with CPU (CPU caches
> > > +	 * are snooped) until GPU acquires. The acquire by the GPU is not
> > > +	 * tracked by CPU caches.
> > > +	 *
> > > +	 * XE_GEM_COHERENCY_2WAY: Fully coherent between GPU and CPU. Fully
> > > +	 * tracked by CPU caches. Both CPU and GPU caches are snooped.
> > > +	 */
> > > +#define XE_GEM_COHERENCY_NONE                  1
> > > +#define XE_GEM_COHERENCY_1WAY                  2
> > > +#define XE_GEM_COHERENCY_2WAY                  3
> > > +	__u16 coh_mode;
> > 
> > Why coh_mode is necessary in the create uAPI?
> 
> With this it becomes an immutable creation time property of the BO. At 
> creation time this limits the values userspace can choose for 
> smem_caching (and anything invalid can already be rejected here). Since 
> this is a property of the BO it then becomes natural that vm_binding it 
> will need to use compatible coh_mode for the pat_index, which then also 
> prevents mixing modes.
> 
> > It is not actually used in this patch and at the end the series it is only used to check if the coh_mode of the pat_index in the vm_bind is equal to
> > coh_mode set in gem create.
> 
> In this patch it is also used to restrict the possible smem_caching 
> values depending on the coh_mode and at creation time rejects anything 
> that KMD considers invalid, for example NONE + WB.
> 
> > I guess it is not allowed to bind the same bo in 2 different addresses with different coh_mode, is that right?
> 
> Correct, the coh_mode must be compatible. Although if this become NONE 
> and AT_LEAST_1WAY this might change slightly.
> 
> > 
> > If so we still could remove it from the gem create uAPI, set coh_mode in gem_bo in the first bind of this bo and then check if equal in the subsequent
> > binds.
> 
> Yeah, I think that would also work. Although the KMD would now need to 
> consider what locking is needed to protect coh_mode, previously we had 
> zero locking as per the api. Also given some object, the KMD in some 
> cases no longer knows what the coh_mode will be (currently doesn't 
> matter but maybe it does in the future?). Also with such an API should 
> the unmap also reset things, assuming there are no remaining binds?
> 
> Assumption here is that UMD knows at BO creation time the suitable 
> coh_mode and doesn't need to mix modes when binding, so locking it down 
> by making it an immutable creation time property seems reasonable. I 
> think the API and implementation will also be slightly simpler in the end.
> 
> Does Mesa need more flexibility here?

Okay, if it involves locking and more complexity leave as is.

> 
> > 
> > > +
> > > +	/**
> > > +	 * @smem_caching: The CPU caching mode to select for system memory.
> > > +	 *
> > > +	 * Supported values:
> > > +	 *
> > > +	 * XE_GEM_CACHING_WB: Allocate the pages with write-back caching.  On
> > > +	 * iGPU this can't be used for scanout surfaces. The @coh_mode must
> > > +	 * either be XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY.
> > > +	 *
> > > +	 * XE_GEM_CACHING_WC: Allocate the pages as write-combined. This is
> > > +	 * uncached. The @coh_mode must be XE_GEM_COHERENCY_NONE or
> > > +	 * XE_GEM_COHERENCY_1WAY. Scanout surfaces should likely use this on
> > > +	 * igpu.
> > > +	 *
> > > +	 * XE_GEM_CACHING_UC: Allocate the pages as uncached. The @coh_mode must
> > > +	 * be XE_GEM_COHERENCY_NONE or XE_GEM_COHERENCY_1WAY. Scanout surfaces
> > > +	 * are permitted to use this.
> > > +	 *
> > > +	 * MUST be left as zero for VRAM-only objects.
> > > +	 */
> > > +#define XE_GEM_CACHING_WB                      1
> > > +#define XE_GEM_CACHING_WC                      2
> > > +#define XE_GEM_CACHING_UC                      3
> > > +	__u16 smem_caching;
> > >   
> > >   	/** @reserved: Reserved */
> > >   	__u64 reserved[2];
> > 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Intel-xe] [RFC 2/5] drm/xe: fix has_llc on rkl
  2023-08-29 16:28 [Intel-xe] [RFC 0/5] PAT and cache coherency support Matthew Auld
  2023-08-29 16:28 ` [Intel-xe] [RFC 1/5] drm/xe/uapi: Add support for cache and coherency mode Matthew Auld
@ 2023-08-29 16:28 ` Matthew Auld
  2023-08-29 18:46   ` Matt Roper
  2023-08-30  1:20   ` Mishra, Pallavi
  2023-08-29 16:28 ` [Intel-xe] [RFC 3/5] drm/xe: move pat_table into device info Matthew Auld
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 47+ messages in thread
From: Matthew Auld @ 2023-08-29 16:28 UTC (permalink / raw)
  To: intel-xe; +Cc: Matt Roper, Lucas De Marchi

Matches i915. Assumption going forward is that non-llc + igpu is only a
thing on MTL+ which should have explicit coherency pat_index settings
for COH_NONE, 1WAY and 2WAY.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Pallavi Mishra <pallavi.mishra@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
---
 drivers/gpu/drm/xe/xe_pci.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
index 7fb00ea410a6..791107dec045 100644
--- a/drivers/gpu/drm/xe/xe_pci.c
+++ b/drivers/gpu/drm/xe/xe_pci.c
@@ -240,6 +240,7 @@ static const struct xe_device_desc rkl_desc = {
 	.media = &media_xem,
 	PLATFORM(XE_ROCKETLAKE),
 	.has_display = true,
+	.has_llc = true,
 	.require_force_probe = true,
 };
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 2/5] drm/xe: fix has_llc on rkl
  2023-08-29 16:28 ` [Intel-xe] [RFC 2/5] drm/xe: fix has_llc on rkl Matthew Auld
@ 2023-08-29 18:46   ` Matt Roper
  2023-08-30  1:20   ` Mishra, Pallavi
  1 sibling, 0 replies; 47+ messages in thread
From: Matt Roper @ 2023-08-29 18:46 UTC (permalink / raw)
  To: Matthew Auld; +Cc: Lucas De Marchi, intel-xe

On Tue, Aug 29, 2023 at 05:28:43PM +0100, Matthew Auld wrote:
> Matches i915. Assumption going forward is that non-llc + igpu is only a
> thing on MTL+ which should have explicit coherency pat_index settings
> for COH_NONE, 1WAY and 2WAY.
> 
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>

Reviewed-by: Matt Roper <matthew.d.roper@intel.com>

> ---
>  drivers/gpu/drm/xe/xe_pci.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> index 7fb00ea410a6..791107dec045 100644
> --- a/drivers/gpu/drm/xe/xe_pci.c
> +++ b/drivers/gpu/drm/xe/xe_pci.c
> @@ -240,6 +240,7 @@ static const struct xe_device_desc rkl_desc = {
>  	.media = &media_xem,
>  	PLATFORM(XE_ROCKETLAKE),
>  	.has_display = true,
> +	.has_llc = true,
>  	.require_force_probe = true,
>  };
>  
> -- 
> 2.41.0
> 

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 2/5] drm/xe: fix has_llc on rkl
  2023-08-29 16:28 ` [Intel-xe] [RFC 2/5] drm/xe: fix has_llc on rkl Matthew Auld
  2023-08-29 18:46   ` Matt Roper
@ 2023-08-30  1:20   ` Mishra, Pallavi
  1 sibling, 0 replies; 47+ messages in thread
From: Mishra, Pallavi @ 2023-08-30  1:20 UTC (permalink / raw)
  To: Auld, Matthew, intel-xe@lists.freedesktop.org
  Cc: Roper, Matthew D, De Marchi, Lucas



> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: Tuesday, August 29, 2023 9:29 AM
> To: intel-xe@lists.freedesktop.org
> Cc: Mishra, Pallavi <pallavi.mishra@intel.com>; De Marchi, Lucas
> <lucas.demarchi@intel.com>; Roper, Matthew D
> <matthew.d.roper@intel.com>
> Subject: [RFC 2/5] drm/xe: fix has_llc on rkl
> 
> Matches i915. Assumption going forward is that non-llc + igpu is only a thing
> on MTL+ which should have explicit coherency pat_index settings for
> COH_NONE, 1WAY and 2WAY.
> 
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_pci.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c index
> 7fb00ea410a6..791107dec045 100644
> --- a/drivers/gpu/drm/xe/xe_pci.c
> +++ b/drivers/gpu/drm/xe/xe_pci.c
> @@ -240,6 +240,7 @@ static const struct xe_device_desc rkl_desc = {
>  	.media = &media_xem,
>  	PLATFORM(XE_ROCKETLAKE),
>  	.has_display = true,
> +	.has_llc = true,
>  	.require_force_probe = true,
>  };
> 

Reviewed-by: Pallavi Mishra <pallavi.mishra@intel.com>

> --
> 2.41.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Intel-xe] [RFC 3/5] drm/xe: move pat_table into device info
  2023-08-29 16:28 [Intel-xe] [RFC 0/5] PAT and cache coherency support Matthew Auld
  2023-08-29 16:28 ` [Intel-xe] [RFC 1/5] drm/xe/uapi: Add support for cache and coherency mode Matthew Auld
  2023-08-29 16:28 ` [Intel-xe] [RFC 2/5] drm/xe: fix has_llc on rkl Matthew Auld
@ 2023-08-29 16:28 ` Matthew Auld
  2023-08-29 19:14   ` Matt Roper
                     ` (3 more replies)
  2023-08-29 16:28 ` [Intel-xe] [RFC 4/5] drm/xe/pat: annotate pat_index with coherency mode Matthew Auld
                   ` (3 subsequent siblings)
  6 siblings, 4 replies; 47+ messages in thread
From: Matthew Auld @ 2023-08-29 16:28 UTC (permalink / raw)
  To: intel-xe; +Cc: Matt Roper, Lucas De Marchi

We need to able to know the max pat_index range for a given platform, as
well being able to lookup the pat_index for a given platform in upcoming
vm_bind uapi, where userspace can directly provide the pat_index. Move
the platform definition of the pat_table into the device info with the
idea of encoding more information about each pat_index in a future
patch.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Pallavi Mishra <pallavi.mishra@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
---
 drivers/gpu/drm/xe/xe_device_types.h |  3 +++
 drivers/gpu/drm/xe/xe_pat.c          | 39 ++++++++++++++++++----------
 drivers/gpu/drm/xe/xe_pat.h          |  3 ++-
 drivers/gpu/drm/xe/xe_pci.c          |  6 +++++
 4 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 5037b8c180b8..06235da647bb 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -238,6 +238,9 @@ struct xe_device {
 		/** @enable_display: display enabled */
 		u8 enable_display:1;
 
+		const u32 *pat_table;
+		int pat_table_n_entries;
+
 #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
 		const struct intel_display_device_info *display;
 		struct intel_display_runtime_info display_runtime;
diff --git a/drivers/gpu/drm/xe/xe_pat.c b/drivers/gpu/drm/xe/xe_pat.c
index a468655db956..f19f5d8dcd94 100644
--- a/drivers/gpu/drm/xe/xe_pat.c
+++ b/drivers/gpu/drm/xe/xe_pat.c
@@ -106,24 +106,17 @@ static void program_pat_mcr(struct xe_gt *gt, const u32 table[], int n_entries)
 	}
 }
 
-void xe_pat_init(struct xe_gt *gt)
+int xe_pat_fill_info(struct xe_device *xe)
 {
-	struct xe_device *xe = gt_to_xe(gt);
-
 	if (xe->info.platform == XE_METEORLAKE) {
-		/*
-		 * SAMedia register offsets are adjusted by the write methods
-		 * and they target registers that are not MCR, while for normal
-		 * GT they are MCR
-		 */
-		if (xe_gt_is_media_type(gt))
-			program_pat(gt, mtl_pat_table, ARRAY_SIZE(mtl_pat_table));
-		else
-			program_pat_mcr(gt, mtl_pat_table, ARRAY_SIZE(mtl_pat_table));
+		xe->info.pat_table = mtl_pat_table;
+		xe->info.pat_table_n_entries = ARRAY_SIZE(mtl_pat_table);
 	} else if (xe->info.platform == XE_PVC || xe->info.platform == XE_DG2) {
-		program_pat_mcr(gt, pvc_pat_table, ARRAY_SIZE(pvc_pat_table));
+		xe->info.pat_table = pvc_pat_table;
+		xe->info.pat_table_n_entries = ARRAY_SIZE(pvc_pat_table);
 	} else if (GRAPHICS_VERx100(xe) <= 1210) {
-		program_pat(gt, tgl_pat_table, ARRAY_SIZE(tgl_pat_table));
+		xe->info.pat_table = tgl_pat_table;
+		xe->info.pat_table_n_entries = ARRAY_SIZE(tgl_pat_table);
 	} else {
 		/*
 		 * Going forward we expect to need new PAT settings for most
@@ -135,7 +128,25 @@ void xe_pat_init(struct xe_gt *gt)
 		 */
 		drm_err(&xe->drm, "Missing PAT table for platform with graphics version %d.%2d!\n",
 			GRAPHICS_VER(xe), GRAPHICS_VERx100(xe) % 100);
+		return -ENODEV;
 	}
+
+	return 0;
+}
+
+void xe_pat_init(struct xe_gt *gt)
+{
+	struct xe_device *xe = gt_to_xe(gt);
+
+	/*
+	 * SAMedia register offsets are adjusted by the write methods
+	 * and they target registers that are not MCR, while for normal
+	 * GT they are MCR.
+	 */
+	if (xe_gt_is_media_type(gt) || GRAPHICS_VERx100(xe) < 1255)
+		program_pat(gt, xe->info.pat_table, xe->info.pat_table_n_entries);
+	else
+		program_pat_mcr(gt, xe->info.pat_table, xe->info.pat_table_n_entries);
 }
 
 void xe_pte_pat_init(struct xe_device *xe)
diff --git a/drivers/gpu/drm/xe/xe_pat.h b/drivers/gpu/drm/xe/xe_pat.h
index 54022f591621..9ab059758ad1 100644
--- a/drivers/gpu/drm/xe/xe_pat.h
+++ b/drivers/gpu/drm/xe/xe_pat.h
@@ -26,8 +26,9 @@
 #define XELPG_PAT_WB_CACHE_1_WAY       3
 
 struct xe_gt;
-extern struct xe_device *xe;
+struct xe_device;
 
+int xe_pat_fill_info(struct xe_device *xe);
 void xe_pat_init(struct xe_gt *gt);
 void xe_pte_pat_init(struct xe_device *xe);
 unsigned int xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache);
diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
index 791107dec045..24f2021aae22 100644
--- a/drivers/gpu/drm/xe/xe_pci.c
+++ b/drivers/gpu/drm/xe/xe_pci.c
@@ -22,6 +22,7 @@
 #include "xe_gt.h"
 #include "xe_macros.h"
 #include "xe_module.h"
+#include "xe_pat.h"
 #include "xe_pci_types.h"
 #include "xe_pm.h"
 #include "xe_step.h"
@@ -553,6 +554,7 @@ static int xe_info_init(struct xe_device *xe,
 	struct xe_tile *tile;
 	struct xe_gt *gt;
 	u8 id;
+	int err;
 
 	xe->info.platform = desc->platform;
 	xe->info.subplatform = subplatform_desc ?
@@ -601,6 +603,10 @@ static int xe_info_init(struct xe_device *xe,
 	xe->info.enable_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY) &&
 				  enable_display &&
 				  desc->has_display;
+
+	err = xe_pat_fill_info(xe);
+	if (err)
+		return err;
 	/*
 	 * All platforms have at least one primary GT.  Any platform with media
 	 * version 13 or higher has an additional dedicated media GT.  And
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 3/5] drm/xe: move pat_table into device info
  2023-08-29 16:28 ` [Intel-xe] [RFC 3/5] drm/xe: move pat_table into device info Matthew Auld
@ 2023-08-29 19:14   ` Matt Roper
  2023-08-29 21:49   ` Lucas De Marchi
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 47+ messages in thread
From: Matt Roper @ 2023-08-29 19:14 UTC (permalink / raw)
  To: Matthew Auld; +Cc: Lucas De Marchi, intel-xe

On Tue, Aug 29, 2023 at 05:28:44PM +0100, Matthew Auld wrote:
> We need to able to know the max pat_index range for a given platform, as
> well being able to lookup the pat_index for a given platform in upcoming
> vm_bind uapi, where userspace can directly provide the pat_index. Move
> the platform definition of the pat_table into the device info with the
> idea of encoding more information about each pat_index in a future
> patch.
> 
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device_types.h |  3 +++
>  drivers/gpu/drm/xe/xe_pat.c          | 39 ++++++++++++++++++----------
>  drivers/gpu/drm/xe/xe_pat.h          |  3 ++-
>  drivers/gpu/drm/xe/xe_pci.c          |  6 +++++
>  4 files changed, 36 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 5037b8c180b8..06235da647bb 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -238,6 +238,9 @@ struct xe_device {
>  		/** @enable_display: display enabled */
>  		u8 enable_display:1;
>  
> +		const u32 *pat_table;
> +		int pat_table_n_entries;

This is okay for our current platforms, but we might need to revisit the
placement in the future (potentially moving these to xe_gt instead).
Although MTL and LNL happen to use the same table for both the primary
GT and the media GT, the table is technically a per-GT concept that gets
programmed into GT registers.  On a future platform, there could
theoretically be some graphics-specific and/or media-specific fields in
each GT's PAT table encoding that make the tables diverge, even though
the high-level meaning of each PAT index remains the same.  And then the
proper PAT tables could probably be selected from the IP descriptors in
xe_pci.c (e.g., xe2_graphics and such).  But we can put that off until
we have a platform that actually needs it.

> +
>  #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
>  		const struct intel_display_device_info *display;
>  		struct intel_display_runtime_info display_runtime;
> diff --git a/drivers/gpu/drm/xe/xe_pat.c b/drivers/gpu/drm/xe/xe_pat.c
> index a468655db956..f19f5d8dcd94 100644
> --- a/drivers/gpu/drm/xe/xe_pat.c
> +++ b/drivers/gpu/drm/xe/xe_pat.c
> @@ -106,24 +106,17 @@ static void program_pat_mcr(struct xe_gt *gt, const u32 table[], int n_entries)
>  	}
>  }
>  
> -void xe_pat_init(struct xe_gt *gt)
> +int xe_pat_fill_info(struct xe_device *xe)
>  {
> -	struct xe_device *xe = gt_to_xe(gt);
> -
>  	if (xe->info.platform == XE_METEORLAKE) {
> -		/*
> -		 * SAMedia register offsets are adjusted by the write methods
> -		 * and they target registers that are not MCR, while for normal
> -		 * GT they are MCR
> -		 */
> -		if (xe_gt_is_media_type(gt))
> -			program_pat(gt, mtl_pat_table, ARRAY_SIZE(mtl_pat_table));
> -		else
> -			program_pat_mcr(gt, mtl_pat_table, ARRAY_SIZE(mtl_pat_table));
> +		xe->info.pat_table = mtl_pat_table;
> +		xe->info.pat_table_n_entries = ARRAY_SIZE(mtl_pat_table);
>  	} else if (xe->info.platform == XE_PVC || xe->info.platform == XE_DG2) {
> -		program_pat_mcr(gt, pvc_pat_table, ARRAY_SIZE(pvc_pat_table));
> +		xe->info.pat_table = pvc_pat_table;
> +		xe->info.pat_table_n_entries = ARRAY_SIZE(pvc_pat_table);
>  	} else if (GRAPHICS_VERx100(xe) <= 1210) {
> -		program_pat(gt, tgl_pat_table, ARRAY_SIZE(tgl_pat_table));
> +		xe->info.pat_table = tgl_pat_table;
> +		xe->info.pat_table_n_entries = ARRAY_SIZE(tgl_pat_table);
>  	} else {
>  		/*
>  		 * Going forward we expect to need new PAT settings for most
> @@ -135,7 +128,25 @@ void xe_pat_init(struct xe_gt *gt)
>  		 */
>  		drm_err(&xe->drm, "Missing PAT table for platform with graphics version %d.%2d!\n",
>  			GRAPHICS_VER(xe), GRAPHICS_VERx100(xe) % 100);
> +		return -ENODEV;
>  	}
> +
> +	return 0;
> +}
> +
> +void xe_pat_init(struct xe_gt *gt)
> +{
> +	struct xe_device *xe = gt_to_xe(gt);
> +
> +	/*
> +	 * SAMedia register offsets are adjusted by the write methods
> +	 * and they target registers that are not MCR, while for normal
> +	 * GT they are MCR.
> +	 */
> +	if (xe_gt_is_media_type(gt) || GRAPHICS_VERx100(xe) < 1255)
> +		program_pat(gt, xe->info.pat_table, xe->info.pat_table_n_entries);
> +	else
> +		program_pat_mcr(gt, xe->info.pat_table, xe->info.pat_table_n_entries);
>  }
>  
>  void xe_pte_pat_init(struct xe_device *xe)
> diff --git a/drivers/gpu/drm/xe/xe_pat.h b/drivers/gpu/drm/xe/xe_pat.h
> index 54022f591621..9ab059758ad1 100644
> --- a/drivers/gpu/drm/xe/xe_pat.h
> +++ b/drivers/gpu/drm/xe/xe_pat.h
> @@ -26,8 +26,9 @@
>  #define XELPG_PAT_WB_CACHE_1_WAY       3
>  
>  struct xe_gt;
> -extern struct xe_device *xe;
> +struct xe_device;
>  
> +int xe_pat_fill_info(struct xe_device *xe);
>  void xe_pat_init(struct xe_gt *gt);
>  void xe_pte_pat_init(struct xe_device *xe);
>  unsigned int xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache);
> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> index 791107dec045..24f2021aae22 100644
> --- a/drivers/gpu/drm/xe/xe_pci.c
> +++ b/drivers/gpu/drm/xe/xe_pci.c
> @@ -22,6 +22,7 @@
>  #include "xe_gt.h"
>  #include "xe_macros.h"
>  #include "xe_module.h"
> +#include "xe_pat.h"
>  #include "xe_pci_types.h"
>  #include "xe_pm.h"
>  #include "xe_step.h"
> @@ -553,6 +554,7 @@ static int xe_info_init(struct xe_device *xe,
>  	struct xe_tile *tile;
>  	struct xe_gt *gt;
>  	u8 id;
> +	int err;
>  
>  	xe->info.platform = desc->platform;
>  	xe->info.subplatform = subplatform_desc ?
> @@ -601,6 +603,10 @@ static int xe_info_init(struct xe_device *xe,
>  	xe->info.enable_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY) &&
>  				  enable_display &&
>  				  desc->has_display;
> +
> +	err = xe_pat_fill_info(xe);
> +	if (err)
> +		return err;

Nitpick:  should there be another blank line here?

Otherwise,

Reviewed-by: Matt Roper <matthew.d.roper@intel.com>

>  	/*
>  	 * All platforms have at least one primary GT.  Any platform with media
>  	 * version 13 or higher has an additional dedicated media GT.  And
> -- 
> 2.41.0
> 

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 3/5] drm/xe: move pat_table into device info
  2023-08-29 16:28 ` [Intel-xe] [RFC 3/5] drm/xe: move pat_table into device info Matthew Auld
  2023-08-29 19:14   ` Matt Roper
@ 2023-08-29 21:49   ` Lucas De Marchi
  2023-08-29 22:20     ` Matt Roper
  2023-08-30  9:43     ` Matthew Auld
  2023-08-30  5:14   ` Mishra, Pallavi
  2023-09-05 20:50   ` Souza, Jose
  3 siblings, 2 replies; 47+ messages in thread
From: Lucas De Marchi @ 2023-08-29 21:49 UTC (permalink / raw)
  To: Matthew Auld; +Cc: Matt Roper, intel-xe

On Tue, Aug 29, 2023 at 05:28:44PM +0100, Matthew Auld wrote:
>We need to able to know the max pat_index range for a given platform, as
>well being able to lookup the pat_index for a given platform in upcoming
>vm_bind uapi, where userspace can directly provide the pat_index. Move
>the platform definition of the pat_table into the device info with the
>idea of encoding more information about each pat_index in a future
>patch.
>
>Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>Cc: Matt Roper <matthew.d.roper@intel.com>
>---
> drivers/gpu/drm/xe/xe_device_types.h |  3 +++
> drivers/gpu/drm/xe/xe_pat.c          | 39 ++++++++++++++++++----------
> drivers/gpu/drm/xe/xe_pat.h          |  3 ++-
> drivers/gpu/drm/xe/xe_pci.c          |  6 +++++
> 4 files changed, 36 insertions(+), 15 deletions(-)
>
>diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>index 5037b8c180b8..06235da647bb 100644
>--- a/drivers/gpu/drm/xe/xe_device_types.h
>+++ b/drivers/gpu/drm/xe/xe_device_types.h
>@@ -238,6 +238,9 @@ struct xe_device {
> 		/** @enable_display: display enabled */
> 		u8 enable_display:1;
>
>+		const u32 *pat_table;
>+		int pat_table_n_entries;

as it's expected to have this "owned" by xe_pat, I'd
use a new substruct "pat" rather than abusing the info.

	struct {
		const u32 *table;
		unsigned int n_entries;
	} pat;

Then we can add more fields as we need them (example for annotating the
idx we want to use for kernel ops)

>+
> #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
> 		const struct intel_display_device_info *display;
> 		struct intel_display_runtime_info display_runtime;
>diff --git a/drivers/gpu/drm/xe/xe_pat.c b/drivers/gpu/drm/xe/xe_pat.c
>index a468655db956..f19f5d8dcd94 100644
>--- a/drivers/gpu/drm/xe/xe_pat.c
>+++ b/drivers/gpu/drm/xe/xe_pat.c
>@@ -106,24 +106,17 @@ static void program_pat_mcr(struct xe_gt *gt, const u32 table[], int n_entries)
> 	}
> }
>
>-void xe_pat_init(struct xe_gt *gt)
>+int xe_pat_fill_info(struct xe_device *xe)

xe_pat_init_early() to follow what is done in other places as being a
"sw initialization only" that only depends on some other fields being
filled


> {
>-	struct xe_device *xe = gt_to_xe(gt);
>-
> 	if (xe->info.platform == XE_METEORLAKE) {
>-		/*
>-		 * SAMedia register offsets are adjusted by the write methods
>-		 * and they target registers that are not MCR, while for normal
>-		 * GT they are MCR
>-		 */
>-		if (xe_gt_is_media_type(gt))
>-			program_pat(gt, mtl_pat_table, ARRAY_SIZE(mtl_pat_table));
>-		else
>-			program_pat_mcr(gt, mtl_pat_table, ARRAY_SIZE(mtl_pat_table));
>+		xe->info.pat_table = mtl_pat_table;
>+		xe->info.pat_table_n_entries = ARRAY_SIZE(mtl_pat_table);
> 	} else if (xe->info.platform == XE_PVC || xe->info.platform == XE_DG2) {
>-		program_pat_mcr(gt, pvc_pat_table, ARRAY_SIZE(pvc_pat_table));
>+		xe->info.pat_table = pvc_pat_table;
>+		xe->info.pat_table_n_entries = ARRAY_SIZE(pvc_pat_table);
> 	} else if (GRAPHICS_VERx100(xe) <= 1210) {
>-		program_pat(gt, tgl_pat_table, ARRAY_SIZE(tgl_pat_table));
>+		xe->info.pat_table = tgl_pat_table;
>+		xe->info.pat_table_n_entries = ARRAY_SIZE(tgl_pat_table);
> 	} else {
> 		/*
> 		 * Going forward we expect to need new PAT settings for most
>@@ -135,7 +128,25 @@ void xe_pat_init(struct xe_gt *gt)
> 		 */
> 		drm_err(&xe->drm, "Missing PAT table for platform with graphics version %d.%2d!\n",
> 			GRAPHICS_VER(xe), GRAPHICS_VERx100(xe) % 100);
>+		return -ENODEV;

that would be an odd error. -ENOTSUPP maybe?


> 	}
>+
>+	return 0;
>+}
>+
>+void xe_pat_init(struct xe_gt *gt)
>+{
>+	struct xe_device *xe = gt_to_xe(gt);
>+
>+	/*
>+	 * SAMedia register offsets are adjusted by the write methods
>+	 * and they target registers that are not MCR, while for normal
>+	 * GT they are MCR.
>+	 */
>+	if (xe_gt_is_media_type(gt) || GRAPHICS_VERx100(xe) < 1255)
>+		program_pat(gt, xe->info.pat_table, xe->info.pat_table_n_entries);
>+	else
>+		program_pat_mcr(gt, xe->info.pat_table, xe->info.pat_table_n_entries);
> }
>
> void xe_pte_pat_init(struct xe_device *xe)
>diff --git a/drivers/gpu/drm/xe/xe_pat.h b/drivers/gpu/drm/xe/xe_pat.h
>index 54022f591621..9ab059758ad1 100644
>--- a/drivers/gpu/drm/xe/xe_pat.h
>+++ b/drivers/gpu/drm/xe/xe_pat.h
>@@ -26,8 +26,9 @@
> #define XELPG_PAT_WB_CACHE_1_WAY       3
>
> struct xe_gt;
>-extern struct xe_device *xe;
>+struct xe_device;

leftover from a previous patch? what was the base you used? I can't see
this on drm-xe-next.

Lucas De Marchi

>
>+int xe_pat_fill_info(struct xe_device *xe);
> void xe_pat_init(struct xe_gt *gt);
> void xe_pte_pat_init(struct xe_device *xe);
> unsigned int xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache);
>diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
>index 791107dec045..24f2021aae22 100644
>--- a/drivers/gpu/drm/xe/xe_pci.c
>+++ b/drivers/gpu/drm/xe/xe_pci.c
>@@ -22,6 +22,7 @@
> #include "xe_gt.h"
> #include "xe_macros.h"
> #include "xe_module.h"
>+#include "xe_pat.h"
> #include "xe_pci_types.h"
> #include "xe_pm.h"
> #include "xe_step.h"
>@@ -553,6 +554,7 @@ static int xe_info_init(struct xe_device *xe,
> 	struct xe_tile *tile;
> 	struct xe_gt *gt;
> 	u8 id;
>+	int err;
>
> 	xe->info.platform = desc->platform;
> 	xe->info.subplatform = subplatform_desc ?
>@@ -601,6 +603,10 @@ static int xe_info_init(struct xe_device *xe,
> 	xe->info.enable_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY) &&
> 				  enable_display &&
> 				  desc->has_display;
>+
>+	err = xe_pat_fill_info(xe);
>+	if (err)
>+		return err;
> 	/*
> 	 * All platforms have at least one primary GT.  Any platform with media
> 	 * version 13 or higher has an additional dedicated media GT.  And
>-- 
>2.41.0
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 3/5] drm/xe: move pat_table into device info
  2023-08-29 21:49   ` Lucas De Marchi
@ 2023-08-29 22:20     ` Matt Roper
  2023-08-30  9:34       ` Matthew Auld
  2023-08-30  9:43     ` Matthew Auld
  1 sibling, 1 reply; 47+ messages in thread
From: Matt Roper @ 2023-08-29 22:20 UTC (permalink / raw)
  To: Lucas De Marchi; +Cc: Matthew Auld, intel-xe

On Tue, Aug 29, 2023 at 02:49:59PM -0700, Lucas De Marchi wrote:
> On Tue, Aug 29, 2023 at 05:28:44PM +0100, Matthew Auld wrote:
> > We need to able to know the max pat_index range for a given platform, as
> > well being able to lookup the pat_index for a given platform in upcoming
> > vm_bind uapi, where userspace can directly provide the pat_index. Move
> > the platform definition of the pat_table into the device info with the
> > idea of encoding more information about each pat_index in a future
> > patch.
> > 
> > Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> > Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> > Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> > Cc: Matt Roper <matthew.d.roper@intel.com>
> > ---
...
> > diff --git a/drivers/gpu/drm/xe/xe_pat.h b/drivers/gpu/drm/xe/xe_pat.h
> > index 54022f591621..9ab059758ad1 100644
> > --- a/drivers/gpu/drm/xe/xe_pat.h
> > +++ b/drivers/gpu/drm/xe/xe_pat.h
> > @@ -26,8 +26,9 @@
> > #define XELPG_PAT_WB_CACHE_1_WAY       3
> > 
> > struct xe_gt;
> > -extern struct xe_device *xe;
> > +struct xe_device;
> 
> leftover from a previous patch? what was the base you used? I can't see
> this on drm-xe-next.

I think his series is based on top of the latest copy of Ravi's PTE
encoding series.


Matt

> 
> Lucas De Marchi
> 
> > 
> > +int xe_pat_fill_info(struct xe_device *xe);
> > void xe_pat_init(struct xe_gt *gt);
> > void xe_pte_pat_init(struct xe_device *xe);
> > unsigned int xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache);
> > diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> > index 791107dec045..24f2021aae22 100644
> > --- a/drivers/gpu/drm/xe/xe_pci.c
> > +++ b/drivers/gpu/drm/xe/xe_pci.c
> > @@ -22,6 +22,7 @@
> > #include "xe_gt.h"
> > #include "xe_macros.h"
> > #include "xe_module.h"
> > +#include "xe_pat.h"
> > #include "xe_pci_types.h"
> > #include "xe_pm.h"
> > #include "xe_step.h"
> > @@ -553,6 +554,7 @@ static int xe_info_init(struct xe_device *xe,
> > 	struct xe_tile *tile;
> > 	struct xe_gt *gt;
> > 	u8 id;
> > +	int err;
> > 
> > 	xe->info.platform = desc->platform;
> > 	xe->info.subplatform = subplatform_desc ?
> > @@ -601,6 +603,10 @@ static int xe_info_init(struct xe_device *xe,
> > 	xe->info.enable_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY) &&
> > 				  enable_display &&
> > 				  desc->has_display;
> > +
> > +	err = xe_pat_fill_info(xe);
> > +	if (err)
> > +		return err;
> > 	/*
> > 	 * All platforms have at least one primary GT.  Any platform with media
> > 	 * version 13 or higher has an additional dedicated media GT.  And
> > -- 
> > 2.41.0
> > 

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 3/5] drm/xe: move pat_table into device info
  2023-08-29 22:20     ` Matt Roper
@ 2023-08-30  9:34       ` Matthew Auld
  0 siblings, 0 replies; 47+ messages in thread
From: Matthew Auld @ 2023-08-30  9:34 UTC (permalink / raw)
  To: Matt Roper, Lucas De Marchi; +Cc: intel-xe

On 29/08/2023 23:20, Matt Roper wrote:
> On Tue, Aug 29, 2023 at 02:49:59PM -0700, Lucas De Marchi wrote:
>> On Tue, Aug 29, 2023 at 05:28:44PM +0100, Matthew Auld wrote:
>>> We need to able to know the max pat_index range for a given platform, as
>>> well being able to lookup the pat_index for a given platform in upcoming
>>> vm_bind uapi, where userspace can directly provide the pat_index. Move
>>> the platform definition of the pat_table into the device info with the
>>> idea of encoding more information about each pat_index in a future
>>> patch.
>>>
>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>> ---
> ...
>>> diff --git a/drivers/gpu/drm/xe/xe_pat.h b/drivers/gpu/drm/xe/xe_pat.h
>>> index 54022f591621..9ab059758ad1 100644
>>> --- a/drivers/gpu/drm/xe/xe_pat.h
>>> +++ b/drivers/gpu/drm/xe/xe_pat.h
>>> @@ -26,8 +26,9 @@
>>> #define XELPG_PAT_WB_CACHE_1_WAY       3
>>>
>>> struct xe_gt;
>>> -extern struct xe_device *xe;
>>> +struct xe_device;
>>
>> leftover from a previous patch? what was the base you used? I can't see
>> this on drm-xe-next.
> 
> I think his series is based on top of the latest copy of Ravi's PTE
> encoding series.

Yes, series is based on top of that patch (mentioned somewhere in cover 
letter).

> 
> 
> Matt
> 
>>
>> Lucas De Marchi
>>
>>>
>>> +int xe_pat_fill_info(struct xe_device *xe);
>>> void xe_pat_init(struct xe_gt *gt);
>>> void xe_pte_pat_init(struct xe_device *xe);
>>> unsigned int xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache);
>>> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
>>> index 791107dec045..24f2021aae22 100644
>>> --- a/drivers/gpu/drm/xe/xe_pci.c
>>> +++ b/drivers/gpu/drm/xe/xe_pci.c
>>> @@ -22,6 +22,7 @@
>>> #include "xe_gt.h"
>>> #include "xe_macros.h"
>>> #include "xe_module.h"
>>> +#include "xe_pat.h"
>>> #include "xe_pci_types.h"
>>> #include "xe_pm.h"
>>> #include "xe_step.h"
>>> @@ -553,6 +554,7 @@ static int xe_info_init(struct xe_device *xe,
>>> 	struct xe_tile *tile;
>>> 	struct xe_gt *gt;
>>> 	u8 id;
>>> +	int err;
>>>
>>> 	xe->info.platform = desc->platform;
>>> 	xe->info.subplatform = subplatform_desc ?
>>> @@ -601,6 +603,10 @@ static int xe_info_init(struct xe_device *xe,
>>> 	xe->info.enable_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY) &&
>>> 				  enable_display &&
>>> 				  desc->has_display;
>>> +
>>> +	err = xe_pat_fill_info(xe);
>>> +	if (err)
>>> +		return err;
>>> 	/*
>>> 	 * All platforms have at least one primary GT.  Any platform with media
>>> 	 * version 13 or higher has an additional dedicated media GT.  And
>>> -- 
>>> 2.41.0
>>>
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 3/5] drm/xe: move pat_table into device info
  2023-08-29 21:49   ` Lucas De Marchi
  2023-08-29 22:20     ` Matt Roper
@ 2023-08-30  9:43     ` Matthew Auld
  1 sibling, 0 replies; 47+ messages in thread
From: Matthew Auld @ 2023-08-30  9:43 UTC (permalink / raw)
  To: Lucas De Marchi; +Cc: Matt Roper, intel-xe

On 29/08/2023 22:49, Lucas De Marchi wrote:
> On Tue, Aug 29, 2023 at 05:28:44PM +0100, Matthew Auld wrote:
>> We need to able to know the max pat_index range for a given platform, as
>> well being able to lookup the pat_index for a given platform in upcoming
>> vm_bind uapi, where userspace can directly provide the pat_index. Move
>> the platform definition of the pat_table into the device info with the
>> idea of encoding more information about each pat_index in a future
>> patch.
>>
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> ---
>> drivers/gpu/drm/xe/xe_device_types.h |  3 +++
>> drivers/gpu/drm/xe/xe_pat.c          | 39 ++++++++++++++++++----------
>> drivers/gpu/drm/xe/xe_pat.h          |  3 ++-
>> drivers/gpu/drm/xe/xe_pci.c          |  6 +++++
>> 4 files changed, 36 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h 
>> b/drivers/gpu/drm/xe/xe_device_types.h
>> index 5037b8c180b8..06235da647bb 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -238,6 +238,9 @@ struct xe_device {
>>         /** @enable_display: display enabled */
>>         u8 enable_display:1;
>>
>> +        const u32 *pat_table;
>> +        int pat_table_n_entries;
> 
> as it's expected to have this "owned" by xe_pat, I'd
> use a new substruct "pat" rather than abusing the info.
> 
>      struct {
>          const u32 *table;
>          unsigned int n_entries;
>      } pat;
> 
> Then we can add more fields as we need them (example for annotating the
> idx we want to use for kernel ops)

Ok. That sounds better.

> 
>> +
>> #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
>>         const struct intel_display_device_info *display;
>>         struct intel_display_runtime_info display_runtime;
>> diff --git a/drivers/gpu/drm/xe/xe_pat.c b/drivers/gpu/drm/xe/xe_pat.c
>> index a468655db956..f19f5d8dcd94 100644
>> --- a/drivers/gpu/drm/xe/xe_pat.c
>> +++ b/drivers/gpu/drm/xe/xe_pat.c
>> @@ -106,24 +106,17 @@ static void program_pat_mcr(struct xe_gt *gt, 
>> const u32 table[], int n_entries)
>>     }
>> }
>>
>> -void xe_pat_init(struct xe_gt *gt)
>> +int xe_pat_fill_info(struct xe_device *xe)
> 
> xe_pat_init_early() to follow what is done in other places as being a
> "sw initialization only" that only depends on some other fields being
> filled

Ok, will fix.

> 
> 
>> {
>> -    struct xe_device *xe = gt_to_xe(gt);
>> -
>>     if (xe->info.platform == XE_METEORLAKE) {
>> -        /*
>> -         * SAMedia register offsets are adjusted by the write methods
>> -         * and they target registers that are not MCR, while for normal
>> -         * GT they are MCR
>> -         */
>> -        if (xe_gt_is_media_type(gt))
>> -            program_pat(gt, mtl_pat_table, ARRAY_SIZE(mtl_pat_table));
>> -        else
>> -            program_pat_mcr(gt, mtl_pat_table, 
>> ARRAY_SIZE(mtl_pat_table));
>> +        xe->info.pat_table = mtl_pat_table;
>> +        xe->info.pat_table_n_entries = ARRAY_SIZE(mtl_pat_table);
>>     } else if (xe->info.platform == XE_PVC || xe->info.platform == 
>> XE_DG2) {
>> -        program_pat_mcr(gt, pvc_pat_table, ARRAY_SIZE(pvc_pat_table));
>> +        xe->info.pat_table = pvc_pat_table;
>> +        xe->info.pat_table_n_entries = ARRAY_SIZE(pvc_pat_table);
>>     } else if (GRAPHICS_VERx100(xe) <= 1210) {
>> -        program_pat(gt, tgl_pat_table, ARRAY_SIZE(tgl_pat_table));
>> +        xe->info.pat_table = tgl_pat_table;
>> +        xe->info.pat_table_n_entries = ARRAY_SIZE(tgl_pat_table);
>>     } else {
>>         /*
>>          * Going forward we expect to need new PAT settings for most
>> @@ -135,7 +128,25 @@ void xe_pat_init(struct xe_gt *gt)
>>          */
>>         drm_err(&xe->drm, "Missing PAT table for platform with 
>> graphics version %d.%2d!\n",
>>             GRAPHICS_VER(xe), GRAPHICS_VERx100(xe) % 100);
>> +        return -ENODEV;
> 
> that would be an odd error. -ENOTSUPP maybe?

I just take it to mean that something about the device is missing/broken 
(I'm pretty sure we have used it like that before). ENOTSUPP would also 
work. Will change.

> 
> 
>>     }
>> +
>> +    return 0;
>> +}
>> +
>> +void xe_pat_init(struct xe_gt *gt)
>> +{
>> +    struct xe_device *xe = gt_to_xe(gt);
>> +
>> +    /*
>> +     * SAMedia register offsets are adjusted by the write methods
>> +     * and they target registers that are not MCR, while for normal
>> +     * GT they are MCR.
>> +     */
>> +    if (xe_gt_is_media_type(gt) || GRAPHICS_VERx100(xe) < 1255)
>> +        program_pat(gt, xe->info.pat_table, 
>> xe->info.pat_table_n_entries);
>> +    else
>> +        program_pat_mcr(gt, xe->info.pat_table, 
>> xe->info.pat_table_n_entries);
>> }
>>
>> void xe_pte_pat_init(struct xe_device *xe)
>> diff --git a/drivers/gpu/drm/xe/xe_pat.h b/drivers/gpu/drm/xe/xe_pat.h
>> index 54022f591621..9ab059758ad1 100644
>> --- a/drivers/gpu/drm/xe/xe_pat.h
>> +++ b/drivers/gpu/drm/xe/xe_pat.h
>> @@ -26,8 +26,9 @@
>> #define XELPG_PAT_WB_CACHE_1_WAY       3
>>
>> struct xe_gt;
>> -extern struct xe_device *xe;
>> +struct xe_device;
> 
> leftover from a previous patch? what was the base you used? I can't see
> this on drm-xe-next.
> 
> Lucas De Marchi
> 
>>
>> +int xe_pat_fill_info(struct xe_device *xe);
>> void xe_pat_init(struct xe_gt *gt);
>> void xe_pte_pat_init(struct xe_device *xe);
>> unsigned int xe_pat_get_index(struct xe_device *xe, enum 
>> xe_cache_level cache);
>> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
>> index 791107dec045..24f2021aae22 100644
>> --- a/drivers/gpu/drm/xe/xe_pci.c
>> +++ b/drivers/gpu/drm/xe/xe_pci.c
>> @@ -22,6 +22,7 @@
>> #include "xe_gt.h"
>> #include "xe_macros.h"
>> #include "xe_module.h"
>> +#include "xe_pat.h"
>> #include "xe_pci_types.h"
>> #include "xe_pm.h"
>> #include "xe_step.h"
>> @@ -553,6 +554,7 @@ static int xe_info_init(struct xe_device *xe,
>>     struct xe_tile *tile;
>>     struct xe_gt *gt;
>>     u8 id;
>> +    int err;
>>
>>     xe->info.platform = desc->platform;
>>     xe->info.subplatform = subplatform_desc ?
>> @@ -601,6 +603,10 @@ static int xe_info_init(struct xe_device *xe,
>>     xe->info.enable_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY) &&
>>                   enable_display &&
>>                   desc->has_display;
>> +
>> +    err = xe_pat_fill_info(xe);
>> +    if (err)
>> +        return err;
>>     /*
>>      * All platforms have at least one primary GT.  Any platform with 
>> media
>>      * version 13 or higher has an additional dedicated media GT.  And
>> -- 
>> 2.41.0
>>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 3/5] drm/xe: move pat_table into device info
  2023-08-29 16:28 ` [Intel-xe] [RFC 3/5] drm/xe: move pat_table into device info Matthew Auld
  2023-08-29 19:14   ` Matt Roper
  2023-08-29 21:49   ` Lucas De Marchi
@ 2023-08-30  5:14   ` Mishra, Pallavi
  2023-09-05 20:50   ` Souza, Jose
  3 siblings, 0 replies; 47+ messages in thread
From: Mishra, Pallavi @ 2023-08-30  5:14 UTC (permalink / raw)
  To: Auld, Matthew
  Cc: Roper, Matthew D, De Marchi, Lucas,
	intel-xe@lists.freedesktop.org



> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: Tuesday, August 29, 2023 9:29 AM
> To: intel-xe@lists.freedesktop.org
> Cc: Mishra, Pallavi <pallavi.mishra@intel.com>; De Marchi, Lucas
> <lucas.demarchi@intel.com>; Roper, Matthew D
> <matthew.d.roper@intel.com>
> Subject: [RFC 3/5] drm/xe: move pat_table into device info
> 
> We need to able to know the max pat_index range for a given platform, as well
> being able to lookup the pat_index for a given platform in upcoming vm_bind
> uapi, where userspace can directly provide the pat_index. Move the platform
> definition of the pat_table into the device info with the idea of encoding more
> information about each pat_index in a future patch.
> 
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>

Reviewed-by: Pallavi Mishra <pallavi.mishra@intel.com>

> ---
>  drivers/gpu/drm/xe/xe_device_types.h |  3 +++
>  drivers/gpu/drm/xe/xe_pat.c          | 39 ++++++++++++++++++----------
>  drivers/gpu/drm/xe/xe_pat.h          |  3 ++-
>  drivers/gpu/drm/xe/xe_pci.c          |  6 +++++
>  4 files changed, 36 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> b/drivers/gpu/drm/xe/xe_device_types.h
> index 5037b8c180b8..06235da647bb 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -238,6 +238,9 @@ struct xe_device {
>  		/** @enable_display: display enabled */
>  		u8 enable_display:1;
> 
> +		const u32 *pat_table;
> +		int pat_table_n_entries;
> +
>  #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
>  		const struct intel_display_device_info *display;
>  		struct intel_display_runtime_info display_runtime; diff --git
> a/drivers/gpu/drm/xe/xe_pat.c b/drivers/gpu/drm/xe/xe_pat.c index
> a468655db956..f19f5d8dcd94 100644
> --- a/drivers/gpu/drm/xe/xe_pat.c
> +++ b/drivers/gpu/drm/xe/xe_pat.c
> @@ -106,24 +106,17 @@ static void program_pat_mcr(struct xe_gt *gt,
> const u32 table[], int n_entries)
>  	}
>  }
> 
> -void xe_pat_init(struct xe_gt *gt)
> +int xe_pat_fill_info(struct xe_device *xe)
>  {
> -	struct xe_device *xe = gt_to_xe(gt);
> -
>  	if (xe->info.platform == XE_METEORLAKE) {
> -		/*
> -		 * SAMedia register offsets are adjusted by the write methods
> -		 * and they target registers that are not MCR, while for normal
> -		 * GT they are MCR
> -		 */
> -		if (xe_gt_is_media_type(gt))
> -			program_pat(gt, mtl_pat_table,
> ARRAY_SIZE(mtl_pat_table));
> -		else
> -			program_pat_mcr(gt, mtl_pat_table,
> ARRAY_SIZE(mtl_pat_table));
> +		xe->info.pat_table = mtl_pat_table;
> +		xe->info.pat_table_n_entries = ARRAY_SIZE(mtl_pat_table);
>  	} else if (xe->info.platform == XE_PVC || xe->info.platform == XE_DG2)
> {
> -		program_pat_mcr(gt, pvc_pat_table,
> ARRAY_SIZE(pvc_pat_table));
> +		xe->info.pat_table = pvc_pat_table;
> +		xe->info.pat_table_n_entries = ARRAY_SIZE(pvc_pat_table);
>  	} else if (GRAPHICS_VERx100(xe) <= 1210) {
> -		program_pat(gt, tgl_pat_table, ARRAY_SIZE(tgl_pat_table));
> +		xe->info.pat_table = tgl_pat_table;
> +		xe->info.pat_table_n_entries = ARRAY_SIZE(tgl_pat_table);
>  	} else {
>  		/*
>  		 * Going forward we expect to need new PAT settings for most
> @@ -135,7 +128,25 @@ void xe_pat_init(struct xe_gt *gt)
>  		 */
>  		drm_err(&xe->drm, "Missing PAT table for platform with
> graphics version %d.%2d!\n",
>  			GRAPHICS_VER(xe), GRAPHICS_VERx100(xe) % 100);
> +		return -ENODEV;
>  	}
> +
> +	return 0;
> +}
> +
> +void xe_pat_init(struct xe_gt *gt)
> +{
> +	struct xe_device *xe = gt_to_xe(gt);
> +
> +	/*
> +	 * SAMedia register offsets are adjusted by the write methods
> +	 * and they target registers that are not MCR, while for normal
> +	 * GT they are MCR.
> +	 */
> +	if (xe_gt_is_media_type(gt) || GRAPHICS_VERx100(xe) < 1255)
> +		program_pat(gt, xe->info.pat_table, xe-
> >info.pat_table_n_entries);
> +	else
> +		program_pat_mcr(gt, xe->info.pat_table,
> +xe->info.pat_table_n_entries);
>  }
> 
>  void xe_pte_pat_init(struct xe_device *xe) diff --git
> a/drivers/gpu/drm/xe/xe_pat.h b/drivers/gpu/drm/xe/xe_pat.h index
> 54022f591621..9ab059758ad1 100644
> --- a/drivers/gpu/drm/xe/xe_pat.h
> +++ b/drivers/gpu/drm/xe/xe_pat.h
> @@ -26,8 +26,9 @@
>  #define XELPG_PAT_WB_CACHE_1_WAY       3
> 
>  struct xe_gt;
> -extern struct xe_device *xe;
> +struct xe_device;
> 
> +int xe_pat_fill_info(struct xe_device *xe);
>  void xe_pat_init(struct xe_gt *gt);
>  void xe_pte_pat_init(struct xe_device *xe);  unsigned int
> xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache); diff --git
> a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c index
> 791107dec045..24f2021aae22 100644
> --- a/drivers/gpu/drm/xe/xe_pci.c
> +++ b/drivers/gpu/drm/xe/xe_pci.c
> @@ -22,6 +22,7 @@
>  #include "xe_gt.h"
>  #include "xe_macros.h"
>  #include "xe_module.h"
> +#include "xe_pat.h"
>  #include "xe_pci_types.h"
>  #include "xe_pm.h"
>  #include "xe_step.h"
> @@ -553,6 +554,7 @@ static int xe_info_init(struct xe_device *xe,
>  	struct xe_tile *tile;
>  	struct xe_gt *gt;
>  	u8 id;
> +	int err;
> 
>  	xe->info.platform = desc->platform;
>  	xe->info.subplatform = subplatform_desc ?
> @@ -601,6 +603,10 @@ static int xe_info_init(struct xe_device *xe,
>  	xe->info.enable_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
> &&
>  				  enable_display &&
>  				  desc->has_display;
> +
> +	err = xe_pat_fill_info(xe);
> +	if (err)
> +		return err;
>  	/*
>  	 * All platforms have at least one primary GT.  Any platform with media
>  	 * version 13 or higher has an additional dedicated media GT.  And
> --
> 2.41.0
 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 3/5] drm/xe: move pat_table into device info
  2023-08-29 16:28 ` [Intel-xe] [RFC 3/5] drm/xe: move pat_table into device info Matthew Auld
                     ` (2 preceding siblings ...)
  2023-08-30  5:14   ` Mishra, Pallavi
@ 2023-09-05 20:50   ` Souza, Jose
  3 siblings, 0 replies; 47+ messages in thread
From: Souza, Jose @ 2023-09-05 20:50 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Auld,  Matthew
  Cc: De Marchi, Lucas, Roper, Matthew D

On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
> We need to able to know the max pat_index range for a given platform, as
> well being able to lookup the pat_index for a given platform in upcoming
> vm_bind uapi, where userspace can directly provide the pat_index. Move
> the platform definition of the pat_table into the device info with the
> idea of encoding more information about each pat_index in a future
> patch.
> 
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device_types.h |  3 +++
>  drivers/gpu/drm/xe/xe_pat.c          | 39 ++++++++++++++++++----------
>  drivers/gpu/drm/xe/xe_pat.h          |  3 ++-
>  drivers/gpu/drm/xe/xe_pci.c          |  6 +++++
>  4 files changed, 36 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 5037b8c180b8..06235da647bb 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -238,6 +238,9 @@ struct xe_device {
>  		/** @enable_display: display enabled */
>  		u8 enable_display:1;
>  
> +		const u32 *pat_table;
> +		int pat_table_n_entries;
> +
>  #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
>  		const struct intel_display_device_info *display;
>  		struct intel_display_runtime_info display_runtime;
> diff --git a/drivers/gpu/drm/xe/xe_pat.c b/drivers/gpu/drm/xe/xe_pat.c
> index a468655db956..f19f5d8dcd94 100644
> --- a/drivers/gpu/drm/xe/xe_pat.c
> +++ b/drivers/gpu/drm/xe/xe_pat.c
> @@ -106,24 +106,17 @@ static void program_pat_mcr(struct xe_gt *gt, const u32 table[], int n_entries)
>  	}
>  }
>  
> -void xe_pat_init(struct xe_gt *gt)
> +int xe_pat_fill_info(struct xe_device *xe)
>  {
> -	struct xe_device *xe = gt_to_xe(gt);
> -
>  	if (xe->info.platform == XE_METEORLAKE) {
> -		/*
> -		 * SAMedia register offsets are adjusted by the write methods
> -		 * and they target registers that are not MCR, while for normal
> -		 * GT they are MCR
> -		 */
> -		if (xe_gt_is_media_type(gt))
> -			program_pat(gt, mtl_pat_table, ARRAY_SIZE(mtl_pat_table));
> -		else
> -			program_pat_mcr(gt, mtl_pat_table, ARRAY_SIZE(mtl_pat_table));
> +		xe->info.pat_table = mtl_pat_table;
> +		xe->info.pat_table_n_entries = ARRAY_SIZE(mtl_pat_table);
>  	} else if (xe->info.platform == XE_PVC || xe->info.platform == XE_DG2) {
> -		program_pat_mcr(gt, pvc_pat_table, ARRAY_SIZE(pvc_pat_table));
> +		xe->info.pat_table = pvc_pat_table;
> +		xe->info.pat_table_n_entries = ARRAY_SIZE(pvc_pat_table);

I guess DG2 should use PVC entries with CLOS0?

What about DG1? TGL table is based around LLC.

>  	} else if (GRAPHICS_VERx100(xe) <= 1210) {
> -		program_pat(gt, tgl_pat_table, ARRAY_SIZE(tgl_pat_table));
> +		xe->info.pat_table = tgl_pat_table;
> +		xe->info.pat_table_n_entries = ARRAY_SIZE(tgl_pat_table);
>  	} else {
>  		/*
>  		 * Going forward we expect to need new PAT settings for most
> @@ -135,7 +128,25 @@ void xe_pat_init(struct xe_gt *gt)
>  		 */
>  		drm_err(&xe->drm, "Missing PAT table for platform with graphics version %d.%2d!\n",
>  			GRAPHICS_VER(xe), GRAPHICS_VERx100(xe) % 100);
> +		return -ENODEV;
>  	}
> +
> +	return 0;
> +}
> +
> +void xe_pat_init(struct xe_gt *gt)
> +{
> +	struct xe_device *xe = gt_to_xe(gt);
> +
> +	/*
> +	 * SAMedia register offsets are adjusted by the write methods
> +	 * and they target registers that are not MCR, while for normal
> +	 * GT they are MCR.
> +	 */
> +	if (xe_gt_is_media_type(gt) || GRAPHICS_VERx100(xe) < 1255)
> +		program_pat(gt, xe->info.pat_table, xe->info.pat_table_n_entries);
> +	else
> +		program_pat_mcr(gt, xe->info.pat_table, xe->info.pat_table_n_entries);
>  }
>  
>  void xe_pte_pat_init(struct xe_device *xe)
> diff --git a/drivers/gpu/drm/xe/xe_pat.h b/drivers/gpu/drm/xe/xe_pat.h
> index 54022f591621..9ab059758ad1 100644
> --- a/drivers/gpu/drm/xe/xe_pat.h
> +++ b/drivers/gpu/drm/xe/xe_pat.h
> @@ -26,8 +26,9 @@
>  #define XELPG_PAT_WB_CACHE_1_WAY       3
>  
>  struct xe_gt;
> -extern struct xe_device *xe;
> +struct xe_device;
>  
> +int xe_pat_fill_info(struct xe_device *xe);
>  void xe_pat_init(struct xe_gt *gt);
>  void xe_pte_pat_init(struct xe_device *xe);
>  unsigned int xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache);
> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> index 791107dec045..24f2021aae22 100644
> --- a/drivers/gpu/drm/xe/xe_pci.c
> +++ b/drivers/gpu/drm/xe/xe_pci.c
> @@ -22,6 +22,7 @@
>  #include "xe_gt.h"
>  #include "xe_macros.h"
>  #include "xe_module.h"
> +#include "xe_pat.h"
>  #include "xe_pci_types.h"
>  #include "xe_pm.h"
>  #include "xe_step.h"
> @@ -553,6 +554,7 @@ static int xe_info_init(struct xe_device *xe,
>  	struct xe_tile *tile;
>  	struct xe_gt *gt;
>  	u8 id;
> +	int err;
>  
>  	xe->info.platform = desc->platform;
>  	xe->info.subplatform = subplatform_desc ?
> @@ -601,6 +603,10 @@ static int xe_info_init(struct xe_device *xe,
>  	xe->info.enable_display = IS_ENABLED(CONFIG_DRM_XE_DISPLAY) &&
>  				  enable_display &&
>  				  desc->has_display;
> +
> +	err = xe_pat_fill_info(xe);
> +	if (err)
> +		return err;
>  	/*
>  	 * All platforms have at least one primary GT.  Any platform with media
>  	 * version 13 or higher has an additional dedicated media GT.  And


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Intel-xe] [RFC 4/5] drm/xe/pat: annotate pat_index with coherency mode
  2023-08-29 16:28 [Intel-xe] [RFC 0/5] PAT and cache coherency support Matthew Auld
                   ` (2 preceding siblings ...)
  2023-08-29 16:28 ` [Intel-xe] [RFC 3/5] drm/xe: move pat_table into device info Matthew Auld
@ 2023-08-29 16:28 ` Matthew Auld
  2023-08-29 21:08   ` Matt Roper
  2023-08-29 22:02   ` Lucas De Marchi
  2023-08-29 16:28 ` [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind Matthew Auld
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 47+ messages in thread
From: Matthew Auld @ 2023-08-29 16:28 UTC (permalink / raw)
  To: intel-xe
  Cc: Filip Hazubski, Joonas Lahtinen, Lucas De Marchi, Carl Zhang,
	Effie Yu, Matt Roper

Future uapi needs to give userspace the ability to select the pat_index
for a given vm_bind. However we need to be able to extract the coherency
mode from the provided pat_index to ensure it matches the coherency mode
set at object creation. There are various security reasons for why this
matters.  However the pat_index itself is very platform specific, so
seems reasonable to annotate each platform definition of the pat table.
On some older platforms there is no explicit coherency mode, so we just
pick whatever makes sense.

Bspec: 45101, 44235 #xe
Bspec: 70552, 71582, 59400 #xe2
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Pallavi Mishra <pallavi.mishra@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Cc: Filip Hazubski <filip.hazubski@intel.com>
Cc: Carl Zhang <carl.zhang@intel.com>
Cc: Effie Yu <effie.yu@intel.com>
---
 drivers/gpu/drm/xe/xe_device_types.h |  2 +-
 drivers/gpu/drm/xe/xe_pat.c          | 67 ++++++++++++++++------------
 drivers/gpu/drm/xe/xe_pat.h          |  6 +++
 3 files changed, 46 insertions(+), 29 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 06235da647bb..53520ae30b33 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -238,7 +238,7 @@ struct xe_device {
 		/** @enable_display: display enabled */
 		u8 enable_display:1;
 
-		const u32 *pat_table;
+		const struct xe_pat_table_entry *pat_table;
 		int pat_table_n_entries;
 
 #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
diff --git a/drivers/gpu/drm/xe/xe_pat.c b/drivers/gpu/drm/xe/xe_pat.c
index f19f5d8dcd94..9e72c1b4b41f 100644
--- a/drivers/gpu/drm/xe/xe_pat.c
+++ b/drivers/gpu/drm/xe/xe_pat.c
@@ -4,6 +4,8 @@
  */
 
 
+#include <drm/xe_drm.h>
+
 #include "regs/xe_reg_defs.h"
 #include "xe_gt.h"
 #include "xe_gt_mcr.h"
@@ -33,34 +35,34 @@
 #define TGL_PAT_WC				REG_FIELD_PREP(TGL_MEM_TYPE_MASK, 1)
 #define TGL_PAT_UC				REG_FIELD_PREP(TGL_MEM_TYPE_MASK, 0)
 
-static const u32 tgl_pat_table[] = {
-	[0] = TGL_PAT_WB,
-	[1] = TGL_PAT_WC,
-	[2] = TGL_PAT_WT,
-	[3] = TGL_PAT_UC,
-	[4] = TGL_PAT_WB,
-	[5] = TGL_PAT_WB,
-	[6] = TGL_PAT_WB,
-	[7] = TGL_PAT_WB,
+static const struct xe_pat_table_entry tgl_pat_table[] = {
+	[0] = { TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
+	[1] = { TGL_PAT_WC, XE_GEM_COHERENCY_NONE },
+	[2] = { TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
+	[3] = { TGL_PAT_UC, XE_GEM_COHERENCY_NONE },
+	[4] = { TGL_PAT_WB }, /* zero coh_mode to indicate invalid from userspace */
+	[5] = { TGL_PAT_WB },
+	[6] = { TGL_PAT_WB },
+	[7] = { TGL_PAT_WB },
 };
 
-static const u32 pvc_pat_table[] = {
-	[0] = TGL_PAT_UC,
-	[1] = TGL_PAT_WC,
-	[2] = TGL_PAT_WT,
-	[3] = TGL_PAT_WB,
-	[4] = PVC_PAT_CLOS(1) | TGL_PAT_WT,
-	[5] = PVC_PAT_CLOS(1) | TGL_PAT_WB,
-	[6] = PVC_PAT_CLOS(2) | TGL_PAT_WT,
-	[7] = PVC_PAT_CLOS(2) | TGL_PAT_WB,
+static const struct xe_pat_table_entry pvc_pat_table[] = {
+	[0] = { TGL_PAT_UC, XE_GEM_COHERENCY_NONE },
+	[1] = { TGL_PAT_WC, XE_GEM_COHERENCY_NONE },
+	[2] = { TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
+	[3] = { TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
+	[4] = { PVC_PAT_CLOS(1) | TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
+	[5] = { PVC_PAT_CLOS(1) | TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
+	[6] = { PVC_PAT_CLOS(2) | TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
+	[7] = { PVC_PAT_CLOS(2) | TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
 };
 
-static const u32 mtl_pat_table[] = {
-	[0] = MTL_PAT_0_WB,
-	[1] = MTL_PAT_1_WT,
-	[2] = MTL_PAT_3_UC,
-	[3] = MTL_PAT_0_WB | MTL_2_COH_1W,
-	[4] = MTL_PAT_0_WB | MTL_3_COH_2W,
+static const struct xe_pat_table_entry mtl_pat_table[] = {
+	[0] = { MTL_PAT_0_WB, XE_GEM_COHERENCY_NONE },
+	[1] = { MTL_PAT_1_WT, XE_GEM_COHERENCY_NONE },
+	[2] = { MTL_PAT_3_UC, XE_GEM_COHERENCY_NONE },
+	[3] = { MTL_PAT_0_WB | MTL_2_COH_1W, XE_GEM_COHERENCY_1WAY },
+	[4] = { MTL_PAT_0_WB | MTL_3_COH_2W, XE_GEM_COHERENCY_2WAY },
 };
 
 static const u32 xelp_pte_pat_table[XE_CACHE_LAST] = {
@@ -82,27 +84,35 @@ static const u32 xelpg_pte_pat_table[XE_CACHE_LAST] = {
 	[XE_CACHE_WB_1_WAY] = XELPG_PAT_WB_CACHE_1_WAY,
 };
 
+u16 xe_pat_index_get_coh_mode(struct xe_device *xe, u32 pat_index)
+{
+	WARN_ON(pat_index >= xe->info.pat_table_n_entries);
+	return xe->info.pat_table[pat_index].coh_mode;
+}
+
 unsigned int xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache)
 {
 	WARN_ON(cache >= XE_CACHE_LAST);
 	return (xe->pat_table).pte_pat_table[cache];
 }
 
-static void program_pat(struct xe_gt *gt, const u32 table[], int n_entries)
+static void program_pat(struct xe_gt *gt, const struct xe_pat_table_entry table[],
+			int n_entries)
 {
 	for (int i = 0; i < n_entries; i++) {
 		struct xe_reg reg = XE_REG(_PAT_INDEX(i));
 
-		xe_mmio_write32(gt, reg, table[i]);
+		xe_mmio_write32(gt, reg, table[i].value);
 	}
 }
 
-static void program_pat_mcr(struct xe_gt *gt, const u32 table[], int n_entries)
+static void program_pat_mcr(struct xe_gt *gt, const struct xe_pat_table_entry table[],
+			    int n_entries)
 {
 	for (int i = 0; i < n_entries; i++) {
 		struct xe_reg_mcr reg_mcr = XE_REG_MCR(_PAT_INDEX(i));
 
-		xe_gt_mcr_multicast_write(gt, reg_mcr, table[i]);
+		xe_gt_mcr_multicast_write(gt, reg_mcr, table[i].value);
 	}
 }
 
@@ -115,6 +125,7 @@ int xe_pat_fill_info(struct xe_device *xe)
 		xe->info.pat_table = pvc_pat_table;
 		xe->info.pat_table_n_entries = ARRAY_SIZE(pvc_pat_table);
 	} else if (GRAPHICS_VERx100(xe) <= 1210) {
+		WARN_ON_ONCE(!IS_DGFX(xe) && !xe->info.has_llc);
 		xe->info.pat_table = tgl_pat_table;
 		xe->info.pat_table_n_entries = ARRAY_SIZE(tgl_pat_table);
 	} else {
diff --git a/drivers/gpu/drm/xe/xe_pat.h b/drivers/gpu/drm/xe/xe_pat.h
index 9ab059758ad1..e2fabde2c730 100644
--- a/drivers/gpu/drm/xe/xe_pat.h
+++ b/drivers/gpu/drm/xe/xe_pat.h
@@ -28,9 +28,15 @@
 struct xe_gt;
 struct xe_device;
 
+struct xe_pat_table_entry {
+	u32 value;
+	u16 coh_mode;
+};
+
 int xe_pat_fill_info(struct xe_device *xe);
 void xe_pat_init(struct xe_gt *gt);
 void xe_pte_pat_init(struct xe_device *xe);
 unsigned int xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache);
+u16 xe_pat_index_get_coh_mode(struct xe_device *xe, u32 pat_index);
 
 #endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 4/5] drm/xe/pat: annotate pat_index with coherency mode
  2023-08-29 16:28 ` [Intel-xe] [RFC 4/5] drm/xe/pat: annotate pat_index with coherency mode Matthew Auld
@ 2023-08-29 21:08   ` Matt Roper
  2023-08-30  9:32     ` Matthew Auld
  2023-08-29 22:02   ` Lucas De Marchi
  1 sibling, 1 reply; 47+ messages in thread
From: Matt Roper @ 2023-08-29 21:08 UTC (permalink / raw)
  To: Matthew Auld
  Cc: Filip Hazubski, Joonas Lahtinen, Lucas De Marchi, Carl Zhang,
	Effie Yu, intel-xe

On Tue, Aug 29, 2023 at 05:28:45PM +0100, Matthew Auld wrote:
> Future uapi needs to give userspace the ability to select the pat_index
> for a given vm_bind. However we need to be able to extract the coherency
> mode from the provided pat_index to ensure it matches the coherency mode
> set at object creation. There are various security reasons for why this
> matters.  However the pat_index itself is very platform specific, so
> seems reasonable to annotate each platform definition of the pat table.
> On some older platforms there is no explicit coherency mode, so we just
> pick whatever makes sense.
> 
> Bspec: 45101, 44235 #xe
> Bspec: 70552, 71582, 59400 #xe2
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: José Roberto de Souza <jose.souza@intel.com>
> Cc: Filip Hazubski <filip.hazubski@intel.com>
> Cc: Carl Zhang <carl.zhang@intel.com>
> Cc: Effie Yu <effie.yu@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device_types.h |  2 +-
>  drivers/gpu/drm/xe/xe_pat.c          | 67 ++++++++++++++++------------
>  drivers/gpu/drm/xe/xe_pat.h          |  6 +++
>  3 files changed, 46 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 06235da647bb..53520ae30b33 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -238,7 +238,7 @@ struct xe_device {
>  		/** @enable_display: display enabled */
>  		u8 enable_display:1;
>  
> -		const u32 *pat_table;
> +		const struct xe_pat_table_entry *pat_table;
>  		int pat_table_n_entries;
>  
>  #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
> diff --git a/drivers/gpu/drm/xe/xe_pat.c b/drivers/gpu/drm/xe/xe_pat.c
> index f19f5d8dcd94..9e72c1b4b41f 100644
> --- a/drivers/gpu/drm/xe/xe_pat.c
> +++ b/drivers/gpu/drm/xe/xe_pat.c
> @@ -4,6 +4,8 @@
>   */
>  
>  
> +#include <drm/xe_drm.h>
> +
>  #include "regs/xe_reg_defs.h"
>  #include "xe_gt.h"
>  #include "xe_gt_mcr.h"
> @@ -33,34 +35,34 @@
>  #define TGL_PAT_WC				REG_FIELD_PREP(TGL_MEM_TYPE_MASK, 1)
>  #define TGL_PAT_UC				REG_FIELD_PREP(TGL_MEM_TYPE_MASK, 0)
>  
> -static const u32 tgl_pat_table[] = {
> -	[0] = TGL_PAT_WB,
> -	[1] = TGL_PAT_WC,
> -	[2] = TGL_PAT_WT,
> -	[3] = TGL_PAT_UC,
> -	[4] = TGL_PAT_WB,
> -	[5] = TGL_PAT_WB,
> -	[6] = TGL_PAT_WB,
> -	[7] = TGL_PAT_WB,
> +static const struct xe_pat_table_entry tgl_pat_table[] = {
> +	[0] = { TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
> +	[1] = { TGL_PAT_WC, XE_GEM_COHERENCY_NONE },
> +	[2] = { TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
> +	[3] = { TGL_PAT_UC, XE_GEM_COHERENCY_NONE },
> +	[4] = { TGL_PAT_WB }, /* zero coh_mode to indicate invalid from userspace */
> +	[5] = { TGL_PAT_WB },
> +	[6] = { TGL_PAT_WB },
> +	[7] = { TGL_PAT_WB },

Should we just not even include 4-7?  That's basically the approach
we've taken with MTL (which has 32 entries in hardware, but we only
provide a table of the first 5).

>  };
>  
> -static const u32 pvc_pat_table[] = {
> -	[0] = TGL_PAT_UC,
> -	[1] = TGL_PAT_WC,
> -	[2] = TGL_PAT_WT,
> -	[3] = TGL_PAT_WB,
> -	[4] = PVC_PAT_CLOS(1) | TGL_PAT_WT,
> -	[5] = PVC_PAT_CLOS(1) | TGL_PAT_WB,
> -	[6] = PVC_PAT_CLOS(2) | TGL_PAT_WT,
> -	[7] = PVC_PAT_CLOS(2) | TGL_PAT_WB,
> +static const struct xe_pat_table_entry pvc_pat_table[] = {
> +	[0] = { TGL_PAT_UC, XE_GEM_COHERENCY_NONE },
> +	[1] = { TGL_PAT_WC, XE_GEM_COHERENCY_NONE },
> +	[2] = { TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
> +	[3] = { TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },

Is 2-way correct here?  Although GPU reads snoop the CPU cache and GPU
writes invalidate the CPU cache, I don't think the reverse is true (CPU
operations don't automatically snoop/invalidate the GPU cache).  So
wouldn't this be just 1-way coherent?

By the same logic, once we add in coherency the TGL table above probably
isn't correct anymore for DG1/DG2 where we only have PCI snooping
instead of a truly shared LLC cache.

> +	[4] = { PVC_PAT_CLOS(1) | TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
> +	[5] = { PVC_PAT_CLOS(1) | TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
> +	[6] = { PVC_PAT_CLOS(2) | TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
> +	[7] = { PVC_PAT_CLOS(2) | TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
>  };
>  
> -static const u32 mtl_pat_table[] = {
> -	[0] = MTL_PAT_0_WB,
> -	[1] = MTL_PAT_1_WT,
> -	[2] = MTL_PAT_3_UC,
> -	[3] = MTL_PAT_0_WB | MTL_2_COH_1W,
> -	[4] = MTL_PAT_0_WB | MTL_3_COH_2W,
> +static const struct xe_pat_table_entry mtl_pat_table[] = {
> +	[0] = { MTL_PAT_0_WB, XE_GEM_COHERENCY_NONE },
> +	[1] = { MTL_PAT_1_WT, XE_GEM_COHERENCY_NONE },
> +	[2] = { MTL_PAT_3_UC, XE_GEM_COHERENCY_NONE },
> +	[3] = { MTL_PAT_0_WB | MTL_2_COH_1W, XE_GEM_COHERENCY_1WAY },
> +	[4] = { MTL_PAT_0_WB | MTL_3_COH_2W, XE_GEM_COHERENCY_2WAY },

Although this is labelled "2 way" in the bspec, the reality is that it's
a bit of a lie.  From bspec 63884:

        "2-way Coherent (GPU <-> CPU Snooping) honored only for atomics,
        else 1-way coh"

and also:

        "Except for system atomics, setting Coherency Mode to 10 or 11
        results in this same one-way coherenct behavior.  Full CPU-GPU
        coherency is maintained for system atomics with Coherency Mode =
        11, with no caching."


Matt

>  };
>  
>  static const u32 xelp_pte_pat_table[XE_CACHE_LAST] = {
> @@ -82,27 +84,35 @@ static const u32 xelpg_pte_pat_table[XE_CACHE_LAST] = {
>  	[XE_CACHE_WB_1_WAY] = XELPG_PAT_WB_CACHE_1_WAY,
>  };
>  
> +u16 xe_pat_index_get_coh_mode(struct xe_device *xe, u32 pat_index)
> +{
> +	WARN_ON(pat_index >= xe->info.pat_table_n_entries);
> +	return xe->info.pat_table[pat_index].coh_mode;
> +}
> +
>  unsigned int xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache)
>  {
>  	WARN_ON(cache >= XE_CACHE_LAST);
>  	return (xe->pat_table).pte_pat_table[cache];
>  }
>  
> -static void program_pat(struct xe_gt *gt, const u32 table[], int n_entries)
> +static void program_pat(struct xe_gt *gt, const struct xe_pat_table_entry table[],
> +			int n_entries)
>  {
>  	for (int i = 0; i < n_entries; i++) {
>  		struct xe_reg reg = XE_REG(_PAT_INDEX(i));
>  
> -		xe_mmio_write32(gt, reg, table[i]);
> +		xe_mmio_write32(gt, reg, table[i].value);
>  	}
>  }
>  
> -static void program_pat_mcr(struct xe_gt *gt, const u32 table[], int n_entries)
> +static void program_pat_mcr(struct xe_gt *gt, const struct xe_pat_table_entry table[],
> +			    int n_entries)
>  {
>  	for (int i = 0; i < n_entries; i++) {
>  		struct xe_reg_mcr reg_mcr = XE_REG_MCR(_PAT_INDEX(i));
>  
> -		xe_gt_mcr_multicast_write(gt, reg_mcr, table[i]);
> +		xe_gt_mcr_multicast_write(gt, reg_mcr, table[i].value);
>  	}
>  }
>  
> @@ -115,6 +125,7 @@ int xe_pat_fill_info(struct xe_device *xe)
>  		xe->info.pat_table = pvc_pat_table;
>  		xe->info.pat_table_n_entries = ARRAY_SIZE(pvc_pat_table);
>  	} else if (GRAPHICS_VERx100(xe) <= 1210) {
> +		WARN_ON_ONCE(!IS_DGFX(xe) && !xe->info.has_llc);
>  		xe->info.pat_table = tgl_pat_table;
>  		xe->info.pat_table_n_entries = ARRAY_SIZE(tgl_pat_table);
>  	} else {
> diff --git a/drivers/gpu/drm/xe/xe_pat.h b/drivers/gpu/drm/xe/xe_pat.h
> index 9ab059758ad1..e2fabde2c730 100644
> --- a/drivers/gpu/drm/xe/xe_pat.h
> +++ b/drivers/gpu/drm/xe/xe_pat.h
> @@ -28,9 +28,15 @@
>  struct xe_gt;
>  struct xe_device;
>  
> +struct xe_pat_table_entry {
> +	u32 value;
> +	u16 coh_mode;
> +};
> +
>  int xe_pat_fill_info(struct xe_device *xe);
>  void xe_pat_init(struct xe_gt *gt);
>  void xe_pte_pat_init(struct xe_device *xe);
>  unsigned int xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache);
> +u16 xe_pat_index_get_coh_mode(struct xe_device *xe, u32 pat_index);
>  
>  #endif
> -- 
> 2.41.0
> 

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 4/5] drm/xe/pat: annotate pat_index with coherency mode
  2023-08-29 21:08   ` Matt Roper
@ 2023-08-30  9:32     ` Matthew Auld
  2023-08-30 19:40       ` Matt Roper
  0 siblings, 1 reply; 47+ messages in thread
From: Matthew Auld @ 2023-08-30  9:32 UTC (permalink / raw)
  To: Matt Roper
  Cc: Filip Hazubski, Joonas Lahtinen, Lucas De Marchi, Carl Zhang,
	Effie Yu, intel-xe

On 29/08/2023 22:08, Matt Roper wrote:
> On Tue, Aug 29, 2023 at 05:28:45PM +0100, Matthew Auld wrote:
>> Future uapi needs to give userspace the ability to select the pat_index
>> for a given vm_bind. However we need to be able to extract the coherency
>> mode from the provided pat_index to ensure it matches the coherency mode
>> set at object creation. There are various security reasons for why this
>> matters.  However the pat_index itself is very platform specific, so
>> seems reasonable to annotate each platform definition of the pat table.
>> On some older platforms there is no explicit coherency mode, so we just
>> pick whatever makes sense.
>>
>> Bspec: 45101, 44235 #xe
>> Bspec: 70552, 71582, 59400 #xe2
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: José Roberto de Souza <jose.souza@intel.com>
>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>> Cc: Carl Zhang <carl.zhang@intel.com>
>> Cc: Effie Yu <effie.yu@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_device_types.h |  2 +-
>>   drivers/gpu/drm/xe/xe_pat.c          | 67 ++++++++++++++++------------
>>   drivers/gpu/drm/xe/xe_pat.h          |  6 +++
>>   3 files changed, 46 insertions(+), 29 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>> index 06235da647bb..53520ae30b33 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -238,7 +238,7 @@ struct xe_device {
>>   		/** @enable_display: display enabled */
>>   		u8 enable_display:1;
>>   
>> -		const u32 *pat_table;
>> +		const struct xe_pat_table_entry *pat_table;
>>   		int pat_table_n_entries;
>>   
>>   #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
>> diff --git a/drivers/gpu/drm/xe/xe_pat.c b/drivers/gpu/drm/xe/xe_pat.c
>> index f19f5d8dcd94..9e72c1b4b41f 100644
>> --- a/drivers/gpu/drm/xe/xe_pat.c
>> +++ b/drivers/gpu/drm/xe/xe_pat.c
>> @@ -4,6 +4,8 @@
>>    */
>>   
>>   
>> +#include <drm/xe_drm.h>
>> +
>>   #include "regs/xe_reg_defs.h"
>>   #include "xe_gt.h"
>>   #include "xe_gt_mcr.h"
>> @@ -33,34 +35,34 @@
>>   #define TGL_PAT_WC				REG_FIELD_PREP(TGL_MEM_TYPE_MASK, 1)
>>   #define TGL_PAT_UC				REG_FIELD_PREP(TGL_MEM_TYPE_MASK, 0)
>>   
>> -static const u32 tgl_pat_table[] = {
>> -	[0] = TGL_PAT_WB,
>> -	[1] = TGL_PAT_WC,
>> -	[2] = TGL_PAT_WT,
>> -	[3] = TGL_PAT_UC,
>> -	[4] = TGL_PAT_WB,
>> -	[5] = TGL_PAT_WB,
>> -	[6] = TGL_PAT_WB,
>> -	[7] = TGL_PAT_WB,
>> +static const struct xe_pat_table_entry tgl_pat_table[] = {
>> +	[0] = { TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
>> +	[1] = { TGL_PAT_WC, XE_GEM_COHERENCY_NONE },
>> +	[2] = { TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
>> +	[3] = { TGL_PAT_UC, XE_GEM_COHERENCY_NONE },
>> +	[4] = { TGL_PAT_WB }, /* zero coh_mode to indicate invalid from userspace */
>> +	[5] = { TGL_PAT_WB },
>> +	[6] = { TGL_PAT_WB },
>> +	[7] = { TGL_PAT_WB },
> 
> Should we just not even include 4-7?  That's basically the approach
> we've taken with MTL (which has 32 entries in hardware, but we only
> provide a table of the first 5).
> 
>>   };
>>   
>> -static const u32 pvc_pat_table[] = {
>> -	[0] = TGL_PAT_UC,
>> -	[1] = TGL_PAT_WC,
>> -	[2] = TGL_PAT_WT,
>> -	[3] = TGL_PAT_WB,
>> -	[4] = PVC_PAT_CLOS(1) | TGL_PAT_WT,
>> -	[5] = PVC_PAT_CLOS(1) | TGL_PAT_WB,
>> -	[6] = PVC_PAT_CLOS(2) | TGL_PAT_WT,
>> -	[7] = PVC_PAT_CLOS(2) | TGL_PAT_WB,
>> +static const struct xe_pat_table_entry pvc_pat_table[] = {
>> +	[0] = { TGL_PAT_UC, XE_GEM_COHERENCY_NONE },
>> +	[1] = { TGL_PAT_WC, XE_GEM_COHERENCY_NONE },
>> +	[2] = { TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
>> +	[3] = { TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
> 
> Is 2-way correct here?  Although GPU reads snoop the CPU cache and GPU
> writes invalidate the CPU cache, I don't think the reverse is true (CPU
> operations don't automatically snoop/invalidate the GPU cache).  So
> wouldn't this be just 1-way coherent?

Yeah, I wasn't too sure about this. The Bspec says something like:

Discrete GPUs do not support caching of system memory in the device. 
Therefore, the coherency mode that is supported in discrete GPUs are the 
one-way coherency mode with no caching of system memory inside the 
discrete GPU.  Setting Coherency Mode to 10 or 11 in discrete GPUs 
results in this same behaviour.

My interpretation is that 1way and 2way do the same thing on dgpu, so I 
figured I could cheat here and claim it's 2way and then we don't need 
separate tables or special logic for dgpu vs igpu. I can change it to 
1way if you prefer though?

> 
> By the same logic, once we add in coherency the TGL table above probably
> isn't correct anymore for DG1/DG2 where we only have PCI snooping
> instead of a truly shared LLC cache.
> 
>> +	[4] = { PVC_PAT_CLOS(1) | TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
>> +	[5] = { PVC_PAT_CLOS(1) | TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
>> +	[6] = { PVC_PAT_CLOS(2) | TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
>> +	[7] = { PVC_PAT_CLOS(2) | TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
>>   };
>>   
>> -static const u32 mtl_pat_table[] = {
>> -	[0] = MTL_PAT_0_WB,
>> -	[1] = MTL_PAT_1_WT,
>> -	[2] = MTL_PAT_3_UC,
>> -	[3] = MTL_PAT_0_WB | MTL_2_COH_1W,
>> -	[4] = MTL_PAT_0_WB | MTL_3_COH_2W,
>> +static const struct xe_pat_table_entry mtl_pat_table[] = {
>> +	[0] = { MTL_PAT_0_WB, XE_GEM_COHERENCY_NONE },
>> +	[1] = { MTL_PAT_1_WT, XE_GEM_COHERENCY_NONE },
>> +	[2] = { MTL_PAT_3_UC, XE_GEM_COHERENCY_NONE },
>> +	[3] = { MTL_PAT_0_WB | MTL_2_COH_1W, XE_GEM_COHERENCY_1WAY },
>> +	[4] = { MTL_PAT_0_WB | MTL_3_COH_2W, XE_GEM_COHERENCY_2WAY },
> 
> Although this is labelled "2 way" in the bspec, the reality is that it's
> a bit of a lie.  From bspec 63884:
> 
>          "2-way Coherent (GPU <-> CPU Snooping) honored only for atomics,
>          else 1-way coh"
> 
> and also:
> 
>          "Except for system atomics, setting Coherency Mode to 10 or 11
>          results in this same one-way coherenct behavior.  Full CPU-GPU
>          coherency is maintained for system atomics with Coherency Mode =
>          11, with no caching."

Ok, I didn't know that. So what should we put here for MTL_3_COH_2W? 
Just mark it as 1way? AFAICT 1way vs 2way doesn't really matter all that 
much from KMD pov, just so long as the coherency level ensures that 
swap-in and clearing can't ever be bypassed. So maybe all that matters 
here is that this matches the bspec, and exactly what 2way means for the 
platform we don't really care? Or perhaps we don't even need the 2way 
and can just mark stuff as NONE or AT_LEAST_1WAY?

> 
> 
> Matt
> 
>>   };
>>   
>>   static const u32 xelp_pte_pat_table[XE_CACHE_LAST] = {
>> @@ -82,27 +84,35 @@ static const u32 xelpg_pte_pat_table[XE_CACHE_LAST] = {
>>   	[XE_CACHE_WB_1_WAY] = XELPG_PAT_WB_CACHE_1_WAY,
>>   };
>>   
>> +u16 xe_pat_index_get_coh_mode(struct xe_device *xe, u32 pat_index)
>> +{
>> +	WARN_ON(pat_index >= xe->info.pat_table_n_entries);
>> +	return xe->info.pat_table[pat_index].coh_mode;
>> +}
>> +
>>   unsigned int xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache)
>>   {
>>   	WARN_ON(cache >= XE_CACHE_LAST);
>>   	return (xe->pat_table).pte_pat_table[cache];
>>   }
>>   
>> -static void program_pat(struct xe_gt *gt, const u32 table[], int n_entries)
>> +static void program_pat(struct xe_gt *gt, const struct xe_pat_table_entry table[],
>> +			int n_entries)
>>   {
>>   	for (int i = 0; i < n_entries; i++) {
>>   		struct xe_reg reg = XE_REG(_PAT_INDEX(i));
>>   
>> -		xe_mmio_write32(gt, reg, table[i]);
>> +		xe_mmio_write32(gt, reg, table[i].value);
>>   	}
>>   }
>>   
>> -static void program_pat_mcr(struct xe_gt *gt, const u32 table[], int n_entries)
>> +static void program_pat_mcr(struct xe_gt *gt, const struct xe_pat_table_entry table[],
>> +			    int n_entries)
>>   {
>>   	for (int i = 0; i < n_entries; i++) {
>>   		struct xe_reg_mcr reg_mcr = XE_REG_MCR(_PAT_INDEX(i));
>>   
>> -		xe_gt_mcr_multicast_write(gt, reg_mcr, table[i]);
>> +		xe_gt_mcr_multicast_write(gt, reg_mcr, table[i].value);
>>   	}
>>   }
>>   
>> @@ -115,6 +125,7 @@ int xe_pat_fill_info(struct xe_device *xe)
>>   		xe->info.pat_table = pvc_pat_table;
>>   		xe->info.pat_table_n_entries = ARRAY_SIZE(pvc_pat_table);
>>   	} else if (GRAPHICS_VERx100(xe) <= 1210) {
>> +		WARN_ON_ONCE(!IS_DGFX(xe) && !xe->info.has_llc);
>>   		xe->info.pat_table = tgl_pat_table;
>>   		xe->info.pat_table_n_entries = ARRAY_SIZE(tgl_pat_table);
>>   	} else {
>> diff --git a/drivers/gpu/drm/xe/xe_pat.h b/drivers/gpu/drm/xe/xe_pat.h
>> index 9ab059758ad1..e2fabde2c730 100644
>> --- a/drivers/gpu/drm/xe/xe_pat.h
>> +++ b/drivers/gpu/drm/xe/xe_pat.h
>> @@ -28,9 +28,15 @@
>>   struct xe_gt;
>>   struct xe_device;
>>   
>> +struct xe_pat_table_entry {
>> +	u32 value;
>> +	u16 coh_mode;
>> +};
>> +
>>   int xe_pat_fill_info(struct xe_device *xe);
>>   void xe_pat_init(struct xe_gt *gt);
>>   void xe_pte_pat_init(struct xe_device *xe);
>>   unsigned int xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache);
>> +u16 xe_pat_index_get_coh_mode(struct xe_device *xe, u32 pat_index);
>>   
>>   #endif
>> -- 
>> 2.41.0
>>
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 4/5] drm/xe/pat: annotate pat_index with coherency mode
  2023-08-30  9:32     ` Matthew Auld
@ 2023-08-30 19:40       ` Matt Roper
  0 siblings, 0 replies; 47+ messages in thread
From: Matt Roper @ 2023-08-30 19:40 UTC (permalink / raw)
  To: Matthew Auld
  Cc: Filip Hazubski, Joonas Lahtinen, Lucas De Marchi, Carl Zhang,
	Effie Yu, intel-xe

On Wed, Aug 30, 2023 at 10:32:30AM +0100, Matthew Auld wrote:
> On 29/08/2023 22:08, Matt Roper wrote:
> > On Tue, Aug 29, 2023 at 05:28:45PM +0100, Matthew Auld wrote:
> > > Future uapi needs to give userspace the ability to select the pat_index
> > > for a given vm_bind. However we need to be able to extract the coherency
> > > mode from the provided pat_index to ensure it matches the coherency mode
> > > set at object creation. There are various security reasons for why this
> > > matters.  However the pat_index itself is very platform specific, so
> > > seems reasonable to annotate each platform definition of the pat table.
> > > On some older platforms there is no explicit coherency mode, so we just
> > > pick whatever makes sense.
> > > 
> > > Bspec: 45101, 44235 #xe
> > > Bspec: 70552, 71582, 59400 #xe2
> > > Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> > > Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > > Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> > > Cc: Matt Roper <matthew.d.roper@intel.com>
> > > Cc: José Roberto de Souza <jose.souza@intel.com>
> > > Cc: Filip Hazubski <filip.hazubski@intel.com>
> > > Cc: Carl Zhang <carl.zhang@intel.com>
> > > Cc: Effie Yu <effie.yu@intel.com>
> > > ---
> > >   drivers/gpu/drm/xe/xe_device_types.h |  2 +-
> > >   drivers/gpu/drm/xe/xe_pat.c          | 67 ++++++++++++++++------------
> > >   drivers/gpu/drm/xe/xe_pat.h          |  6 +++
> > >   3 files changed, 46 insertions(+), 29 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> > > index 06235da647bb..53520ae30b33 100644
> > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > @@ -238,7 +238,7 @@ struct xe_device {
> > >   		/** @enable_display: display enabled */
> > >   		u8 enable_display:1;
> > > -		const u32 *pat_table;
> > > +		const struct xe_pat_table_entry *pat_table;
> > >   		int pat_table_n_entries;
> > >   #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
> > > diff --git a/drivers/gpu/drm/xe/xe_pat.c b/drivers/gpu/drm/xe/xe_pat.c
> > > index f19f5d8dcd94..9e72c1b4b41f 100644
> > > --- a/drivers/gpu/drm/xe/xe_pat.c
> > > +++ b/drivers/gpu/drm/xe/xe_pat.c
> > > @@ -4,6 +4,8 @@
> > >    */
> > > +#include <drm/xe_drm.h>
> > > +
> > >   #include "regs/xe_reg_defs.h"
> > >   #include "xe_gt.h"
> > >   #include "xe_gt_mcr.h"
> > > @@ -33,34 +35,34 @@
> > >   #define TGL_PAT_WC				REG_FIELD_PREP(TGL_MEM_TYPE_MASK, 1)
> > >   #define TGL_PAT_UC				REG_FIELD_PREP(TGL_MEM_TYPE_MASK, 0)
> > > -static const u32 tgl_pat_table[] = {
> > > -	[0] = TGL_PAT_WB,
> > > -	[1] = TGL_PAT_WC,
> > > -	[2] = TGL_PAT_WT,
> > > -	[3] = TGL_PAT_UC,
> > > -	[4] = TGL_PAT_WB,
> > > -	[5] = TGL_PAT_WB,
> > > -	[6] = TGL_PAT_WB,
> > > -	[7] = TGL_PAT_WB,
> > > +static const struct xe_pat_table_entry tgl_pat_table[] = {
> > > +	[0] = { TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
> > > +	[1] = { TGL_PAT_WC, XE_GEM_COHERENCY_NONE },
> > > +	[2] = { TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
> > > +	[3] = { TGL_PAT_UC, XE_GEM_COHERENCY_NONE },
> > > +	[4] = { TGL_PAT_WB }, /* zero coh_mode to indicate invalid from userspace */
> > > +	[5] = { TGL_PAT_WB },
> > > +	[6] = { TGL_PAT_WB },
> > > +	[7] = { TGL_PAT_WB },
> > 
> > Should we just not even include 4-7?  That's basically the approach
> > we've taken with MTL (which has 32 entries in hardware, but we only
> > provide a table of the first 5).
> > 
> > >   };
> > > -static const u32 pvc_pat_table[] = {
> > > -	[0] = TGL_PAT_UC,
> > > -	[1] = TGL_PAT_WC,
> > > -	[2] = TGL_PAT_WT,
> > > -	[3] = TGL_PAT_WB,
> > > -	[4] = PVC_PAT_CLOS(1) | TGL_PAT_WT,
> > > -	[5] = PVC_PAT_CLOS(1) | TGL_PAT_WB,
> > > -	[6] = PVC_PAT_CLOS(2) | TGL_PAT_WT,
> > > -	[7] = PVC_PAT_CLOS(2) | TGL_PAT_WB,
> > > +static const struct xe_pat_table_entry pvc_pat_table[] = {
> > > +	[0] = { TGL_PAT_UC, XE_GEM_COHERENCY_NONE },
> > > +	[1] = { TGL_PAT_WC, XE_GEM_COHERENCY_NONE },
> > > +	[2] = { TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
> > > +	[3] = { TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
> > 
> > Is 2-way correct here?  Although GPU reads snoop the CPU cache and GPU
> > writes invalidate the CPU cache, I don't think the reverse is true (CPU
> > operations don't automatically snoop/invalidate the GPU cache).  So
> > wouldn't this be just 1-way coherent?
> 
> Yeah, I wasn't too sure about this. The Bspec says something like:
> 
> Discrete GPUs do not support caching of system memory in the device.
> Therefore, the coherency mode that is supported in discrete GPUs are the
> one-way coherency mode with no caching of system memory inside the discrete
> GPU.  Setting Coherency Mode to 10 or 11 in discrete GPUs results in this
> same behaviour.
> 
> My interpretation is that 1way and 2way do the same thing on dgpu, so I
> figured I could cheat here and claim it's 2way and then we don't need
> separate tables or special logic for dgpu vs igpu. I can change it to 1way
> if you prefer though?
> 
> > 
> > By the same logic, once we add in coherency the TGL table above probably
> > isn't correct anymore for DG1/DG2 where we only have PCI snooping
> > instead of a truly shared LLC cache.
> > 
> > > +	[4] = { PVC_PAT_CLOS(1) | TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
> > > +	[5] = { PVC_PAT_CLOS(1) | TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
> > > +	[6] = { PVC_PAT_CLOS(2) | TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
> > > +	[7] = { PVC_PAT_CLOS(2) | TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
> > >   };
> > > -static const u32 mtl_pat_table[] = {
> > > -	[0] = MTL_PAT_0_WB,
> > > -	[1] = MTL_PAT_1_WT,
> > > -	[2] = MTL_PAT_3_UC,
> > > -	[3] = MTL_PAT_0_WB | MTL_2_COH_1W,
> > > -	[4] = MTL_PAT_0_WB | MTL_3_COH_2W,
> > > +static const struct xe_pat_table_entry mtl_pat_table[] = {
> > > +	[0] = { MTL_PAT_0_WB, XE_GEM_COHERENCY_NONE },
> > > +	[1] = { MTL_PAT_1_WT, XE_GEM_COHERENCY_NONE },
> > > +	[2] = { MTL_PAT_3_UC, XE_GEM_COHERENCY_NONE },
> > > +	[3] = { MTL_PAT_0_WB | MTL_2_COH_1W, XE_GEM_COHERENCY_1WAY },
> > > +	[4] = { MTL_PAT_0_WB | MTL_3_COH_2W, XE_GEM_COHERENCY_2WAY },
> > 
> > Although this is labelled "2 way" in the bspec, the reality is that it's
> > a bit of a lie.  From bspec 63884:
> > 
> >          "2-way Coherent (GPU <-> CPU Snooping) honored only for atomics,
> >          else 1-way coh"
> > 
> > and also:
> > 
> >          "Except for system atomics, setting Coherency Mode to 10 or 11
> >          results in this same one-way coherenct behavior.  Full CPU-GPU
> >          coherency is maintained for system atomics with Coherency Mode =
> >          11, with no caching."
> 
> Ok, I didn't know that. So what should we put here for MTL_3_COH_2W? Just
> mark it as 1way? AFAICT 1way vs 2way doesn't really matter all that much
> from KMD pov, just so long as the coherency level ensures that swap-in and
> clearing can't ever be bypassed. So maybe all that matters here is that this
> matches the bspec, and exactly what 2way means for the platform we don't
> really care? Or perhaps we don't even need the 2way and can just mark stuff
> as NONE or AT_LEAST_1WAY?

Yeah, I like the idea of values "non-coherent" and "at least 1-way
coherent" for now.  We can potentially add an explicit "2 way coherent"
farther down the road if there ever winds up being a need for it.


Matt

> 
> > 
> > 
> > Matt
> > 
> > >   };
> > >   static const u32 xelp_pte_pat_table[XE_CACHE_LAST] = {
> > > @@ -82,27 +84,35 @@ static const u32 xelpg_pte_pat_table[XE_CACHE_LAST] = {
> > >   	[XE_CACHE_WB_1_WAY] = XELPG_PAT_WB_CACHE_1_WAY,
> > >   };
> > > +u16 xe_pat_index_get_coh_mode(struct xe_device *xe, u32 pat_index)
> > > +{
> > > +	WARN_ON(pat_index >= xe->info.pat_table_n_entries);
> > > +	return xe->info.pat_table[pat_index].coh_mode;
> > > +}
> > > +
> > >   unsigned int xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache)
> > >   {
> > >   	WARN_ON(cache >= XE_CACHE_LAST);
> > >   	return (xe->pat_table).pte_pat_table[cache];
> > >   }
> > > -static void program_pat(struct xe_gt *gt, const u32 table[], int n_entries)
> > > +static void program_pat(struct xe_gt *gt, const struct xe_pat_table_entry table[],
> > > +			int n_entries)
> > >   {
> > >   	for (int i = 0; i < n_entries; i++) {
> > >   		struct xe_reg reg = XE_REG(_PAT_INDEX(i));
> > > -		xe_mmio_write32(gt, reg, table[i]);
> > > +		xe_mmio_write32(gt, reg, table[i].value);
> > >   	}
> > >   }
> > > -static void program_pat_mcr(struct xe_gt *gt, const u32 table[], int n_entries)
> > > +static void program_pat_mcr(struct xe_gt *gt, const struct xe_pat_table_entry table[],
> > > +			    int n_entries)
> > >   {
> > >   	for (int i = 0; i < n_entries; i++) {
> > >   		struct xe_reg_mcr reg_mcr = XE_REG_MCR(_PAT_INDEX(i));
> > > -		xe_gt_mcr_multicast_write(gt, reg_mcr, table[i]);
> > > +		xe_gt_mcr_multicast_write(gt, reg_mcr, table[i].value);
> > >   	}
> > >   }
> > > @@ -115,6 +125,7 @@ int xe_pat_fill_info(struct xe_device *xe)
> > >   		xe->info.pat_table = pvc_pat_table;
> > >   		xe->info.pat_table_n_entries = ARRAY_SIZE(pvc_pat_table);
> > >   	} else if (GRAPHICS_VERx100(xe) <= 1210) {
> > > +		WARN_ON_ONCE(!IS_DGFX(xe) && !xe->info.has_llc);
> > >   		xe->info.pat_table = tgl_pat_table;
> > >   		xe->info.pat_table_n_entries = ARRAY_SIZE(tgl_pat_table);
> > >   	} else {
> > > diff --git a/drivers/gpu/drm/xe/xe_pat.h b/drivers/gpu/drm/xe/xe_pat.h
> > > index 9ab059758ad1..e2fabde2c730 100644
> > > --- a/drivers/gpu/drm/xe/xe_pat.h
> > > +++ b/drivers/gpu/drm/xe/xe_pat.h
> > > @@ -28,9 +28,15 @@
> > >   struct xe_gt;
> > >   struct xe_device;
> > > +struct xe_pat_table_entry {
> > > +	u32 value;
> > > +	u16 coh_mode;
> > > +};
> > > +
> > >   int xe_pat_fill_info(struct xe_device *xe);
> > >   void xe_pat_init(struct xe_gt *gt);
> > >   void xe_pte_pat_init(struct xe_device *xe);
> > >   unsigned int xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache);
> > > +u16 xe_pat_index_get_coh_mode(struct xe_device *xe, u32 pat_index);
> > >   #endif
> > > -- 
> > > 2.41.0
> > > 
> > 

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 4/5] drm/xe/pat: annotate pat_index with coherency mode
  2023-08-29 16:28 ` [Intel-xe] [RFC 4/5] drm/xe/pat: annotate pat_index with coherency mode Matthew Auld
  2023-08-29 21:08   ` Matt Roper
@ 2023-08-29 22:02   ` Lucas De Marchi
  1 sibling, 0 replies; 47+ messages in thread
From: Lucas De Marchi @ 2023-08-29 22:02 UTC (permalink / raw)
  To: Matthew Auld
  Cc: Filip Hazubski, Joonas Lahtinen, Carl Zhang, Effie Yu, Matt Roper,
	intel-xe

On Tue, Aug 29, 2023 at 05:28:45PM +0100, Matthew Auld wrote:
>Future uapi needs to give userspace the ability to select the pat_index
>for a given vm_bind. However we need to be able to extract the coherency
>mode from the provided pat_index to ensure it matches the coherency mode
>set at object creation. There are various security reasons for why this
>matters.  However the pat_index itself is very platform specific, so
>seems reasonable to annotate each platform definition of the pat table.
>On some older platforms there is no explicit coherency mode, so we just
>pick whatever makes sense.
>
>Bspec: 45101, 44235 #xe
>Bspec: 70552, 71582, 59400 #xe2
>Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>Cc: Matt Roper <matthew.d.roper@intel.com>
>Cc: José Roberto de Souza <jose.souza@intel.com>
>Cc: Filip Hazubski <filip.hazubski@intel.com>
>Cc: Carl Zhang <carl.zhang@intel.com>
>Cc: Effie Yu <effie.yu@intel.com>
>---
> drivers/gpu/drm/xe/xe_device_types.h |  2 +-
> drivers/gpu/drm/xe/xe_pat.c          | 67 ++++++++++++++++------------
> drivers/gpu/drm/xe/xe_pat.h          |  6 +++
> 3 files changed, 46 insertions(+), 29 deletions(-)
>
>diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>index 06235da647bb..53520ae30b33 100644
>--- a/drivers/gpu/drm/xe/xe_device_types.h
>+++ b/drivers/gpu/drm/xe/xe_device_types.h
>@@ -238,7 +238,7 @@ struct xe_device {
> 		/** @enable_display: display enabled */
> 		u8 enable_display:1;
>
>-		const u32 *pat_table;
>+		const struct xe_pat_table_entry *pat_table;

if following my comment on previous patch, we'd have

	struct {
		const struct xe_pat_table_entry *entries;
		unsigned int n_entries;
	} pat;

Lucas De Marchi

> 		int pat_table_n_entries;
>
> #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
>diff --git a/drivers/gpu/drm/xe/xe_pat.c b/drivers/gpu/drm/xe/xe_pat.c
>index f19f5d8dcd94..9e72c1b4b41f 100644
>--- a/drivers/gpu/drm/xe/xe_pat.c
>+++ b/drivers/gpu/drm/xe/xe_pat.c
>@@ -4,6 +4,8 @@
>  */
>
>
>+#include <drm/xe_drm.h>
>+
> #include "regs/xe_reg_defs.h"
> #include "xe_gt.h"
> #include "xe_gt_mcr.h"
>@@ -33,34 +35,34 @@
> #define TGL_PAT_WC				REG_FIELD_PREP(TGL_MEM_TYPE_MASK, 1)
> #define TGL_PAT_UC				REG_FIELD_PREP(TGL_MEM_TYPE_MASK, 0)
>
>-static const u32 tgl_pat_table[] = {
>-	[0] = TGL_PAT_WB,
>-	[1] = TGL_PAT_WC,
>-	[2] = TGL_PAT_WT,
>-	[3] = TGL_PAT_UC,
>-	[4] = TGL_PAT_WB,
>-	[5] = TGL_PAT_WB,
>-	[6] = TGL_PAT_WB,
>-	[7] = TGL_PAT_WB,
>+static const struct xe_pat_table_entry tgl_pat_table[] = {
>+	[0] = { TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
>+	[1] = { TGL_PAT_WC, XE_GEM_COHERENCY_NONE },
>+	[2] = { TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
>+	[3] = { TGL_PAT_UC, XE_GEM_COHERENCY_NONE },
>+	[4] = { TGL_PAT_WB }, /* zero coh_mode to indicate invalid from userspace */
>+	[5] = { TGL_PAT_WB },
>+	[6] = { TGL_PAT_WB },
>+	[7] = { TGL_PAT_WB },
> };
>
>-static const u32 pvc_pat_table[] = {
>-	[0] = TGL_PAT_UC,
>-	[1] = TGL_PAT_WC,
>-	[2] = TGL_PAT_WT,
>-	[3] = TGL_PAT_WB,
>-	[4] = PVC_PAT_CLOS(1) | TGL_PAT_WT,
>-	[5] = PVC_PAT_CLOS(1) | TGL_PAT_WB,
>-	[6] = PVC_PAT_CLOS(2) | TGL_PAT_WT,
>-	[7] = PVC_PAT_CLOS(2) | TGL_PAT_WB,
>+static const struct xe_pat_table_entry pvc_pat_table[] = {
>+	[0] = { TGL_PAT_UC, XE_GEM_COHERENCY_NONE },
>+	[1] = { TGL_PAT_WC, XE_GEM_COHERENCY_NONE },
>+	[2] = { TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
>+	[3] = { TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
>+	[4] = { PVC_PAT_CLOS(1) | TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
>+	[5] = { PVC_PAT_CLOS(1) | TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
>+	[6] = { PVC_PAT_CLOS(2) | TGL_PAT_WT, XE_GEM_COHERENCY_NONE },
>+	[7] = { PVC_PAT_CLOS(2) | TGL_PAT_WB, XE_GEM_COHERENCY_2WAY },
> };
>
>-static const u32 mtl_pat_table[] = {
>-	[0] = MTL_PAT_0_WB,
>-	[1] = MTL_PAT_1_WT,
>-	[2] = MTL_PAT_3_UC,
>-	[3] = MTL_PAT_0_WB | MTL_2_COH_1W,
>-	[4] = MTL_PAT_0_WB | MTL_3_COH_2W,
>+static const struct xe_pat_table_entry mtl_pat_table[] = {
>+	[0] = { MTL_PAT_0_WB, XE_GEM_COHERENCY_NONE },
>+	[1] = { MTL_PAT_1_WT, XE_GEM_COHERENCY_NONE },
>+	[2] = { MTL_PAT_3_UC, XE_GEM_COHERENCY_NONE },
>+	[3] = { MTL_PAT_0_WB | MTL_2_COH_1W, XE_GEM_COHERENCY_1WAY },
>+	[4] = { MTL_PAT_0_WB | MTL_3_COH_2W, XE_GEM_COHERENCY_2WAY },
> };
>
> static const u32 xelp_pte_pat_table[XE_CACHE_LAST] = {
>@@ -82,27 +84,35 @@ static const u32 xelpg_pte_pat_table[XE_CACHE_LAST] = {
> 	[XE_CACHE_WB_1_WAY] = XELPG_PAT_WB_CACHE_1_WAY,
> };
>
>+u16 xe_pat_index_get_coh_mode(struct xe_device *xe, u32 pat_index)
>+{
>+	WARN_ON(pat_index >= xe->info.pat_table_n_entries);
>+	return xe->info.pat_table[pat_index].coh_mode;
>+}
>+
> unsigned int xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache)
> {
> 	WARN_ON(cache >= XE_CACHE_LAST);
> 	return (xe->pat_table).pte_pat_table[cache];
> }
>
>-static void program_pat(struct xe_gt *gt, const u32 table[], int n_entries)
>+static void program_pat(struct xe_gt *gt, const struct xe_pat_table_entry table[],
>+			int n_entries)
> {
> 	for (int i = 0; i < n_entries; i++) {
> 		struct xe_reg reg = XE_REG(_PAT_INDEX(i));
>
>-		xe_mmio_write32(gt, reg, table[i]);
>+		xe_mmio_write32(gt, reg, table[i].value);
> 	}
> }
>
>-static void program_pat_mcr(struct xe_gt *gt, const u32 table[], int n_entries)
>+static void program_pat_mcr(struct xe_gt *gt, const struct xe_pat_table_entry table[],
>+			    int n_entries)
> {
> 	for (int i = 0; i < n_entries; i++) {
> 		struct xe_reg_mcr reg_mcr = XE_REG_MCR(_PAT_INDEX(i));
>
>-		xe_gt_mcr_multicast_write(gt, reg_mcr, table[i]);
>+		xe_gt_mcr_multicast_write(gt, reg_mcr, table[i].value);
> 	}
> }
>
>@@ -115,6 +125,7 @@ int xe_pat_fill_info(struct xe_device *xe)
> 		xe->info.pat_table = pvc_pat_table;
> 		xe->info.pat_table_n_entries = ARRAY_SIZE(pvc_pat_table);
> 	} else if (GRAPHICS_VERx100(xe) <= 1210) {
>+		WARN_ON_ONCE(!IS_DGFX(xe) && !xe->info.has_llc);
> 		xe->info.pat_table = tgl_pat_table;
> 		xe->info.pat_table_n_entries = ARRAY_SIZE(tgl_pat_table);
> 	} else {
>diff --git a/drivers/gpu/drm/xe/xe_pat.h b/drivers/gpu/drm/xe/xe_pat.h
>index 9ab059758ad1..e2fabde2c730 100644
>--- a/drivers/gpu/drm/xe/xe_pat.h
>+++ b/drivers/gpu/drm/xe/xe_pat.h
>@@ -28,9 +28,15 @@
> struct xe_gt;
> struct xe_device;
>
>+struct xe_pat_table_entry {
>+	u32 value;
>+	u16 coh_mode;
>+};
>+
> int xe_pat_fill_info(struct xe_device *xe);
> void xe_pat_init(struct xe_gt *gt);
> void xe_pte_pat_init(struct xe_device *xe);
> unsigned int xe_pat_get_index(struct xe_device *xe, enum xe_cache_level cache);
>+u16 xe_pat_index_get_coh_mode(struct xe_device *xe, u32 pat_index);
>
> #endif
>-- 
>2.41.0
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-08-29 16:28 [Intel-xe] [RFC 0/5] PAT and cache coherency support Matthew Auld
                   ` (3 preceding siblings ...)
  2023-08-29 16:28 ` [Intel-xe] [RFC 4/5] drm/xe/pat: annotate pat_index with coherency mode Matthew Auld
@ 2023-08-29 16:28 ` Matthew Auld
  2023-08-29 21:36   ` Matt Roper
                     ` (2 more replies)
  2023-08-29 16:40 ` [Intel-xe] ✗ CI.Patch_applied: failure for PAT and cache coherency support Patchwork
  2023-09-04 20:25 ` [Intel-xe] [RFC 0/5] " Souza, Jose
  6 siblings, 3 replies; 47+ messages in thread
From: Matthew Auld @ 2023-08-29 16:28 UTC (permalink / raw)
  To: intel-xe
  Cc: Filip Hazubski, Joonas Lahtinen, Lucas De Marchi, Carl Zhang,
	Effie Yu, Matt Roper

Allow userspace to directly control the pat_index for a given vm
binding. This should allow directly controlling the coherency, caching
and potentially other stuff in the future for the ppGTT binding.

The exact meaning behind the pat_index is very platform specific (see
BSpec or PRMs) but effectively maps to some predefined memory
attributes. From the KMD pov we only care about the coherency that is
provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
The vm_bind coherency mode for the given pat_index needs to match the
given coh_mode that was set at object creation. For platforms that lack
the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.

For userptr mappings we lack a corresponding gem object, so the expected
coherency mode is instead implicit and must fall into either 1WAY or
2WAY. Trying to use NONE will be rejected by the kernel. For imported
dma-buf (from a different device) the coherency mode is also implicit
and must also be either 1WAY or 2WAY.

As part of adding pat_index support with vm_bind we also need stop using
xe_cache_level and instead use the pat_index in various places. We still
make use of xe_cache_level, but only as a convenience for kernel
internal objectsi (internally it maps to some reasonable pat_index). For
now this is just a 1:1 conversion of the existing code, however for
platforms like MTL+ we might need to give more control through bo_create
or stop using WB on the CPU side if we need CPU access.

Bspec: 45101, 44235 #xe
Bspec: 70552, 71582, 59400 #xe2
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Pallavi Mishra <pallavi.mishra@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Cc: Filip Hazubski <filip.hazubski@intel.com>
Cc: Carl Zhang <carl.zhang@intel.com>
Cc: Effie Yu <effie.yu@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
 drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
 drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
 drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
 drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
 drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
 drivers/gpu/drm/xe/xe_pt.h            |  6 +--
 drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++------
 drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
 include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
 10 files changed, 134 insertions(+), 58 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
index 4d6296cdbcfd..bb4c182048c3 100644
--- a/drivers/gpu/drm/i915/gt/intel_gtt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
@@ -302,7 +302,7 @@ struct i915_address_space {
 		(*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
 
 	u64 (*pte_encode)(dma_addr_t addr,
-			  unsigned int pat_index,
+			  u32 pat_index,
 			  u32 flags); /* Create a valid PTE */
 #define PTE_READ_ONLY	BIT(0)
 #define PTE_LM		BIT(1)
diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
index 5c8d5e78d9bc..7a128fd20a29 100644
--- a/drivers/gpu/drm/xe/tests/xe_migrate.c
+++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
@@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
 	/* First part of the test, are we updating our pagetable bo with a new entry? */
 	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
 		  0xdeaddeadbeefbeef);
-	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
+	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
 	if (m->q->vm->flags & XE_VM_FLAG_64K)
 		expected |= XE_PTE_PS64;
 	if (xe_bo_is_vram(pt))
diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
index 209fa053d9fb..4134c26150a5 100644
--- a/drivers/gpu/drm/xe/xe_ggtt.c
+++ b/drivers/gpu/drm/xe/xe_ggtt.c
@@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
 		pte |= XE_GGTT_PTE_DM;
 
 	if ((ggtt->pat_encode).pte_encode)
-		pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
+		pte = (ggtt->pat_encode).pte_encode(xe, pte,
+						    xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
 
 	return pte;
 }
@@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
 }
 
 static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
-						enum xe_cache_level cache)
+				     u32 pat_index)
 {
-	u32 pat_index = xe_pat_get_index(xe, cache);
-
 	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
 
 	if (pat_index & BIT(0))
diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
index 7e55fac1a8a9..0bc40cb072e3 100644
--- a/drivers/gpu/drm/xe/xe_ggtt_types.h
+++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
@@ -31,7 +31,7 @@ struct xe_ggtt {
 
 	struct {
 		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
-						enum xe_cache_level cache);
+				  u32 pat_index);
 	} pat_encode;
 };
 
diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index a782ea282cb6..54585e98452a 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -24,6 +24,7 @@
 #include "xe_lrc.h"
 #include "xe_map.h"
 #include "xe_mocs.h"
+#include "xe_pat.h"
 #include "xe_pt.h"
 #include "xe_res_cursor.h"
 #include "xe_sched_job.h"
@@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
 	u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
 	u32 map_ofs, level, i;
 	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
+	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
 	u64 entry;
 	int ret;
 
@@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
 		return ret;
 	}
 
-	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
+	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
 	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
 
 	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
 
 	/* Map the entire BO in our level 0 pt */
 	for (i = 0, level = 0; i < num_entries; level++) {
-		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
+		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
 
 		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
 
@@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
 		for (i = 0; i < batch->size;
 		     i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
 		     XE_PAGE_SIZE) {
-			entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
+			entry = xe_pte_encode(batch, i, pat_index, 0);
 
 			xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
 				  entry);
@@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
 			flags = XE_PDE_64K;
 
 		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
-					XE_PAGE_SIZE, XE_CACHE_WB);
+					XE_PAGE_SIZE, pat_index);
 		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
 			  entry | flags);
 	}
@@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
 	/* Write PDE's that point to our BO. */
 	for (i = 0; i < num_entries - num_level; i++) {
 		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
-				      XE_CACHE_WB);
+				      pat_index);
 
 		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
 			  (i + 1) * 8, u64, entry);
@@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
 
 			XE_WARN_ON(pt_bo->size != SZ_4K);
 
-			addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
+			addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
 			bb->cs[bb->len++] = lower_32_bits(addr);
 			bb->cs[bb->len++] = upper_32_bits(addr);
 		}
diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 64713f400d94..019af2920078 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -10,6 +10,7 @@
 #include "xe_gt.h"
 #include "xe_gt_tlb_invalidation.h"
 #include "xe_migrate.h"
+#include "xe_pat.h"
 #include "xe_pt_types.h"
 #include "xe_pt_walk.h"
 #include "xe_res_cursor.h"
@@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
  *
  * Return: An encoded page directory entry. No errors.
  */
-u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
-		  const enum xe_cache_level cache)
+u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
 {
 	u64 pde;
 	struct xe_vm *vm = bo->vm;
 	struct xe_device *xe = vm->xe;
 
-
 	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
 	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
 
 	if ((vm->pat_encode).pde_encode)
-		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
+		pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
 
 	return pde;
 }
 
-static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
+static u64 __pte_encode(u64 pte, u32 pat_index,
 			struct xe_vma *vma, u32 pt_level)
 {
 	struct xe_vm *vm = xe_vma_vm(vma);
@@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
 		pte |= XE_PTE_NULL;
 
 	if ((vm->pat_encode).pte_encode)
-		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
+		pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
 
 	if (pt_level == 1)
 		pte |= XE_PDE_PS_2M;
@@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
  *
  * Return: An encoded page-table entry. No errors.
  */
-u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
+u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
 		  u32 pt_level)
 {
 	u64 pte;
@@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
 	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
 		pte |= XE_PPGTT_PTE_DM;
 
-	return __pte_encode(pte, cache, NULL, pt_level);
+	return __pte_encode(pte, pat_index, NULL, pt_level);
 }
 
 static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
@@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
 
 	if (level == 0) {
 		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
-					  XE_CACHE_WB, 0);
+					  xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
 
 		return empty;
 	} else {
 		return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
-				     XE_CACHE_WB);
+				     xe_pat_get_index(vm->xe, XE_CACHE_WB));
 	}
 }
 
@@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
 	struct xe_vm *vm;
 	/** @tile: The tile we're building for. */
 	struct xe_tile *tile;
-	/** @cache: Desired cache level for the ptes */
-	enum xe_cache_level cache;
 	/** @default_pte: PTE flag only template. No address is associated */
 	u64 default_pte;
 	/** @dma_offset: DMA offset to add to the PTE. */
@@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
 
 		pte = __pte_encode(is_null ? 0 :
 				   xe_res_dma(curs) + xe_walk->dma_offset,
-				   xe_walk->cache, xe_walk->vma, level);
+				   xe_walk->vma->pat_index, xe_walk->vma, level);
 		pte |= xe_walk->default_pte;
 
 		/*
@@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
 			xe_child->is_compact = true;
 		}
 
-		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
+		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
 		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
 					 pte);
 	}
@@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 		if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
 			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
 		xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
-		xe_walk.cache = XE_CACHE_WB;
-	} else {
-		if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
-			xe_walk.cache = XE_CACHE_WT;
-		else
-			xe_walk.cache = XE_CACHE_WB;
 	}
+
 	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
 		xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
 
diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
index 01be7ab08f87..1d433a5a96b4 100644
--- a/drivers/gpu/drm/xe/xe_pt.h
+++ b/drivers/gpu/drm/xe/xe_pt.h
@@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
 
 bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
 
-u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
-		  const enum xe_cache_level level);
+u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
 
-u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
-		  u32 pt_level);
+u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 7eeeed0411f3..34603a7e84b0 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -6,6 +6,7 @@
 #include "xe_vm.h"
 
 #include <linux/dma-fence-array.h>
+#include <linux/nospec.h>
 
 #include <drm/drm_print.h>
 #include <drm/ttm/ttm_execbuf_util.h>
@@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 				    u64 start, u64 end,
 				    bool read_only,
 				    bool is_null,
-				    u8 tile_mask)
+				    u8 tile_mask,
+				    u32 pat_index)
 {
 	struct xe_vma *vma;
 	struct xe_tile *tile;
@@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 			vma->tile_mask |= 0x1 << id;
 	}
 
+	vma->pat_index = pat_index;
+
 	if (vm->xe->info.platform == XE_PVC)
 		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
 
@@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
 static void vm_destroy_work_func(struct work_struct *w);
 
 static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
-						enum xe_cache_level cache)
+				     u32 pat_index)
 {
-	u32 pat_index = xe_pat_get_index(xe, cache);
-
 	pde_pat &= ~(XELP_PDE_PAT_MASK);
 
 	if (pat_index & BIT(0))
@@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
 }
 
 static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
-						enum xe_cache_level cache)
+				     u32 pat_index)
 {
-	u32 pat_index = xe_pat_get_index(xe, cache);
-
 	pte_pat &= ~(XELP_PTE_PAT_MASK);
 
 	if (pat_index & BIT(0))
@@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
 u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
 {
 	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
-			     XE_CACHE_WB);
+			     xe_pat_get_index(vm->xe, XE_CACHE_WB));
 }
 
 static struct dma_fence *
@@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
 static struct drm_gpuva_ops *
 vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
 			 u64 bo_offset_or_userptr, u64 addr, u64 range,
-			 u32 operation, u8 tile_mask, u32 region)
+			 u32 operation, u8 tile_mask, u32 region, u32 pat_index)
 {
 	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
 	struct ww_acquire_ctx ww;
@@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
 			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
 
 			op->tile_mask = tile_mask;
+			op->pat_index = pat_index;
 			op->map.immediate =
 				operation & XE_VM_BIND_FLAG_IMMEDIATE;
 			op->map.read_only =
@@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
 			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
 
 			op->tile_mask = tile_mask;
+			op->pat_index = pat_index;
 			op->prefetch.region = region;
 		}
 		break;
@@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
 }
 
 static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
-			      u8 tile_mask, bool read_only, bool is_null)
+			      u8 tile_mask, bool read_only, bool is_null,
+			      u32 pat_index)
 {
 	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
 	struct xe_vma *vma;
@@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
 	vma = xe_vma_create(vm, bo, op->gem.offset,
 			    op->va.addr, op->va.addr +
 			    op->va.range - 1, read_only, is_null,
-			    tile_mask);
+			    tile_mask, pat_index);
 	if (bo)
 		xe_bo_unlock(bo, &ww);
 
@@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 
 				vma = new_vma(vm, &op->base.map,
 					      op->tile_mask, op->map.read_only,
-					      op->map.is_null);
+					      op->map.is_null, op->pat_index);
 				if (IS_ERR(vma)) {
 					err = PTR_ERR(vma);
 					goto free_fence;
@@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 
 					vma = new_vma(vm, op->base.remap.prev,
 						      op->tile_mask, read_only,
-						      is_null);
+						      is_null, op->pat_index);
 					if (IS_ERR(vma)) {
 						err = PTR_ERR(vma);
 						goto free_fence;
@@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 
 					vma = new_vma(vm, op->base.remap.next,
 						      op->tile_mask, read_only,
-						      is_null);
+						      is_null, op->pat_index);
 					if (IS_ERR(vma)) {
 						err = PTR_ERR(vma);
 						goto free_fence;
@@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
 		u32 obj = (*bind_ops)[i].obj;
 		u64 obj_offset = (*bind_ops)[i].obj_offset;
 		u32 region = (*bind_ops)[i].region;
+		u32 pat_index = (*bind_ops)[i].pat_index;
 		bool is_null = op & XE_VM_BIND_FLAG_NULL;
 
+		if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
+			err = -EINVAL;
+			goto free_bind_ops;
+		}
+
 		if (i == 0) {
 			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
 		} else if (XE_IOCTL_DBG(xe, !*async) ||
@@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		struct drm_gem_object *gem_obj;
 		u64 range = bind_ops[i].range;
 		u64 addr = bind_ops[i].addr;
+		u32 op = bind_ops[i].op;
 		u32 obj = bind_ops[i].obj;
 		u64 obj_offset = bind_ops[i].obj_offset;
+		u32 pat_index = bind_ops[i].pat_index;
+		u16 coh_mode;
+
+		pat_index = array_index_nospec(pat_index,
+					       xe->info.pat_table_n_entries);
+		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
+		if (XE_IOCTL_DBG(xe, !coh_mode)) {
+			err = -EINVAL;
+			goto put_obj;
+		}
+
+		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
+				 coh_mode == XE_GEM_COHERENCY_NONE)) {
+			err = -EINVAL;
+			goto put_obj;
+		}
 
 		if (!obj)
 			continue;
@@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 				goto put_obj;
 			}
 		}
+
+		if (bos[i]->coh_mode) {
+			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
+				err = -EINVAL;
+				goto put_obj;
+			}
+		} else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
+			/*
+			 * Imported dma-buf from a different device should
+			 * require 1way or 2way coherency since we don't know
+			 * how it was mapped on CPU. Just assume is it
+			 * potentially cached on CPU side.
+			 */
+			err = -EINVAL;
+			goto put_obj;
+		}
 	}
 
 	if (args->num_syncs) {
@@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		u64 obj_offset = bind_ops[i].obj_offset;
 		u8 tile_mask = bind_ops[i].tile_mask;
 		u32 region = bind_ops[i].region;
+		u32 pat_index = bind_ops[i].pat_index;
 
 		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
 						  addr, range, op, tile_mask,
-						  region);
+						  region, pat_index);
 		if (IS_ERR(ops[i])) {
 			err = PTR_ERR(ops[i]);
 			ops[i] = NULL;
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 83a1f87b6537..508679ed3c74 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -111,6 +111,11 @@ struct xe_vma {
 	 */
 	u8 tile_present;
 
+	/**
+	 * @pat_index: The pat index to use when encoding the PTEs for this vma.
+	 */
+	u32 pat_index;
+
 	struct {
 		struct list_head rebind_link;
 	} notifier;
@@ -338,10 +343,8 @@ struct xe_vm {
 	bool batch_invalidate_tlb;
 
 	struct {
-		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
-						enum xe_cache_level cache);
-		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
-						enum xe_cache_level cache);
+		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
+		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
 	} pat_encode;
 };
 
@@ -417,6 +420,8 @@ struct xe_vma_op {
 	struct async_op_fence *fence;
 	/** @tile_mask: gt mask for this operation */
 	u8 tile_mask;
+	/** @pat_index: The pat index to use for this operation. */
+	u32 pat_index;
 	/** @flags: operation flags */
 	enum xe_vma_op_flags flags;
 
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index 64bc66d4b550..0c15b6f32447 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
 	 */
 	__u32 obj;
 
-	/** @pad: MBZ */
-	__u32 pad;
+	/**
+	 * @pat_index: The platform defined @pat_index to use for this mapping.
+	 * The index basically maps to some predefined memory attributes,
+	 * including things like caching, coherency and likely other stuff in
+	 * the future.  The exact meaning of the pat_index is platform specific
+	 * and defined in the Bspec and PRMs.  When the KMD sets up the binding
+	 * the index here is encoded into the ppGTT PTE.
+	 *
+	 * For coherency the @pat_index needs to match the
+	 * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
+	 * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
+	 * the coherency mode from the @pat_index and reject if there is a
+	 * mismatch (see note below for pre-MTL platforms).
+	 *
+	 * Note: On pre-MTL platforms there is only a caching mode and no
+	 * explicit coherency mode, but on such hardware there is always a
+	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
+	 * CPU caches even with the caching mode set as uncached.  It's only the
+	 * display engine that is incoherent (on dgpu it must be in VRAM which
+	 * is always mapped as WC on the CPU). However to keep the uapi somewhat
+	 * consistent with newer platforms the KMD groups the different cache
+	 * levels into the following coherency buckets on all pre-MTL platforms:
+	 *
+	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
+	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
+	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
+	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
+	 *
+	 * In practice UC/WC/WT should only ever used for scanout surfaces on
+	 * such platforms since it is only the display engine that is actually
+	 * incoherent. Everything else should typically use WB given that we
+	 * have a shared-LLC.  On MTL+ this completely changes (also potentially
+	 * no shared-LLC) and the HW defines the coherency mode as part of the
+	 * @pat_index.
+	 *
+	 * Note: For userptr and externally imported dma-buf the kernel expects
+	 * either 1WAY or 2WAY for the @pat_index.
+	 */
+	__u32 pat_index;
 
 	union {
 		/**
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-08-29 16:28 ` [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind Matthew Auld
@ 2023-08-29 21:36   ` Matt Roper
  2023-08-30  6:38     ` Thomas Hellström
  2023-08-30 11:28     ` Matthew Auld
  2023-09-04 20:21   ` Souza, Jose
  2023-09-07 18:56   ` Souza, Jose
  2 siblings, 2 replies; 47+ messages in thread
From: Matt Roper @ 2023-08-29 21:36 UTC (permalink / raw)
  To: Matthew Auld
  Cc: Filip Hazubski, Joonas Lahtinen, Lucas De Marchi, Carl Zhang,
	Effie Yu, intel-xe

On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
> Allow userspace to directly control the pat_index for a given vm
> binding. This should allow directly controlling the coherency, caching
> and potentially other stuff in the future for the ppGTT binding.
> 
> The exact meaning behind the pat_index is very platform specific (see
> BSpec or PRMs) but effectively maps to some predefined memory
> attributes. From the KMD pov we only care about the coherency that is
> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
> The vm_bind coherency mode for the given pat_index needs to match the
> given coh_mode that was set at object creation. For platforms that lack

Is it actually important to match the coherency mode?  I think one of
the main goals was to know up front if userspace might be using a
non-snooping PAT setting that would let it bypass the CPU cache (and
potentially read old, stale data from a different process if the
buffer's clear value is still sitting in cache and hasn't landed in
memory yet).

If that's the only concern, then I think it should still be fine to map
with a non-matching PAT as long as it's more coherent than the one
specified at creation, right?  E.g., if the buffer was created with
1-way coherency, it would be fine to map it with 2-way because userspace
still can't use that to observe the previous contents of the buffer.  Or
if the buffer was created with "non-coherent" then we've already done
the necessary clflushing in kernel before handing to buffer over to
userspace to ensure the clear value landed in memory, so any valid PAT
index should be safe (from a security POV) after that, right?  Any other
problems that arise from mismatched coherency would just be contained to
the app possibly shooting itself in the foot, which isn't really our
concern.


> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
> 
> For userptr mappings we lack a corresponding gem object, so the expected
> coherency mode is instead implicit and must fall into either 1WAY or
> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
> dma-buf (from a different device) the coherency mode is also implicit
> and must also be either 1WAY or 2WAY.
> 
> As part of adding pat_index support with vm_bind we also need stop using
> xe_cache_level and instead use the pat_index in various places. We still
> make use of xe_cache_level, but only as a convenience for kernel
> internal objectsi (internally it maps to some reasonable pat_index). For

Maybe we should kill xe_cache_level completely and just assign
xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
appropriate places, similar to what we do with MOCS (gt->mocs.uc_index,
gt->mocs.wb_index)?

> now this is just a 1:1 conversion of the existing code, however for
> platforms like MTL+ we might need to give more control through bo_create
> or stop using WB on the CPU side if we need CPU access.
> 
> Bspec: 45101, 44235 #xe
> Bspec: 70552, 71582, 59400 #xe2
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: José Roberto de Souza <jose.souza@intel.com>
> Cc: Filip Hazubski <filip.hazubski@intel.com>
> Cc: Carl Zhang <carl.zhang@intel.com>
> Cc: Effie Yu <effie.yu@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
>  drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
>  drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
>  drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
>  drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
>  drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
>  drivers/gpu/drm/xe/xe_pt.h            |  6 +--
>  drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++------
>  drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
>  include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
>  10 files changed, 134 insertions(+), 58 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
> index 4d6296cdbcfd..bb4c182048c3 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> @@ -302,7 +302,7 @@ struct i915_address_space {
>  		(*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>  
>  	u64 (*pte_encode)(dma_addr_t addr,
> -			  unsigned int pat_index,
> +			  u32 pat_index,
>  			  u32 flags); /* Create a valid PTE */
>  #define PTE_READ_ONLY	BIT(0)
>  #define PTE_LM		BIT(1)
> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
> index 5c8d5e78d9bc..7a128fd20a29 100644
> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
>  	/* First part of the test, are we updating our pagetable bo with a new entry? */
>  	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
>  		  0xdeaddeadbeefbeef);
> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>  	if (m->q->vm->flags & XE_VM_FLAG_64K)
>  		expected |= XE_PTE_PS64;
>  	if (xe_bo_is_vram(pt))
> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
> index 209fa053d9fb..4134c26150a5 100644
> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
>  		pte |= XE_GGTT_PTE_DM;
>  
>  	if ((ggtt->pat_encode).pte_encode)
> -		pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> +						    xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
>  
>  	return pte;
>  }
> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>  }
>  
>  static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> -						enum xe_cache_level cache)
> +				     u32 pat_index)
>  {
> -	u32 pat_index = xe_pat_get_index(xe, cache);
> -
>  	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>  
>  	if (pat_index & BIT(0))
> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
> index 7e55fac1a8a9..0bc40cb072e3 100644
> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> @@ -31,7 +31,7 @@ struct xe_ggtt {
>  
>  	struct {
>  		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> -						enum xe_cache_level cache);
> +				  u32 pat_index);
>  	} pat_encode;
>  };
>  
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> index a782ea282cb6..54585e98452a 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -24,6 +24,7 @@
>  #include "xe_lrc.h"
>  #include "xe_map.h"
>  #include "xe_mocs.h"
> +#include "xe_pat.h"
>  #include "xe_pt.h"
>  #include "xe_res_cursor.h"
>  #include "xe_sched_job.h"
> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>  	u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
>  	u32 map_ofs, level, i;
>  	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>  	u64 entry;
>  	int ret;
>  
> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>  		return ret;
>  	}
>  
> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>  	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>  
>  	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>  
>  	/* Map the entire BO in our level 0 pt */
>  	for (i = 0, level = 0; i < num_entries; level++) {
> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>  
>  		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>  
> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>  		for (i = 0; i < batch->size;
>  		     i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>  		     XE_PAGE_SIZE) {
> -			entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
> +			entry = xe_pte_encode(batch, i, pat_index, 0);
>  
>  			xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>  				  entry);
> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>  			flags = XE_PDE_64K;
>  
>  		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> -					XE_PAGE_SIZE, XE_CACHE_WB);
> +					XE_PAGE_SIZE, pat_index);
>  		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
>  			  entry | flags);
>  	}
> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>  	/* Write PDE's that point to our BO. */
>  	for (i = 0; i < num_entries - num_level; i++) {
>  		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> -				      XE_CACHE_WB);
> +				      pat_index);
>  
>  		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>  			  (i + 1) * 8, u64, entry);
> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
>  
>  			XE_WARN_ON(pt_bo->size != SZ_4K);
>  
> -			addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
> +			addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>  			bb->cs[bb->len++] = lower_32_bits(addr);
>  			bb->cs[bb->len++] = upper_32_bits(addr);
>  		}
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 64713f400d94..019af2920078 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -10,6 +10,7 @@
>  #include "xe_gt.h"
>  #include "xe_gt_tlb_invalidation.h"
>  #include "xe_migrate.h"
> +#include "xe_pat.h"
>  #include "xe_pt_types.h"
>  #include "xe_pt_walk.h"
>  #include "xe_res_cursor.h"
> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
>   *
>   * Return: An encoded page directory entry. No errors.
>   */
> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> -		  const enum xe_cache_level cache)
> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>  {
>  	u64 pde;
>  	struct xe_vm *vm = bo->vm;
>  	struct xe_device *xe = vm->xe;
>  
> -
>  	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>  	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>  
>  	if ((vm->pat_encode).pde_encode)
> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> +		pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>  
>  	return pde;
>  }
>  
> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> +static u64 __pte_encode(u64 pte, u32 pat_index,
>  			struct xe_vma *vma, u32 pt_level)
>  {
>  	struct xe_vm *vm = xe_vma_vm(vma);
> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>  		pte |= XE_PTE_NULL;
>  
>  	if ((vm->pat_encode).pte_encode)
> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> +		pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>  
>  	if (pt_level == 1)
>  		pte |= XE_PDE_PS_2M;
> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>   *
>   * Return: An encoded page-table entry. No errors.
>   */
> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>  		  u32 pt_level)
>  {
>  	u64 pte;
> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>  	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>  		pte |= XE_PPGTT_PTE_DM;
>  
> -	return __pte_encode(pte, cache, NULL, pt_level);
> +	return __pte_encode(pte, pat_index, NULL, pt_level);
>  }
>  
>  static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>  
>  	if (level == 0) {
>  		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> -					  XE_CACHE_WB, 0);
> +					  xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
>  
>  		return empty;
>  	} else {
>  		return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
> -				     XE_CACHE_WB);
> +				     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>  	}
>  }
>  
> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>  	struct xe_vm *vm;
>  	/** @tile: The tile we're building for. */
>  	struct xe_tile *tile;
> -	/** @cache: Desired cache level for the ptes */
> -	enum xe_cache_level cache;
>  	/** @default_pte: PTE flag only template. No address is associated */
>  	u64 default_pte;
>  	/** @dma_offset: DMA offset to add to the PTE. */
> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>  
>  		pte = __pte_encode(is_null ? 0 :
>  				   xe_res_dma(curs) + xe_walk->dma_offset,
> -				   xe_walk->cache, xe_walk->vma, level);
> +				   xe_walk->vma->pat_index, xe_walk->vma, level);
>  		pte |= xe_walk->default_pte;
>  
>  		/*
> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>  			xe_child->is_compact = true;
>  		}
>  
> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
>  		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>  					 pte);
>  	}
> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
>  		if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>  			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>  		xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
> -		xe_walk.cache = XE_CACHE_WB;
> -	} else {
> -		if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
> -			xe_walk.cache = XE_CACHE_WT;
> -		else
> -			xe_walk.cache = XE_CACHE_WB;
>  	}
> +
>  	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>  		xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>  
> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> index 01be7ab08f87..1d433a5a96b4 100644
> --- a/drivers/gpu/drm/xe/xe_pt.h
> +++ b/drivers/gpu/drm/xe/xe_pt.h
> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
>  
>  bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>  
> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> -		  const enum xe_cache_level level);
> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>  
> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> -		  u32 pt_level);
> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
>  
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 7eeeed0411f3..34603a7e84b0 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -6,6 +6,7 @@
>  #include "xe_vm.h"
>  
>  #include <linux/dma-fence-array.h>
> +#include <linux/nospec.h>
>  
>  #include <drm/drm_print.h>
>  #include <drm/ttm/ttm_execbuf_util.h>
> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>  				    u64 start, u64 end,
>  				    bool read_only,
>  				    bool is_null,
> -				    u8 tile_mask)
> +				    u8 tile_mask,
> +				    u32 pat_index)
>  {
>  	struct xe_vma *vma;
>  	struct xe_tile *tile;
> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>  			vma->tile_mask |= 0x1 << id;
>  	}
>  
> +	vma->pat_index = pat_index;
> +
>  	if (vm->xe->info.platform == XE_PVC)
>  		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>  
> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
>  static void vm_destroy_work_func(struct work_struct *w);
>  
>  static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> -						enum xe_cache_level cache)
> +				     u32 pat_index)
>  {
> -	u32 pat_index = xe_pat_get_index(xe, cache);
> -
>  	pde_pat &= ~(XELP_PDE_PAT_MASK);
>  
>  	if (pat_index & BIT(0))
> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>  }
>  
>  static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> -						enum xe_cache_level cache)
> +				     u32 pat_index)
>  {
> -	u32 pat_index = xe_pat_get_index(xe, cache);
> -
>  	pte_pat &= ~(XELP_PTE_PAT_MASK);
>  
>  	if (pat_index & BIT(0))
> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
>  u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>  {
>  	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> -			     XE_CACHE_WB);
> +			     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>  }
>  
>  static struct dma_fence *
> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
>  static struct drm_gpuva_ops *
>  vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>  			 u64 bo_offset_or_userptr, u64 addr, u64 range,
> -			 u32 operation, u8 tile_mask, u32 region)
> +			 u32 operation, u8 tile_mask, u32 region, u32 pat_index)
>  {
>  	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>  	struct ww_acquire_ctx ww;
> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>  			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>  
>  			op->tile_mask = tile_mask;
> +			op->pat_index = pat_index;
>  			op->map.immediate =
>  				operation & XE_VM_BIND_FLAG_IMMEDIATE;
>  			op->map.read_only =
> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>  			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>  
>  			op->tile_mask = tile_mask;
> +			op->pat_index = pat_index;
>  			op->prefetch.region = region;
>  		}
>  		break;
> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>  }
>  
>  static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> -			      u8 tile_mask, bool read_only, bool is_null)
> +			      u8 tile_mask, bool read_only, bool is_null,
> +			      u32 pat_index)
>  {
>  	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
>  	struct xe_vma *vma;
> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>  	vma = xe_vma_create(vm, bo, op->gem.offset,
>  			    op->va.addr, op->va.addr +
>  			    op->va.range - 1, read_only, is_null,
> -			    tile_mask);
> +			    tile_mask, pat_index);
>  	if (bo)
>  		xe_bo_unlock(bo, &ww);
>  
> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>  
>  				vma = new_vma(vm, &op->base.map,
>  					      op->tile_mask, op->map.read_only,
> -					      op->map.is_null);
> +					      op->map.is_null, op->pat_index);
>  				if (IS_ERR(vma)) {
>  					err = PTR_ERR(vma);
>  					goto free_fence;
> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>  
>  					vma = new_vma(vm, op->base.remap.prev,
>  						      op->tile_mask, read_only,
> -						      is_null);
> +						      is_null, op->pat_index);
>  					if (IS_ERR(vma)) {
>  						err = PTR_ERR(vma);
>  						goto free_fence;
> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>  
>  					vma = new_vma(vm, op->base.remap.next,
>  						      op->tile_mask, read_only,
> -						      is_null);
> +						      is_null, op->pat_index);
>  					if (IS_ERR(vma)) {
>  						err = PTR_ERR(vma);
>  						goto free_fence;
> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
>  		u32 obj = (*bind_ops)[i].obj;
>  		u64 obj_offset = (*bind_ops)[i].obj_offset;
>  		u32 region = (*bind_ops)[i].region;
> +		u32 pat_index = (*bind_ops)[i].pat_index;
>  		bool is_null = op & XE_VM_BIND_FLAG_NULL;
>  
> +		if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
> +			err = -EINVAL;
> +			goto free_bind_ops;
> +		}
> +
>  		if (i == 0) {
>  			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>  		} else if (XE_IOCTL_DBG(xe, !*async) ||
> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>  		struct drm_gem_object *gem_obj;
>  		u64 range = bind_ops[i].range;
>  		u64 addr = bind_ops[i].addr;
> +		u32 op = bind_ops[i].op;
>  		u32 obj = bind_ops[i].obj;
>  		u64 obj_offset = bind_ops[i].obj_offset;
> +		u32 pat_index = bind_ops[i].pat_index;
> +		u16 coh_mode;
> +
> +		pat_index = array_index_nospec(pat_index,
> +					       xe->info.pat_table_n_entries);
> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {

Assuming we drop the unusable entries from the TGL table, this should be
impossible, right?  Any index that makes it past the n_entries check at
the top of the function should have a valid, non-zero coh_mode value.
So this should probably be an assertion (to highlight a KMD bug) rather
than just a silent uapi failure return.

> +			err = -EINVAL;
> +			goto put_obj;
> +		}
> +
> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
> +				 coh_mode == XE_GEM_COHERENCY_NONE)) {
> +			err = -EINVAL;
> +			goto put_obj;
> +		}
>  
>  		if (!obj)
>  			continue;
> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>  				goto put_obj;
>  			}
>  		}
> +
> +		if (bos[i]->coh_mode) {
> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
> +				err = -EINVAL;
> +				goto put_obj;
> +			}
> +		} else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
> +			/*
> +			 * Imported dma-buf from a different device should
> +			 * require 1way or 2way coherency since we don't know
> +			 * how it was mapped on CPU. Just assume is it
> +			 * potentially cached on CPU side.
> +			 */
> +			err = -EINVAL;
> +			goto put_obj;
> +		}
>  	}
>  
>  	if (args->num_syncs) {
> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>  		u64 obj_offset = bind_ops[i].obj_offset;
>  		u8 tile_mask = bind_ops[i].tile_mask;
>  		u32 region = bind_ops[i].region;
> +		u32 pat_index = bind_ops[i].pat_index;
>  
>  		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>  						  addr, range, op, tile_mask,
> -						  region);
> +						  region, pat_index);
>  		if (IS_ERR(ops[i])) {
>  			err = PTR_ERR(ops[i]);
>  			ops[i] = NULL;
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
> index 83a1f87b6537..508679ed3c74 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -111,6 +111,11 @@ struct xe_vma {
>  	 */
>  	u8 tile_present;
>  
> +	/**
> +	 * @pat_index: The pat index to use when encoding the PTEs for this vma.
> +	 */
> +	u32 pat_index;
> +
>  	struct {
>  		struct list_head rebind_link;
>  	} notifier;
> @@ -338,10 +343,8 @@ struct xe_vm {
>  	bool batch_invalidate_tlb;
>  
>  	struct {
> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> -						enum xe_cache_level cache);
> -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> -						enum xe_cache_level cache);
> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
> +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
>  	} pat_encode;
>  };
>  
> @@ -417,6 +420,8 @@ struct xe_vma_op {
>  	struct async_op_fence *fence;
>  	/** @tile_mask: gt mask for this operation */
>  	u8 tile_mask;
> +	/** @pat_index: The pat index to use for this operation. */
> +	u32 pat_index;
>  	/** @flags: operation flags */
>  	enum xe_vma_op_flags flags;
>  
> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> index 64bc66d4b550..0c15b6f32447 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>  	 */
>  	__u32 obj;
>  
> -	/** @pad: MBZ */
> -	__u32 pad;
> +	/**
> +	 * @pat_index: The platform defined @pat_index to use for this mapping.
> +	 * The index basically maps to some predefined memory attributes,
> +	 * including things like caching, coherency and likely other stuff in
> +	 * the future.  The exact meaning of the pat_index is platform specific

BTW, "other stuff in the future" already includes compression on Xe2, we
just haven't landed the patches for the Xe2 table yet.

> +	 * and defined in the Bspec and PRMs.  When the KMD sets up the binding
> +	 * the index here is encoded into the ppGTT PTE.
> +	 *
> +	 * For coherency the @pat_index needs to match the
> +	 * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
> +	 * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
> +	 * the coherency mode from the @pat_index and reject if there is a
> +	 * mismatch (see note below for pre-MTL platforms).
> +	 *
> +	 * Note: On pre-MTL platforms there is only a caching mode and no
> +	 * explicit coherency mode, but on such hardware there is always a
> +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
> +	 * CPU caches even with the caching mode set as uncached.  It's only the
> +	 * display engine that is incoherent (on dgpu it must be in VRAM which
> +	 * is always mapped as WC on the CPU). However to keep the uapi somewhat
> +	 * consistent with newer platforms the KMD groups the different cache
> +	 * levels into the following coherency buckets on all pre-MTL platforms:
> +	 *
> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY

As noted on the previous patch, it seems like 2-way is appropriate for
LLC platforms, but 1-way might be a more accurate description of dGPU
behavior.

> +	 *
> +	 * In practice UC/WC/WT should only ever used for scanout surfaces on
> +	 * such platforms since it is only the display engine that is actually
> +	 * incoherent. Everything else should typically use WB given that we

What if we're sharing our buffers with some other (non-GPU) device?  Are
there cases where that other device wouldn't be coherent with the LLC,
so we'd want to use one of these?


Matt

> +	 * have a shared-LLC.  On MTL+ this completely changes (also potentially
> +	 * no shared-LLC) and the HW defines the coherency mode as part of the
> +	 * @pat_index.
> +	 *
> +	 * Note: For userptr and externally imported dma-buf the kernel expects
> +	 * either 1WAY or 2WAY for the @pat_index.
> +	 */
> +	__u32 pat_index;
>  
>  	union {
>  		/**
> -- 
> 2.41.0
> 

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-08-29 21:36   ` Matt Roper
@ 2023-08-30  6:38     ` Thomas Hellström
  2023-08-30 19:28       ` Matt Roper
  2023-08-30 11:28     ` Matthew Auld
  1 sibling, 1 reply; 47+ messages in thread
From: Thomas Hellström @ 2023-08-30  6:38 UTC (permalink / raw)
  To: Matt Roper, Matthew Auld
  Cc: Filip Hazubski, Joonas Lahtinen, Lucas De Marchi, Carl Zhang,
	Effie Yu, intel-xe


On 8/29/23 23:36, Matt Roper wrote:
> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
>> Allow userspace to directly control the pat_index for a given vm
>> binding. This should allow directly controlling the coherency, caching
>> and potentially other stuff in the future for the ppGTT binding.
>>
>> The exact meaning behind the pat_index is very platform specific (see
>> BSpec or PRMs) but effectively maps to some predefined memory
>> attributes. From the KMD pov we only care about the coherency that is
>> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
>> The vm_bind coherency mode for the given pat_index needs to match the
>> given coh_mode that was set at object creation. For platforms that lack
> Is it actually important to match the coherency mode?  I think one of
> the main goals was to know up front if userspace might be using a
> non-snooping PAT setting that would let it bypass the CPU cache (and
> potentially read old, stale data from a different process if the
> buffer's clear value is still sitting in cache and hasn't landed in
> memory yet).
>
> If that's the only concern, then I think it should still be fine to map
> with a non-matching PAT as long as it's more coherent than the one
> specified at creation, right?  E.g., if the buffer was created with
> 1-way coherency, it would be fine to map it with 2-way because userspace
> still can't use that to observe the previous contents of the buffer.  Or
> if the buffer was created with "non-coherent" then we've already done
> the necessary clflushing in kernel before handing to buffer over to
> userspace to ensure the clear value landed in memory, so any valid PAT
> index should be safe (from a security POV) after that, right?  Any other
> problems that arise from mismatched coherency would just be contained to
> the app possibly shooting itself in the foot, which isn't really our
> concern.

We also have shrinking to keep in mind: If content was generated by the 
GPU using 2-way coherency, and then a copy from the backing page is done 
with the page mapped WC, wouldn't that blow up? I think requiring 
consistency here is reasonable also keeping unforeseen side-effects of 
future HW in mind. Or do you see it as blocking an important use-case?

Thanks,

Thomas


>
>
>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
>>
>> For userptr mappings we lack a corresponding gem object, so the expected
>> coherency mode is instead implicit and must fall into either 1WAY or
>> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
>> dma-buf (from a different device) the coherency mode is also implicit
>> and must also be either 1WAY or 2WAY.
>>
>> As part of adding pat_index support with vm_bind we also need stop using
>> xe_cache_level and instead use the pat_index in various places. We still
>> make use of xe_cache_level, but only as a convenience for kernel
>> internal objectsi (internally it maps to some reasonable pat_index). For
> Maybe we should kill xe_cache_level completely and just assign
> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
> appropriate places, similar to what we do with MOCS (gt->mocs.uc_index,
> gt->mocs.wb_index)?
>
>> now this is just a 1:1 conversion of the existing code, however for
>> platforms like MTL+ we might need to give more control through bo_create
>> or stop using WB on the CPU side if we need CPU access.
>>
>> Bspec: 45101, 44235 #xe
>> Bspec: 70552, 71582, 59400 #xe2
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: José Roberto de Souza <jose.souza@intel.com>
>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>> Cc: Carl Zhang <carl.zhang@intel.com>
>> Cc: Effie Yu <effie.yu@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
>>   drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
>>   drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
>>   drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
>>   drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
>>   drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
>>   drivers/gpu/drm/xe/xe_pt.h            |  6 +--
>>   drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++------
>>   drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
>>   include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
>>   10 files changed, 134 insertions(+), 58 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
>> index 4d6296cdbcfd..bb4c182048c3 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>> @@ -302,7 +302,7 @@ struct i915_address_space {
>>   		(*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>>   
>>   	u64 (*pte_encode)(dma_addr_t addr,
>> -			  unsigned int pat_index,
>> +			  u32 pat_index,
>>   			  u32 flags); /* Create a valid PTE */
>>   #define PTE_READ_ONLY	BIT(0)
>>   #define PTE_LM		BIT(1)
>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
>> index 5c8d5e78d9bc..7a128fd20a29 100644
>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
>>   	/* First part of the test, are we updating our pagetable bo with a new entry? */
>>   	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
>>   		  0xdeaddeadbeefbeef);
>> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>>   	if (m->q->vm->flags & XE_VM_FLAG_64K)
>>   		expected |= XE_PTE_PS64;
>>   	if (xe_bo_is_vram(pt))
>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
>> index 209fa053d9fb..4134c26150a5 100644
>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
>>   		pte |= XE_GGTT_PTE_DM;
>>   
>>   	if ((ggtt->pat_encode).pte_encode)
>> -		pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
>> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
>> +						    xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
>>   
>>   	return pte;
>>   }
>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>>   }
>>   
>>   static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>> -						enum xe_cache_level cache)
>> +				     u32 pat_index)
>>   {
>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>>   	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>   
>>   	if (pat_index & BIT(0))
>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
>> index 7e55fac1a8a9..0bc40cb072e3 100644
>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>   
>>   	struct {
>>   		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> -						enum xe_cache_level cache);
>> +				  u32 pat_index);
>>   	} pat_encode;
>>   };
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
>> index a782ea282cb6..54585e98452a 100644
>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>> @@ -24,6 +24,7 @@
>>   #include "xe_lrc.h"
>>   #include "xe_map.h"
>>   #include "xe_mocs.h"
>> +#include "xe_pat.h"
>>   #include "xe_pt.h"
>>   #include "xe_res_cursor.h"
>>   #include "xe_sched_job.h"
>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   	u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
>>   	u32 map_ofs, level, i;
>>   	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>>   	u64 entry;
>>   	int ret;
>>   
>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   		return ret;
>>   	}
>>   
>> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
>> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>>   	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>   
>>   	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>   
>>   	/* Map the entire BO in our level 0 pt */
>>   	for (i = 0, level = 0; i < num_entries; level++) {
>> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
>> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>>   
>>   		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>>   
>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   		for (i = 0; i < batch->size;
>>   		     i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>>   		     XE_PAGE_SIZE) {
>> -			entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
>> +			entry = xe_pte_encode(batch, i, pat_index, 0);
>>   
>>   			xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>>   				  entry);
>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   			flags = XE_PDE_64K;
>>   
>>   		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>> -					XE_PAGE_SIZE, XE_CACHE_WB);
>> +					XE_PAGE_SIZE, pat_index);
>>   		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
>>   			  entry | flags);
>>   	}
>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   	/* Write PDE's that point to our BO. */
>>   	for (i = 0; i < num_entries - num_level; i++) {
>>   		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>> -				      XE_CACHE_WB);
>> +				      pat_index);
>>   
>>   		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>>   			  (i + 1) * 8, u64, entry);
>> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
>>   
>>   			XE_WARN_ON(pt_bo->size != SZ_4K);
>>   
>> -			addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
>> +			addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>>   			bb->cs[bb->len++] = lower_32_bits(addr);
>>   			bb->cs[bb->len++] = upper_32_bits(addr);
>>   		}
>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
>> index 64713f400d94..019af2920078 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.c
>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>> @@ -10,6 +10,7 @@
>>   #include "xe_gt.h"
>>   #include "xe_gt_tlb_invalidation.h"
>>   #include "xe_migrate.h"
>> +#include "xe_pat.h"
>>   #include "xe_pt_types.h"
>>   #include "xe_pt_walk.h"
>>   #include "xe_res_cursor.h"
>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
>>    *
>>    * Return: An encoded page directory entry. No errors.
>>    */
>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>> -		  const enum xe_cache_level cache)
>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>>   {
>>   	u64 pde;
>>   	struct xe_vm *vm = bo->vm;
>>   	struct xe_device *xe = vm->xe;
>>   
>> -
>>   	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>>   	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>   
>>   	if ((vm->pat_encode).pde_encode)
>> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>> +		pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>>   
>>   	return pde;
>>   }
>>   
>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>>   			struct xe_vma *vma, u32 pt_level)
>>   {
>>   	struct xe_vm *vm = xe_vma_vm(vma);
>> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>   		pte |= XE_PTE_NULL;
>>   
>>   	if ((vm->pat_encode).pte_encode)
>> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>> +		pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>>   
>>   	if (pt_level == 1)
>>   		pte |= XE_PDE_PS_2M;
>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>    *
>>    * Return: An encoded page-table entry. No errors.
>>    */
>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>   		  u32 pt_level)
>>   {
>>   	u64 pte;
>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>>   	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>>   		pte |= XE_PPGTT_PTE_DM;
>>   
>> -	return __pte_encode(pte, cache, NULL, pt_level);
>> +	return __pte_encode(pte, pat_index, NULL, pt_level);
>>   }
>>   
>>   static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>>   
>>   	if (level == 0) {
>>   		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>> -					  XE_CACHE_WB, 0);
>> +					  xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
>>   
>>   		return empty;
>>   	} else {
>>   		return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
>> -				     XE_CACHE_WB);
>> +				     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>>   	}
>>   }
>>   
>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>>   	struct xe_vm *vm;
>>   	/** @tile: The tile we're building for. */
>>   	struct xe_tile *tile;
>> -	/** @cache: Desired cache level for the ptes */
>> -	enum xe_cache_level cache;
>>   	/** @default_pte: PTE flag only template. No address is associated */
>>   	u64 default_pte;
>>   	/** @dma_offset: DMA offset to add to the PTE. */
>> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>>   
>>   		pte = __pte_encode(is_null ? 0 :
>>   				   xe_res_dma(curs) + xe_walk->dma_offset,
>> -				   xe_walk->cache, xe_walk->vma, level);
>> +				   xe_walk->vma->pat_index, xe_walk->vma, level);
>>   		pte |= xe_walk->default_pte;
>>   
>>   		/*
>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>>   			xe_child->is_compact = true;
>>   		}
>>   
>> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
>> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
>>   		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>>   					 pte);
>>   	}
>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
>>   		if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>>   			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>>   		xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
>> -		xe_walk.cache = XE_CACHE_WB;
>> -	} else {
>> -		if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
>> -			xe_walk.cache = XE_CACHE_WT;
>> -		else
>> -			xe_walk.cache = XE_CACHE_WB;
>>   	}
>> +
>>   	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>   		xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
>> index 01be7ab08f87..1d433a5a96b4 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.h
>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
>>   
>>   bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>>   
>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>> -		  const enum xe_cache_level level);
>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>>   
>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> -		  u32 pt_level);
>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
>>   
>>   #endif
>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>> index 7eeeed0411f3..34603a7e84b0 100644
>> --- a/drivers/gpu/drm/xe/xe_vm.c
>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>> @@ -6,6 +6,7 @@
>>   #include "xe_vm.h"
>>   
>>   #include <linux/dma-fence-array.h>
>> +#include <linux/nospec.h>
>>   
>>   #include <drm/drm_print.h>
>>   #include <drm/ttm/ttm_execbuf_util.h>
>> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>>   				    u64 start, u64 end,
>>   				    bool read_only,
>>   				    bool is_null,
>> -				    u8 tile_mask)
>> +				    u8 tile_mask,
>> +				    u32 pat_index)
>>   {
>>   	struct xe_vma *vma;
>>   	struct xe_tile *tile;
>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>>   			vma->tile_mask |= 0x1 << id;
>>   	}
>>   
>> +	vma->pat_index = pat_index;
>> +
>>   	if (vm->xe->info.platform == XE_PVC)
>>   		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>   
>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
>>   static void vm_destroy_work_func(struct work_struct *w);
>>   
>>   static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>> -						enum xe_cache_level cache)
>> +				     u32 pat_index)
>>   {
>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>>   	pde_pat &= ~(XELP_PDE_PAT_MASK);
>>   
>>   	if (pat_index & BIT(0))
>> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>>   }
>>   
>>   static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>> -						enum xe_cache_level cache)
>> +				     u32 pat_index)
>>   {
>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>>   	pte_pat &= ~(XELP_PTE_PAT_MASK);
>>   
>>   	if (pat_index & BIT(0))
>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
>>   u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>>   {
>>   	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>> -			     XE_CACHE_WB);
>> +			     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>>   }
>>   
>>   static struct dma_fence *
>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
>>   static struct drm_gpuva_ops *
>>   vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   			 u64 bo_offset_or_userptr, u64 addr, u64 range,
>> -			 u32 operation, u8 tile_mask, u32 region)
>> +			 u32 operation, u8 tile_mask, u32 region, u32 pat_index)
>>   {
>>   	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>   	struct ww_acquire_ctx ww;
>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>   
>>   			op->tile_mask = tile_mask;
>> +			op->pat_index = pat_index;
>>   			op->map.immediate =
>>   				operation & XE_VM_BIND_FLAG_IMMEDIATE;
>>   			op->map.read_only =
>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>   
>>   			op->tile_mask = tile_mask;
>> +			op->pat_index = pat_index;
>>   			op->prefetch.region = region;
>>   		}
>>   		break;
>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   }
>>   
>>   static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>> -			      u8 tile_mask, bool read_only, bool is_null)
>> +			      u8 tile_mask, bool read_only, bool is_null,
>> +			      u32 pat_index)
>>   {
>>   	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
>>   	struct xe_vma *vma;
>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>>   	vma = xe_vma_create(vm, bo, op->gem.offset,
>>   			    op->va.addr, op->va.addr +
>>   			    op->va.range - 1, read_only, is_null,
>> -			    tile_mask);
>> +			    tile_mask, pat_index);
>>   	if (bo)
>>   		xe_bo_unlock(bo, &ww);
>>   
>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>   
>>   				vma = new_vma(vm, &op->base.map,
>>   					      op->tile_mask, op->map.read_only,
>> -					      op->map.is_null);
>> +					      op->map.is_null, op->pat_index);
>>   				if (IS_ERR(vma)) {
>>   					err = PTR_ERR(vma);
>>   					goto free_fence;
>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>   
>>   					vma = new_vma(vm, op->base.remap.prev,
>>   						      op->tile_mask, read_only,
>> -						      is_null);
>> +						      is_null, op->pat_index);
>>   					if (IS_ERR(vma)) {
>>   						err = PTR_ERR(vma);
>>   						goto free_fence;
>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>   
>>   					vma = new_vma(vm, op->base.remap.next,
>>   						      op->tile_mask, read_only,
>> -						      is_null);
>> +						      is_null, op->pat_index);
>>   					if (IS_ERR(vma)) {
>>   						err = PTR_ERR(vma);
>>   						goto free_fence;
>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
>>   		u32 obj = (*bind_ops)[i].obj;
>>   		u64 obj_offset = (*bind_ops)[i].obj_offset;
>>   		u32 region = (*bind_ops)[i].region;
>> +		u32 pat_index = (*bind_ops)[i].pat_index;
>>   		bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>   
>> +		if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
>> +			err = -EINVAL;
>> +			goto free_bind_ops;
>> +		}
>> +
>>   		if (i == 0) {
>>   			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>>   		} else if (XE_IOCTL_DBG(xe, !*async) ||
>> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>   		struct drm_gem_object *gem_obj;
>>   		u64 range = bind_ops[i].range;
>>   		u64 addr = bind_ops[i].addr;
>> +		u32 op = bind_ops[i].op;
>>   		u32 obj = bind_ops[i].obj;
>>   		u64 obj_offset = bind_ops[i].obj_offset;
>> +		u32 pat_index = bind_ops[i].pat_index;
>> +		u16 coh_mode;
>> +
>> +		pat_index = array_index_nospec(pat_index,
>> +					       xe->info.pat_table_n_entries);
>> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
> Assuming we drop the unusable entries from the TGL table, this should be
> impossible, right?  Any index that makes it past the n_entries check at
> the top of the function should have a valid, non-zero coh_mode value.
> So this should probably be an assertion (to highlight a KMD bug) rather
> than just a silent uapi failure return.
>
>> +			err = -EINVAL;
>> +			goto put_obj;
>> +		}
>> +
>> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
>> +				 coh_mode == XE_GEM_COHERENCY_NONE)) {
>> +			err = -EINVAL;
>> +			goto put_obj;
>> +		}
>>   
>>   		if (!obj)
>>   			continue;
>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>   				goto put_obj;
>>   			}
>>   		}
>> +
>> +		if (bos[i]->coh_mode) {
>> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
>> +				err = -EINVAL;
>> +				goto put_obj;
>> +			}
>> +		} else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
>> +			/*
>> +			 * Imported dma-buf from a different device should
>> +			 * require 1way or 2way coherency since we don't know
>> +			 * how it was mapped on CPU. Just assume is it
>> +			 * potentially cached on CPU side.
>> +			 */
>> +			err = -EINVAL;
>> +			goto put_obj;
>> +		}
>>   	}
>>   
>>   	if (args->num_syncs) {
>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>   		u64 obj_offset = bind_ops[i].obj_offset;
>>   		u8 tile_mask = bind_ops[i].tile_mask;
>>   		u32 region = bind_ops[i].region;
>> +		u32 pat_index = bind_ops[i].pat_index;
>>   
>>   		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>>   						  addr, range, op, tile_mask,
>> -						  region);
>> +						  region, pat_index);
>>   		if (IS_ERR(ops[i])) {
>>   			err = PTR_ERR(ops[i]);
>>   			ops[i] = NULL;
>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
>> index 83a1f87b6537..508679ed3c74 100644
>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>> @@ -111,6 +111,11 @@ struct xe_vma {
>>   	 */
>>   	u8 tile_present;
>>   
>> +	/**
>> +	 * @pat_index: The pat index to use when encoding the PTEs for this vma.
>> +	 */
>> +	u32 pat_index;
>> +
>>   	struct {
>>   		struct list_head rebind_link;
>>   	} notifier;
>> @@ -338,10 +343,8 @@ struct xe_vm {
>>   	bool batch_invalidate_tlb;
>>   
>>   	struct {
>> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> -						enum xe_cache_level cache);
>> -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>> -						enum xe_cache_level cache);
>> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
>> +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
>>   	} pat_encode;
>>   };
>>   
>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>>   	struct async_op_fence *fence;
>>   	/** @tile_mask: gt mask for this operation */
>>   	u8 tile_mask;
>> +	/** @pat_index: The pat index to use for this operation. */
>> +	u32 pat_index;
>>   	/** @flags: operation flags */
>>   	enum xe_vma_op_flags flags;
>>   
>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>> index 64bc66d4b550..0c15b6f32447 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>>   	 */
>>   	__u32 obj;
>>   
>> -	/** @pad: MBZ */
>> -	__u32 pad;
>> +	/**
>> +	 * @pat_index: The platform defined @pat_index to use for this mapping.
>> +	 * The index basically maps to some predefined memory attributes,
>> +	 * including things like caching, coherency and likely other stuff in
>> +	 * the future.  The exact meaning of the pat_index is platform specific
> BTW, "other stuff in the future" already includes compression on Xe2, we
> just haven't landed the patches for the Xe2 table yet.
>
>> +	 * and defined in the Bspec and PRMs.  When the KMD sets up the binding
>> +	 * the index here is encoded into the ppGTT PTE.
>> +	 *
>> +	 * For coherency the @pat_index needs to match the
>> +	 * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
>> +	 * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
>> +	 * the coherency mode from the @pat_index and reject if there is a
>> +	 * mismatch (see note below for pre-MTL platforms).
>> +	 *
>> +	 * Note: On pre-MTL platforms there is only a caching mode and no
>> +	 * explicit coherency mode, but on such hardware there is always a
>> +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
>> +	 * CPU caches even with the caching mode set as uncached.  It's only the
>> +	 * display engine that is incoherent (on dgpu it must be in VRAM which
>> +	 * is always mapped as WC on the CPU). However to keep the uapi somewhat
>> +	 * consistent with newer platforms the KMD groups the different cache
>> +	 * levels into the following coherency buckets on all pre-MTL platforms:
>> +	 *
>> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
>> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
>> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
>> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
> As noted on the previous patch, it seems like 2-way is appropriate for
> LLC platforms, but 1-way might be a more accurate description of dGPU
> behavior.
>
>> +	 *
>> +	 * In practice UC/WC/WT should only ever used for scanout surfaces on
>> +	 * such platforms since it is only the display engine that is actually
>> +	 * incoherent. Everything else should typically use WB given that we
> What if we're sharing our buffers with some other (non-GPU) device?  Are
> there cases where that other device wouldn't be coherent with the LLC,
> so we'd want to use one of these?
>
>
> Matt
>
>> +	 * have a shared-LLC.  On MTL+ this completely changes (also potentially
>> +	 * no shared-LLC) and the HW defines the coherency mode as part of the
>> +	 * @pat_index.
>> +	 *
>> +	 * Note: For userptr and externally imported dma-buf the kernel expects
>> +	 * either 1WAY or 2WAY for the @pat_index.
>> +	 */
>> +	__u32 pat_index;
>>   
>>   	union {
>>   		/**
>> -- 
>> 2.41.0
>>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-08-30  6:38     ` Thomas Hellström
@ 2023-08-30 19:28       ` Matt Roper
  0 siblings, 0 replies; 47+ messages in thread
From: Matt Roper @ 2023-08-30 19:28 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Filip Hazubski, Joonas Lahtinen, Lucas De Marchi, Carl Zhang,
	Effie Yu, Matthew Auld, intel-xe

On Wed, Aug 30, 2023 at 08:38:13AM +0200, Thomas Hellström wrote:
> 
> On 8/29/23 23:36, Matt Roper wrote:
> > On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
> > > Allow userspace to directly control the pat_index for a given vm
> > > binding. This should allow directly controlling the coherency, caching
> > > and potentially other stuff in the future for the ppGTT binding.
> > > 
> > > The exact meaning behind the pat_index is very platform specific (see
> > > BSpec or PRMs) but effectively maps to some predefined memory
> > > attributes. From the KMD pov we only care about the coherency that is
> > > provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
> > > The vm_bind coherency mode for the given pat_index needs to match the
> > > given coh_mode that was set at object creation. For platforms that lack
> > Is it actually important to match the coherency mode?  I think one of
> > the main goals was to know up front if userspace might be using a
> > non-snooping PAT setting that would let it bypass the CPU cache (and
> > potentially read old, stale data from a different process if the
> > buffer's clear value is still sitting in cache and hasn't landed in
> > memory yet).
> > 
> > If that's the only concern, then I think it should still be fine to map
> > with a non-matching PAT as long as it's more coherent than the one
> > specified at creation, right?  E.g., if the buffer was created with
> > 1-way coherency, it would be fine to map it with 2-way because userspace
> > still can't use that to observe the previous contents of the buffer.  Or
> > if the buffer was created with "non-coherent" then we've already done
> > the necessary clflushing in kernel before handing to buffer over to
> > userspace to ensure the clear value landed in memory, so any valid PAT
> > index should be safe (from a security POV) after that, right?  Any other
> > problems that arise from mismatched coherency would just be contained to
> > the app possibly shooting itself in the foot, which isn't really our
> > concern.
> 
> We also have shrinking to keep in mind: If content was generated by the GPU
> using 2-way coherency, and then a copy from the backing page is done with
> the page mapped WC, wouldn't that blow up? I think requiring consistency

My understanding (which could easily be wrong since this isn't an area
I've done much work in) is that on dGPU and non-LLC igpu (MTL):
 - A GPU update done with a 1-way coherent PAT setting will invalidate
   the CPU cache.  Any type of CPU mapping should read the correct
   content at that point, as long as the GPU's caches have been flushed.
 - A GPU update done with a non-coherent PAT setting won't invalidate
   the CPU cache, so CPU reads will need to be CPU:UC or CPU:WC to read
   the latest content from memory (assuming the GPU caches were
   flushed and/or the update bypassed the GPU caches).
 - 2-way coherency isn't actually supported (even though there's a PAT
   index labelled as 2-way on the MTL table, it's misleading and still
   behaves as 1-way for everything except atomic operations).  Maybe
   this changes on LNL; I don't remember off the top of my head right
   now.

I believe we always take care of the necessary GPU flushing
(PIPE_CONTROL or MI_FLUSH_DW) as part of our post-bb handling, so it
seems to me like using a CPU:WC mapping would always be safe on these
platforms, regardless of which PAT setting is selected at vm_bind.

I'm less sure about the exact semantics of the older LLC platforms; it's
a bit hard to find the details for that in the bspec.  But I'm not sure
the PAT really matters much at all on those platforms since the
important cache behavior was mostly defined in the MOCS on those
platforms, and the MOCS would usually override the PAT in cases where
there was overlap.  No matter what vm_bind PAT is chosen, the CPU
handling needs to be conservative enough to work properly no matter what
the MOCS might have done.

> here is reasonable also keeping unforeseen side-effects of future HW in
> mind. Or do you see it as blocking an important use-case?

I don't know of any specific use cases; the userspace guys would
probably have a much better feel for that.  I think the current proposal
here at least avoids the biggest problems we had in the past with i915
(where 3D/media content generation into a buffer needed very different
cache behavior [in the PPGTT] than the display scanout wanted [in the
GGTT] since the two now remain unrelated).

I suppose even if we don't need to be so strict about restricting the
vm_bind-time PAT selection, it doesn't hurt anything as long as the
userspace guys can live with it.  We can always loosen the restrictions
down the road if/when we find it necessary.


Matt

> 
> Thanks,
> 
> Thomas
> 
> 
> > 
> > 
> > > the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
> > > 
> > > For userptr mappings we lack a corresponding gem object, so the expected
> > > coherency mode is instead implicit and must fall into either 1WAY or
> > > 2WAY. Trying to use NONE will be rejected by the kernel. For imported
> > > dma-buf (from a different device) the coherency mode is also implicit
> > > and must also be either 1WAY or 2WAY.
> > > 
> > > As part of adding pat_index support with vm_bind we also need stop using
> > > xe_cache_level and instead use the pat_index in various places. We still
> > > make use of xe_cache_level, but only as a convenience for kernel
> > > internal objectsi (internally it maps to some reasonable pat_index). For
> > Maybe we should kill xe_cache_level completely and just assign
> > xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
> > appropriate places, similar to what we do with MOCS (gt->mocs.uc_index,
> > gt->mocs.wb_index)?
> > 
> > > now this is just a 1:1 conversion of the existing code, however for
> > > platforms like MTL+ we might need to give more control through bo_create
> > > or stop using WB on the CPU side if we need CPU access.
> > > 
> > > Bspec: 45101, 44235 #xe
> > > Bspec: 70552, 71582, 59400 #xe2
> > > Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> > > Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > > Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> > > Cc: Matt Roper <matthew.d.roper@intel.com>
> > > Cc: José Roberto de Souza <jose.souza@intel.com>
> > > Cc: Filip Hazubski <filip.hazubski@intel.com>
> > > Cc: Carl Zhang <carl.zhang@intel.com>
> > > Cc: Effie Yu <effie.yu@intel.com>
> > > ---
> > >   drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
> > >   drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
> > >   drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
> > >   drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
> > >   drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
> > >   drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
> > >   drivers/gpu/drm/xe/xe_pt.h            |  6 +--
> > >   drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++------
> > >   drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
> > >   include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
> > >   10 files changed, 134 insertions(+), 58 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
> > > index 4d6296cdbcfd..bb4c182048c3 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> > > @@ -302,7 +302,7 @@ struct i915_address_space {
> > >   		(*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
> > >   	u64 (*pte_encode)(dma_addr_t addr,
> > > -			  unsigned int pat_index,
> > > +			  u32 pat_index,
> > >   			  u32 flags); /* Create a valid PTE */
> > >   #define PTE_READ_ONLY	BIT(0)
> > >   #define PTE_LM		BIT(1)
> > > diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
> > > index 5c8d5e78d9bc..7a128fd20a29 100644
> > > --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> > > +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> > > @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
> > >   	/* First part of the test, are we updating our pagetable bo with a new entry? */
> > >   	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
> > >   		  0xdeaddeadbeefbeef);
> > > -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> > > +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
> > >   	if (m->q->vm->flags & XE_VM_FLAG_64K)
> > >   		expected |= XE_PTE_PS64;
> > >   	if (xe_bo_is_vram(pt))
> > > diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
> > > index 209fa053d9fb..4134c26150a5 100644
> > > --- a/drivers/gpu/drm/xe/xe_ggtt.c
> > > +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> > > @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
> > >   		pte |= XE_GGTT_PTE_DM;
> > >   	if ((ggtt->pat_encode).pte_encode)
> > > -		pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
> > > +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> > > +						    xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
> > >   	return pte;
> > >   }
> > > @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
> > >   }
> > >   static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> > > -						enum xe_cache_level cache)
> > > +				     u32 pat_index)
> > >   {
> > > -	u32 pat_index = xe_pat_get_index(xe, cache);
> > > -
> > >   	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
> > >   	if (pat_index & BIT(0))
> > > diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
> > > index 7e55fac1a8a9..0bc40cb072e3 100644
> > > --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> > > @@ -31,7 +31,7 @@ struct xe_ggtt {
> > >   	struct {
> > >   		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> > > -						enum xe_cache_level cache);
> > > +				  u32 pat_index);
> > >   	} pat_encode;
> > >   };
> > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> > > index a782ea282cb6..54585e98452a 100644
> > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > @@ -24,6 +24,7 @@
> > >   #include "xe_lrc.h"
> > >   #include "xe_map.h"
> > >   #include "xe_mocs.h"
> > > +#include "xe_pat.h"
> > >   #include "xe_pt.h"
> > >   #include "xe_res_cursor.h"
> > >   #include "xe_sched_job.h"
> > > @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > >   	u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
> > >   	u32 map_ofs, level, i;
> > >   	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> > > +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> > >   	u64 entry;
> > >   	int ret;
> > > @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > >   		return ret;
> > >   	}
> > > -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
> > > +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
> > >   	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
> > >   	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
> > >   	/* Map the entire BO in our level 0 pt */
> > >   	for (i = 0, level = 0; i < num_entries; level++) {
> > > -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
> > > +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
> > >   		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
> > > @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > >   		for (i = 0; i < batch->size;
> > >   		     i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
> > >   		     XE_PAGE_SIZE) {
> > > -			entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
> > > +			entry = xe_pte_encode(batch, i, pat_index, 0);
> > >   			xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> > >   				  entry);
> > > @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > >   			flags = XE_PDE_64K;
> > >   		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> > > -					XE_PAGE_SIZE, XE_CACHE_WB);
> > > +					XE_PAGE_SIZE, pat_index);
> > >   		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
> > >   			  entry | flags);
> > >   	}
> > > @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > >   	/* Write PDE's that point to our BO. */
> > >   	for (i = 0; i < num_entries - num_level; i++) {
> > >   		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> > > -				      XE_CACHE_WB);
> > > +				      pat_index);
> > >   		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
> > >   			  (i + 1) * 8, u64, entry);
> > > @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
> > >   			XE_WARN_ON(pt_bo->size != SZ_4K);
> > > -			addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
> > > +			addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
> > >   			bb->cs[bb->len++] = lower_32_bits(addr);
> > >   			bb->cs[bb->len++] = upper_32_bits(addr);
> > >   		}
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> > > index 64713f400d94..019af2920078 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > @@ -10,6 +10,7 @@
> > >   #include "xe_gt.h"
> > >   #include "xe_gt_tlb_invalidation.h"
> > >   #include "xe_migrate.h"
> > > +#include "xe_pat.h"
> > >   #include "xe_pt_types.h"
> > >   #include "xe_pt_walk.h"
> > >   #include "xe_res_cursor.h"
> > > @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
> > >    *
> > >    * Return: An encoded page directory entry. No errors.
> > >    */
> > > -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> > > -		  const enum xe_cache_level cache)
> > > +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
> > >   {
> > >   	u64 pde;
> > >   	struct xe_vm *vm = bo->vm;
> > >   	struct xe_device *xe = vm->xe;
> > > -
> > >   	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> > >   	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
> > >   	if ((vm->pat_encode).pde_encode)
> > > -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> > > +		pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
> > >   	return pde;
> > >   }
> > > -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> > > +static u64 __pte_encode(u64 pte, u32 pat_index,
> > >   			struct xe_vma *vma, u32 pt_level)
> > >   {
> > >   	struct xe_vm *vm = xe_vma_vm(vma);
> > > @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> > >   		pte |= XE_PTE_NULL;
> > >   	if ((vm->pat_encode).pte_encode)
> > > -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> > > +		pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
> > >   	if (pt_level == 1)
> > >   		pte |= XE_PDE_PS_2M;
> > > @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> > >    *
> > >    * Return: An encoded page-table entry. No errors.
> > >    */
> > > -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> > > +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> > >   		  u32 pt_level)
> > >   {
> > >   	u64 pte;
> > > @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> > >   	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> > >   		pte |= XE_PPGTT_PTE_DM;
> > > -	return __pte_encode(pte, cache, NULL, pt_level);
> > > +	return __pte_encode(pte, pat_index, NULL, pt_level);
> > >   }
> > >   static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> > > @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> > >   	if (level == 0) {
> > >   		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> > > -					  XE_CACHE_WB, 0);
> > > +					  xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
> > >   		return empty;
> > >   	} else {
> > >   		return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
> > > -				     XE_CACHE_WB);
> > > +				     xe_pat_get_index(vm->xe, XE_CACHE_WB));
> > >   	}
> > >   }
> > > @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> > >   	struct xe_vm *vm;
> > >   	/** @tile: The tile we're building for. */
> > >   	struct xe_tile *tile;
> > > -	/** @cache: Desired cache level for the ptes */
> > > -	enum xe_cache_level cache;
> > >   	/** @default_pte: PTE flag only template. No address is associated */
> > >   	u64 default_pte;
> > >   	/** @dma_offset: DMA offset to add to the PTE. */
> > > @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
> > >   		pte = __pte_encode(is_null ? 0 :
> > >   				   xe_res_dma(curs) + xe_walk->dma_offset,
> > > -				   xe_walk->cache, xe_walk->vma, level);
> > > +				   xe_walk->vma->pat_index, xe_walk->vma, level);
> > >   		pte |= xe_walk->default_pte;
> > >   		/*
> > > @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
> > >   			xe_child->is_compact = true;
> > >   		}
> > > -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
> > > +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
> > >   		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
> > >   					 pte);
> > >   	}
> > > @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> > >   		if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
> > >   			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
> > >   		xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
> > > -		xe_walk.cache = XE_CACHE_WB;
> > > -	} else {
> > > -		if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
> > > -			xe_walk.cache = XE_CACHE_WT;
> > > -		else
> > > -			xe_walk.cache = XE_CACHE_WB;
> > >   	}
> > > +
> > >   	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> > >   		xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> > > index 01be7ab08f87..1d433a5a96b4 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.h
> > > +++ b/drivers/gpu/drm/xe/xe_pt.h
> > > @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
> > >   bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
> > > -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> > > -		  const enum xe_cache_level level);
> > > +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
> > > -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> > > -		  u32 pt_level);
> > > +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
> > >   #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > > index 7eeeed0411f3..34603a7e84b0 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -6,6 +6,7 @@
> > >   #include "xe_vm.h"
> > >   #include <linux/dma-fence-array.h>
> > > +#include <linux/nospec.h>
> > >   #include <drm/drm_print.h>
> > >   #include <drm/ttm/ttm_execbuf_util.h>
> > > @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
> > >   				    u64 start, u64 end,
> > >   				    bool read_only,
> > >   				    bool is_null,
> > > -				    u8 tile_mask)
> > > +				    u8 tile_mask,
> > > +				    u32 pat_index)
> > >   {
> > >   	struct xe_vma *vma;
> > >   	struct xe_tile *tile;
> > > @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
> > >   			vma->tile_mask |= 0x1 << id;
> > >   	}
> > > +	vma->pat_index = pat_index;
> > > +
> > >   	if (vm->xe->info.platform == XE_PVC)
> > >   		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
> > > @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
> > >   static void vm_destroy_work_func(struct work_struct *w);
> > >   static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> > > -						enum xe_cache_level cache)
> > > +				     u32 pat_index)
> > >   {
> > > -	u32 pat_index = xe_pat_get_index(xe, cache);
> > > -
> > >   	pde_pat &= ~(XELP_PDE_PAT_MASK);
> > >   	if (pat_index & BIT(0))
> > > @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> > >   }
> > >   static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> > > -						enum xe_cache_level cache)
> > > +				     u32 pat_index)
> > >   {
> > > -	u32 pat_index = xe_pat_get_index(xe, cache);
> > > -
> > >   	pte_pat &= ~(XELP_PTE_PAT_MASK);
> > >   	if (pat_index & BIT(0))
> > > @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
> > >   u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
> > >   {
> > >   	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> > > -			     XE_CACHE_WB);
> > > +			     xe_pat_get_index(vm->xe, XE_CACHE_WB));
> > >   }
> > >   static struct dma_fence *
> > > @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
> > >   static struct drm_gpuva_ops *
> > >   vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> > >   			 u64 bo_offset_or_userptr, u64 addr, u64 range,
> > > -			 u32 operation, u8 tile_mask, u32 region)
> > > +			 u32 operation, u8 tile_mask, u32 region, u32 pat_index)
> > >   {
> > >   	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> > >   	struct ww_acquire_ctx ww;
> > > @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> > >   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> > >   			op->tile_mask = tile_mask;
> > > +			op->pat_index = pat_index;
> > >   			op->map.immediate =
> > >   				operation & XE_VM_BIND_FLAG_IMMEDIATE;
> > >   			op->map.read_only =
> > > @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> > >   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> > >   			op->tile_mask = tile_mask;
> > > +			op->pat_index = pat_index;
> > >   			op->prefetch.region = region;
> > >   		}
> > >   		break;
> > > @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> > >   }
> > >   static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> > > -			      u8 tile_mask, bool read_only, bool is_null)
> > > +			      u8 tile_mask, bool read_only, bool is_null,
> > > +			      u32 pat_index)
> > >   {
> > >   	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
> > >   	struct xe_vma *vma;
> > > @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> > >   	vma = xe_vma_create(vm, bo, op->gem.offset,
> > >   			    op->va.addr, op->va.addr +
> > >   			    op->va.range - 1, read_only, is_null,
> > > -			    tile_mask);
> > > +			    tile_mask, pat_index);
> > >   	if (bo)
> > >   		xe_bo_unlock(bo, &ww);
> > > @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
> > >   				vma = new_vma(vm, &op->base.map,
> > >   					      op->tile_mask, op->map.read_only,
> > > -					      op->map.is_null);
> > > +					      op->map.is_null, op->pat_index);
> > >   				if (IS_ERR(vma)) {
> > >   					err = PTR_ERR(vma);
> > >   					goto free_fence;
> > > @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
> > >   					vma = new_vma(vm, op->base.remap.prev,
> > >   						      op->tile_mask, read_only,
> > > -						      is_null);
> > > +						      is_null, op->pat_index);
> > >   					if (IS_ERR(vma)) {
> > >   						err = PTR_ERR(vma);
> > >   						goto free_fence;
> > > @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
> > >   					vma = new_vma(vm, op->base.remap.next,
> > >   						      op->tile_mask, read_only,
> > > -						      is_null);
> > > +						      is_null, op->pat_index);
> > >   					if (IS_ERR(vma)) {
> > >   						err = PTR_ERR(vma);
> > >   						goto free_fence;
> > > @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
> > >   		u32 obj = (*bind_ops)[i].obj;
> > >   		u64 obj_offset = (*bind_ops)[i].obj_offset;
> > >   		u32 region = (*bind_ops)[i].region;
> > > +		u32 pat_index = (*bind_ops)[i].pat_index;
> > >   		bool is_null = op & XE_VM_BIND_FLAG_NULL;
> > > +		if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
> > > +			err = -EINVAL;
> > > +			goto free_bind_ops;
> > > +		}
> > > +
> > >   		if (i == 0) {
> > >   			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
> > >   		} else if (XE_IOCTL_DBG(xe, !*async) ||
> > > @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> > >   		struct drm_gem_object *gem_obj;
> > >   		u64 range = bind_ops[i].range;
> > >   		u64 addr = bind_ops[i].addr;
> > > +		u32 op = bind_ops[i].op;
> > >   		u32 obj = bind_ops[i].obj;
> > >   		u64 obj_offset = bind_ops[i].obj_offset;
> > > +		u32 pat_index = bind_ops[i].pat_index;
> > > +		u16 coh_mode;
> > > +
> > > +		pat_index = array_index_nospec(pat_index,
> > > +					       xe->info.pat_table_n_entries);
> > > +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
> > > +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
> > Assuming we drop the unusable entries from the TGL table, this should be
> > impossible, right?  Any index that makes it past the n_entries check at
> > the top of the function should have a valid, non-zero coh_mode value.
> > So this should probably be an assertion (to highlight a KMD bug) rather
> > than just a silent uapi failure return.
> > 
> > > +			err = -EINVAL;
> > > +			goto put_obj;
> > > +		}
> > > +
> > > +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
> > > +				 coh_mode == XE_GEM_COHERENCY_NONE)) {
> > > +			err = -EINVAL;
> > > +			goto put_obj;
> > > +		}
> > >   		if (!obj)
> > >   			continue;
> > > @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> > >   				goto put_obj;
> > >   			}
> > >   		}
> > > +
> > > +		if (bos[i]->coh_mode) {
> > > +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
> > > +				err = -EINVAL;
> > > +				goto put_obj;
> > > +			}
> > > +		} else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
> > > +			/*
> > > +			 * Imported dma-buf from a different device should
> > > +			 * require 1way or 2way coherency since we don't know
> > > +			 * how it was mapped on CPU. Just assume is it
> > > +			 * potentially cached on CPU side.
> > > +			 */
> > > +			err = -EINVAL;
> > > +			goto put_obj;
> > > +		}
> > >   	}
> > >   	if (args->num_syncs) {
> > > @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> > >   		u64 obj_offset = bind_ops[i].obj_offset;
> > >   		u8 tile_mask = bind_ops[i].tile_mask;
> > >   		u32 region = bind_ops[i].region;
> > > +		u32 pat_index = bind_ops[i].pat_index;
> > >   		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
> > >   						  addr, range, op, tile_mask,
> > > -						  region);
> > > +						  region, pat_index);
> > >   		if (IS_ERR(ops[i])) {
> > >   			err = PTR_ERR(ops[i]);
> > >   			ops[i] = NULL;
> > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
> > > index 83a1f87b6537..508679ed3c74 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > > @@ -111,6 +111,11 @@ struct xe_vma {
> > >   	 */
> > >   	u8 tile_present;
> > > +	/**
> > > +	 * @pat_index: The pat index to use when encoding the PTEs for this vma.
> > > +	 */
> > > +	u32 pat_index;
> > > +
> > >   	struct {
> > >   		struct list_head rebind_link;
> > >   	} notifier;
> > > @@ -338,10 +343,8 @@ struct xe_vm {
> > >   	bool batch_invalidate_tlb;
> > >   	struct {
> > > -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> > > -						enum xe_cache_level cache);
> > > -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> > > -						enum xe_cache_level cache);
> > > +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
> > > +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
> > >   	} pat_encode;
> > >   };
> > > @@ -417,6 +420,8 @@ struct xe_vma_op {
> > >   	struct async_op_fence *fence;
> > >   	/** @tile_mask: gt mask for this operation */
> > >   	u8 tile_mask;
> > > +	/** @pat_index: The pat index to use for this operation. */
> > > +	u32 pat_index;
> > >   	/** @flags: operation flags */
> > >   	enum xe_vma_op_flags flags;
> > > diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> > > index 64bc66d4b550..0c15b6f32447 100644
> > > --- a/include/uapi/drm/xe_drm.h
> > > +++ b/include/uapi/drm/xe_drm.h
> > > @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> > >   	 */
> > >   	__u32 obj;
> > > -	/** @pad: MBZ */
> > > -	__u32 pad;
> > > +	/**
> > > +	 * @pat_index: The platform defined @pat_index to use for this mapping.
> > > +	 * The index basically maps to some predefined memory attributes,
> > > +	 * including things like caching, coherency and likely other stuff in
> > > +	 * the future.  The exact meaning of the pat_index is platform specific
> > BTW, "other stuff in the future" already includes compression on Xe2, we
> > just haven't landed the patches for the Xe2 table yet.
> > 
> > > +	 * and defined in the Bspec and PRMs.  When the KMD sets up the binding
> > > +	 * the index here is encoded into the ppGTT PTE.
> > > +	 *
> > > +	 * For coherency the @pat_index needs to match the
> > > +	 * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
> > > +	 * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
> > > +	 * the coherency mode from the @pat_index and reject if there is a
> > > +	 * mismatch (see note below for pre-MTL platforms).
> > > +	 *
> > > +	 * Note: On pre-MTL platforms there is only a caching mode and no
> > > +	 * explicit coherency mode, but on such hardware there is always a
> > > +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
> > > +	 * CPU caches even with the caching mode set as uncached.  It's only the
> > > +	 * display engine that is incoherent (on dgpu it must be in VRAM which
> > > +	 * is always mapped as WC on the CPU). However to keep the uapi somewhat
> > > +	 * consistent with newer platforms the KMD groups the different cache
> > > +	 * levels into the following coherency buckets on all pre-MTL platforms:
> > > +	 *
> > > +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
> > > +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
> > > +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
> > > +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
> > As noted on the previous patch, it seems like 2-way is appropriate for
> > LLC platforms, but 1-way might be a more accurate description of dGPU
> > behavior.
> > 
> > > +	 *
> > > +	 * In practice UC/WC/WT should only ever used for scanout surfaces on
> > > +	 * such platforms since it is only the display engine that is actually
> > > +	 * incoherent. Everything else should typically use WB given that we
> > What if we're sharing our buffers with some other (non-GPU) device?  Are
> > there cases where that other device wouldn't be coherent with the LLC,
> > so we'd want to use one of these?
> > 
> > 
> > Matt
> > 
> > > +	 * have a shared-LLC.  On MTL+ this completely changes (also potentially
> > > +	 * no shared-LLC) and the HW defines the coherency mode as part of the
> > > +	 * @pat_index.
> > > +	 *
> > > +	 * Note: For userptr and externally imported dma-buf the kernel expects
> > > +	 * either 1WAY or 2WAY for the @pat_index.
> > > +	 */
> > > +	__u32 pat_index;
> > >   	union {
> > >   		/**
> > > -- 
> > > 2.41.0
> > > 

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-08-29 21:36   ` Matt Roper
  2023-08-30  6:38     ` Thomas Hellström
@ 2023-08-30 11:28     ` Matthew Auld
  2023-08-30 15:27       ` Zhang, Carl
  1 sibling, 1 reply; 47+ messages in thread
From: Matthew Auld @ 2023-08-30 11:28 UTC (permalink / raw)
  To: Matt Roper
  Cc: Filip Hazubski, Joonas Lahtinen, Lucas De Marchi, Carl Zhang,
	Effie Yu, intel-xe

On 29/08/2023 22:36, Matt Roper wrote:
> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
>> Allow userspace to directly control the pat_index for a given vm
>> binding. This should allow directly controlling the coherency, caching
>> and potentially other stuff in the future for the ppGTT binding.
>>
>> The exact meaning behind the pat_index is very platform specific (see
>> BSpec or PRMs) but effectively maps to some predefined memory
>> attributes. From the KMD pov we only care about the coherency that is
>> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
>> The vm_bind coherency mode for the given pat_index needs to match the
>> given coh_mode that was set at object creation. For platforms that lack
> 
> Is it actually important to match the coherency mode?  I think one of
> the main goals was to know up front if userspace might be using a
> non-snooping PAT setting that would let it bypass the CPU cache (and
> potentially read old, stale data from a different process if the
> buffer's clear value is still sitting in cache and hasn't landed in
> memory yet).
> 
> If that's the only concern, then I think it should still be fine to map
> with a non-matching PAT as long as it's more coherent than the one
> specified at creation, right?  E.g., if the buffer was created with
> 1-way coherency, it would be fine to map it with 2-way because userspace
> still can't use that to observe the previous contents of the buffer. Or

Yeah, I guess we could in theory do something that.

> if the buffer was created with "non-coherent" then we've already done
> the necessary clflushing in kernel before handing to buffer over to
> userspace to ensure the clear value landed in memory, so any valid PAT
> index should be safe (from a security POV) after that, right?  Any other
> problems that arise from mismatched coherency would just be contained to
> the app possibly shooting itself in the foot, which isn't really our
> concern.

That is also my understanding, at least from the KMD security pov. If 
you allocate as wb then you must use at least 1way, since there is no 
flushing for clearing or swap-in. For uc/wc you could in theory use 
whatever you want.

> 
> 
>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
>>
>> For userptr mappings we lack a corresponding gem object, so the expected
>> coherency mode is instead implicit and must fall into either 1WAY or
>> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
>> dma-buf (from a different device) the coherency mode is also implicit
>> and must also be either 1WAY or 2WAY.
>>
>> As part of adding pat_index support with vm_bind we also need stop using
>> xe_cache_level and instead use the pat_index in various places. We still
>> make use of xe_cache_level, but only as a convenience for kernel
>> internal objectsi (internally it maps to some reasonable pat_index). For
> 
> Maybe we should kill xe_cache_level completely and just assign
> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
> appropriate places, similar to what we do with MOCS (gt->mocs.uc_index,
> gt->mocs.wb_index)?

OK, seems reasonable to me.

> 
>> now this is just a 1:1 conversion of the existing code, however for
>> platforms like MTL+ we might need to give more control through bo_create
>> or stop using WB on the CPU side if we need CPU access.
>>
>> Bspec: 45101, 44235 #xe
>> Bspec: 70552, 71582, 59400 #xe2
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: José Roberto de Souza <jose.souza@intel.com>
>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>> Cc: Carl Zhang <carl.zhang@intel.com>
>> Cc: Effie Yu <effie.yu@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
>>   drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
>>   drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
>>   drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
>>   drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
>>   drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
>>   drivers/gpu/drm/xe/xe_pt.h            |  6 +--
>>   drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++------
>>   drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
>>   include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
>>   10 files changed, 134 insertions(+), 58 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
>> index 4d6296cdbcfd..bb4c182048c3 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>> @@ -302,7 +302,7 @@ struct i915_address_space {
>>   		(*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>>   
>>   	u64 (*pte_encode)(dma_addr_t addr,
>> -			  unsigned int pat_index,
>> +			  u32 pat_index,
>>   			  u32 flags); /* Create a valid PTE */
>>   #define PTE_READ_ONLY	BIT(0)
>>   #define PTE_LM		BIT(1)
>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
>> index 5c8d5e78d9bc..7a128fd20a29 100644
>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
>>   	/* First part of the test, are we updating our pagetable bo with a new entry? */
>>   	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
>>   		  0xdeaddeadbeefbeef);
>> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>>   	if (m->q->vm->flags & XE_VM_FLAG_64K)
>>   		expected |= XE_PTE_PS64;
>>   	if (xe_bo_is_vram(pt))
>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
>> index 209fa053d9fb..4134c26150a5 100644
>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
>>   		pte |= XE_GGTT_PTE_DM;
>>   
>>   	if ((ggtt->pat_encode).pte_encode)
>> -		pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
>> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
>> +						    xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
>>   
>>   	return pte;
>>   }
>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>>   }
>>   
>>   static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>> -						enum xe_cache_level cache)
>> +				     u32 pat_index)
>>   {
>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>>   	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>   
>>   	if (pat_index & BIT(0))
>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
>> index 7e55fac1a8a9..0bc40cb072e3 100644
>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>   
>>   	struct {
>>   		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> -						enum xe_cache_level cache);
>> +				  u32 pat_index);
>>   	} pat_encode;
>>   };
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
>> index a782ea282cb6..54585e98452a 100644
>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>> @@ -24,6 +24,7 @@
>>   #include "xe_lrc.h"
>>   #include "xe_map.h"
>>   #include "xe_mocs.h"
>> +#include "xe_pat.h"
>>   #include "xe_pt.h"
>>   #include "xe_res_cursor.h"
>>   #include "xe_sched_job.h"
>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   	u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
>>   	u32 map_ofs, level, i;
>>   	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>>   	u64 entry;
>>   	int ret;
>>   
>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   		return ret;
>>   	}
>>   
>> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
>> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>>   	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>   
>>   	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>   
>>   	/* Map the entire BO in our level 0 pt */
>>   	for (i = 0, level = 0; i < num_entries; level++) {
>> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
>> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>>   
>>   		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>>   
>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   		for (i = 0; i < batch->size;
>>   		     i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>>   		     XE_PAGE_SIZE) {
>> -			entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
>> +			entry = xe_pte_encode(batch, i, pat_index, 0);
>>   
>>   			xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>>   				  entry);
>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   			flags = XE_PDE_64K;
>>   
>>   		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>> -					XE_PAGE_SIZE, XE_CACHE_WB);
>> +					XE_PAGE_SIZE, pat_index);
>>   		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
>>   			  entry | flags);
>>   	}
>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   	/* Write PDE's that point to our BO. */
>>   	for (i = 0; i < num_entries - num_level; i++) {
>>   		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>> -				      XE_CACHE_WB);
>> +				      pat_index);
>>   
>>   		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>>   			  (i + 1) * 8, u64, entry);
>> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
>>   
>>   			XE_WARN_ON(pt_bo->size != SZ_4K);
>>   
>> -			addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
>> +			addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>>   			bb->cs[bb->len++] = lower_32_bits(addr);
>>   			bb->cs[bb->len++] = upper_32_bits(addr);
>>   		}
>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
>> index 64713f400d94..019af2920078 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.c
>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>> @@ -10,6 +10,7 @@
>>   #include "xe_gt.h"
>>   #include "xe_gt_tlb_invalidation.h"
>>   #include "xe_migrate.h"
>> +#include "xe_pat.h"
>>   #include "xe_pt_types.h"
>>   #include "xe_pt_walk.h"
>>   #include "xe_res_cursor.h"
>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
>>    *
>>    * Return: An encoded page directory entry. No errors.
>>    */
>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>> -		  const enum xe_cache_level cache)
>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>>   {
>>   	u64 pde;
>>   	struct xe_vm *vm = bo->vm;
>>   	struct xe_device *xe = vm->xe;
>>   
>> -
>>   	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>>   	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>   
>>   	if ((vm->pat_encode).pde_encode)
>> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>> +		pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>>   
>>   	return pde;
>>   }
>>   
>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>>   			struct xe_vma *vma, u32 pt_level)
>>   {
>>   	struct xe_vm *vm = xe_vma_vm(vma);
>> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>   		pte |= XE_PTE_NULL;
>>   
>>   	if ((vm->pat_encode).pte_encode)
>> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>> +		pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>>   
>>   	if (pt_level == 1)
>>   		pte |= XE_PDE_PS_2M;
>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>    *
>>    * Return: An encoded page-table entry. No errors.
>>    */
>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>   		  u32 pt_level)
>>   {
>>   	u64 pte;
>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>>   	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>>   		pte |= XE_PPGTT_PTE_DM;
>>   
>> -	return __pte_encode(pte, cache, NULL, pt_level);
>> +	return __pte_encode(pte, pat_index, NULL, pt_level);
>>   }
>>   
>>   static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>>   
>>   	if (level == 0) {
>>   		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>> -					  XE_CACHE_WB, 0);
>> +					  xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
>>   
>>   		return empty;
>>   	} else {
>>   		return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
>> -				     XE_CACHE_WB);
>> +				     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>>   	}
>>   }
>>   
>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>>   	struct xe_vm *vm;
>>   	/** @tile: The tile we're building for. */
>>   	struct xe_tile *tile;
>> -	/** @cache: Desired cache level for the ptes */
>> -	enum xe_cache_level cache;
>>   	/** @default_pte: PTE flag only template. No address is associated */
>>   	u64 default_pte;
>>   	/** @dma_offset: DMA offset to add to the PTE. */
>> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>>   
>>   		pte = __pte_encode(is_null ? 0 :
>>   				   xe_res_dma(curs) + xe_walk->dma_offset,
>> -				   xe_walk->cache, xe_walk->vma, level);
>> +				   xe_walk->vma->pat_index, xe_walk->vma, level);
>>   		pte |= xe_walk->default_pte;
>>   
>>   		/*
>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>>   			xe_child->is_compact = true;
>>   		}
>>   
>> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
>> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
>>   		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>>   					 pte);
>>   	}
>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
>>   		if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>>   			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>>   		xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
>> -		xe_walk.cache = XE_CACHE_WB;
>> -	} else {
>> -		if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
>> -			xe_walk.cache = XE_CACHE_WT;
>> -		else
>> -			xe_walk.cache = XE_CACHE_WB;
>>   	}
>> +
>>   	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>   		xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
>> index 01be7ab08f87..1d433a5a96b4 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.h
>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
>>   
>>   bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>>   
>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>> -		  const enum xe_cache_level level);
>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>>   
>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> -		  u32 pt_level);
>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
>>   
>>   #endif
>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>> index 7eeeed0411f3..34603a7e84b0 100644
>> --- a/drivers/gpu/drm/xe/xe_vm.c
>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>> @@ -6,6 +6,7 @@
>>   #include "xe_vm.h"
>>   
>>   #include <linux/dma-fence-array.h>
>> +#include <linux/nospec.h>
>>   
>>   #include <drm/drm_print.h>
>>   #include <drm/ttm/ttm_execbuf_util.h>
>> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>>   				    u64 start, u64 end,
>>   				    bool read_only,
>>   				    bool is_null,
>> -				    u8 tile_mask)
>> +				    u8 tile_mask,
>> +				    u32 pat_index)
>>   {
>>   	struct xe_vma *vma;
>>   	struct xe_tile *tile;
>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>>   			vma->tile_mask |= 0x1 << id;
>>   	}
>>   
>> +	vma->pat_index = pat_index;
>> +
>>   	if (vm->xe->info.platform == XE_PVC)
>>   		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>   
>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
>>   static void vm_destroy_work_func(struct work_struct *w);
>>   
>>   static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>> -						enum xe_cache_level cache)
>> +				     u32 pat_index)
>>   {
>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>>   	pde_pat &= ~(XELP_PDE_PAT_MASK);
>>   
>>   	if (pat_index & BIT(0))
>> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>>   }
>>   
>>   static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>> -						enum xe_cache_level cache)
>> +				     u32 pat_index)
>>   {
>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>>   	pte_pat &= ~(XELP_PTE_PAT_MASK);
>>   
>>   	if (pat_index & BIT(0))
>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
>>   u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>>   {
>>   	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>> -			     XE_CACHE_WB);
>> +			     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>>   }
>>   
>>   static struct dma_fence *
>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
>>   static struct drm_gpuva_ops *
>>   vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   			 u64 bo_offset_or_userptr, u64 addr, u64 range,
>> -			 u32 operation, u8 tile_mask, u32 region)
>> +			 u32 operation, u8 tile_mask, u32 region, u32 pat_index)
>>   {
>>   	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>   	struct ww_acquire_ctx ww;
>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>   
>>   			op->tile_mask = tile_mask;
>> +			op->pat_index = pat_index;
>>   			op->map.immediate =
>>   				operation & XE_VM_BIND_FLAG_IMMEDIATE;
>>   			op->map.read_only =
>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>   
>>   			op->tile_mask = tile_mask;
>> +			op->pat_index = pat_index;
>>   			op->prefetch.region = region;
>>   		}
>>   		break;
>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   }
>>   
>>   static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>> -			      u8 tile_mask, bool read_only, bool is_null)
>> +			      u8 tile_mask, bool read_only, bool is_null,
>> +			      u32 pat_index)
>>   {
>>   	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
>>   	struct xe_vma *vma;
>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>>   	vma = xe_vma_create(vm, bo, op->gem.offset,
>>   			    op->va.addr, op->va.addr +
>>   			    op->va.range - 1, read_only, is_null,
>> -			    tile_mask);
>> +			    tile_mask, pat_index);
>>   	if (bo)
>>   		xe_bo_unlock(bo, &ww);
>>   
>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>   
>>   				vma = new_vma(vm, &op->base.map,
>>   					      op->tile_mask, op->map.read_only,
>> -					      op->map.is_null);
>> +					      op->map.is_null, op->pat_index);
>>   				if (IS_ERR(vma)) {
>>   					err = PTR_ERR(vma);
>>   					goto free_fence;
>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>   
>>   					vma = new_vma(vm, op->base.remap.prev,
>>   						      op->tile_mask, read_only,
>> -						      is_null);
>> +						      is_null, op->pat_index);
>>   					if (IS_ERR(vma)) {
>>   						err = PTR_ERR(vma);
>>   						goto free_fence;
>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>   
>>   					vma = new_vma(vm, op->base.remap.next,
>>   						      op->tile_mask, read_only,
>> -						      is_null);
>> +						      is_null, op->pat_index);
>>   					if (IS_ERR(vma)) {
>>   						err = PTR_ERR(vma);
>>   						goto free_fence;
>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
>>   		u32 obj = (*bind_ops)[i].obj;
>>   		u64 obj_offset = (*bind_ops)[i].obj_offset;
>>   		u32 region = (*bind_ops)[i].region;
>> +		u32 pat_index = (*bind_ops)[i].pat_index;
>>   		bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>   
>> +		if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
>> +			err = -EINVAL;
>> +			goto free_bind_ops;
>> +		}
>> +
>>   		if (i == 0) {
>>   			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>>   		} else if (XE_IOCTL_DBG(xe, !*async) ||
>> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>   		struct drm_gem_object *gem_obj;
>>   		u64 range = bind_ops[i].range;
>>   		u64 addr = bind_ops[i].addr;
>> +		u32 op = bind_ops[i].op;
>>   		u32 obj = bind_ops[i].obj;
>>   		u64 obj_offset = bind_ops[i].obj_offset;
>> +		u32 pat_index = bind_ops[i].pat_index;
>> +		u16 coh_mode;
>> +
>> +		pat_index = array_index_nospec(pat_index,
>> +					       xe->info.pat_table_n_entries);
>> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
> 
> Assuming we drop the unusable entries from the TGL table, this should be
> impossible, right?  Any index that makes it past the n_entries check at
> the top of the function should have a valid, non-zero coh_mode value.
> So this should probably be an assertion (to highlight a KMD bug) rather
> than just a silent uapi failure return.

Makes sense.

> 
>> +			err = -EINVAL;
>> +			goto put_obj;
>> +		}
>> +
>> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
>> +				 coh_mode == XE_GEM_COHERENCY_NONE)) {
>> +			err = -EINVAL;
>> +			goto put_obj;
>> +		}
>>   
>>   		if (!obj)
>>   			continue;
>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>   				goto put_obj;
>>   			}
>>   		}
>> +
>> +		if (bos[i]->coh_mode) {
>> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
>> +				err = -EINVAL;
>> +				goto put_obj;
>> +			}
>> +		} else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
>> +			/*
>> +			 * Imported dma-buf from a different device should
>> +			 * require 1way or 2way coherency since we don't know
>> +			 * how it was mapped on CPU. Just assume is it
>> +			 * potentially cached on CPU side.
>> +			 */
>> +			err = -EINVAL;
>> +			goto put_obj;
>> +		}
>>   	}
>>   
>>   	if (args->num_syncs) {
>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>   		u64 obj_offset = bind_ops[i].obj_offset;
>>   		u8 tile_mask = bind_ops[i].tile_mask;
>>   		u32 region = bind_ops[i].region;
>> +		u32 pat_index = bind_ops[i].pat_index;
>>   
>>   		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>>   						  addr, range, op, tile_mask,
>> -						  region);
>> +						  region, pat_index);
>>   		if (IS_ERR(ops[i])) {
>>   			err = PTR_ERR(ops[i]);
>>   			ops[i] = NULL;
>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
>> index 83a1f87b6537..508679ed3c74 100644
>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>> @@ -111,6 +111,11 @@ struct xe_vma {
>>   	 */
>>   	u8 tile_present;
>>   
>> +	/**
>> +	 * @pat_index: The pat index to use when encoding the PTEs for this vma.
>> +	 */
>> +	u32 pat_index;
>> +
>>   	struct {
>>   		struct list_head rebind_link;
>>   	} notifier;
>> @@ -338,10 +343,8 @@ struct xe_vm {
>>   	bool batch_invalidate_tlb;
>>   
>>   	struct {
>> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> -						enum xe_cache_level cache);
>> -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>> -						enum xe_cache_level cache);
>> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
>> +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
>>   	} pat_encode;
>>   };
>>   
>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>>   	struct async_op_fence *fence;
>>   	/** @tile_mask: gt mask for this operation */
>>   	u8 tile_mask;
>> +	/** @pat_index: The pat index to use for this operation. */
>> +	u32 pat_index;
>>   	/** @flags: operation flags */
>>   	enum xe_vma_op_flags flags;
>>   
>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>> index 64bc66d4b550..0c15b6f32447 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>>   	 */
>>   	__u32 obj;
>>   
>> -	/** @pad: MBZ */
>> -	__u32 pad;
>> +	/**
>> +	 * @pat_index: The platform defined @pat_index to use for this mapping.
>> +	 * The index basically maps to some predefined memory attributes,
>> +	 * including things like caching, coherency and likely other stuff in
>> +	 * the future.  The exact meaning of the pat_index is platform specific
> 
> BTW, "other stuff in the future" already includes compression on Xe2, we
> just haven't landed the patches for the Xe2 table yet.

Ok, good to know.

> 
>> +	 * and defined in the Bspec and PRMs.  When the KMD sets up the binding
>> +	 * the index here is encoded into the ppGTT PTE.
>> +	 *
>> +	 * For coherency the @pat_index needs to match the
>> +	 * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
>> +	 * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
>> +	 * the coherency mode from the @pat_index and reject if there is a
>> +	 * mismatch (see note below for pre-MTL platforms).
>> +	 *
>> +	 * Note: On pre-MTL platforms there is only a caching mode and no
>> +	 * explicit coherency mode, but on such hardware there is always a
>> +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
>> +	 * CPU caches even with the caching mode set as uncached.  It's only the
>> +	 * display engine that is incoherent (on dgpu it must be in VRAM which
>> +	 * is always mapped as WC on the CPU). However to keep the uapi somewhat
>> +	 * consistent with newer platforms the KMD groups the different cache
>> +	 * levels into the following coherency buckets on all pre-MTL platforms:
>> +	 *
>> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
>> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
>> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
>> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
> 
> As noted on the previous patch, it seems like 2-way is appropriate for
> LLC platforms, but 1-way might be a more accurate description of dGPU
> behavior.
> 
>> +	 *
>> +	 * In practice UC/WC/WT should only ever used for scanout surfaces on
>> +	 * such platforms since it is only the display engine that is actually
>> +	 * incoherent. Everything else should typically use WB given that we
> 
> What if we're sharing our buffers with some other (non-GPU) device?  Are
> there cases where that other device wouldn't be coherent with the LLC,
> so we'd want to use one of these?

Yeah, I guess there might be cases like that. I'll reword.

> 
> 
> Matt
> 
>> +	 * have a shared-LLC.  On MTL+ this completely changes (also potentially
>> +	 * no shared-LLC) and the HW defines the coherency mode as part of the
>> +	 * @pat_index.
>> +	 *
>> +	 * Note: For userptr and externally imported dma-buf the kernel expects
>> +	 * either 1WAY or 2WAY for the @pat_index.
>> +	 */
>> +	__u32 pat_index;
>>   
>>   	union {
>>   		/**
>> -- 
>> 2.41.0
>>
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-08-30 11:28     ` Matthew Auld
@ 2023-08-30 15:27       ` Zhang, Carl
  2023-08-30 16:02         ` Matthew Auld
  0 siblings, 1 reply; 47+ messages in thread
From: Zhang, Carl @ 2023-08-30 15:27 UTC (permalink / raw)
  To: Auld, Matthew, Roper, Matthew D
  Cc: Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas, Yu, Effie,
	intel-xe@lists.freedesktop.org

Several questions:
1. the pat_index from vm_bind will override the setting from bo_create?
How to keep the value from bo_create unchanged? 
2. no UC/WB/WC definition (CPU cachable) in drm_xe_gem_mmap_offset, will it be handled by KMD automatically?
For example: if set 1-way, it means GPU could snoop CPU cache, we could use WB  in mmap offset
If it is COHERENCY_NONE, we could only use UC,  all these logic is handled by KMD automatically?
3.  about " For imported dma-buf (from a different device) the coherency mode is also implicit
and must also be either 1WAY or 2WAY" 
it means it must be 1way or 2way, and UMD need not to set it?


> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: Wednesday, August 30, 2023 7:28 PM
> To: Roper, Matthew D <matthew.d.roper@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi <pallavi.mishra@intel.com>;
> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas Lahtinen
> <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>; Hazubski,
> Filip <filip.hazubski@intel.com>; Zhang, Carl <carl.zhang@intel.com>; Yu, Effie
> <effie.yu@intel.com>
> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
> 
> On 29/08/2023 22:36, Matt Roper wrote:
> > On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
> >> Allow userspace to directly control the pat_index for a given vm
> >> binding. This should allow directly controlling the coherency,
> >> caching and potentially other stuff in the future for the ppGTT binding.
> >>
> >> The exact meaning behind the pat_index is very platform specific (see
> >> BSpec or PRMs) but effectively maps to some predefined memory
> >> attributes. From the KMD pov we only care about the coherency that is
> >> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
> >> The vm_bind coherency mode for the given pat_index needs to match the
> >> given coh_mode that was set at object creation. For platforms that
> >> lack
> >
> > Is it actually important to match the coherency mode?  I think one of
> > the main goals was to know up front if userspace might be using a
> > non-snooping PAT setting that would let it bypass the CPU cache (and
> > potentially read old, stale data from a different process if the
> > buffer's clear value is still sitting in cache and hasn't landed in
> > memory yet).
> >
> > If that's the only concern, then I think it should still be fine to
> > map with a non-matching PAT as long as it's more coherent than the one
> > specified at creation, right?  E.g., if the buffer was created with
> > 1-way coherency, it would be fine to map it with 2-way because
> > userspace still can't use that to observe the previous contents of the
> > buffer. Or
> 
> Yeah, I guess we could in theory do something that.
> 
> > if the buffer was created with "non-coherent" then we've already done
> > the necessary clflushing in kernel before handing to buffer over to
> > userspace to ensure the clear value landed in memory, so any valid PAT
> > index should be safe (from a security POV) after that, right?  Any
> > other problems that arise from mismatched coherency would just be
> > contained to the app possibly shooting itself in the foot, which isn't
> > really our concern.
> 
> That is also my understanding, at least from the KMD security pov. If you
> allocate as wb then you must use at least 1way, since there is no flushing for
> clearing or swap-in. For uc/wc you could in theory use whatever you want.
> 
> >
> >
> >> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as
> 2WAY.
> >>
> >> For userptr mappings we lack a corresponding gem object, so the expected
> >> coherency mode is instead implicit and must fall into either 1WAY or
> >> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
> >> dma-buf (from a different device) the coherency mode is also implicit
> >> and must also be either 1WAY or 2WAY.
> >>
> >> As part of adding pat_index support with vm_bind we also need stop using
> >> xe_cache_level and instead use the pat_index in various places. We still
> >> make use of xe_cache_level, but only as a convenience for kernel
> >> internal objectsi (internally it maps to some reasonable pat_index). For
> >
> > Maybe we should kill xe_cache_level completely and just assign
> > xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
> > appropriate places, similar to what we do with MOCS (gt->mocs.uc_index,
> > gt->mocs.wb_index)?
> 
> OK, seems reasonable to me.
> 
> >
> >> now this is just a 1:1 conversion of the existing code, however for
> >> platforms like MTL+ we might need to give more control through bo_create
> >> or stop using WB on the CPU side if we need CPU access.
> >>
> >> Bspec: 45101, 44235 #xe
> >> Bspec: 70552, 71582, 59400 #xe2
> >> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> >> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> >> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> >> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> >> Cc: Matt Roper <matthew.d.roper@intel.com>
> >> Cc: José Roberto de Souza <jose.souza@intel.com>
> >> Cc: Filip Hazubski <filip.hazubski@intel.com>
> >> Cc: Carl Zhang <carl.zhang@intel.com>
> >> Cc: Effie Yu <effie.yu@intel.com>
> >> ---
> >>   drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
> >>   drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
> >>   drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
> >>   drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
> >>   drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
> >>   drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
> >>   drivers/gpu/drm/xe/xe_pt.h            |  6 +--
> >>   drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++------
> >>   drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
> >>   include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
> >>   10 files changed, 134 insertions(+), 58 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
> b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >> index 4d6296cdbcfd..bb4c182048c3 100644
> >> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >> @@ -302,7 +302,7 @@ struct i915_address_space {
> >>   		(*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
> >>
> >>   	u64 (*pte_encode)(dma_addr_t addr,
> >> -			  unsigned int pat_index,
> >> +			  u32 pat_index,
> >>   			  u32 flags); /* Create a valid PTE */
> >>   #define PTE_READ_ONLY	BIT(0)
> >>   #define PTE_LM		BIT(1)
> >> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
> b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >> index 5c8d5e78d9bc..7a128fd20a29 100644
> >> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate
> *m, struct kunit *test)
> >>   	/* First part of the test, are we updating our pagetable bo with a new
> entry? */
> >>   	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1),
> u64,
> >>   		  0xdeaddeadbeefbeef);
> >> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> >> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
> XE_CACHE_WB), 0);
> >>   	if (m->q->vm->flags & XE_VM_FLAG_64K)
> >>   		expected |= XE_PTE_PS64;
> >>   	if (xe_bo_is_vram(pt))
> >> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
> >> index 209fa053d9fb..4134c26150a5 100644
> >> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> >> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> >> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64
> bo_offset)
> >>   		pte |= XE_GGTT_PTE_DM;
> >>
> >>   	if ((ggtt->pat_encode).pte_encode)
> >> -		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> XE_CACHE_WB_1_WAY);
> >> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >> +						    xe_pat_get_index(xe,
> XE_CACHE_WB_1_WAY));
> >>
> >>   	return pte;
> >>   }
> >> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
> >>   }
> >>
> >>   static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> >> -						enum xe_cache_level cache)
> >> +				     u32 pat_index)
> >>   {
> >> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >> -
> >>   	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
> >>
> >>   	if (pat_index & BIT(0))
> >> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
> b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >> index 7e55fac1a8a9..0bc40cb072e3 100644
> >> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >> @@ -31,7 +31,7 @@ struct xe_ggtt {
> >>
> >>   	struct {
> >>   		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >> -						enum xe_cache_level cache);
> >> +				  u32 pat_index);
> >>   	} pat_encode;
> >>   };
> >>
> >> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> b/drivers/gpu/drm/xe/xe_migrate.c
> >> index a782ea282cb6..54585e98452a 100644
> >> --- a/drivers/gpu/drm/xe/xe_migrate.c
> >> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> >> @@ -24,6 +24,7 @@
> >>   #include "xe_lrc.h"
> >>   #include "xe_map.h"
> >>   #include "xe_mocs.h"
> >> +#include "xe_pat.h"
> >>   #include "xe_pt.h"
> >>   #include "xe_res_cursor.h"
> >>   #include "xe_sched_job.h"
> >> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
> struct xe_migrate *m,
> >>   	u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]-
> >level;
> >>   	u32 map_ofs, level, i;
> >>   	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> >> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> >>   	u64 entry;
> >>   	int ret;
> >>
> >> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile
> *tile, struct xe_migrate *m,
> >>   		return ret;
> >>   	}
> >>
> >> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
> >> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
> >>   	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
> >>
> >>   	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
> >>
> >>   	/* Map the entire BO in our level 0 pt */
> >>   	for (i = 0, level = 0; i < num_entries; level++) {
> >> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB,
> 0);
> >> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
> >>
> >>   		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
> >>
> >> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
> struct xe_migrate *m,
> >>   		for (i = 0; i < batch->size;
> >>   		     i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
> >>   		     XE_PAGE_SIZE) {
> >> -			entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
> >> +			entry = xe_pte_encode(batch, i, pat_index, 0);
> >>
> >>   			xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> >>   				  entry);
> >> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
> struct xe_migrate *m,
> >>   			flags = XE_PDE_64K;
> >>
> >>   		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> >> -					XE_PAGE_SIZE, XE_CACHE_WB);
> >> +					XE_PAGE_SIZE, pat_index);
> >>   		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level,
> u64,
> >>   			  entry | flags);
> >>   	}
> >> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
> struct xe_migrate *m,
> >>   	/* Write PDE's that point to our BO. */
> >>   	for (i = 0; i < num_entries - num_level; i++) {
> >>   		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> >> -				      XE_CACHE_WB);
> >> +				      pat_index);
> >>
> >>   		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
> >>   			  (i + 1) * 8, u64, entry);
> >> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate
> *m,
> >>
> >>   			XE_WARN_ON(pt_bo->size != SZ_4K);
> >>
> >> -			addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
> >> +			addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe,
> XE_CACHE_WB), 0);
> >>   			bb->cs[bb->len++] = lower_32_bits(addr);
> >>   			bb->cs[bb->len++] = upper_32_bits(addr);
> >>   		}
> >> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> >> index 64713f400d94..019af2920078 100644
> >> --- a/drivers/gpu/drm/xe/xe_pt.c
> >> +++ b/drivers/gpu/drm/xe/xe_pt.c
> >> @@ -10,6 +10,7 @@
> >>   #include "xe_gt.h"
> >>   #include "xe_gt_tlb_invalidation.h"
> >>   #include "xe_migrate.h"
> >> +#include "xe_pat.h"
> >>   #include "xe_pt_types.h"
> >>   #include "xe_pt_walk.h"
> >>   #include "xe_res_cursor.h"
> >> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir
> *pt_dir, unsigned int index)
> >>    *
> >>    * Return: An encoded page directory entry. No errors.
> >>    */
> >> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >> -		  const enum xe_cache_level cache)
> >> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
> >>   {
> >>   	u64 pde;
> >>   	struct xe_vm *vm = bo->vm;
> >>   	struct xe_device *xe = vm->xe;
> >>
> >> -
> >>   	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> >>   	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
> >>
> >>   	if ((vm->pat_encode).pde_encode)
> >> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> >> +		pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
> >>
> >>   	return pde;
> >>   }
> >>
> >> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> >> +static u64 __pte_encode(u64 pte, u32 pat_index,
> >>   			struct xe_vma *vma, u32 pt_level)
> >>   {
> >>   	struct xe_vm *vm = xe_vma_vm(vma);
> >> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum
> xe_cache_level cache,
> >>   		pte |= XE_PTE_NULL;
> >>
> >>   	if ((vm->pat_encode).pte_encode)
> >> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> >> +		pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
> >>
> >>   	if (pt_level == 1)
> >>   		pte |= XE_PDE_PS_2M;
> >> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
> xe_cache_level cache,
> >>    *
> >>    * Return: An encoded page-table entry. No errors.
> >>    */
> >> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
> cache,
> >> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> >>   		  u32 pt_level)
> >>   {
> >>   	u64 pte;
> >> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset,
> enum xe_cache_level cache,
> >>   	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> >>   		pte |= XE_PPGTT_PTE_DM;
> >>
> >> -	return __pte_encode(pte, cache, NULL, pt_level);
> >> +	return __pte_encode(pte, pat_index, NULL, pt_level);
> >>   }
> >>
> >>   static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> >> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile,
> struct xe_vm *vm,
> >>
> >>   	if (level == 0) {
> >>   		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> >> -					  XE_CACHE_WB, 0);
> >> +					  xe_pat_get_index(vm->xe,
> XE_CACHE_WB), 0);
> >>
> >>   		return empty;
> >>   	} else {
> >>   		return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
> >> -				     XE_CACHE_WB);
> >> +				     xe_pat_get_index(vm->xe,
> XE_CACHE_WB));
> >>   	}
> >>   }
> >>
> >> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> >>   	struct xe_vm *vm;
> >>   	/** @tile: The tile we're building for. */
> >>   	struct xe_tile *tile;
> >> -	/** @cache: Desired cache level for the ptes */
> >> -	enum xe_cache_level cache;
> >>   	/** @default_pte: PTE flag only template. No address is associated */
> >>   	u64 default_pte;
> >>   	/** @dma_offset: DMA offset to add to the PTE. */
> >> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> pgoff_t offset,
> >>
> >>   		pte = __pte_encode(is_null ? 0 :
> >>   				   xe_res_dma(curs) + xe_walk->dma_offset,
> >> -				   xe_walk->cache, xe_walk->vma, level);
> >> +				   xe_walk->vma->pat_index, xe_walk->vma,
> level);
> >>   		pte |= xe_walk->default_pte;
> >>
> >>   		/*
> >> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> pgoff_t offset,
> >>   			xe_child->is_compact = true;
> >>   		}
> >>
> >> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
> >> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
> >pat_index) | flags;
> >>   		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
> >>   					 pte);
> >>   	}
> >> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma
> *vma,
> >>   		if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
> >>   			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
> >>   		xe_walk.dma_offset = vram_region_gpu_offset(bo-
> >ttm.resource);
> >> -		xe_walk.cache = XE_CACHE_WB;
> >> -	} else {
> >> -		if (!xe_vma_has_no_bo(vma) && bo->flags &
> XE_BO_SCANOUT_BIT)
> >> -			xe_walk.cache = XE_CACHE_WT;
> >> -		else
> >> -			xe_walk.cache = XE_CACHE_WB;
> >>   	}
> >> +
> >>   	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> >>   		xe_walk.dma_offset =
> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> >>
> >> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> >> index 01be7ab08f87..1d433a5a96b4 100644
> >> --- a/drivers/gpu/drm/xe/xe_pt.h
> >> +++ b/drivers/gpu/drm/xe/xe_pt.h
> >> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct
> xe_vma *vma, struct xe_exec_queu
> >>
> >>   bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
> >>
> >> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >> -		  const enum xe_cache_level level);
> >> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
> >>
> >> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
> cache,
> >> -		  u32 pt_level);
> >> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32
> pt_level);
> >>
> >>   #endif
> >> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> >> index 7eeeed0411f3..34603a7e84b0 100644
> >> --- a/drivers/gpu/drm/xe/xe_vm.c
> >> +++ b/drivers/gpu/drm/xe/xe_vm.c
> >> @@ -6,6 +6,7 @@
> >>   #include "xe_vm.h"
> >>
> >>   #include <linux/dma-fence-array.h>
> >> +#include <linux/nospec.h>
> >>
> >>   #include <drm/drm_print.h>
> >>   #include <drm/ttm/ttm_execbuf_util.h>
> >> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm
> *vm,
> >>   				    u64 start, u64 end,
> >>   				    bool read_only,
> >>   				    bool is_null,
> >> -				    u8 tile_mask)
> >> +				    u8 tile_mask,
> >> +				    u32 pat_index)
> >>   {
> >>   	struct xe_vma *vma;
> >>   	struct xe_tile *tile;
> >> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm
> *vm,
> >>   			vma->tile_mask |= 0x1 << id;
> >>   	}
> >>
> >> +	vma->pat_index = pat_index;
> >> +
> >>   	if (vm->xe->info.platform == XE_PVC)
> >>   		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
> >>
> >> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct
> work_struct *w);
> >>   static void vm_destroy_work_func(struct work_struct *w);
> >>
> >>   static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> >> -						enum xe_cache_level cache)
> >> +				     u32 pat_index)
> >>   {
> >> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >> -
> >>   	pde_pat &= ~(XELP_PDE_PAT_MASK);
> >>
> >>   	if (pat_index & BIT(0))
> >> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct
> xe_device *xe, u64 pde_pat,
> >>   }
> >>
> >>   static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> >> -						enum xe_cache_level cache)
> >> +				     u32 pat_index)
> >>   {
> >> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >> -
> >>   	pte_pat &= ~(XELP_PTE_PAT_MASK);
> >>
> >>   	if (pat_index & BIT(0))
> >> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef,
> u32 id)
> >>   u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
> >>   {
> >>   	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> >> -			     XE_CACHE_WB);
> >> +			     xe_pat_get_index(vm->xe, XE_CACHE_WB));
> >>   }
> >>
> >>   static struct dma_fence *
> >> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct
> drm_gpuva_op *op)
> >>   static struct drm_gpuva_ops *
> >>   vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> >>   			 u64 bo_offset_or_userptr, u64 addr, u64 range,
> >> -			 u32 operation, u8 tile_mask, u32 region)
> >> +			 u32 operation, u8 tile_mask, u32 region, u32
> pat_index)
> >>   {
> >>   	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> >>   	struct ww_acquire_ctx ww;
> >> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> struct xe_bo *bo,
> >>   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> >>
> >>   			op->tile_mask = tile_mask;
> >> +			op->pat_index = pat_index;
> >>   			op->map.immediate =
> >>   				operation & XE_VM_BIND_FLAG_IMMEDIATE;
> >>   			op->map.read_only =
> >> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> struct xe_bo *bo,
> >>   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> >>
> >>   			op->tile_mask = tile_mask;
> >> +			op->pat_index = pat_index;
> >>   			op->prefetch.region = region;
> >>   		}
> >>   		break;
> >> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> struct xe_bo *bo,
> >>   }
> >>
> >>   static struct xe_vma *new_vma(struct xe_vm *vm, struct
> drm_gpuva_op_map *op,
> >> -			      u8 tile_mask, bool read_only, bool is_null)
> >> +			      u8 tile_mask, bool read_only, bool is_null,
> >> +			      u32 pat_index)
> >>   {
> >>   	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
> >>   	struct xe_vma *vma;
> >> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm
> *vm, struct drm_gpuva_op_map *op,
> >>   	vma = xe_vma_create(vm, bo, op->gem.offset,
> >>   			    op->va.addr, op->va.addr +
> >>   			    op->va.range - 1, read_only, is_null,
> >> -			    tile_mask);
> >> +			    tile_mask, pat_index);
> >>   	if (bo)
> >>   		xe_bo_unlock(bo, &ww);
> >>
> >> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
> *vm, struct xe_exec_queue *q,
> >>
> >>   				vma = new_vma(vm, &op->base.map,
> >>   					      op->tile_mask, op->map.read_only,
> >> -					      op->map.is_null);
> >> +					      op->map.is_null, op->pat_index);
> >>   				if (IS_ERR(vma)) {
> >>   					err = PTR_ERR(vma);
> >>   					goto free_fence;
> >> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
> *vm, struct xe_exec_queue *q,
> >>
> >>   					vma = new_vma(vm, op-
> >base.remap.prev,
> >>   						      op->tile_mask, read_only,
> >> -						      is_null);
> >> +						      is_null, op->pat_index);
> >>   					if (IS_ERR(vma)) {
> >>   						err = PTR_ERR(vma);
> >>   						goto free_fence;
> >> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
> *vm, struct xe_exec_queue *q,
> >>
> >>   					vma = new_vma(vm, op-
> >base.remap.next,
> >>   						      op->tile_mask, read_only,
> >> -						      is_null);
> >> +						      is_null, op->pat_index);
> >>   					if (IS_ERR(vma)) {
> >>   						err = PTR_ERR(vma);
> >>   						goto free_fence;
> >> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct
> xe_device *xe,
> >>   		u32 obj = (*bind_ops)[i].obj;
> >>   		u64 obj_offset = (*bind_ops)[i].obj_offset;
> >>   		u32 region = (*bind_ops)[i].region;
> >> +		u32 pat_index = (*bind_ops)[i].pat_index;
> >>   		bool is_null = op & XE_VM_BIND_FLAG_NULL;
> >>
> >> +		if (XE_IOCTL_DBG(xe, pat_index >= xe-
> >info.pat_table_n_entries)) {
> >> +			err = -EINVAL;
> >> +			goto free_bind_ops;
> >> +		}
> >> +
> >>   		if (i == 0) {
> >>   			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
> >>   		} else if (XE_IOCTL_DBG(xe, !*async) ||
> >> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
> void *data, struct drm_file *file)
> >>   		struct drm_gem_object *gem_obj;
> >>   		u64 range = bind_ops[i].range;
> >>   		u64 addr = bind_ops[i].addr;
> >> +		u32 op = bind_ops[i].op;
> >>   		u32 obj = bind_ops[i].obj;
> >>   		u64 obj_offset = bind_ops[i].obj_offset;
> >> +		u32 pat_index = bind_ops[i].pat_index;
> >> +		u16 coh_mode;
> >> +
> >> +		pat_index = array_index_nospec(pat_index,
> >> +					       xe->info.pat_table_n_entries);
> >> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
> >> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
> >
> > Assuming we drop the unusable entries from the TGL table, this should be
> > impossible, right?  Any index that makes it past the n_entries check at
> > the top of the function should have a valid, non-zero coh_mode value.
> > So this should probably be an assertion (to highlight a KMD bug) rather
> > than just a silent uapi failure return.
> 
> Makes sense.
> 
> >
> >> +			err = -EINVAL;
> >> +			goto put_obj;
> >> +		}
> >> +
> >> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
> XE_VM_BIND_OP_MAP_USERPTR &&
> >> +				 coh_mode == XE_GEM_COHERENCY_NONE)) {
> >> +			err = -EINVAL;
> >> +			goto put_obj;
> >> +		}
> >>
> >>   		if (!obj)
> >>   			continue;
> >> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
> void *data, struct drm_file *file)
> >>   				goto put_obj;
> >>   			}
> >>   		}
> >> +
> >> +		if (bos[i]->coh_mode) {
> >> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
> coh_mode)) {
> >> +				err = -EINVAL;
> >> +				goto put_obj;
> >> +			}
> >> +		} else if (XE_IOCTL_DBG(xe, coh_mode ==
> XE_GEM_COHERENCY_NONE)) {
> >> +			/*
> >> +			 * Imported dma-buf from a different device should
> >> +			 * require 1way or 2way coherency since we don't
> know
> >> +			 * how it was mapped on CPU. Just assume is it
> >> +			 * potentially cached on CPU side.
> >> +			 */
> >> +			err = -EINVAL;
> >> +			goto put_obj;
> >> +		}
> >>   	}
> >>
> >>   	if (args->num_syncs) {
> >> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
> void *data, struct drm_file *file)
> >>   		u64 obj_offset = bind_ops[i].obj_offset;
> >>   		u8 tile_mask = bind_ops[i].tile_mask;
> >>   		u32 region = bind_ops[i].region;
> >> +		u32 pat_index = bind_ops[i].pat_index;
> >>
> >>   		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
> >>   						  addr, range, op, tile_mask,
> >> -						  region);
> >> +						  region, pat_index);
> >>   		if (IS_ERR(ops[i])) {
> >>   			err = PTR_ERR(ops[i]);
> >>   			ops[i] = NULL;
> >> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> b/drivers/gpu/drm/xe/xe_vm_types.h
> >> index 83a1f87b6537..508679ed3c74 100644
> >> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> >> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> >> @@ -111,6 +111,11 @@ struct xe_vma {
> >>   	 */
> >>   	u8 tile_present;
> >>
> >> +	/**
> >> +	 * @pat_index: The pat index to use when encoding the PTEs for this
> vma.
> >> +	 */
> >> +	u32 pat_index;
> >> +
> >>   	struct {
> >>   		struct list_head rebind_link;
> >>   	} notifier;
> >> @@ -338,10 +343,8 @@ struct xe_vm {
> >>   	bool batch_invalidate_tlb;
> >>
> >>   	struct {
> >> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >> -						enum xe_cache_level cache);
> >> -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> >> -						enum xe_cache_level cache);
> >> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32
> pat_index);
> >> +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32
> pat_index);
> >>   	} pat_encode;
> >>   };
> >>
> >> @@ -417,6 +420,8 @@ struct xe_vma_op {
> >>   	struct async_op_fence *fence;
> >>   	/** @tile_mask: gt mask for this operation */
> >>   	u8 tile_mask;
> >> +	/** @pat_index: The pat index to use for this operation. */
> >> +	u32 pat_index;
> >>   	/** @flags: operation flags */
> >>   	enum xe_vma_op_flags flags;
> >>
> >> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> >> index 64bc66d4b550..0c15b6f32447 100644
> >> --- a/include/uapi/drm/xe_drm.h
> >> +++ b/include/uapi/drm/xe_drm.h
> >> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> >>   	 */
> >>   	__u32 obj;
> >>
> >> -	/** @pad: MBZ */
> >> -	__u32 pad;
> >> +	/**
> >> +	 * @pat_index: The platform defined @pat_index to use for this
> mapping.
> >> +	 * The index basically maps to some predefined memory attributes,
> >> +	 * including things like caching, coherency and likely other stuff in
> >> +	 * the future.  The exact meaning of the pat_index is platform specific
> >
> > BTW, "other stuff in the future" already includes compression on Xe2, we
> > just haven't landed the patches for the Xe2 table yet.
> 
> Ok, good to know.
> 
> >
> >> +	 * and defined in the Bspec and PRMs.  When the KMD sets up the
> binding
> >> +	 * the index here is encoded into the ppGTT PTE.
> >> +	 *
> >> +	 * For coherency the @pat_index needs to match the
> >> +	 * drm_xe_gem_create.coh_mode, so either
> XE_GEM_COHERENCY_NONE,
> >> +	 * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The
> KMD will extract
> >> +	 * the coherency mode from the @pat_index and reject if there is a
> >> +	 * mismatch (see note below for pre-MTL platforms).
> >> +	 *
> >> +	 * Note: On pre-MTL platforms there is only a caching mode and no
> >> +	 * explicit coherency mode, but on such hardware there is always a
> >> +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
> >> +	 * CPU caches even with the caching mode set as uncached.  It's only
> the
> >> +	 * display engine that is incoherent (on dgpu it must be in VRAM which
> >> +	 * is always mapped as WC on the CPU). However to keep the uapi
> somewhat
> >> +	 * consistent with newer platforms the KMD groups the different cache
> >> +	 * levels into the following coherency buckets on all pre-MTL platforms:
> >> +	 *
> >> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
> >> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
> >> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
> >> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
> >
> > As noted on the previous patch, it seems like 2-way is appropriate for
> > LLC platforms, but 1-way might be a more accurate description of dGPU
> > behavior.
> >
> >> +	 *
> >> +	 * In practice UC/WC/WT should only ever used for scanout surfaces on
> >> +	 * such platforms since it is only the display engine that is actually
> >> +	 * incoherent. Everything else should typically use WB given that we
> >
> > What if we're sharing our buffers with some other (non-GPU) device?  Are
> > there cases where that other device wouldn't be coherent with the LLC,
> > so we'd want to use one of these?
> 
> Yeah, I guess there might be cases like that. I'll reword.
> 
> >
> >
> > Matt
> >
> >> +	 * have a shared-LLC.  On MTL+ this completely changes (also
> potentially
> >> +	 * no shared-LLC) and the HW defines the coherency mode as part of
> the
> >> +	 * @pat_index.
> >> +	 *
> >> +	 * Note: For userptr and externally imported dma-buf the kernel
> expects
> >> +	 * either 1WAY or 2WAY for the @pat_index.
> >> +	 */
> >> +	__u32 pat_index;
> >>
> >>   	union {
> >>   		/**
> >> --
> >> 2.41.0
> >>
> >

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-08-30 15:27       ` Zhang, Carl
@ 2023-08-30 16:02         ` Matthew Auld
  2023-08-31  8:24           ` Zhang, Carl
  0 siblings, 1 reply; 47+ messages in thread
From: Matthew Auld @ 2023-08-30 16:02 UTC (permalink / raw)
  To: Zhang, Carl, Roper, Matthew D
  Cc: Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas, Yu, Effie,
	intel-xe@lists.freedesktop.org

On 30/08/2023 16:27, Zhang, Carl wrote:
> Several questions:
> 1. the pat_index from vm_bind will override the setting from bo_create?
> How to keep the value from bo_create unchanged?

You only set the pat_index during vm_bind. At gem_create you just need 
tell the kernel what CPU side caching mode to use for system memory 
(wb/wc/uc), plus the expected GPU coherency mode of the pat_index. The 
current expectation is that the coherency mode of the pat_index should 
match the one at gem_create.

> 2. no UC/WB/WC definition (CPU cachable) in drm_xe_gem_mmap_offset, will it be handled by KMD automatically?
> For example: if set 1-way, it means GPU could snoop CPU cache, we could use WB  in mmap offset
> If it is COHERENCY_NONE, we could only use UC,  all these logic is handled by KMD automatically?

Yes, the mmap will use the same CPU side caching mode that you set at 
gem_create with smem_caching (wb/wc/uc).

> 3.  about " For imported dma-buf (from a different device) the coherency mode is also implicit
> and must also be either 1WAY or 2WAY"
> it means it must be 1way or 2way, and UMD need not to set it?

If you are going to vm_bind it you need to supply the pat_index and in 
this case it needs to be either 1way or 2way. The object comes from a 
different device so we might not have smem_caching/coh_mode like we do 
for native objects.

> 
> 
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld@intel.com>
>> Sent: Wednesday, August 30, 2023 7:28 PM
>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi <pallavi.mishra@intel.com>;
>> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas Lahtinen
>> <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
>> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>; Hazubski,
>> Filip <filip.hazubski@intel.com>; Zhang, Carl <carl.zhang@intel.com>; Yu, Effie
>> <effie.yu@intel.com>
>> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
>>
>> On 29/08/2023 22:36, Matt Roper wrote:
>>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
>>>> Allow userspace to directly control the pat_index for a given vm
>>>> binding. This should allow directly controlling the coherency,
>>>> caching and potentially other stuff in the future for the ppGTT binding.
>>>>
>>>> The exact meaning behind the pat_index is very platform specific (see
>>>> BSpec or PRMs) but effectively maps to some predefined memory
>>>> attributes. From the KMD pov we only care about the coherency that is
>>>> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
>>>> The vm_bind coherency mode for the given pat_index needs to match the
>>>> given coh_mode that was set at object creation. For platforms that
>>>> lack
>>>
>>> Is it actually important to match the coherency mode?  I think one of
>>> the main goals was to know up front if userspace might be using a
>>> non-snooping PAT setting that would let it bypass the CPU cache (and
>>> potentially read old, stale data from a different process if the
>>> buffer's clear value is still sitting in cache and hasn't landed in
>>> memory yet).
>>>
>>> If that's the only concern, then I think it should still be fine to
>>> map with a non-matching PAT as long as it's more coherent than the one
>>> specified at creation, right?  E.g., if the buffer was created with
>>> 1-way coherency, it would be fine to map it with 2-way because
>>> userspace still can't use that to observe the previous contents of the
>>> buffer. Or
>>
>> Yeah, I guess we could in theory do something that.
>>
>>> if the buffer was created with "non-coherent" then we've already done
>>> the necessary clflushing in kernel before handing to buffer over to
>>> userspace to ensure the clear value landed in memory, so any valid PAT
>>> index should be safe (from a security POV) after that, right?  Any
>>> other problems that arise from mismatched coherency would just be
>>> contained to the app possibly shooting itself in the foot, which isn't
>>> really our concern.
>>
>> That is also my understanding, at least from the KMD security pov. If you
>> allocate as wb then you must use at least 1way, since there is no flushing for
>> clearing or swap-in. For uc/wc you could in theory use whatever you want.
>>
>>>
>>>
>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as
>> 2WAY.
>>>>
>>>> For userptr mappings we lack a corresponding gem object, so the expected
>>>> coherency mode is instead implicit and must fall into either 1WAY or
>>>> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
>>>> dma-buf (from a different device) the coherency mode is also implicit
>>>> and must also be either 1WAY or 2WAY.
>>>>
>>>> As part of adding pat_index support with vm_bind we also need stop using
>>>> xe_cache_level and instead use the pat_index in various places. We still
>>>> make use of xe_cache_level, but only as a convenience for kernel
>>>> internal objectsi (internally it maps to some reasonable pat_index). For
>>>
>>> Maybe we should kill xe_cache_level completely and just assign
>>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
>>> appropriate places, similar to what we do with MOCS (gt->mocs.uc_index,
>>> gt->mocs.wb_index)?
>>
>> OK, seems reasonable to me.
>>
>>>
>>>> now this is just a 1:1 conversion of the existing code, however for
>>>> platforms like MTL+ we might need to give more control through bo_create
>>>> or stop using WB on the CPU side if we need CPU access.
>>>>
>>>> Bspec: 45101, 44235 #xe
>>>> Bspec: 70552, 71582, 59400 #xe2
>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>>>> Cc: Carl Zhang <carl.zhang@intel.com>
>>>> Cc: Effie Yu <effie.yu@intel.com>
>>>> ---
>>>>    drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
>>>>    drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
>>>>    drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
>>>>    drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
>>>>    drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
>>>>    drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
>>>>    drivers/gpu/drm/xe/xe_pt.h            |  6 +--
>>>>    drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++------
>>>>    drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
>>>>    include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
>>>>    10 files changed, 134 insertions(+), 58 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
>> b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>> index 4d6296cdbcfd..bb4c182048c3 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
>>>>    		(*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>>>>
>>>>    	u64 (*pte_encode)(dma_addr_t addr,
>>>> -			  unsigned int pat_index,
>>>> +			  u32 pat_index,
>>>>    			  u32 flags); /* Create a valid PTE */
>>>>    #define PTE_READ_ONLY	BIT(0)
>>>>    #define PTE_LM		BIT(1)
>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
>> b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate
>> *m, struct kunit *test)
>>>>    	/* First part of the test, are we updating our pagetable bo with a new
>> entry? */
>>>>    	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1),
>> u64,
>>>>    		  0xdeaddeadbeefbeef);
>>>> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>>>> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
>> XE_CACHE_WB), 0);
>>>>    	if (m->q->vm->flags & XE_VM_FLAG_64K)
>>>>    		expected |= XE_PTE_PS64;
>>>>    	if (xe_bo_is_vram(pt))
>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
>>>> index 209fa053d9fb..4134c26150a5 100644
>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64
>> bo_offset)
>>>>    		pte |= XE_GGTT_PTE_DM;
>>>>
>>>>    	if ((ggtt->pat_encode).pte_encode)
>>>> -		pte = (ggtt->pat_encode).pte_encode(xe, pte,
>> XE_CACHE_WB_1_WAY);
>>>> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>> +						    xe_pat_get_index(xe,
>> XE_CACHE_WB_1_WAY));
>>>>
>>>>    	return pte;
>>>>    }
>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>>>>    }
>>>>
>>>>    static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>>>> -						enum xe_cache_level cache)
>>>> +				     u32 pat_index)
>>>>    {
>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>> -
>>>>    	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>>>
>>>>    	if (pat_index & BIT(0))
>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
>> b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>>>
>>>>    	struct {
>>>>    		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>> -						enum xe_cache_level cache);
>>>> +				  u32 pat_index);
>>>>    	} pat_encode;
>>>>    };
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
>> b/drivers/gpu/drm/xe/xe_migrate.c
>>>> index a782ea282cb6..54585e98452a 100644
>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>>>> @@ -24,6 +24,7 @@
>>>>    #include "xe_lrc.h"
>>>>    #include "xe_map.h"
>>>>    #include "xe_mocs.h"
>>>> +#include "xe_pat.h"
>>>>    #include "xe_pt.h"
>>>>    #include "xe_res_cursor.h"
>>>>    #include "xe_sched_job.h"
>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
>> struct xe_migrate *m,
>>>>    	u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]-
>>> level;
>>>>    	u32 map_ofs, level, i;
>>>>    	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>>>> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>>>>    	u64 entry;
>>>>    	int ret;
>>>>
>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile
>> *tile, struct xe_migrate *m,
>>>>    		return ret;
>>>>    	}
>>>>
>>>> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
>>>> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>>>>    	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>>>
>>>>    	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>>>
>>>>    	/* Map the entire BO in our level 0 pt */
>>>>    	for (i = 0, level = 0; i < num_entries; level++) {
>>>> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB,
>> 0);
>>>> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>>>>
>>>>    		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>>>>
>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
>> struct xe_migrate *m,
>>>>    		for (i = 0; i < batch->size;
>>>>    		     i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>>>>    		     XE_PAGE_SIZE) {
>>>> -			entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
>>>> +			entry = xe_pte_encode(batch, i, pat_index, 0);
>>>>
>>>>    			xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>>>>    				  entry);
>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
>> struct xe_migrate *m,
>>>>    			flags = XE_PDE_64K;
>>>>
>>>>    		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>>>> -					XE_PAGE_SIZE, XE_CACHE_WB);
>>>> +					XE_PAGE_SIZE, pat_index);
>>>>    		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level,
>> u64,
>>>>    			  entry | flags);
>>>>    	}
>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
>> struct xe_migrate *m,
>>>>    	/* Write PDE's that point to our BO. */
>>>>    	for (i = 0; i < num_entries - num_level; i++) {
>>>>    		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>>>> -				      XE_CACHE_WB);
>>>> +				      pat_index);
>>>>
>>>>    		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>>>>    			  (i + 1) * 8, u64, entry);
>>>> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate
>> *m,
>>>>
>>>>    			XE_WARN_ON(pt_bo->size != SZ_4K);
>>>>
>>>> -			addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
>>>> +			addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe,
>> XE_CACHE_WB), 0);
>>>>    			bb->cs[bb->len++] = lower_32_bits(addr);
>>>>    			bb->cs[bb->len++] = upper_32_bits(addr);
>>>>    		}
>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
>>>> index 64713f400d94..019af2920078 100644
>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>>>> @@ -10,6 +10,7 @@
>>>>    #include "xe_gt.h"
>>>>    #include "xe_gt_tlb_invalidation.h"
>>>>    #include "xe_migrate.h"
>>>> +#include "xe_pat.h"
>>>>    #include "xe_pt_types.h"
>>>>    #include "xe_pt_walk.h"
>>>>    #include "xe_res_cursor.h"
>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir
>> *pt_dir, unsigned int index)
>>>>     *
>>>>     * Return: An encoded page directory entry. No errors.
>>>>     */
>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>> -		  const enum xe_cache_level cache)
>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>>>>    {
>>>>    	u64 pde;
>>>>    	struct xe_vm *vm = bo->vm;
>>>>    	struct xe_device *xe = vm->xe;
>>>>
>>>> -
>>>>    	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>>>>    	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>>>
>>>>    	if ((vm->pat_encode).pde_encode)
>>>> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>>>> +		pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>>>>
>>>>    	return pde;
>>>>    }
>>>>
>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>>>>    			struct xe_vma *vma, u32 pt_level)
>>>>    {
>>>>    	struct xe_vm *vm = xe_vma_vm(vma);
>>>> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum
>> xe_cache_level cache,
>>>>    		pte |= XE_PTE_NULL;
>>>>
>>>>    	if ((vm->pat_encode).pte_encode)
>>>> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>>>> +		pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>>>>
>>>>    	if (pt_level == 1)
>>>>    		pte |= XE_PDE_PS_2M;
>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
>> xe_cache_level cache,
>>>>     *
>>>>     * Return: An encoded page-table entry. No errors.
>>>>     */
>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
>> cache,
>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>>>    		  u32 pt_level)
>>>>    {
>>>>    	u64 pte;
>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset,
>> enum xe_cache_level cache,
>>>>    	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>>>>    		pte |= XE_PPGTT_PTE_DM;
>>>>
>>>> -	return __pte_encode(pte, cache, NULL, pt_level);
>>>> +	return __pte_encode(pte, pat_index, NULL, pt_level);
>>>>    }
>>>>
>>>>    static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>>>> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile,
>> struct xe_vm *vm,
>>>>
>>>>    	if (level == 0) {
>>>>    		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>>>> -					  XE_CACHE_WB, 0);
>>>> +					  xe_pat_get_index(vm->xe,
>> XE_CACHE_WB), 0);
>>>>
>>>>    		return empty;
>>>>    	} else {
>>>>    		return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
>>>> -				     XE_CACHE_WB);
>>>> +				     xe_pat_get_index(vm->xe,
>> XE_CACHE_WB));
>>>>    	}
>>>>    }
>>>>
>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>>>>    	struct xe_vm *vm;
>>>>    	/** @tile: The tile we're building for. */
>>>>    	struct xe_tile *tile;
>>>> -	/** @cache: Desired cache level for the ptes */
>>>> -	enum xe_cache_level cache;
>>>>    	/** @default_pte: PTE flag only template. No address is associated */
>>>>    	u64 default_pte;
>>>>    	/** @dma_offset: DMA offset to add to the PTE. */
>>>> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
>> pgoff_t offset,
>>>>
>>>>    		pte = __pte_encode(is_null ? 0 :
>>>>    				   xe_res_dma(curs) + xe_walk->dma_offset,
>>>> -				   xe_walk->cache, xe_walk->vma, level);
>>>> +				   xe_walk->vma->pat_index, xe_walk->vma,
>> level);
>>>>    		pte |= xe_walk->default_pte;
>>>>
>>>>    		/*
>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
>> pgoff_t offset,
>>>>    			xe_child->is_compact = true;
>>>>    		}
>>>>
>>>> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
>>>> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
>>> pat_index) | flags;
>>>>    		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>>>>    					 pte);
>>>>    	}
>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma
>> *vma,
>>>>    		if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>>>>    			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>>>>    		xe_walk.dma_offset = vram_region_gpu_offset(bo-
>>> ttm.resource);
>>>> -		xe_walk.cache = XE_CACHE_WB;
>>>> -	} else {
>>>> -		if (!xe_vma_has_no_bo(vma) && bo->flags &
>> XE_BO_SCANOUT_BIT)
>>>> -			xe_walk.cache = XE_CACHE_WT;
>>>> -		else
>>>> -			xe_walk.cache = XE_CACHE_WB;
>>>>    	}
>>>> +
>>>>    	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>>>    		xe_walk.dma_offset =
>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
>>>> index 01be7ab08f87..1d433a5a96b4 100644
>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct
>> xe_vma *vma, struct xe_exec_queu
>>>>
>>>>    bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>>>>
>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>> -		  const enum xe_cache_level level);
>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>>>>
>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
>> cache,
>>>> -		  u32 pt_level);
>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32
>> pt_level);
>>>>
>>>>    #endif
>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>>>> index 7eeeed0411f3..34603a7e84b0 100644
>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>> @@ -6,6 +6,7 @@
>>>>    #include "xe_vm.h"
>>>>
>>>>    #include <linux/dma-fence-array.h>
>>>> +#include <linux/nospec.h>
>>>>
>>>>    #include <drm/drm_print.h>
>>>>    #include <drm/ttm/ttm_execbuf_util.h>
>>>> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm
>> *vm,
>>>>    				    u64 start, u64 end,
>>>>    				    bool read_only,
>>>>    				    bool is_null,
>>>> -				    u8 tile_mask)
>>>> +				    u8 tile_mask,
>>>> +				    u32 pat_index)
>>>>    {
>>>>    	struct xe_vma *vma;
>>>>    	struct xe_tile *tile;
>>>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm
>> *vm,
>>>>    			vma->tile_mask |= 0x1 << id;
>>>>    	}
>>>>
>>>> +	vma->pat_index = pat_index;
>>>> +
>>>>    	if (vm->xe->info.platform == XE_PVC)
>>>>    		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>>>
>>>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct
>> work_struct *w);
>>>>    static void vm_destroy_work_func(struct work_struct *w);
>>>>
>>>>    static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>>>> -						enum xe_cache_level cache)
>>>> +				     u32 pat_index)
>>>>    {
>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>> -
>>>>    	pde_pat &= ~(XELP_PDE_PAT_MASK);
>>>>
>>>>    	if (pat_index & BIT(0))
>>>> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct
>> xe_device *xe, u64 pde_pat,
>>>>    }
>>>>
>>>>    static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>>>> -						enum xe_cache_level cache)
>>>> +				     u32 pat_index)
>>>>    {
>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>> -
>>>>    	pte_pat &= ~(XELP_PTE_PAT_MASK);
>>>>
>>>>    	if (pat_index & BIT(0))
>>>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef,
>> u32 id)
>>>>    u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>>>>    {
>>>>    	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>>>> -			     XE_CACHE_WB);
>>>> +			     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>>>>    }
>>>>
>>>>    static struct dma_fence *
>>>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct
>> drm_gpuva_op *op)
>>>>    static struct drm_gpuva_ops *
>>>>    vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>>    			 u64 bo_offset_or_userptr, u64 addr, u64 range,
>>>> -			 u32 operation, u8 tile_mask, u32 region)
>>>> +			 u32 operation, u8 tile_mask, u32 region, u32
>> pat_index)
>>>>    {
>>>>    	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>>>    	struct ww_acquire_ctx ww;
>>>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
>> struct xe_bo *bo,
>>>>    			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>>>
>>>>    			op->tile_mask = tile_mask;
>>>> +			op->pat_index = pat_index;
>>>>    			op->map.immediate =
>>>>    				operation & XE_VM_BIND_FLAG_IMMEDIATE;
>>>>    			op->map.read_only =
>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
>> struct xe_bo *bo,
>>>>    			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>>>
>>>>    			op->tile_mask = tile_mask;
>>>> +			op->pat_index = pat_index;
>>>>    			op->prefetch.region = region;
>>>>    		}
>>>>    		break;
>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
>> struct xe_bo *bo,
>>>>    }
>>>>
>>>>    static struct xe_vma *new_vma(struct xe_vm *vm, struct
>> drm_gpuva_op_map *op,
>>>> -			      u8 tile_mask, bool read_only, bool is_null)
>>>> +			      u8 tile_mask, bool read_only, bool is_null,
>>>> +			      u32 pat_index)
>>>>    {
>>>>    	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
>>>>    	struct xe_vma *vma;
>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm
>> *vm, struct drm_gpuva_op_map *op,
>>>>    	vma = xe_vma_create(vm, bo, op->gem.offset,
>>>>    			    op->va.addr, op->va.addr +
>>>>    			    op->va.range - 1, read_only, is_null,
>>>> -			    tile_mask);
>>>> +			    tile_mask, pat_index);
>>>>    	if (bo)
>>>>    		xe_bo_unlock(bo, &ww);
>>>>
>>>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
>> *vm, struct xe_exec_queue *q,
>>>>
>>>>    				vma = new_vma(vm, &op->base.map,
>>>>    					      op->tile_mask, op->map.read_only,
>>>> -					      op->map.is_null);
>>>> +					      op->map.is_null, op->pat_index);
>>>>    				if (IS_ERR(vma)) {
>>>>    					err = PTR_ERR(vma);
>>>>    					goto free_fence;
>>>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
>> *vm, struct xe_exec_queue *q,
>>>>
>>>>    					vma = new_vma(vm, op-
>>> base.remap.prev,
>>>>    						      op->tile_mask, read_only,
>>>> -						      is_null);
>>>> +						      is_null, op->pat_index);
>>>>    					if (IS_ERR(vma)) {
>>>>    						err = PTR_ERR(vma);
>>>>    						goto free_fence;
>>>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
>> *vm, struct xe_exec_queue *q,
>>>>
>>>>    					vma = new_vma(vm, op-
>>> base.remap.next,
>>>>    						      op->tile_mask, read_only,
>>>> -						      is_null);
>>>> +						      is_null, op->pat_index);
>>>>    					if (IS_ERR(vma)) {
>>>>    						err = PTR_ERR(vma);
>>>>    						goto free_fence;
>>>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct
>> xe_device *xe,
>>>>    		u32 obj = (*bind_ops)[i].obj;
>>>>    		u64 obj_offset = (*bind_ops)[i].obj_offset;
>>>>    		u32 region = (*bind_ops)[i].region;
>>>> +		u32 pat_index = (*bind_ops)[i].pat_index;
>>>>    		bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>>>
>>>> +		if (XE_IOCTL_DBG(xe, pat_index >= xe-
>>> info.pat_table_n_entries)) {
>>>> +			err = -EINVAL;
>>>> +			goto free_bind_ops;
>>>> +		}
>>>> +
>>>>    		if (i == 0) {
>>>>    			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>>>>    		} else if (XE_IOCTL_DBG(xe, !*async) ||
>>>> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
>> void *data, struct drm_file *file)
>>>>    		struct drm_gem_object *gem_obj;
>>>>    		u64 range = bind_ops[i].range;
>>>>    		u64 addr = bind_ops[i].addr;
>>>> +		u32 op = bind_ops[i].op;
>>>>    		u32 obj = bind_ops[i].obj;
>>>>    		u64 obj_offset = bind_ops[i].obj_offset;
>>>> +		u32 pat_index = bind_ops[i].pat_index;
>>>> +		u16 coh_mode;
>>>> +
>>>> +		pat_index = array_index_nospec(pat_index,
>>>> +					       xe->info.pat_table_n_entries);
>>>> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>>>> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
>>>
>>> Assuming we drop the unusable entries from the TGL table, this should be
>>> impossible, right?  Any index that makes it past the n_entries check at
>>> the top of the function should have a valid, non-zero coh_mode value.
>>> So this should probably be an assertion (to highlight a KMD bug) rather
>>> than just a silent uapi failure return.
>>
>> Makes sense.
>>
>>>
>>>> +			err = -EINVAL;
>>>> +			goto put_obj;
>>>> +		}
>>>> +
>>>> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
>> XE_VM_BIND_OP_MAP_USERPTR &&
>>>> +				 coh_mode == XE_GEM_COHERENCY_NONE)) {
>>>> +			err = -EINVAL;
>>>> +			goto put_obj;
>>>> +		}
>>>>
>>>>    		if (!obj)
>>>>    			continue;
>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
>> void *data, struct drm_file *file)
>>>>    				goto put_obj;
>>>>    			}
>>>>    		}
>>>> +
>>>> +		if (bos[i]->coh_mode) {
>>>> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
>> coh_mode)) {
>>>> +				err = -EINVAL;
>>>> +				goto put_obj;
>>>> +			}
>>>> +		} else if (XE_IOCTL_DBG(xe, coh_mode ==
>> XE_GEM_COHERENCY_NONE)) {
>>>> +			/*
>>>> +			 * Imported dma-buf from a different device should
>>>> +			 * require 1way or 2way coherency since we don't
>> know
>>>> +			 * how it was mapped on CPU. Just assume is it
>>>> +			 * potentially cached on CPU side.
>>>> +			 */
>>>> +			err = -EINVAL;
>>>> +			goto put_obj;
>>>> +		}
>>>>    	}
>>>>
>>>>    	if (args->num_syncs) {
>>>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
>> void *data, struct drm_file *file)
>>>>    		u64 obj_offset = bind_ops[i].obj_offset;
>>>>    		u8 tile_mask = bind_ops[i].tile_mask;
>>>>    		u32 region = bind_ops[i].region;
>>>> +		u32 pat_index = bind_ops[i].pat_index;
>>>>
>>>>    		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>>>>    						  addr, range, op, tile_mask,
>>>> -						  region);
>>>> +						  region, pat_index);
>>>>    		if (IS_ERR(ops[i])) {
>>>>    			err = PTR_ERR(ops[i]);
>>>>    			ops[i] = NULL;
>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
>> b/drivers/gpu/drm/xe/xe_vm_types.h
>>>> index 83a1f87b6537..508679ed3c74 100644
>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>>>> @@ -111,6 +111,11 @@ struct xe_vma {
>>>>    	 */
>>>>    	u8 tile_present;
>>>>
>>>> +	/**
>>>> +	 * @pat_index: The pat index to use when encoding the PTEs for this
>> vma.
>>>> +	 */
>>>> +	u32 pat_index;
>>>> +
>>>>    	struct {
>>>>    		struct list_head rebind_link;
>>>>    	} notifier;
>>>> @@ -338,10 +343,8 @@ struct xe_vm {
>>>>    	bool batch_invalidate_tlb;
>>>>
>>>>    	struct {
>>>> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>> -						enum xe_cache_level cache);
>>>> -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>>>> -						enum xe_cache_level cache);
>>>> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32
>> pat_index);
>>>> +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32
>> pat_index);
>>>>    	} pat_encode;
>>>>    };
>>>>
>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>>>>    	struct async_op_fence *fence;
>>>>    	/** @tile_mask: gt mask for this operation */
>>>>    	u8 tile_mask;
>>>> +	/** @pat_index: The pat index to use for this operation. */
>>>> +	u32 pat_index;
>>>>    	/** @flags: operation flags */
>>>>    	enum xe_vma_op_flags flags;
>>>>
>>>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>>>> index 64bc66d4b550..0c15b6f32447 100644
>>>> --- a/include/uapi/drm/xe_drm.h
>>>> +++ b/include/uapi/drm/xe_drm.h
>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>>>>    	 */
>>>>    	__u32 obj;
>>>>
>>>> -	/** @pad: MBZ */
>>>> -	__u32 pad;
>>>> +	/**
>>>> +	 * @pat_index: The platform defined @pat_index to use for this
>> mapping.
>>>> +	 * The index basically maps to some predefined memory attributes,
>>>> +	 * including things like caching, coherency and likely other stuff in
>>>> +	 * the future.  The exact meaning of the pat_index is platform specific
>>>
>>> BTW, "other stuff in the future" already includes compression on Xe2, we
>>> just haven't landed the patches for the Xe2 table yet.
>>
>> Ok, good to know.
>>
>>>
>>>> +	 * and defined in the Bspec and PRMs.  When the KMD sets up the
>> binding
>>>> +	 * the index here is encoded into the ppGTT PTE.
>>>> +	 *
>>>> +	 * For coherency the @pat_index needs to match the
>>>> +	 * drm_xe_gem_create.coh_mode, so either
>> XE_GEM_COHERENCY_NONE,
>>>> +	 * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The
>> KMD will extract
>>>> +	 * the coherency mode from the @pat_index and reject if there is a
>>>> +	 * mismatch (see note below for pre-MTL platforms).
>>>> +	 *
>>>> +	 * Note: On pre-MTL platforms there is only a caching mode and no
>>>> +	 * explicit coherency mode, but on such hardware there is always a
>>>> +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
>>>> +	 * CPU caches even with the caching mode set as uncached.  It's only
>> the
>>>> +	 * display engine that is incoherent (on dgpu it must be in VRAM which
>>>> +	 * is always mapped as WC on the CPU). However to keep the uapi
>> somewhat
>>>> +	 * consistent with newer platforms the KMD groups the different cache
>>>> +	 * levels into the following coherency buckets on all pre-MTL platforms:
>>>> +	 *
>>>> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
>>>> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
>>>> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
>>>> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
>>>
>>> As noted on the previous patch, it seems like 2-way is appropriate for
>>> LLC platforms, but 1-way might be a more accurate description of dGPU
>>> behavior.
>>>
>>>> +	 *
>>>> +	 * In practice UC/WC/WT should only ever used for scanout surfaces on
>>>> +	 * such platforms since it is only the display engine that is actually
>>>> +	 * incoherent. Everything else should typically use WB given that we
>>>
>>> What if we're sharing our buffers with some other (non-GPU) device?  Are
>>> there cases where that other device wouldn't be coherent with the LLC,
>>> so we'd want to use one of these?
>>
>> Yeah, I guess there might be cases like that. I'll reword.
>>
>>>
>>>
>>> Matt
>>>
>>>> +	 * have a shared-LLC.  On MTL+ this completely changes (also
>> potentially
>>>> +	 * no shared-LLC) and the HW defines the coherency mode as part of
>> the
>>>> +	 * @pat_index.
>>>> +	 *
>>>> +	 * Note: For userptr and externally imported dma-buf the kernel
>> expects
>>>> +	 * either 1WAY or 2WAY for the @pat_index.
>>>> +	 */
>>>> +	__u32 pat_index;
>>>>
>>>>    	union {
>>>>    		/**
>>>> --
>>>> 2.41.0
>>>>
>>>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-08-30 16:02         ` Matthew Auld
@ 2023-08-31  8:24           ` Zhang, Carl
  2023-08-31 10:44             ` Matthew Auld
  0 siblings, 1 reply; 47+ messages in thread
From: Zhang, Carl @ 2023-08-31  8:24 UTC (permalink / raw)
  To: Auld, Matthew, Roper, Matthew D
  Cc: Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas, Yu, Effie,
	intel-xe@lists.freedesktop.org



> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: Thursday, August 31, 2023 12:02 AM
> 
> On 30/08/2023 16:27, Zhang, Carl wrote:
> > Several questions:
> > 1. the pat_index from vm_bind will override the setting from bo_create?
> > How to keep the value from bo_create unchanged?
> 
> You only set the pat_index during vm_bind. At gem_create you just need tell
> the kernel what CPU side caching mode to use for system memory
> (wb/wc/uc), plus the expected GPU coherency mode of the pat_index. The
> current expectation is that the coherency mode of the pat_index should match
> the one at gem_create.
> 
It means that we set the incomplete pat_index at gem_create (coherency only)
And set another part at vm_bind (gpu cache) 
Why we separate them? Why not set a full value at vm_bind?

> > 2. no UC/WB/WC definition (CPU cachable) in drm_xe_gem_mmap_offset,
> will it be handled by KMD automatically?
> > For example: if set 1-way, it means GPU could snoop CPU cache, we
> > could use WB  in mmap offset If it is COHERENCY_NONE, we could only use
> UC,  all these logic is handled by KMD automatically?
> 
> Yes, the mmap will use the same CPU side caching mode that you set at
> gem_create with smem_caching (wb/wc/uc).
> 
It means it  could not be changed after gem_create.  Sometimes, UMD doesn’t know
Whether the data need to be accessed by CPU at gem_create .  only application
Knows the purpose of allocation & the surface usage.  So, may need change UMD api
to let applications set whether the resource is "accessible", or it will hurt the perf.

> > 3.  about " For imported dma-buf (from a different device) the
> > coherency mode is also implicit and must also be either 1WAY or 2WAY"
> > it means it must be 1way or 2way, and UMD need not to set it?
> 
> If you are going to vm_bind it you need to supply the pat_index and in this
> case it needs to be either 1way or 2way. The object comes from a different
> device so we might not have smem_caching/coh_mode like we do for native
> objects.
> 
How about the GPU cache setting of imported dma-buf?
Is there possible conflict between dma-buf producer and consumer?
For example: 
In producer side , one bo is set GPU L3 cache,  also CPU cacheable, it should be 2Way 
In client side, we know the coherency should be 1-way or 2-way, if it suppose no GPU cache
Used. It just set 1-way.
It may cause some problems?

Also , when a bo is created, UMD does not know whether it will be exported. 
It maybe be set as COHERENCY_NONE, right?
> >
> >
> >> -----Original Message-----
> >> From: Auld, Matthew <matthew.auld@intel.com>
> >> Sent: Wednesday, August 30, 2023 7:28 PM
> >> To: Roper, Matthew D <matthew.d.roper@intel.com>
> >> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi
> <pallavi.mishra@intel.com>;
> >> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas Lahtinen
> >> <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
> >> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>;
> Hazubski,
> >> Filip <filip.hazubski@intel.com>; Zhang, Carl <carl.zhang@intel.com>; Yu,
> Effie
> >> <effie.yu@intel.com>
> >> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection with
> vm_bind
> >>
> >> On 29/08/2023 22:36, Matt Roper wrote:
> >>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
> >>>> Allow userspace to directly control the pat_index for a given vm
> >>>> binding. This should allow directly controlling the coherency,
> >>>> caching and potentially other stuff in the future for the ppGTT binding.
> >>>>
> >>>> The exact meaning behind the pat_index is very platform specific (see
> >>>> BSpec or PRMs) but effectively maps to some predefined memory
> >>>> attributes. From the KMD pov we only care about the coherency that is
> >>>> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
> >>>> The vm_bind coherency mode for the given pat_index needs to match
> the
> >>>> given coh_mode that was set at object creation. For platforms that
> >>>> lack
> >>>
> >>> Is it actually important to match the coherency mode?  I think one of
> >>> the main goals was to know up front if userspace might be using a
> >>> non-snooping PAT setting that would let it bypass the CPU cache (and
> >>> potentially read old, stale data from a different process if the
> >>> buffer's clear value is still sitting in cache and hasn't landed in
> >>> memory yet).
> >>>
> >>> If that's the only concern, then I think it should still be fine to
> >>> map with a non-matching PAT as long as it's more coherent than the one
> >>> specified at creation, right?  E.g., if the buffer was created with
> >>> 1-way coherency, it would be fine to map it with 2-way because
> >>> userspace still can't use that to observe the previous contents of the
> >>> buffer. Or
> >>
> >> Yeah, I guess we could in theory do something that.
> >>
> >>> if the buffer was created with "non-coherent" then we've already done
> >>> the necessary clflushing in kernel before handing to buffer over to
> >>> userspace to ensure the clear value landed in memory, so any valid PAT
> >>> index should be safe (from a security POV) after that, right?  Any
> >>> other problems that arise from mismatched coherency would just be
> >>> contained to the app possibly shooting itself in the foot, which isn't
> >>> really our concern.
> >>
> >> That is also my understanding, at least from the KMD security pov. If you
> >> allocate as wb then you must use at least 1way, since there is no flushing
> for
> >> clearing or swap-in. For uc/wc you could in theory use whatever you want.
> >>
> >>>
> >>>
> >>>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as
> >> 2WAY.
> >>>>
> >>>> For userptr mappings we lack a corresponding gem object, so the
> expected
> >>>> coherency mode is instead implicit and must fall into either 1WAY or
> >>>> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
> >>>> dma-buf (from a different device) the coherency mode is also implicit
> >>>> and must also be either 1WAY or 2WAY.
> >>>>
> >>>> As part of adding pat_index support with vm_bind we also need stop
> using
> >>>> xe_cache_level and instead use the pat_index in various places. We still
> >>>> make use of xe_cache_level, but only as a convenience for kernel
> >>>> internal objectsi (internally it maps to some reasonable pat_index). For
> >>>
> >>> Maybe we should kill xe_cache_level completely and just assign
> >>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
> >>> appropriate places, similar to what we do with MOCS (gt-
> >mocs.uc_index,
> >>> gt->mocs.wb_index)?
> >>
> >> OK, seems reasonable to me.
> >>
> >>>
> >>>> now this is just a 1:1 conversion of the existing code, however for
> >>>> platforms like MTL+ we might need to give more control through
> bo_create
> >>>> or stop using WB on the CPU side if we need CPU access.
> >>>>
> >>>> Bspec: 45101, 44235 #xe
> >>>> Bspec: 70552, 71582, 59400 #xe2
> >>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> >>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> >>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> >>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> >>>> Cc: Matt Roper <matthew.d.roper@intel.com>
> >>>> Cc: José Roberto de Souza <jose.souza@intel.com>
> >>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
> >>>> Cc: Carl Zhang <carl.zhang@intel.com>
> >>>> Cc: Effie Yu <effie.yu@intel.com>
> >>>> ---
> >>>>    drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
> >>>>    drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
> >>>>    drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
> >>>>    drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
> >>>>    drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
> >>>>    drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
> >>>>    drivers/gpu/drm/xe/xe_pt.h            |  6 +--
> >>>>    drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++-----
> -
> >>>>    drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
> >>>>    include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
> >>>>    10 files changed, 134 insertions(+), 58 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >> b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>> index 4d6296cdbcfd..bb4c182048c3 100644
> >>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>> @@ -302,7 +302,7 @@ struct i915_address_space {
> >>>>    		(*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
> >>>>
> >>>>    	u64 (*pte_encode)(dma_addr_t addr,
> >>>> -			  unsigned int pat_index,
> >>>> +			  u32 pat_index,
> >>>>    			  u32 flags); /* Create a valid PTE */
> >>>>    #define PTE_READ_ONLY	BIT(0)
> >>>>    #define PTE_LM		BIT(1)
> >>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >> b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>> index 5c8d5e78d9bc..7a128fd20a29 100644
> >>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct
> xe_migrate
> >> *m, struct kunit *test)
> >>>>    	/* First part of the test, are we updating our pagetable bo with a new
> >> entry? */
> >>>>    	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE -
> 1),
> >> u64,
> >>>>    		  0xdeaddeadbeefbeef);
> >>>> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> >>>> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
> >> XE_CACHE_WB), 0);
> >>>>    	if (m->q->vm->flags & XE_VM_FLAG_64K)
> >>>>    		expected |= XE_PTE_PS64;
> >>>>    	if (xe_bo_is_vram(pt))
> >>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c
> b/drivers/gpu/drm/xe/xe_ggtt.c
> >>>> index 209fa053d9fb..4134c26150a5 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> >>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> >>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64
> >> bo_offset)
> >>>>    		pte |= XE_GGTT_PTE_DM;
> >>>>
> >>>>    	if ((ggtt->pat_encode).pte_encode)
> >>>> -		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >> XE_CACHE_WB_1_WAY);
> >>>> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >>>> +						    xe_pat_get_index(xe,
> >> XE_CACHE_WB_1_WAY));
> >>>>
> >>>>    	return pte;
> >>>>    }
> >>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
> >>>>    }
> >>>>
> >>>>    static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64
> pte_pat,
> >>>> -						enum xe_cache_level cache)
> >>>> +				     u32 pat_index)
> >>>>    {
> >>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >>>> -
> >>>>    	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
> >>>>
> >>>>    	if (pat_index & BIT(0))
> >>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >> b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>> index 7e55fac1a8a9..0bc40cb072e3 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
> >>>>
> >>>>    	struct {
> >>>>    		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>> -						enum xe_cache_level cache);
> >>>> +				  u32 pat_index);
> >>>>    	} pat_encode;
> >>>>    };
> >>>>
> >>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> >> b/drivers/gpu/drm/xe/xe_migrate.c
> >>>> index a782ea282cb6..54585e98452a 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
> >>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> >>>> @@ -24,6 +24,7 @@
> >>>>    #include "xe_lrc.h"
> >>>>    #include "xe_map.h"
> >>>>    #include "xe_mocs.h"
> >>>> +#include "xe_pat.h"
> >>>>    #include "xe_pt.h"
> >>>>    #include "xe_res_cursor.h"
> >>>>    #include "xe_sched_job.h"
> >>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile
> *tile,
> >> struct xe_migrate *m,
> >>>>    	u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]-
> >>> level;
> >>>>    	u32 map_ofs, level, i;
> >>>>    	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> >>>> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> >>>>    	u64 entry;
> >>>>    	int ret;
> >>>>
> >>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct
> xe_tile
> >> *tile, struct xe_migrate *m,
> >>>>    		return ret;
> >>>>    	}
> >>>>
> >>>> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
> XE_CACHE_WB);
> >>>> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
> >>>>    	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
> >>>>
> >>>>    	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
> >>>>
> >>>>    	/* Map the entire BO in our level 0 pt */
> >>>>    	for (i = 0, level = 0; i < num_entries; level++) {
> >>>> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB,
> >> 0);
> >>>> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
> >>>>
> >>>>    		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
> >>>>
> >>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile
> *tile,
> >> struct xe_migrate *m,
> >>>>    		for (i = 0; i < batch->size;
> >>>>    		     i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
> >>>>    		     XE_PAGE_SIZE) {
> >>>> -			entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
> >>>> +			entry = xe_pte_encode(batch, i, pat_index, 0);
> >>>>
> >>>>    			xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> >>>>    				  entry);
> >>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile
> *tile,
> >> struct xe_migrate *m,
> >>>>    			flags = XE_PDE_64K;
> >>>>
> >>>>    		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> >>>> -					XE_PAGE_SIZE, XE_CACHE_WB);
> >>>> +					XE_PAGE_SIZE, pat_index);
> >>>>    		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level,
> >> u64,
> >>>>    			  entry | flags);
> >>>>    	}
> >>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile
> *tile,
> >> struct xe_migrate *m,
> >>>>    	/* Write PDE's that point to our BO. */
> >>>>    	for (i = 0; i < num_entries - num_level; i++) {
> >>>>    		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> >>>> -				      XE_CACHE_WB);
> >>>> +				      pat_index);
> >>>>
> >>>>    		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
> >>>>    			  (i + 1) * 8, u64, entry);
> >>>> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct
> xe_migrate
> >> *m,
> >>>>
> >>>>    			XE_WARN_ON(pt_bo->size != SZ_4K);
> >>>>
> >>>> -			addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
> >>>> +			addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe,
> >> XE_CACHE_WB), 0);
> >>>>    			bb->cs[bb->len++] = lower_32_bits(addr);
> >>>>    			bb->cs[bb->len++] = upper_32_bits(addr);
> >>>>    		}
> >>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> >>>> index 64713f400d94..019af2920078 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_pt.c
> >>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
> >>>> @@ -10,6 +10,7 @@
> >>>>    #include "xe_gt.h"
> >>>>    #include "xe_gt_tlb_invalidation.h"
> >>>>    #include "xe_migrate.h"
> >>>> +#include "xe_pat.h"
> >>>>    #include "xe_pt_types.h"
> >>>>    #include "xe_pt_walk.h"
> >>>>    #include "xe_res_cursor.h"
> >>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir
> >> *pt_dir, unsigned int index)
> >>>>     *
> >>>>     * Return: An encoded page directory entry. No errors.
> >>>>     */
> >>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >>>> -		  const enum xe_cache_level cache)
> >>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
> >>>>    {
> >>>>    	u64 pde;
> >>>>    	struct xe_vm *vm = bo->vm;
> >>>>    	struct xe_device *xe = vm->xe;
> >>>>
> >>>> -
> >>>>    	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> >>>>    	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
> >>>>
> >>>>    	if ((vm->pat_encode).pde_encode)
> >>>> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> >>>> +		pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
> >>>>
> >>>>    	return pde;
> >>>>    }
> >>>>
> >>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> >>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
> >>>>    			struct xe_vma *vma, u32 pt_level)
> >>>>    {
> >>>>    	struct xe_vm *vm = xe_vma_vm(vma);
> >>>> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum
> >> xe_cache_level cache,
> >>>>    		pte |= XE_PTE_NULL;
> >>>>
> >>>>    	if ((vm->pat_encode).pte_encode)
> >>>> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> >>>> +		pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
> >>>>
> >>>>    	if (pt_level == 1)
> >>>>    		pte |= XE_PDE_PS_2M;
> >>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
> >> xe_cache_level cache,
> >>>>     *
> >>>>     * Return: An encoded page-table entry. No errors.
> >>>>     */
> >>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
> >> cache,
> >>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> >>>>    		  u32 pt_level)
> >>>>    {
> >>>>    	u64 pte;
> >>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64
> offset,
> >> enum xe_cache_level cache,
> >>>>    	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> >>>>    		pte |= XE_PPGTT_PTE_DM;
> >>>>
> >>>> -	return __pte_encode(pte, cache, NULL, pt_level);
> >>>> +	return __pte_encode(pte, pat_index, NULL, pt_level);
> >>>>    }
> >>>>
> >>>>    static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> >>>> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile
> *tile,
> >> struct xe_vm *vm,
> >>>>
> >>>>    	if (level == 0) {
> >>>>    		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> >>>> -					  XE_CACHE_WB, 0);
> >>>> +					  xe_pat_get_index(vm->xe,
> >> XE_CACHE_WB), 0);
> >>>>
> >>>>    		return empty;
> >>>>    	} else {
> >>>>    		return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
> >>>> -				     XE_CACHE_WB);
> >>>> +				     xe_pat_get_index(vm->xe,
> >> XE_CACHE_WB));
> >>>>    	}
> >>>>    }
> >>>>
> >>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> >>>>    	struct xe_vm *vm;
> >>>>    	/** @tile: The tile we're building for. */
> >>>>    	struct xe_tile *tile;
> >>>> -	/** @cache: Desired cache level for the ptes */
> >>>> -	enum xe_cache_level cache;
> >>>>    	/** @default_pte: PTE flag only template. No address is associated */
> >>>>    	u64 default_pte;
> >>>>    	/** @dma_offset: DMA offset to add to the PTE. */
> >>>> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> >> pgoff_t offset,
> >>>>
> >>>>    		pte = __pte_encode(is_null ? 0 :
> >>>>    				   xe_res_dma(curs) + xe_walk->dma_offset,
> >>>> -				   xe_walk->cache, xe_walk->vma, level);
> >>>> +				   xe_walk->vma->pat_index, xe_walk->vma,
> >> level);
> >>>>    		pte |= xe_walk->default_pte;
> >>>>
> >>>>    		/*
> >>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> >> pgoff_t offset,
> >>>>    			xe_child->is_compact = true;
> >>>>    		}
> >>>>
> >>>> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) |
> flags;
> >>>> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
> >>> pat_index) | flags;
> >>>>    		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
> >>>>    					 pte);
> >>>>    	}
> >>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> xe_vma
> >> *vma,
> >>>>    		if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
> >>>>    			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
> >>>>    		xe_walk.dma_offset = vram_region_gpu_offset(bo-
> >>> ttm.resource);
> >>>> -		xe_walk.cache = XE_CACHE_WB;
> >>>> -	} else {
> >>>> -		if (!xe_vma_has_no_bo(vma) && bo->flags &
> >> XE_BO_SCANOUT_BIT)
> >>>> -			xe_walk.cache = XE_CACHE_WT;
> >>>> -		else
> >>>> -			xe_walk.cache = XE_CACHE_WB;
> >>>>    	}
> >>>> +
> >>>>    	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> >>>>    		xe_walk.dma_offset =
> >> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> >>>>
> >>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> >>>> index 01be7ab08f87..1d433a5a96b4 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_pt.h
> >>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
> >>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct
> >> xe_vma *vma, struct xe_exec_queu
> >>>>
> >>>>    bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
> >>>>
> >>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >>>> -		  const enum xe_cache_level level);
> >>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
> >>>>
> >>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
> >> cache,
> >>>> -		  u32 pt_level);
> >>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32
> >> pt_level);
> >>>>
> >>>>    #endif
> >>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> >>>> index 7eeeed0411f3..34603a7e84b0 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_vm.c
> >>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
> >>>> @@ -6,6 +6,7 @@
> >>>>    #include "xe_vm.h"
> >>>>
> >>>>    #include <linux/dma-fence-array.h>
> >>>> +#include <linux/nospec.h>
> >>>>
> >>>>    #include <drm/drm_print.h>
> >>>>    #include <drm/ttm/ttm_execbuf_util.h>
> >>>> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct
> xe_vm
> >> *vm,
> >>>>    				    u64 start, u64 end,
> >>>>    				    bool read_only,
> >>>>    				    bool is_null,
> >>>> -				    u8 tile_mask)
> >>>> +				    u8 tile_mask,
> >>>> +				    u32 pat_index)
> >>>>    {
> >>>>    	struct xe_vma *vma;
> >>>>    	struct xe_tile *tile;
> >>>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct
> xe_vm
> >> *vm,
> >>>>    			vma->tile_mask |= 0x1 << id;
> >>>>    	}
> >>>>
> >>>> +	vma->pat_index = pat_index;
> >>>> +
> >>>>    	if (vm->xe->info.platform == XE_PVC)
> >>>>    		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
> >>>>
> >>>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct
> >> work_struct *w);
> >>>>    static void vm_destroy_work_func(struct work_struct *w);
> >>>>
> >>>>    static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64
> pde_pat,
> >>>> -						enum xe_cache_level cache)
> >>>> +				     u32 pat_index)
> >>>>    {
> >>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >>>> -
> >>>>    	pde_pat &= ~(XELP_PDE_PAT_MASK);
> >>>>
> >>>>    	if (pat_index & BIT(0))
> >>>> @@ -1213,10 +1215,8 @@ static u64
> xelp_ppgtt_pde_encode_pat(struct
> >> xe_device *xe, u64 pde_pat,
> >>>>    }
> >>>>
> >>>>    static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64
> pte_pat,
> >>>> -						enum xe_cache_level cache)
> >>>> +				     u32 pat_index)
> >>>>    {
> >>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >>>> -
> >>>>    	pte_pat &= ~(XELP_PTE_PAT_MASK);
> >>>>
> >>>>    	if (pat_index & BIT(0))
> >>>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file
> *xef,
> >> u32 id)
> >>>>    u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
> >>>>    {
> >>>>    	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> >>>> -			     XE_CACHE_WB);
> >>>> +			     xe_pat_get_index(vm->xe, XE_CACHE_WB));
> >>>>    }
> >>>>
> >>>>    static struct dma_fence *
> >>>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe,
> struct
> >> drm_gpuva_op *op)
> >>>>    static struct drm_gpuva_ops *
> >>>>    vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> >>>>    			 u64 bo_offset_or_userptr, u64 addr, u64 range,
> >>>> -			 u32 operation, u8 tile_mask, u32 region)
> >>>> +			 u32 operation, u8 tile_mask, u32 region, u32
> >> pat_index)
> >>>>    {
> >>>>    	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> >>>>    	struct ww_acquire_ctx ww;
> >>>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> >> struct xe_bo *bo,
> >>>>    			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> >>>>
> >>>>    			op->tile_mask = tile_mask;
> >>>> +			op->pat_index = pat_index;
> >>>>    			op->map.immediate =
> >>>>    				operation &
> XE_VM_BIND_FLAG_IMMEDIATE;
> >>>>    			op->map.read_only =
> >>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> >> struct xe_bo *bo,
> >>>>    			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> >>>>
> >>>>    			op->tile_mask = tile_mask;
> >>>> +			op->pat_index = pat_index;
> >>>>    			op->prefetch.region = region;
> >>>>    		}
> >>>>    		break;
> >>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> >> struct xe_bo *bo,
> >>>>    }
> >>>>
> >>>>    static struct xe_vma *new_vma(struct xe_vm *vm, struct
> >> drm_gpuva_op_map *op,
> >>>> -			      u8 tile_mask, bool read_only, bool is_null)
> >>>> +			      u8 tile_mask, bool read_only, bool is_null,
> >>>> +			      u32 pat_index)
> >>>>    {
> >>>>    	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) :
> NULL;
> >>>>    	struct xe_vma *vma;
> >>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm
> >> *vm, struct drm_gpuva_op_map *op,
> >>>>    	vma = xe_vma_create(vm, bo, op->gem.offset,
> >>>>    			    op->va.addr, op->va.addr +
> >>>>    			    op->va.range - 1, read_only, is_null,
> >>>> -			    tile_mask);
> >>>> +			    tile_mask, pat_index);
> >>>>    	if (bo)
> >>>>    		xe_bo_unlock(bo, &ww);
> >>>>
> >>>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct
> xe_vm
> >> *vm, struct xe_exec_queue *q,
> >>>>
> >>>>    				vma = new_vma(vm, &op->base.map,
> >>>>    					      op->tile_mask, op-
> >map.read_only,
> >>>> -					      op->map.is_null);
> >>>> +					      op->map.is_null, op->pat_index);
> >>>>    				if (IS_ERR(vma)) {
> >>>>    					err = PTR_ERR(vma);
> >>>>    					goto free_fence;
> >>>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct
> xe_vm
> >> *vm, struct xe_exec_queue *q,
> >>>>
> >>>>    					vma = new_vma(vm, op-
> >>> base.remap.prev,
> >>>>    						      op->tile_mask, read_only,
> >>>> -						      is_null);
> >>>> +						      is_null, op->pat_index);
> >>>>    					if (IS_ERR(vma)) {
> >>>>    						err = PTR_ERR(vma);
> >>>>    						goto free_fence;
> >>>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct
> xe_vm
> >> *vm, struct xe_exec_queue *q,
> >>>>
> >>>>    					vma = new_vma(vm, op-
> >>> base.remap.next,
> >>>>    						      op->tile_mask, read_only,
> >>>> -						      is_null);
> >>>> +						      is_null, op->pat_index);
> >>>>    					if (IS_ERR(vma)) {
> >>>>    						err = PTR_ERR(vma);
> >>>>    						goto free_fence;
> >>>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct
> >> xe_device *xe,
> >>>>    		u32 obj = (*bind_ops)[i].obj;
> >>>>    		u64 obj_offset = (*bind_ops)[i].obj_offset;
> >>>>    		u32 region = (*bind_ops)[i].region;
> >>>> +		u32 pat_index = (*bind_ops)[i].pat_index;
> >>>>    		bool is_null = op & XE_VM_BIND_FLAG_NULL;
> >>>>
> >>>> +		if (XE_IOCTL_DBG(xe, pat_index >= xe-
> >>> info.pat_table_n_entries)) {
> >>>> +			err = -EINVAL;
> >>>> +			goto free_bind_ops;
> >>>> +		}
> >>>> +
> >>>>    		if (i == 0) {
> >>>>    			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
> >>>>    		} else if (XE_IOCTL_DBG(xe, !*async) ||
> >>>> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device
> *dev,
> >> void *data, struct drm_file *file)
> >>>>    		struct drm_gem_object *gem_obj;
> >>>>    		u64 range = bind_ops[i].range;
> >>>>    		u64 addr = bind_ops[i].addr;
> >>>> +		u32 op = bind_ops[i].op;
> >>>>    		u32 obj = bind_ops[i].obj;
> >>>>    		u64 obj_offset = bind_ops[i].obj_offset;
> >>>> +		u32 pat_index = bind_ops[i].pat_index;
> >>>> +		u16 coh_mode;
> >>>> +
> >>>> +		pat_index = array_index_nospec(pat_index,
> >>>> +					       xe->info.pat_table_n_entries);
> >>>> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
> >>>> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
> >>>
> >>> Assuming we drop the unusable entries from the TGL table, this should be
> >>> impossible, right?  Any index that makes it past the n_entries check at
> >>> the top of the function should have a valid, non-zero coh_mode value.
> >>> So this should probably be an assertion (to highlight a KMD bug) rather
> >>> than just a silent uapi failure return.
> >>
> >> Makes sense.
> >>
> >>>
> >>>> +			err = -EINVAL;
> >>>> +			goto put_obj;
> >>>> +		}
> >>>> +
> >>>> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
> >> XE_VM_BIND_OP_MAP_USERPTR &&
> >>>> +				 coh_mode == XE_GEM_COHERENCY_NONE))
> {
> >>>> +			err = -EINVAL;
> >>>> +			goto put_obj;
> >>>> +		}
> >>>>
> >>>>    		if (!obj)
> >>>>    			continue;
> >>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device
> *dev,
> >> void *data, struct drm_file *file)
> >>>>    				goto put_obj;
> >>>>    			}
> >>>>    		}
> >>>> +
> >>>> +		if (bos[i]->coh_mode) {
> >>>> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
> >> coh_mode)) {
> >>>> +				err = -EINVAL;
> >>>> +				goto put_obj;
> >>>> +			}
> >>>> +		} else if (XE_IOCTL_DBG(xe, coh_mode ==
> >> XE_GEM_COHERENCY_NONE)) {
> >>>> +			/*
> >>>> +			 * Imported dma-buf from a different device should
> >>>> +			 * require 1way or 2way coherency since we don't
> >> know
> >>>> +			 * how it was mapped on CPU. Just assume is it
> >>>> +			 * potentially cached on CPU side.
> >>>> +			 */
> >>>> +			err = -EINVAL;
> >>>> +			goto put_obj;
> >>>> +		}
> >>>>    	}
> >>>>
> >>>>    	if (args->num_syncs) {
> >>>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device
> *dev,
> >> void *data, struct drm_file *file)
> >>>>    		u64 obj_offset = bind_ops[i].obj_offset;
> >>>>    		u8 tile_mask = bind_ops[i].tile_mask;
> >>>>    		u32 region = bind_ops[i].region;
> >>>> +		u32 pat_index = bind_ops[i].pat_index;
> >>>>
> >>>>    		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
> >>>>    						  addr, range, op, tile_mask,
> >>>> -						  region);
> >>>> +						  region, pat_index);
> >>>>    		if (IS_ERR(ops[i])) {
> >>>>    			err = PTR_ERR(ops[i]);
> >>>>    			ops[i] = NULL;
> >>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> >> b/drivers/gpu/drm/xe/xe_vm_types.h
> >>>> index 83a1f87b6537..508679ed3c74 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> >>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> >>>> @@ -111,6 +111,11 @@ struct xe_vma {
> >>>>    	 */
> >>>>    	u8 tile_present;
> >>>>
> >>>> +	/**
> >>>> +	 * @pat_index: The pat index to use when encoding the PTEs for this
> >> vma.
> >>>> +	 */
> >>>> +	u32 pat_index;
> >>>> +
> >>>>    	struct {
> >>>>    		struct list_head rebind_link;
> >>>>    	} notifier;
> >>>> @@ -338,10 +343,8 @@ struct xe_vm {
> >>>>    	bool batch_invalidate_tlb;
> >>>>
> >>>>    	struct {
> >>>> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>> -						enum xe_cache_level cache);
> >>>> -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> >>>> -						enum xe_cache_level cache);
> >>>> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32
> >> pat_index);
> >>>> +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32
> >> pat_index);
> >>>>    	} pat_encode;
> >>>>    };
> >>>>
> >>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
> >>>>    	struct async_op_fence *fence;
> >>>>    	/** @tile_mask: gt mask for this operation */
> >>>>    	u8 tile_mask;
> >>>> +	/** @pat_index: The pat index to use for this operation. */
> >>>> +	u32 pat_index;
> >>>>    	/** @flags: operation flags */
> >>>>    	enum xe_vma_op_flags flags;
> >>>>
> >>>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> >>>> index 64bc66d4b550..0c15b6f32447 100644
> >>>> --- a/include/uapi/drm/xe_drm.h
> >>>> +++ b/include/uapi/drm/xe_drm.h
> >>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> >>>>    	 */
> >>>>    	__u32 obj;
> >>>>
> >>>> -	/** @pad: MBZ */
> >>>> -	__u32 pad;
> >>>> +	/**
> >>>> +	 * @pat_index: The platform defined @pat_index to use for this
> >> mapping.
> >>>> +	 * The index basically maps to some predefined memory attributes,
> >>>> +	 * including things like caching, coherency and likely other stuff in
> >>>> +	 * the future.  The exact meaning of the pat_index is platform specific
> >>>
> >>> BTW, "other stuff in the future" already includes compression on Xe2, we
> >>> just haven't landed the patches for the Xe2 table yet.
> >>
> >> Ok, good to know.
> >>
> >>>
> >>>> +	 * and defined in the Bspec and PRMs.  When the KMD sets up the
> >> binding
> >>>> +	 * the index here is encoded into the ppGTT PTE.
> >>>> +	 *
> >>>> +	 * For coherency the @pat_index needs to match the
> >>>> +	 * drm_xe_gem_create.coh_mode, so either
> >> XE_GEM_COHERENCY_NONE,
> >>>> +	 * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The
> >> KMD will extract
> >>>> +	 * the coherency mode from the @pat_index and reject if there is a
> >>>> +	 * mismatch (see note below for pre-MTL platforms).
> >>>> +	 *
> >>>> +	 * Note: On pre-MTL platforms there is only a caching mode and no
> >>>> +	 * explicit coherency mode, but on such hardware there is always a
> >>>> +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent
> with
> >>>> +	 * CPU caches even with the caching mode set as uncached.  It's only
> >> the
> >>>> +	 * display engine that is incoherent (on dgpu it must be in VRAM which
> >>>> +	 * is always mapped as WC on the CPU). However to keep the uapi
> >> somewhat
> >>>> +	 * consistent with newer platforms the KMD groups the different
> cache
> >>>> +	 * levels into the following coherency buckets on all pre-MTL
> platforms:
> >>>> +	 *
> >>>> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
> >>>> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
> >>>> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
> >>>> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
> >>>
> >>> As noted on the previous patch, it seems like 2-way is appropriate for
> >>> LLC platforms, but 1-way might be a more accurate description of dGPU
> >>> behavior.
> >>>
> >>>> +	 *
> >>>> +	 * In practice UC/WC/WT should only ever used for scanout surfaces
> on
> >>>> +	 * such platforms since it is only the display engine that is actually
> >>>> +	 * incoherent. Everything else should typically use WB given that we
> >>>
> >>> What if we're sharing our buffers with some other (non-GPU) device?  Are
> >>> there cases where that other device wouldn't be coherent with the LLC,
> >>> so we'd want to use one of these?
> >>
> >> Yeah, I guess there might be cases like that. I'll reword.
> >>
> >>>
> >>>
> >>> Matt
> >>>
> >>>> +	 * have a shared-LLC.  On MTL+ this completely changes (also
> >> potentially
> >>>> +	 * no shared-LLC) and the HW defines the coherency mode as part of
> >> the
> >>>> +	 * @pat_index.
> >>>> +	 *
> >>>> +	 * Note: For userptr and externally imported dma-buf the kernel
> >> expects
> >>>> +	 * either 1WAY or 2WAY for the @pat_index.
> >>>> +	 */
> >>>> +	__u32 pat_index;
> >>>>
> >>>>    	union {
> >>>>    		/**
> >>>> --
> >>>> 2.41.0
> >>>>
> >>>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-08-31  8:24           ` Zhang, Carl
@ 2023-08-31 10:44             ` Matthew Auld
  2023-09-01  9:34               ` Zhang, Carl
  0 siblings, 1 reply; 47+ messages in thread
From: Matthew Auld @ 2023-08-31 10:44 UTC (permalink / raw)
  To: Zhang, Carl, Roper, Matthew D
  Cc: Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas, Yu, Effie,
	intel-xe@lists.freedesktop.org

On 31/08/2023 09:24, Zhang, Carl wrote:
> 
> 
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld@intel.com>
>> Sent: Thursday, August 31, 2023 12:02 AM
>>
>> On 30/08/2023 16:27, Zhang, Carl wrote:
>>> Several questions:
>>> 1. the pat_index from vm_bind will override the setting from bo_create?
>>> How to keep the value from bo_create unchanged?
>>
>> You only set the pat_index during vm_bind. At gem_create you just need tell
>> the kernel what CPU side caching mode to use for system memory
>> (wb/wc/uc), plus the expected GPU coherency mode of the pat_index. The
>> current expectation is that the coherency mode of the pat_index should match
>> the one at gem_create.
>>
> It means that we set the incomplete pat_index at gem_create (coherency only)
> And set another part at vm_bind (gpu cache)
> Why we separate them? Why not set a full value at vm_bind?

You always give the full pat_index at vm_bind. It's platform specific 
but the BSpec for the most part tells you what the coherency mode of the 
pat_index is.

For example if you look at the MTL pat_index table it tells you the 
coh_mode:

pat_index 0-2 = coh_mode "No snoop" = COH_NONE
pat_index 3   = coh_mode "1way"     = COH_1WAY
pat_index 4   = coh_mode "2way"     = COH_2WAY

On future platforms the table is larger and also encodes stuff like 
compression, but KMD really only cares about the coh_mode, since that 
will restrict the allowed CPU smem_caching values that userspace can 
select.

 From KMD pov this matters since we always give userspace zeroed memory, 
which is useful for normal applications but is also needed for security 
reasons. However the actual clearing might result in the writes only 
being visible in the CPU cache (not flushed to main memory), and if the 
GPU doesn't snoop the CPU cache it can read directly from main memory, 
which effectively bypasses the clearing. To prevent that the KMD needs 
to know how userspace will map the memory via the GPU i.e what is the 
coh_mode for the pat_index. The gist is that if you select COH_NONE then 
it needs to clflush the pages before giving them to userspace, which is 
done by forcing you to use uc/wc for the smem_caching (the kernel will 
internally issue cflush on x86 when marking the pages as uc/wc). And if 
you select wb there is no flushing, but here you need to use 1way or 2way.

AFAIK separating them is just to prevent userspace from mixing different 
coherency modes for the same object by having it immutable (you can 
still use different pat_index but the coherency must be compatible). 
Also at creation time we can determine if the smem_caching is compatible 
with the coh_mode.

Note that it might be that we ditch the COH_2WAY and just have COH_NONE 
and COH_AT_LEAST_1WAY, in which case this might be slightly different.

> 
>>> 2. no UC/WB/WC definition (CPU cachable) in drm_xe_gem_mmap_offset,
>> will it be handled by KMD automatically?
>>> For example: if set 1-way, it means GPU could snoop CPU cache, we
>>> could use WB  in mmap offset If it is COHERENCY_NONE, we could only use
>> UC,  all these logic is handled by KMD automatically?
>>
>> Yes, the mmap will use the same CPU side caching mode that you set at
>> gem_create with smem_caching (wb/wc/uc).
>>
> It means it  could not be changed after gem_create.  Sometimes, UMD doesn’t know
> Whether the data need to be accessed by CPU at gem_create .  only application
> Knows the purpose of allocation & the surface usage.  So, may need change UMD api
> to let applications set whether the resource is "accessible", or it will hurt the perf.

Right, the smem_caching is immutable.

> 
>>> 3.  about " For imported dma-buf (from a different device) the
>>> coherency mode is also implicit and must also be either 1WAY or 2WAY"
>>> it means it must be 1way or 2way, and UMD need not to set it?
>>
>> If you are going to vm_bind it you need to supply the pat_index and in this
>> case it needs to be either 1way or 2way. The object comes from a different
>> device so we might not have smem_caching/coh_mode like we do for native
>> objects.
>>
> How about the GPU cache setting of imported dma-buf?
> Is there possible conflict between dma-buf producer and consumer?
> For example:
> In producer side , one bo is set GPU L3 cache,  also CPU cacheable, it should be 2Way
> In client side, we know the coherency should be 1-way or 2-way, if it suppose no GPU cache
> Used. It just set 1-way.
> It may cause some problems?

If the producer and consumer are the same device, then I think the 
original gem_create.coh_mode is considered, since it's effectively still 
a native object. In which case the normal rules apply.

If it comes from a different device then we don't know what 
gem_create.coh_mode is (or if that even exists) since we don't have the 
usual native object, but just some opaque dma-buf object. In this case 
it must be at least 1way or 2way.

 From KMD pov we only care about userspace not being able to bypass the 
clearing. There might still be other coherency issues though.

> 
> Also , when a bo is created, UMD does not know whether it will be exported.
> It maybe be set as COHERENCY_NONE, right?

Yeah, you can use whatever you want. Just that your smem_caching will 
need to be wc/uc if using COH_NONE. On the importer side, if it comes 
from the same device then it's just a normal native object and the 
normal rules apply. Currently this would mean also using pat_index with 
COH_NONE. If it is imported from a different device then you need to use 
a pat_index with 1way or 2way, since KMD doesn't really know how the 
object is mapped on the CPU or if the CPU caches are dirty (potentially 
with clearing).

Is this going to be problematic?

>>>
>>>
>>>> -----Original Message-----
>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>> Sent: Wednesday, August 30, 2023 7:28 PM
>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>>> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi
>> <pallavi.mishra@intel.com>;
>>>> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas Lahtinen
>>>> <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
>>>> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>;
>> Hazubski,
>>>> Filip <filip.hazubski@intel.com>; Zhang, Carl <carl.zhang@intel.com>; Yu,
>> Effie
>>>> <effie.yu@intel.com>
>>>> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection with
>> vm_bind
>>>>
>>>> On 29/08/2023 22:36, Matt Roper wrote:
>>>>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
>>>>>> Allow userspace to directly control the pat_index for a given vm
>>>>>> binding. This should allow directly controlling the coherency,
>>>>>> caching and potentially other stuff in the future for the ppGTT binding.
>>>>>>
>>>>>> The exact meaning behind the pat_index is very platform specific (see
>>>>>> BSpec or PRMs) but effectively maps to some predefined memory
>>>>>> attributes. From the KMD pov we only care about the coherency that is
>>>>>> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
>>>>>> The vm_bind coherency mode for the given pat_index needs to match
>> the
>>>>>> given coh_mode that was set at object creation. For platforms that
>>>>>> lack
>>>>>
>>>>> Is it actually important to match the coherency mode?  I think one of
>>>>> the main goals was to know up front if userspace might be using a
>>>>> non-snooping PAT setting that would let it bypass the CPU cache (and
>>>>> potentially read old, stale data from a different process if the
>>>>> buffer's clear value is still sitting in cache and hasn't landed in
>>>>> memory yet).
>>>>>
>>>>> If that's the only concern, then I think it should still be fine to
>>>>> map with a non-matching PAT as long as it's more coherent than the one
>>>>> specified at creation, right?  E.g., if the buffer was created with
>>>>> 1-way coherency, it would be fine to map it with 2-way because
>>>>> userspace still can't use that to observe the previous contents of the
>>>>> buffer. Or
>>>>
>>>> Yeah, I guess we could in theory do something that.
>>>>
>>>>> if the buffer was created with "non-coherent" then we've already done
>>>>> the necessary clflushing in kernel before handing to buffer over to
>>>>> userspace to ensure the clear value landed in memory, so any valid PAT
>>>>> index should be safe (from a security POV) after that, right?  Any
>>>>> other problems that arise from mismatched coherency would just be
>>>>> contained to the app possibly shooting itself in the foot, which isn't
>>>>> really our concern.
>>>>
>>>> That is also my understanding, at least from the KMD security pov. If you
>>>> allocate as wb then you must use at least 1way, since there is no flushing
>> for
>>>> clearing or swap-in. For uc/wc you could in theory use whatever you want.
>>>>
>>>>>
>>>>>
>>>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as
>>>> 2WAY.
>>>>>>
>>>>>> For userptr mappings we lack a corresponding gem object, so the
>> expected
>>>>>> coherency mode is instead implicit and must fall into either 1WAY or
>>>>>> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
>>>>>> dma-buf (from a different device) the coherency mode is also implicit
>>>>>> and must also be either 1WAY or 2WAY.
>>>>>>
>>>>>> As part of adding pat_index support with vm_bind we also need stop
>> using
>>>>>> xe_cache_level and instead use the pat_index in various places. We still
>>>>>> make use of xe_cache_level, but only as a convenience for kernel
>>>>>> internal objectsi (internally it maps to some reasonable pat_index). For
>>>>>
>>>>> Maybe we should kill xe_cache_level completely and just assign
>>>>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
>>>>> appropriate places, similar to what we do with MOCS (gt-
>>> mocs.uc_index,
>>>>> gt->mocs.wb_index)?
>>>>
>>>> OK, seems reasonable to me.
>>>>
>>>>>
>>>>>> now this is just a 1:1 conversion of the existing code, however for
>>>>>> platforms like MTL+ we might need to give more control through
>> bo_create
>>>>>> or stop using WB on the CPU side if we need CPU access.
>>>>>>
>>>>>> Bspec: 45101, 44235 #xe
>>>>>> Bspec: 70552, 71582, 59400 #xe2
>>>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>>>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
>>>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>>>>>> Cc: Carl Zhang <carl.zhang@intel.com>
>>>>>> Cc: Effie Yu <effie.yu@intel.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
>>>>>>     drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
>>>>>>     drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
>>>>>>     drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
>>>>>>     drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
>>>>>>     drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
>>>>>>     drivers/gpu/drm/xe/xe_pt.h            |  6 +--
>>>>>>     drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++-----
>> -
>>>>>>     drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
>>>>>>     include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
>>>>>>     10 files changed, 134 insertions(+), 58 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>> b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>> index 4d6296cdbcfd..bb4c182048c3 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
>>>>>>     		(*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>>>>>>
>>>>>>     	u64 (*pte_encode)(dma_addr_t addr,
>>>>>> -			  unsigned int pat_index,
>>>>>> +			  u32 pat_index,
>>>>>>     			  u32 flags); /* Create a valid PTE */
>>>>>>     #define PTE_READ_ONLY	BIT(0)
>>>>>>     #define PTE_LM		BIT(1)
>>>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>> b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
>>>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct
>> xe_migrate
>>>> *m, struct kunit *test)
>>>>>>     	/* First part of the test, are we updating our pagetable bo with a new
>>>> entry? */
>>>>>>     	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE -
>> 1),
>>>> u64,
>>>>>>     		  0xdeaddeadbeefbeef);
>>>>>> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>>>>>> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
>>>> XE_CACHE_WB), 0);
>>>>>>     	if (m->q->vm->flags & XE_VM_FLAG_64K)
>>>>>>     		expected |= XE_PTE_PS64;
>>>>>>     	if (xe_bo_is_vram(pt))
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c
>> b/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>> index 209fa053d9fb..4134c26150a5 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64
>>>> bo_offset)
>>>>>>     		pte |= XE_GGTT_PTE_DM;
>>>>>>
>>>>>>     	if ((ggtt->pat_encode).pte_encode)
>>>>>> -		pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>> XE_CACHE_WB_1_WAY);
>>>>>> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>>>> +						    xe_pat_get_index(xe,
>>>> XE_CACHE_WB_1_WAY));
>>>>>>
>>>>>>     	return pte;
>>>>>>     }
>>>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>>>>>>     }
>>>>>>
>>>>>>     static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64
>> pte_pat,
>>>>>> -						enum xe_cache_level cache)
>>>>>> +				     u32 pat_index)
>>>>>>     {
>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>> -
>>>>>>     	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>>>>>
>>>>>>     	if (pat_index & BIT(0))
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>> b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>>>>>
>>>>>>     	struct {
>>>>>>     		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>> -						enum xe_cache_level cache);
>>>>>> +				  u32 pat_index);
>>>>>>     	} pat_encode;
>>>>>>     };
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
>>>> b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>> index a782ea282cb6..54585e98452a 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>> @@ -24,6 +24,7 @@
>>>>>>     #include "xe_lrc.h"
>>>>>>     #include "xe_map.h"
>>>>>>     #include "xe_mocs.h"
>>>>>> +#include "xe_pat.h"
>>>>>>     #include "xe_pt.h"
>>>>>>     #include "xe_res_cursor.h"
>>>>>>     #include "xe_sched_job.h"
>>>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile
>> *tile,
>>>> struct xe_migrate *m,
>>>>>>     	u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]-
>>>>> level;
>>>>>>     	u32 map_ofs, level, i;
>>>>>>     	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>>>>>> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>>>>>>     	u64 entry;
>>>>>>     	int ret;
>>>>>>
>>>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct
>> xe_tile
>>>> *tile, struct xe_migrate *m,
>>>>>>     		return ret;
>>>>>>     	}
>>>>>>
>>>>>> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
>> XE_CACHE_WB);
>>>>>> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>>>>>>     	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>>>>>
>>>>>>     	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>>>>>
>>>>>>     	/* Map the entire BO in our level 0 pt */
>>>>>>     	for (i = 0, level = 0; i < num_entries; level++) {
>>>>>> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB,
>>>> 0);
>>>>>> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>>>>>>
>>>>>>     		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>>>>>>
>>>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile
>> *tile,
>>>> struct xe_migrate *m,
>>>>>>     		for (i = 0; i < batch->size;
>>>>>>     		     i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>>>>>>     		     XE_PAGE_SIZE) {
>>>>>> -			entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
>>>>>> +			entry = xe_pte_encode(batch, i, pat_index, 0);
>>>>>>
>>>>>>     			xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>>>>>>     				  entry);
>>>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile
>> *tile,
>>>> struct xe_migrate *m,
>>>>>>     			flags = XE_PDE_64K;
>>>>>>
>>>>>>     		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>>>>>> -					XE_PAGE_SIZE, XE_CACHE_WB);
>>>>>> +					XE_PAGE_SIZE, pat_index);
>>>>>>     		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level,
>>>> u64,
>>>>>>     			  entry | flags);
>>>>>>     	}
>>>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile
>> *tile,
>>>> struct xe_migrate *m,
>>>>>>     	/* Write PDE's that point to our BO. */
>>>>>>     	for (i = 0; i < num_entries - num_level; i++) {
>>>>>>     		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>>>>>> -				      XE_CACHE_WB);
>>>>>> +				      pat_index);
>>>>>>
>>>>>>     		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>>>>>>     			  (i + 1) * 8, u64, entry);
>>>>>> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct
>> xe_migrate
>>>> *m,
>>>>>>
>>>>>>     			XE_WARN_ON(pt_bo->size != SZ_4K);
>>>>>>
>>>>>> -			addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
>>>>>> +			addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe,
>>>> XE_CACHE_WB), 0);
>>>>>>     			bb->cs[bb->len++] = lower_32_bits(addr);
>>>>>>     			bb->cs[bb->len++] = upper_32_bits(addr);
>>>>>>     		}
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
>>>>>> index 64713f400d94..019af2920078 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>>>>>> @@ -10,6 +10,7 @@
>>>>>>     #include "xe_gt.h"
>>>>>>     #include "xe_gt_tlb_invalidation.h"
>>>>>>     #include "xe_migrate.h"
>>>>>> +#include "xe_pat.h"
>>>>>>     #include "xe_pt_types.h"
>>>>>>     #include "xe_pt_walk.h"
>>>>>>     #include "xe_res_cursor.h"
>>>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir
>>>> *pt_dir, unsigned int index)
>>>>>>      *
>>>>>>      * Return: An encoded page directory entry. No errors.
>>>>>>      */
>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>>>> -		  const enum xe_cache_level cache)
>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>>>>>>     {
>>>>>>     	u64 pde;
>>>>>>     	struct xe_vm *vm = bo->vm;
>>>>>>     	struct xe_device *xe = vm->xe;
>>>>>>
>>>>>> -
>>>>>>     	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>>>>>>     	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>>>>>
>>>>>>     	if ((vm->pat_encode).pde_encode)
>>>>>> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>>>>>> +		pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>>>>>>
>>>>>>     	return pde;
>>>>>>     }
>>>>>>
>>>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>>>>>>     			struct xe_vma *vma, u32 pt_level)
>>>>>>     {
>>>>>>     	struct xe_vm *vm = xe_vma_vm(vma);
>>>>>> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum
>>>> xe_cache_level cache,
>>>>>>     		pte |= XE_PTE_NULL;
>>>>>>
>>>>>>     	if ((vm->pat_encode).pte_encode)
>>>>>> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>>>>>> +		pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>>>>>>
>>>>>>     	if (pt_level == 1)
>>>>>>     		pte |= XE_PDE_PS_2M;
>>>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
>>>> xe_cache_level cache,
>>>>>>      *
>>>>>>      * Return: An encoded page-table entry. No errors.
>>>>>>      */
>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
>>>> cache,
>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>>>>>     		  u32 pt_level)
>>>>>>     {
>>>>>>     	u64 pte;
>>>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64
>> offset,
>>>> enum xe_cache_level cache,
>>>>>>     	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>>>>>>     		pte |= XE_PPGTT_PTE_DM;
>>>>>>
>>>>>> -	return __pte_encode(pte, cache, NULL, pt_level);
>>>>>> +	return __pte_encode(pte, pat_index, NULL, pt_level);
>>>>>>     }
>>>>>>
>>>>>>     static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>>>>>> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile
>> *tile,
>>>> struct xe_vm *vm,
>>>>>>
>>>>>>     	if (level == 0) {
>>>>>>     		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>>>>>> -					  XE_CACHE_WB, 0);
>>>>>> +					  xe_pat_get_index(vm->xe,
>>>> XE_CACHE_WB), 0);
>>>>>>
>>>>>>     		return empty;
>>>>>>     	} else {
>>>>>>     		return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
>>>>>> -				     XE_CACHE_WB);
>>>>>> +				     xe_pat_get_index(vm->xe,
>>>> XE_CACHE_WB));
>>>>>>     	}
>>>>>>     }
>>>>>>
>>>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>>>>>>     	struct xe_vm *vm;
>>>>>>     	/** @tile: The tile we're building for. */
>>>>>>     	struct xe_tile *tile;
>>>>>> -	/** @cache: Desired cache level for the ptes */
>>>>>> -	enum xe_cache_level cache;
>>>>>>     	/** @default_pte: PTE flag only template. No address is associated */
>>>>>>     	u64 default_pte;
>>>>>>     	/** @dma_offset: DMA offset to add to the PTE. */
>>>>>> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
>>>> pgoff_t offset,
>>>>>>
>>>>>>     		pte = __pte_encode(is_null ? 0 :
>>>>>>     				   xe_res_dma(curs) + xe_walk->dma_offset,
>>>>>> -				   xe_walk->cache, xe_walk->vma, level);
>>>>>> +				   xe_walk->vma->pat_index, xe_walk->vma,
>>>> level);
>>>>>>     		pte |= xe_walk->default_pte;
>>>>>>
>>>>>>     		/*
>>>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
>>>> pgoff_t offset,
>>>>>>     			xe_child->is_compact = true;
>>>>>>     		}
>>>>>>
>>>>>> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) |
>> flags;
>>>>>> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
>>>>> pat_index) | flags;
>>>>>>     		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>>>>>>     					 pte);
>>>>>>     	}
>>>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
>> xe_vma
>>>> *vma,
>>>>>>     		if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>>>>>>     			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>>>>>>     		xe_walk.dma_offset = vram_region_gpu_offset(bo-
>>>>> ttm.resource);
>>>>>> -		xe_walk.cache = XE_CACHE_WB;
>>>>>> -	} else {
>>>>>> -		if (!xe_vma_has_no_bo(vma) && bo->flags &
>>>> XE_BO_SCANOUT_BIT)
>>>>>> -			xe_walk.cache = XE_CACHE_WT;
>>>>>> -		else
>>>>>> -			xe_walk.cache = XE_CACHE_WB;
>>>>>>     	}
>>>>>> +
>>>>>>     	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>>>>>     		xe_walk.dma_offset =
>>>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
>>>>>> index 01be7ab08f87..1d433a5a96b4 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>>>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct
>>>> xe_vma *vma, struct xe_exec_queu
>>>>>>
>>>>>>     bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>>>>>>
>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>>>> -		  const enum xe_cache_level level);
>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>>>>>>
>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
>>>> cache,
>>>>>> -		  u32 pt_level);
>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32
>>>> pt_level);
>>>>>>
>>>>>>     #endif
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>>>>>> index 7eeeed0411f3..34603a7e84b0 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>>>> @@ -6,6 +6,7 @@
>>>>>>     #include "xe_vm.h"
>>>>>>
>>>>>>     #include <linux/dma-fence-array.h>
>>>>>> +#include <linux/nospec.h>
>>>>>>
>>>>>>     #include <drm/drm_print.h>
>>>>>>     #include <drm/ttm/ttm_execbuf_util.h>
>>>>>> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct
>> xe_vm
>>>> *vm,
>>>>>>     				    u64 start, u64 end,
>>>>>>     				    bool read_only,
>>>>>>     				    bool is_null,
>>>>>> -				    u8 tile_mask)
>>>>>> +				    u8 tile_mask,
>>>>>> +				    u32 pat_index)
>>>>>>     {
>>>>>>     	struct xe_vma *vma;
>>>>>>     	struct xe_tile *tile;
>>>>>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct
>> xe_vm
>>>> *vm,
>>>>>>     			vma->tile_mask |= 0x1 << id;
>>>>>>     	}
>>>>>>
>>>>>> +	vma->pat_index = pat_index;
>>>>>> +
>>>>>>     	if (vm->xe->info.platform == XE_PVC)
>>>>>>     		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>>>>>
>>>>>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct
>>>> work_struct *w);
>>>>>>     static void vm_destroy_work_func(struct work_struct *w);
>>>>>>
>>>>>>     static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64
>> pde_pat,
>>>>>> -						enum xe_cache_level cache)
>>>>>> +				     u32 pat_index)
>>>>>>     {
>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>> -
>>>>>>     	pde_pat &= ~(XELP_PDE_PAT_MASK);
>>>>>>
>>>>>>     	if (pat_index & BIT(0))
>>>>>> @@ -1213,10 +1215,8 @@ static u64
>> xelp_ppgtt_pde_encode_pat(struct
>>>> xe_device *xe, u64 pde_pat,
>>>>>>     }
>>>>>>
>>>>>>     static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64
>> pte_pat,
>>>>>> -						enum xe_cache_level cache)
>>>>>> +				     u32 pat_index)
>>>>>>     {
>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>> -
>>>>>>     	pte_pat &= ~(XELP_PTE_PAT_MASK);
>>>>>>
>>>>>>     	if (pat_index & BIT(0))
>>>>>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file
>> *xef,
>>>> u32 id)
>>>>>>     u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>>>>>>     {
>>>>>>     	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>>>>>> -			     XE_CACHE_WB);
>>>>>> +			     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>>>>>>     }
>>>>>>
>>>>>>     static struct dma_fence *
>>>>>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe,
>> struct
>>>> drm_gpuva_op *op)
>>>>>>     static struct drm_gpuva_ops *
>>>>>>     vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>>>>     			 u64 bo_offset_or_userptr, u64 addr, u64 range,
>>>>>> -			 u32 operation, u8 tile_mask, u32 region)
>>>>>> +			 u32 operation, u8 tile_mask, u32 region, u32
>>>> pat_index)
>>>>>>     {
>>>>>>     	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>>>>>     	struct ww_acquire_ctx ww;
>>>>>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
>>>> struct xe_bo *bo,
>>>>>>     			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>>>>>
>>>>>>     			op->tile_mask = tile_mask;
>>>>>> +			op->pat_index = pat_index;
>>>>>>     			op->map.immediate =
>>>>>>     				operation &
>> XE_VM_BIND_FLAG_IMMEDIATE;
>>>>>>     			op->map.read_only =
>>>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
>>>> struct xe_bo *bo,
>>>>>>     			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>>>>>
>>>>>>     			op->tile_mask = tile_mask;
>>>>>> +			op->pat_index = pat_index;
>>>>>>     			op->prefetch.region = region;
>>>>>>     		}
>>>>>>     		break;
>>>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
>>>> struct xe_bo *bo,
>>>>>>     }
>>>>>>
>>>>>>     static struct xe_vma *new_vma(struct xe_vm *vm, struct
>>>> drm_gpuva_op_map *op,
>>>>>> -			      u8 tile_mask, bool read_only, bool is_null)
>>>>>> +			      u8 tile_mask, bool read_only, bool is_null,
>>>>>> +			      u32 pat_index)
>>>>>>     {
>>>>>>     	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) :
>> NULL;
>>>>>>     	struct xe_vma *vma;
>>>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm
>>>> *vm, struct drm_gpuva_op_map *op,
>>>>>>     	vma = xe_vma_create(vm, bo, op->gem.offset,
>>>>>>     			    op->va.addr, op->va.addr +
>>>>>>     			    op->va.range - 1, read_only, is_null,
>>>>>> -			    tile_mask);
>>>>>> +			    tile_mask, pat_index);
>>>>>>     	if (bo)
>>>>>>     		xe_bo_unlock(bo, &ww);
>>>>>>
>>>>>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct
>> xe_vm
>>>> *vm, struct xe_exec_queue *q,
>>>>>>
>>>>>>     				vma = new_vma(vm, &op->base.map,
>>>>>>     					      op->tile_mask, op-
>>> map.read_only,
>>>>>> -					      op->map.is_null);
>>>>>> +					      op->map.is_null, op->pat_index);
>>>>>>     				if (IS_ERR(vma)) {
>>>>>>     					err = PTR_ERR(vma);
>>>>>>     					goto free_fence;
>>>>>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct
>> xe_vm
>>>> *vm, struct xe_exec_queue *q,
>>>>>>
>>>>>>     					vma = new_vma(vm, op-
>>>>> base.remap.prev,
>>>>>>     						      op->tile_mask, read_only,
>>>>>> -						      is_null);
>>>>>> +						      is_null, op->pat_index);
>>>>>>     					if (IS_ERR(vma)) {
>>>>>>     						err = PTR_ERR(vma);
>>>>>>     						goto free_fence;
>>>>>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct
>> xe_vm
>>>> *vm, struct xe_exec_queue *q,
>>>>>>
>>>>>>     					vma = new_vma(vm, op-
>>>>> base.remap.next,
>>>>>>     						      op->tile_mask, read_only,
>>>>>> -						      is_null);
>>>>>> +						      is_null, op->pat_index);
>>>>>>     					if (IS_ERR(vma)) {
>>>>>>     						err = PTR_ERR(vma);
>>>>>>     						goto free_fence;
>>>>>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct
>>>> xe_device *xe,
>>>>>>     		u32 obj = (*bind_ops)[i].obj;
>>>>>>     		u64 obj_offset = (*bind_ops)[i].obj_offset;
>>>>>>     		u32 region = (*bind_ops)[i].region;
>>>>>> +		u32 pat_index = (*bind_ops)[i].pat_index;
>>>>>>     		bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>>>>>
>>>>>> +		if (XE_IOCTL_DBG(xe, pat_index >= xe-
>>>>> info.pat_table_n_entries)) {
>>>>>> +			err = -EINVAL;
>>>>>> +			goto free_bind_ops;
>>>>>> +		}
>>>>>> +
>>>>>>     		if (i == 0) {
>>>>>>     			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>>>>>>     		} else if (XE_IOCTL_DBG(xe, !*async) ||
>>>>>> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device
>> *dev,
>>>> void *data, struct drm_file *file)
>>>>>>     		struct drm_gem_object *gem_obj;
>>>>>>     		u64 range = bind_ops[i].range;
>>>>>>     		u64 addr = bind_ops[i].addr;
>>>>>> +		u32 op = bind_ops[i].op;
>>>>>>     		u32 obj = bind_ops[i].obj;
>>>>>>     		u64 obj_offset = bind_ops[i].obj_offset;
>>>>>> +		u32 pat_index = bind_ops[i].pat_index;
>>>>>> +		u16 coh_mode;
>>>>>> +
>>>>>> +		pat_index = array_index_nospec(pat_index,
>>>>>> +					       xe->info.pat_table_n_entries);
>>>>>> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>>>>>> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
>>>>>
>>>>> Assuming we drop the unusable entries from the TGL table, this should be
>>>>> impossible, right?  Any index that makes it past the n_entries check at
>>>>> the top of the function should have a valid, non-zero coh_mode value.
>>>>> So this should probably be an assertion (to highlight a KMD bug) rather
>>>>> than just a silent uapi failure return.
>>>>
>>>> Makes sense.
>>>>
>>>>>
>>>>>> +			err = -EINVAL;
>>>>>> +			goto put_obj;
>>>>>> +		}
>>>>>> +
>>>>>> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
>>>> XE_VM_BIND_OP_MAP_USERPTR &&
>>>>>> +				 coh_mode == XE_GEM_COHERENCY_NONE))
>> {
>>>>>> +			err = -EINVAL;
>>>>>> +			goto put_obj;
>>>>>> +		}
>>>>>>
>>>>>>     		if (!obj)
>>>>>>     			continue;
>>>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device
>> *dev,
>>>> void *data, struct drm_file *file)
>>>>>>     				goto put_obj;
>>>>>>     			}
>>>>>>     		}
>>>>>> +
>>>>>> +		if (bos[i]->coh_mode) {
>>>>>> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
>>>> coh_mode)) {
>>>>>> +				err = -EINVAL;
>>>>>> +				goto put_obj;
>>>>>> +			}
>>>>>> +		} else if (XE_IOCTL_DBG(xe, coh_mode ==
>>>> XE_GEM_COHERENCY_NONE)) {
>>>>>> +			/*
>>>>>> +			 * Imported dma-buf from a different device should
>>>>>> +			 * require 1way or 2way coherency since we don't
>>>> know
>>>>>> +			 * how it was mapped on CPU. Just assume is it
>>>>>> +			 * potentially cached on CPU side.
>>>>>> +			 */
>>>>>> +			err = -EINVAL;
>>>>>> +			goto put_obj;
>>>>>> +		}
>>>>>>     	}
>>>>>>
>>>>>>     	if (args->num_syncs) {
>>>>>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device
>> *dev,
>>>> void *data, struct drm_file *file)
>>>>>>     		u64 obj_offset = bind_ops[i].obj_offset;
>>>>>>     		u8 tile_mask = bind_ops[i].tile_mask;
>>>>>>     		u32 region = bind_ops[i].region;
>>>>>> +		u32 pat_index = bind_ops[i].pat_index;
>>>>>>
>>>>>>     		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>>>>>>     						  addr, range, op, tile_mask,
>>>>>> -						  region);
>>>>>> +						  region, pat_index);
>>>>>>     		if (IS_ERR(ops[i])) {
>>>>>>     			err = PTR_ERR(ops[i]);
>>>>>>     			ops[i] = NULL;
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
>>>> b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>> index 83a1f87b6537..508679ed3c74 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>> @@ -111,6 +111,11 @@ struct xe_vma {
>>>>>>     	 */
>>>>>>     	u8 tile_present;
>>>>>>
>>>>>> +	/**
>>>>>> +	 * @pat_index: The pat index to use when encoding the PTEs for this
>>>> vma.
>>>>>> +	 */
>>>>>> +	u32 pat_index;
>>>>>> +
>>>>>>     	struct {
>>>>>>     		struct list_head rebind_link;
>>>>>>     	} notifier;
>>>>>> @@ -338,10 +343,8 @@ struct xe_vm {
>>>>>>     	bool batch_invalidate_tlb;
>>>>>>
>>>>>>     	struct {
>>>>>> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>> -						enum xe_cache_level cache);
>>>>>> -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>>>>>> -						enum xe_cache_level cache);
>>>>>> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32
>>>> pat_index);
>>>>>> +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32
>>>> pat_index);
>>>>>>     	} pat_encode;
>>>>>>     };
>>>>>>
>>>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>>>>>>     	struct async_op_fence *fence;
>>>>>>     	/** @tile_mask: gt mask for this operation */
>>>>>>     	u8 tile_mask;
>>>>>> +	/** @pat_index: The pat index to use for this operation. */
>>>>>> +	u32 pat_index;
>>>>>>     	/** @flags: operation flags */
>>>>>>     	enum xe_vma_op_flags flags;
>>>>>>
>>>>>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>>>>>> index 64bc66d4b550..0c15b6f32447 100644
>>>>>> --- a/include/uapi/drm/xe_drm.h
>>>>>> +++ b/include/uapi/drm/xe_drm.h
>>>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>>>>>>     	 */
>>>>>>     	__u32 obj;
>>>>>>
>>>>>> -	/** @pad: MBZ */
>>>>>> -	__u32 pad;
>>>>>> +	/**
>>>>>> +	 * @pat_index: The platform defined @pat_index to use for this
>>>> mapping.
>>>>>> +	 * The index basically maps to some predefined memory attributes,
>>>>>> +	 * including things like caching, coherency and likely other stuff in
>>>>>> +	 * the future.  The exact meaning of the pat_index is platform specific
>>>>>
>>>>> BTW, "other stuff in the future" already includes compression on Xe2, we
>>>>> just haven't landed the patches for the Xe2 table yet.
>>>>
>>>> Ok, good to know.
>>>>
>>>>>
>>>>>> +	 * and defined in the Bspec and PRMs.  When the KMD sets up the
>>>> binding
>>>>>> +	 * the index here is encoded into the ppGTT PTE.
>>>>>> +	 *
>>>>>> +	 * For coherency the @pat_index needs to match the
>>>>>> +	 * drm_xe_gem_create.coh_mode, so either
>>>> XE_GEM_COHERENCY_NONE,
>>>>>> +	 * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The
>>>> KMD will extract
>>>>>> +	 * the coherency mode from the @pat_index and reject if there is a
>>>>>> +	 * mismatch (see note below for pre-MTL platforms).
>>>>>> +	 *
>>>>>> +	 * Note: On pre-MTL platforms there is only a caching mode and no
>>>>>> +	 * explicit coherency mode, but on such hardware there is always a
>>>>>> +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent
>> with
>>>>>> +	 * CPU caches even with the caching mode set as uncached.  It's only
>>>> the
>>>>>> +	 * display engine that is incoherent (on dgpu it must be in VRAM which
>>>>>> +	 * is always mapped as WC on the CPU). However to keep the uapi
>>>> somewhat
>>>>>> +	 * consistent with newer platforms the KMD groups the different
>> cache
>>>>>> +	 * levels into the following coherency buckets on all pre-MTL
>> platforms:
>>>>>> +	 *
>>>>>> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
>>>>>> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
>>>>>> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
>>>>>> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
>>>>>
>>>>> As noted on the previous patch, it seems like 2-way is appropriate for
>>>>> LLC platforms, but 1-way might be a more accurate description of dGPU
>>>>> behavior.
>>>>>
>>>>>> +	 *
>>>>>> +	 * In practice UC/WC/WT should only ever used for scanout surfaces
>> on
>>>>>> +	 * such platforms since it is only the display engine that is actually
>>>>>> +	 * incoherent. Everything else should typically use WB given that we
>>>>>
>>>>> What if we're sharing our buffers with some other (non-GPU) device?  Are
>>>>> there cases where that other device wouldn't be coherent with the LLC,
>>>>> so we'd want to use one of these?
>>>>
>>>> Yeah, I guess there might be cases like that. I'll reword.
>>>>
>>>>>
>>>>>
>>>>> Matt
>>>>>
>>>>>> +	 * have a shared-LLC.  On MTL+ this completely changes (also
>>>> potentially
>>>>>> +	 * no shared-LLC) and the HW defines the coherency mode as part of
>>>> the
>>>>>> +	 * @pat_index.
>>>>>> +	 *
>>>>>> +	 * Note: For userptr and externally imported dma-buf the kernel
>>>> expects
>>>>>> +	 * either 1WAY or 2WAY for the @pat_index.
>>>>>> +	 */
>>>>>> +	__u32 pat_index;
>>>>>>
>>>>>>     	union {
>>>>>>     		/**
>>>>>> --
>>>>>> 2.41.0
>>>>>>
>>>>>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-08-31 10:44             ` Matthew Auld
@ 2023-09-01  9:34               ` Zhang, Carl
  2023-09-04  9:23                 ` Matthew Auld
  0 siblings, 1 reply; 47+ messages in thread
From: Zhang, Carl @ 2023-09-01  9:34 UTC (permalink / raw)
  To: Auld, Matthew, Roper, Matthew D
  Cc: Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas, Yu, Effie,
	intel-xe@lists.freedesktop.org



> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: Thursday, August 31, 2023 6:44 PM
> 
> On 31/08/2023 09:24, Zhang, Carl wrote:
> >
> >
> >> -----Original Message-----
> >> From: Auld, Matthew <matthew.auld@intel.com>
> >> Sent: Thursday, August 31, 2023 12:02 AM
> >>
> >> On 30/08/2023 16:27, Zhang, Carl wrote:
> >>> Several questions:
> >>> 1. the pat_index from vm_bind will override the setting from bo_create?
> >>> How to keep the value from bo_create unchanged?
> >>
> >> You only set the pat_index during vm_bind. At gem_create you just
> >> need tell the kernel what CPU side caching mode to use for system
> >> memory (wb/wc/uc), plus the expected GPU coherency mode of the
> >> pat_index. The current expectation is that the coherency mode of the
> >> pat_index should match the one at gem_create.
> >>
> > It means that we set the incomplete pat_index at gem_create (coherency
> > only) And set another part at vm_bind (gpu cache) Why we separate
> > them? Why not set a full value at vm_bind?
> 
> You always give the full pat_index at vm_bind. It's platform specific but the
> BSpec for the most part tells you what the coherency mode of the pat_index is.
> 
> For example if you look at the MTL pat_index table it tells you the
> coh_mode:
> 
> pat_index 0-2 = coh_mode "No snoop" = COH_NONE
> pat_index 3   = coh_mode "1way"     = COH_1WAY
> pat_index 4   = coh_mode "2way"     = COH_2WAY
> 
> On future platforms the table is larger and also encodes stuff like compression,
> but KMD really only cares about the coh_mode, since that will restrict the
> allowed CPU smem_caching values that userspace can select.
> 
>  From KMD pov this matters since we always give userspace zeroed memory,
> which is useful for normal applications but is also needed for security reasons.
> However the actual clearing might result in the writes only being visible in the
> CPU cache (not flushed to main memory), and if the GPU doesn't snoop the
> CPU cache it can read directly from main memory, which effectively bypasses
> the clearing. To prevent that the KMD needs to know how userspace will map
> the memory via the GPU i.e what is the coh_mode for the pat_index. The gist is
> that if you select COH_NONE then it needs to clflush the pages before giving
> them to userspace, which is done by forcing you to use uc/wc for the
> smem_caching (the kernel will internally issue cflush on x86 when marking the
> pages as uc/wc). And if you select wb there is no flushing, but here you need to
> use 1way or 2way.
>

 It sounds like a perf optimization, if no considering the perf, you could always call
Clflush to flush the cpu cache to main memory before giving it to user space


> AFAIK separating them is just to prevent userspace from mixing different
> coherency modes for the same object by having it immutable (you can still use
> different pat_index but the coherency must be compatible).
> Also at creation time we can determine if the smem_caching is compatible with
> the coh_mode.
> 
> Note that it might be that we ditch the COH_2WAY and just have COH_NONE
> and COH_AT_LEAST_1WAY, in which case this might be slightly different.
> 
> >
> >>> 2. no UC/WB/WC definition (CPU cachable) in drm_xe_gem_mmap_offset,
> >> will it be handled by KMD automatically?
> >>> For example: if set 1-way, it means GPU could snoop CPU cache, we
> >>> could use WB  in mmap offset If it is COHERENCY_NONE, we could only
> >>> use
> >> UC,  all these logic is handled by KMD automatically?
> >>
> >> Yes, the mmap will use the same CPU side caching mode that you set at
> >> gem_create with smem_caching (wb/wc/uc).
> >>
> > It means it  could not be changed after gem_create.  Sometimes, UMD
> > doesn’t know Whether the data need to be accessed by CPU at gem_create
> > .  only application Knows the purpose of allocation & the surface
> > usage.  So, may need change UMD api to let applications set whether the
> resource is "accessible", or it will hurt the perf.
> 
> Right, the smem_caching is immutable.
> 
> >
> >>> 3.  about " For imported dma-buf (from a different device) the
> >>> coherency mode is also implicit and must also be either 1WAY or 2WAY"
> >>> it means it must be 1way or 2way, and UMD need not to set it?
> >>
> >> If you are going to vm_bind it you need to supply the pat_index and
> >> in this case it needs to be either 1way or 2way. The object comes
> >> from a different device so we might not have smem_caching/coh_mode
> >> like we do for native objects.
> >>
> > How about the GPU cache setting of imported dma-buf?
> > Is there possible conflict between dma-buf producer and consumer?
> > For example:
> > In producer side , one bo is set GPU L3 cache,  also CPU cacheable, it
> > should be 2Way In client side, we know the coherency should be 1-way
> > or 2-way, if it suppose no GPU cache Used. It just set 1-way.
> > It may cause some problems?
> 
> If the producer and consumer are the same device, then I think the original
> gem_create.coh_mode is considered, since it's effectively still a native object. In
> which case the normal rules apply.

But the consumer does not know the coh_mode, so , it is possible that it set will set
Another value, then it will failed. 
And because consumer does not know it is 1-way or 2-way,  consumer could not
Set the gpu cache correctly.  Because if it is 1-way, cpu could not snoop gpu cache

> 
> If it comes from a different device then we don't know what
> gem_create.coh_mode is (or if that even exists) since we don't have the usual
> native object, but just some opaque dma-buf object. In this case it must be at
> least 1way or 2way.
> 
>  From KMD pov we only care about userspace not being able to bypass the
> clearing. There might still be other coherency issues though.
> 
> >
> > Also , when a bo is created, UMD does not know whether it will be exported.
> > It maybe be set as COHERENCY_NONE, right?
> 
> Yeah, you can use whatever you want. Just that your smem_caching will need
> to be wc/uc if using COH_NONE. On the importer side, if it comes from the
> same device then it's just a normal native object and the normal rules apply.
> Currently this would mean also using pat_index with COH_NONE. If it is
> imported from a different device then you need to use a pat_index with 1way
> or 2way, since KMD doesn't really know how the object is mapped on the CPU
> or if the CPU caches are dirty (potentially with clearing).
> 
> Is this going to be problematic?
> 
If imported surfaces is from same device, how consumer know it is original coh_non
Or 1-way or 2-way?

And it also means producer and consumer could have different pat_index for same bo. 

> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Auld, Matthew <matthew.auld@intel.com>
> >>>> Sent: Wednesday, August 30, 2023 7:28 PM
> >>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
> >>>> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi
> >> <pallavi.mishra@intel.com>;
> >>>> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas
> >>>> Lahtinen <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
> >>>> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>;
> >> Hazubski,
> >>>> Filip <filip.hazubski@intel.com>; Zhang, Carl
> >>>> <carl.zhang@intel.com>; Yu,
> >> Effie
> >>>> <effie.yu@intel.com>
> >>>> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection
> >>>> with
> >> vm_bind
> >>>>
> >>>> On 29/08/2023 22:36, Matt Roper wrote:
> >>>>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
> >>>>>> Allow userspace to directly control the pat_index for a given vm
> >>>>>> binding. This should allow directly controlling the coherency,
> >>>>>> caching and potentially other stuff in the future for the ppGTT binding.
> >>>>>>
> >>>>>> The exact meaning behind the pat_index is very platform specific
> >>>>>> (see BSpec or PRMs) but effectively maps to some predefined
> >>>>>> memory attributes. From the KMD pov we only care about the
> >>>>>> coherency that is provided by the pat_index, which falls into either
> NONE, 1WAY or 2WAY.
> >>>>>> The vm_bind coherency mode for the given pat_index needs to match
> >> the
> >>>>>> given coh_mode that was set at object creation. For platforms
> >>>>>> that lack
> >>>>>
> >>>>> Is it actually important to match the coherency mode?  I think one
> >>>>> of the main goals was to know up front if userspace might be using
> >>>>> a non-snooping PAT setting that would let it bypass the CPU cache
> >>>>> (and potentially read old, stale data from a different process if
> >>>>> the buffer's clear value is still sitting in cache and hasn't
> >>>>> landed in memory yet).
> >>>>>
> >>>>> If that's the only concern, then I think it should still be fine
> >>>>> to map with a non-matching PAT as long as it's more coherent than
> >>>>> the one specified at creation, right?  E.g., if the buffer was
> >>>>> created with 1-way coherency, it would be fine to map it with
> >>>>> 2-way because userspace still can't use that to observe the
> >>>>> previous contents of the buffer. Or
> >>>>
> >>>> Yeah, I guess we could in theory do something that.
> >>>>
> >>>>> if the buffer was created with "non-coherent" then we've already
> >>>>> done the necessary clflushing in kernel before handing to buffer
> >>>>> over to userspace to ensure the clear value landed in memory, so
> >>>>> any valid PAT index should be safe (from a security POV) after
> >>>>> that, right?  Any other problems that arise from mismatched
> >>>>> coherency would just be contained to the app possibly shooting
> >>>>> itself in the foot, which isn't really our concern.
> >>>>
> >>>> That is also my understanding, at least from the KMD security pov.
> >>>> If you allocate as wb then you must use at least 1way, since there
> >>>> is no flushing
> >> for
> >>>> clearing or swap-in. For uc/wc you could in theory use whatever you want.
> >>>>
> >>>>>
> >>>>>
> >>>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as
> >>>> 2WAY.
> >>>>>>
> >>>>>> For userptr mappings we lack a corresponding gem object, so the
> >> expected
> >>>>>> coherency mode is instead implicit and must fall into either 1WAY
> >>>>>> or 2WAY. Trying to use NONE will be rejected by the kernel. For
> >>>>>> imported dma-buf (from a different device) the coherency mode is
> >>>>>> also implicit and must also be either 1WAY or 2WAY.
> >>>>>>
> >>>>>> As part of adding pat_index support with vm_bind we also need
> >>>>>> stop
> >> using
> >>>>>> xe_cache_level and instead use the pat_index in various places.
> >>>>>> We still make use of xe_cache_level, but only as a convenience
> >>>>>> for kernel internal objectsi (internally it maps to some
> >>>>>> reasonable pat_index). For
> >>>>>
> >>>>> Maybe we should kill xe_cache_level completely and just assign
> >>>>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used
> >>>>> in appropriate places, similar to what we do with MOCS (gt-
> >>> mocs.uc_index,
> >>>>> gt->mocs.wb_index)?
> >>>>
> >>>> OK, seems reasonable to me.
> >>>>
> >>>>>
> >>>>>> now this is just a 1:1 conversion of the existing code, however
> >>>>>> for platforms like MTL+ we might need to give more control
> >>>>>> through
> >> bo_create
> >>>>>> or stop using WB on the CPU side if we need CPU access.
> >>>>>>
> >>>>>> Bspec: 45101, 44235 #xe
> >>>>>> Bspec: 70552, 71582, 59400 #xe2
> >>>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> >>>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> >>>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> >>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >>>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> >>>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
> >>>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
> >>>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
> >>>>>> Cc: Carl Zhang <carl.zhang@intel.com>
> >>>>>> Cc: Effie Yu <effie.yu@intel.com>
> >>>>>> ---
> >>>>>>     drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
> >>>>>>     drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
> >>>>>>     drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
> >>>>>>     drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
> >>>>>>     drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
> >>>>>>     drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
> >>>>>>     drivers/gpu/drm/xe/xe_pt.h            |  6 +--
> >>>>>>     drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++----
> -
> >> -
> >>>>>>     drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
> >>>>>>     include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
> >>>>>>     10 files changed, 134 insertions(+), 58 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>> b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>> index 4d6296cdbcfd..bb4c182048c3 100644
> >>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
> >>>>>>     		(*alloc_scratch_dma)(struct i915_address_space *vm,
> int
> >>>>>> sz);
> >>>>>>
> >>>>>>     	u64 (*pte_encode)(dma_addr_t addr,
> >>>>>> -			  unsigned int pat_index,
> >>>>>> +			  u32 pat_index,
> >>>>>>     			  u32 flags); /* Create a valid PTE */
> >>>>>>     #define PTE_READ_ONLY	BIT(0)
> >>>>>>     #define PTE_LM		BIT(1)
> >>>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>> b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
> >>>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct
> >> xe_migrate
> >>>> *m, struct kunit *test)
> >>>>>>     	/* First part of the test, are we updating our pagetable bo
> >>>>>> with a new
> >>>> entry? */
> >>>>>>     	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE *
> (NUM_KERNEL_PDE -
> >> 1),
> >>>> u64,
> >>>>>>     		  0xdeaddeadbeefbeef);
> >>>>>> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> >>>>>> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
> >>>> XE_CACHE_WB), 0);
> >>>>>>     	if (m->q->vm->flags & XE_VM_FLAG_64K)
> >>>>>>     		expected |= XE_PTE_PS64;
> >>>>>>     	if (xe_bo_is_vram(pt))
> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c
> >> b/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>> index 209fa053d9fb..4134c26150a5 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64
> >>>> bo_offset)
> >>>>>>     		pte |= XE_GGTT_PTE_DM;
> >>>>>>
> >>>>>>     	if ((ggtt->pat_encode).pte_encode)
> >>>>>> -		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >>>> XE_CACHE_WB_1_WAY);
> >>>>>> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >>>>>> +
> xe_pat_get_index(xe,
> >>>> XE_CACHE_WB_1_WAY));
> >>>>>>
> >>>>>>     	return pte;
> >>>>>>     }
> >>>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
> >>>>>>     }
> >>>>>>
> >>>>>>     static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe,
> >>>>>> u64
> >> pte_pat,
> >>>>>> -						enum xe_cache_level
> cache)
> >>>>>> +				     u32 pat_index)
> >>>>>>     {
> >>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>> -
> >>>>>>     	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
> >>>>>>
> >>>>>>     	if (pat_index & BIT(0))
> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>> b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
> >>>>>>
> >>>>>>     	struct {
> >>>>>>     		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>>>> -						enum xe_cache_level
> cache);
> >>>>>> +				  u32 pat_index);
> >>>>>>     	} pat_encode;
> >>>>>>     };
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> >>>> b/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>> index a782ea282cb6..54585e98452a 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>> @@ -24,6 +24,7 @@
> >>>>>>     #include "xe_lrc.h"
> >>>>>>     #include "xe_map.h"
> >>>>>>     #include "xe_mocs.h"
> >>>>>> +#include "xe_pat.h"
> >>>>>>     #include "xe_pt.h"
> >>>>>>     #include "xe_res_cursor.h"
> >>>>>>     #include "xe_sched_job.h"
> >>>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>> xe_tile
> >> *tile,
> >>>> struct xe_migrate *m,
> >>>>>>     	u32 num_entries = NUM_PT_SLOTS, num_level = vm-
> >pt_root[id]-
> >>>>> level;
> >>>>>>     	u32 map_ofs, level, i;
> >>>>>>     	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> >>>>>> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> >>>>>>     	u64 entry;
> >>>>>>     	int ret;
> >>>>>>
> >>>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct
> >> xe_tile
> >>>> *tile, struct xe_migrate *m,
> >>>>>>     		return ret;
> >>>>>>     	}
> >>>>>>
> >>>>>> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
> >> XE_CACHE_WB);
> >>>>>> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
> pat_index);
> >>>>>>     	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
> >>>>>>
> >>>>>>     	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
> >>>>>>
> >>>>>>     	/* Map the entire BO in our level 0 pt */
> >>>>>>     	for (i = 0, level = 0; i < num_entries; level++) {
> >>>>>> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
> XE_CACHE_WB,
> >>>> 0);
> >>>>>> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
> pat_index, 0);
> >>>>>>
> >>>>>>     		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> entry);
> >>>>>>
> >>>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>> xe_tile
> >> *tile,
> >>>> struct xe_migrate *m,
> >>>>>>     		for (i = 0; i < batch->size;
> >>>>>>     		     i += vm->flags & XE_VM_FLAG_64K ?
> XE_64K_PAGE_SIZE :
> >>>>>>     		     XE_PAGE_SIZE) {
> >>>>>> -			entry = xe_pte_encode(batch, i,
> XE_CACHE_WB, 0);
> >>>>>> +			entry = xe_pte_encode(batch, i, pat_index, 0);
> >>>>>>
> >>>>>>     			xe_map_wr(xe, &bo->vmap, map_ofs + level *
> 8, u64,
> >>>>>>     				  entry);
> >>>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>> xe_tile
> >> *tile,
> >>>> struct xe_migrate *m,
> >>>>>>     			flags = XE_PDE_64K;
> >>>>>>
> >>>>>>     		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> >>>>>> -					XE_PAGE_SIZE,
> XE_CACHE_WB);
> >>>>>> +					XE_PAGE_SIZE, pat_index);
> >>>>>>     		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE
> * level,
> >>>> u64,
> >>>>>>     			  entry | flags);
> >>>>>>     	}
> >>>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>> xe_tile
> >> *tile,
> >>>> struct xe_migrate *m,
> >>>>>>     	/* Write PDE's that point to our BO. */
> >>>>>>     	for (i = 0; i < num_entries - num_level; i++) {
> >>>>>>     		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> >>>>>> -				      XE_CACHE_WB);
> >>>>>> +				      pat_index);
> >>>>>>
> >>>>>>     		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE
> +
> >>>>>>     			  (i + 1) * 8, u64, entry); @@ -1256,7 +1258,7
> @@
> >>>>>> xe_migrate_update_pgtables(struct
> >> xe_migrate
> >>>> *m,
> >>>>>>
> >>>>>>     			XE_WARN_ON(pt_bo->size != SZ_4K);
> >>>>>>
> >>>>>> -			addr = xe_pte_encode(pt_bo, 0,
> XE_CACHE_WB, 0);
> >>>>>> +			addr = xe_pte_encode(pt_bo, 0,
> xe_pat_get_index(xe,
> >>>> XE_CACHE_WB), 0);
> >>>>>>     			bb->cs[bb->len++] = lower_32_bits(addr);
> >>>>>>     			bb->cs[bb->len++] = upper_32_bits(addr);
> >>>>>>     		}
> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c
> >>>>>> b/drivers/gpu/drm/xe/xe_pt.c index 64713f400d94..019af2920078
> >>>>>> 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
> >>>>>> @@ -10,6 +10,7 @@
> >>>>>>     #include "xe_gt.h"
> >>>>>>     #include "xe_gt_tlb_invalidation.h"
> >>>>>>     #include "xe_migrate.h"
> >>>>>> +#include "xe_pat.h"
> >>>>>>     #include "xe_pt_types.h"
> >>>>>>     #include "xe_pt_walk.h"
> >>>>>>     #include "xe_res_cursor.h"
> >>>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct
> >>>>>> xe_pt_dir
> >>>> *pt_dir, unsigned int index)
> >>>>>>      *
> >>>>>>      * Return: An encoded page directory entry. No errors.
> >>>>>>      */
> >>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >>>>>> -		  const enum xe_cache_level cache)
> >>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
> >>>>>> +pat_index)
> >>>>>>     {
> >>>>>>     	u64 pde;
> >>>>>>     	struct xe_vm *vm = bo->vm;
> >>>>>>     	struct xe_device *xe = vm->xe;
> >>>>>>
> >>>>>> -
> >>>>>>     	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> >>>>>>     	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
> >>>>>>
> >>>>>>     	if ((vm->pat_encode).pde_encode)
> >>>>>> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> >>>>>> +		pde = (vm->pat_encode).pde_encode(xe, pde,
> pat_index);
> >>>>>>
> >>>>>>     	return pde;
> >>>>>>     }
> >>>>>>
> >>>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> >>>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
> >>>>>>     			struct xe_vma *vma, u32 pt_level)
> >>>>>>     {
> >>>>>>     	struct xe_vm *vm = xe_vma_vm(vma); @@ -89,7 +88,7 @@
> static
> >>>>>> u64 __pte_encode(u64 pte, enum
> >>>> xe_cache_level cache,
> >>>>>>     		pte |= XE_PTE_NULL;
> >>>>>>
> >>>>>>     	if ((vm->pat_encode).pte_encode)
> >>>>>> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> >>>>>> +		pte = (vm->pat_encode).pte_encode(xe, pte,
> pat_index);
> >>>>>>
> >>>>>>     	if (pt_level == 1)
> >>>>>>     		pte |= XE_PDE_PS_2M;
> >>>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
> >>>> xe_cache_level cache,
> >>>>>>      *
> >>>>>>      * Return: An encoded page-table entry. No errors.
> >>>>>>      */
> >>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
> >>>>>> xe_cache_level
> >>>> cache,
> >>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> >>>>>>     		  u32 pt_level)
> >>>>>>     {
> >>>>>>     	u64 pte;
> >>>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64
> >> offset,
> >>>> enum xe_cache_level cache,
> >>>>>>     	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> >>>>>>     		pte |= XE_PPGTT_PTE_DM;
> >>>>>>
> >>>>>> -	return __pte_encode(pte, cache, NULL, pt_level);
> >>>>>> +	return __pte_encode(pte, pat_index, NULL, pt_level);
> >>>>>>     }
> >>>>>>
> >>>>>>     static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct
> >>>>>> xe_vm *vm, @@ -134,12 +133,12 @@ static u64
> >>>>>> __xe_pt_empty_pte(struct xe_tile
> >> *tile,
> >>>> struct xe_vm *vm,
> >>>>>>
> >>>>>>     	if (level == 0) {
> >>>>>>     		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> >>>>>> -					  XE_CACHE_WB, 0);
> >>>>>> +					  xe_pat_get_index(vm->xe,
> >>>> XE_CACHE_WB), 0);
> >>>>>>
> >>>>>>     		return empty;
> >>>>>>     	} else {
> >>>>>>     		return xe_pde_encode(vm->scratch_pt[id][level - 1]-
> >bo, 0,
> >>>>>> -				     XE_CACHE_WB);
> >>>>>> +				     xe_pat_get_index(vm->xe,
> >>>> XE_CACHE_WB));
> >>>>>>     	}
> >>>>>>     }
> >>>>>>
> >>>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> >>>>>>     	struct xe_vm *vm;
> >>>>>>     	/** @tile: The tile we're building for. */
> >>>>>>     	struct xe_tile *tile;
> >>>>>> -	/** @cache: Desired cache level for the ptes */
> >>>>>> -	enum xe_cache_level cache;
> >>>>>>     	/** @default_pte: PTE flag only template. No address is
> associated */
> >>>>>>     	u64 default_pte;
> >>>>>>     	/** @dma_offset: DMA offset to add to the PTE. */ @@ -604,7
> >>>>>> +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> >>>> pgoff_t offset,
> >>>>>>
> >>>>>>     		pte = __pte_encode(is_null ? 0 :
> >>>>>>     				   xe_res_dma(curs) + xe_walk-
> >dma_offset,
> >>>>>> -				   xe_walk->cache, xe_walk->vma,
> level);
> >>>>>> +				   xe_walk->vma->pat_index, xe_walk-
> >vma,
> >>>> level);
> >>>>>>     		pte |= xe_walk->default_pte;
> >>>>>>
> >>>>>>     		/*
> >>>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> >>>> pgoff_t offset,
> >>>>>>     			xe_child->is_compact = true;
> >>>>>>     		}
> >>>>>>
> >>>>>> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache)
> |
> >> flags;
> >>>>>> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
> >>>>> pat_index) | flags;
> >>>>>>     		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset,
> xe_child,
> >>>>>>     					 pte);
> >>>>>>     	}
> >>>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile,
> >>>>>> struct
> >> xe_vma
> >>>> *vma,
> >>>>>>     		if (vma && vma->gpuva.flags &
> XE_VMA_ATOMIC_PTE_BIT)
> >>>>>>     			xe_walk.default_pte |=
> XE_USM_PPGTT_PTE_AE;
> >>>>>>     		xe_walk.dma_offset = vram_region_gpu_offset(bo-
> >>>>> ttm.resource);
> >>>>>> -		xe_walk.cache = XE_CACHE_WB;
> >>>>>> -	} else {
> >>>>>> -		if (!xe_vma_has_no_bo(vma) && bo->flags &
> >>>> XE_BO_SCANOUT_BIT)
> >>>>>> -			xe_walk.cache = XE_CACHE_WT;
> >>>>>> -		else
> >>>>>> -			xe_walk.cache = XE_CACHE_WB;
> >>>>>>     	}
> >>>>>> +
> >>>>>>     	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> >>>>>>     		xe_walk.dma_offset =
> >>>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h
> >>>>>> b/drivers/gpu/drm/xe/xe_pt.h index 01be7ab08f87..1d433a5a96b4
> >>>>>> 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
> >>>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile,
> >>>>>> struct
> >>>> xe_vma *vma, struct xe_exec_queu
> >>>>>>
> >>>>>>     bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma
> >>>>>> *vma);
> >>>>>>
> >>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >>>>>> -		  const enum xe_cache_level level);
> >>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
> >>>>>> +pat_index);
> >>>>>>
> >>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
> >>>>>> xe_cache_level
> >>>> cache,
> >>>>>> -		  u32 pt_level);
> >>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> >>>>>> +u32
> >>>> pt_level);
> >>>>>>
> >>>>>>     #endif
> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c
> >>>>>> b/drivers/gpu/drm/xe/xe_vm.c index 7eeeed0411f3..34603a7e84b0
> >>>>>> 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
> >>>>>> @@ -6,6 +6,7 @@
> >>>>>>     #include "xe_vm.h"
> >>>>>>
> >>>>>>     #include <linux/dma-fence-array.h>
> >>>>>> +#include <linux/nospec.h>
> >>>>>>
> >>>>>>     #include <drm/drm_print.h>
> >>>>>>     #include <drm/ttm/ttm_execbuf_util.h> @@ -874,7 +875,8 @@
> >>>>>> static struct xe_vma *xe_vma_create(struct
> >> xe_vm
> >>>> *vm,
> >>>>>>     				    u64 start, u64 end,
> >>>>>>     				    bool read_only,
> >>>>>>     				    bool is_null,
> >>>>>> -				    u8 tile_mask)
> >>>>>> +				    u8 tile_mask,
> >>>>>> +				    u32 pat_index)
> >>>>>>     {
> >>>>>>     	struct xe_vma *vma;
> >>>>>>     	struct xe_tile *tile;
> >>>>>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct
> >> xe_vm
> >>>> *vm,
> >>>>>>     			vma->tile_mask |= 0x1 << id;
> >>>>>>     	}
> >>>>>>
> >>>>>> +	vma->pat_index = pat_index;
> >>>>>> +
> >>>>>>     	if (vm->xe->info.platform == XE_PVC)
> >>>>>>     		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
> >>>>>>
> >>>>>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct
> >>>> work_struct *w);
> >>>>>>     static void vm_destroy_work_func(struct work_struct *w);
> >>>>>>
> >>>>>>     static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe,
> >>>>>> u64
> >> pde_pat,
> >>>>>> -						enum xe_cache_level
> cache)
> >>>>>> +				     u32 pat_index)
> >>>>>>     {
> >>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>> -
> >>>>>>     	pde_pat &= ~(XELP_PDE_PAT_MASK);
> >>>>>>
> >>>>>>     	if (pat_index & BIT(0))
> >>>>>> @@ -1213,10 +1215,8 @@ static u64
> >> xelp_ppgtt_pde_encode_pat(struct
> >>>> xe_device *xe, u64 pde_pat,
> >>>>>>     }
> >>>>>>
> >>>>>>     static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe,
> >>>>>> u64
> >> pte_pat,
> >>>>>> -						enum xe_cache_level
> cache)
> >>>>>> +				     u32 pat_index)
> >>>>>>     {
> >>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>> -
> >>>>>>     	pte_pat &= ~(XELP_PTE_PAT_MASK);
> >>>>>>
> >>>>>>     	if (pat_index & BIT(0))
> >>>>>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file
> >> *xef,
> >>>> u32 id)
> >>>>>>     u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
> >>>>>>     {
> >>>>>>     	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> >>>>>> -			     XE_CACHE_WB);
> >>>>>> +			     xe_pat_get_index(vm->xe,
> XE_CACHE_WB));
> >>>>>>     }
> >>>>>>
> >>>>>>     static struct dma_fence *
> >>>>>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe,
> >> struct
> >>>> drm_gpuva_op *op)
> >>>>>>     static struct drm_gpuva_ops *
> >>>>>>     vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> >>>>>>     			 u64 bo_offset_or_userptr, u64 addr, u64
> range,
> >>>>>> -			 u32 operation, u8 tile_mask, u32 region)
> >>>>>> +			 u32 operation, u8 tile_mask, u32 region, u32
> >>>> pat_index)
> >>>>>>     {
> >>>>>>     	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> >>>>>>     	struct ww_acquire_ctx ww;
> >>>>>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm
> *vm,
> >>>> struct xe_bo *bo,
> >>>>>>     			struct xe_vma_op *op =
> gpuva_op_to_vma_op(__op);
> >>>>>>
> >>>>>>     			op->tile_mask = tile_mask;
> >>>>>> +			op->pat_index = pat_index;
> >>>>>>     			op->map.immediate =
> >>>>>>     				operation &
> >> XE_VM_BIND_FLAG_IMMEDIATE;
> >>>>>>     			op->map.read_only =
> >>>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm
> *vm,
> >>>> struct xe_bo *bo,
> >>>>>>     			struct xe_vma_op *op =
> gpuva_op_to_vma_op(__op);
> >>>>>>
> >>>>>>     			op->tile_mask = tile_mask;
> >>>>>> +			op->pat_index = pat_index;
> >>>>>>     			op->prefetch.region = region;
> >>>>>>     		}
> >>>>>>     		break;
> >>>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
> *vm,
> >>>> struct xe_bo *bo,
> >>>>>>     }
> >>>>>>
> >>>>>>     static struct xe_vma *new_vma(struct xe_vm *vm, struct
> >>>> drm_gpuva_op_map *op,
> >>>>>> -			      u8 tile_mask, bool read_only, bool is_null)
> >>>>>> +			      u8 tile_mask, bool read_only, bool is_null,
> >>>>>> +			      u32 pat_index)
> >>>>>>     {
> >>>>>>     	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op-
> >gem.obj) :
> >> NULL;
> >>>>>>     	struct xe_vma *vma;
> >>>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm
> >>>> *vm, struct drm_gpuva_op_map *op,
> >>>>>>     	vma = xe_vma_create(vm, bo, op->gem.offset,
> >>>>>>     			    op->va.addr, op->va.addr +
> >>>>>>     			    op->va.range - 1, read_only, is_null,
> >>>>>> -			    tile_mask);
> >>>>>> +			    tile_mask, pat_index);
> >>>>>>     	if (bo)
> >>>>>>     		xe_bo_unlock(bo, &ww);
> >>>>>>
> >>>>>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct
> >> xe_vm
> >>>> *vm, struct xe_exec_queue *q,
> >>>>>>
> >>>>>>     				vma = new_vma(vm, &op->base.map,
> >>>>>>     					      op->tile_mask, op-
> >>> map.read_only,
> >>>>>> -					      op->map.is_null);
> >>>>>> +					      op->map.is_null, op-
> >pat_index);
> >>>>>>     				if (IS_ERR(vma)) {
> >>>>>>     					err = PTR_ERR(vma);
> >>>>>>     					goto free_fence;
> >>>>>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct
> >> xe_vm
> >>>> *vm, struct xe_exec_queue *q,
> >>>>>>
> >>>>>>     					vma = new_vma(vm, op-
> >>>>> base.remap.prev,
> >>>>>>     						      op->tile_mask,
> read_only,
> >>>>>> -						      is_null);
> >>>>>> +						      is_null, op-
> >pat_index);
> >>>>>>     					if (IS_ERR(vma)) {
> >>>>>>     						err = PTR_ERR(vma);
> >>>>>>     						goto free_fence;
> >>>>>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct
> >> xe_vm
> >>>> *vm, struct xe_exec_queue *q,
> >>>>>>
> >>>>>>     					vma = new_vma(vm, op-
> >>>>> base.remap.next,
> >>>>>>     						      op->tile_mask,
> read_only,
> >>>>>> -						      is_null);
> >>>>>> +						      is_null, op-
> >pat_index);
> >>>>>>     					if (IS_ERR(vma)) {
> >>>>>>     						err = PTR_ERR(vma);
> >>>>>>     						goto free_fence;
> >>>>>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct
> >>>> xe_device *xe,
> >>>>>>     		u32 obj = (*bind_ops)[i].obj;
> >>>>>>     		u64 obj_offset = (*bind_ops)[i].obj_offset;
> >>>>>>     		u32 region = (*bind_ops)[i].region;
> >>>>>> +		u32 pat_index = (*bind_ops)[i].pat_index;
> >>>>>>     		bool is_null = op & XE_VM_BIND_FLAG_NULL;
> >>>>>>
> >>>>>> +		if (XE_IOCTL_DBG(xe, pat_index >= xe-
> >>>>> info.pat_table_n_entries)) {
> >>>>>> +			err = -EINVAL;
> >>>>>> +			goto free_bind_ops;
> >>>>>> +		}
> >>>>>> +
> >>>>>>     		if (i == 0) {
> >>>>>>     			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
> >>>>>>     		} else if (XE_IOCTL_DBG(xe, !*async) || @@ -3346,8
> +3355,25
> >>>>>> @@ int xe_vm_bind_ioctl(struct drm_device
> >> *dev,
> >>>> void *data, struct drm_file *file)
> >>>>>>     		struct drm_gem_object *gem_obj;
> >>>>>>     		u64 range = bind_ops[i].range;
> >>>>>>     		u64 addr = bind_ops[i].addr;
> >>>>>> +		u32 op = bind_ops[i].op;
> >>>>>>     		u32 obj = bind_ops[i].obj;
> >>>>>>     		u64 obj_offset = bind_ops[i].obj_offset;
> >>>>>> +		u32 pat_index = bind_ops[i].pat_index;
> >>>>>> +		u16 coh_mode;
> >>>>>> +
> >>>>>> +		pat_index = array_index_nospec(pat_index,
> >>>>>> +					       xe-
> >info.pat_table_n_entries);
> >>>>>> +		coh_mode = xe_pat_index_get_coh_mode(xe,
> pat_index);
> >>>>>> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
> >>>>>
> >>>>> Assuming we drop the unusable entries from the TGL table, this
> >>>>> should be impossible, right?  Any index that makes it past the
> >>>>> n_entries check at the top of the function should have a valid, non-zero
> coh_mode value.
> >>>>> So this should probably be an assertion (to highlight a KMD bug)
> >>>>> rather than just a silent uapi failure return.
> >>>>
> >>>> Makes sense.
> >>>>
> >>>>>
> >>>>>> +			err = -EINVAL;
> >>>>>> +			goto put_obj;
> >>>>>> +		}
> >>>>>> +
> >>>>>> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
> >>>> XE_VM_BIND_OP_MAP_USERPTR &&
> >>>>>> +				 coh_mode ==
> XE_GEM_COHERENCY_NONE))
> >> {
> >>>>>> +			err = -EINVAL;
> >>>>>> +			goto put_obj;
> >>>>>> +		}
> >>>>>>
> >>>>>>     		if (!obj)
> >>>>>>     			continue;
> >>>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device
> >> *dev,
> >>>> void *data, struct drm_file *file)
> >>>>>>     				goto put_obj;
> >>>>>>     			}
> >>>>>>     		}
> >>>>>> +
> >>>>>> +		if (bos[i]->coh_mode) {
> >>>>>> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
> >>>> coh_mode)) {
> >>>>>> +				err = -EINVAL;
> >>>>>> +				goto put_obj;
> >>>>>> +			}
> >>>>>> +		} else if (XE_IOCTL_DBG(xe, coh_mode ==
> >>>> XE_GEM_COHERENCY_NONE)) {
> >>>>>> +			/*
> >>>>>> +			 * Imported dma-buf from a different device
> should
> >>>>>> +			 * require 1way or 2way coherency since we
> don't
> >>>> know
> >>>>>> +			 * how it was mapped on CPU. Just assume is it
> >>>>>> +			 * potentially cached on CPU side.
> >>>>>> +			 */
> >>>>>> +			err = -EINVAL;
> >>>>>> +			goto put_obj;
> >>>>>> +		}
> >>>>>>     	}
> >>>>>>
> >>>>>>     	if (args->num_syncs) {
> >>>>>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device
> >> *dev,
> >>>> void *data, struct drm_file *file)
> >>>>>>     		u64 obj_offset = bind_ops[i].obj_offset;
> >>>>>>     		u8 tile_mask = bind_ops[i].tile_mask;
> >>>>>>     		u32 region = bind_ops[i].region;
> >>>>>> +		u32 pat_index = bind_ops[i].pat_index;
> >>>>>>
> >>>>>>     		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i],
> obj_offset,
> >>>>>>     						  addr, range, op,
> tile_mask,
> >>>>>> -						  region);
> >>>>>> +						  region, pat_index);
> >>>>>>     		if (IS_ERR(ops[i])) {
> >>>>>>     			err = PTR_ERR(ops[i]);
> >>>>>>     			ops[i] = NULL;
> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> >>>> b/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>> index 83a1f87b6537..508679ed3c74 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>> @@ -111,6 +111,11 @@ struct xe_vma {
> >>>>>>     	 */
> >>>>>>     	u8 tile_present;
> >>>>>>
> >>>>>> +	/**
> >>>>>> +	 * @pat_index: The pat index to use when encoding the PTEs
> for
> >>>>>> +this
> >>>> vma.
> >>>>>> +	 */
> >>>>>> +	u32 pat_index;
> >>>>>> +
> >>>>>>     	struct {
> >>>>>>     		struct list_head rebind_link;
> >>>>>>     	} notifier;
> >>>>>> @@ -338,10 +343,8 @@ struct xe_vm {
> >>>>>>     	bool batch_invalidate_tlb;
> >>>>>>
> >>>>>>     	struct {
> >>>>>> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>>>> -						enum xe_cache_level
> cache);
> >>>>>> -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> >>>>>> -						enum xe_cache_level
> cache);
> >>>>>> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> u32
> >>>> pat_index);
> >>>>>> +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> u32
> >>>> pat_index);
> >>>>>>     	} pat_encode;
> >>>>>>     };
> >>>>>>
> >>>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
> >>>>>>     	struct async_op_fence *fence;
> >>>>>>     	/** @tile_mask: gt mask for this operation */
> >>>>>>     	u8 tile_mask;
> >>>>>> +	/** @pat_index: The pat index to use for this operation. */
> >>>>>> +	u32 pat_index;
> >>>>>>     	/** @flags: operation flags */
> >>>>>>     	enum xe_vma_op_flags flags;
> >>>>>>
> >>>>>> diff --git a/include/uapi/drm/xe_drm.h
> >>>>>> b/include/uapi/drm/xe_drm.h index 64bc66d4b550..0c15b6f32447
> >>>>>> 100644
> >>>>>> --- a/include/uapi/drm/xe_drm.h
> >>>>>> +++ b/include/uapi/drm/xe_drm.h
> >>>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> >>>>>>     	 */
> >>>>>>     	__u32 obj;
> >>>>>>
> >>>>>> -	/** @pad: MBZ */
> >>>>>> -	__u32 pad;
> >>>>>> +	/**
> >>>>>> +	 * @pat_index: The platform defined @pat_index to use for
> this
> >>>> mapping.
> >>>>>> +	 * The index basically maps to some predefined memory
> attributes,
> >>>>>> +	 * including things like caching, coherency and likely other stuff
> in
> >>>>>> +	 * the future.  The exact meaning of the pat_index is platform
> >>>>>> +specific
> >>>>>
> >>>>> BTW, "other stuff in the future" already includes compression on
> >>>>> Xe2, we just haven't landed the patches for the Xe2 table yet.
> >>>>
> >>>> Ok, good to know.
> >>>>
> >>>>>
> >>>>>> +	 * and defined in the Bspec and PRMs.  When the KMD sets up
> the
> >>>> binding
> >>>>>> +	 * the index here is encoded into the ppGTT PTE.
> >>>>>> +	 *
> >>>>>> +	 * For coherency the @pat_index needs to match the
> >>>>>> +	 * drm_xe_gem_create.coh_mode, so either
> >>>> XE_GEM_COHERENCY_NONE,
> >>>>>> +	 * XE_GEM_COHERENCY_1WAY or
> XE_GEM_COHERENCY_2WAY. The
> >>>> KMD will extract
> >>>>>> +	 * the coherency mode from the @pat_index and reject if
> there is a
> >>>>>> +	 * mismatch (see note below for pre-MTL platforms).
> >>>>>> +	 *
> >>>>>> +	 * Note: On pre-MTL platforms there is only a caching mode
> and no
> >>>>>> +	 * explicit coherency mode, but on such hardware there is
> always a
> >>>>>> +	 * shared-LLC (or is dgpu) so all GT memory accesses are
> >>>>>> +coherent
> >> with
> >>>>>> +	 * CPU caches even with the caching mode set as uncached.
> It's
> >>>>>> +only
> >>>> the
> >>>>>> +	 * display engine that is incoherent (on dgpu it must be in
> VRAM which
> >>>>>> +	 * is always mapped as WC on the CPU). However to keep the
> uapi
> >>>> somewhat
> >>>>>> +	 * consistent with newer platforms the KMD groups the
> different
> >> cache
> >>>>>> +	 * levels into the following coherency buckets on all pre-MTL
> >> platforms:
> >>>>>> +	 *
> >>>>>> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
> >>>>>> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
> >>>>>> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
> >>>>>> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
> >>>>>
> >>>>> As noted on the previous patch, it seems like 2-way is appropriate
> >>>>> for LLC platforms, but 1-way might be a more accurate description
> >>>>> of dGPU behavior.
> >>>>>
> >>>>>> +	 *
> >>>>>> +	 * In practice UC/WC/WT should only ever used for scanout
> >>>>>> +surfaces
> >> on
> >>>>>> +	 * such platforms since it is only the display engine that is
> actually
> >>>>>> +	 * incoherent. Everything else should typically use WB given
> >>>>>> +that we
> >>>>>
> >>>>> What if we're sharing our buffers with some other (non-GPU)
> >>>>> device?  Are there cases where that other device wouldn't be
> >>>>> coherent with the LLC, so we'd want to use one of these?
> >>>>
> >>>> Yeah, I guess there might be cases like that. I'll reword.
> >>>>
> >>>>>
> >>>>>
> >>>>> Matt
> >>>>>
> >>>>>> +	 * have a shared-LLC.  On MTL+ this completely changes (also
> >>>> potentially
> >>>>>> +	 * no shared-LLC) and the HW defines the coherency mode as
> part
> >>>>>> +of
> >>>> the
> >>>>>> +	 * @pat_index.
> >>>>>> +	 *
> >>>>>> +	 * Note: For userptr and externally imported dma-buf the
> kernel
> >>>> expects
> >>>>>> +	 * either 1WAY or 2WAY for the @pat_index.
> >>>>>> +	 */
> >>>>>> +	__u32 pat_index;
> >>>>>>
> >>>>>>     	union {
> >>>>>>     		/**
> >>>>>> --
> >>>>>> 2.41.0
> >>>>>>
> >>>>>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-09-01  9:34               ` Zhang, Carl
@ 2023-09-04  9:23                 ` Matthew Auld
  2023-09-05  9:12                   ` Zhang, Carl
  0 siblings, 1 reply; 47+ messages in thread
From: Matthew Auld @ 2023-09-04  9:23 UTC (permalink / raw)
  To: Zhang, Carl, Roper, Matthew D
  Cc: Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas, Yu, Effie,
	intel-xe@lists.freedesktop.org

On 01/09/2023 10:34, Zhang, Carl wrote:
> 
> 
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld@intel.com>
>> Sent: Thursday, August 31, 2023 6:44 PM
>>
>> On 31/08/2023 09:24, Zhang, Carl wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>> Sent: Thursday, August 31, 2023 12:02 AM
>>>>
>>>> On 30/08/2023 16:27, Zhang, Carl wrote:
>>>>> Several questions:
>>>>> 1. the pat_index from vm_bind will override the setting from bo_create?
>>>>> How to keep the value from bo_create unchanged?
>>>>
>>>> You only set the pat_index during vm_bind. At gem_create you just
>>>> need tell the kernel what CPU side caching mode to use for system
>>>> memory (wb/wc/uc), plus the expected GPU coherency mode of the
>>>> pat_index. The current expectation is that the coherency mode of the
>>>> pat_index should match the one at gem_create.
>>>>
>>> It means that we set the incomplete pat_index at gem_create (coherency
>>> only) And set another part at vm_bind (gpu cache) Why we separate
>>> them? Why not set a full value at vm_bind?
>>
>> You always give the full pat_index at vm_bind. It's platform specific but the
>> BSpec for the most part tells you what the coherency mode of the pat_index is.
>>
>> For example if you look at the MTL pat_index table it tells you the
>> coh_mode:
>>
>> pat_index 0-2 = coh_mode "No snoop" = COH_NONE
>> pat_index 3   = coh_mode "1way"     = COH_1WAY
>> pat_index 4   = coh_mode "2way"     = COH_2WAY
>>
>> On future platforms the table is larger and also encodes stuff like compression,
>> but KMD really only cares about the coh_mode, since that will restrict the
>> allowed CPU smem_caching values that userspace can select.
>>
>>   From KMD pov this matters since we always give userspace zeroed memory,
>> which is useful for normal applications but is also needed for security reasons.
>> However the actual clearing might result in the writes only being visible in the
>> CPU cache (not flushed to main memory), and if the GPU doesn't snoop the
>> CPU cache it can read directly from main memory, which effectively bypasses
>> the clearing. To prevent that the KMD needs to know how userspace will map
>> the memory via the GPU i.e what is the coh_mode for the pat_index. The gist is
>> that if you select COH_NONE then it needs to clflush the pages before giving
>> them to userspace, which is done by forcing you to use uc/wc for the
>> smem_caching (the kernel will internally issue cflush on x86 when marking the
>> pages as uc/wc). And if you select wb there is no flushing, but here you need to
>> use 1way or 2way.
>>
> 
>   It sounds like a perf optimization, if no considering the perf, you could always call
> Clflush to flush the cpu cache to main memory before giving it to user space

I think the other thing is that the KMD is no longer doing any manual 
flushing, but instead the core kernel will now do that for us when 
marking the pages as wc/uc on x86.

> 
> 
>> AFAIK separating them is just to prevent userspace from mixing different
>> coherency modes for the same object by having it immutable (you can still use
>> different pat_index but the coherency must be compatible).
>> Also at creation time we can determine if the smem_caching is compatible with
>> the coh_mode.
>>
>> Note that it might be that we ditch the COH_2WAY and just have COH_NONE
>> and COH_AT_LEAST_1WAY, in which case this might be slightly different.
>>
>>>
>>>>> 2. no UC/WB/WC definition (CPU cachable) in drm_xe_gem_mmap_offset,
>>>> will it be handled by KMD automatically?
>>>>> For example: if set 1-way, it means GPU could snoop CPU cache, we
>>>>> could use WB  in mmap offset If it is COHERENCY_NONE, we could only
>>>>> use
>>>> UC,  all these logic is handled by KMD automatically?
>>>>
>>>> Yes, the mmap will use the same CPU side caching mode that you set at
>>>> gem_create with smem_caching (wb/wc/uc).
>>>>
>>> It means it  could not be changed after gem_create.  Sometimes, UMD
>>> doesn’t know Whether the data need to be accessed by CPU at gem_create
>>> .  only application Knows the purpose of allocation & the surface
>>> usage.  So, may need change UMD api to let applications set whether the
>> resource is "accessible", or it will hurt the perf.
>>
>> Right, the smem_caching is immutable.
>>
>>>
>>>>> 3.  about " For imported dma-buf (from a different device) the
>>>>> coherency mode is also implicit and must also be either 1WAY or 2WAY"
>>>>> it means it must be 1way or 2way, and UMD need not to set it?
>>>>
>>>> If you are going to vm_bind it you need to supply the pat_index and
>>>> in this case it needs to be either 1way or 2way. The object comes
>>>> from a different device so we might not have smem_caching/coh_mode
>>>> like we do for native objects.
>>>>
>>> How about the GPU cache setting of imported dma-buf?
>>> Is there possible conflict between dma-buf producer and consumer?
>>> For example:
>>> In producer side , one bo is set GPU L3 cache,  also CPU cacheable, it
>>> should be 2Way In client side, we know the coherency should be 1-way
>>> or 2-way, if it suppose no GPU cache Used. It just set 1-way.
>>> It may cause some problems?
>>
>> If the producer and consumer are the same device, then I think the original
>> gem_create.coh_mode is considered, since it's effectively still a native object. In
>> which case the normal rules apply.
> 
> But the consumer does not know the coh_mode, so , it is possible that it set will set
> Another value, then it will failed.
> And because consumer does not know it is 1-way or 2-way,  consumer could not
> Set the gpu cache correctly.  Because if it is 1-way, cpu could not snoop gpu cache
> 
>>
>> If it comes from a different device then we don't know what
>> gem_create.coh_mode is (or if that even exists) since we don't have the usual
>> native object, but just some opaque dma-buf object. In this case it must be at
>> least 1way or 2way.
>>
>>   From KMD pov we only care about userspace not being able to bypass the
>> clearing. There might still be other coherency issues though.
>>
>>>
>>> Also , when a bo is created, UMD does not know whether it will be exported.
>>> It maybe be set as COHERENCY_NONE, right?
>>
>> Yeah, you can use whatever you want. Just that your smem_caching will need
>> to be wc/uc if using COH_NONE. On the importer side, if it comes from the
>> same device then it's just a normal native object and the normal rules apply.
>> Currently this would mean also using pat_index with COH_NONE. If it is
>> imported from a different device then you need to use a pat_index with 1way
>> or 2way, since KMD doesn't really know how the object is mapped on the CPU
>> or if the CPU caches are dirty (potentially with clearing).
>>
>> Is this going to be problematic?
>>
> If imported surfaces is from same device, how consumer know it is original coh_non
> Or 1-way or 2-way?
> 
> And it also means producer and consumer could have different pat_index for same bo.

Ok, if KMD instead allows 1way or 2way for dma-buf when it's external or 
from the same device? Would that help here? i.e you can always select 
1way or 2way, even if it came from the same device and was coh_none?

Other option is you could call the vm_bind ioctl to check if the 
coh_mode is supported by checking if it returns an error? But maybe that 
is too nasy.


> 
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>>>> Sent: Wednesday, August 30, 2023 7:28 PM
>>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>>>>> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi
>>>> <pallavi.mishra@intel.com>;
>>>>>> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas
>>>>>> Lahtinen <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
>>>>>> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>;
>>>> Hazubski,
>>>>>> Filip <filip.hazubski@intel.com>; Zhang, Carl
>>>>>> <carl.zhang@intel.com>; Yu,
>>>> Effie
>>>>>> <effie.yu@intel.com>
>>>>>> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection
>>>>>> with
>>>> vm_bind
>>>>>>
>>>>>> On 29/08/2023 22:36, Matt Roper wrote:
>>>>>>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
>>>>>>>> Allow userspace to directly control the pat_index for a given vm
>>>>>>>> binding. This should allow directly controlling the coherency,
>>>>>>>> caching and potentially other stuff in the future for the ppGTT binding.
>>>>>>>>
>>>>>>>> The exact meaning behind the pat_index is very platform specific
>>>>>>>> (see BSpec or PRMs) but effectively maps to some predefined
>>>>>>>> memory attributes. From the KMD pov we only care about the
>>>>>>>> coherency that is provided by the pat_index, which falls into either
>> NONE, 1WAY or 2WAY.
>>>>>>>> The vm_bind coherency mode for the given pat_index needs to match
>>>> the
>>>>>>>> given coh_mode that was set at object creation. For platforms
>>>>>>>> that lack
>>>>>>>
>>>>>>> Is it actually important to match the coherency mode?  I think one
>>>>>>> of the main goals was to know up front if userspace might be using
>>>>>>> a non-snooping PAT setting that would let it bypass the CPU cache
>>>>>>> (and potentially read old, stale data from a different process if
>>>>>>> the buffer's clear value is still sitting in cache and hasn't
>>>>>>> landed in memory yet).
>>>>>>>
>>>>>>> If that's the only concern, then I think it should still be fine
>>>>>>> to map with a non-matching PAT as long as it's more coherent than
>>>>>>> the one specified at creation, right?  E.g., if the buffer was
>>>>>>> created with 1-way coherency, it would be fine to map it with
>>>>>>> 2-way because userspace still can't use that to observe the
>>>>>>> previous contents of the buffer. Or
>>>>>>
>>>>>> Yeah, I guess we could in theory do something that.
>>>>>>
>>>>>>> if the buffer was created with "non-coherent" then we've already
>>>>>>> done the necessary clflushing in kernel before handing to buffer
>>>>>>> over to userspace to ensure the clear value landed in memory, so
>>>>>>> any valid PAT index should be safe (from a security POV) after
>>>>>>> that, right?  Any other problems that arise from mismatched
>>>>>>> coherency would just be contained to the app possibly shooting
>>>>>>> itself in the foot, which isn't really our concern.
>>>>>>
>>>>>> That is also my understanding, at least from the KMD security pov.
>>>>>> If you allocate as wb then you must use at least 1way, since there
>>>>>> is no flushing
>>>> for
>>>>>> clearing or swap-in. For uc/wc you could in theory use whatever you want.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as
>>>>>> 2WAY.
>>>>>>>>
>>>>>>>> For userptr mappings we lack a corresponding gem object, so the
>>>> expected
>>>>>>>> coherency mode is instead implicit and must fall into either 1WAY
>>>>>>>> or 2WAY. Trying to use NONE will be rejected by the kernel. For
>>>>>>>> imported dma-buf (from a different device) the coherency mode is
>>>>>>>> also implicit and must also be either 1WAY or 2WAY.
>>>>>>>>
>>>>>>>> As part of adding pat_index support with vm_bind we also need
>>>>>>>> stop
>>>> using
>>>>>>>> xe_cache_level and instead use the pat_index in various places.
>>>>>>>> We still make use of xe_cache_level, but only as a convenience
>>>>>>>> for kernel internal objectsi (internally it maps to some
>>>>>>>> reasonable pat_index). For
>>>>>>>
>>>>>>> Maybe we should kill xe_cache_level completely and just assign
>>>>>>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used
>>>>>>> in appropriate places, similar to what we do with MOCS (gt-
>>>>> mocs.uc_index,
>>>>>>> gt->mocs.wb_index)?
>>>>>>
>>>>>> OK, seems reasonable to me.
>>>>>>
>>>>>>>
>>>>>>>> now this is just a 1:1 conversion of the existing code, however
>>>>>>>> for platforms like MTL+ we might need to give more control
>>>>>>>> through
>>>> bo_create
>>>>>>>> or stop using WB on the CPU side if we need CPU access.
>>>>>>>>
>>>>>>>> Bspec: 45101, 44235 #xe
>>>>>>>> Bspec: 70552, 71582, 59400 #xe2
>>>>>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>>>>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>>>>>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>>>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>>>>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>>>>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
>>>>>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>>>>>>>> Cc: Carl Zhang <carl.zhang@intel.com>
>>>>>>>> Cc: Effie Yu <effie.yu@intel.com>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
>>>>>>>>      drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
>>>>>>>>      drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
>>>>>>>>      drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
>>>>>>>>      drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
>>>>>>>>      drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
>>>>>>>>      drivers/gpu/drm/xe/xe_pt.h            |  6 +--
>>>>>>>>      drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++----
>> -
>>>> -
>>>>>>>>      drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
>>>>>>>>      include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
>>>>>>>>      10 files changed, 134 insertions(+), 58 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>> b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>> index 4d6296cdbcfd..bb4c182048c3 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
>>>>>>>>      		(*alloc_scratch_dma)(struct i915_address_space *vm,
>> int
>>>>>>>> sz);
>>>>>>>>
>>>>>>>>      	u64 (*pte_encode)(dma_addr_t addr,
>>>>>>>> -			  unsigned int pat_index,
>>>>>>>> +			  u32 pat_index,
>>>>>>>>      			  u32 flags); /* Create a valid PTE */
>>>>>>>>      #define PTE_READ_ONLY	BIT(0)
>>>>>>>>      #define PTE_LM		BIT(1)
>>>>>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>> b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct
>>>> xe_migrate
>>>>>> *m, struct kunit *test)
>>>>>>>>      	/* First part of the test, are we updating our pagetable bo
>>>>>>>> with a new
>>>>>> entry? */
>>>>>>>>      	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE *
>> (NUM_KERNEL_PDE -
>>>> 1),
>>>>>> u64,
>>>>>>>>      		  0xdeaddeadbeefbeef);
>>>>>>>> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>>>>>>>> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
>>>>>> XE_CACHE_WB), 0);
>>>>>>>>      	if (m->q->vm->flags & XE_VM_FLAG_64K)
>>>>>>>>      		expected |= XE_PTE_PS64;
>>>>>>>>      	if (xe_bo_is_vram(pt))
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c
>>>> b/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>> index 209fa053d9fb..4134c26150a5 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64
>>>>>> bo_offset)
>>>>>>>>      		pte |= XE_GGTT_PTE_DM;
>>>>>>>>
>>>>>>>>      	if ((ggtt->pat_encode).pte_encode)
>>>>>>>> -		pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>>>> XE_CACHE_WB_1_WAY);
>>>>>>>> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>>>>>> +
>> xe_pat_get_index(xe,
>>>>>> XE_CACHE_WB_1_WAY));
>>>>>>>>
>>>>>>>>      	return pte;
>>>>>>>>      }
>>>>>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>>>>>>>>      }
>>>>>>>>
>>>>>>>>      static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe,
>>>>>>>> u64
>>>> pte_pat,
>>>>>>>> -						enum xe_cache_level
>> cache)
>>>>>>>> +				     u32 pat_index)
>>>>>>>>      {
>>>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>> -
>>>>>>>>      	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>>>>>>>
>>>>>>>>      	if (pat_index & BIT(0))
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>> b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>>>>>>>
>>>>>>>>      	struct {
>>>>>>>>      		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>>>> -						enum xe_cache_level
>> cache);
>>>>>>>> +				  u32 pat_index);
>>>>>>>>      	} pat_encode;
>>>>>>>>      };
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>> b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>> index a782ea282cb6..54585e98452a 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>> @@ -24,6 +24,7 @@
>>>>>>>>      #include "xe_lrc.h"
>>>>>>>>      #include "xe_map.h"
>>>>>>>>      #include "xe_mocs.h"
>>>>>>>> +#include "xe_pat.h"
>>>>>>>>      #include "xe_pt.h"
>>>>>>>>      #include "xe_res_cursor.h"
>>>>>>>>      #include "xe_sched_job.h"
>>>>>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>> xe_tile
>>>> *tile,
>>>>>> struct xe_migrate *m,
>>>>>>>>      	u32 num_entries = NUM_PT_SLOTS, num_level = vm-
>>> pt_root[id]-
>>>>>>> level;
>>>>>>>>      	u32 map_ofs, level, i;
>>>>>>>>      	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>>>>>>>> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>>>>>>>>      	u64 entry;
>>>>>>>>      	int ret;
>>>>>>>>
>>>>>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct
>>>> xe_tile
>>>>>> *tile, struct xe_migrate *m,
>>>>>>>>      		return ret;
>>>>>>>>      	}
>>>>>>>>
>>>>>>>> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
>>>> XE_CACHE_WB);
>>>>>>>> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
>> pat_index);
>>>>>>>>      	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>>>>>>>
>>>>>>>>      	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>>>>>>>
>>>>>>>>      	/* Map the entire BO in our level 0 pt */
>>>>>>>>      	for (i = 0, level = 0; i < num_entries; level++) {
>>>>>>>> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
>> XE_CACHE_WB,
>>>>>> 0);
>>>>>>>> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
>> pat_index, 0);
>>>>>>>>
>>>>>>>>      		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>> entry);
>>>>>>>>
>>>>>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>> xe_tile
>>>> *tile,
>>>>>> struct xe_migrate *m,
>>>>>>>>      		for (i = 0; i < batch->size;
>>>>>>>>      		     i += vm->flags & XE_VM_FLAG_64K ?
>> XE_64K_PAGE_SIZE :
>>>>>>>>      		     XE_PAGE_SIZE) {
>>>>>>>> -			entry = xe_pte_encode(batch, i,
>> XE_CACHE_WB, 0);
>>>>>>>> +			entry = xe_pte_encode(batch, i, pat_index, 0);
>>>>>>>>
>>>>>>>>      			xe_map_wr(xe, &bo->vmap, map_ofs + level *
>> 8, u64,
>>>>>>>>      				  entry);
>>>>>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>> xe_tile
>>>> *tile,
>>>>>> struct xe_migrate *m,
>>>>>>>>      			flags = XE_PDE_64K;
>>>>>>>>
>>>>>>>>      		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>>>>>>>> -					XE_PAGE_SIZE,
>> XE_CACHE_WB);
>>>>>>>> +					XE_PAGE_SIZE, pat_index);
>>>>>>>>      		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE
>> * level,
>>>>>> u64,
>>>>>>>>      			  entry | flags);
>>>>>>>>      	}
>>>>>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>> xe_tile
>>>> *tile,
>>>>>> struct xe_migrate *m,
>>>>>>>>      	/* Write PDE's that point to our BO. */
>>>>>>>>      	for (i = 0; i < num_entries - num_level; i++) {
>>>>>>>>      		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>>>>>>>> -				      XE_CACHE_WB);
>>>>>>>> +				      pat_index);
>>>>>>>>
>>>>>>>>      		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE
>> +
>>>>>>>>      			  (i + 1) * 8, u64, entry); @@ -1256,7 +1258,7
>> @@
>>>>>>>> xe_migrate_update_pgtables(struct
>>>> xe_migrate
>>>>>> *m,
>>>>>>>>
>>>>>>>>      			XE_WARN_ON(pt_bo->size != SZ_4K);
>>>>>>>>
>>>>>>>> -			addr = xe_pte_encode(pt_bo, 0,
>> XE_CACHE_WB, 0);
>>>>>>>> +			addr = xe_pte_encode(pt_bo, 0,
>> xe_pat_get_index(xe,
>>>>>> XE_CACHE_WB), 0);
>>>>>>>>      			bb->cs[bb->len++] = lower_32_bits(addr);
>>>>>>>>      			bb->cs[bb->len++] = upper_32_bits(addr);
>>>>>>>>      		}
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>> b/drivers/gpu/drm/xe/xe_pt.c index 64713f400d94..019af2920078
>>>>>>>> 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>> @@ -10,6 +10,7 @@
>>>>>>>>      #include "xe_gt.h"
>>>>>>>>      #include "xe_gt_tlb_invalidation.h"
>>>>>>>>      #include "xe_migrate.h"
>>>>>>>> +#include "xe_pat.h"
>>>>>>>>      #include "xe_pt_types.h"
>>>>>>>>      #include "xe_pt_walk.h"
>>>>>>>>      #include "xe_res_cursor.h"
>>>>>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct
>>>>>>>> xe_pt_dir
>>>>>> *pt_dir, unsigned int index)
>>>>>>>>       *
>>>>>>>>       * Return: An encoded page directory entry. No errors.
>>>>>>>>       */
>>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>>>>>> -		  const enum xe_cache_level cache)
>>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
>>>>>>>> +pat_index)
>>>>>>>>      {
>>>>>>>>      	u64 pde;
>>>>>>>>      	struct xe_vm *vm = bo->vm;
>>>>>>>>      	struct xe_device *xe = vm->xe;
>>>>>>>>
>>>>>>>> -
>>>>>>>>      	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>>>>>>>>      	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>>>>>>>
>>>>>>>>      	if ((vm->pat_encode).pde_encode)
>>>>>>>> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>>>>>>>> +		pde = (vm->pat_encode).pde_encode(xe, pde,
>> pat_index);
>>>>>>>>
>>>>>>>>      	return pde;
>>>>>>>>      }
>>>>>>>>
>>>>>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>>>>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>>>>>>>>      			struct xe_vma *vma, u32 pt_level)
>>>>>>>>      {
>>>>>>>>      	struct xe_vm *vm = xe_vma_vm(vma); @@ -89,7 +88,7 @@
>> static
>>>>>>>> u64 __pte_encode(u64 pte, enum
>>>>>> xe_cache_level cache,
>>>>>>>>      		pte |= XE_PTE_NULL;
>>>>>>>>
>>>>>>>>      	if ((vm->pat_encode).pte_encode)
>>>>>>>> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>>>>>>>> +		pte = (vm->pat_encode).pte_encode(xe, pte,
>> pat_index);
>>>>>>>>
>>>>>>>>      	if (pt_level == 1)
>>>>>>>>      		pte |= XE_PDE_PS_2M;
>>>>>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
>>>>>> xe_cache_level cache,
>>>>>>>>       *
>>>>>>>>       * Return: An encoded page-table entry. No errors.
>>>>>>>>       */
>>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
>>>>>>>> xe_cache_level
>>>>>> cache,
>>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>>>>>>>      		  u32 pt_level)
>>>>>>>>      {
>>>>>>>>      	u64 pte;
>>>>>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64
>>>> offset,
>>>>>> enum xe_cache_level cache,
>>>>>>>>      	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>>>>>>>>      		pte |= XE_PPGTT_PTE_DM;
>>>>>>>>
>>>>>>>> -	return __pte_encode(pte, cache, NULL, pt_level);
>>>>>>>> +	return __pte_encode(pte, pat_index, NULL, pt_level);
>>>>>>>>      }
>>>>>>>>
>>>>>>>>      static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct
>>>>>>>> xe_vm *vm, @@ -134,12 +133,12 @@ static u64
>>>>>>>> __xe_pt_empty_pte(struct xe_tile
>>>> *tile,
>>>>>> struct xe_vm *vm,
>>>>>>>>
>>>>>>>>      	if (level == 0) {
>>>>>>>>      		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>>>>>>>> -					  XE_CACHE_WB, 0);
>>>>>>>> +					  xe_pat_get_index(vm->xe,
>>>>>> XE_CACHE_WB), 0);
>>>>>>>>
>>>>>>>>      		return empty;
>>>>>>>>      	} else {
>>>>>>>>      		return xe_pde_encode(vm->scratch_pt[id][level - 1]-
>>> bo, 0,
>>>>>>>> -				     XE_CACHE_WB);
>>>>>>>> +				     xe_pat_get_index(vm->xe,
>>>>>> XE_CACHE_WB));
>>>>>>>>      	}
>>>>>>>>      }
>>>>>>>>
>>>>>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>>>>>>>>      	struct xe_vm *vm;
>>>>>>>>      	/** @tile: The tile we're building for. */
>>>>>>>>      	struct xe_tile *tile;
>>>>>>>> -	/** @cache: Desired cache level for the ptes */
>>>>>>>> -	enum xe_cache_level cache;
>>>>>>>>      	/** @default_pte: PTE flag only template. No address is
>> associated */
>>>>>>>>      	u64 default_pte;
>>>>>>>>      	/** @dma_offset: DMA offset to add to the PTE. */ @@ -604,7
>>>>>>>> +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
>>>>>> pgoff_t offset,
>>>>>>>>
>>>>>>>>      		pte = __pte_encode(is_null ? 0 :
>>>>>>>>      				   xe_res_dma(curs) + xe_walk-
>>> dma_offset,
>>>>>>>> -				   xe_walk->cache, xe_walk->vma,
>> level);
>>>>>>>> +				   xe_walk->vma->pat_index, xe_walk-
>>> vma,
>>>>>> level);
>>>>>>>>      		pte |= xe_walk->default_pte;
>>>>>>>>
>>>>>>>>      		/*
>>>>>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
>>>>>> pgoff_t offset,
>>>>>>>>      			xe_child->is_compact = true;
>>>>>>>>      		}
>>>>>>>>
>>>>>>>> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache)
>> |
>>>> flags;
>>>>>>>> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
>>>>>>> pat_index) | flags;
>>>>>>>>      		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset,
>> xe_child,
>>>>>>>>      					 pte);
>>>>>>>>      	}
>>>>>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile,
>>>>>>>> struct
>>>> xe_vma
>>>>>> *vma,
>>>>>>>>      		if (vma && vma->gpuva.flags &
>> XE_VMA_ATOMIC_PTE_BIT)
>>>>>>>>      			xe_walk.default_pte |=
>> XE_USM_PPGTT_PTE_AE;
>>>>>>>>      		xe_walk.dma_offset = vram_region_gpu_offset(bo-
>>>>>>> ttm.resource);
>>>>>>>> -		xe_walk.cache = XE_CACHE_WB;
>>>>>>>> -	} else {
>>>>>>>> -		if (!xe_vma_has_no_bo(vma) && bo->flags &
>>>>>> XE_BO_SCANOUT_BIT)
>>>>>>>> -			xe_walk.cache = XE_CACHE_WT;
>>>>>>>> -		else
>>>>>>>> -			xe_walk.cache = XE_CACHE_WB;
>>>>>>>>      	}
>>>>>>>> +
>>>>>>>>      	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>>>>>>>      		xe_walk.dma_offset =
>>>>>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>> b/drivers/gpu/drm/xe/xe_pt.h index 01be7ab08f87..1d433a5a96b4
>>>>>>>> 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile,
>>>>>>>> struct
>>>>>> xe_vma *vma, struct xe_exec_queu
>>>>>>>>
>>>>>>>>      bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma
>>>>>>>> *vma);
>>>>>>>>
>>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>>>>>> -		  const enum xe_cache_level level);
>>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
>>>>>>>> +pat_index);
>>>>>>>>
>>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
>>>>>>>> xe_cache_level
>>>>>> cache,
>>>>>>>> -		  u32 pt_level);
>>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>>>>>>> +u32
>>>>>> pt_level);
>>>>>>>>
>>>>>>>>      #endif
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>> b/drivers/gpu/drm/xe/xe_vm.c index 7eeeed0411f3..34603a7e84b0
>>>>>>>> 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>> @@ -6,6 +6,7 @@
>>>>>>>>      #include "xe_vm.h"
>>>>>>>>
>>>>>>>>      #include <linux/dma-fence-array.h>
>>>>>>>> +#include <linux/nospec.h>
>>>>>>>>
>>>>>>>>      #include <drm/drm_print.h>
>>>>>>>>      #include <drm/ttm/ttm_execbuf_util.h> @@ -874,7 +875,8 @@
>>>>>>>> static struct xe_vma *xe_vma_create(struct
>>>> xe_vm
>>>>>> *vm,
>>>>>>>>      				    u64 start, u64 end,
>>>>>>>>      				    bool read_only,
>>>>>>>>      				    bool is_null,
>>>>>>>> -				    u8 tile_mask)
>>>>>>>> +				    u8 tile_mask,
>>>>>>>> +				    u32 pat_index)
>>>>>>>>      {
>>>>>>>>      	struct xe_vma *vma;
>>>>>>>>      	struct xe_tile *tile;
>>>>>>>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct
>>>> xe_vm
>>>>>> *vm,
>>>>>>>>      			vma->tile_mask |= 0x1 << id;
>>>>>>>>      	}
>>>>>>>>
>>>>>>>> +	vma->pat_index = pat_index;
>>>>>>>> +
>>>>>>>>      	if (vm->xe->info.platform == XE_PVC)
>>>>>>>>      		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>>>>>>>
>>>>>>>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct
>>>>>> work_struct *w);
>>>>>>>>      static void vm_destroy_work_func(struct work_struct *w);
>>>>>>>>
>>>>>>>>      static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe,
>>>>>>>> u64
>>>> pde_pat,
>>>>>>>> -						enum xe_cache_level
>> cache)
>>>>>>>> +				     u32 pat_index)
>>>>>>>>      {
>>>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>> -
>>>>>>>>      	pde_pat &= ~(XELP_PDE_PAT_MASK);
>>>>>>>>
>>>>>>>>      	if (pat_index & BIT(0))
>>>>>>>> @@ -1213,10 +1215,8 @@ static u64
>>>> xelp_ppgtt_pde_encode_pat(struct
>>>>>> xe_device *xe, u64 pde_pat,
>>>>>>>>      }
>>>>>>>>
>>>>>>>>      static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe,
>>>>>>>> u64
>>>> pte_pat,
>>>>>>>> -						enum xe_cache_level
>> cache)
>>>>>>>> +				     u32 pat_index)
>>>>>>>>      {
>>>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>> -
>>>>>>>>      	pte_pat &= ~(XELP_PTE_PAT_MASK);
>>>>>>>>
>>>>>>>>      	if (pat_index & BIT(0))
>>>>>>>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file
>>>> *xef,
>>>>>> u32 id)
>>>>>>>>      u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>>>>>>>>      {
>>>>>>>>      	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>>>>>>>> -			     XE_CACHE_WB);
>>>>>>>> +			     xe_pat_get_index(vm->xe,
>> XE_CACHE_WB));
>>>>>>>>      }
>>>>>>>>
>>>>>>>>      static struct dma_fence *
>>>>>>>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe,
>>>> struct
>>>>>> drm_gpuva_op *op)
>>>>>>>>      static struct drm_gpuva_ops *
>>>>>>>>      vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>>>>>>      			 u64 bo_offset_or_userptr, u64 addr, u64
>> range,
>>>>>>>> -			 u32 operation, u8 tile_mask, u32 region)
>>>>>>>> +			 u32 operation, u8 tile_mask, u32 region, u32
>>>>>> pat_index)
>>>>>>>>      {
>>>>>>>>      	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>>>>>>>      	struct ww_acquire_ctx ww;
>>>>>>>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm
>> *vm,
>>>>>> struct xe_bo *bo,
>>>>>>>>      			struct xe_vma_op *op =
>> gpuva_op_to_vma_op(__op);
>>>>>>>>
>>>>>>>>      			op->tile_mask = tile_mask;
>>>>>>>> +			op->pat_index = pat_index;
>>>>>>>>      			op->map.immediate =
>>>>>>>>      				operation &
>>>> XE_VM_BIND_FLAG_IMMEDIATE;
>>>>>>>>      			op->map.read_only =
>>>>>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm
>> *vm,
>>>>>> struct xe_bo *bo,
>>>>>>>>      			struct xe_vma_op *op =
>> gpuva_op_to_vma_op(__op);
>>>>>>>>
>>>>>>>>      			op->tile_mask = tile_mask;
>>>>>>>> +			op->pat_index = pat_index;
>>>>>>>>      			op->prefetch.region = region;
>>>>>>>>      		}
>>>>>>>>      		break;
>>>>>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
>> *vm,
>>>>>> struct xe_bo *bo,
>>>>>>>>      }
>>>>>>>>
>>>>>>>>      static struct xe_vma *new_vma(struct xe_vm *vm, struct
>>>>>> drm_gpuva_op_map *op,
>>>>>>>> -			      u8 tile_mask, bool read_only, bool is_null)
>>>>>>>> +			      u8 tile_mask, bool read_only, bool is_null,
>>>>>>>> +			      u32 pat_index)
>>>>>>>>      {
>>>>>>>>      	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op-
>>> gem.obj) :
>>>> NULL;
>>>>>>>>      	struct xe_vma *vma;
>>>>>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm
>>>>>> *vm, struct drm_gpuva_op_map *op,
>>>>>>>>      	vma = xe_vma_create(vm, bo, op->gem.offset,
>>>>>>>>      			    op->va.addr, op->va.addr +
>>>>>>>>      			    op->va.range - 1, read_only, is_null,
>>>>>>>> -			    tile_mask);
>>>>>>>> +			    tile_mask, pat_index);
>>>>>>>>      	if (bo)
>>>>>>>>      		xe_bo_unlock(bo, &ww);
>>>>>>>>
>>>>>>>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct
>>>> xe_vm
>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>
>>>>>>>>      				vma = new_vma(vm, &op->base.map,
>>>>>>>>      					      op->tile_mask, op-
>>>>> map.read_only,
>>>>>>>> -					      op->map.is_null);
>>>>>>>> +					      op->map.is_null, op-
>>> pat_index);
>>>>>>>>      				if (IS_ERR(vma)) {
>>>>>>>>      					err = PTR_ERR(vma);
>>>>>>>>      					goto free_fence;
>>>>>>>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct
>>>> xe_vm
>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>
>>>>>>>>      					vma = new_vma(vm, op-
>>>>>>> base.remap.prev,
>>>>>>>>      						      op->tile_mask,
>> read_only,
>>>>>>>> -						      is_null);
>>>>>>>> +						      is_null, op-
>>> pat_index);
>>>>>>>>      					if (IS_ERR(vma)) {
>>>>>>>>      						err = PTR_ERR(vma);
>>>>>>>>      						goto free_fence;
>>>>>>>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct
>>>> xe_vm
>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>
>>>>>>>>      					vma = new_vma(vm, op-
>>>>>>> base.remap.next,
>>>>>>>>      						      op->tile_mask,
>> read_only,
>>>>>>>> -						      is_null);
>>>>>>>> +						      is_null, op-
>>> pat_index);
>>>>>>>>      					if (IS_ERR(vma)) {
>>>>>>>>      						err = PTR_ERR(vma);
>>>>>>>>      						goto free_fence;
>>>>>>>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct
>>>>>> xe_device *xe,
>>>>>>>>      		u32 obj = (*bind_ops)[i].obj;
>>>>>>>>      		u64 obj_offset = (*bind_ops)[i].obj_offset;
>>>>>>>>      		u32 region = (*bind_ops)[i].region;
>>>>>>>> +		u32 pat_index = (*bind_ops)[i].pat_index;
>>>>>>>>      		bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>>>>>>>
>>>>>>>> +		if (XE_IOCTL_DBG(xe, pat_index >= xe-
>>>>>>> info.pat_table_n_entries)) {
>>>>>>>> +			err = -EINVAL;
>>>>>>>> +			goto free_bind_ops;
>>>>>>>> +		}
>>>>>>>> +
>>>>>>>>      		if (i == 0) {
>>>>>>>>      			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>>>>>>>>      		} else if (XE_IOCTL_DBG(xe, !*async) || @@ -3346,8
>> +3355,25
>>>>>>>> @@ int xe_vm_bind_ioctl(struct drm_device
>>>> *dev,
>>>>>> void *data, struct drm_file *file)
>>>>>>>>      		struct drm_gem_object *gem_obj;
>>>>>>>>      		u64 range = bind_ops[i].range;
>>>>>>>>      		u64 addr = bind_ops[i].addr;
>>>>>>>> +		u32 op = bind_ops[i].op;
>>>>>>>>      		u32 obj = bind_ops[i].obj;
>>>>>>>>      		u64 obj_offset = bind_ops[i].obj_offset;
>>>>>>>> +		u32 pat_index = bind_ops[i].pat_index;
>>>>>>>> +		u16 coh_mode;
>>>>>>>> +
>>>>>>>> +		pat_index = array_index_nospec(pat_index,
>>>>>>>> +					       xe-
>>> info.pat_table_n_entries);
>>>>>>>> +		coh_mode = xe_pat_index_get_coh_mode(xe,
>> pat_index);
>>>>>>>> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
>>>>>>>
>>>>>>> Assuming we drop the unusable entries from the TGL table, this
>>>>>>> should be impossible, right?  Any index that makes it past the
>>>>>>> n_entries check at the top of the function should have a valid, non-zero
>> coh_mode value.
>>>>>>> So this should probably be an assertion (to highlight a KMD bug)
>>>>>>> rather than just a silent uapi failure return.
>>>>>>
>>>>>> Makes sense.
>>>>>>
>>>>>>>
>>>>>>>> +			err = -EINVAL;
>>>>>>>> +			goto put_obj;
>>>>>>>> +		}
>>>>>>>> +
>>>>>>>> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
>>>>>> XE_VM_BIND_OP_MAP_USERPTR &&
>>>>>>>> +				 coh_mode ==
>> XE_GEM_COHERENCY_NONE))
>>>> {
>>>>>>>> +			err = -EINVAL;
>>>>>>>> +			goto put_obj;
>>>>>>>> +		}
>>>>>>>>
>>>>>>>>      		if (!obj)
>>>>>>>>      			continue;
>>>>>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device
>>>> *dev,
>>>>>> void *data, struct drm_file *file)
>>>>>>>>      				goto put_obj;
>>>>>>>>      			}
>>>>>>>>      		}
>>>>>>>> +
>>>>>>>> +		if (bos[i]->coh_mode) {
>>>>>>>> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
>>>>>> coh_mode)) {
>>>>>>>> +				err = -EINVAL;
>>>>>>>> +				goto put_obj;
>>>>>>>> +			}
>>>>>>>> +		} else if (XE_IOCTL_DBG(xe, coh_mode ==
>>>>>> XE_GEM_COHERENCY_NONE)) {
>>>>>>>> +			/*
>>>>>>>> +			 * Imported dma-buf from a different device
>> should
>>>>>>>> +			 * require 1way or 2way coherency since we
>> don't
>>>>>> know
>>>>>>>> +			 * how it was mapped on CPU. Just assume is it
>>>>>>>> +			 * potentially cached on CPU side.
>>>>>>>> +			 */
>>>>>>>> +			err = -EINVAL;
>>>>>>>> +			goto put_obj;
>>>>>>>> +		}
>>>>>>>>      	}
>>>>>>>>
>>>>>>>>      	if (args->num_syncs) {
>>>>>>>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device
>>>> *dev,
>>>>>> void *data, struct drm_file *file)
>>>>>>>>      		u64 obj_offset = bind_ops[i].obj_offset;
>>>>>>>>      		u8 tile_mask = bind_ops[i].tile_mask;
>>>>>>>>      		u32 region = bind_ops[i].region;
>>>>>>>> +		u32 pat_index = bind_ops[i].pat_index;
>>>>>>>>
>>>>>>>>      		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i],
>> obj_offset,
>>>>>>>>      						  addr, range, op,
>> tile_mask,
>>>>>>>> -						  region);
>>>>>>>> +						  region, pat_index);
>>>>>>>>      		if (IS_ERR(ops[i])) {
>>>>>>>>      			err = PTR_ERR(ops[i]);
>>>>>>>>      			ops[i] = NULL;
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>> b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>> index 83a1f87b6537..508679ed3c74 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>> @@ -111,6 +111,11 @@ struct xe_vma {
>>>>>>>>      	 */
>>>>>>>>      	u8 tile_present;
>>>>>>>>
>>>>>>>> +	/**
>>>>>>>> +	 * @pat_index: The pat index to use when encoding the PTEs
>> for
>>>>>>>> +this
>>>>>> vma.
>>>>>>>> +	 */
>>>>>>>> +	u32 pat_index;
>>>>>>>> +
>>>>>>>>      	struct {
>>>>>>>>      		struct list_head rebind_link;
>>>>>>>>      	} notifier;
>>>>>>>> @@ -338,10 +343,8 @@ struct xe_vm {
>>>>>>>>      	bool batch_invalidate_tlb;
>>>>>>>>
>>>>>>>>      	struct {
>>>>>>>> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>>>> -						enum xe_cache_level
>> cache);
>>>>>>>> -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>>>>>>>> -						enum xe_cache_level
>> cache);
>>>>>>>> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> u32
>>>>>> pat_index);
>>>>>>>> +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>> u32
>>>>>> pat_index);
>>>>>>>>      	} pat_encode;
>>>>>>>>      };
>>>>>>>>
>>>>>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>>>>>>>>      	struct async_op_fence *fence;
>>>>>>>>      	/** @tile_mask: gt mask for this operation */
>>>>>>>>      	u8 tile_mask;
>>>>>>>> +	/** @pat_index: The pat index to use for this operation. */
>>>>>>>> +	u32 pat_index;
>>>>>>>>      	/** @flags: operation flags */
>>>>>>>>      	enum xe_vma_op_flags flags;
>>>>>>>>
>>>>>>>> diff --git a/include/uapi/drm/xe_drm.h
>>>>>>>> b/include/uapi/drm/xe_drm.h index 64bc66d4b550..0c15b6f32447
>>>>>>>> 100644
>>>>>>>> --- a/include/uapi/drm/xe_drm.h
>>>>>>>> +++ b/include/uapi/drm/xe_drm.h
>>>>>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>>>>>>>>      	 */
>>>>>>>>      	__u32 obj;
>>>>>>>>
>>>>>>>> -	/** @pad: MBZ */
>>>>>>>> -	__u32 pad;
>>>>>>>> +	/**
>>>>>>>> +	 * @pat_index: The platform defined @pat_index to use for
>> this
>>>>>> mapping.
>>>>>>>> +	 * The index basically maps to some predefined memory
>> attributes,
>>>>>>>> +	 * including things like caching, coherency and likely other stuff
>> in
>>>>>>>> +	 * the future.  The exact meaning of the pat_index is platform
>>>>>>>> +specific
>>>>>>>
>>>>>>> BTW, "other stuff in the future" already includes compression on
>>>>>>> Xe2, we just haven't landed the patches for the Xe2 table yet.
>>>>>>
>>>>>> Ok, good to know.
>>>>>>
>>>>>>>
>>>>>>>> +	 * and defined in the Bspec and PRMs.  When the KMD sets up
>> the
>>>>>> binding
>>>>>>>> +	 * the index here is encoded into the ppGTT PTE.
>>>>>>>> +	 *
>>>>>>>> +	 * For coherency the @pat_index needs to match the
>>>>>>>> +	 * drm_xe_gem_create.coh_mode, so either
>>>>>> XE_GEM_COHERENCY_NONE,
>>>>>>>> +	 * XE_GEM_COHERENCY_1WAY or
>> XE_GEM_COHERENCY_2WAY. The
>>>>>> KMD will extract
>>>>>>>> +	 * the coherency mode from the @pat_index and reject if
>> there is a
>>>>>>>> +	 * mismatch (see note below for pre-MTL platforms).
>>>>>>>> +	 *
>>>>>>>> +	 * Note: On pre-MTL platforms there is only a caching mode
>> and no
>>>>>>>> +	 * explicit coherency mode, but on such hardware there is
>> always a
>>>>>>>> +	 * shared-LLC (or is dgpu) so all GT memory accesses are
>>>>>>>> +coherent
>>>> with
>>>>>>>> +	 * CPU caches even with the caching mode set as uncached.
>> It's
>>>>>>>> +only
>>>>>> the
>>>>>>>> +	 * display engine that is incoherent (on dgpu it must be in
>> VRAM which
>>>>>>>> +	 * is always mapped as WC on the CPU). However to keep the
>> uapi
>>>>>> somewhat
>>>>>>>> +	 * consistent with newer platforms the KMD groups the
>> different
>>>> cache
>>>>>>>> +	 * levels into the following coherency buckets on all pre-MTL
>>>> platforms:
>>>>>>>> +	 *
>>>>>>>> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
>>>>>>>> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
>>>>>>>> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
>>>>>>>> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
>>>>>>>
>>>>>>> As noted on the previous patch, it seems like 2-way is appropriate
>>>>>>> for LLC platforms, but 1-way might be a more accurate description
>>>>>>> of dGPU behavior.
>>>>>>>
>>>>>>>> +	 *
>>>>>>>> +	 * In practice UC/WC/WT should only ever used for scanout
>>>>>>>> +surfaces
>>>> on
>>>>>>>> +	 * such platforms since it is only the display engine that is
>> actually
>>>>>>>> +	 * incoherent. Everything else should typically use WB given
>>>>>>>> +that we
>>>>>>>
>>>>>>> What if we're sharing our buffers with some other (non-GPU)
>>>>>>> device?  Are there cases where that other device wouldn't be
>>>>>>> coherent with the LLC, so we'd want to use one of these?
>>>>>>
>>>>>> Yeah, I guess there might be cases like that. I'll reword.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>>> +	 * have a shared-LLC.  On MTL+ this completely changes (also
>>>>>> potentially
>>>>>>>> +	 * no shared-LLC) and the HW defines the coherency mode as
>> part
>>>>>>>> +of
>>>>>> the
>>>>>>>> +	 * @pat_index.
>>>>>>>> +	 *
>>>>>>>> +	 * Note: For userptr and externally imported dma-buf the
>> kernel
>>>>>> expects
>>>>>>>> +	 * either 1WAY or 2WAY for the @pat_index.
>>>>>>>> +	 */
>>>>>>>> +	__u32 pat_index;
>>>>>>>>
>>>>>>>>      	union {
>>>>>>>>      		/**
>>>>>>>> --
>>>>>>>> 2.41.0
>>>>>>>>
>>>>>>>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-09-04  9:23                 ` Matthew Auld
@ 2023-09-05  9:12                   ` Zhang, Carl
  2023-09-05  9:46                     ` Matthew Auld
  0 siblings, 1 reply; 47+ messages in thread
From: Zhang, Carl @ 2023-09-05  9:12 UTC (permalink / raw)
  To: Auld, Matthew, Roper, Matthew D
  Cc: Gu, Lihao, Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas,
	Yu, Effie, intel-xe@lists.freedesktop.org


> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: Monday, September 4, 2023 5:24 PM
> 
> On 01/09/2023 10:34, Zhang, Carl wrote:
> >
> >
> >> -----Original Message-----
> >> From: Auld, Matthew <matthew.auld@intel.com>
> >> Sent: Thursday, August 31, 2023 6:44 PM
> >>
> >> On 31/08/2023 09:24, Zhang, Carl wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Auld, Matthew <matthew.auld@intel.com>
> >>>> Sent: Thursday, August 31, 2023 12:02 AM
> >>>>
> >>>> On 30/08/2023 16:27, Zhang, Carl wrote:
> >>>>> Several questions:
> >>>>> 1. the pat_index from vm_bind will override the setting from
> bo_create?
> >>>>> How to keep the value from bo_create unchanged?
> >>>>
> >>>> You only set the pat_index during vm_bind. At gem_create you just
> >>>> need tell the kernel what CPU side caching mode to use for system
> >>>> memory (wb/wc/uc), plus the expected GPU coherency mode of the
> >>>> pat_index. The current expectation is that the coherency mode of
> >>>> the pat_index should match the one at gem_create.
> >>>>
> >>> It means that we set the incomplete pat_index at gem_create
> >>> (coherency
> >>> only) And set another part at vm_bind (gpu cache) Why we separate
> >>> them? Why not set a full value at vm_bind?
> >>
> >> You always give the full pat_index at vm_bind. It's platform specific
> >> but the BSpec for the most part tells you what the coherency mode of the
> pat_index is.
> >>
> >> For example if you look at the MTL pat_index table it tells you the
> >> coh_mode:
> >>
> >> pat_index 0-2 = coh_mode "No snoop" = COH_NONE
> >> pat_index 3   = coh_mode "1way"     = COH_1WAY
> >> pat_index 4   = coh_mode "2way"     = COH_2WAY
> >>
> >> On future platforms the table is larger and also encodes stuff like
> >> compression, but KMD really only cares about the coh_mode, since that
> >> will restrict the allowed CPU smem_caching values that userspace can
> select.
> >>
> >>   From KMD pov this matters since we always give userspace zeroed
> >> memory, which is useful for normal applications but is also needed for
> security reasons.
> >> However the actual clearing might result in the writes only being
> >> visible in the CPU cache (not flushed to main memory), and if the GPU
> >> doesn't snoop the CPU cache it can read directly from main memory,
> >> which effectively bypasses the clearing. To prevent that the KMD
> >> needs to know how userspace will map the memory via the GPU i.e what
> >> is the coh_mode for the pat_index. The gist is that if you select
> >> COH_NONE then it needs to clflush the pages before giving them to
> >> userspace, which is done by forcing you to use uc/wc for the
> >> smem_caching (the kernel will internally issue cflush on x86 when
> >> marking the pages as uc/wc). And if you select wb there is no flushing, but
> here you need to use 1way or 2way.
> >>
> >
> >   It sounds like a perf optimization, if no considering the perf, you
> > could always call Clflush to flush the cpu cache to main memory before
> > giving it to user space
> 
> I think the other thing is that the KMD is no longer doing any manual flushing,
> but instead the core kernel will now do that for us when marking the pages as
> wc/uc on x86.
> 
> >
> >
> >> AFAIK separating them is just to prevent userspace from mixing
> >> different coherency modes for the same object by having it immutable
> >> (you can still use different pat_index but the coherency must be
> compatible).
> >> Also at creation time we can determine if the smem_caching is
> >> compatible with the coh_mode.
> >>
> >> Note that it might be that we ditch the COH_2WAY and just have
> >> COH_NONE and COH_AT_LEAST_1WAY, in which case this might be slightly
> different.
> >>
> >>>
> >>>>> 2. no UC/WB/WC definition (CPU cachable) in
> >>>>> drm_xe_gem_mmap_offset,
> >>>> will it be handled by KMD automatically?
> >>>>> For example: if set 1-way, it means GPU could snoop CPU cache, we
> >>>>> could use WB  in mmap offset If it is COHERENCY_NONE, we could
> >>>>> only use
> >>>> UC,  all these logic is handled by KMD automatically?
> >>>>
> >>>> Yes, the mmap will use the same CPU side caching mode that you set
> >>>> at gem_create with smem_caching (wb/wc/uc).
> >>>>
> >>> It means it  could not be changed after gem_create.  Sometimes, UMD
> >>> doesn’t know Whether the data need to be accessed by CPU at
> >>> gem_create .  only application Knows the purpose of allocation & the
> >>> surface usage.  So, may need change UMD api to let applications set
> >>> whether the
> >> resource is "accessible", or it will hurt the perf.
> >>
> >> Right, the smem_caching is immutable.
> >>
> >>>
> >>>>> 3.  about " For imported dma-buf (from a different device) the
> >>>>> coherency mode is also implicit and must also be either 1WAY or
> 2WAY"
> >>>>> it means it must be 1way or 2way, and UMD need not to set it?
> >>>>
> >>>> If you are going to vm_bind it you need to supply the pat_index and
> >>>> in this case it needs to be either 1way or 2way. The object comes
> >>>> from a different device so we might not have smem_caching/coh_mode
> >>>> like we do for native objects.
> >>>>
> >>> How about the GPU cache setting of imported dma-buf?
> >>> Is there possible conflict between dma-buf producer and consumer?
> >>> For example:
> >>> In producer side , one bo is set GPU L3 cache,  also CPU cacheable,
> >>> it should be 2Way In client side, we know the coherency should be
> >>> 1-way or 2-way, if it suppose no GPU cache Used. It just set 1-way.
> >>> It may cause some problems?
> >>
> >> If the producer and consumer are the same device, then I think the
> >> original gem_create.coh_mode is considered, since it's effectively
> >> still a native object. In which case the normal rules apply.
> >
> > But the consumer does not know the coh_mode, so , it is possible that
> > it set will set Another value, then it will failed.
> > And because consumer does not know it is 1-way or 2-way,  consumer
> > could not Set the gpu cache correctly.  Because if it is 1-way, cpu
> > could not snoop gpu cache
> >
> >>
> >> If it comes from a different device then we don't know what
> >> gem_create.coh_mode is (or if that even exists) since we don't have
> >> the usual native object, but just some opaque dma-buf object. In this
> >> case it must be at least 1way or 2way.
> >>
> >>   From KMD pov we only care about userspace not being able to bypass
> >> the clearing. There might still be other coherency issues though.
> >>
> >>>
> >>> Also , when a bo is created, UMD does not know whether it will be
> exported.
> >>> It maybe be set as COHERENCY_NONE, right?
> >>
> >> Yeah, you can use whatever you want. Just that your smem_caching will
> >> need to be wc/uc if using COH_NONE. On the importer side, if it comes
> >> from the same device then it's just a normal native object and the normal
> rules apply.
> >> Currently this would mean also using pat_index with COH_NONE. If it
> >> is imported from a different device then you need to use a pat_index
> >> with 1way or 2way, since KMD doesn't really know how the object is
> >> mapped on the CPU or if the CPU caches are dirty (potentially with
> clearing).
> >>
> >> Is this going to be problematic?
> >>
> > If imported surfaces is from same device, how consumer know it is
> > original coh_non Or 1-way or 2-way?
> >
> > And it also means producer and consumer could have different pat_index for
> same bo.
> 
> Ok, if KMD instead allows 1way or 2way for dma-buf when it's external or
> from the same device? Would that help here? i.e you can always select 1way
> or 2way, even if it came from the same device and was coh_none?
> 
> Other option is you could call the vm_bind ioctl to check if the coh_mode is
> supported by checking if it returns an error? But maybe that is too nasy.
> 

Still a bit confuse,  I create a bo with coh_none, then export it to a PRIME fd.
In consumer side, it does not know it is coh_none, and will call vm_bind to set pat_index.
    
You mean it will return a error when I set different coh value though vm_bind?
And try coh_non, 1-way, 2-way, then get the correct one? TBH, it is ugly.  

My concern is: if it is coh_non or 1-way,  it means I could not use GPU cache if the bo also is CPU accessible. 
Because CPU could not snoop GPU cache.
So, GPU cache specified by pat_index also will be rejected?

Could we add some query interface to query the CPU cache and coherency setting of a bo?

> 
> >
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Auld, Matthew <matthew.auld@intel.com>
> >>>>>> Sent: Wednesday, August 30, 2023 7:28 PM
> >>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
> >>>>>> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi
> >>>> <pallavi.mishra@intel.com>;
> >>>>>> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas
> >>>>>> Lahtinen <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
> >>>>>> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>;
> >>>> Hazubski,
> >>>>>> Filip <filip.hazubski@intel.com>; Zhang, Carl
> >>>>>> <carl.zhang@intel.com>; Yu,
> >>>> Effie
> >>>>>> <effie.yu@intel.com>
> >>>>>> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection
> >>>>>> with
> >>>> vm_bind
> >>>>>>
> >>>>>> On 29/08/2023 22:36, Matt Roper wrote:
> >>>>>>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
> >>>>>>>> Allow userspace to directly control the pat_index for a given
> >>>>>>>> vm binding. This should allow directly controlling the
> >>>>>>>> coherency, caching and potentially other stuff in the future for the
> ppGTT binding.
> >>>>>>>>
> >>>>>>>> The exact meaning behind the pat_index is very platform
> >>>>>>>> specific (see BSpec or PRMs) but effectively maps to some
> >>>>>>>> predefined memory attributes. From the KMD pov we only care
> >>>>>>>> about the coherency that is provided by the pat_index, which
> >>>>>>>> falls into either
> >> NONE, 1WAY or 2WAY.
> >>>>>>>> The vm_bind coherency mode for the given pat_index needs to
> >>>>>>>> match
> >>>> the
> >>>>>>>> given coh_mode that was set at object creation. For platforms
> >>>>>>>> that lack
> >>>>>>>
> >>>>>>> Is it actually important to match the coherency mode?  I think
> >>>>>>> one of the main goals was to know up front if userspace might be
> >>>>>>> using a non-snooping PAT setting that would let it bypass the
> >>>>>>> CPU cache (and potentially read old, stale data from a different
> >>>>>>> process if the buffer's clear value is still sitting in cache
> >>>>>>> and hasn't landed in memory yet).
> >>>>>>>
> >>>>>>> If that's the only concern, then I think it should still be fine
> >>>>>>> to map with a non-matching PAT as long as it's more coherent
> >>>>>>> than the one specified at creation, right?  E.g., if the buffer
> >>>>>>> was created with 1-way coherency, it would be fine to map it
> >>>>>>> with 2-way because userspace still can't use that to observe the
> >>>>>>> previous contents of the buffer. Or
> >>>>>>
> >>>>>> Yeah, I guess we could in theory do something that.
> >>>>>>
> >>>>>>> if the buffer was created with "non-coherent" then we've already
> >>>>>>> done the necessary clflushing in kernel before handing to buffer
> >>>>>>> over to userspace to ensure the clear value landed in memory, so
> >>>>>>> any valid PAT index should be safe (from a security POV) after
> >>>>>>> that, right?  Any other problems that arise from mismatched
> >>>>>>> coherency would just be contained to the app possibly shooting
> >>>>>>> itself in the foot, which isn't really our concern.
> >>>>>>
> >>>>>> That is also my understanding, at least from the KMD security pov.
> >>>>>> If you allocate as wb then you must use at least 1way, since
> >>>>>> there is no flushing
> >>>> for
> >>>>>> clearing or swap-in. For uc/wc you could in theory use whatever you
> want.
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and
> WB
> >>>>>>>> as
> >>>>>> 2WAY.
> >>>>>>>>
> >>>>>>>> For userptr mappings we lack a corresponding gem object, so the
> >>>> expected
> >>>>>>>> coherency mode is instead implicit and must fall into either
> >>>>>>>> 1WAY or 2WAY. Trying to use NONE will be rejected by the
> >>>>>>>> kernel. For imported dma-buf (from a different device) the
> >>>>>>>> coherency mode is also implicit and must also be either 1WAY or
> 2WAY.
> >>>>>>>>
> >>>>>>>> As part of adding pat_index support with vm_bind we also need
> >>>>>>>> stop
> >>>> using
> >>>>>>>> xe_cache_level and instead use the pat_index in various places.
> >>>>>>>> We still make use of xe_cache_level, but only as a convenience
> >>>>>>>> for kernel internal objectsi (internally it maps to some
> >>>>>>>> reasonable pat_index). For
> >>>>>>>
> >>>>>>> Maybe we should kill xe_cache_level completely and just assign
> >>>>>>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used
> >>>>>>> in appropriate places, similar to what we do with MOCS (gt-
> >>>>> mocs.uc_index,
> >>>>>>> gt->mocs.wb_index)?
> >>>>>>
> >>>>>> OK, seems reasonable to me.
> >>>>>>
> >>>>>>>
> >>>>>>>> now this is just a 1:1 conversion of the existing code, however
> >>>>>>>> for platforms like MTL+ we might need to give more control
> >>>>>>>> through
> >>>> bo_create
> >>>>>>>> or stop using WB on the CPU side if we need CPU access.
> >>>>>>>>
> >>>>>>>> Bspec: 45101, 44235 #xe
> >>>>>>>> Bspec: 70552, 71582, 59400 #xe2
> >>>>>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> >>>>>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> >>>>>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> >>>>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >>>>>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> >>>>>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
> >>>>>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
> >>>>>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
> >>>>>>>> Cc: Carl Zhang <carl.zhang@intel.com>
> >>>>>>>> Cc: Effie Yu <effie.yu@intel.com>
> >>>>>>>> ---
> >>>>>>>>      drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
> >>>>>>>>      drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
> >>>>>>>>      drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
> >>>>>>>>      drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
> >>>>>>>>      drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
> >>>>>>>>      drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
> >>>>>>>>      drivers/gpu/drm/xe/xe_pt.h            |  6 +--
> >>>>>>>>      drivers/gpu/drm/xe/xe_vm.c            | 73
> +++++++++++++++++++++----
> >> -
> >>>> -
> >>>>>>>>      drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
> >>>>>>>>      include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
> >>>>>>>>      10 files changed, 134 insertions(+), 58 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>> b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>>>> index 4d6296cdbcfd..bb4c182048c3 100644
> >>>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
> >>>>>>>>      		(*alloc_scratch_dma)(struct i915_address_space
> *vm,
> >> int
> >>>>>>>> sz);
> >>>>>>>>
> >>>>>>>>      	u64 (*pte_encode)(dma_addr_t addr,
> >>>>>>>> -			  unsigned int pat_index,
> >>>>>>>> +			  u32 pat_index,
> >>>>>>>>      			  u32 flags); /* Create a valid PTE */
> >>>>>>>>      #define PTE_READ_ONLY	BIT(0)
> >>>>>>>>      #define PTE_LM		BIT(1)
> >>>>>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>> b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
> >>>>>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct
> >>>> xe_migrate
> >>>>>> *m, struct kunit *test)
> >>>>>>>>      	/* First part of the test, are we updating our pagetable
> >>>>>>>> bo with a new
> >>>>>> entry? */
> >>>>>>>>      	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE *
> >> (NUM_KERNEL_PDE -
> >>>> 1),
> >>>>>> u64,
> >>>>>>>>      		  0xdeaddeadbeefbeef);
> >>>>>>>> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> >>>>>>>> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
> >>>>>> XE_CACHE_WB), 0);
> >>>>>>>>      	if (m->q->vm->flags & XE_VM_FLAG_64K)
> >>>>>>>>      		expected |= XE_PTE_PS64;
> >>>>>>>>      	if (xe_bo_is_vram(pt))
> >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c
> >>>> b/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>>>> index 209fa053d9fb..4134c26150a5 100644
> >>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo,
> u64
> >>>>>> bo_offset)
> >>>>>>>>      		pte |= XE_GGTT_PTE_DM;
> >>>>>>>>
> >>>>>>>>      	if ((ggtt->pat_encode).pte_encode)
> >>>>>>>> -		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >>>>>> XE_CACHE_WB_1_WAY);
> >>>>>>>> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >>>>>>>> +
> >> xe_pat_get_index(xe,
> >>>>>> XE_CACHE_WB_1_WAY));
> >>>>>>>>
> >>>>>>>>      	return pte;
> >>>>>>>>      }
> >>>>>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt
> *ggtt)
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>>      static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe,
> >>>>>>>> u64
> >>>> pte_pat,
> >>>>>>>> -						enum xe_cache_level
> >> cache)
> >>>>>>>> +				     u32 pat_index)
> >>>>>>>>      {
> >>>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>>>> -
> >>>>>>>>      	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
> >>>>>>>>
> >>>>>>>>      	if (pat_index & BIT(0))
> >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>> b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
> >>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
> >>>>>>>>
> >>>>>>>>      	struct {
> >>>>>>>>      		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>>>>>> -						enum xe_cache_level
> >> cache);
> >>>>>>>> +				  u32 pat_index);
> >>>>>>>>      	} pat_encode;
> >>>>>>>>      };
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>> b/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>>>> index a782ea282cb6..54585e98452a 100644
> >>>>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>>>> @@ -24,6 +24,7 @@
> >>>>>>>>      #include "xe_lrc.h"
> >>>>>>>>      #include "xe_map.h"
> >>>>>>>>      #include "xe_mocs.h"
> >>>>>>>> +#include "xe_pat.h"
> >>>>>>>>      #include "xe_pt.h"
> >>>>>>>>      #include "xe_res_cursor.h"
> >>>>>>>>      #include "xe_sched_job.h"
> >>>>>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>>>> xe_tile
> >>>> *tile,
> >>>>>> struct xe_migrate *m,
> >>>>>>>>      	u32 num_entries = NUM_PT_SLOTS, num_level = vm-
> >>> pt_root[id]-
> >>>>>>> level;
> >>>>>>>>      	u32 map_ofs, level, i;
> >>>>>>>>      	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> >>>>>>>> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> >>>>>>>>      	u64 entry;
> >>>>>>>>      	int ret;
> >>>>>>>>
> >>>>>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct
> >>>> xe_tile
> >>>>>> *tile, struct xe_migrate *m,
> >>>>>>>>      		return ret;
> >>>>>>>>      	}
> >>>>>>>>
> >>>>>>>> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
> >>>> XE_CACHE_WB);
> >>>>>>>> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
> >> pat_index);
> >>>>>>>>      	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
> >>>>>>>>
> >>>>>>>>      	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
> >>>>>>>>
> >>>>>>>>      	/* Map the entire BO in our level 0 pt */
> >>>>>>>>      	for (i = 0, level = 0; i < num_entries; level++) {
> >>>>>>>> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
> >> XE_CACHE_WB,
> >>>>>> 0);
> >>>>>>>> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
> >> pat_index, 0);
> >>>>>>>>
> >>>>>>>>      		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> >> entry);
> >>>>>>>>
> >>>>>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>>>> xe_tile
> >>>> *tile,
> >>>>>> struct xe_migrate *m,
> >>>>>>>>      		for (i = 0; i < batch->size;
> >>>>>>>>      		     i += vm->flags & XE_VM_FLAG_64K ?
> >> XE_64K_PAGE_SIZE :
> >>>>>>>>      		     XE_PAGE_SIZE) {
> >>>>>>>> -			entry = xe_pte_encode(batch, i,
> >> XE_CACHE_WB, 0);
> >>>>>>>> +			entry = xe_pte_encode(batch, i, pat_index, 0);
> >>>>>>>>
> >>>>>>>>      			xe_map_wr(xe, &bo->vmap, map_ofs + level
> *
> >> 8, u64,
> >>>>>>>>      				  entry);
> >>>>>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>>>> xe_tile
> >>>> *tile,
> >>>>>> struct xe_migrate *m,
> >>>>>>>>      			flags = XE_PDE_64K;
> >>>>>>>>
> >>>>>>>>      		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> >>>>>>>> -					XE_PAGE_SIZE,
> >> XE_CACHE_WB);
> >>>>>>>> +					XE_PAGE_SIZE, pat_index);
> >>>>>>>>      		xe_map_wr(xe, &bo->vmap, map_ofs +
> XE_PAGE_SIZE
> >> * level,
> >>>>>> u64,
> >>>>>>>>      			  entry | flags);
> >>>>>>>>      	}
> >>>>>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>>>> xe_tile
> >>>> *tile,
> >>>>>> struct xe_migrate *m,
> >>>>>>>>      	/* Write PDE's that point to our BO. */
> >>>>>>>>      	for (i = 0; i < num_entries - num_level; i++) {
> >>>>>>>>      		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> >>>>>>>> -				      XE_CACHE_WB);
> >>>>>>>> +				      pat_index);
> >>>>>>>>
> >>>>>>>>      		xe_map_wr(xe, &bo->vmap, map_ofs +
> XE_PAGE_SIZE
> >> +
> >>>>>>>>      			  (i + 1) * 8, u64, entry); @@ -1256,7 +1258,7
> >> @@
> >>>>>>>> xe_migrate_update_pgtables(struct
> >>>> xe_migrate
> >>>>>> *m,
> >>>>>>>>
> >>>>>>>>      			XE_WARN_ON(pt_bo->size != SZ_4K);
> >>>>>>>>
> >>>>>>>> -			addr = xe_pte_encode(pt_bo, 0,
> >> XE_CACHE_WB, 0);
> >>>>>>>> +			addr = xe_pte_encode(pt_bo, 0,
> >> xe_pat_get_index(xe,
> >>>>>> XE_CACHE_WB), 0);
> >>>>>>>>      			bb->cs[bb->len++] = lower_32_bits(addr);
> >>>>>>>>      			bb->cs[bb->len++] = upper_32_bits(addr);
> >>>>>>>>      		}
> >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c
> >>>>>>>> b/drivers/gpu/drm/xe/xe_pt.c index 64713f400d94..019af2920078
> >>>>>>>> 100644
> >>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
> >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
> >>>>>>>> @@ -10,6 +10,7 @@
> >>>>>>>>      #include "xe_gt.h"
> >>>>>>>>      #include "xe_gt_tlb_invalidation.h"
> >>>>>>>>      #include "xe_migrate.h"
> >>>>>>>> +#include "xe_pat.h"
> >>>>>>>>      #include "xe_pt_types.h"
> >>>>>>>>      #include "xe_pt_walk.h"
> >>>>>>>>      #include "xe_res_cursor.h"
> >>>>>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct
> >>>>>>>> xe_pt_dir
> >>>>>> *pt_dir, unsigned int index)
> >>>>>>>>       *
> >>>>>>>>       * Return: An encoded page directory entry. No errors.
> >>>>>>>>       */
> >>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >>>>>>>> -		  const enum xe_cache_level cache)
> >>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
> >>>>>>>> +pat_index)
> >>>>>>>>      {
> >>>>>>>>      	u64 pde;
> >>>>>>>>      	struct xe_vm *vm = bo->vm;
> >>>>>>>>      	struct xe_device *xe = vm->xe;
> >>>>>>>>
> >>>>>>>> -
> >>>>>>>>      	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> >>>>>>>>      	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
> >>>>>>>>
> >>>>>>>>      	if ((vm->pat_encode).pde_encode)
> >>>>>>>> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> >>>>>>>> +		pde = (vm->pat_encode).pde_encode(xe, pde,
> >> pat_index);
> >>>>>>>>
> >>>>>>>>      	return pde;
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> >>>>>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
> >>>>>>>>      			struct xe_vma *vma, u32 pt_level)
> >>>>>>>>      {
> >>>>>>>>      	struct xe_vm *vm = xe_vma_vm(vma); @@ -89,7 +88,7 @@
> >> static
> >>>>>>>> u64 __pte_encode(u64 pte, enum
> >>>>>> xe_cache_level cache,
> >>>>>>>>      		pte |= XE_PTE_NULL;
> >>>>>>>>
> >>>>>>>>      	if ((vm->pat_encode).pte_encode)
> >>>>>>>> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> >>>>>>>> +		pte = (vm->pat_encode).pte_encode(xe, pte,
> >> pat_index);
> >>>>>>>>
> >>>>>>>>      	if (pt_level == 1)
> >>>>>>>>      		pte |= XE_PDE_PS_2M;
> >>>>>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
> >>>>>> xe_cache_level cache,
> >>>>>>>>       *
> >>>>>>>>       * Return: An encoded page-table entry. No errors.
> >>>>>>>>       */
> >>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
> >>>>>>>> xe_cache_level
> >>>>>> cache,
> >>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> >>>>>>>>      		  u32 pt_level)
> >>>>>>>>      {
> >>>>>>>>      	u64 pte;
> >>>>>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64
> >>>> offset,
> >>>>>> enum xe_cache_level cache,
> >>>>>>>>      	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> >>>>>>>>      		pte |= XE_PPGTT_PTE_DM;
> >>>>>>>>
> >>>>>>>> -	return __pte_encode(pte, cache, NULL, pt_level);
> >>>>>>>> +	return __pte_encode(pte, pat_index, NULL, pt_level);
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>>      static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct
> >>>>>>>> xe_vm *vm, @@ -134,12 +133,12 @@ static u64
> >>>>>>>> __xe_pt_empty_pte(struct xe_tile
> >>>> *tile,
> >>>>>> struct xe_vm *vm,
> >>>>>>>>
> >>>>>>>>      	if (level == 0) {
> >>>>>>>>      		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> >>>>>>>> -					  XE_CACHE_WB, 0);
> >>>>>>>> +					  xe_pat_get_index(vm->xe,
> >>>>>> XE_CACHE_WB), 0);
> >>>>>>>>
> >>>>>>>>      		return empty;
> >>>>>>>>      	} else {
> >>>>>>>>      		return xe_pde_encode(vm->scratch_pt[id][level - 1]-
> >>> bo, 0,
> >>>>>>>> -				     XE_CACHE_WB);
> >>>>>>>> +				     xe_pat_get_index(vm->xe,
> >>>>>> XE_CACHE_WB));
> >>>>>>>>      	}
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> >>>>>>>>      	struct xe_vm *vm;
> >>>>>>>>      	/** @tile: The tile we're building for. */
> >>>>>>>>      	struct xe_tile *tile;
> >>>>>>>> -	/** @cache: Desired cache level for the ptes */
> >>>>>>>> -	enum xe_cache_level cache;
> >>>>>>>>      	/** @default_pte: PTE flag only template. No address is
> >> associated */
> >>>>>>>>      	u64 default_pte;
> >>>>>>>>      	/** @dma_offset: DMA offset to add to the PTE. */ @@
> >>>>>>>> -604,7
> >>>>>>>> +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> >>>>>> pgoff_t offset,
> >>>>>>>>
> >>>>>>>>      		pte = __pte_encode(is_null ? 0 :
> >>>>>>>>      				   xe_res_dma(curs) + xe_walk-
> >>> dma_offset,
> >>>>>>>> -				   xe_walk->cache, xe_walk->vma,
> >> level);
> >>>>>>>> +				   xe_walk->vma->pat_index,
> xe_walk-
> >>> vma,
> >>>>>> level);
> >>>>>>>>      		pte |= xe_walk->default_pte;
> >>>>>>>>
> >>>>>>>>      		/*
> >>>>>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw
> >>>>>>>> *parent,
> >>>>>> pgoff_t offset,
> >>>>>>>>      			xe_child->is_compact = true;
> >>>>>>>>      		}
> >>>>>>>>
> >>>>>>>> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk-
> >cache)
> >> |
> >>>> flags;
> >>>>>>>> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
> >>>>>>> pat_index) | flags;
> >>>>>>>>      		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset,
> >> xe_child,
> >>>>>>>>      					 pte);
> >>>>>>>>      	}
> >>>>>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile,
> >>>>>>>> struct
> >>>> xe_vma
> >>>>>> *vma,
> >>>>>>>>      		if (vma && vma->gpuva.flags &
> >> XE_VMA_ATOMIC_PTE_BIT)
> >>>>>>>>      			xe_walk.default_pte |=
> >> XE_USM_PPGTT_PTE_AE;
> >>>>>>>>      		xe_walk.dma_offset = vram_region_gpu_offset(bo-
> >>>>>>> ttm.resource);
> >>>>>>>> -		xe_walk.cache = XE_CACHE_WB;
> >>>>>>>> -	} else {
> >>>>>>>> -		if (!xe_vma_has_no_bo(vma) && bo->flags &
> >>>>>> XE_BO_SCANOUT_BIT)
> >>>>>>>> -			xe_walk.cache = XE_CACHE_WT;
> >>>>>>>> -		else
> >>>>>>>> -			xe_walk.cache = XE_CACHE_WB;
> >>>>>>>>      	}
> >>>>>>>> +
> >>>>>>>>      	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> >>>>>>>>      		xe_walk.dma_offset =
> >>>>>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h
> >>>>>>>> b/drivers/gpu/drm/xe/xe_pt.h index 01be7ab08f87..1d433a5a96b4
> >>>>>>>> 100644
> >>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
> >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
> >>>>>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile,
> >>>>>>>> struct
> >>>>>> xe_vma *vma, struct xe_exec_queu
> >>>>>>>>
> >>>>>>>>      bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma
> >>>>>>>> *vma);
> >>>>>>>>
> >>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >>>>>>>> -		  const enum xe_cache_level level);
> >>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
> >>>>>>>> +pat_index);
> >>>>>>>>
> >>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
> >>>>>>>> xe_cache_level
> >>>>>> cache,
> >>>>>>>> -		  u32 pt_level);
> >>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> >>>>>>>> +u32
> >>>>>> pt_level);
> >>>>>>>>
> >>>>>>>>      #endif
> >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c
> >>>>>>>> b/drivers/gpu/drm/xe/xe_vm.c index
> 7eeeed0411f3..34603a7e84b0
> >>>>>>>> 100644
> >>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
> >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
> >>>>>>>> @@ -6,6 +6,7 @@
> >>>>>>>>      #include "xe_vm.h"
> >>>>>>>>
> >>>>>>>>      #include <linux/dma-fence-array.h>
> >>>>>>>> +#include <linux/nospec.h>
> >>>>>>>>
> >>>>>>>>      #include <drm/drm_print.h>
> >>>>>>>>      #include <drm/ttm/ttm_execbuf_util.h> @@ -874,7 +875,8 @@
> >>>>>>>> static struct xe_vma *xe_vma_create(struct
> >>>> xe_vm
> >>>>>> *vm,
> >>>>>>>>      				    u64 start, u64 end,
> >>>>>>>>      				    bool read_only,
> >>>>>>>>      				    bool is_null,
> >>>>>>>> -				    u8 tile_mask)
> >>>>>>>> +				    u8 tile_mask,
> >>>>>>>> +				    u32 pat_index)
> >>>>>>>>      {
> >>>>>>>>      	struct xe_vma *vma;
> >>>>>>>>      	struct xe_tile *tile;
> >>>>>>>> @@ -913,6 +915,8 @@ static struct xe_vma
> *xe_vma_create(struct
> >>>> xe_vm
> >>>>>> *vm,
> >>>>>>>>      			vma->tile_mask |= 0x1 << id;
> >>>>>>>>      	}
> >>>>>>>>
> >>>>>>>> +	vma->pat_index = pat_index;
> >>>>>>>> +
> >>>>>>>>      	if (vm->xe->info.platform == XE_PVC)
> >>>>>>>>      		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
> >>>>>>>>
> >>>>>>>> @@ -1194,10 +1198,8 @@ static void
> xe_vma_op_work_func(struct
> >>>>>> work_struct *w);
> >>>>>>>>      static void vm_destroy_work_func(struct work_struct *w);
> >>>>>>>>
> >>>>>>>>      static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe,
> >>>>>>>> u64
> >>>> pde_pat,
> >>>>>>>> -						enum xe_cache_level
> >> cache)
> >>>>>>>> +				     u32 pat_index)
> >>>>>>>>      {
> >>>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>>>> -
> >>>>>>>>      	pde_pat &= ~(XELP_PDE_PAT_MASK);
> >>>>>>>>
> >>>>>>>>      	if (pat_index & BIT(0))
> >>>>>>>> @@ -1213,10 +1215,8 @@ static u64
> >>>> xelp_ppgtt_pde_encode_pat(struct
> >>>>>> xe_device *xe, u64 pde_pat,
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>>      static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe,
> >>>>>>>> u64
> >>>> pte_pat,
> >>>>>>>> -						enum xe_cache_level
> >> cache)
> >>>>>>>> +				     u32 pat_index)
> >>>>>>>>      {
> >>>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>>>> -
> >>>>>>>>      	pte_pat &= ~(XELP_PTE_PAT_MASK);
> >>>>>>>>
> >>>>>>>>      	if (pat_index & BIT(0))
> >>>>>>>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct
> xe_file
> >>>> *xef,
> >>>>>> u32 id)
> >>>>>>>>      u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile
> *tile)
> >>>>>>>>      {
> >>>>>>>>      	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> >>>>>>>> -			     XE_CACHE_WB);
> >>>>>>>> +			     xe_pat_get_index(vm->xe,
> >> XE_CACHE_WB));
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>>      static struct dma_fence *
> >>>>>>>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device
> >>>>>>>> *xe,
> >>>> struct
> >>>>>> drm_gpuva_op *op)
> >>>>>>>>      static struct drm_gpuva_ops *
> >>>>>>>>      vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> >>>>>>>>      			 u64 bo_offset_or_userptr, u64 addr, u64
> >> range,
> >>>>>>>> -			 u32 operation, u8 tile_mask, u32 region)
> >>>>>>>> +			 u32 operation, u8 tile_mask, u32 region, u32
> >>>>>> pat_index)
> >>>>>>>>      {
> >>>>>>>>      	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> >>>>>>>>      	struct ww_acquire_ctx ww; @@ -2339,6 +2339,7 @@
> >>>>>>>> vm_bind_ioctl_ops_create(struct xe_vm
> >> *vm,
> >>>>>> struct xe_bo *bo,
> >>>>>>>>      			struct xe_vma_op *op =
> >> gpuva_op_to_vma_op(__op);
> >>>>>>>>
> >>>>>>>>      			op->tile_mask = tile_mask;
> >>>>>>>> +			op->pat_index = pat_index;
> >>>>>>>>      			op->map.immediate =
> >>>>>>>>      				operation &
> >>>> XE_VM_BIND_FLAG_IMMEDIATE;
> >>>>>>>>      			op->map.read_only =
> >>>>>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm
> >> *vm,
> >>>>>> struct xe_bo *bo,
> >>>>>>>>      			struct xe_vma_op *op =
> >> gpuva_op_to_vma_op(__op);
> >>>>>>>>
> >>>>>>>>      			op->tile_mask = tile_mask;
> >>>>>>>> +			op->pat_index = pat_index;
> >>>>>>>>      			op->prefetch.region = region;
> >>>>>>>>      		}
> >>>>>>>>      		break;
> >>>>>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
> >> *vm,
> >>>>>> struct xe_bo *bo,
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>>      static struct xe_vma *new_vma(struct xe_vm *vm, struct
> >>>>>> drm_gpuva_op_map *op,
> >>>>>>>> -			      u8 tile_mask, bool read_only, bool is_null)
> >>>>>>>> +			      u8 tile_mask, bool read_only, bool is_null,
> >>>>>>>> +			      u32 pat_index)
> >>>>>>>>      {
> >>>>>>>>      	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op-
> >>> gem.obj) :
> >>>> NULL;
> >>>>>>>>      	struct xe_vma *vma;
> >>>>>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct
> >>>>>>>> xe_vm
> >>>>>> *vm, struct drm_gpuva_op_map *op,
> >>>>>>>>      	vma = xe_vma_create(vm, bo, op->gem.offset,
> >>>>>>>>      			    op->va.addr, op->va.addr +
> >>>>>>>>      			    op->va.range - 1, read_only, is_null,
> >>>>>>>> -			    tile_mask);
> >>>>>>>> +			    tile_mask, pat_index);
> >>>>>>>>      	if (bo)
> >>>>>>>>      		xe_bo_unlock(bo, &ww);
> >>>>>>>>
> >>>>>>>> @@ -2539,7 +2542,7 @@ static int
> vm_bind_ioctl_ops_parse(struct
> >>>> xe_vm
> >>>>>> *vm, struct xe_exec_queue *q,
> >>>>>>>>
> >>>>>>>>      				vma = new_vma(vm, &op-
> >base.map,
> >>>>>>>>      					      op->tile_mask, op-
> >>>>> map.read_only,
> >>>>>>>> -					      op->map.is_null);
> >>>>>>>> +					      op->map.is_null, op-
> >>> pat_index);
> >>>>>>>>      				if (IS_ERR(vma)) {
> >>>>>>>>      					err = PTR_ERR(vma);
> >>>>>>>>      					goto free_fence;
> >>>>>>>> @@ -2567,7 +2570,7 @@ static int
> vm_bind_ioctl_ops_parse(struct
> >>>> xe_vm
> >>>>>> *vm, struct xe_exec_queue *q,
> >>>>>>>>
> >>>>>>>>      					vma = new_vma(vm, op-
> >>>>>>> base.remap.prev,
> >>>>>>>>      						      op->tile_mask,
> >> read_only,
> >>>>>>>> -						      is_null);
> >>>>>>>> +						      is_null, op-
> >>> pat_index);
> >>>>>>>>      					if (IS_ERR(vma)) {
> >>>>>>>>      						err = PTR_ERR(vma);
> >>>>>>>>      						goto free_fence;
> >>>>>>>> @@ -2603,7 +2606,7 @@ static int
> vm_bind_ioctl_ops_parse(struct
> >>>> xe_vm
> >>>>>> *vm, struct xe_exec_queue *q,
> >>>>>>>>
> >>>>>>>>      					vma = new_vma(vm, op-
> >>>>>>> base.remap.next,
> >>>>>>>>      						      op->tile_mask,
> >> read_only,
> >>>>>>>> -						      is_null);
> >>>>>>>> +						      is_null, op-
> >>> pat_index);
> >>>>>>>>      					if (IS_ERR(vma)) {
> >>>>>>>>      						err = PTR_ERR(vma);
> >>>>>>>>      						goto free_fence;
> >>>>>>>> @@ -3158,8 +3161,14 @@ static int
> >>>>>>>> vm_bind_ioctl_check_args(struct
> >>>>>> xe_device *xe,
> >>>>>>>>      		u32 obj = (*bind_ops)[i].obj;
> >>>>>>>>      		u64 obj_offset = (*bind_ops)[i].obj_offset;
> >>>>>>>>      		u32 region = (*bind_ops)[i].region;
> >>>>>>>> +		u32 pat_index = (*bind_ops)[i].pat_index;
> >>>>>>>>      		bool is_null = op & XE_VM_BIND_FLAG_NULL;
> >>>>>>>>
> >>>>>>>> +		if (XE_IOCTL_DBG(xe, pat_index >= xe-
> >>>>>>> info.pat_table_n_entries)) {
> >>>>>>>> +			err = -EINVAL;
> >>>>>>>> +			goto free_bind_ops;
> >>>>>>>> +		}
> >>>>>>>> +
> >>>>>>>>      		if (i == 0) {
> >>>>>>>>      			*async = !!(op &
> XE_VM_BIND_FLAG_ASYNC);
> >>>>>>>>      		} else if (XE_IOCTL_DBG(xe, !*async) || @@ -3346,8
> >> +3355,25
> >>>>>>>> @@ int xe_vm_bind_ioctl(struct drm_device
> >>>> *dev,
> >>>>>> void *data, struct drm_file *file)
> >>>>>>>>      		struct drm_gem_object *gem_obj;
> >>>>>>>>      		u64 range = bind_ops[i].range;
> >>>>>>>>      		u64 addr = bind_ops[i].addr;
> >>>>>>>> +		u32 op = bind_ops[i].op;
> >>>>>>>>      		u32 obj = bind_ops[i].obj;
> >>>>>>>>      		u64 obj_offset = bind_ops[i].obj_offset;
> >>>>>>>> +		u32 pat_index = bind_ops[i].pat_index;
> >>>>>>>> +		u16 coh_mode;
> >>>>>>>> +
> >>>>>>>> +		pat_index = array_index_nospec(pat_index,
> >>>>>>>> +					       xe-
> >>> info.pat_table_n_entries);
> >>>>>>>> +		coh_mode = xe_pat_index_get_coh_mode(xe,
> >> pat_index);
> >>>>>>>> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
> >>>>>>>
> >>>>>>> Assuming we drop the unusable entries from the TGL table, this
> >>>>>>> should be impossible, right?  Any index that makes it past the
> >>>>>>> n_entries check at the top of the function should have a valid,
> >>>>>>> non-zero
> >> coh_mode value.
> >>>>>>> So this should probably be an assertion (to highlight a KMD bug)
> >>>>>>> rather than just a silent uapi failure return.
> >>>>>>
> >>>>>> Makes sense.
> >>>>>>
> >>>>>>>
> >>>>>>>> +			err = -EINVAL;
> >>>>>>>> +			goto put_obj;
> >>>>>>>> +		}
> >>>>>>>> +
> >>>>>>>> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
> >>>>>> XE_VM_BIND_OP_MAP_USERPTR &&
> >>>>>>>> +				 coh_mode ==
> >> XE_GEM_COHERENCY_NONE))
> >>>> {
> >>>>>>>> +			err = -EINVAL;
> >>>>>>>> +			goto put_obj;
> >>>>>>>> +		}
> >>>>>>>>
> >>>>>>>>      		if (!obj)
> >>>>>>>>      			continue;
> >>>>>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct
> drm_device
> >>>> *dev,
> >>>>>> void *data, struct drm_file *file)
> >>>>>>>>      				goto put_obj;
> >>>>>>>>      			}
> >>>>>>>>      		}
> >>>>>>>> +
> >>>>>>>> +		if (bos[i]->coh_mode) {
> >>>>>>>> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
> >>>>>> coh_mode)) {
> >>>>>>>> +				err = -EINVAL;
> >>>>>>>> +				goto put_obj;
> >>>>>>>> +			}
> >>>>>>>> +		} else if (XE_IOCTL_DBG(xe, coh_mode ==
> >>>>>> XE_GEM_COHERENCY_NONE)) {
> >>>>>>>> +			/*
> >>>>>>>> +			 * Imported dma-buf from a different device
> >> should
> >>>>>>>> +			 * require 1way or 2way coherency since we
> >> don't
> >>>>>> know
> >>>>>>>> +			 * how it was mapped on CPU. Just assume is
> it
> >>>>>>>> +			 * potentially cached on CPU side.
> >>>>>>>> +			 */
> >>>>>>>> +			err = -EINVAL;
> >>>>>>>> +			goto put_obj;
> >>>>>>>> +		}
> >>>>>>>>      	}
> >>>>>>>>
> >>>>>>>>      	if (args->num_syncs) {
> >>>>>>>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct
> drm_device
> >>>> *dev,
> >>>>>> void *data, struct drm_file *file)
> >>>>>>>>      		u64 obj_offset = bind_ops[i].obj_offset;
> >>>>>>>>      		u8 tile_mask = bind_ops[i].tile_mask;
> >>>>>>>>      		u32 region = bind_ops[i].region;
> >>>>>>>> +		u32 pat_index = bind_ops[i].pat_index;
> >>>>>>>>
> >>>>>>>>      		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i],
> >> obj_offset,
> >>>>>>>>      						  addr, range, op,
> >> tile_mask,
> >>>>>>>> -						  region);
> >>>>>>>> +						  region, pat_index);
> >>>>>>>>      		if (IS_ERR(ops[i])) {
> >>>>>>>>      			err = PTR_ERR(ops[i]);
> >>>>>>>>      			ops[i] = NULL;
> >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>> b/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>>>> index 83a1f87b6537..508679ed3c74 100644
> >>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>>>> @@ -111,6 +111,11 @@ struct xe_vma {
> >>>>>>>>      	 */
> >>>>>>>>      	u8 tile_present;
> >>>>>>>>
> >>>>>>>> +	/**
> >>>>>>>> +	 * @pat_index: The pat index to use when encoding the PTEs
> >> for
> >>>>>>>> +this
> >>>>>> vma.
> >>>>>>>> +	 */
> >>>>>>>> +	u32 pat_index;
> >>>>>>>> +
> >>>>>>>>      	struct {
> >>>>>>>>      		struct list_head rebind_link;
> >>>>>>>>      	} notifier;
> >>>>>>>> @@ -338,10 +343,8 @@ struct xe_vm {
> >>>>>>>>      	bool batch_invalidate_tlb;
> >>>>>>>>
> >>>>>>>>      	struct {
> >>>>>>>> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>>>>>> -						enum xe_cache_level
> >> cache);
> >>>>>>>> -		u64 (*pde_encode)(struct xe_device *xe, u64
> pde_pat,
> >>>>>>>> -						enum xe_cache_level
> >> cache);
> >>>>>>>> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >> u32
> >>>>>> pat_index);
> >>>>>>>> +		u64 (*pde_encode)(struct xe_device *xe, u64
> pde_pat,
> >> u32
> >>>>>> pat_index);
> >>>>>>>>      	} pat_encode;
> >>>>>>>>      };
> >>>>>>>>
> >>>>>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
> >>>>>>>>      	struct async_op_fence *fence;
> >>>>>>>>      	/** @tile_mask: gt mask for this operation */
> >>>>>>>>      	u8 tile_mask;
> >>>>>>>> +	/** @pat_index: The pat index to use for this operation. */
> >>>>>>>> +	u32 pat_index;
> >>>>>>>>      	/** @flags: operation flags */
> >>>>>>>>      	enum xe_vma_op_flags flags;
> >>>>>>>>
> >>>>>>>> diff --git a/include/uapi/drm/xe_drm.h
> >>>>>>>> b/include/uapi/drm/xe_drm.h index 64bc66d4b550..0c15b6f32447
> >>>>>>>> 100644
> >>>>>>>> --- a/include/uapi/drm/xe_drm.h
> >>>>>>>> +++ b/include/uapi/drm/xe_drm.h
> >>>>>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> >>>>>>>>      	 */
> >>>>>>>>      	__u32 obj;
> >>>>>>>>
> >>>>>>>> -	/** @pad: MBZ */
> >>>>>>>> -	__u32 pad;
> >>>>>>>> +	/**
> >>>>>>>> +	 * @pat_index: The platform defined @pat_index to use for
> >> this
> >>>>>> mapping.
> >>>>>>>> +	 * The index basically maps to some predefined memory
> >> attributes,
> >>>>>>>> +	 * including things like caching, coherency and likely other
> >>>>>>>> +stuff
> >> in
> >>>>>>>> +	 * the future.  The exact meaning of the pat_index is
> >>>>>>>> +platform specific
> >>>>>>>
> >>>>>>> BTW, "other stuff in the future" already includes compression on
> >>>>>>> Xe2, we just haven't landed the patches for the Xe2 table yet.
> >>>>>>
> >>>>>> Ok, good to know.
> >>>>>>
> >>>>>>>
> >>>>>>>> +	 * and defined in the Bspec and PRMs.  When the KMD sets
> up
> >> the
> >>>>>> binding
> >>>>>>>> +	 * the index here is encoded into the ppGTT PTE.
> >>>>>>>> +	 *
> >>>>>>>> +	 * For coherency the @pat_index needs to match the
> >>>>>>>> +	 * drm_xe_gem_create.coh_mode, so either
> >>>>>> XE_GEM_COHERENCY_NONE,
> >>>>>>>> +	 * XE_GEM_COHERENCY_1WAY or
> >> XE_GEM_COHERENCY_2WAY. The
> >>>>>> KMD will extract
> >>>>>>>> +	 * the coherency mode from the @pat_index and reject if
> >> there is a
> >>>>>>>> +	 * mismatch (see note below for pre-MTL platforms).
> >>>>>>>> +	 *
> >>>>>>>> +	 * Note: On pre-MTL platforms there is only a caching mode
> >> and no
> >>>>>>>> +	 * explicit coherency mode, but on such hardware there is
> >> always a
> >>>>>>>> +	 * shared-LLC (or is dgpu) so all GT memory accesses are
> >>>>>>>> +coherent
> >>>> with
> >>>>>>>> +	 * CPU caches even with the caching mode set as uncached.
> >> It's
> >>>>>>>> +only
> >>>>>> the
> >>>>>>>> +	 * display engine that is incoherent (on dgpu it must be in
> >> VRAM which
> >>>>>>>> +	 * is always mapped as WC on the CPU). However to keep the
> >> uapi
> >>>>>> somewhat
> >>>>>>>> +	 * consistent with newer platforms the KMD groups the
> >> different
> >>>> cache
> >>>>>>>> +	 * levels into the following coherency buckets on all pre-MTL
> >>>> platforms:
> >>>>>>>> +	 *
> >>>>>>>> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
> >>>>>>>> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
> >>>>>>>> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
> >>>>>>>> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
> >>>>>>>
> >>>>>>> As noted on the previous patch, it seems like 2-way is
> >>>>>>> appropriate for LLC platforms, but 1-way might be a more
> >>>>>>> accurate description of dGPU behavior.
> >>>>>>>
> >>>>>>>> +	 *
> >>>>>>>> +	 * In practice UC/WC/WT should only ever used for scanout
> >>>>>>>> +surfaces
> >>>> on
> >>>>>>>> +	 * such platforms since it is only the display engine that is
> >> actually
> >>>>>>>> +	 * incoherent. Everything else should typically use WB given
> >>>>>>>> +that we
> >>>>>>>
> >>>>>>> What if we're sharing our buffers with some other (non-GPU)
> >>>>>>> device?  Are there cases where that other device wouldn't be
> >>>>>>> coherent with the LLC, so we'd want to use one of these?
> >>>>>>
> >>>>>> Yeah, I guess there might be cases like that. I'll reword.
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Matt
> >>>>>>>
> >>>>>>>> +	 * have a shared-LLC.  On MTL+ this completely changes (also
> >>>>>> potentially
> >>>>>>>> +	 * no shared-LLC) and the HW defines the coherency mode as
> >> part
> >>>>>>>> +of
> >>>>>> the
> >>>>>>>> +	 * @pat_index.
> >>>>>>>> +	 *
> >>>>>>>> +	 * Note: For userptr and externally imported dma-buf the
> >> kernel
> >>>>>> expects
> >>>>>>>> +	 * either 1WAY or 2WAY for the @pat_index.
> >>>>>>>> +	 */
> >>>>>>>> +	__u32 pat_index;
> >>>>>>>>
> >>>>>>>>      	union {
> >>>>>>>>      		/**
> >>>>>>>> --
> >>>>>>>> 2.41.0
> >>>>>>>>
> >>>>>>>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-09-05  9:12                   ` Zhang, Carl
@ 2023-09-05  9:46                     ` Matthew Auld
  2023-09-05 13:50                       ` Zhang, Carl
  0 siblings, 1 reply; 47+ messages in thread
From: Matthew Auld @ 2023-09-05  9:46 UTC (permalink / raw)
  To: Zhang, Carl, Roper, Matthew D
  Cc: Gu, Lihao, Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas,
	Yu, Effie, intel-xe@lists.freedesktop.org

On 05/09/2023 10:12, Zhang, Carl wrote:
> 
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld@intel.com>
>> Sent: Monday, September 4, 2023 5:24 PM
>>
>> On 01/09/2023 10:34, Zhang, Carl wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>> Sent: Thursday, August 31, 2023 6:44 PM
>>>>
>>>> On 31/08/2023 09:24, Zhang, Carl wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>>>> Sent: Thursday, August 31, 2023 12:02 AM
>>>>>>
>>>>>> On 30/08/2023 16:27, Zhang, Carl wrote:
>>>>>>> Several questions:
>>>>>>> 1. the pat_index from vm_bind will override the setting from
>> bo_create?
>>>>>>> How to keep the value from bo_create unchanged?
>>>>>>
>>>>>> You only set the pat_index during vm_bind. At gem_create you just
>>>>>> need tell the kernel what CPU side caching mode to use for system
>>>>>> memory (wb/wc/uc), plus the expected GPU coherency mode of the
>>>>>> pat_index. The current expectation is that the coherency mode of
>>>>>> the pat_index should match the one at gem_create.
>>>>>>
>>>>> It means that we set the incomplete pat_index at gem_create
>>>>> (coherency
>>>>> only) And set another part at vm_bind (gpu cache) Why we separate
>>>>> them? Why not set a full value at vm_bind?
>>>>
>>>> You always give the full pat_index at vm_bind. It's platform specific
>>>> but the BSpec for the most part tells you what the coherency mode of the
>> pat_index is.
>>>>
>>>> For example if you look at the MTL pat_index table it tells you the
>>>> coh_mode:
>>>>
>>>> pat_index 0-2 = coh_mode "No snoop" = COH_NONE
>>>> pat_index 3   = coh_mode "1way"     = COH_1WAY
>>>> pat_index 4   = coh_mode "2way"     = COH_2WAY
>>>>
>>>> On future platforms the table is larger and also encodes stuff like
>>>> compression, but KMD really only cares about the coh_mode, since that
>>>> will restrict the allowed CPU smem_caching values that userspace can
>> select.
>>>>
>>>>    From KMD pov this matters since we always give userspace zeroed
>>>> memory, which is useful for normal applications but is also needed for
>> security reasons.
>>>> However the actual clearing might result in the writes only being
>>>> visible in the CPU cache (not flushed to main memory), and if the GPU
>>>> doesn't snoop the CPU cache it can read directly from main memory,
>>>> which effectively bypasses the clearing. To prevent that the KMD
>>>> needs to know how userspace will map the memory via the GPU i.e what
>>>> is the coh_mode for the pat_index. The gist is that if you select
>>>> COH_NONE then it needs to clflush the pages before giving them to
>>>> userspace, which is done by forcing you to use uc/wc for the
>>>> smem_caching (the kernel will internally issue cflush on x86 when
>>>> marking the pages as uc/wc). And if you select wb there is no flushing, but
>> here you need to use 1way or 2way.
>>>>
>>>
>>>    It sounds like a perf optimization, if no considering the perf, you
>>> could always call Clflush to flush the cpu cache to main memory before
>>> giving it to user space
>>
>> I think the other thing is that the KMD is no longer doing any manual flushing,
>> but instead the core kernel will now do that for us when marking the pages as
>> wc/uc on x86.
>>
>>>
>>>
>>>> AFAIK separating them is just to prevent userspace from mixing
>>>> different coherency modes for the same object by having it immutable
>>>> (you can still use different pat_index but the coherency must be
>> compatible).
>>>> Also at creation time we can determine if the smem_caching is
>>>> compatible with the coh_mode.
>>>>
>>>> Note that it might be that we ditch the COH_2WAY and just have
>>>> COH_NONE and COH_AT_LEAST_1WAY, in which case this might be slightly
>> different.
>>>>
>>>>>
>>>>>>> 2. no UC/WB/WC definition (CPU cachable) in
>>>>>>> drm_xe_gem_mmap_offset,
>>>>>> will it be handled by KMD automatically?
>>>>>>> For example: if set 1-way, it means GPU could snoop CPU cache, we
>>>>>>> could use WB  in mmap offset If it is COHERENCY_NONE, we could
>>>>>>> only use
>>>>>> UC,  all these logic is handled by KMD automatically?
>>>>>>
>>>>>> Yes, the mmap will use the same CPU side caching mode that you set
>>>>>> at gem_create with smem_caching (wb/wc/uc).
>>>>>>
>>>>> It means it  could not be changed after gem_create.  Sometimes, UMD
>>>>> doesn’t know Whether the data need to be accessed by CPU at
>>>>> gem_create .  only application Knows the purpose of allocation & the
>>>>> surface usage.  So, may need change UMD api to let applications set
>>>>> whether the
>>>> resource is "accessible", or it will hurt the perf.
>>>>
>>>> Right, the smem_caching is immutable.
>>>>
>>>>>
>>>>>>> 3.  about " For imported dma-buf (from a different device) the
>>>>>>> coherency mode is also implicit and must also be either 1WAY or
>> 2WAY"
>>>>>>> it means it must be 1way or 2way, and UMD need not to set it?
>>>>>>
>>>>>> If you are going to vm_bind it you need to supply the pat_index and
>>>>>> in this case it needs to be either 1way or 2way. The object comes
>>>>>> from a different device so we might not have smem_caching/coh_mode
>>>>>> like we do for native objects.
>>>>>>
>>>>> How about the GPU cache setting of imported dma-buf?
>>>>> Is there possible conflict between dma-buf producer and consumer?
>>>>> For example:
>>>>> In producer side , one bo is set GPU L3 cache,  also CPU cacheable,
>>>>> it should be 2Way In client side, we know the coherency should be
>>>>> 1-way or 2-way, if it suppose no GPU cache Used. It just set 1-way.
>>>>> It may cause some problems?
>>>>
>>>> If the producer and consumer are the same device, then I think the
>>>> original gem_create.coh_mode is considered, since it's effectively
>>>> still a native object. In which case the normal rules apply.
>>>
>>> But the consumer does not know the coh_mode, so , it is possible that
>>> it set will set Another value, then it will failed.
>>> And because consumer does not know it is 1-way or 2-way,  consumer
>>> could not Set the gpu cache correctly.  Because if it is 1-way, cpu
>>> could not snoop gpu cache
>>>
>>>>
>>>> If it comes from a different device then we don't know what
>>>> gem_create.coh_mode is (or if that even exists) since we don't have
>>>> the usual native object, but just some opaque dma-buf object. In this
>>>> case it must be at least 1way or 2way.
>>>>
>>>>    From KMD pov we only care about userspace not being able to bypass
>>>> the clearing. There might still be other coherency issues though.
>>>>
>>>>>
>>>>> Also , when a bo is created, UMD does not know whether it will be
>> exported.
>>>>> It maybe be set as COHERENCY_NONE, right?
>>>>
>>>> Yeah, you can use whatever you want. Just that your smem_caching will
>>>> need to be wc/uc if using COH_NONE. On the importer side, if it comes
>>>> from the same device then it's just a normal native object and the normal
>> rules apply.
>>>> Currently this would mean also using pat_index with COH_NONE. If it
>>>> is imported from a different device then you need to use a pat_index
>>>> with 1way or 2way, since KMD doesn't really know how the object is
>>>> mapped on the CPU or if the CPU caches are dirty (potentially with
>> clearing).
>>>>
>>>> Is this going to be problematic?
>>>>
>>> If imported surfaces is from same device, how consumer know it is
>>> original coh_non Or 1-way or 2-way?
>>>
>>> And it also means producer and consumer could have different pat_index for
>> same bo.
>>
>> Ok, if KMD instead allows 1way or 2way for dma-buf when it's external or
>> from the same device? Would that help here? i.e you can always select 1way
>> or 2way, even if it came from the same device and was coh_none?
>>
>> Other option is you could call the vm_bind ioctl to check if the coh_mode is
>> supported by checking if it returns an error? But maybe that is too nasy.
>>
> 
> Still a bit confuse,  I create a bo with coh_none, then export it to a PRIME fd.
> In consumer side, it does not know it is coh_none, and will call vm_bind to set pat_index.
>      
> You mean it will return a error when I set different coh value though vm_bind?
> And try coh_non, 1-way, 2-way, then get the correct one? TBH, it is ugly.

Yeah, it is nasty.

> 
> My concern is: if it is coh_non or 1-way,  it means I could not use GPU cache if the bo also is CPU accessible.
> Because CPU could not snoop GPU cache.
> So, GPU cache specified by pat_index also will be rejected?
> 
> Could we add some query interface to query the CPU cache and coherency setting of a bo?

Say if the buffer comes from a different device + driver, what pat_index 
would you choose for that? There is no such thing as 
coh_mode/smem_caching for such an object. If you have to handle that 
case can't you just treat all imported dma-buf the same as that? The new 
proposal was to allow 1way or 2way for any dma-buf object, even if it 
comes from the same device and has different BO coh_mode. Do you need 
more than that? AFAICT the query would only really work for objects 
imported from the same device.

> 
>>
>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>>>>>> Sent: Wednesday, August 30, 2023 7:28 PM
>>>>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>>>>>>> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi
>>>>>> <pallavi.mishra@intel.com>;
>>>>>>>> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas
>>>>>>>> Lahtinen <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
>>>>>>>> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>;
>>>>>> Hazubski,
>>>>>>>> Filip <filip.hazubski@intel.com>; Zhang, Carl
>>>>>>>> <carl.zhang@intel.com>; Yu,
>>>>>> Effie
>>>>>>>> <effie.yu@intel.com>
>>>>>>>> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection
>>>>>>>> with
>>>>>> vm_bind
>>>>>>>>
>>>>>>>> On 29/08/2023 22:36, Matt Roper wrote:
>>>>>>>>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
>>>>>>>>>> Allow userspace to directly control the pat_index for a given
>>>>>>>>>> vm binding. This should allow directly controlling the
>>>>>>>>>> coherency, caching and potentially other stuff in the future for the
>> ppGTT binding.
>>>>>>>>>>
>>>>>>>>>> The exact meaning behind the pat_index is very platform
>>>>>>>>>> specific (see BSpec or PRMs) but effectively maps to some
>>>>>>>>>> predefined memory attributes. From the KMD pov we only care
>>>>>>>>>> about the coherency that is provided by the pat_index, which
>>>>>>>>>> falls into either
>>>> NONE, 1WAY or 2WAY.
>>>>>>>>>> The vm_bind coherency mode for the given pat_index needs to
>>>>>>>>>> match
>>>>>> the
>>>>>>>>>> given coh_mode that was set at object creation. For platforms
>>>>>>>>>> that lack
>>>>>>>>>
>>>>>>>>> Is it actually important to match the coherency mode?  I think
>>>>>>>>> one of the main goals was to know up front if userspace might be
>>>>>>>>> using a non-snooping PAT setting that would let it bypass the
>>>>>>>>> CPU cache (and potentially read old, stale data from a different
>>>>>>>>> process if the buffer's clear value is still sitting in cache
>>>>>>>>> and hasn't landed in memory yet).
>>>>>>>>>
>>>>>>>>> If that's the only concern, then I think it should still be fine
>>>>>>>>> to map with a non-matching PAT as long as it's more coherent
>>>>>>>>> than the one specified at creation, right?  E.g., if the buffer
>>>>>>>>> was created with 1-way coherency, it would be fine to map it
>>>>>>>>> with 2-way because userspace still can't use that to observe the
>>>>>>>>> previous contents of the buffer. Or
>>>>>>>>
>>>>>>>> Yeah, I guess we could in theory do something that.
>>>>>>>>
>>>>>>>>> if the buffer was created with "non-coherent" then we've already
>>>>>>>>> done the necessary clflushing in kernel before handing to buffer
>>>>>>>>> over to userspace to ensure the clear value landed in memory, so
>>>>>>>>> any valid PAT index should be safe (from a security POV) after
>>>>>>>>> that, right?  Any other problems that arise from mismatched
>>>>>>>>> coherency would just be contained to the app possibly shooting
>>>>>>>>> itself in the foot, which isn't really our concern.
>>>>>>>>
>>>>>>>> That is also my understanding, at least from the KMD security pov.
>>>>>>>> If you allocate as wb then you must use at least 1way, since
>>>>>>>> there is no flushing
>>>>>> for
>>>>>>>> clearing or swap-in. For uc/wc you could in theory use whatever you
>> want.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and
>> WB
>>>>>>>>>> as
>>>>>>>> 2WAY.
>>>>>>>>>>
>>>>>>>>>> For userptr mappings we lack a corresponding gem object, so the
>>>>>> expected
>>>>>>>>>> coherency mode is instead implicit and must fall into either
>>>>>>>>>> 1WAY or 2WAY. Trying to use NONE will be rejected by the
>>>>>>>>>> kernel. For imported dma-buf (from a different device) the
>>>>>>>>>> coherency mode is also implicit and must also be either 1WAY or
>> 2WAY.
>>>>>>>>>>
>>>>>>>>>> As part of adding pat_index support with vm_bind we also need
>>>>>>>>>> stop
>>>>>> using
>>>>>>>>>> xe_cache_level and instead use the pat_index in various places.
>>>>>>>>>> We still make use of xe_cache_level, but only as a convenience
>>>>>>>>>> for kernel internal objectsi (internally it maps to some
>>>>>>>>>> reasonable pat_index). For
>>>>>>>>>
>>>>>>>>> Maybe we should kill xe_cache_level completely and just assign
>>>>>>>>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used
>>>>>>>>> in appropriate places, similar to what we do with MOCS (gt-
>>>>>>> mocs.uc_index,
>>>>>>>>> gt->mocs.wb_index)?
>>>>>>>>
>>>>>>>> OK, seems reasonable to me.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> now this is just a 1:1 conversion of the existing code, however
>>>>>>>>>> for platforms like MTL+ we might need to give more control
>>>>>>>>>> through
>>>>>> bo_create
>>>>>>>>>> or stop using WB on the CPU side if we need CPU access.
>>>>>>>>>>
>>>>>>>>>> Bspec: 45101, 44235 #xe
>>>>>>>>>> Bspec: 70552, 71582, 59400 #xe2
>>>>>>>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>>>>>>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>>>>>>>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>>>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>>>>>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>>>>>>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>>>>>>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
>>>>>>>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>>>>>>>>>> Cc: Carl Zhang <carl.zhang@intel.com>
>>>>>>>>>> Cc: Effie Yu <effie.yu@intel.com>
>>>>>>>>>> ---
>>>>>>>>>>       drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
>>>>>>>>>>       drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
>>>>>>>>>>       drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
>>>>>>>>>>       drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
>>>>>>>>>>       drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
>>>>>>>>>>       drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
>>>>>>>>>>       drivers/gpu/drm/xe/xe_pt.h            |  6 +--
>>>>>>>>>>       drivers/gpu/drm/xe/xe_vm.c            | 73
>> +++++++++++++++++++++----
>>>> -
>>>>>> -
>>>>>>>>>>       drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
>>>>>>>>>>       include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
>>>>>>>>>>       10 files changed, 134 insertions(+), 58 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>> b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>>>> index 4d6296cdbcfd..bb4c182048c3 100644
>>>>>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
>>>>>>>>>>       		(*alloc_scratch_dma)(struct i915_address_space
>> *vm,
>>>> int
>>>>>>>>>> sz);
>>>>>>>>>>
>>>>>>>>>>       	u64 (*pte_encode)(dma_addr_t addr,
>>>>>>>>>> -			  unsigned int pat_index,
>>>>>>>>>> +			  u32 pat_index,
>>>>>>>>>>       			  u32 flags); /* Create a valid PTE */
>>>>>>>>>>       #define PTE_READ_ONLY	BIT(0)
>>>>>>>>>>       #define PTE_LM		BIT(1)
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>> b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct
>>>>>> xe_migrate
>>>>>>>> *m, struct kunit *test)
>>>>>>>>>>       	/* First part of the test, are we updating our pagetable
>>>>>>>>>> bo with a new
>>>>>>>> entry? */
>>>>>>>>>>       	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE *
>>>> (NUM_KERNEL_PDE -
>>>>>> 1),
>>>>>>>> u64,
>>>>>>>>>>       		  0xdeaddeadbeefbeef);
>>>>>>>>>> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>>>>>>>>>> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
>>>>>>>> XE_CACHE_WB), 0);
>>>>>>>>>>       	if (m->q->vm->flags & XE_VM_FLAG_64K)
>>>>>>>>>>       		expected |= XE_PTE_PS64;
>>>>>>>>>>       	if (xe_bo_is_vram(pt))
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>> b/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>>>> index 209fa053d9fb..4134c26150a5 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo,
>> u64
>>>>>>>> bo_offset)
>>>>>>>>>>       		pte |= XE_GGTT_PTE_DM;
>>>>>>>>>>
>>>>>>>>>>       	if ((ggtt->pat_encode).pte_encode)
>>>>>>>>>> -		pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>>>>>> XE_CACHE_WB_1_WAY);
>>>>>>>>>> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>>>>>>>> +
>>>> xe_pat_get_index(xe,
>>>>>>>> XE_CACHE_WB_1_WAY));
>>>>>>>>>>
>>>>>>>>>>       	return pte;
>>>>>>>>>>       }
>>>>>>>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt
>> *ggtt)
>>>>>>>>>>       }
>>>>>>>>>>
>>>>>>>>>>       static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe,
>>>>>>>>>> u64
>>>>>> pte_pat,
>>>>>>>>>> -						enum xe_cache_level
>>>> cache)
>>>>>>>>>> +				     u32 pat_index)
>>>>>>>>>>       {
>>>>>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>>>> -
>>>>>>>>>>       	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>>>>>>>>>
>>>>>>>>>>       	if (pat_index & BIT(0))
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>> b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>>>>>>>>>
>>>>>>>>>>       	struct {
>>>>>>>>>>       		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>>>>>> -						enum xe_cache_level
>>>> cache);
>>>>>>>>>> +				  u32 pat_index);
>>>>>>>>>>       	} pat_encode;
>>>>>>>>>>       };
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>> b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>>> index a782ea282cb6..54585e98452a 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>>> @@ -24,6 +24,7 @@
>>>>>>>>>>       #include "xe_lrc.h"
>>>>>>>>>>       #include "xe_map.h"
>>>>>>>>>>       #include "xe_mocs.h"
>>>>>>>>>> +#include "xe_pat.h"
>>>>>>>>>>       #include "xe_pt.h"
>>>>>>>>>>       #include "xe_res_cursor.h"
>>>>>>>>>>       #include "xe_sched_job.h"
>>>>>>>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>>>> xe_tile
>>>>>> *tile,
>>>>>>>> struct xe_migrate *m,
>>>>>>>>>>       	u32 num_entries = NUM_PT_SLOTS, num_level = vm-
>>>>> pt_root[id]-
>>>>>>>>> level;
>>>>>>>>>>       	u32 map_ofs, level, i;
>>>>>>>>>>       	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>>>>>>>>>> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>>>>>>>>>>       	u64 entry;
>>>>>>>>>>       	int ret;
>>>>>>>>>>
>>>>>>>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct
>>>>>> xe_tile
>>>>>>>> *tile, struct xe_migrate *m,
>>>>>>>>>>       		return ret;
>>>>>>>>>>       	}
>>>>>>>>>>
>>>>>>>>>> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
>>>>>> XE_CACHE_WB);
>>>>>>>>>> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
>>>> pat_index);
>>>>>>>>>>       	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>>>>>>>>>
>>>>>>>>>>       	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>>>>>>>>>
>>>>>>>>>>       	/* Map the entire BO in our level 0 pt */
>>>>>>>>>>       	for (i = 0, level = 0; i < num_entries; level++) {
>>>>>>>>>> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
>>>> XE_CACHE_WB,
>>>>>>>> 0);
>>>>>>>>>> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
>>>> pat_index, 0);
>>>>>>>>>>
>>>>>>>>>>       		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>>>> entry);
>>>>>>>>>>
>>>>>>>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>>>> xe_tile
>>>>>> *tile,
>>>>>>>> struct xe_migrate *m,
>>>>>>>>>>       		for (i = 0; i < batch->size;
>>>>>>>>>>       		     i += vm->flags & XE_VM_FLAG_64K ?
>>>> XE_64K_PAGE_SIZE :
>>>>>>>>>>       		     XE_PAGE_SIZE) {
>>>>>>>>>> -			entry = xe_pte_encode(batch, i,
>>>> XE_CACHE_WB, 0);
>>>>>>>>>> +			entry = xe_pte_encode(batch, i, pat_index, 0);
>>>>>>>>>>
>>>>>>>>>>       			xe_map_wr(xe, &bo->vmap, map_ofs + level
>> *
>>>> 8, u64,
>>>>>>>>>>       				  entry);
>>>>>>>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>>>> xe_tile
>>>>>> *tile,
>>>>>>>> struct xe_migrate *m,
>>>>>>>>>>       			flags = XE_PDE_64K;
>>>>>>>>>>
>>>>>>>>>>       		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>>>>>>>>>> -					XE_PAGE_SIZE,
>>>> XE_CACHE_WB);
>>>>>>>>>> +					XE_PAGE_SIZE, pat_index);
>>>>>>>>>>       		xe_map_wr(xe, &bo->vmap, map_ofs +
>> XE_PAGE_SIZE
>>>> * level,
>>>>>>>> u64,
>>>>>>>>>>       			  entry | flags);
>>>>>>>>>>       	}
>>>>>>>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>>>> xe_tile
>>>>>> *tile,
>>>>>>>> struct xe_migrate *m,
>>>>>>>>>>       	/* Write PDE's that point to our BO. */
>>>>>>>>>>       	for (i = 0; i < num_entries - num_level; i++) {
>>>>>>>>>>       		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>>>>>>>>>> -				      XE_CACHE_WB);
>>>>>>>>>> +				      pat_index);
>>>>>>>>>>
>>>>>>>>>>       		xe_map_wr(xe, &bo->vmap, map_ofs +
>> XE_PAGE_SIZE
>>>> +
>>>>>>>>>>       			  (i + 1) * 8, u64, entry); @@ -1256,7 +1258,7
>>>> @@
>>>>>>>>>> xe_migrate_update_pgtables(struct
>>>>>> xe_migrate
>>>>>>>> *m,
>>>>>>>>>>
>>>>>>>>>>       			XE_WARN_ON(pt_bo->size != SZ_4K);
>>>>>>>>>>
>>>>>>>>>> -			addr = xe_pte_encode(pt_bo, 0,
>>>> XE_CACHE_WB, 0);
>>>>>>>>>> +			addr = xe_pte_encode(pt_bo, 0,
>>>> xe_pat_get_index(xe,
>>>>>>>> XE_CACHE_WB), 0);
>>>>>>>>>>       			bb->cs[bb->len++] = lower_32_bits(addr);
>>>>>>>>>>       			bb->cs[bb->len++] = upper_32_bits(addr);
>>>>>>>>>>       		}
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>>>> b/drivers/gpu/drm/xe/xe_pt.c index 64713f400d94..019af2920078
>>>>>>>>>> 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>>>> @@ -10,6 +10,7 @@
>>>>>>>>>>       #include "xe_gt.h"
>>>>>>>>>>       #include "xe_gt_tlb_invalidation.h"
>>>>>>>>>>       #include "xe_migrate.h"
>>>>>>>>>> +#include "xe_pat.h"
>>>>>>>>>>       #include "xe_pt_types.h"
>>>>>>>>>>       #include "xe_pt_walk.h"
>>>>>>>>>>       #include "xe_res_cursor.h"
>>>>>>>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct
>>>>>>>>>> xe_pt_dir
>>>>>>>> *pt_dir, unsigned int index)
>>>>>>>>>>        *
>>>>>>>>>>        * Return: An encoded page directory entry. No errors.
>>>>>>>>>>        */
>>>>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>>>>>>>> -		  const enum xe_cache_level cache)
>>>>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
>>>>>>>>>> +pat_index)
>>>>>>>>>>       {
>>>>>>>>>>       	u64 pde;
>>>>>>>>>>       	struct xe_vm *vm = bo->vm;
>>>>>>>>>>       	struct xe_device *xe = vm->xe;
>>>>>>>>>>
>>>>>>>>>> -
>>>>>>>>>>       	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>>>>>>>>>>       	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>>>>>>>>>
>>>>>>>>>>       	if ((vm->pat_encode).pde_encode)
>>>>>>>>>> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>>>>>>>>>> +		pde = (vm->pat_encode).pde_encode(xe, pde,
>>>> pat_index);
>>>>>>>>>>
>>>>>>>>>>       	return pde;
>>>>>>>>>>       }
>>>>>>>>>>
>>>>>>>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>>>>>>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>>>>>>>>>>       			struct xe_vma *vma, u32 pt_level)
>>>>>>>>>>       {
>>>>>>>>>>       	struct xe_vm *vm = xe_vma_vm(vma); @@ -89,7 +88,7 @@
>>>> static
>>>>>>>>>> u64 __pte_encode(u64 pte, enum
>>>>>>>> xe_cache_level cache,
>>>>>>>>>>       		pte |= XE_PTE_NULL;
>>>>>>>>>>
>>>>>>>>>>       	if ((vm->pat_encode).pte_encode)
>>>>>>>>>> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>>>>>>>>>> +		pte = (vm->pat_encode).pte_encode(xe, pte,
>>>> pat_index);
>>>>>>>>>>
>>>>>>>>>>       	if (pt_level == 1)
>>>>>>>>>>       		pte |= XE_PDE_PS_2M;
>>>>>>>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
>>>>>>>> xe_cache_level cache,
>>>>>>>>>>        *
>>>>>>>>>>        * Return: An encoded page-table entry. No errors.
>>>>>>>>>>        */
>>>>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
>>>>>>>>>> xe_cache_level
>>>>>>>> cache,
>>>>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>>>>>>>>>       		  u32 pt_level)
>>>>>>>>>>       {
>>>>>>>>>>       	u64 pte;
>>>>>>>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64
>>>>>> offset,
>>>>>>>> enum xe_cache_level cache,
>>>>>>>>>>       	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>>>>>>>>>>       		pte |= XE_PPGTT_PTE_DM;
>>>>>>>>>>
>>>>>>>>>> -	return __pte_encode(pte, cache, NULL, pt_level);
>>>>>>>>>> +	return __pte_encode(pte, pat_index, NULL, pt_level);
>>>>>>>>>>       }
>>>>>>>>>>
>>>>>>>>>>       static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct
>>>>>>>>>> xe_vm *vm, @@ -134,12 +133,12 @@ static u64
>>>>>>>>>> __xe_pt_empty_pte(struct xe_tile
>>>>>> *tile,
>>>>>>>> struct xe_vm *vm,
>>>>>>>>>>
>>>>>>>>>>       	if (level == 0) {
>>>>>>>>>>       		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>>>>>>>>>> -					  XE_CACHE_WB, 0);
>>>>>>>>>> +					  xe_pat_get_index(vm->xe,
>>>>>>>> XE_CACHE_WB), 0);
>>>>>>>>>>
>>>>>>>>>>       		return empty;
>>>>>>>>>>       	} else {
>>>>>>>>>>       		return xe_pde_encode(vm->scratch_pt[id][level - 1]-
>>>>> bo, 0,
>>>>>>>>>> -				     XE_CACHE_WB);
>>>>>>>>>> +				     xe_pat_get_index(vm->xe,
>>>>>>>> XE_CACHE_WB));
>>>>>>>>>>       	}
>>>>>>>>>>       }
>>>>>>>>>>
>>>>>>>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>>>>>>>>>>       	struct xe_vm *vm;
>>>>>>>>>>       	/** @tile: The tile we're building for. */
>>>>>>>>>>       	struct xe_tile *tile;
>>>>>>>>>> -	/** @cache: Desired cache level for the ptes */
>>>>>>>>>> -	enum xe_cache_level cache;
>>>>>>>>>>       	/** @default_pte: PTE flag only template. No address is
>>>> associated */
>>>>>>>>>>       	u64 default_pte;
>>>>>>>>>>       	/** @dma_offset: DMA offset to add to the PTE. */ @@
>>>>>>>>>> -604,7
>>>>>>>>>> +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
>>>>>>>> pgoff_t offset,
>>>>>>>>>>
>>>>>>>>>>       		pte = __pte_encode(is_null ? 0 :
>>>>>>>>>>       				   xe_res_dma(curs) + xe_walk-
>>>>> dma_offset,
>>>>>>>>>> -				   xe_walk->cache, xe_walk->vma,
>>>> level);
>>>>>>>>>> +				   xe_walk->vma->pat_index,
>> xe_walk-
>>>>> vma,
>>>>>>>> level);
>>>>>>>>>>       		pte |= xe_walk->default_pte;
>>>>>>>>>>
>>>>>>>>>>       		/*
>>>>>>>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw
>>>>>>>>>> *parent,
>>>>>>>> pgoff_t offset,
>>>>>>>>>>       			xe_child->is_compact = true;
>>>>>>>>>>       		}
>>>>>>>>>>
>>>>>>>>>> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk-
>>> cache)
>>>> |
>>>>>> flags;
>>>>>>>>>> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
>>>>>>>>> pat_index) | flags;
>>>>>>>>>>       		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset,
>>>> xe_child,
>>>>>>>>>>       					 pte);
>>>>>>>>>>       	}
>>>>>>>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile,
>>>>>>>>>> struct
>>>>>> xe_vma
>>>>>>>> *vma,
>>>>>>>>>>       		if (vma && vma->gpuva.flags &
>>>> XE_VMA_ATOMIC_PTE_BIT)
>>>>>>>>>>       			xe_walk.default_pte |=
>>>> XE_USM_PPGTT_PTE_AE;
>>>>>>>>>>       		xe_walk.dma_offset = vram_region_gpu_offset(bo-
>>>>>>>>> ttm.resource);
>>>>>>>>>> -		xe_walk.cache = XE_CACHE_WB;
>>>>>>>>>> -	} else {
>>>>>>>>>> -		if (!xe_vma_has_no_bo(vma) && bo->flags &
>>>>>>>> XE_BO_SCANOUT_BIT)
>>>>>>>>>> -			xe_walk.cache = XE_CACHE_WT;
>>>>>>>>>> -		else
>>>>>>>>>> -			xe_walk.cache = XE_CACHE_WB;
>>>>>>>>>>       	}
>>>>>>>>>> +
>>>>>>>>>>       	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>>>>>>>>>       		xe_walk.dma_offset =
>>>>>>>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>>>> b/drivers/gpu/drm/xe/xe_pt.h index 01be7ab08f87..1d433a5a96b4
>>>>>>>>>> 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile,
>>>>>>>>>> struct
>>>>>>>> xe_vma *vma, struct xe_exec_queu
>>>>>>>>>>
>>>>>>>>>>       bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma
>>>>>>>>>> *vma);
>>>>>>>>>>
>>>>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>>>>>>>> -		  const enum xe_cache_level level);
>>>>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
>>>>>>>>>> +pat_index);
>>>>>>>>>>
>>>>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
>>>>>>>>>> xe_cache_level
>>>>>>>> cache,
>>>>>>>>>> -		  u32 pt_level);
>>>>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>>>>>>>>> +u32
>>>>>>>> pt_level);
>>>>>>>>>>
>>>>>>>>>>       #endif
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>>>> b/drivers/gpu/drm/xe/xe_vm.c index
>> 7eeeed0411f3..34603a7e84b0
>>>>>>>>>> 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>>>> @@ -6,6 +6,7 @@
>>>>>>>>>>       #include "xe_vm.h"
>>>>>>>>>>
>>>>>>>>>>       #include <linux/dma-fence-array.h>
>>>>>>>>>> +#include <linux/nospec.h>
>>>>>>>>>>
>>>>>>>>>>       #include <drm/drm_print.h>
>>>>>>>>>>       #include <drm/ttm/ttm_execbuf_util.h> @@ -874,7 +875,8 @@
>>>>>>>>>> static struct xe_vma *xe_vma_create(struct
>>>>>> xe_vm
>>>>>>>> *vm,
>>>>>>>>>>       				    u64 start, u64 end,
>>>>>>>>>>       				    bool read_only,
>>>>>>>>>>       				    bool is_null,
>>>>>>>>>> -				    u8 tile_mask)
>>>>>>>>>> +				    u8 tile_mask,
>>>>>>>>>> +				    u32 pat_index)
>>>>>>>>>>       {
>>>>>>>>>>       	struct xe_vma *vma;
>>>>>>>>>>       	struct xe_tile *tile;
>>>>>>>>>> @@ -913,6 +915,8 @@ static struct xe_vma
>> *xe_vma_create(struct
>>>>>> xe_vm
>>>>>>>> *vm,
>>>>>>>>>>       			vma->tile_mask |= 0x1 << id;
>>>>>>>>>>       	}
>>>>>>>>>>
>>>>>>>>>> +	vma->pat_index = pat_index;
>>>>>>>>>> +
>>>>>>>>>>       	if (vm->xe->info.platform == XE_PVC)
>>>>>>>>>>       		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>>>>>>>>>
>>>>>>>>>> @@ -1194,10 +1198,8 @@ static void
>> xe_vma_op_work_func(struct
>>>>>>>> work_struct *w);
>>>>>>>>>>       static void vm_destroy_work_func(struct work_struct *w);
>>>>>>>>>>
>>>>>>>>>>       static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe,
>>>>>>>>>> u64
>>>>>> pde_pat,
>>>>>>>>>> -						enum xe_cache_level
>>>> cache)
>>>>>>>>>> +				     u32 pat_index)
>>>>>>>>>>       {
>>>>>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>>>> -
>>>>>>>>>>       	pde_pat &= ~(XELP_PDE_PAT_MASK);
>>>>>>>>>>
>>>>>>>>>>       	if (pat_index & BIT(0))
>>>>>>>>>> @@ -1213,10 +1215,8 @@ static u64
>>>>>> xelp_ppgtt_pde_encode_pat(struct
>>>>>>>> xe_device *xe, u64 pde_pat,
>>>>>>>>>>       }
>>>>>>>>>>
>>>>>>>>>>       static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe,
>>>>>>>>>> u64
>>>>>> pte_pat,
>>>>>>>>>> -						enum xe_cache_level
>>>> cache)
>>>>>>>>>> +				     u32 pat_index)
>>>>>>>>>>       {
>>>>>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>>>> -
>>>>>>>>>>       	pte_pat &= ~(XELP_PTE_PAT_MASK);
>>>>>>>>>>
>>>>>>>>>>       	if (pat_index & BIT(0))
>>>>>>>>>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct
>> xe_file
>>>>>> *xef,
>>>>>>>> u32 id)
>>>>>>>>>>       u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile
>> *tile)
>>>>>>>>>>       {
>>>>>>>>>>       	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>>>>>>>>>> -			     XE_CACHE_WB);
>>>>>>>>>> +			     xe_pat_get_index(vm->xe,
>>>> XE_CACHE_WB));
>>>>>>>>>>       }
>>>>>>>>>>
>>>>>>>>>>       static struct dma_fence *
>>>>>>>>>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device
>>>>>>>>>> *xe,
>>>>>> struct
>>>>>>>> drm_gpuva_op *op)
>>>>>>>>>>       static struct drm_gpuva_ops *
>>>>>>>>>>       vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>>>>>>>>       			 u64 bo_offset_or_userptr, u64 addr, u64
>>>> range,
>>>>>>>>>> -			 u32 operation, u8 tile_mask, u32 region)
>>>>>>>>>> +			 u32 operation, u8 tile_mask, u32 region, u32
>>>>>>>> pat_index)
>>>>>>>>>>       {
>>>>>>>>>>       	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>>>>>>>>>       	struct ww_acquire_ctx ww; @@ -2339,6 +2339,7 @@
>>>>>>>>>> vm_bind_ioctl_ops_create(struct xe_vm
>>>> *vm,
>>>>>>>> struct xe_bo *bo,
>>>>>>>>>>       			struct xe_vma_op *op =
>>>> gpuva_op_to_vma_op(__op);
>>>>>>>>>>
>>>>>>>>>>       			op->tile_mask = tile_mask;
>>>>>>>>>> +			op->pat_index = pat_index;
>>>>>>>>>>       			op->map.immediate =
>>>>>>>>>>       				operation &
>>>>>> XE_VM_BIND_FLAG_IMMEDIATE;
>>>>>>>>>>       			op->map.read_only =
>>>>>>>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm
>>>> *vm,
>>>>>>>> struct xe_bo *bo,
>>>>>>>>>>       			struct xe_vma_op *op =
>>>> gpuva_op_to_vma_op(__op);
>>>>>>>>>>
>>>>>>>>>>       			op->tile_mask = tile_mask;
>>>>>>>>>> +			op->pat_index = pat_index;
>>>>>>>>>>       			op->prefetch.region = region;
>>>>>>>>>>       		}
>>>>>>>>>>       		break;
>>>>>>>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
>>>> *vm,
>>>>>>>> struct xe_bo *bo,
>>>>>>>>>>       }
>>>>>>>>>>
>>>>>>>>>>       static struct xe_vma *new_vma(struct xe_vm *vm, struct
>>>>>>>> drm_gpuva_op_map *op,
>>>>>>>>>> -			      u8 tile_mask, bool read_only, bool is_null)
>>>>>>>>>> +			      u8 tile_mask, bool read_only, bool is_null,
>>>>>>>>>> +			      u32 pat_index)
>>>>>>>>>>       {
>>>>>>>>>>       	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op-
>>>>> gem.obj) :
>>>>>> NULL;
>>>>>>>>>>       	struct xe_vma *vma;
>>>>>>>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct
>>>>>>>>>> xe_vm
>>>>>>>> *vm, struct drm_gpuva_op_map *op,
>>>>>>>>>>       	vma = xe_vma_create(vm, bo, op->gem.offset,
>>>>>>>>>>       			    op->va.addr, op->va.addr +
>>>>>>>>>>       			    op->va.range - 1, read_only, is_null,
>>>>>>>>>> -			    tile_mask);
>>>>>>>>>> +			    tile_mask, pat_index);
>>>>>>>>>>       	if (bo)
>>>>>>>>>>       		xe_bo_unlock(bo, &ww);
>>>>>>>>>>
>>>>>>>>>> @@ -2539,7 +2542,7 @@ static int
>> vm_bind_ioctl_ops_parse(struct
>>>>>> xe_vm
>>>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>>>
>>>>>>>>>>       				vma = new_vma(vm, &op-
>>> base.map,
>>>>>>>>>>       					      op->tile_mask, op-
>>>>>>> map.read_only,
>>>>>>>>>> -					      op->map.is_null);
>>>>>>>>>> +					      op->map.is_null, op-
>>>>> pat_index);
>>>>>>>>>>       				if (IS_ERR(vma)) {
>>>>>>>>>>       					err = PTR_ERR(vma);
>>>>>>>>>>       					goto free_fence;
>>>>>>>>>> @@ -2567,7 +2570,7 @@ static int
>> vm_bind_ioctl_ops_parse(struct
>>>>>> xe_vm
>>>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>>>
>>>>>>>>>>       					vma = new_vma(vm, op-
>>>>>>>>> base.remap.prev,
>>>>>>>>>>       						      op->tile_mask,
>>>> read_only,
>>>>>>>>>> -						      is_null);
>>>>>>>>>> +						      is_null, op-
>>>>> pat_index);
>>>>>>>>>>       					if (IS_ERR(vma)) {
>>>>>>>>>>       						err = PTR_ERR(vma);
>>>>>>>>>>       						goto free_fence;
>>>>>>>>>> @@ -2603,7 +2606,7 @@ static int
>> vm_bind_ioctl_ops_parse(struct
>>>>>> xe_vm
>>>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>>>
>>>>>>>>>>       					vma = new_vma(vm, op-
>>>>>>>>> base.remap.next,
>>>>>>>>>>       						      op->tile_mask,
>>>> read_only,
>>>>>>>>>> -						      is_null);
>>>>>>>>>> +						      is_null, op-
>>>>> pat_index);
>>>>>>>>>>       					if (IS_ERR(vma)) {
>>>>>>>>>>       						err = PTR_ERR(vma);
>>>>>>>>>>       						goto free_fence;
>>>>>>>>>> @@ -3158,8 +3161,14 @@ static int
>>>>>>>>>> vm_bind_ioctl_check_args(struct
>>>>>>>> xe_device *xe,
>>>>>>>>>>       		u32 obj = (*bind_ops)[i].obj;
>>>>>>>>>>       		u64 obj_offset = (*bind_ops)[i].obj_offset;
>>>>>>>>>>       		u32 region = (*bind_ops)[i].region;
>>>>>>>>>> +		u32 pat_index = (*bind_ops)[i].pat_index;
>>>>>>>>>>       		bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>>>>>>>>>
>>>>>>>>>> +		if (XE_IOCTL_DBG(xe, pat_index >= xe-
>>>>>>>>> info.pat_table_n_entries)) {
>>>>>>>>>> +			err = -EINVAL;
>>>>>>>>>> +			goto free_bind_ops;
>>>>>>>>>> +		}
>>>>>>>>>> +
>>>>>>>>>>       		if (i == 0) {
>>>>>>>>>>       			*async = !!(op &
>> XE_VM_BIND_FLAG_ASYNC);
>>>>>>>>>>       		} else if (XE_IOCTL_DBG(xe, !*async) || @@ -3346,8
>>>> +3355,25
>>>>>>>>>> @@ int xe_vm_bind_ioctl(struct drm_device
>>>>>> *dev,
>>>>>>>> void *data, struct drm_file *file)
>>>>>>>>>>       		struct drm_gem_object *gem_obj;
>>>>>>>>>>       		u64 range = bind_ops[i].range;
>>>>>>>>>>       		u64 addr = bind_ops[i].addr;
>>>>>>>>>> +		u32 op = bind_ops[i].op;
>>>>>>>>>>       		u32 obj = bind_ops[i].obj;
>>>>>>>>>>       		u64 obj_offset = bind_ops[i].obj_offset;
>>>>>>>>>> +		u32 pat_index = bind_ops[i].pat_index;
>>>>>>>>>> +		u16 coh_mode;
>>>>>>>>>> +
>>>>>>>>>> +		pat_index = array_index_nospec(pat_index,
>>>>>>>>>> +					       xe-
>>>>> info.pat_table_n_entries);
>>>>>>>>>> +		coh_mode = xe_pat_index_get_coh_mode(xe,
>>>> pat_index);
>>>>>>>>>> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
>>>>>>>>>
>>>>>>>>> Assuming we drop the unusable entries from the TGL table, this
>>>>>>>>> should be impossible, right?  Any index that makes it past the
>>>>>>>>> n_entries check at the top of the function should have a valid,
>>>>>>>>> non-zero
>>>> coh_mode value.
>>>>>>>>> So this should probably be an assertion (to highlight a KMD bug)
>>>>>>>>> rather than just a silent uapi failure return.
>>>>>>>>
>>>>>>>> Makes sense.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +			err = -EINVAL;
>>>>>>>>>> +			goto put_obj;
>>>>>>>>>> +		}
>>>>>>>>>> +
>>>>>>>>>> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
>>>>>>>> XE_VM_BIND_OP_MAP_USERPTR &&
>>>>>>>>>> +				 coh_mode ==
>>>> XE_GEM_COHERENCY_NONE))
>>>>>> {
>>>>>>>>>> +			err = -EINVAL;
>>>>>>>>>> +			goto put_obj;
>>>>>>>>>> +		}
>>>>>>>>>>
>>>>>>>>>>       		if (!obj)
>>>>>>>>>>       			continue;
>>>>>>>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct
>> drm_device
>>>>>> *dev,
>>>>>>>> void *data, struct drm_file *file)
>>>>>>>>>>       				goto put_obj;
>>>>>>>>>>       			}
>>>>>>>>>>       		}
>>>>>>>>>> +
>>>>>>>>>> +		if (bos[i]->coh_mode) {
>>>>>>>>>> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
>>>>>>>> coh_mode)) {
>>>>>>>>>> +				err = -EINVAL;
>>>>>>>>>> +				goto put_obj;
>>>>>>>>>> +			}
>>>>>>>>>> +		} else if (XE_IOCTL_DBG(xe, coh_mode ==
>>>>>>>> XE_GEM_COHERENCY_NONE)) {
>>>>>>>>>> +			/*
>>>>>>>>>> +			 * Imported dma-buf from a different device
>>>> should
>>>>>>>>>> +			 * require 1way or 2way coherency since we
>>>> don't
>>>>>>>> know
>>>>>>>>>> +			 * how it was mapped on CPU. Just assume is
>> it
>>>>>>>>>> +			 * potentially cached on CPU side.
>>>>>>>>>> +			 */
>>>>>>>>>> +			err = -EINVAL;
>>>>>>>>>> +			goto put_obj;
>>>>>>>>>> +		}
>>>>>>>>>>       	}
>>>>>>>>>>
>>>>>>>>>>       	if (args->num_syncs) {
>>>>>>>>>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct
>> drm_device
>>>>>> *dev,
>>>>>>>> void *data, struct drm_file *file)
>>>>>>>>>>       		u64 obj_offset = bind_ops[i].obj_offset;
>>>>>>>>>>       		u8 tile_mask = bind_ops[i].tile_mask;
>>>>>>>>>>       		u32 region = bind_ops[i].region;
>>>>>>>>>> +		u32 pat_index = bind_ops[i].pat_index;
>>>>>>>>>>
>>>>>>>>>>       		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i],
>>>> obj_offset,
>>>>>>>>>>       						  addr, range, op,
>>>> tile_mask,
>>>>>>>>>> -						  region);
>>>>>>>>>> +						  region, pat_index);
>>>>>>>>>>       		if (IS_ERR(ops[i])) {
>>>>>>>>>>       			err = PTR_ERR(ops[i]);
>>>>>>>>>>       			ops[i] = NULL;
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>> b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>>>> index 83a1f87b6537..508679ed3c74 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>>>> @@ -111,6 +111,11 @@ struct xe_vma {
>>>>>>>>>>       	 */
>>>>>>>>>>       	u8 tile_present;
>>>>>>>>>>
>>>>>>>>>> +	/**
>>>>>>>>>> +	 * @pat_index: The pat index to use when encoding the PTEs
>>>> for
>>>>>>>>>> +this
>>>>>>>> vma.
>>>>>>>>>> +	 */
>>>>>>>>>> +	u32 pat_index;
>>>>>>>>>> +
>>>>>>>>>>       	struct {
>>>>>>>>>>       		struct list_head rebind_link;
>>>>>>>>>>       	} notifier;
>>>>>>>>>> @@ -338,10 +343,8 @@ struct xe_vm {
>>>>>>>>>>       	bool batch_invalidate_tlb;
>>>>>>>>>>
>>>>>>>>>>       	struct {
>>>>>>>>>> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>>>>>> -						enum xe_cache_level
>>>> cache);
>>>>>>>>>> -		u64 (*pde_encode)(struct xe_device *xe, u64
>> pde_pat,
>>>>>>>>>> -						enum xe_cache_level
>>>> cache);
>>>>>>>>>> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>> u32
>>>>>>>> pat_index);
>>>>>>>>>> +		u64 (*pde_encode)(struct xe_device *xe, u64
>> pde_pat,
>>>> u32
>>>>>>>> pat_index);
>>>>>>>>>>       	} pat_encode;
>>>>>>>>>>       };
>>>>>>>>>>
>>>>>>>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>>>>>>>>>>       	struct async_op_fence *fence;
>>>>>>>>>>       	/** @tile_mask: gt mask for this operation */
>>>>>>>>>>       	u8 tile_mask;
>>>>>>>>>> +	/** @pat_index: The pat index to use for this operation. */
>>>>>>>>>> +	u32 pat_index;
>>>>>>>>>>       	/** @flags: operation flags */
>>>>>>>>>>       	enum xe_vma_op_flags flags;
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/uapi/drm/xe_drm.h
>>>>>>>>>> b/include/uapi/drm/xe_drm.h index 64bc66d4b550..0c15b6f32447
>>>>>>>>>> 100644
>>>>>>>>>> --- a/include/uapi/drm/xe_drm.h
>>>>>>>>>> +++ b/include/uapi/drm/xe_drm.h
>>>>>>>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>>>>>>>>>>       	 */
>>>>>>>>>>       	__u32 obj;
>>>>>>>>>>
>>>>>>>>>> -	/** @pad: MBZ */
>>>>>>>>>> -	__u32 pad;
>>>>>>>>>> +	/**
>>>>>>>>>> +	 * @pat_index: The platform defined @pat_index to use for
>>>> this
>>>>>>>> mapping.
>>>>>>>>>> +	 * The index basically maps to some predefined memory
>>>> attributes,
>>>>>>>>>> +	 * including things like caching, coherency and likely other
>>>>>>>>>> +stuff
>>>> in
>>>>>>>>>> +	 * the future.  The exact meaning of the pat_index is
>>>>>>>>>> +platform specific
>>>>>>>>>
>>>>>>>>> BTW, "other stuff in the future" already includes compression on
>>>>>>>>> Xe2, we just haven't landed the patches for the Xe2 table yet.
>>>>>>>>
>>>>>>>> Ok, good to know.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +	 * and defined in the Bspec and PRMs.  When the KMD sets
>> up
>>>> the
>>>>>>>> binding
>>>>>>>>>> +	 * the index here is encoded into the ppGTT PTE.
>>>>>>>>>> +	 *
>>>>>>>>>> +	 * For coherency the @pat_index needs to match the
>>>>>>>>>> +	 * drm_xe_gem_create.coh_mode, so either
>>>>>>>> XE_GEM_COHERENCY_NONE,
>>>>>>>>>> +	 * XE_GEM_COHERENCY_1WAY or
>>>> XE_GEM_COHERENCY_2WAY. The
>>>>>>>> KMD will extract
>>>>>>>>>> +	 * the coherency mode from the @pat_index and reject if
>>>> there is a
>>>>>>>>>> +	 * mismatch (see note below for pre-MTL platforms).
>>>>>>>>>> +	 *
>>>>>>>>>> +	 * Note: On pre-MTL platforms there is only a caching mode
>>>> and no
>>>>>>>>>> +	 * explicit coherency mode, but on such hardware there is
>>>> always a
>>>>>>>>>> +	 * shared-LLC (or is dgpu) so all GT memory accesses are
>>>>>>>>>> +coherent
>>>>>> with
>>>>>>>>>> +	 * CPU caches even with the caching mode set as uncached.
>>>> It's
>>>>>>>>>> +only
>>>>>>>> the
>>>>>>>>>> +	 * display engine that is incoherent (on dgpu it must be in
>>>> VRAM which
>>>>>>>>>> +	 * is always mapped as WC on the CPU). However to keep the
>>>> uapi
>>>>>>>> somewhat
>>>>>>>>>> +	 * consistent with newer platforms the KMD groups the
>>>> different
>>>>>> cache
>>>>>>>>>> +	 * levels into the following coherency buckets on all pre-MTL
>>>>>> platforms:
>>>>>>>>>> +	 *
>>>>>>>>>> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
>>>>>>>>>> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
>>>>>>>>>> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
>>>>>>>>>> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
>>>>>>>>>
>>>>>>>>> As noted on the previous patch, it seems like 2-way is
>>>>>>>>> appropriate for LLC platforms, but 1-way might be a more
>>>>>>>>> accurate description of dGPU behavior.
>>>>>>>>>
>>>>>>>>>> +	 *
>>>>>>>>>> +	 * In practice UC/WC/WT should only ever used for scanout
>>>>>>>>>> +surfaces
>>>>>> on
>>>>>>>>>> +	 * such platforms since it is only the display engine that is
>>>> actually
>>>>>>>>>> +	 * incoherent. Everything else should typically use WB given
>>>>>>>>>> +that we
>>>>>>>>>
>>>>>>>>> What if we're sharing our buffers with some other (non-GPU)
>>>>>>>>> device?  Are there cases where that other device wouldn't be
>>>>>>>>> coherent with the LLC, so we'd want to use one of these?
>>>>>>>>
>>>>>>>> Yeah, I guess there might be cases like that. I'll reword.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Matt
>>>>>>>>>
>>>>>>>>>> +	 * have a shared-LLC.  On MTL+ this completely changes (also
>>>>>>>> potentially
>>>>>>>>>> +	 * no shared-LLC) and the HW defines the coherency mode as
>>>> part
>>>>>>>>>> +of
>>>>>>>> the
>>>>>>>>>> +	 * @pat_index.
>>>>>>>>>> +	 *
>>>>>>>>>> +	 * Note: For userptr and externally imported dma-buf the
>>>> kernel
>>>>>>>> expects
>>>>>>>>>> +	 * either 1WAY or 2WAY for the @pat_index.
>>>>>>>>>> +	 */
>>>>>>>>>> +	__u32 pat_index;
>>>>>>>>>>
>>>>>>>>>>       	union {
>>>>>>>>>>       		/**
>>>>>>>>>> --
>>>>>>>>>> 2.41.0
>>>>>>>>>>
>>>>>>>>>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-09-05  9:46                     ` Matthew Auld
@ 2023-09-05 13:50                       ` Zhang, Carl
  2023-09-05 14:07                         ` Matthew Auld
  0 siblings, 1 reply; 47+ messages in thread
From: Zhang, Carl @ 2023-09-05 13:50 UTC (permalink / raw)
  To: Auld, Matthew, Roper, Matthew D
  Cc: Gu, Lihao, Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas,
	Yu, Effie, intel-xe@lists.freedesktop.org


> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: Tuesday, September 5, 2023 5:47 PM
> 
> On 05/09/2023 10:12, Zhang, Carl wrote:
> >
> >> -----Original Message-----
> >> From: Auld, Matthew <matthew.auld@intel.com>
> >> Sent: Monday, September 4, 2023 5:24 PM
> >>
> >> On 01/09/2023 10:34, Zhang, Carl wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Auld, Matthew <matthew.auld@intel.com>
> >>>> Sent: Thursday, August 31, 2023 6:44 PM
> >>>>
> >>>> On 31/08/2023 09:24, Zhang, Carl wrote:
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Auld, Matthew <matthew.auld@intel.com>
> >>>>>> Sent: Thursday, August 31, 2023 12:02 AM
> >>>>>>
> >>>>>> On 30/08/2023 16:27, Zhang, Carl wrote:
> >>>>>>> Several questions:
> >>>>>>> 1. the pat_index from vm_bind will override the setting from
> >> bo_create?
> >>>>>>> How to keep the value from bo_create unchanged?
> >>>>>>
> >>>>>> You only set the pat_index during vm_bind. At gem_create you just
> >>>>>> need tell the kernel what CPU side caching mode to use for system
> >>>>>> memory (wb/wc/uc), plus the expected GPU coherency mode of the
> >>>>>> pat_index. The current expectation is that the coherency mode of
> >>>>>> the pat_index should match the one at gem_create.
> >>>>>>
> >>>>> It means that we set the incomplete pat_index at gem_create
> >>>>> (coherency
> >>>>> only) And set another part at vm_bind (gpu cache) Why we separate
> >>>>> them? Why not set a full value at vm_bind?
> >>>>
> >>>> You always give the full pat_index at vm_bind. It's platform
> >>>> specific but the BSpec for the most part tells you what the
> >>>> coherency mode of the
> >> pat_index is.
> >>>>
> >>>> For example if you look at the MTL pat_index table it tells you the
> >>>> coh_mode:
> >>>>
> >>>> pat_index 0-2 = coh_mode "No snoop" = COH_NONE
> >>>> pat_index 3   = coh_mode "1way"     = COH_1WAY
> >>>> pat_index 4   = coh_mode "2way"     = COH_2WAY
> >>>>
> >>>> On future platforms the table is larger and also encodes stuff like
> >>>> compression, but KMD really only cares about the coh_mode, since
> >>>> that will restrict the allowed CPU smem_caching values that
> >>>> userspace can
> >> select.
> >>>>
> >>>>    From KMD pov this matters since we always give userspace zeroed
> >>>> memory, which is useful for normal applications but is also needed
> >>>> for
> >> security reasons.
> >>>> However the actual clearing might result in the writes only being
> >>>> visible in the CPU cache (not flushed to main memory), and if the
> >>>> GPU doesn't snoop the CPU cache it can read directly from main
> >>>> memory, which effectively bypasses the clearing. To prevent that
> >>>> the KMD needs to know how userspace will map the memory via the GPU
> >>>> i.e what is the coh_mode for the pat_index. The gist is that if you
> >>>> select COH_NONE then it needs to clflush the pages before giving
> >>>> them to userspace, which is done by forcing you to use uc/wc for
> >>>> the smem_caching (the kernel will internally issue cflush on x86
> >>>> when marking the pages as uc/wc). And if you select wb there is no
> >>>> flushing, but
> >> here you need to use 1way or 2way.
> >>>>
> >>>
> >>>    It sounds like a perf optimization, if no considering the perf,
> >>> you could always call Clflush to flush the cpu cache to main memory
> >>> before giving it to user space
> >>
> >> I think the other thing is that the KMD is no longer doing any manual
> >> flushing, but instead the core kernel will now do that for us when
> >> marking the pages as wc/uc on x86.
> >>
> >>>
> >>>
> >>>> AFAIK separating them is just to prevent userspace from mixing
> >>>> different coherency modes for the same object by having it
> >>>> immutable (you can still use different pat_index but the coherency
> >>>> must be
> >> compatible).
> >>>> Also at creation time we can determine if the smem_caching is
> >>>> compatible with the coh_mode.
> >>>>
> >>>> Note that it might be that we ditch the COH_2WAY and just have
> >>>> COH_NONE and COH_AT_LEAST_1WAY, in which case this might be
> >>>> slightly
> >> different.
> >>>>
> >>>>>
> >>>>>>> 2. no UC/WB/WC definition (CPU cachable) in
> >>>>>>> drm_xe_gem_mmap_offset,
> >>>>>> will it be handled by KMD automatically?
> >>>>>>> For example: if set 1-way, it means GPU could snoop CPU cache,
> >>>>>>> we could use WB  in mmap offset If it is COHERENCY_NONE, we
> >>>>>>> could only use
> >>>>>> UC,  all these logic is handled by KMD automatically?
> >>>>>>
> >>>>>> Yes, the mmap will use the same CPU side caching mode that you
> >>>>>> set at gem_create with smem_caching (wb/wc/uc).
> >>>>>>
> >>>>> It means it  could not be changed after gem_create.  Sometimes,
> >>>>> UMD doesn’t know Whether the data need to be accessed by CPU at
> >>>>> gem_create .  only application Knows the purpose of allocation &
> >>>>> the surface usage.  So, may need change UMD api to let
> >>>>> applications set whether the
> >>>> resource is "accessible", or it will hurt the perf.
> >>>>
> >>>> Right, the smem_caching is immutable.
> >>>>
> >>>>>
> >>>>>>> 3.  about " For imported dma-buf (from a different device) the
> >>>>>>> coherency mode is also implicit and must also be either 1WAY or
> >> 2WAY"
> >>>>>>> it means it must be 1way or 2way, and UMD need not to set it?
> >>>>>>
> >>>>>> If you are going to vm_bind it you need to supply the pat_index
> >>>>>> and in this case it needs to be either 1way or 2way. The object
> >>>>>> comes from a different device so we might not have
> >>>>>> smem_caching/coh_mode like we do for native objects.
> >>>>>>
> >>>>> How about the GPU cache setting of imported dma-buf?
> >>>>> Is there possible conflict between dma-buf producer and consumer?
> >>>>> For example:
> >>>>> In producer side , one bo is set GPU L3 cache,  also CPU
> >>>>> cacheable, it should be 2Way In client side, we know the coherency
> >>>>> should be 1-way or 2-way, if it suppose no GPU cache Used. It just set 1-
> way.
> >>>>> It may cause some problems?
> >>>>
> >>>> If the producer and consumer are the same device, then I think the
> >>>> original gem_create.coh_mode is considered, since it's effectively
> >>>> still a native object. In which case the normal rules apply.
> >>>
> >>> But the consumer does not know the coh_mode, so , it is possible
> >>> that it set will set Another value, then it will failed.
> >>> And because consumer does not know it is 1-way or 2-way,  consumer
> >>> could not Set the gpu cache correctly.  Because if it is 1-way, cpu
> >>> could not snoop gpu cache
> >>>
> >>>>
> >>>> If it comes from a different device then we don't know what
> >>>> gem_create.coh_mode is (or if that even exists) since we don't have
> >>>> the usual native object, but just some opaque dma-buf object. In
> >>>> this case it must be at least 1way or 2way.
> >>>>
> >>>>    From KMD pov we only care about userspace not being able to
> >>>> bypass the clearing. There might still be other coherency issues though.
> >>>>
> >>>>>
> >>>>> Also , when a bo is created, UMD does not know whether it will be
> >> exported.
> >>>>> It maybe be set as COHERENCY_NONE, right?
> >>>>
> >>>> Yeah, you can use whatever you want. Just that your smem_caching
> >>>> will need to be wc/uc if using COH_NONE. On the importer side, if
> >>>> it comes from the same device then it's just a normal native object
> >>>> and the normal
> >> rules apply.
> >>>> Currently this would mean also using pat_index with COH_NONE. If it
> >>>> is imported from a different device then you need to use a
> >>>> pat_index with 1way or 2way, since KMD doesn't really know how the
> >>>> object is mapped on the CPU or if the CPU caches are dirty
> >>>> (potentially with
> >> clearing).
> >>>>
> >>>> Is this going to be problematic?
> >>>>
> >>> If imported surfaces is from same device, how consumer know it is
> >>> original coh_non Or 1-way or 2-way?
> >>>
> >>> And it also means producer and consumer could have different
> >>> pat_index for
> >> same bo.
> >>
> >> Ok, if KMD instead allows 1way or 2way for dma-buf when it's external
> >> or from the same device? Would that help here? i.e you can always
> >> select 1way or 2way, even if it came from the same device and was
> coh_none?
> >>
> >> Other option is you could call the vm_bind ioctl to check if the
> >> coh_mode is supported by checking if it returns an error? But maybe that is
> too nasy.
> >>
> >
> > Still a bit confuse,  I create a bo with coh_none, then export it to a PRIME fd.
> > In consumer side, it does not know it is coh_none, and will call vm_bind to
> set pat_index.
> >
> > You mean it will return a error when I set different coh value though
> vm_bind?
> > And try coh_non, 1-way, 2-way, then get the correct one? TBH, it is ugly.
> 
> Yeah, it is nasty.
> 
> >
> > My concern is: if it is coh_non or 1-way,  it means I could not use GPU cache if
> the bo also is CPU accessible.
> > Because CPU could not snoop GPU cache.
> > So, GPU cache specified by pat_index also will be rejected?
> >
> > Could we add some query interface to query the CPU cache and coherency
> setting of a bo?
> 
> Say if the buffer comes from a different device + driver, what pat_index would
> you choose for that? There is no such thing as coh_mode/smem_caching for
> such an object. If you have to handle that case can't you just treat all imported
> dma-buf the same as that? The new proposal was to allow 1way or 2way for
> any dma-buf object, even if it comes from the same device and has different BO
> coh_mode. Do you need more than that? AFAICT the query would only really
> work for objects imported from the same device.
> 
Looks I misunderstood the proposal, so, you mean if it shared in one device,
Consumer could choose 1-way ,2-way , any one should work , and no error returned, 
no rejection from vm_bind, right?

> >
> >>
> >>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Auld, Matthew <matthew.auld@intel.com>
> >>>>>>>> Sent: Wednesday, August 30, 2023 7:28 PM
> >>>>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
> >>>>>>>> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi
> >>>>>> <pallavi.mishra@intel.com>;
> >>>>>>>> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas
> >>>>>>>> Lahtinen <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
> >>>>>>>> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>;
> >>>>>> Hazubski,
> >>>>>>>> Filip <filip.hazubski@intel.com>; Zhang, Carl
> >>>>>>>> <carl.zhang@intel.com>; Yu,
> >>>>>> Effie
> >>>>>>>> <effie.yu@intel.com>
> >>>>>>>> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection
> >>>>>>>> with
> >>>>>> vm_bind
> >>>>>>>>
> >>>>>>>> On 29/08/2023 22:36, Matt Roper wrote:
> >>>>>>>>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
> >>>>>>>>>> Allow userspace to directly control the pat_index for a given
> >>>>>>>>>> vm binding. This should allow directly controlling the
> >>>>>>>>>> coherency, caching and potentially other stuff in the future
> >>>>>>>>>> for the
> >> ppGTT binding.
> >>>>>>>>>>
> >>>>>>>>>> The exact meaning behind the pat_index is very platform
> >>>>>>>>>> specific (see BSpec or PRMs) but effectively maps to some
> >>>>>>>>>> predefined memory attributes. From the KMD pov we only care
> >>>>>>>>>> about the coherency that is provided by the pat_index, which
> >>>>>>>>>> falls into either
> >>>> NONE, 1WAY or 2WAY.
> >>>>>>>>>> The vm_bind coherency mode for the given pat_index needs to
> >>>>>>>>>> match
> >>>>>> the
> >>>>>>>>>> given coh_mode that was set at object creation. For platforms
> >>>>>>>>>> that lack
> >>>>>>>>>
> >>>>>>>>> Is it actually important to match the coherency mode?  I think
> >>>>>>>>> one of the main goals was to know up front if userspace might
> >>>>>>>>> be using a non-snooping PAT setting that would let it bypass
> >>>>>>>>> the CPU cache (and potentially read old, stale data from a
> >>>>>>>>> different process if the buffer's clear value is still sitting
> >>>>>>>>> in cache and hasn't landed in memory yet).
> >>>>>>>>>
> >>>>>>>>> If that's the only concern, then I think it should still be
> >>>>>>>>> fine to map with a non-matching PAT as long as it's more
> >>>>>>>>> coherent than the one specified at creation, right?  E.g., if
> >>>>>>>>> the buffer was created with 1-way coherency, it would be fine
> >>>>>>>>> to map it with 2-way because userspace still can't use that to
> >>>>>>>>> observe the previous contents of the buffer. Or
> >>>>>>>>
> >>>>>>>> Yeah, I guess we could in theory do something that.
> >>>>>>>>
> >>>>>>>>> if the buffer was created with "non-coherent" then we've
> >>>>>>>>> already done the necessary clflushing in kernel before handing
> >>>>>>>>> to buffer over to userspace to ensure the clear value landed
> >>>>>>>>> in memory, so any valid PAT index should be safe (from a
> >>>>>>>>> security POV) after that, right?  Any other problems that
> >>>>>>>>> arise from mismatched coherency would just be contained to the
> >>>>>>>>> app possibly shooting itself in the foot, which isn't really our
> concern.
> >>>>>>>>
> >>>>>>>> That is also my understanding, at least from the KMD security pov.
> >>>>>>>> If you allocate as wb then you must use at least 1way, since
> >>>>>>>> there is no flushing
> >>>>>> for
> >>>>>>>> clearing or swap-in. For uc/wc you could in theory use whatever
> >>>>>>>> you
> >> want.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and
> >> WB
> >>>>>>>>>> as
> >>>>>>>> 2WAY.
> >>>>>>>>>>
> >>>>>>>>>> For userptr mappings we lack a corresponding gem object, so
> >>>>>>>>>> the
> >>>>>> expected
> >>>>>>>>>> coherency mode is instead implicit and must fall into either
> >>>>>>>>>> 1WAY or 2WAY. Trying to use NONE will be rejected by the
> >>>>>>>>>> kernel. For imported dma-buf (from a different device) the
> >>>>>>>>>> coherency mode is also implicit and must also be either 1WAY
> >>>>>>>>>> or
> >> 2WAY.
> >>>>>>>>>>
> >>>>>>>>>> As part of adding pat_index support with vm_bind we also need
> >>>>>>>>>> stop
> >>>>>> using
> >>>>>>>>>> xe_cache_level and instead use the pat_index in various places.
> >>>>>>>>>> We still make use of xe_cache_level, but only as a
> >>>>>>>>>> convenience for kernel internal objectsi (internally it maps
> >>>>>>>>>> to some reasonable pat_index). For
> >>>>>>>>>
> >>>>>>>>> Maybe we should kill xe_cache_level completely and just assign
> >>>>>>>>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be
> >>>>>>>>> used in appropriate places, similar to what we do with MOCS
> >>>>>>>>> (gt-
> >>>>>>> mocs.uc_index,
> >>>>>>>>> gt->mocs.wb_index)?
> >>>>>>>>
> >>>>>>>> OK, seems reasonable to me.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> now this is just a 1:1 conversion of the existing code,
> >>>>>>>>>> however for platforms like MTL+ we might need to give more
> >>>>>>>>>> control through
> >>>>>> bo_create
> >>>>>>>>>> or stop using WB on the CPU side if we need CPU access.
> >>>>>>>>>>
> >>>>>>>>>> Bspec: 45101, 44235 #xe
> >>>>>>>>>> Bspec: 70552, 71582, 59400 #xe2
> >>>>>>>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> >>>>>>>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> >>>>>>>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> >>>>>>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >>>>>>>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> >>>>>>>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
> >>>>>>>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
> >>>>>>>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
> >>>>>>>>>> Cc: Carl Zhang <carl.zhang@intel.com>
> >>>>>>>>>> Cc: Effie Yu <effie.yu@intel.com>
> >>>>>>>>>> ---
> >>>>>>>>>>       drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
> >>>>>>>>>>       drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
> >>>>>>>>>>       drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
> >>>>>>>>>>       drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
> >>>>>>>>>>       drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
> >>>>>>>>>>       drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
> >>>>>>>>>>       drivers/gpu/drm/xe/xe_pt.h            |  6 +--
> >>>>>>>>>>       drivers/gpu/drm/xe/xe_vm.c            | 73
> >> +++++++++++++++++++++----
> >>>> -
> >>>>>> -
> >>>>>>>>>>       drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
> >>>>>>>>>>       include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
> >>>>>>>>>>       10 files changed, 134 insertions(+), 58 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>>>> b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>>>>>> index 4d6296cdbcfd..bb4c182048c3 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>>>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
> >>>>>>>>>>       		(*alloc_scratch_dma)(struct i915_address_space
> >> *vm,
> >>>> int
> >>>>>>>>>> sz);
> >>>>>>>>>>
> >>>>>>>>>>       	u64 (*pte_encode)(dma_addr_t addr,
> >>>>>>>>>> -			  unsigned int pat_index,
> >>>>>>>>>> +			  u32 pat_index,
> >>>>>>>>>>       			  u32 flags); /* Create a valid PTE */
> >>>>>>>>>>       #define PTE_READ_ONLY	BIT(0)
> >>>>>>>>>>       #define PTE_LM		BIT(1)
> >>>>>>>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>>>> b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>>>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>>>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>>>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct
> >>>>>> xe_migrate
> >>>>>>>> *m, struct kunit *test)
> >>>>>>>>>>       	/* First part of the test, are we updating our
> >>>>>>>>>> pagetable bo with a new
> >>>>>>>> entry? */
> >>>>>>>>>>       	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE *
> >>>> (NUM_KERNEL_PDE -
> >>>>>> 1),
> >>>>>>>> u64,
> >>>>>>>>>>       		  0xdeaddeadbeefbeef);
> >>>>>>>>>> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> >>>>>>>>>> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
> >>>>>>>> XE_CACHE_WB), 0);
> >>>>>>>>>>       	if (m->q->vm->flags & XE_VM_FLAG_64K)
> >>>>>>>>>>       		expected |= XE_PTE_PS64;
> >>>>>>>>>>       	if (xe_bo_is_vram(pt)) diff --git
> >>>>>>>>>> a/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>> b/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>>>>>> index 209fa053d9fb..4134c26150a5 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>>>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo,
> >> u64
> >>>>>>>> bo_offset)
> >>>>>>>>>>       		pte |= XE_GGTT_PTE_DM;
> >>>>>>>>>>
> >>>>>>>>>>       	if ((ggtt->pat_encode).pte_encode)
> >>>>>>>>>> -		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >>>>>>>> XE_CACHE_WB_1_WAY);
> >>>>>>>>>> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >>>>>>>>>> +
> >>>> xe_pat_get_index(xe,
> >>>>>>>> XE_CACHE_WB_1_WAY));
> >>>>>>>>>>
> >>>>>>>>>>       	return pte;
> >>>>>>>>>>       }
> >>>>>>>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt
> >> *ggtt)
> >>>>>>>>>>       }
> >>>>>>>>>>
> >>>>>>>>>>       static u64 xelpg_ggtt_pte_encode_pat(struct xe_device
> >>>>>>>>>> *xe,
> >>>>>>>>>> u64
> >>>>>> pte_pat,
> >>>>>>>>>> -						enum xe_cache_level
> >>>> cache)
> >>>>>>>>>> +				     u32 pat_index)
> >>>>>>>>>>       {
> >>>>>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>>>>>> -
> >>>>>>>>>>       	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
> >>>>>>>>>>
> >>>>>>>>>>       	if (pat_index & BIT(0)) diff --git
> >>>>>>>>>> a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>>>> b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>>>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>>>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
> >>>>>>>>>>
> >>>>>>>>>>       	struct {
> >>>>>>>>>>       		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>>>>>>>> -						enum xe_cache_level
> >>>> cache);
> >>>>>>>>>> +				  u32 pat_index);
> >>>>>>>>>>       	} pat_encode;
> >>>>>>>>>>       };
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>>>> b/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>>>>>> index a782ea282cb6..54585e98452a 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>>>>>> @@ -24,6 +24,7 @@
> >>>>>>>>>>       #include "xe_lrc.h"
> >>>>>>>>>>       #include "xe_map.h"
> >>>>>>>>>>       #include "xe_mocs.h"
> >>>>>>>>>> +#include "xe_pat.h"
> >>>>>>>>>>       #include "xe_pt.h"
> >>>>>>>>>>       #include "xe_res_cursor.h"
> >>>>>>>>>>       #include "xe_sched_job.h"
> >>>>>>>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>>>>>> xe_tile
> >>>>>> *tile,
> >>>>>>>> struct xe_migrate *m,
> >>>>>>>>>>       	u32 num_entries = NUM_PT_SLOTS, num_level = vm-
> >>>>> pt_root[id]-
> >>>>>>>>> level;
> >>>>>>>>>>       	u32 map_ofs, level, i;
> >>>>>>>>>>       	struct xe_bo *bo, *batch =
> >>>>>>>>>> tile->mem.kernel_bb_pool->bo;
> >>>>>>>>>> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> >>>>>>>>>>       	u64 entry;
> >>>>>>>>>>       	int ret;
> >>>>>>>>>>
> >>>>>>>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct
> >>>>>> xe_tile
> >>>>>>>> *tile, struct xe_migrate *m,
> >>>>>>>>>>       		return ret;
> >>>>>>>>>>       	}
> >>>>>>>>>>
> >>>>>>>>>> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
> >>>>>> XE_CACHE_WB);
> >>>>>>>>>> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
> >>>> pat_index);
> >>>>>>>>>>       	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
> >>>>>>>>>>
> >>>>>>>>>>       	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
> >>>>>>>>>>
> >>>>>>>>>>       	/* Map the entire BO in our level 0 pt */
> >>>>>>>>>>       	for (i = 0, level = 0; i < num_entries; level++) {
> >>>>>>>>>> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
> >>>> XE_CACHE_WB,
> >>>>>>>> 0);
> >>>>>>>>>> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
> >>>> pat_index, 0);
> >>>>>>>>>>
> >>>>>>>>>>       		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> >>>> entry);
> >>>>>>>>>>
> >>>>>>>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>>>>>> xe_tile
> >>>>>> *tile,
> >>>>>>>> struct xe_migrate *m,
> >>>>>>>>>>       		for (i = 0; i < batch->size;
> >>>>>>>>>>       		     i += vm->flags & XE_VM_FLAG_64K ?
> >>>> XE_64K_PAGE_SIZE :
> >>>>>>>>>>       		     XE_PAGE_SIZE) {
> >>>>>>>>>> -			entry = xe_pte_encode(batch, i,
> >>>> XE_CACHE_WB, 0);
> >>>>>>>>>> +			entry = xe_pte_encode(batch, i, pat_index, 0);
> >>>>>>>>>>
> >>>>>>>>>>       			xe_map_wr(xe, &bo->vmap, map_ofs + level
> >> *
> >>>> 8, u64,
> >>>>>>>>>>       				  entry);
> >>>>>>>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>>>>>> xe_tile
> >>>>>> *tile,
> >>>>>>>> struct xe_migrate *m,
> >>>>>>>>>>       			flags = XE_PDE_64K;
> >>>>>>>>>>
> >>>>>>>>>>       		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> >>>>>>>>>> -					XE_PAGE_SIZE,
> >>>> XE_CACHE_WB);
> >>>>>>>>>> +					XE_PAGE_SIZE, pat_index);
> >>>>>>>>>>       		xe_map_wr(xe, &bo->vmap, map_ofs +
> >> XE_PAGE_SIZE
> >>>> * level,
> >>>>>>>> u64,
> >>>>>>>>>>       			  entry | flags);
> >>>>>>>>>>       	}
> >>>>>>>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>>>>>> xe_tile
> >>>>>> *tile,
> >>>>>>>> struct xe_migrate *m,
> >>>>>>>>>>       	/* Write PDE's that point to our BO. */
> >>>>>>>>>>       	for (i = 0; i < num_entries - num_level; i++) {
> >>>>>>>>>>       		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> >>>>>>>>>> -				      XE_CACHE_WB);
> >>>>>>>>>> +				      pat_index);
> >>>>>>>>>>
> >>>>>>>>>>       		xe_map_wr(xe, &bo->vmap, map_ofs +
> >> XE_PAGE_SIZE
> >>>> +
> >>>>>>>>>>       			  (i + 1) * 8, u64, entry); @@ -1256,7 +1258,7
> >>>> @@
> >>>>>>>>>> xe_migrate_update_pgtables(struct
> >>>>>> xe_migrate
> >>>>>>>> *m,
> >>>>>>>>>>
> >>>>>>>>>>       			XE_WARN_ON(pt_bo->size != SZ_4K);
> >>>>>>>>>>
> >>>>>>>>>> -			addr = xe_pte_encode(pt_bo, 0,
> >>>> XE_CACHE_WB, 0);
> >>>>>>>>>> +			addr = xe_pte_encode(pt_bo, 0,
> >>>> xe_pat_get_index(xe,
> >>>>>>>> XE_CACHE_WB), 0);
> >>>>>>>>>>       			bb->cs[bb->len++] = lower_32_bits(addr);
> >>>>>>>>>>       			bb->cs[bb->len++] = upper_32_bits(addr);
> >>>>>>>>>>       		}
> >>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c
> >>>>>>>>>> b/drivers/gpu/drm/xe/xe_pt.c index 64713f400d94..019af2920078
> >>>>>>>>>> 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
> >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
> >>>>>>>>>> @@ -10,6 +10,7 @@
> >>>>>>>>>>       #include "xe_gt.h"
> >>>>>>>>>>       #include "xe_gt_tlb_invalidation.h"
> >>>>>>>>>>       #include "xe_migrate.h"
> >>>>>>>>>> +#include "xe_pat.h"
> >>>>>>>>>>       #include "xe_pt_types.h"
> >>>>>>>>>>       #include "xe_pt_walk.h"
> >>>>>>>>>>       #include "xe_res_cursor.h"
> >>>>>>>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct
> >>>>>>>>>> xe_pt_dir
> >>>>>>>> *pt_dir, unsigned int index)
> >>>>>>>>>>        *
> >>>>>>>>>>        * Return: An encoded page directory entry. No errors.
> >>>>>>>>>>        */
> >>>>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >>>>>>>>>> -		  const enum xe_cache_level cache)
> >>>>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
> >>>>>>>>>> +pat_index)
> >>>>>>>>>>       {
> >>>>>>>>>>       	u64 pde;
> >>>>>>>>>>       	struct xe_vm *vm = bo->vm;
> >>>>>>>>>>       	struct xe_device *xe = vm->xe;
> >>>>>>>>>>
> >>>>>>>>>> -
> >>>>>>>>>>       	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> >>>>>>>>>>       	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
> >>>>>>>>>>
> >>>>>>>>>>       	if ((vm->pat_encode).pde_encode)
> >>>>>>>>>> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> >>>>>>>>>> +		pde = (vm->pat_encode).pde_encode(xe, pde,
> >>>> pat_index);
> >>>>>>>>>>
> >>>>>>>>>>       	return pde;
> >>>>>>>>>>       }
> >>>>>>>>>>
> >>>>>>>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> >>>>>>>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
> >>>>>>>>>>       			struct xe_vma *vma, u32 pt_level)
> >>>>>>>>>>       {
> >>>>>>>>>>       	struct xe_vm *vm = xe_vma_vm(vma); @@ -89,7 +88,7 @@
> >>>> static
> >>>>>>>>>> u64 __pte_encode(u64 pte, enum
> >>>>>>>> xe_cache_level cache,
> >>>>>>>>>>       		pte |= XE_PTE_NULL;
> >>>>>>>>>>
> >>>>>>>>>>       	if ((vm->pat_encode).pte_encode)
> >>>>>>>>>> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> >>>>>>>>>> +		pte = (vm->pat_encode).pte_encode(xe, pte,
> >>>> pat_index);
> >>>>>>>>>>
> >>>>>>>>>>       	if (pt_level == 1)
> >>>>>>>>>>       		pte |= XE_PDE_PS_2M;
> >>>>>>>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
> >>>>>>>> xe_cache_level cache,
> >>>>>>>>>>        *
> >>>>>>>>>>        * Return: An encoded page-table entry. No errors.
> >>>>>>>>>>        */
> >>>>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
> >>>>>>>>>> xe_cache_level
> >>>>>>>> cache,
> >>>>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32
> >>>>>>>>>> +pat_index,
> >>>>>>>>>>       		  u32 pt_level)
> >>>>>>>>>>       {
> >>>>>>>>>>       	u64 pte;
> >>>>>>>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64
> >>>>>> offset,
> >>>>>>>> enum xe_cache_level cache,
> >>>>>>>>>>       	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> >>>>>>>>>>       		pte |= XE_PPGTT_PTE_DM;
> >>>>>>>>>>
> >>>>>>>>>> -	return __pte_encode(pte, cache, NULL, pt_level);
> >>>>>>>>>> +	return __pte_encode(pte, pat_index, NULL, pt_level);
> >>>>>>>>>>       }
> >>>>>>>>>>
> >>>>>>>>>>       static u64 __xe_pt_empty_pte(struct xe_tile *tile,
> >>>>>>>>>> struct xe_vm *vm, @@ -134,12 +133,12 @@ static u64
> >>>>>>>>>> __xe_pt_empty_pte(struct xe_tile
> >>>>>> *tile,
> >>>>>>>> struct xe_vm *vm,
> >>>>>>>>>>
> >>>>>>>>>>       	if (level == 0) {
> >>>>>>>>>>       		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> >>>>>>>>>> -					  XE_CACHE_WB, 0);
> >>>>>>>>>> +					  xe_pat_get_index(vm->xe,
> >>>>>>>> XE_CACHE_WB), 0);
> >>>>>>>>>>
> >>>>>>>>>>       		return empty;
> >>>>>>>>>>       	} else {
> >>>>>>>>>>       		return xe_pde_encode(vm->scratch_pt[id][level - 1]-
> >>>>> bo, 0,
> >>>>>>>>>> -				     XE_CACHE_WB);
> >>>>>>>>>> +				     xe_pat_get_index(vm->xe,
> >>>>>>>> XE_CACHE_WB));
> >>>>>>>>>>       	}
> >>>>>>>>>>       }
> >>>>>>>>>>
> >>>>>>>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> >>>>>>>>>>       	struct xe_vm *vm;
> >>>>>>>>>>       	/** @tile: The tile we're building for. */
> >>>>>>>>>>       	struct xe_tile *tile;
> >>>>>>>>>> -	/** @cache: Desired cache level for the ptes */
> >>>>>>>>>> -	enum xe_cache_level cache;
> >>>>>>>>>>       	/** @default_pte: PTE flag only template. No address
> >>>>>>>>>> is
> >>>> associated */
> >>>>>>>>>>       	u64 default_pte;
> >>>>>>>>>>       	/** @dma_offset: DMA offset to add to the PTE. */ @@
> >>>>>>>>>> -604,7
> >>>>>>>>>> +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> >>>>>>>> pgoff_t offset,
> >>>>>>>>>>
> >>>>>>>>>>       		pte = __pte_encode(is_null ? 0 :
> >>>>>>>>>>       				   xe_res_dma(curs) + xe_walk-
> >>>>> dma_offset,
> >>>>>>>>>> -				   xe_walk->cache, xe_walk->vma,
> >>>> level);
> >>>>>>>>>> +				   xe_walk->vma->pat_index,
> >> xe_walk-
> >>>>> vma,
> >>>>>>>> level);
> >>>>>>>>>>       		pte |= xe_walk->default_pte;
> >>>>>>>>>>
> >>>>>>>>>>       		/*
> >>>>>>>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw
> >>>>>>>>>> *parent,
> >>>>>>>> pgoff_t offset,
> >>>>>>>>>>       			xe_child->is_compact = true;
> >>>>>>>>>>       		}
> >>>>>>>>>>
> >>>>>>>>>> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk-
> >>> cache)
> >>>> |
> >>>>>> flags;
> >>>>>>>>>> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
> >>>>>>>>> pat_index) | flags;
> >>>>>>>>>>       		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset,
> >>>> xe_child,
> >>>>>>>>>>       					 pte);
> >>>>>>>>>>       	}
> >>>>>>>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile,
> >>>>>>>>>> struct
> >>>>>> xe_vma
> >>>>>>>> *vma,
> >>>>>>>>>>       		if (vma && vma->gpuva.flags &
> >>>> XE_VMA_ATOMIC_PTE_BIT)
> >>>>>>>>>>       			xe_walk.default_pte |=
> >>>> XE_USM_PPGTT_PTE_AE;
> >>>>>>>>>>       		xe_walk.dma_offset = vram_region_gpu_offset(bo-
> >>>>>>>>> ttm.resource);
> >>>>>>>>>> -		xe_walk.cache = XE_CACHE_WB;
> >>>>>>>>>> -	} else {
> >>>>>>>>>> -		if (!xe_vma_has_no_bo(vma) && bo->flags &
> >>>>>>>> XE_BO_SCANOUT_BIT)
> >>>>>>>>>> -			xe_walk.cache = XE_CACHE_WT;
> >>>>>>>>>> -		else
> >>>>>>>>>> -			xe_walk.cache = XE_CACHE_WB;
> >>>>>>>>>>       	}
> >>>>>>>>>> +
> >>>>>>>>>>       	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> >>>>>>>>>>       		xe_walk.dma_offset =
> >>>>>>>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h
> >>>>>>>>>> b/drivers/gpu/drm/xe/xe_pt.h index
> 01be7ab08f87..1d433a5a96b4
> >>>>>>>>>> 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
> >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
> >>>>>>>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile,
> >>>>>>>>>> struct
> >>>>>>>> xe_vma *vma, struct xe_exec_queu
> >>>>>>>>>>
> >>>>>>>>>>       bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma
> >>>>>>>>>> *vma);
> >>>>>>>>>>
> >>>>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >>>>>>>>>> -		  const enum xe_cache_level level);
> >>>>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
> >>>>>>>>>> +pat_index);
> >>>>>>>>>>
> >>>>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
> >>>>>>>>>> xe_cache_level
> >>>>>>>> cache,
> >>>>>>>>>> -		  u32 pt_level);
> >>>>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32
> >>>>>>>>>> +pat_index,
> >>>>>>>>>> +u32
> >>>>>>>> pt_level);
> >>>>>>>>>>
> >>>>>>>>>>       #endif
> >>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c
> >>>>>>>>>> b/drivers/gpu/drm/xe/xe_vm.c index
> >> 7eeeed0411f3..34603a7e84b0
> >>>>>>>>>> 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
> >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
> >>>>>>>>>> @@ -6,6 +6,7 @@
> >>>>>>>>>>       #include "xe_vm.h"
> >>>>>>>>>>
> >>>>>>>>>>       #include <linux/dma-fence-array.h>
> >>>>>>>>>> +#include <linux/nospec.h>
> >>>>>>>>>>
> >>>>>>>>>>       #include <drm/drm_print.h>
> >>>>>>>>>>       #include <drm/ttm/ttm_execbuf_util.h> @@ -874,7 +875,8
> >>>>>>>>>> @@ static struct xe_vma *xe_vma_create(struct
> >>>>>> xe_vm
> >>>>>>>> *vm,
> >>>>>>>>>>       				    u64 start, u64 end,
> >>>>>>>>>>       				    bool read_only,
> >>>>>>>>>>       				    bool is_null,
> >>>>>>>>>> -				    u8 tile_mask)
> >>>>>>>>>> +				    u8 tile_mask,
> >>>>>>>>>> +				    u32 pat_index)
> >>>>>>>>>>       {
> >>>>>>>>>>       	struct xe_vma *vma;
> >>>>>>>>>>       	struct xe_tile *tile;
> >>>>>>>>>> @@ -913,6 +915,8 @@ static struct xe_vma
> >> *xe_vma_create(struct
> >>>>>> xe_vm
> >>>>>>>> *vm,
> >>>>>>>>>>       			vma->tile_mask |= 0x1 << id;
> >>>>>>>>>>       	}
> >>>>>>>>>>
> >>>>>>>>>> +	vma->pat_index = pat_index;
> >>>>>>>>>> +
> >>>>>>>>>>       	if (vm->xe->info.platform == XE_PVC)
> >>>>>>>>>>       		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
> >>>>>>>>>>
> >>>>>>>>>> @@ -1194,10 +1198,8 @@ static void
> >> xe_vma_op_work_func(struct
> >>>>>>>> work_struct *w);
> >>>>>>>>>>       static void vm_destroy_work_func(struct work_struct
> >>>>>>>>>> *w);
> >>>>>>>>>>
> >>>>>>>>>>       static u64 xelp_ppgtt_pde_encode_pat(struct xe_device
> >>>>>>>>>> *xe,
> >>>>>>>>>> u64
> >>>>>> pde_pat,
> >>>>>>>>>> -						enum xe_cache_level
> >>>> cache)
> >>>>>>>>>> +				     u32 pat_index)
> >>>>>>>>>>       {
> >>>>>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>>>>>> -
> >>>>>>>>>>       	pde_pat &= ~(XELP_PDE_PAT_MASK);
> >>>>>>>>>>
> >>>>>>>>>>       	if (pat_index & BIT(0)) @@ -1213,10 +1215,8 @@ static
> >>>>>>>>>> u64
> >>>>>> xelp_ppgtt_pde_encode_pat(struct
> >>>>>>>> xe_device *xe, u64 pde_pat,
> >>>>>>>>>>       }
> >>>>>>>>>>
> >>>>>>>>>>       static u64 xelp_ppgtt_pte_encode_pat(struct xe_device
> >>>>>>>>>> *xe,
> >>>>>>>>>> u64
> >>>>>> pte_pat,
> >>>>>>>>>> -						enum xe_cache_level
> >>>> cache)
> >>>>>>>>>> +				     u32 pat_index)
> >>>>>>>>>>       {
> >>>>>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>>>>>> -
> >>>>>>>>>>       	pte_pat &= ~(XELP_PTE_PAT_MASK);
> >>>>>>>>>>
> >>>>>>>>>>       	if (pat_index & BIT(0)) @@ -1622,7 +1622,7 @@ struct
> >>>>>>>>>> xe_vm *xe_vm_lookup(struct
> >> xe_file
> >>>>>> *xef,
> >>>>>>>> u32 id)
> >>>>>>>>>>       u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct
> >>>>>>>>>> xe_tile
> >> *tile)
> >>>>>>>>>>       {
> >>>>>>>>>>       	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> >>>>>>>>>> -			     XE_CACHE_WB);
> >>>>>>>>>> +			     xe_pat_get_index(vm->xe,
> >>>> XE_CACHE_WB));
> >>>>>>>>>>       }
> >>>>>>>>>>
> >>>>>>>>>>       static struct dma_fence * @@ -2311,7 +2311,7 @@ static
> >>>>>>>>>> void print_op(struct xe_device *xe,
> >>>>>> struct
> >>>>>>>> drm_gpuva_op *op)
> >>>>>>>>>>       static struct drm_gpuva_ops *
> >>>>>>>>>>       vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo
> *bo,
> >>>>>>>>>>       			 u64 bo_offset_or_userptr, u64 addr, u64
> >>>> range,
> >>>>>>>>>> -			 u32 operation, u8 tile_mask, u32 region)
> >>>>>>>>>> +			 u32 operation, u8 tile_mask, u32 region, u32
> >>>>>>>> pat_index)
> >>>>>>>>>>       {
> >>>>>>>>>>       	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> >>>>>>>>>>       	struct ww_acquire_ctx ww; @@ -2339,6 +2339,7 @@
> >>>>>>>>>> vm_bind_ioctl_ops_create(struct xe_vm
> >>>> *vm,
> >>>>>>>> struct xe_bo *bo,
> >>>>>>>>>>       			struct xe_vma_op *op =
> >>>> gpuva_op_to_vma_op(__op);
> >>>>>>>>>>
> >>>>>>>>>>       			op->tile_mask = tile_mask;
> >>>>>>>>>> +			op->pat_index = pat_index;
> >>>>>>>>>>       			op->map.immediate =
> >>>>>>>>>>       				operation &
> >>>>>> XE_VM_BIND_FLAG_IMMEDIATE;
> >>>>>>>>>>       			op->map.read_only =
> >>>>>>>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm
> >>>> *vm,
> >>>>>>>> struct xe_bo *bo,
> >>>>>>>>>>       			struct xe_vma_op *op =
> >>>> gpuva_op_to_vma_op(__op);
> >>>>>>>>>>
> >>>>>>>>>>       			op->tile_mask = tile_mask;
> >>>>>>>>>> +			op->pat_index = pat_index;
> >>>>>>>>>>       			op->prefetch.region = region;
> >>>>>>>>>>       		}
> >>>>>>>>>>       		break;
> >>>>>>>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
> >>>> *vm,
> >>>>>>>> struct xe_bo *bo,
> >>>>>>>>>>       }
> >>>>>>>>>>
> >>>>>>>>>>       static struct xe_vma *new_vma(struct xe_vm *vm, struct
> >>>>>>>> drm_gpuva_op_map *op,
> >>>>>>>>>> -			      u8 tile_mask, bool read_only, bool is_null)
> >>>>>>>>>> +			      u8 tile_mask, bool read_only, bool is_null,
> >>>>>>>>>> +			      u32 pat_index)
> >>>>>>>>>>       {
> >>>>>>>>>>       	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op-
> >>>>> gem.obj) :
> >>>>>> NULL;
> >>>>>>>>>>       	struct xe_vma *vma;
> >>>>>>>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct
> >>>>>>>>>> xe_vm
> >>>>>>>> *vm, struct drm_gpuva_op_map *op,
> >>>>>>>>>>       	vma = xe_vma_create(vm, bo, op->gem.offset,
> >>>>>>>>>>       			    op->va.addr, op->va.addr +
> >>>>>>>>>>       			    op->va.range - 1, read_only, is_null,
> >>>>>>>>>> -			    tile_mask);
> >>>>>>>>>> +			    tile_mask, pat_index);
> >>>>>>>>>>       	if (bo)
> >>>>>>>>>>       		xe_bo_unlock(bo, &ww);
> >>>>>>>>>>
> >>>>>>>>>> @@ -2539,7 +2542,7 @@ static int
> >> vm_bind_ioctl_ops_parse(struct
> >>>>>> xe_vm
> >>>>>>>> *vm, struct xe_exec_queue *q,
> >>>>>>>>>>
> >>>>>>>>>>       				vma = new_vma(vm, &op-
> >>> base.map,
> >>>>>>>>>>       					      op->tile_mask, op-
> >>>>>>> map.read_only,
> >>>>>>>>>> -					      op->map.is_null);
> >>>>>>>>>> +					      op->map.is_null, op-
> >>>>> pat_index);
> >>>>>>>>>>       				if (IS_ERR(vma)) {
> >>>>>>>>>>       					err = PTR_ERR(vma);
> >>>>>>>>>>       					goto free_fence;
> >>>>>>>>>> @@ -2567,7 +2570,7 @@ static int
> >> vm_bind_ioctl_ops_parse(struct
> >>>>>> xe_vm
> >>>>>>>> *vm, struct xe_exec_queue *q,
> >>>>>>>>>>
> >>>>>>>>>>       					vma = new_vma(vm, op-
> >>>>>>>>> base.remap.prev,
> >>>>>>>>>>       						      op->tile_mask,
> >>>> read_only,
> >>>>>>>>>> -						      is_null);
> >>>>>>>>>> +						      is_null, op-
> >>>>> pat_index);
> >>>>>>>>>>       					if (IS_ERR(vma)) {
> >>>>>>>>>>       						err = PTR_ERR(vma);
> >>>>>>>>>>       						goto free_fence;
> >>>>>>>>>> @@ -2603,7 +2606,7 @@ static int
> >> vm_bind_ioctl_ops_parse(struct
> >>>>>> xe_vm
> >>>>>>>> *vm, struct xe_exec_queue *q,
> >>>>>>>>>>
> >>>>>>>>>>       					vma = new_vma(vm, op-
> >>>>>>>>> base.remap.next,
> >>>>>>>>>>       						      op->tile_mask,
> >>>> read_only,
> >>>>>>>>>> -						      is_null);
> >>>>>>>>>> +						      is_null, op-
> >>>>> pat_index);
> >>>>>>>>>>       					if (IS_ERR(vma)) {
> >>>>>>>>>>       						err = PTR_ERR(vma);
> >>>>>>>>>>       						goto free_fence;
> >>>>>>>>>> @@ -3158,8 +3161,14 @@ static int
> >>>>>>>>>> vm_bind_ioctl_check_args(struct
> >>>>>>>> xe_device *xe,
> >>>>>>>>>>       		u32 obj = (*bind_ops)[i].obj;
> >>>>>>>>>>       		u64 obj_offset = (*bind_ops)[i].obj_offset;
> >>>>>>>>>>       		u32 region = (*bind_ops)[i].region;
> >>>>>>>>>> +		u32 pat_index = (*bind_ops)[i].pat_index;
> >>>>>>>>>>       		bool is_null = op & XE_VM_BIND_FLAG_NULL;
> >>>>>>>>>>
> >>>>>>>>>> +		if (XE_IOCTL_DBG(xe, pat_index >= xe-
> >>>>>>>>> info.pat_table_n_entries)) {
> >>>>>>>>>> +			err = -EINVAL;
> >>>>>>>>>> +			goto free_bind_ops;
> >>>>>>>>>> +		}
> >>>>>>>>>> +
> >>>>>>>>>>       		if (i == 0) {
> >>>>>>>>>>       			*async = !!(op &
> >> XE_VM_BIND_FLAG_ASYNC);
> >>>>>>>>>>       		} else if (XE_IOCTL_DBG(xe, !*async) || @@ -3346,8
> >>>> +3355,25
> >>>>>>>>>> @@ int xe_vm_bind_ioctl(struct drm_device
> >>>>>> *dev,
> >>>>>>>> void *data, struct drm_file *file)
> >>>>>>>>>>       		struct drm_gem_object *gem_obj;
> >>>>>>>>>>       		u64 range = bind_ops[i].range;
> >>>>>>>>>>       		u64 addr = bind_ops[i].addr;
> >>>>>>>>>> +		u32 op = bind_ops[i].op;
> >>>>>>>>>>       		u32 obj = bind_ops[i].obj;
> >>>>>>>>>>       		u64 obj_offset = bind_ops[i].obj_offset;
> >>>>>>>>>> +		u32 pat_index = bind_ops[i].pat_index;
> >>>>>>>>>> +		u16 coh_mode;
> >>>>>>>>>> +
> >>>>>>>>>> +		pat_index = array_index_nospec(pat_index,
> >>>>>>>>>> +					       xe-
> >>>>> info.pat_table_n_entries);
> >>>>>>>>>> +		coh_mode = xe_pat_index_get_coh_mode(xe,
> >>>> pat_index);
> >>>>>>>>>> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
> >>>>>>>>>
> >>>>>>>>> Assuming we drop the unusable entries from the TGL table, this
> >>>>>>>>> should be impossible, right?  Any index that makes it past the
> >>>>>>>>> n_entries check at the top of the function should have a
> >>>>>>>>> valid, non-zero
> >>>> coh_mode value.
> >>>>>>>>> So this should probably be an assertion (to highlight a KMD
> >>>>>>>>> bug) rather than just a silent uapi failure return.
> >>>>>>>>
> >>>>>>>> Makes sense.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> +			err = -EINVAL;
> >>>>>>>>>> +			goto put_obj;
> >>>>>>>>>> +		}
> >>>>>>>>>> +
> >>>>>>>>>> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
> >>>>>>>> XE_VM_BIND_OP_MAP_USERPTR &&
> >>>>>>>>>> +				 coh_mode ==
> >>>> XE_GEM_COHERENCY_NONE))
> >>>>>> {
> >>>>>>>>>> +			err = -EINVAL;
> >>>>>>>>>> +			goto put_obj;
> >>>>>>>>>> +		}
> >>>>>>>>>>
> >>>>>>>>>>       		if (!obj)
> >>>>>>>>>>       			continue;
> >>>>>>>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct
> >> drm_device
> >>>>>> *dev,
> >>>>>>>> void *data, struct drm_file *file)
> >>>>>>>>>>       				goto put_obj;
> >>>>>>>>>>       			}
> >>>>>>>>>>       		}
> >>>>>>>>>> +
> >>>>>>>>>> +		if (bos[i]->coh_mode) {
> >>>>>>>>>> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
> >>>>>>>> coh_mode)) {
> >>>>>>>>>> +				err = -EINVAL;
> >>>>>>>>>> +				goto put_obj;
> >>>>>>>>>> +			}
> >>>>>>>>>> +		} else if (XE_IOCTL_DBG(xe, coh_mode ==
> >>>>>>>> XE_GEM_COHERENCY_NONE)) {
> >>>>>>>>>> +			/*
> >>>>>>>>>> +			 * Imported dma-buf from a different device
> >>>> should
> >>>>>>>>>> +			 * require 1way or 2way coherency since we
> >>>> don't
> >>>>>>>> know
> >>>>>>>>>> +			 * how it was mapped on CPU. Just assume is
> >> it
> >>>>>>>>>> +			 * potentially cached on CPU side.
> >>>>>>>>>> +			 */
> >>>>>>>>>> +			err = -EINVAL;
> >>>>>>>>>> +			goto put_obj;
> >>>>>>>>>> +		}
> >>>>>>>>>>       	}
> >>>>>>>>>>
> >>>>>>>>>>       	if (args->num_syncs) { @@ -3412,10 +3454,11 @@ int
> >>>>>>>>>> xe_vm_bind_ioctl(struct
> >> drm_device
> >>>>>> *dev,
> >>>>>>>> void *data, struct drm_file *file)
> >>>>>>>>>>       		u64 obj_offset = bind_ops[i].obj_offset;
> >>>>>>>>>>       		u8 tile_mask = bind_ops[i].tile_mask;
> >>>>>>>>>>       		u32 region = bind_ops[i].region;
> >>>>>>>>>> +		u32 pat_index = bind_ops[i].pat_index;
> >>>>>>>>>>
> >>>>>>>>>>       		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i],
> >>>> obj_offset,
> >>>>>>>>>>       						  addr, range, op,
> >>>> tile_mask,
> >>>>>>>>>> -						  region);
> >>>>>>>>>> +						  region, pat_index);
> >>>>>>>>>>       		if (IS_ERR(ops[i])) {
> >>>>>>>>>>       			err = PTR_ERR(ops[i]);
> >>>>>>>>>>       			ops[i] = NULL;
> >>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>>>> b/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>>>>>> index 83a1f87b6537..508679ed3c74 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>>>>>> @@ -111,6 +111,11 @@ struct xe_vma {
> >>>>>>>>>>       	 */
> >>>>>>>>>>       	u8 tile_present;
> >>>>>>>>>>
> >>>>>>>>>> +	/**
> >>>>>>>>>> +	 * @pat_index: The pat index to use when encoding the PTEs
> >>>> for
> >>>>>>>>>> +this
> >>>>>>>> vma.
> >>>>>>>>>> +	 */
> >>>>>>>>>> +	u32 pat_index;
> >>>>>>>>>> +
> >>>>>>>>>>       	struct {
> >>>>>>>>>>       		struct list_head rebind_link;
> >>>>>>>>>>       	} notifier;
> >>>>>>>>>> @@ -338,10 +343,8 @@ struct xe_vm {
> >>>>>>>>>>       	bool batch_invalidate_tlb;
> >>>>>>>>>>
> >>>>>>>>>>       	struct {
> >>>>>>>>>> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>>>>>>>> -						enum xe_cache_level
> >>>> cache);
> >>>>>>>>>> -		u64 (*pde_encode)(struct xe_device *xe, u64
> >> pde_pat,
> >>>>>>>>>> -						enum xe_cache_level
> >>>> cache);
> >>>>>>>>>> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>> u32
> >>>>>>>> pat_index);
> >>>>>>>>>> +		u64 (*pde_encode)(struct xe_device *xe, u64
> >> pde_pat,
> >>>> u32
> >>>>>>>> pat_index);
> >>>>>>>>>>       	} pat_encode;
> >>>>>>>>>>       };
> >>>>>>>>>>
> >>>>>>>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
> >>>>>>>>>>       	struct async_op_fence *fence;
> >>>>>>>>>>       	/** @tile_mask: gt mask for this operation */
> >>>>>>>>>>       	u8 tile_mask;
> >>>>>>>>>> +	/** @pat_index: The pat index to use for this operation. */
> >>>>>>>>>> +	u32 pat_index;
> >>>>>>>>>>       	/** @flags: operation flags */
> >>>>>>>>>>       	enum xe_vma_op_flags flags;
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/include/uapi/drm/xe_drm.h
> >>>>>>>>>> b/include/uapi/drm/xe_drm.h index
> 64bc66d4b550..0c15b6f32447
> >>>>>>>>>> 100644
> >>>>>>>>>> --- a/include/uapi/drm/xe_drm.h
> >>>>>>>>>> +++ b/include/uapi/drm/xe_drm.h
> >>>>>>>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> >>>>>>>>>>       	 */
> >>>>>>>>>>       	__u32 obj;
> >>>>>>>>>>
> >>>>>>>>>> -	/** @pad: MBZ */
> >>>>>>>>>> -	__u32 pad;
> >>>>>>>>>> +	/**
> >>>>>>>>>> +	 * @pat_index: The platform defined @pat_index to use for
> >>>> this
> >>>>>>>> mapping.
> >>>>>>>>>> +	 * The index basically maps to some predefined memory
> >>>> attributes,
> >>>>>>>>>> +	 * including things like caching, coherency and likely
> >>>>>>>>>> +other stuff
> >>>> in
> >>>>>>>>>> +	 * the future.  The exact meaning of the pat_index is
> >>>>>>>>>> +platform specific
> >>>>>>>>>
> >>>>>>>>> BTW, "other stuff in the future" already includes compression
> >>>>>>>>> on Xe2, we just haven't landed the patches for the Xe2 table yet.
> >>>>>>>>
> >>>>>>>> Ok, good to know.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> +	 * and defined in the Bspec and PRMs.  When the KMD sets
> >> up
> >>>> the
> >>>>>>>> binding
> >>>>>>>>>> +	 * the index here is encoded into the ppGTT PTE.
> >>>>>>>>>> +	 *
> >>>>>>>>>> +	 * For coherency the @pat_index needs to match the
> >>>>>>>>>> +	 * drm_xe_gem_create.coh_mode, so either
> >>>>>>>> XE_GEM_COHERENCY_NONE,
> >>>>>>>>>> +	 * XE_GEM_COHERENCY_1WAY or
> >>>> XE_GEM_COHERENCY_2WAY. The
> >>>>>>>> KMD will extract
> >>>>>>>>>> +	 * the coherency mode from the @pat_index and reject if
> >>>> there is a
> >>>>>>>>>> +	 * mismatch (see note below for pre-MTL platforms).
> >>>>>>>>>> +	 *
> >>>>>>>>>> +	 * Note: On pre-MTL platforms there is only a caching mode
> >>>> and no
> >>>>>>>>>> +	 * explicit coherency mode, but on such hardware there is
> >>>> always a
> >>>>>>>>>> +	 * shared-LLC (or is dgpu) so all GT memory accesses are
> >>>>>>>>>> +coherent
> >>>>>> with
> >>>>>>>>>> +	 * CPU caches even with the caching mode set as uncached.
> >>>> It's
> >>>>>>>>>> +only
> >>>>>>>> the
> >>>>>>>>>> +	 * display engine that is incoherent (on dgpu it must be in
> >>>> VRAM which
> >>>>>>>>>> +	 * is always mapped as WC on the CPU). However to keep the
> >>>> uapi
> >>>>>>>> somewhat
> >>>>>>>>>> +	 * consistent with newer platforms the KMD groups the
> >>>> different
> >>>>>> cache
> >>>>>>>>>> +	 * levels into the following coherency buckets on all
> >>>>>>>>>> +pre-MTL
> >>>>>> platforms:
> >>>>>>>>>> +	 *
> >>>>>>>>>> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
> >>>>>>>>>> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
> >>>>>>>>>> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
> >>>>>>>>>> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
> >>>>>>>>>
> >>>>>>>>> As noted on the previous patch, it seems like 2-way is
> >>>>>>>>> appropriate for LLC platforms, but 1-way might be a more
> >>>>>>>>> accurate description of dGPU behavior.
> >>>>>>>>>
> >>>>>>>>>> +	 *
> >>>>>>>>>> +	 * In practice UC/WC/WT should only ever used for scanout
> >>>>>>>>>> +surfaces
> >>>>>> on
> >>>>>>>>>> +	 * such platforms since it is only the display engine that
> >>>>>>>>>> +is
> >>>> actually
> >>>>>>>>>> +	 * incoherent. Everything else should typically use WB
> >>>>>>>>>> +given that we
> >>>>>>>>>
> >>>>>>>>> What if we're sharing our buffers with some other (non-GPU)
> >>>>>>>>> device?  Are there cases where that other device wouldn't be
> >>>>>>>>> coherent with the LLC, so we'd want to use one of these?
> >>>>>>>>
> >>>>>>>> Yeah, I guess there might be cases like that. I'll reword.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Matt
> >>>>>>>>>
> >>>>>>>>>> +	 * have a shared-LLC.  On MTL+ this completely changes
> >>>>>>>>>> +(also
> >>>>>>>> potentially
> >>>>>>>>>> +	 * no shared-LLC) and the HW defines the coherency mode as
> >>>> part
> >>>>>>>>>> +of
> >>>>>>>> the
> >>>>>>>>>> +	 * @pat_index.
> >>>>>>>>>> +	 *
> >>>>>>>>>> +	 * Note: For userptr and externally imported dma-buf the
> >>>> kernel
> >>>>>>>> expects
> >>>>>>>>>> +	 * either 1WAY or 2WAY for the @pat_index.
> >>>>>>>>>> +	 */
> >>>>>>>>>> +	__u32 pat_index;
> >>>>>>>>>>
> >>>>>>>>>>       	union {
> >>>>>>>>>>       		/**
> >>>>>>>>>> --
> >>>>>>>>>> 2.41.0
> >>>>>>>>>>
> >>>>>>>>>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-09-05 13:50                       ` Zhang, Carl
@ 2023-09-05 14:07                         ` Matthew Auld
  0 siblings, 0 replies; 47+ messages in thread
From: Matthew Auld @ 2023-09-05 14:07 UTC (permalink / raw)
  To: Zhang, Carl, Roper, Matthew D
  Cc: Gu, Lihao, Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas,
	Yu, Effie, intel-xe@lists.freedesktop.org

On 05/09/2023 14:50, Zhang, Carl wrote:
> 
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld@intel.com>
>> Sent: Tuesday, September 5, 2023 5:47 PM
>>
>> On 05/09/2023 10:12, Zhang, Carl wrote:
>>>
>>>> -----Original Message-----
>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>> Sent: Monday, September 4, 2023 5:24 PM
>>>>
>>>> On 01/09/2023 10:34, Zhang, Carl wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>>>> Sent: Thursday, August 31, 2023 6:44 PM
>>>>>>
>>>>>> On 31/08/2023 09:24, Zhang, Carl wrote:
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>>>>>> Sent: Thursday, August 31, 2023 12:02 AM
>>>>>>>>
>>>>>>>> On 30/08/2023 16:27, Zhang, Carl wrote:
>>>>>>>>> Several questions:
>>>>>>>>> 1. the pat_index from vm_bind will override the setting from
>>>> bo_create?
>>>>>>>>> How to keep the value from bo_create unchanged?
>>>>>>>>
>>>>>>>> You only set the pat_index during vm_bind. At gem_create you just
>>>>>>>> need tell the kernel what CPU side caching mode to use for system
>>>>>>>> memory (wb/wc/uc), plus the expected GPU coherency mode of the
>>>>>>>> pat_index. The current expectation is that the coherency mode of
>>>>>>>> the pat_index should match the one at gem_create.
>>>>>>>>
>>>>>>> It means that we set the incomplete pat_index at gem_create
>>>>>>> (coherency
>>>>>>> only) And set another part at vm_bind (gpu cache) Why we separate
>>>>>>> them? Why not set a full value at vm_bind?
>>>>>>
>>>>>> You always give the full pat_index at vm_bind. It's platform
>>>>>> specific but the BSpec for the most part tells you what the
>>>>>> coherency mode of the
>>>> pat_index is.
>>>>>>
>>>>>> For example if you look at the MTL pat_index table it tells you the
>>>>>> coh_mode:
>>>>>>
>>>>>> pat_index 0-2 = coh_mode "No snoop" = COH_NONE
>>>>>> pat_index 3   = coh_mode "1way"     = COH_1WAY
>>>>>> pat_index 4   = coh_mode "2way"     = COH_2WAY
>>>>>>
>>>>>> On future platforms the table is larger and also encodes stuff like
>>>>>> compression, but KMD really only cares about the coh_mode, since
>>>>>> that will restrict the allowed CPU smem_caching values that
>>>>>> userspace can
>>>> select.
>>>>>>
>>>>>>     From KMD pov this matters since we always give userspace zeroed
>>>>>> memory, which is useful for normal applications but is also needed
>>>>>> for
>>>> security reasons.
>>>>>> However the actual clearing might result in the writes only being
>>>>>> visible in the CPU cache (not flushed to main memory), and if the
>>>>>> GPU doesn't snoop the CPU cache it can read directly from main
>>>>>> memory, which effectively bypasses the clearing. To prevent that
>>>>>> the KMD needs to know how userspace will map the memory via the GPU
>>>>>> i.e what is the coh_mode for the pat_index. The gist is that if you
>>>>>> select COH_NONE then it needs to clflush the pages before giving
>>>>>> them to userspace, which is done by forcing you to use uc/wc for
>>>>>> the smem_caching (the kernel will internally issue cflush on x86
>>>>>> when marking the pages as uc/wc). And if you select wb there is no
>>>>>> flushing, but
>>>> here you need to use 1way or 2way.
>>>>>>
>>>>>
>>>>>     It sounds like a perf optimization, if no considering the perf,
>>>>> you could always call Clflush to flush the cpu cache to main memory
>>>>> before giving it to user space
>>>>
>>>> I think the other thing is that the KMD is no longer doing any manual
>>>> flushing, but instead the core kernel will now do that for us when
>>>> marking the pages as wc/uc on x86.
>>>>
>>>>>
>>>>>
>>>>>> AFAIK separating them is just to prevent userspace from mixing
>>>>>> different coherency modes for the same object by having it
>>>>>> immutable (you can still use different pat_index but the coherency
>>>>>> must be
>>>> compatible).
>>>>>> Also at creation time we can determine if the smem_caching is
>>>>>> compatible with the coh_mode.
>>>>>>
>>>>>> Note that it might be that we ditch the COH_2WAY and just have
>>>>>> COH_NONE and COH_AT_LEAST_1WAY, in which case this might be
>>>>>> slightly
>>>> different.
>>>>>>
>>>>>>>
>>>>>>>>> 2. no UC/WB/WC definition (CPU cachable) in
>>>>>>>>> drm_xe_gem_mmap_offset,
>>>>>>>> will it be handled by KMD automatically?
>>>>>>>>> For example: if set 1-way, it means GPU could snoop CPU cache,
>>>>>>>>> we could use WB  in mmap offset If it is COHERENCY_NONE, we
>>>>>>>>> could only use
>>>>>>>> UC,  all these logic is handled by KMD automatically?
>>>>>>>>
>>>>>>>> Yes, the mmap will use the same CPU side caching mode that you
>>>>>>>> set at gem_create with smem_caching (wb/wc/uc).
>>>>>>>>
>>>>>>> It means it  could not be changed after gem_create.  Sometimes,
>>>>>>> UMD doesn’t know Whether the data need to be accessed by CPU at
>>>>>>> gem_create .  only application Knows the purpose of allocation &
>>>>>>> the surface usage.  So, may need change UMD api to let
>>>>>>> applications set whether the
>>>>>> resource is "accessible", or it will hurt the perf.
>>>>>>
>>>>>> Right, the smem_caching is immutable.
>>>>>>
>>>>>>>
>>>>>>>>> 3.  about " For imported dma-buf (from a different device) the
>>>>>>>>> coherency mode is also implicit and must also be either 1WAY or
>>>> 2WAY"
>>>>>>>>> it means it must be 1way or 2way, and UMD need not to set it?
>>>>>>>>
>>>>>>>> If you are going to vm_bind it you need to supply the pat_index
>>>>>>>> and in this case it needs to be either 1way or 2way. The object
>>>>>>>> comes from a different device so we might not have
>>>>>>>> smem_caching/coh_mode like we do for native objects.
>>>>>>>>
>>>>>>> How about the GPU cache setting of imported dma-buf?
>>>>>>> Is there possible conflict between dma-buf producer and consumer?
>>>>>>> For example:
>>>>>>> In producer side , one bo is set GPU L3 cache,  also CPU
>>>>>>> cacheable, it should be 2Way In client side, we know the coherency
>>>>>>> should be 1-way or 2-way, if it suppose no GPU cache Used. It just set 1-
>> way.
>>>>>>> It may cause some problems?
>>>>>>
>>>>>> If the producer and consumer are the same device, then I think the
>>>>>> original gem_create.coh_mode is considered, since it's effectively
>>>>>> still a native object. In which case the normal rules apply.
>>>>>
>>>>> But the consumer does not know the coh_mode, so , it is possible
>>>>> that it set will set Another value, then it will failed.
>>>>> And because consumer does not know it is 1-way or 2-way,  consumer
>>>>> could not Set the gpu cache correctly.  Because if it is 1-way, cpu
>>>>> could not snoop gpu cache
>>>>>
>>>>>>
>>>>>> If it comes from a different device then we don't know what
>>>>>> gem_create.coh_mode is (or if that even exists) since we don't have
>>>>>> the usual native object, but just some opaque dma-buf object. In
>>>>>> this case it must be at least 1way or 2way.
>>>>>>
>>>>>>     From KMD pov we only care about userspace not being able to
>>>>>> bypass the clearing. There might still be other coherency issues though.
>>>>>>
>>>>>>>
>>>>>>> Also , when a bo is created, UMD does not know whether it will be
>>>> exported.
>>>>>>> It maybe be set as COHERENCY_NONE, right?
>>>>>>
>>>>>> Yeah, you can use whatever you want. Just that your smem_caching
>>>>>> will need to be wc/uc if using COH_NONE. On the importer side, if
>>>>>> it comes from the same device then it's just a normal native object
>>>>>> and the normal
>>>> rules apply.
>>>>>> Currently this would mean also using pat_index with COH_NONE. If it
>>>>>> is imported from a different device then you need to use a
>>>>>> pat_index with 1way or 2way, since KMD doesn't really know how the
>>>>>> object is mapped on the CPU or if the CPU caches are dirty
>>>>>> (potentially with
>>>> clearing).
>>>>>>
>>>>>> Is this going to be problematic?
>>>>>>
>>>>> If imported surfaces is from same device, how consumer know it is
>>>>> original coh_non Or 1-way or 2-way?
>>>>>
>>>>> And it also means producer and consumer could have different
>>>>> pat_index for
>>>> same bo.
>>>>
>>>> Ok, if KMD instead allows 1way or 2way for dma-buf when it's external
>>>> or from the same device? Would that help here? i.e you can always
>>>> select 1way or 2way, even if it came from the same device and was
>> coh_none?
>>>>
>>>> Other option is you could call the vm_bind ioctl to check if the
>>>> coh_mode is supported by checking if it returns an error? But maybe that is
>> too nasy.
>>>>
>>>
>>> Still a bit confuse,  I create a bo with coh_none, then export it to a PRIME fd.
>>> In consumer side, it does not know it is coh_none, and will call vm_bind to
>> set pat_index.
>>>
>>> You mean it will return a error when I set different coh value though
>> vm_bind?
>>> And try coh_non, 1-way, 2-way, then get the correct one? TBH, it is ugly.
>>
>> Yeah, it is nasty.
>>
>>>
>>> My concern is: if it is coh_non or 1-way,  it means I could not use GPU cache if
>> the bo also is CPU accessible.
>>> Because CPU could not snoop GPU cache.
>>> So, GPU cache specified by pat_index also will be rejected?
>>>
>>> Could we add some query interface to query the CPU cache and coherency
>> setting of a bo?
>>
>> Say if the buffer comes from a different device + driver, what pat_index would
>> you choose for that? There is no such thing as coh_mode/smem_caching for
>> such an object. If you have to handle that case can't you just treat all imported
>> dma-buf the same as that? The new proposal was to allow 1way or 2way for
>> any dma-buf object, even if it comes from the same device and has different BO
>> coh_mode. Do you need more than that? AFAICT the query would only really
>> work for objects imported from the same device.
>>
> Looks I misunderstood the proposal, so, you mean if it shared in one device,
> Consumer could choose 1-way ,2-way , any one should work , and no error returned,
> no rejection from vm_bind, right?

Yeah, treat same-device and different-device the same in KMD, and allow 
1way or 2way (anything but coh_none), since you highlighted the issue 
that for importer from same device it might not even know the original 
coh_mode.

> 
>>>
>>>>
>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>>>>>>>> Sent: Wednesday, August 30, 2023 7:28 PM
>>>>>>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>>>>>>>>> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi
>>>>>>>> <pallavi.mishra@intel.com>;
>>>>>>>>>> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas
>>>>>>>>>> Lahtinen <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
>>>>>>>>>> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>;
>>>>>>>> Hazubski,
>>>>>>>>>> Filip <filip.hazubski@intel.com>; Zhang, Carl
>>>>>>>>>> <carl.zhang@intel.com>; Yu,
>>>>>>>> Effie
>>>>>>>>>> <effie.yu@intel.com>
>>>>>>>>>> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection
>>>>>>>>>> with
>>>>>>>> vm_bind
>>>>>>>>>>
>>>>>>>>>> On 29/08/2023 22:36, Matt Roper wrote:
>>>>>>>>>>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
>>>>>>>>>>>> Allow userspace to directly control the pat_index for a given
>>>>>>>>>>>> vm binding. This should allow directly controlling the
>>>>>>>>>>>> coherency, caching and potentially other stuff in the future
>>>>>>>>>>>> for the
>>>> ppGTT binding.
>>>>>>>>>>>>
>>>>>>>>>>>> The exact meaning behind the pat_index is very platform
>>>>>>>>>>>> specific (see BSpec or PRMs) but effectively maps to some
>>>>>>>>>>>> predefined memory attributes. From the KMD pov we only care
>>>>>>>>>>>> about the coherency that is provided by the pat_index, which
>>>>>>>>>>>> falls into either
>>>>>> NONE, 1WAY or 2WAY.
>>>>>>>>>>>> The vm_bind coherency mode for the given pat_index needs to
>>>>>>>>>>>> match
>>>>>>>> the
>>>>>>>>>>>> given coh_mode that was set at object creation. For platforms
>>>>>>>>>>>> that lack
>>>>>>>>>>>
>>>>>>>>>>> Is it actually important to match the coherency mode?  I think
>>>>>>>>>>> one of the main goals was to know up front if userspace might
>>>>>>>>>>> be using a non-snooping PAT setting that would let it bypass
>>>>>>>>>>> the CPU cache (and potentially read old, stale data from a
>>>>>>>>>>> different process if the buffer's clear value is still sitting
>>>>>>>>>>> in cache and hasn't landed in memory yet).
>>>>>>>>>>>
>>>>>>>>>>> If that's the only concern, then I think it should still be
>>>>>>>>>>> fine to map with a non-matching PAT as long as it's more
>>>>>>>>>>> coherent than the one specified at creation, right?  E.g., if
>>>>>>>>>>> the buffer was created with 1-way coherency, it would be fine
>>>>>>>>>>> to map it with 2-way because userspace still can't use that to
>>>>>>>>>>> observe the previous contents of the buffer. Or
>>>>>>>>>>
>>>>>>>>>> Yeah, I guess we could in theory do something that.
>>>>>>>>>>
>>>>>>>>>>> if the buffer was created with "non-coherent" then we've
>>>>>>>>>>> already done the necessary clflushing in kernel before handing
>>>>>>>>>>> to buffer over to userspace to ensure the clear value landed
>>>>>>>>>>> in memory, so any valid PAT index should be safe (from a
>>>>>>>>>>> security POV) after that, right?  Any other problems that
>>>>>>>>>>> arise from mismatched coherency would just be contained to the
>>>>>>>>>>> app possibly shooting itself in the foot, which isn't really our
>> concern.
>>>>>>>>>>
>>>>>>>>>> That is also my understanding, at least from the KMD security pov.
>>>>>>>>>> If you allocate as wb then you must use at least 1way, since
>>>>>>>>>> there is no flushing
>>>>>>>> for
>>>>>>>>>> clearing or swap-in. For uc/wc you could in theory use whatever
>>>>>>>>>> you
>>>> want.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and
>>>> WB
>>>>>>>>>>>> as
>>>>>>>>>> 2WAY.
>>>>>>>>>>>>
>>>>>>>>>>>> For userptr mappings we lack a corresponding gem object, so
>>>>>>>>>>>> the
>>>>>>>> expected
>>>>>>>>>>>> coherency mode is instead implicit and must fall into either
>>>>>>>>>>>> 1WAY or 2WAY. Trying to use NONE will be rejected by the
>>>>>>>>>>>> kernel. For imported dma-buf (from a different device) the
>>>>>>>>>>>> coherency mode is also implicit and must also be either 1WAY
>>>>>>>>>>>> or
>>>> 2WAY.
>>>>>>>>>>>>
>>>>>>>>>>>> As part of adding pat_index support with vm_bind we also need
>>>>>>>>>>>> stop
>>>>>>>> using
>>>>>>>>>>>> xe_cache_level and instead use the pat_index in various places.
>>>>>>>>>>>> We still make use of xe_cache_level, but only as a
>>>>>>>>>>>> convenience for kernel internal objectsi (internally it maps
>>>>>>>>>>>> to some reasonable pat_index). For
>>>>>>>>>>>
>>>>>>>>>>> Maybe we should kill xe_cache_level completely and just assign
>>>>>>>>>>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be
>>>>>>>>>>> used in appropriate places, similar to what we do with MOCS
>>>>>>>>>>> (gt-
>>>>>>>>> mocs.uc_index,
>>>>>>>>>>> gt->mocs.wb_index)?
>>>>>>>>>>
>>>>>>>>>> OK, seems reasonable to me.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> now this is just a 1:1 conversion of the existing code,
>>>>>>>>>>>> however for platforms like MTL+ we might need to give more
>>>>>>>>>>>> control through
>>>>>>>> bo_create
>>>>>>>>>>>> or stop using WB on the CPU side if we need CPU access.
>>>>>>>>>>>>
>>>>>>>>>>>> Bspec: 45101, 44235 #xe
>>>>>>>>>>>> Bspec: 70552, 71582, 59400 #xe2
>>>>>>>>>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>>>>>>>>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>>>>>>>>>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>>>>>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>>>>>>>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>>>>>>>>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>>>>>>>>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
>>>>>>>>>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>>>>>>>>>>>> Cc: Carl Zhang <carl.zhang@intel.com>
>>>>>>>>>>>> Cc: Effie Yu <effie.yu@intel.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>        drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
>>>>>>>>>>>>        drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
>>>>>>>>>>>>        drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
>>>>>>>>>>>>        drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
>>>>>>>>>>>>        drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
>>>>>>>>>>>>        drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
>>>>>>>>>>>>        drivers/gpu/drm/xe/xe_pt.h            |  6 +--
>>>>>>>>>>>>        drivers/gpu/drm/xe/xe_vm.c            | 73
>>>> +++++++++++++++++++++----
>>>>>> -
>>>>>>>> -
>>>>>>>>>>>>        drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
>>>>>>>>>>>>        include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
>>>>>>>>>>>>        10 files changed, 134 insertions(+), 58 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>>>> b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>>>>>> index 4d6296cdbcfd..bb4c182048c3 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>>>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
>>>>>>>>>>>>        		(*alloc_scratch_dma)(struct i915_address_space
>>>> *vm,
>>>>>> int
>>>>>>>>>>>> sz);
>>>>>>>>>>>>
>>>>>>>>>>>>        	u64 (*pte_encode)(dma_addr_t addr,
>>>>>>>>>>>> -			  unsigned int pat_index,
>>>>>>>>>>>> +			  u32 pat_index,
>>>>>>>>>>>>        			  u32 flags); /* Create a valid PTE */
>>>>>>>>>>>>        #define PTE_READ_ONLY	BIT(0)
>>>>>>>>>>>>        #define PTE_LM		BIT(1)
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>>>> b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>>>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>>>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct
>>>>>>>> xe_migrate
>>>>>>>>>> *m, struct kunit *test)
>>>>>>>>>>>>        	/* First part of the test, are we updating our
>>>>>>>>>>>> pagetable bo with a new
>>>>>>>>>> entry? */
>>>>>>>>>>>>        	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE *
>>>>>> (NUM_KERNEL_PDE -
>>>>>>>> 1),
>>>>>>>>>> u64,
>>>>>>>>>>>>        		  0xdeaddeadbeefbeef);
>>>>>>>>>>>> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>>>>>>>>>>>> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
>>>>>>>>>> XE_CACHE_WB), 0);
>>>>>>>>>>>>        	if (m->q->vm->flags & XE_VM_FLAG_64K)
>>>>>>>>>>>>        		expected |= XE_PTE_PS64;
>>>>>>>>>>>>        	if (xe_bo_is_vram(pt)) diff --git
>>>>>>>>>>>> a/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>> b/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>>>>>> index 209fa053d9fb..4134c26150a5 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>>>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo,
>>>> u64
>>>>>>>>>> bo_offset)
>>>>>>>>>>>>        		pte |= XE_GGTT_PTE_DM;
>>>>>>>>>>>>
>>>>>>>>>>>>        	if ((ggtt->pat_encode).pte_encode)
>>>>>>>>>>>> -		pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>>>>>>>> XE_CACHE_WB_1_WAY);
>>>>>>>>>>>> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>>>>>>>>>> +
>>>>>> xe_pat_get_index(xe,
>>>>>>>>>> XE_CACHE_WB_1_WAY));
>>>>>>>>>>>>
>>>>>>>>>>>>        	return pte;
>>>>>>>>>>>>        }
>>>>>>>>>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt
>>>> *ggtt)
>>>>>>>>>>>>        }
>>>>>>>>>>>>
>>>>>>>>>>>>        static u64 xelpg_ggtt_pte_encode_pat(struct xe_device
>>>>>>>>>>>> *xe,
>>>>>>>>>>>> u64
>>>>>>>> pte_pat,
>>>>>>>>>>>> -						enum xe_cache_level
>>>>>> cache)
>>>>>>>>>>>> +				     u32 pat_index)
>>>>>>>>>>>>        {
>>>>>>>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>>>>>> -
>>>>>>>>>>>>        	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>>>>>>>>>>>
>>>>>>>>>>>>        	if (pat_index & BIT(0)) diff --git
>>>>>>>>>>>> a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>>>> b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>>>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>>>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>>>>>>>>>>>
>>>>>>>>>>>>        	struct {
>>>>>>>>>>>>        		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>>>>>>>> -						enum xe_cache_level
>>>>>> cache);
>>>>>>>>>>>> +				  u32 pat_index);
>>>>>>>>>>>>        	} pat_encode;
>>>>>>>>>>>>        };
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>>> b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>>>>> index a782ea282cb6..54585e98452a 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>>>>> @@ -24,6 +24,7 @@
>>>>>>>>>>>>        #include "xe_lrc.h"
>>>>>>>>>>>>        #include "xe_map.h"
>>>>>>>>>>>>        #include "xe_mocs.h"
>>>>>>>>>>>> +#include "xe_pat.h"
>>>>>>>>>>>>        #include "xe_pt.h"
>>>>>>>>>>>>        #include "xe_res_cursor.h"
>>>>>>>>>>>>        #include "xe_sched_job.h"
>>>>>>>>>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>>>>>> xe_tile
>>>>>>>> *tile,
>>>>>>>>>> struct xe_migrate *m,
>>>>>>>>>>>>        	u32 num_entries = NUM_PT_SLOTS, num_level = vm-
>>>>>>> pt_root[id]-
>>>>>>>>>>> level;
>>>>>>>>>>>>        	u32 map_ofs, level, i;
>>>>>>>>>>>>        	struct xe_bo *bo, *batch =
>>>>>>>>>>>> tile->mem.kernel_bb_pool->bo;
>>>>>>>>>>>> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>>>>>>>>>>>>        	u64 entry;
>>>>>>>>>>>>        	int ret;
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>> xe_tile
>>>>>>>>>> *tile, struct xe_migrate *m,
>>>>>>>>>>>>        		return ret;
>>>>>>>>>>>>        	}
>>>>>>>>>>>>
>>>>>>>>>>>> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
>>>>>>>> XE_CACHE_WB);
>>>>>>>>>>>> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
>>>>>> pat_index);
>>>>>>>>>>>>        	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>>>>>>>>>>>
>>>>>>>>>>>>        	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>>>>>>>>>>>
>>>>>>>>>>>>        	/* Map the entire BO in our level 0 pt */
>>>>>>>>>>>>        	for (i = 0, level = 0; i < num_entries; level++) {
>>>>>>>>>>>> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
>>>>>> XE_CACHE_WB,
>>>>>>>>>> 0);
>>>>>>>>>>>> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
>>>>>> pat_index, 0);
>>>>>>>>>>>>
>>>>>>>>>>>>        		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>>>>>> entry);
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>>>>>> xe_tile
>>>>>>>> *tile,
>>>>>>>>>> struct xe_migrate *m,
>>>>>>>>>>>>        		for (i = 0; i < batch->size;
>>>>>>>>>>>>        		     i += vm->flags & XE_VM_FLAG_64K ?
>>>>>> XE_64K_PAGE_SIZE :
>>>>>>>>>>>>        		     XE_PAGE_SIZE) {
>>>>>>>>>>>> -			entry = xe_pte_encode(batch, i,
>>>>>> XE_CACHE_WB, 0);
>>>>>>>>>>>> +			entry = xe_pte_encode(batch, i, pat_index, 0);
>>>>>>>>>>>>
>>>>>>>>>>>>        			xe_map_wr(xe, &bo->vmap, map_ofs + level
>>>> *
>>>>>> 8, u64,
>>>>>>>>>>>>        				  entry);
>>>>>>>>>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>>>>>> xe_tile
>>>>>>>> *tile,
>>>>>>>>>> struct xe_migrate *m,
>>>>>>>>>>>>        			flags = XE_PDE_64K;
>>>>>>>>>>>>
>>>>>>>>>>>>        		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>>>>>>>>>>>> -					XE_PAGE_SIZE,
>>>>>> XE_CACHE_WB);
>>>>>>>>>>>> +					XE_PAGE_SIZE, pat_index);
>>>>>>>>>>>>        		xe_map_wr(xe, &bo->vmap, map_ofs +
>>>> XE_PAGE_SIZE
>>>>>> * level,
>>>>>>>>>> u64,
>>>>>>>>>>>>        			  entry | flags);
>>>>>>>>>>>>        	}
>>>>>>>>>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>>>>>> xe_tile
>>>>>>>> *tile,
>>>>>>>>>> struct xe_migrate *m,
>>>>>>>>>>>>        	/* Write PDE's that point to our BO. */
>>>>>>>>>>>>        	for (i = 0; i < num_entries - num_level; i++) {
>>>>>>>>>>>>        		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>>>>>>>>>>>> -				      XE_CACHE_WB);
>>>>>>>>>>>> +				      pat_index);
>>>>>>>>>>>>
>>>>>>>>>>>>        		xe_map_wr(xe, &bo->vmap, map_ofs +
>>>> XE_PAGE_SIZE
>>>>>> +
>>>>>>>>>>>>        			  (i + 1) * 8, u64, entry); @@ -1256,7 +1258,7
>>>>>> @@
>>>>>>>>>>>> xe_migrate_update_pgtables(struct
>>>>>>>> xe_migrate
>>>>>>>>>> *m,
>>>>>>>>>>>>
>>>>>>>>>>>>        			XE_WARN_ON(pt_bo->size != SZ_4K);
>>>>>>>>>>>>
>>>>>>>>>>>> -			addr = xe_pte_encode(pt_bo, 0,
>>>>>> XE_CACHE_WB, 0);
>>>>>>>>>>>> +			addr = xe_pte_encode(pt_bo, 0,
>>>>>> xe_pat_get_index(xe,
>>>>>>>>>> XE_CACHE_WB), 0);
>>>>>>>>>>>>        			bb->cs[bb->len++] = lower_32_bits(addr);
>>>>>>>>>>>>        			bb->cs[bb->len++] = upper_32_bits(addr);
>>>>>>>>>>>>        		}
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_pt.c index 64713f400d94..019af2920078
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>>>>>> @@ -10,6 +10,7 @@
>>>>>>>>>>>>        #include "xe_gt.h"
>>>>>>>>>>>>        #include "xe_gt_tlb_invalidation.h"
>>>>>>>>>>>>        #include "xe_migrate.h"
>>>>>>>>>>>> +#include "xe_pat.h"
>>>>>>>>>>>>        #include "xe_pt_types.h"
>>>>>>>>>>>>        #include "xe_pt_walk.h"
>>>>>>>>>>>>        #include "xe_res_cursor.h"
>>>>>>>>>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct
>>>>>>>>>>>> xe_pt_dir
>>>>>>>>>> *pt_dir, unsigned int index)
>>>>>>>>>>>>         *
>>>>>>>>>>>>         * Return: An encoded page directory entry. No errors.
>>>>>>>>>>>>         */
>>>>>>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>>>>>>>>>> -		  const enum xe_cache_level cache)
>>>>>>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
>>>>>>>>>>>> +pat_index)
>>>>>>>>>>>>        {
>>>>>>>>>>>>        	u64 pde;
>>>>>>>>>>>>        	struct xe_vm *vm = bo->vm;
>>>>>>>>>>>>        	struct xe_device *xe = vm->xe;
>>>>>>>>>>>>
>>>>>>>>>>>> -
>>>>>>>>>>>>        	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>>>>>>>>>>>>        	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>>>>>>>>>>>
>>>>>>>>>>>>        	if ((vm->pat_encode).pde_encode)
>>>>>>>>>>>> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>>>>>>>>>>>> +		pde = (vm->pat_encode).pde_encode(xe, pde,
>>>>>> pat_index);
>>>>>>>>>>>>
>>>>>>>>>>>>        	return pde;
>>>>>>>>>>>>        }
>>>>>>>>>>>>
>>>>>>>>>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>>>>>>>>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>>>>>>>>>>>>        			struct xe_vma *vma, u32 pt_level)
>>>>>>>>>>>>        {
>>>>>>>>>>>>        	struct xe_vm *vm = xe_vma_vm(vma); @@ -89,7 +88,7 @@
>>>>>> static
>>>>>>>>>>>> u64 __pte_encode(u64 pte, enum
>>>>>>>>>> xe_cache_level cache,
>>>>>>>>>>>>        		pte |= XE_PTE_NULL;
>>>>>>>>>>>>
>>>>>>>>>>>>        	if ((vm->pat_encode).pte_encode)
>>>>>>>>>>>> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>>>>>>>>>>>> +		pte = (vm->pat_encode).pte_encode(xe, pte,
>>>>>> pat_index);
>>>>>>>>>>>>
>>>>>>>>>>>>        	if (pt_level == 1)
>>>>>>>>>>>>        		pte |= XE_PDE_PS_2M;
>>>>>>>>>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
>>>>>>>>>> xe_cache_level cache,
>>>>>>>>>>>>         *
>>>>>>>>>>>>         * Return: An encoded page-table entry. No errors.
>>>>>>>>>>>>         */
>>>>>>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
>>>>>>>>>>>> xe_cache_level
>>>>>>>>>> cache,
>>>>>>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32
>>>>>>>>>>>> +pat_index,
>>>>>>>>>>>>        		  u32 pt_level)
>>>>>>>>>>>>        {
>>>>>>>>>>>>        	u64 pte;
>>>>>>>>>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64
>>>>>>>> offset,
>>>>>>>>>> enum xe_cache_level cache,
>>>>>>>>>>>>        	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>>>>>>>>>>>>        		pte |= XE_PPGTT_PTE_DM;
>>>>>>>>>>>>
>>>>>>>>>>>> -	return __pte_encode(pte, cache, NULL, pt_level);
>>>>>>>>>>>> +	return __pte_encode(pte, pat_index, NULL, pt_level);
>>>>>>>>>>>>        }
>>>>>>>>>>>>
>>>>>>>>>>>>        static u64 __xe_pt_empty_pte(struct xe_tile *tile,
>>>>>>>>>>>> struct xe_vm *vm, @@ -134,12 +133,12 @@ static u64
>>>>>>>>>>>> __xe_pt_empty_pte(struct xe_tile
>>>>>>>> *tile,
>>>>>>>>>> struct xe_vm *vm,
>>>>>>>>>>>>
>>>>>>>>>>>>        	if (level == 0) {
>>>>>>>>>>>>        		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>>>>>>>>>>>> -					  XE_CACHE_WB, 0);
>>>>>>>>>>>> +					  xe_pat_get_index(vm->xe,
>>>>>>>>>> XE_CACHE_WB), 0);
>>>>>>>>>>>>
>>>>>>>>>>>>        		return empty;
>>>>>>>>>>>>        	} else {
>>>>>>>>>>>>        		return xe_pde_encode(vm->scratch_pt[id][level - 1]-
>>>>>>> bo, 0,
>>>>>>>>>>>> -				     XE_CACHE_WB);
>>>>>>>>>>>> +				     xe_pat_get_index(vm->xe,
>>>>>>>>>> XE_CACHE_WB));
>>>>>>>>>>>>        	}
>>>>>>>>>>>>        }
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>>>>>>>>>>>>        	struct xe_vm *vm;
>>>>>>>>>>>>        	/** @tile: The tile we're building for. */
>>>>>>>>>>>>        	struct xe_tile *tile;
>>>>>>>>>>>> -	/** @cache: Desired cache level for the ptes */
>>>>>>>>>>>> -	enum xe_cache_level cache;
>>>>>>>>>>>>        	/** @default_pte: PTE flag only template. No address
>>>>>>>>>>>> is
>>>>>> associated */
>>>>>>>>>>>>        	u64 default_pte;
>>>>>>>>>>>>        	/** @dma_offset: DMA offset to add to the PTE. */ @@
>>>>>>>>>>>> -604,7
>>>>>>>>>>>> +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
>>>>>>>>>> pgoff_t offset,
>>>>>>>>>>>>
>>>>>>>>>>>>        		pte = __pte_encode(is_null ? 0 :
>>>>>>>>>>>>        				   xe_res_dma(curs) + xe_walk-
>>>>>>> dma_offset,
>>>>>>>>>>>> -				   xe_walk->cache, xe_walk->vma,
>>>>>> level);
>>>>>>>>>>>> +				   xe_walk->vma->pat_index,
>>>> xe_walk-
>>>>>>> vma,
>>>>>>>>>> level);
>>>>>>>>>>>>        		pte |= xe_walk->default_pte;
>>>>>>>>>>>>
>>>>>>>>>>>>        		/*
>>>>>>>>>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw
>>>>>>>>>>>> *parent,
>>>>>>>>>> pgoff_t offset,
>>>>>>>>>>>>        			xe_child->is_compact = true;
>>>>>>>>>>>>        		}
>>>>>>>>>>>>
>>>>>>>>>>>> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk-
>>>>> cache)
>>>>>> |
>>>>>>>> flags;
>>>>>>>>>>>> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
>>>>>>>>>>> pat_index) | flags;
>>>>>>>>>>>>        		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset,
>>>>>> xe_child,
>>>>>>>>>>>>        					 pte);
>>>>>>>>>>>>        	}
>>>>>>>>>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile,
>>>>>>>>>>>> struct
>>>>>>>> xe_vma
>>>>>>>>>> *vma,
>>>>>>>>>>>>        		if (vma && vma->gpuva.flags &
>>>>>> XE_VMA_ATOMIC_PTE_BIT)
>>>>>>>>>>>>        			xe_walk.default_pte |=
>>>>>> XE_USM_PPGTT_PTE_AE;
>>>>>>>>>>>>        		xe_walk.dma_offset = vram_region_gpu_offset(bo-
>>>>>>>>>>> ttm.resource);
>>>>>>>>>>>> -		xe_walk.cache = XE_CACHE_WB;
>>>>>>>>>>>> -	} else {
>>>>>>>>>>>> -		if (!xe_vma_has_no_bo(vma) && bo->flags &
>>>>>>>>>> XE_BO_SCANOUT_BIT)
>>>>>>>>>>>> -			xe_walk.cache = XE_CACHE_WT;
>>>>>>>>>>>> -		else
>>>>>>>>>>>> -			xe_walk.cache = XE_CACHE_WB;
>>>>>>>>>>>>        	}
>>>>>>>>>>>> +
>>>>>>>>>>>>        	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>>>>>>>>>>>        		xe_walk.dma_offset =
>>>>>>>>>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_pt.h index
>> 01be7ab08f87..1d433a5a96b4
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>>>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile,
>>>>>>>>>>>> struct
>>>>>>>>>> xe_vma *vma, struct xe_exec_queu
>>>>>>>>>>>>
>>>>>>>>>>>>        bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma
>>>>>>>>>>>> *vma);
>>>>>>>>>>>>
>>>>>>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>>>>>>>>>> -		  const enum xe_cache_level level);
>>>>>>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
>>>>>>>>>>>> +pat_index);
>>>>>>>>>>>>
>>>>>>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
>>>>>>>>>>>> xe_cache_level
>>>>>>>>>> cache,
>>>>>>>>>>>> -		  u32 pt_level);
>>>>>>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32
>>>>>>>>>>>> +pat_index,
>>>>>>>>>>>> +u32
>>>>>>>>>> pt_level);
>>>>>>>>>>>>
>>>>>>>>>>>>        #endif
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_vm.c index
>>>> 7eeeed0411f3..34603a7e84b0
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>>>>>> @@ -6,6 +6,7 @@
>>>>>>>>>>>>        #include "xe_vm.h"
>>>>>>>>>>>>
>>>>>>>>>>>>        #include <linux/dma-fence-array.h>
>>>>>>>>>>>> +#include <linux/nospec.h>
>>>>>>>>>>>>
>>>>>>>>>>>>        #include <drm/drm_print.h>
>>>>>>>>>>>>        #include <drm/ttm/ttm_execbuf_util.h> @@ -874,7 +875,8
>>>>>>>>>>>> @@ static struct xe_vma *xe_vma_create(struct
>>>>>>>> xe_vm
>>>>>>>>>> *vm,
>>>>>>>>>>>>        				    u64 start, u64 end,
>>>>>>>>>>>>        				    bool read_only,
>>>>>>>>>>>>        				    bool is_null,
>>>>>>>>>>>> -				    u8 tile_mask)
>>>>>>>>>>>> +				    u8 tile_mask,
>>>>>>>>>>>> +				    u32 pat_index)
>>>>>>>>>>>>        {
>>>>>>>>>>>>        	struct xe_vma *vma;
>>>>>>>>>>>>        	struct xe_tile *tile;
>>>>>>>>>>>> @@ -913,6 +915,8 @@ static struct xe_vma
>>>> *xe_vma_create(struct
>>>>>>>> xe_vm
>>>>>>>>>> *vm,
>>>>>>>>>>>>        			vma->tile_mask |= 0x1 << id;
>>>>>>>>>>>>        	}
>>>>>>>>>>>>
>>>>>>>>>>>> +	vma->pat_index = pat_index;
>>>>>>>>>>>> +
>>>>>>>>>>>>        	if (vm->xe->info.platform == XE_PVC)
>>>>>>>>>>>>        		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -1194,10 +1198,8 @@ static void
>>>> xe_vma_op_work_func(struct
>>>>>>>>>> work_struct *w);
>>>>>>>>>>>>        static void vm_destroy_work_func(struct work_struct
>>>>>>>>>>>> *w);
>>>>>>>>>>>>
>>>>>>>>>>>>        static u64 xelp_ppgtt_pde_encode_pat(struct xe_device
>>>>>>>>>>>> *xe,
>>>>>>>>>>>> u64
>>>>>>>> pde_pat,
>>>>>>>>>>>> -						enum xe_cache_level
>>>>>> cache)
>>>>>>>>>>>> +				     u32 pat_index)
>>>>>>>>>>>>        {
>>>>>>>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>>>>>> -
>>>>>>>>>>>>        	pde_pat &= ~(XELP_PDE_PAT_MASK);
>>>>>>>>>>>>
>>>>>>>>>>>>        	if (pat_index & BIT(0)) @@ -1213,10 +1215,8 @@ static
>>>>>>>>>>>> u64
>>>>>>>> xelp_ppgtt_pde_encode_pat(struct
>>>>>>>>>> xe_device *xe, u64 pde_pat,
>>>>>>>>>>>>        }
>>>>>>>>>>>>
>>>>>>>>>>>>        static u64 xelp_ppgtt_pte_encode_pat(struct xe_device
>>>>>>>>>>>> *xe,
>>>>>>>>>>>> u64
>>>>>>>> pte_pat,
>>>>>>>>>>>> -						enum xe_cache_level
>>>>>> cache)
>>>>>>>>>>>> +				     u32 pat_index)
>>>>>>>>>>>>        {
>>>>>>>>>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>>>>>> -
>>>>>>>>>>>>        	pte_pat &= ~(XELP_PTE_PAT_MASK);
>>>>>>>>>>>>
>>>>>>>>>>>>        	if (pat_index & BIT(0)) @@ -1622,7 +1622,7 @@ struct
>>>>>>>>>>>> xe_vm *xe_vm_lookup(struct
>>>> xe_file
>>>>>>>> *xef,
>>>>>>>>>> u32 id)
>>>>>>>>>>>>        u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct
>>>>>>>>>>>> xe_tile
>>>> *tile)
>>>>>>>>>>>>        {
>>>>>>>>>>>>        	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>>>>>>>>>>>> -			     XE_CACHE_WB);
>>>>>>>>>>>> +			     xe_pat_get_index(vm->xe,
>>>>>> XE_CACHE_WB));
>>>>>>>>>>>>        }
>>>>>>>>>>>>
>>>>>>>>>>>>        static struct dma_fence * @@ -2311,7 +2311,7 @@ static
>>>>>>>>>>>> void print_op(struct xe_device *xe,
>>>>>>>> struct
>>>>>>>>>> drm_gpuva_op *op)
>>>>>>>>>>>>        static struct drm_gpuva_ops *
>>>>>>>>>>>>        vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo
>> *bo,
>>>>>>>>>>>>        			 u64 bo_offset_or_userptr, u64 addr, u64
>>>>>> range,
>>>>>>>>>>>> -			 u32 operation, u8 tile_mask, u32 region)
>>>>>>>>>>>> +			 u32 operation, u8 tile_mask, u32 region, u32
>>>>>>>>>> pat_index)
>>>>>>>>>>>>        {
>>>>>>>>>>>>        	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>>>>>>>>>>>        	struct ww_acquire_ctx ww; @@ -2339,6 +2339,7 @@
>>>>>>>>>>>> vm_bind_ioctl_ops_create(struct xe_vm
>>>>>> *vm,
>>>>>>>>>> struct xe_bo *bo,
>>>>>>>>>>>>        			struct xe_vma_op *op =
>>>>>> gpuva_op_to_vma_op(__op);
>>>>>>>>>>>>
>>>>>>>>>>>>        			op->tile_mask = tile_mask;
>>>>>>>>>>>> +			op->pat_index = pat_index;
>>>>>>>>>>>>        			op->map.immediate =
>>>>>>>>>>>>        				operation &
>>>>>>>> XE_VM_BIND_FLAG_IMMEDIATE;
>>>>>>>>>>>>        			op->map.read_only =
>>>>>>>>>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm
>>>>>> *vm,
>>>>>>>>>> struct xe_bo *bo,
>>>>>>>>>>>>        			struct xe_vma_op *op =
>>>>>> gpuva_op_to_vma_op(__op);
>>>>>>>>>>>>
>>>>>>>>>>>>        			op->tile_mask = tile_mask;
>>>>>>>>>>>> +			op->pat_index = pat_index;
>>>>>>>>>>>>        			op->prefetch.region = region;
>>>>>>>>>>>>        		}
>>>>>>>>>>>>        		break;
>>>>>>>>>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
>>>>>> *vm,
>>>>>>>>>> struct xe_bo *bo,
>>>>>>>>>>>>        }
>>>>>>>>>>>>
>>>>>>>>>>>>        static struct xe_vma *new_vma(struct xe_vm *vm, struct
>>>>>>>>>> drm_gpuva_op_map *op,
>>>>>>>>>>>> -			      u8 tile_mask, bool read_only, bool is_null)
>>>>>>>>>>>> +			      u8 tile_mask, bool read_only, bool is_null,
>>>>>>>>>>>> +			      u32 pat_index)
>>>>>>>>>>>>        {
>>>>>>>>>>>>        	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op-
>>>>>>> gem.obj) :
>>>>>>>> NULL;
>>>>>>>>>>>>        	struct xe_vma *vma;
>>>>>>>>>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct
>>>>>>>>>>>> xe_vm
>>>>>>>>>> *vm, struct drm_gpuva_op_map *op,
>>>>>>>>>>>>        	vma = xe_vma_create(vm, bo, op->gem.offset,
>>>>>>>>>>>>        			    op->va.addr, op->va.addr +
>>>>>>>>>>>>        			    op->va.range - 1, read_only, is_null,
>>>>>>>>>>>> -			    tile_mask);
>>>>>>>>>>>> +			    tile_mask, pat_index);
>>>>>>>>>>>>        	if (bo)
>>>>>>>>>>>>        		xe_bo_unlock(bo, &ww);
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -2539,7 +2542,7 @@ static int
>>>> vm_bind_ioctl_ops_parse(struct
>>>>>>>> xe_vm
>>>>>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>>>>>
>>>>>>>>>>>>        				vma = new_vma(vm, &op-
>>>>> base.map,
>>>>>>>>>>>>        					      op->tile_mask, op-
>>>>>>>>> map.read_only,
>>>>>>>>>>>> -					      op->map.is_null);
>>>>>>>>>>>> +					      op->map.is_null, op-
>>>>>>> pat_index);
>>>>>>>>>>>>        				if (IS_ERR(vma)) {
>>>>>>>>>>>>        					err = PTR_ERR(vma);
>>>>>>>>>>>>        					goto free_fence;
>>>>>>>>>>>> @@ -2567,7 +2570,7 @@ static int
>>>> vm_bind_ioctl_ops_parse(struct
>>>>>>>> xe_vm
>>>>>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>>>>>
>>>>>>>>>>>>        					vma = new_vma(vm, op-
>>>>>>>>>>> base.remap.prev,
>>>>>>>>>>>>        						      op->tile_mask,
>>>>>> read_only,
>>>>>>>>>>>> -						      is_null);
>>>>>>>>>>>> +						      is_null, op-
>>>>>>> pat_index);
>>>>>>>>>>>>        					if (IS_ERR(vma)) {
>>>>>>>>>>>>        						err = PTR_ERR(vma);
>>>>>>>>>>>>        						goto free_fence;
>>>>>>>>>>>> @@ -2603,7 +2606,7 @@ static int
>>>> vm_bind_ioctl_ops_parse(struct
>>>>>>>> xe_vm
>>>>>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>>>>>
>>>>>>>>>>>>        					vma = new_vma(vm, op-
>>>>>>>>>>> base.remap.next,
>>>>>>>>>>>>        						      op->tile_mask,
>>>>>> read_only,
>>>>>>>>>>>> -						      is_null);
>>>>>>>>>>>> +						      is_null, op-
>>>>>>> pat_index);
>>>>>>>>>>>>        					if (IS_ERR(vma)) {
>>>>>>>>>>>>        						err = PTR_ERR(vma);
>>>>>>>>>>>>        						goto free_fence;
>>>>>>>>>>>> @@ -3158,8 +3161,14 @@ static int
>>>>>>>>>>>> vm_bind_ioctl_check_args(struct
>>>>>>>>>> xe_device *xe,
>>>>>>>>>>>>        		u32 obj = (*bind_ops)[i].obj;
>>>>>>>>>>>>        		u64 obj_offset = (*bind_ops)[i].obj_offset;
>>>>>>>>>>>>        		u32 region = (*bind_ops)[i].region;
>>>>>>>>>>>> +		u32 pat_index = (*bind_ops)[i].pat_index;
>>>>>>>>>>>>        		bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>>>>>>>>>>>
>>>>>>>>>>>> +		if (XE_IOCTL_DBG(xe, pat_index >= xe-
>>>>>>>>>>> info.pat_table_n_entries)) {
>>>>>>>>>>>> +			err = -EINVAL;
>>>>>>>>>>>> +			goto free_bind_ops;
>>>>>>>>>>>> +		}
>>>>>>>>>>>> +
>>>>>>>>>>>>        		if (i == 0) {
>>>>>>>>>>>>        			*async = !!(op &
>>>> XE_VM_BIND_FLAG_ASYNC);
>>>>>>>>>>>>        		} else if (XE_IOCTL_DBG(xe, !*async) || @@ -3346,8
>>>>>> +3355,25
>>>>>>>>>>>> @@ int xe_vm_bind_ioctl(struct drm_device
>>>>>>>> *dev,
>>>>>>>>>> void *data, struct drm_file *file)
>>>>>>>>>>>>        		struct drm_gem_object *gem_obj;
>>>>>>>>>>>>        		u64 range = bind_ops[i].range;
>>>>>>>>>>>>        		u64 addr = bind_ops[i].addr;
>>>>>>>>>>>> +		u32 op = bind_ops[i].op;
>>>>>>>>>>>>        		u32 obj = bind_ops[i].obj;
>>>>>>>>>>>>        		u64 obj_offset = bind_ops[i].obj_offset;
>>>>>>>>>>>> +		u32 pat_index = bind_ops[i].pat_index;
>>>>>>>>>>>> +		u16 coh_mode;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		pat_index = array_index_nospec(pat_index,
>>>>>>>>>>>> +					       xe-
>>>>>>> info.pat_table_n_entries);
>>>>>>>>>>>> +		coh_mode = xe_pat_index_get_coh_mode(xe,
>>>>>> pat_index);
>>>>>>>>>>>> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
>>>>>>>>>>>
>>>>>>>>>>> Assuming we drop the unusable entries from the TGL table, this
>>>>>>>>>>> should be impossible, right?  Any index that makes it past the
>>>>>>>>>>> n_entries check at the top of the function should have a
>>>>>>>>>>> valid, non-zero
>>>>>> coh_mode value.
>>>>>>>>>>> So this should probably be an assertion (to highlight a KMD
>>>>>>>>>>> bug) rather than just a silent uapi failure return.
>>>>>>>>>>
>>>>>>>>>> Makes sense.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> +			err = -EINVAL;
>>>>>>>>>>>> +			goto put_obj;
>>>>>>>>>>>> +		}
>>>>>>>>>>>> +
>>>>>>>>>>>> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
>>>>>>>>>> XE_VM_BIND_OP_MAP_USERPTR &&
>>>>>>>>>>>> +				 coh_mode ==
>>>>>> XE_GEM_COHERENCY_NONE))
>>>>>>>> {
>>>>>>>>>>>> +			err = -EINVAL;
>>>>>>>>>>>> +			goto put_obj;
>>>>>>>>>>>> +		}
>>>>>>>>>>>>
>>>>>>>>>>>>        		if (!obj)
>>>>>>>>>>>>        			continue;
>>>>>>>>>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct
>>>> drm_device
>>>>>>>> *dev,
>>>>>>>>>> void *data, struct drm_file *file)
>>>>>>>>>>>>        				goto put_obj;
>>>>>>>>>>>>        			}
>>>>>>>>>>>>        		}
>>>>>>>>>>>> +
>>>>>>>>>>>> +		if (bos[i]->coh_mode) {
>>>>>>>>>>>> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
>>>>>>>>>> coh_mode)) {
>>>>>>>>>>>> +				err = -EINVAL;
>>>>>>>>>>>> +				goto put_obj;
>>>>>>>>>>>> +			}
>>>>>>>>>>>> +		} else if (XE_IOCTL_DBG(xe, coh_mode ==
>>>>>>>>>> XE_GEM_COHERENCY_NONE)) {
>>>>>>>>>>>> +			/*
>>>>>>>>>>>> +			 * Imported dma-buf from a different device
>>>>>> should
>>>>>>>>>>>> +			 * require 1way or 2way coherency since we
>>>>>> don't
>>>>>>>>>> know
>>>>>>>>>>>> +			 * how it was mapped on CPU. Just assume is
>>>> it
>>>>>>>>>>>> +			 * potentially cached on CPU side.
>>>>>>>>>>>> +			 */
>>>>>>>>>>>> +			err = -EINVAL;
>>>>>>>>>>>> +			goto put_obj;
>>>>>>>>>>>> +		}
>>>>>>>>>>>>        	}
>>>>>>>>>>>>
>>>>>>>>>>>>        	if (args->num_syncs) { @@ -3412,10 +3454,11 @@ int
>>>>>>>>>>>> xe_vm_bind_ioctl(struct
>>>> drm_device
>>>>>>>> *dev,
>>>>>>>>>> void *data, struct drm_file *file)
>>>>>>>>>>>>        		u64 obj_offset = bind_ops[i].obj_offset;
>>>>>>>>>>>>        		u8 tile_mask = bind_ops[i].tile_mask;
>>>>>>>>>>>>        		u32 region = bind_ops[i].region;
>>>>>>>>>>>> +		u32 pat_index = bind_ops[i].pat_index;
>>>>>>>>>>>>
>>>>>>>>>>>>        		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i],
>>>>>> obj_offset,
>>>>>>>>>>>>        						  addr, range, op,
>>>>>> tile_mask,
>>>>>>>>>>>> -						  region);
>>>>>>>>>>>> +						  region, pat_index);
>>>>>>>>>>>>        		if (IS_ERR(ops[i])) {
>>>>>>>>>>>>        			err = PTR_ERR(ops[i]);
>>>>>>>>>>>>        			ops[i] = NULL;
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>>>> b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>>>>>> index 83a1f87b6537..508679ed3c74 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>>>>>> @@ -111,6 +111,11 @@ struct xe_vma {
>>>>>>>>>>>>        	 */
>>>>>>>>>>>>        	u8 tile_present;
>>>>>>>>>>>>
>>>>>>>>>>>> +	/**
>>>>>>>>>>>> +	 * @pat_index: The pat index to use when encoding the PTEs
>>>>>> for
>>>>>>>>>>>> +this
>>>>>>>>>> vma.
>>>>>>>>>>>> +	 */
>>>>>>>>>>>> +	u32 pat_index;
>>>>>>>>>>>> +
>>>>>>>>>>>>        	struct {
>>>>>>>>>>>>        		struct list_head rebind_link;
>>>>>>>>>>>>        	} notifier;
>>>>>>>>>>>> @@ -338,10 +343,8 @@ struct xe_vm {
>>>>>>>>>>>>        	bool batch_invalidate_tlb;
>>>>>>>>>>>>
>>>>>>>>>>>>        	struct {
>>>>>>>>>>>> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>>>>>>>> -						enum xe_cache_level
>>>>>> cache);
>>>>>>>>>>>> -		u64 (*pde_encode)(struct xe_device *xe, u64
>>>> pde_pat,
>>>>>>>>>>>> -						enum xe_cache_level
>>>>>> cache);
>>>>>>>>>>>> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>> u32
>>>>>>>>>> pat_index);
>>>>>>>>>>>> +		u64 (*pde_encode)(struct xe_device *xe, u64
>>>> pde_pat,
>>>>>> u32
>>>>>>>>>> pat_index);
>>>>>>>>>>>>        	} pat_encode;
>>>>>>>>>>>>        };
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>>>>>>>>>>>>        	struct async_op_fence *fence;
>>>>>>>>>>>>        	/** @tile_mask: gt mask for this operation */
>>>>>>>>>>>>        	u8 tile_mask;
>>>>>>>>>>>> +	/** @pat_index: The pat index to use for this operation. */
>>>>>>>>>>>> +	u32 pat_index;
>>>>>>>>>>>>        	/** @flags: operation flags */
>>>>>>>>>>>>        	enum xe_vma_op_flags flags;
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/include/uapi/drm/xe_drm.h
>>>>>>>>>>>> b/include/uapi/drm/xe_drm.h index
>> 64bc66d4b550..0c15b6f32447
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/include/uapi/drm/xe_drm.h
>>>>>>>>>>>> +++ b/include/uapi/drm/xe_drm.h
>>>>>>>>>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>>>>>>>>>>>>        	 */
>>>>>>>>>>>>        	__u32 obj;
>>>>>>>>>>>>
>>>>>>>>>>>> -	/** @pad: MBZ */
>>>>>>>>>>>> -	__u32 pad;
>>>>>>>>>>>> +	/**
>>>>>>>>>>>> +	 * @pat_index: The platform defined @pat_index to use for
>>>>>> this
>>>>>>>>>> mapping.
>>>>>>>>>>>> +	 * The index basically maps to some predefined memory
>>>>>> attributes,
>>>>>>>>>>>> +	 * including things like caching, coherency and likely
>>>>>>>>>>>> +other stuff
>>>>>> in
>>>>>>>>>>>> +	 * the future.  The exact meaning of the pat_index is
>>>>>>>>>>>> +platform specific
>>>>>>>>>>>
>>>>>>>>>>> BTW, "other stuff in the future" already includes compression
>>>>>>>>>>> on Xe2, we just haven't landed the patches for the Xe2 table yet.
>>>>>>>>>>
>>>>>>>>>> Ok, good to know.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> +	 * and defined in the Bspec and PRMs.  When the KMD sets
>>>> up
>>>>>> the
>>>>>>>>>> binding
>>>>>>>>>>>> +	 * the index here is encoded into the ppGTT PTE.
>>>>>>>>>>>> +	 *
>>>>>>>>>>>> +	 * For coherency the @pat_index needs to match the
>>>>>>>>>>>> +	 * drm_xe_gem_create.coh_mode, so either
>>>>>>>>>> XE_GEM_COHERENCY_NONE,
>>>>>>>>>>>> +	 * XE_GEM_COHERENCY_1WAY or
>>>>>> XE_GEM_COHERENCY_2WAY. The
>>>>>>>>>> KMD will extract
>>>>>>>>>>>> +	 * the coherency mode from the @pat_index and reject if
>>>>>> there is a
>>>>>>>>>>>> +	 * mismatch (see note below for pre-MTL platforms).
>>>>>>>>>>>> +	 *
>>>>>>>>>>>> +	 * Note: On pre-MTL platforms there is only a caching mode
>>>>>> and no
>>>>>>>>>>>> +	 * explicit coherency mode, but on such hardware there is
>>>>>> always a
>>>>>>>>>>>> +	 * shared-LLC (or is dgpu) so all GT memory accesses are
>>>>>>>>>>>> +coherent
>>>>>>>> with
>>>>>>>>>>>> +	 * CPU caches even with the caching mode set as uncached.
>>>>>> It's
>>>>>>>>>>>> +only
>>>>>>>>>> the
>>>>>>>>>>>> +	 * display engine that is incoherent (on dgpu it must be in
>>>>>> VRAM which
>>>>>>>>>>>> +	 * is always mapped as WC on the CPU). However to keep the
>>>>>> uapi
>>>>>>>>>> somewhat
>>>>>>>>>>>> +	 * consistent with newer platforms the KMD groups the
>>>>>> different
>>>>>>>> cache
>>>>>>>>>>>> +	 * levels into the following coherency buckets on all
>>>>>>>>>>>> +pre-MTL
>>>>>>>> platforms:
>>>>>>>>>>>> +	 *
>>>>>>>>>>>> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
>>>>>>>>>>>> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
>>>>>>>>>>>> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
>>>>>>>>>>>> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
>>>>>>>>>>>
>>>>>>>>>>> As noted on the previous patch, it seems like 2-way is
>>>>>>>>>>> appropriate for LLC platforms, but 1-way might be a more
>>>>>>>>>>> accurate description of dGPU behavior.
>>>>>>>>>>>
>>>>>>>>>>>> +	 *
>>>>>>>>>>>> +	 * In practice UC/WC/WT should only ever used for scanout
>>>>>>>>>>>> +surfaces
>>>>>>>> on
>>>>>>>>>>>> +	 * such platforms since it is only the display engine that
>>>>>>>>>>>> +is
>>>>>> actually
>>>>>>>>>>>> +	 * incoherent. Everything else should typically use WB
>>>>>>>>>>>> +given that we
>>>>>>>>>>>
>>>>>>>>>>> What if we're sharing our buffers with some other (non-GPU)
>>>>>>>>>>> device?  Are there cases where that other device wouldn't be
>>>>>>>>>>> coherent with the LLC, so we'd want to use one of these?
>>>>>>>>>>
>>>>>>>>>> Yeah, I guess there might be cases like that. I'll reword.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Matt
>>>>>>>>>>>
>>>>>>>>>>>> +	 * have a shared-LLC.  On MTL+ this completely changes
>>>>>>>>>>>> +(also
>>>>>>>>>> potentially
>>>>>>>>>>>> +	 * no shared-LLC) and the HW defines the coherency mode as
>>>>>> part
>>>>>>>>>>>> +of
>>>>>>>>>> the
>>>>>>>>>>>> +	 * @pat_index.
>>>>>>>>>>>> +	 *
>>>>>>>>>>>> +	 * Note: For userptr and externally imported dma-buf the
>>>>>> kernel
>>>>>>>>>> expects
>>>>>>>>>>>> +	 * either 1WAY or 2WAY for the @pat_index.
>>>>>>>>>>>> +	 */
>>>>>>>>>>>> +	__u32 pat_index;
>>>>>>>>>>>>
>>>>>>>>>>>>        	union {
>>>>>>>>>>>>        		/**
>>>>>>>>>>>> --
>>>>>>>>>>>> 2.41.0
>>>>>>>>>>>>
>>>>>>>>>>>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-08-29 16:28 ` [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind Matthew Auld
  2023-08-29 21:36   ` Matt Roper
@ 2023-09-04 20:21   ` Souza, Jose
  2023-09-05  9:08     ` Matthew Auld
  2023-09-07 18:56   ` Souza, Jose
  2 siblings, 1 reply; 47+ messages in thread
From: Souza, Jose @ 2023-09-04 20:21 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Auld,  Matthew

On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
> Allow userspace to directly control the pat_index for a given vm
> binding. This should allow directly controlling the coherency, caching
> and potentially other stuff in the future for the ppGTT binding.
> 
> The exact meaning behind the pat_index is very platform specific (see
> BSpec or PRMs) but effectively maps to some predefined memory
> attributes. From the KMD pov we only care about the coherency that is
> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
> The vm_bind coherency mode for the given pat_index needs to match the
> given coh_mode that was set at object creation. For platforms that lack
> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
> 
> For userptr mappings we lack a corresponding gem object, so the expected
> coherency mode is instead implicit and must fall into either 1WAY or
> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
> dma-buf (from a different device) the coherency mode is also implicit
> and must also be either 1WAY or 2WAY.
> 
> As part of adding pat_index support with vm_bind we also need stop using
> xe_cache_level and instead use the pat_index in various places. We still
> make use of xe_cache_level, but only as a convenience for kernel
> internal objectsi (internally it maps to some reasonable pat_index). For
> now this is just a 1:1 conversion of the existing code, however for
> platforms like MTL+ we might need to give more control through bo_create
> or stop using WB on the CPU side if we need CPU access.
> 
> Bspec: 45101, 44235 #xe
> Bspec: 70552, 71582, 59400 #xe2
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: José Roberto de Souza <jose.souza@intel.com>
> Cc: Filip Hazubski <filip.hazubski@intel.com>
> Cc: Carl Zhang <carl.zhang@intel.com>
> Cc: Effie Yu <effie.yu@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
>  drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
>  drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
>  drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
>  drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
>  drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
>  drivers/gpu/drm/xe/xe_pt.h            |  6 +--
>  drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++------
>  drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
>  include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
>  10 files changed, 134 insertions(+), 58 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
> index 4d6296cdbcfd..bb4c182048c3 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> @@ -302,7 +302,7 @@ struct i915_address_space {
>  		(*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>  
>  	u64 (*pte_encode)(dma_addr_t addr,
> -			  unsigned int pat_index,
> +			  u32 pat_index,
>  			  u32 flags); /* Create a valid PTE */
>  #define PTE_READ_ONLY	BIT(0)
>  #define PTE_LM		BIT(1)
> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
> index 5c8d5e78d9bc..7a128fd20a29 100644
> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
>  	/* First part of the test, are we updating our pagetable bo with a new entry? */
>  	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
>  		  0xdeaddeadbeefbeef);
> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>  	if (m->q->vm->flags & XE_VM_FLAG_64K)
>  		expected |= XE_PTE_PS64;
>  	if (xe_bo_is_vram(pt))
> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
> index 209fa053d9fb..4134c26150a5 100644
> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
>  		pte |= XE_GGTT_PTE_DM;
>  
>  	if ((ggtt->pat_encode).pte_encode)
> -		pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> +						    xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
>  
>  	return pte;
>  }
> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>  }
>  
>  static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> -						enum xe_cache_level cache)
> +				     u32 pat_index)
>  {
> -	u32 pat_index = xe_pat_get_index(xe, cache);
> -
>  	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>  
>  	if (pat_index & BIT(0))
> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
> index 7e55fac1a8a9..0bc40cb072e3 100644
> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> @@ -31,7 +31,7 @@ struct xe_ggtt {
>  
>  	struct {
>  		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> -						enum xe_cache_level cache);
> +				  u32 pat_index);
>  	} pat_encode;
>  };
>  
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> index a782ea282cb6..54585e98452a 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -24,6 +24,7 @@
>  #include "xe_lrc.h"
>  #include "xe_map.h"
>  #include "xe_mocs.h"
> +#include "xe_pat.h"
>  #include "xe_pt.h"
>  #include "xe_res_cursor.h"
>  #include "xe_sched_job.h"
> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>  	u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
>  	u32 map_ofs, level, i;
>  	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>  	u64 entry;
>  	int ret;
>  
> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>  		return ret;
>  	}
>  
> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>  	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>  
>  	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>  
>  	/* Map the entire BO in our level 0 pt */
>  	for (i = 0, level = 0; i < num_entries; level++) {
> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>  
>  		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>  
> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>  		for (i = 0; i < batch->size;
>  		     i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>  		     XE_PAGE_SIZE) {
> -			entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
> +			entry = xe_pte_encode(batch, i, pat_index, 0);
>  
>  			xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>  				  entry);
> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>  			flags = XE_PDE_64K;
>  
>  		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> -					XE_PAGE_SIZE, XE_CACHE_WB);
> +					XE_PAGE_SIZE, pat_index);
>  		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
>  			  entry | flags);
>  	}
> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>  	/* Write PDE's that point to our BO. */
>  	for (i = 0; i < num_entries - num_level; i++) {
>  		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> -				      XE_CACHE_WB);
> +				      pat_index);
>  
>  		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>  			  (i + 1) * 8, u64, entry);
> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
>  
>  			XE_WARN_ON(pt_bo->size != SZ_4K);
>  
> -			addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
> +			addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>  			bb->cs[bb->len++] = lower_32_bits(addr);
>  			bb->cs[bb->len++] = upper_32_bits(addr);
>  		}
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 64713f400d94..019af2920078 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -10,6 +10,7 @@
>  #include "xe_gt.h"
>  #include "xe_gt_tlb_invalidation.h"
>  #include "xe_migrate.h"
> +#include "xe_pat.h"
>  #include "xe_pt_types.h"
>  #include "xe_pt_walk.h"
>  #include "xe_res_cursor.h"
> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
>   *
>   * Return: An encoded page directory entry. No errors.
>   */
> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> -		  const enum xe_cache_level cache)
> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>  {
>  	u64 pde;
>  	struct xe_vm *vm = bo->vm;
>  	struct xe_device *xe = vm->xe;
>  
> -
>  	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>  	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>  
>  	if ((vm->pat_encode).pde_encode)
> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> +		pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>  
>  	return pde;
>  }
>  
> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> +static u64 __pte_encode(u64 pte, u32 pat_index,
>  			struct xe_vma *vma, u32 pt_level)
>  {
>  	struct xe_vm *vm = xe_vma_vm(vma);
> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>  		pte |= XE_PTE_NULL;
>  
>  	if ((vm->pat_encode).pte_encode)
> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> +		pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>  
>  	if (pt_level == 1)
>  		pte |= XE_PDE_PS_2M;
> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>   *
>   * Return: An encoded page-table entry. No errors.
>   */
> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>  		  u32 pt_level)
>  {
>  	u64 pte;
> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>  	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>  		pte |= XE_PPGTT_PTE_DM;
>  
> -	return __pte_encode(pte, cache, NULL, pt_level);
> +	return __pte_encode(pte, pat_index, NULL, pt_level);
>  }
>  
>  static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>  
>  	if (level == 0) {
>  		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> -					  XE_CACHE_WB, 0);
> +					  xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
>  
>  		return empty;
>  	} else {
>  		return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
> -				     XE_CACHE_WB);
> +				     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>  	}
>  }
>  
> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>  	struct xe_vm *vm;
>  	/** @tile: The tile we're building for. */
>  	struct xe_tile *tile;
> -	/** @cache: Desired cache level for the ptes */
> -	enum xe_cache_level cache;
>  	/** @default_pte: PTE flag only template. No address is associated */
>  	u64 default_pte;
>  	/** @dma_offset: DMA offset to add to the PTE. */
> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>  
>  		pte = __pte_encode(is_null ? 0 :
>  				   xe_res_dma(curs) + xe_walk->dma_offset,
> -				   xe_walk->cache, xe_walk->vma, level);
> +				   xe_walk->vma->pat_index, xe_walk->vma, level);
>  		pte |= xe_walk->default_pte;
>  
>  		/*
> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>  			xe_child->is_compact = true;
>  		}
>  
> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
>  		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>  					 pte);
>  	}
> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
>  		if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>  			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>  		xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
> -		xe_walk.cache = XE_CACHE_WB;
> -	} else {
> -		if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
> -			xe_walk.cache = XE_CACHE_WT;
> -		else
> -			xe_walk.cache = XE_CACHE_WB;
>  	}
> +
>  	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>  		xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>  
> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> index 01be7ab08f87..1d433a5a96b4 100644
> --- a/drivers/gpu/drm/xe/xe_pt.h
> +++ b/drivers/gpu/drm/xe/xe_pt.h
> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
>  
>  bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>  
> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> -		  const enum xe_cache_level level);
> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>  
> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> -		  u32 pt_level);
> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
>  
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 7eeeed0411f3..34603a7e84b0 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -6,6 +6,7 @@
>  #include "xe_vm.h"
>  
>  #include <linux/dma-fence-array.h>
> +#include <linux/nospec.h>
>  
>  #include <drm/drm_print.h>
>  #include <drm/ttm/ttm_execbuf_util.h>
> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>  				    u64 start, u64 end,
>  				    bool read_only,
>  				    bool is_null,
> -				    u8 tile_mask)
> +				    u8 tile_mask,
> +				    u32 pat_index)
>  {
>  	struct xe_vma *vma;
>  	struct xe_tile *tile;
> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>  			vma->tile_mask |= 0x1 << id;
>  	}
>  
> +	vma->pat_index = pat_index;
> +
>  	if (vm->xe->info.platform == XE_PVC)
>  		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>  
> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
>  static void vm_destroy_work_func(struct work_struct *w);
>  
>  static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> -						enum xe_cache_level cache)
> +				     u32 pat_index)
>  {
> -	u32 pat_index = xe_pat_get_index(xe, cache);
> -
>  	pde_pat &= ~(XELP_PDE_PAT_MASK);
>  
>  	if (pat_index & BIT(0))
> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>  }
>  
>  static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> -						enum xe_cache_level cache)
> +				     u32 pat_index)
>  {
> -	u32 pat_index = xe_pat_get_index(xe, cache);
> -
>  	pte_pat &= ~(XELP_PTE_PAT_MASK);
>  
>  	if (pat_index & BIT(0))
> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
>  u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>  {
>  	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> -			     XE_CACHE_WB);
> +			     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>  }
>  
>  static struct dma_fence *
> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
>  static struct drm_gpuva_ops *
>  vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>  			 u64 bo_offset_or_userptr, u64 addr, u64 range,
> -			 u32 operation, u8 tile_mask, u32 region)
> +			 u32 operation, u8 tile_mask, u32 region, u32 pat_index)
>  {
>  	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>  	struct ww_acquire_ctx ww;
> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>  			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>  
>  			op->tile_mask = tile_mask;
> +			op->pat_index = pat_index;
>  			op->map.immediate =
>  				operation & XE_VM_BIND_FLAG_IMMEDIATE;
>  			op->map.read_only =
> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>  			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>  
>  			op->tile_mask = tile_mask;
> +			op->pat_index = pat_index;
>  			op->prefetch.region = region;
>  		}
>  		break;
> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>  }
>  
>  static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> -			      u8 tile_mask, bool read_only, bool is_null)
> +			      u8 tile_mask, bool read_only, bool is_null,
> +			      u32 pat_index)
>  {
>  	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
>  	struct xe_vma *vma;
> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>  	vma = xe_vma_create(vm, bo, op->gem.offset,
>  			    op->va.addr, op->va.addr +
>  			    op->va.range - 1, read_only, is_null,
> -			    tile_mask);
> +			    tile_mask, pat_index);
>  	if (bo)
>  		xe_bo_unlock(bo, &ww);
>  
> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>  
>  				vma = new_vma(vm, &op->base.map,
>  					      op->tile_mask, op->map.read_only,
> -					      op->map.is_null);
> +					      op->map.is_null, op->pat_index);
>  				if (IS_ERR(vma)) {
>  					err = PTR_ERR(vma);
>  					goto free_fence;
> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>  
>  					vma = new_vma(vm, op->base.remap.prev,
>  						      op->tile_mask, read_only,
> -						      is_null);
> +						      is_null, op->pat_index);
>  					if (IS_ERR(vma)) {
>  						err = PTR_ERR(vma);
>  						goto free_fence;
> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>  
>  					vma = new_vma(vm, op->base.remap.next,
>  						      op->tile_mask, read_only,
> -						      is_null);
> +						      is_null, op->pat_index);
>  					if (IS_ERR(vma)) {
>  						err = PTR_ERR(vma);
>  						goto free_fence;
> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
>  		u32 obj = (*bind_ops)[i].obj;
>  		u64 obj_offset = (*bind_ops)[i].obj_offset;
>  		u32 region = (*bind_ops)[i].region;
> +		u32 pat_index = (*bind_ops)[i].pat_index;
>  		bool is_null = op & XE_VM_BIND_FLAG_NULL;
>  
> +		if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
> +			err = -EINVAL;
> +			goto free_bind_ops;
> +		}
> +
>  		if (i == 0) {
>  			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>  		} else if (XE_IOCTL_DBG(xe, !*async) ||
> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>  		struct drm_gem_object *gem_obj;
>  		u64 range = bind_ops[i].range;
>  		u64 addr = bind_ops[i].addr;
> +		u32 op = bind_ops[i].op;
>  		u32 obj = bind_ops[i].obj;
>  		u64 obj_offset = bind_ops[i].obj_offset;
> +		u32 pat_index = bind_ops[i].pat_index;
> +		u16 coh_mode;
> +
> +		pat_index = array_index_nospec(pat_index,
> +					       xe->info.pat_table_n_entries);
> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
> +			err = -EINVAL;
> +			goto put_obj;
> +		}

In my opinion this check belongs to vm_bind_ioctl_check_args().

> +
> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
> +				 coh_mode == XE_GEM_COHERENCY_NONE)) {
> +			err = -EINVAL;
> +			goto put_obj;
> +		}

Same for this one.

>  
>  		if (!obj)
>  			continue;
> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>  				goto put_obj;
>  			}
>  		}
> +
> +		if (bos[i]->coh_mode) {
> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
> +				err = -EINVAL;
> +				goto put_obj;
> +			}
> +		} else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
> +			/*
> +			 * Imported dma-buf from a different device should
> +			 * require 1way or 2way coherency since we don't know
> +			 * how it was mapped on CPU. Just assume is it
> +			 * potentially cached on CPU side.
> +			 */
> +			err = -EINVAL;
> +			goto put_obj;
> +		}
>  	}
>  
>  	if (args->num_syncs) {
> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>  		u64 obj_offset = bind_ops[i].obj_offset;
>  		u8 tile_mask = bind_ops[i].tile_mask;
>  		u32 region = bind_ops[i].region;
> +		u32 pat_index = bind_ops[i].pat_index;
>  
>  		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>  						  addr, range, op, tile_mask,
> -						  region);
> +						  region, pat_index);
>  		if (IS_ERR(ops[i])) {
>  			err = PTR_ERR(ops[i]);
>  			ops[i] = NULL;
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
> index 83a1f87b6537..508679ed3c74 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -111,6 +111,11 @@ struct xe_vma {
>  	 */
>  	u8 tile_present;
>  
> +	/**
> +	 * @pat_index: The pat index to use when encoding the PTEs for this vma.
> +	 */
> +	u32 pat_index;

u32 is overkill for pat_index.

> +
>  	struct {
>  		struct list_head rebind_link;
>  	} notifier;
> @@ -338,10 +343,8 @@ struct xe_vm {
>  	bool batch_invalidate_tlb;
>  
>  	struct {
> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> -						enum xe_cache_level cache);
> -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> -						enum xe_cache_level cache);
> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
> +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
>  	} pat_encode;
>  };
>  
> @@ -417,6 +420,8 @@ struct xe_vma_op {
>  	struct async_op_fence *fence;
>  	/** @tile_mask: gt mask for this operation */
>  	u8 tile_mask;
> +	/** @pat_index: The pat index to use for this operation. */
> +	u32 pat_index;
>  	/** @flags: operation flags */
>  	enum xe_vma_op_flags flags;
>  
> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> index 64bc66d4b550..0c15b6f32447 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>  	 */
>  	__u32 obj;
>  
> -	/** @pad: MBZ */
> -	__u32 pad;
> +	/**
> +	 * @pat_index: The platform defined @pat_index to use for this mapping.
> +	 * The index basically maps to some predefined memory attributes,
> +	 * including things like caching, coherency and likely other stuff in
> +	 * the future.  The exact meaning of the pat_index is platform specific
> +	 * and defined in the Bspec and PRMs.  When the KMD sets up the binding
> +	 * the index here is encoded into the ppGTT PTE.
> +	 *
> +	 * For coherency the @pat_index needs to match the
> +	 * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
> +	 * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
> +	 * the coherency mode from the @pat_index and reject if there is a
> +	 * mismatch (see note below for pre-MTL platforms).
> +	 *
> +	 * Note: On pre-MTL platforms there is only a caching mode and no
> +	 * explicit coherency mode, but on such hardware there is always a
> +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
> +	 * CPU caches even with the caching mode set as uncached.  It's only the
> +	 * display engine that is incoherent (on dgpu it must be in VRAM which
> +	 * is always mapped as WC on the CPU). However to keep the uapi somewhat
> +	 * consistent with newer platforms the KMD groups the different cache
> +	 * levels into the following coherency buckets on all pre-MTL platforms:
> +	 *
> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
> +	 *
> +	 * In practice UC/WC/WT should only ever used for scanout surfaces on
> +	 * such platforms since it is only the display engine that is actually
> +	 * incoherent. Everything else should typically use WB given that we
> +	 * have a shared-LLC.  On MTL+ this completely changes (also potentially
> +	 * no shared-LLC) and the HW defines the coherency mode as part of the
> +	 * @pat_index.
> +	 *
> +	 * Note: For userptr and externally imported dma-buf the kernel expects
> +	 * either 1WAY or 2WAY for the @pat_index.
> +	 */
> +	__u32 pat_index;
>  
>  	union {
>  		/**


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-09-04 20:21   ` Souza, Jose
@ 2023-09-05  9:08     ` Matthew Auld
  0 siblings, 0 replies; 47+ messages in thread
From: Matthew Auld @ 2023-09-05  9:08 UTC (permalink / raw)
  To: Souza, Jose, intel-xe@lists.freedesktop.org

On 04/09/2023 21:21, Souza, Jose wrote:
> On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
>> Allow userspace to directly control the pat_index for a given vm
>> binding. This should allow directly controlling the coherency, caching
>> and potentially other stuff in the future for the ppGTT binding.
>>
>> The exact meaning behind the pat_index is very platform specific (see
>> BSpec or PRMs) but effectively maps to some predefined memory
>> attributes. From the KMD pov we only care about the coherency that is
>> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
>> The vm_bind coherency mode for the given pat_index needs to match the
>> given coh_mode that was set at object creation. For platforms that lack
>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
>>
>> For userptr mappings we lack a corresponding gem object, so the expected
>> coherency mode is instead implicit and must fall into either 1WAY or
>> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
>> dma-buf (from a different device) the coherency mode is also implicit
>> and must also be either 1WAY or 2WAY.
>>
>> As part of adding pat_index support with vm_bind we also need stop using
>> xe_cache_level and instead use the pat_index in various places. We still
>> make use of xe_cache_level, but only as a convenience for kernel
>> internal objectsi (internally it maps to some reasonable pat_index). For
>> now this is just a 1:1 conversion of the existing code, however for
>> platforms like MTL+ we might need to give more control through bo_create
>> or stop using WB on the CPU side if we need CPU access.
>>
>> Bspec: 45101, 44235 #xe
>> Bspec: 70552, 71582, 59400 #xe2
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: José Roberto de Souza <jose.souza@intel.com>
>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>> Cc: Carl Zhang <carl.zhang@intel.com>
>> Cc: Effie Yu <effie.yu@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
>>   drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
>>   drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
>>   drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
>>   drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
>>   drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
>>   drivers/gpu/drm/xe/xe_pt.h            |  6 +--
>>   drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++------
>>   drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
>>   include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
>>   10 files changed, 134 insertions(+), 58 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
>> index 4d6296cdbcfd..bb4c182048c3 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>> @@ -302,7 +302,7 @@ struct i915_address_space {
>>   		(*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>>   
>>   	u64 (*pte_encode)(dma_addr_t addr,
>> -			  unsigned int pat_index,
>> +			  u32 pat_index,
>>   			  u32 flags); /* Create a valid PTE */
>>   #define PTE_READ_ONLY	BIT(0)
>>   #define PTE_LM		BIT(1)
>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
>> index 5c8d5e78d9bc..7a128fd20a29 100644
>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
>>   	/* First part of the test, are we updating our pagetable bo with a new entry? */
>>   	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
>>   		  0xdeaddeadbeefbeef);
>> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>>   	if (m->q->vm->flags & XE_VM_FLAG_64K)
>>   		expected |= XE_PTE_PS64;
>>   	if (xe_bo_is_vram(pt))
>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
>> index 209fa053d9fb..4134c26150a5 100644
>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
>>   		pte |= XE_GGTT_PTE_DM;
>>   
>>   	if ((ggtt->pat_encode).pte_encode)
>> -		pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
>> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
>> +						    xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
>>   
>>   	return pte;
>>   }
>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>>   }
>>   
>>   static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>> -						enum xe_cache_level cache)
>> +				     u32 pat_index)
>>   {
>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>>   	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>   
>>   	if (pat_index & BIT(0))
>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
>> index 7e55fac1a8a9..0bc40cb072e3 100644
>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>   
>>   	struct {
>>   		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> -						enum xe_cache_level cache);
>> +				  u32 pat_index);
>>   	} pat_encode;
>>   };
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
>> index a782ea282cb6..54585e98452a 100644
>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>> @@ -24,6 +24,7 @@
>>   #include "xe_lrc.h"
>>   #include "xe_map.h"
>>   #include "xe_mocs.h"
>> +#include "xe_pat.h"
>>   #include "xe_pt.h"
>>   #include "xe_res_cursor.h"
>>   #include "xe_sched_job.h"
>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   	u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
>>   	u32 map_ofs, level, i;
>>   	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>>   	u64 entry;
>>   	int ret;
>>   
>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   		return ret;
>>   	}
>>   
>> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
>> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>>   	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>   
>>   	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>   
>>   	/* Map the entire BO in our level 0 pt */
>>   	for (i = 0, level = 0; i < num_entries; level++) {
>> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
>> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>>   
>>   		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>>   
>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   		for (i = 0; i < batch->size;
>>   		     i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>>   		     XE_PAGE_SIZE) {
>> -			entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
>> +			entry = xe_pte_encode(batch, i, pat_index, 0);
>>   
>>   			xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>>   				  entry);
>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   			flags = XE_PDE_64K;
>>   
>>   		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>> -					XE_PAGE_SIZE, XE_CACHE_WB);
>> +					XE_PAGE_SIZE, pat_index);
>>   		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
>>   			  entry | flags);
>>   	}
>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   	/* Write PDE's that point to our BO. */
>>   	for (i = 0; i < num_entries - num_level; i++) {
>>   		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>> -				      XE_CACHE_WB);
>> +				      pat_index);
>>   
>>   		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>>   			  (i + 1) * 8, u64, entry);
>> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
>>   
>>   			XE_WARN_ON(pt_bo->size != SZ_4K);
>>   
>> -			addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
>> +			addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>>   			bb->cs[bb->len++] = lower_32_bits(addr);
>>   			bb->cs[bb->len++] = upper_32_bits(addr);
>>   		}
>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
>> index 64713f400d94..019af2920078 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.c
>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>> @@ -10,6 +10,7 @@
>>   #include "xe_gt.h"
>>   #include "xe_gt_tlb_invalidation.h"
>>   #include "xe_migrate.h"
>> +#include "xe_pat.h"
>>   #include "xe_pt_types.h"
>>   #include "xe_pt_walk.h"
>>   #include "xe_res_cursor.h"
>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
>>    *
>>    * Return: An encoded page directory entry. No errors.
>>    */
>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>> -		  const enum xe_cache_level cache)
>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>>   {
>>   	u64 pde;
>>   	struct xe_vm *vm = bo->vm;
>>   	struct xe_device *xe = vm->xe;
>>   
>> -
>>   	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>>   	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>   
>>   	if ((vm->pat_encode).pde_encode)
>> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>> +		pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>>   
>>   	return pde;
>>   }
>>   
>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>>   			struct xe_vma *vma, u32 pt_level)
>>   {
>>   	struct xe_vm *vm = xe_vma_vm(vma);
>> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>   		pte |= XE_PTE_NULL;
>>   
>>   	if ((vm->pat_encode).pte_encode)
>> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>> +		pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>>   
>>   	if (pt_level == 1)
>>   		pte |= XE_PDE_PS_2M;
>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>    *
>>    * Return: An encoded page-table entry. No errors.
>>    */
>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>   		  u32 pt_level)
>>   {
>>   	u64 pte;
>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>>   	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>>   		pte |= XE_PPGTT_PTE_DM;
>>   
>> -	return __pte_encode(pte, cache, NULL, pt_level);
>> +	return __pte_encode(pte, pat_index, NULL, pt_level);
>>   }
>>   
>>   static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>>   
>>   	if (level == 0) {
>>   		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>> -					  XE_CACHE_WB, 0);
>> +					  xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
>>   
>>   		return empty;
>>   	} else {
>>   		return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
>> -				     XE_CACHE_WB);
>> +				     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>>   	}
>>   }
>>   
>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>>   	struct xe_vm *vm;
>>   	/** @tile: The tile we're building for. */
>>   	struct xe_tile *tile;
>> -	/** @cache: Desired cache level for the ptes */
>> -	enum xe_cache_level cache;
>>   	/** @default_pte: PTE flag only template. No address is associated */
>>   	u64 default_pte;
>>   	/** @dma_offset: DMA offset to add to the PTE. */
>> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>>   
>>   		pte = __pte_encode(is_null ? 0 :
>>   				   xe_res_dma(curs) + xe_walk->dma_offset,
>> -				   xe_walk->cache, xe_walk->vma, level);
>> +				   xe_walk->vma->pat_index, xe_walk->vma, level);
>>   		pte |= xe_walk->default_pte;
>>   
>>   		/*
>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>>   			xe_child->is_compact = true;
>>   		}
>>   
>> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
>> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
>>   		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>>   					 pte);
>>   	}
>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
>>   		if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>>   			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>>   		xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
>> -		xe_walk.cache = XE_CACHE_WB;
>> -	} else {
>> -		if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
>> -			xe_walk.cache = XE_CACHE_WT;
>> -		else
>> -			xe_walk.cache = XE_CACHE_WB;
>>   	}
>> +
>>   	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>   		xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
>> index 01be7ab08f87..1d433a5a96b4 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.h
>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
>>   
>>   bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>>   
>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>> -		  const enum xe_cache_level level);
>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>>   
>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> -		  u32 pt_level);
>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
>>   
>>   #endif
>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>> index 7eeeed0411f3..34603a7e84b0 100644
>> --- a/drivers/gpu/drm/xe/xe_vm.c
>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>> @@ -6,6 +6,7 @@
>>   #include "xe_vm.h"
>>   
>>   #include <linux/dma-fence-array.h>
>> +#include <linux/nospec.h>
>>   
>>   #include <drm/drm_print.h>
>>   #include <drm/ttm/ttm_execbuf_util.h>
>> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>>   				    u64 start, u64 end,
>>   				    bool read_only,
>>   				    bool is_null,
>> -				    u8 tile_mask)
>> +				    u8 tile_mask,
>> +				    u32 pat_index)
>>   {
>>   	struct xe_vma *vma;
>>   	struct xe_tile *tile;
>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>>   			vma->tile_mask |= 0x1 << id;
>>   	}
>>   
>> +	vma->pat_index = pat_index;
>> +
>>   	if (vm->xe->info.platform == XE_PVC)
>>   		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>   
>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
>>   static void vm_destroy_work_func(struct work_struct *w);
>>   
>>   static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>> -						enum xe_cache_level cache)
>> +				     u32 pat_index)
>>   {
>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>>   	pde_pat &= ~(XELP_PDE_PAT_MASK);
>>   
>>   	if (pat_index & BIT(0))
>> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>>   }
>>   
>>   static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>> -						enum xe_cache_level cache)
>> +				     u32 pat_index)
>>   {
>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>>   	pte_pat &= ~(XELP_PTE_PAT_MASK);
>>   
>>   	if (pat_index & BIT(0))
>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
>>   u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>>   {
>>   	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>> -			     XE_CACHE_WB);
>> +			     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>>   }
>>   
>>   static struct dma_fence *
>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
>>   static struct drm_gpuva_ops *
>>   vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   			 u64 bo_offset_or_userptr, u64 addr, u64 range,
>> -			 u32 operation, u8 tile_mask, u32 region)
>> +			 u32 operation, u8 tile_mask, u32 region, u32 pat_index)
>>   {
>>   	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>   	struct ww_acquire_ctx ww;
>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>   
>>   			op->tile_mask = tile_mask;
>> +			op->pat_index = pat_index;
>>   			op->map.immediate =
>>   				operation & XE_VM_BIND_FLAG_IMMEDIATE;
>>   			op->map.read_only =
>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>   
>>   			op->tile_mask = tile_mask;
>> +			op->pat_index = pat_index;
>>   			op->prefetch.region = region;
>>   		}
>>   		break;
>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   }
>>   
>>   static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>> -			      u8 tile_mask, bool read_only, bool is_null)
>> +			      u8 tile_mask, bool read_only, bool is_null,
>> +			      u32 pat_index)
>>   {
>>   	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
>>   	struct xe_vma *vma;
>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>>   	vma = xe_vma_create(vm, bo, op->gem.offset,
>>   			    op->va.addr, op->va.addr +
>>   			    op->va.range - 1, read_only, is_null,
>> -			    tile_mask);
>> +			    tile_mask, pat_index);
>>   	if (bo)
>>   		xe_bo_unlock(bo, &ww);
>>   
>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>   
>>   				vma = new_vma(vm, &op->base.map,
>>   					      op->tile_mask, op->map.read_only,
>> -					      op->map.is_null);
>> +					      op->map.is_null, op->pat_index);
>>   				if (IS_ERR(vma)) {
>>   					err = PTR_ERR(vma);
>>   					goto free_fence;
>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>   
>>   					vma = new_vma(vm, op->base.remap.prev,
>>   						      op->tile_mask, read_only,
>> -						      is_null);
>> +						      is_null, op->pat_index);
>>   					if (IS_ERR(vma)) {
>>   						err = PTR_ERR(vma);
>>   						goto free_fence;
>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>   
>>   					vma = new_vma(vm, op->base.remap.next,
>>   						      op->tile_mask, read_only,
>> -						      is_null);
>> +						      is_null, op->pat_index);
>>   					if (IS_ERR(vma)) {
>>   						err = PTR_ERR(vma);
>>   						goto free_fence;
>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
>>   		u32 obj = (*bind_ops)[i].obj;
>>   		u64 obj_offset = (*bind_ops)[i].obj_offset;
>>   		u32 region = (*bind_ops)[i].region;
>> +		u32 pat_index = (*bind_ops)[i].pat_index;
>>   		bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>   
>> +		if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
>> +			err = -EINVAL;
>> +			goto free_bind_ops;
>> +		}
>> +
>>   		if (i == 0) {
>>   			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>>   		} else if (XE_IOCTL_DBG(xe, !*async) ||
>> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>   		struct drm_gem_object *gem_obj;
>>   		u64 range = bind_ops[i].range;
>>   		u64 addr = bind_ops[i].addr;
>> +		u32 op = bind_ops[i].op;
>>   		u32 obj = bind_ops[i].obj;
>>   		u64 obj_offset = bind_ops[i].obj_offset;
>> +		u32 pat_index = bind_ops[i].pat_index;
>> +		u16 coh_mode;
>> +
>> +		pat_index = array_index_nospec(pat_index,
>> +					       xe->info.pat_table_n_entries);
>> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
>> +			err = -EINVAL;
>> +			goto put_obj;
>> +		}
> 
> In my opinion this check belongs to vm_bind_ioctl_check_args().
> 
>> +
>> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
>> +				 coh_mode == XE_GEM_COHERENCY_NONE)) {
>> +			err = -EINVAL;
>> +			goto put_obj;
>> +		}
> 
> Same for this one.


OK, will take a look.

> 
>>   
>>   		if (!obj)
>>   			continue;
>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>   				goto put_obj;
>>   			}
>>   		}
>> +
>> +		if (bos[i]->coh_mode) {
>> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
>> +				err = -EINVAL;
>> +				goto put_obj;
>> +			}
>> +		} else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
>> +			/*
>> +			 * Imported dma-buf from a different device should
>> +			 * require 1way or 2way coherency since we don't know
>> +			 * how it was mapped on CPU. Just assume is it
>> +			 * potentially cached on CPU side.
>> +			 */
>> +			err = -EINVAL;
>> +			goto put_obj;
>> +		}
>>   	}
>>   
>>   	if (args->num_syncs) {
>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>   		u64 obj_offset = bind_ops[i].obj_offset;
>>   		u8 tile_mask = bind_ops[i].tile_mask;
>>   		u32 region = bind_ops[i].region;
>> +		u32 pat_index = bind_ops[i].pat_index;
>>   
>>   		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>>   						  addr, range, op, tile_mask,
>> -						  region);
>> +						  region, pat_index);
>>   		if (IS_ERR(ops[i])) {
>>   			err = PTR_ERR(ops[i]);
>>   			ops[i] = NULL;
>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
>> index 83a1f87b6537..508679ed3c74 100644
>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>> @@ -111,6 +111,11 @@ struct xe_vma {
>>   	 */
>>   	u8 tile_present;
>>   
>> +	/**
>> +	 * @pat_index: The pat index to use when encoding the PTEs for this vma.
>> +	 */
>> +	u32 pat_index;
> 
> u32 is overkill for pat_index.

What about u16 instead?

> 
>> +
>>   	struct {
>>   		struct list_head rebind_link;
>>   	} notifier;
>> @@ -338,10 +343,8 @@ struct xe_vm {
>>   	bool batch_invalidate_tlb;
>>   
>>   	struct {
>> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> -						enum xe_cache_level cache);
>> -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>> -						enum xe_cache_level cache);
>> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
>> +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
>>   	} pat_encode;
>>   };
>>   
>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>>   	struct async_op_fence *fence;
>>   	/** @tile_mask: gt mask for this operation */
>>   	u8 tile_mask;
>> +	/** @pat_index: The pat index to use for this operation. */
>> +	u32 pat_index;
>>   	/** @flags: operation flags */
>>   	enum xe_vma_op_flags flags;
>>   
>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>> index 64bc66d4b550..0c15b6f32447 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>>   	 */
>>   	__u32 obj;
>>   
>> -	/** @pad: MBZ */
>> -	__u32 pad;
>> +	/**
>> +	 * @pat_index: The platform defined @pat_index to use for this mapping.
>> +	 * The index basically maps to some predefined memory attributes,
>> +	 * including things like caching, coherency and likely other stuff in
>> +	 * the future.  The exact meaning of the pat_index is platform specific
>> +	 * and defined in the Bspec and PRMs.  When the KMD sets up the binding
>> +	 * the index here is encoded into the ppGTT PTE.
>> +	 *
>> +	 * For coherency the @pat_index needs to match the
>> +	 * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
>> +	 * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
>> +	 * the coherency mode from the @pat_index and reject if there is a
>> +	 * mismatch (see note below for pre-MTL platforms).
>> +	 *
>> +	 * Note: On pre-MTL platforms there is only a caching mode and no
>> +	 * explicit coherency mode, but on such hardware there is always a
>> +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
>> +	 * CPU caches even with the caching mode set as uncached.  It's only the
>> +	 * display engine that is incoherent (on dgpu it must be in VRAM which
>> +	 * is always mapped as WC on the CPU). However to keep the uapi somewhat
>> +	 * consistent with newer platforms the KMD groups the different cache
>> +	 * levels into the following coherency buckets on all pre-MTL platforms:
>> +	 *
>> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
>> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
>> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
>> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
>> +	 *
>> +	 * In practice UC/WC/WT should only ever used for scanout surfaces on
>> +	 * such platforms since it is only the display engine that is actually
>> +	 * incoherent. Everything else should typically use WB given that we
>> +	 * have a shared-LLC.  On MTL+ this completely changes (also potentially
>> +	 * no shared-LLC) and the HW defines the coherency mode as part of the
>> +	 * @pat_index.
>> +	 *
>> +	 * Note: For userptr and externally imported dma-buf the kernel expects
>> +	 * either 1WAY or 2WAY for the @pat_index.
>> +	 */
>> +	__u32 pat_index;
>>   
>>   	union {
>>   		/**
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-08-29 16:28 ` [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind Matthew Auld
  2023-08-29 21:36   ` Matt Roper
  2023-09-04 20:21   ` Souza, Jose
@ 2023-09-07 18:56   ` Souza, Jose
  2023-09-08  6:51     ` Matthew Auld
  2 siblings, 1 reply; 47+ messages in thread
From: Souza, Jose @ 2023-09-07 18:56 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Auld,  Matthew

On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
> Allow userspace to directly control the pat_index for a given vm
> binding. This should allow directly controlling the coherency, caching
> and potentially other stuff in the future for the ppGTT binding.
> 
> The exact meaning behind the pat_index is very platform specific (see
> BSpec or PRMs) but effectively maps to some predefined memory
> attributes. From the KMD pov we only care about the coherency that is
> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
> The vm_bind coherency mode for the given pat_index needs to match the
> given coh_mode that was set at object creation. For platforms that lack
> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
> 
> For userptr mappings we lack a corresponding gem object, so the expected
> coherency mode is instead implicit and must fall into either 1WAY or
> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
> dma-buf (from a different device) the coherency mode is also implicit
> and must also be either 1WAY or 2WAY.
> 
> As part of adding pat_index support with vm_bind we also need stop using
> xe_cache_level and instead use the pat_index in various places. We still
> make use of xe_cache_level, but only as a convenience for kernel
> internal objectsi (internally it maps to some reasonable pat_index). For
> now this is just a 1:1 conversion of the existing code, however for
> platforms like MTL+ we might need to give more control through bo_create
> or stop using WB on the CPU side if we need CPU access.

xe_ttm_tt_create() caching should not also take into consideration the PAT index?

> 
> Bspec: 45101, 44235 #xe
> Bspec: 70552, 71582, 59400 #xe2
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: José Roberto de Souza <jose.souza@intel.com>
> Cc: Filip Hazubski <filip.hazubski@intel.com>
> Cc: Carl Zhang <carl.zhang@intel.com>
> Cc: Effie Yu <effie.yu@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
>  drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
>  drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
>  drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
>  drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
>  drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
>  drivers/gpu/drm/xe/xe_pt.h            |  6 +--
>  drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++------
>  drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
>  include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
>  10 files changed, 134 insertions(+), 58 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
> index 4d6296cdbcfd..bb4c182048c3 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> @@ -302,7 +302,7 @@ struct i915_address_space {
>  		(*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>  
>  	u64 (*pte_encode)(dma_addr_t addr,
> -			  unsigned int pat_index,
> +			  u32 pat_index,
>  			  u32 flags); /* Create a valid PTE */
>  #define PTE_READ_ONLY	BIT(0)
>  #define PTE_LM		BIT(1)
> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
> index 5c8d5e78d9bc..7a128fd20a29 100644
> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
>  	/* First part of the test, are we updating our pagetable bo with a new entry? */
>  	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
>  		  0xdeaddeadbeefbeef);
> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>  	if (m->q->vm->flags & XE_VM_FLAG_64K)
>  		expected |= XE_PTE_PS64;
>  	if (xe_bo_is_vram(pt))
> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
> index 209fa053d9fb..4134c26150a5 100644
> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
>  		pte |= XE_GGTT_PTE_DM;
>  
>  	if ((ggtt->pat_encode).pte_encode)
> -		pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> +						    xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
>  
>  	return pte;
>  }
> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>  }
>  
>  static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> -						enum xe_cache_level cache)
> +				     u32 pat_index)
>  {
> -	u32 pat_index = xe_pat_get_index(xe, cache);
> -
>  	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>  
>  	if (pat_index & BIT(0))
> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
> index 7e55fac1a8a9..0bc40cb072e3 100644
> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> @@ -31,7 +31,7 @@ struct xe_ggtt {
>  
>  	struct {
>  		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> -						enum xe_cache_level cache);
> +				  u32 pat_index);
>  	} pat_encode;
>  };
>  
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> index a782ea282cb6..54585e98452a 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -24,6 +24,7 @@
>  #include "xe_lrc.h"
>  #include "xe_map.h"
>  #include "xe_mocs.h"
> +#include "xe_pat.h"
>  #include "xe_pt.h"
>  #include "xe_res_cursor.h"
>  #include "xe_sched_job.h"
> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>  	u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
>  	u32 map_ofs, level, i;
>  	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>  	u64 entry;
>  	int ret;
>  
> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>  		return ret;
>  	}
>  
> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>  	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>  
>  	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>  
>  	/* Map the entire BO in our level 0 pt */
>  	for (i = 0, level = 0; i < num_entries; level++) {
> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>  
>  		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>  
> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>  		for (i = 0; i < batch->size;
>  		     i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>  		     XE_PAGE_SIZE) {
> -			entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
> +			entry = xe_pte_encode(batch, i, pat_index, 0);
>  
>  			xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>  				  entry);
> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>  			flags = XE_PDE_64K;
>  
>  		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> -					XE_PAGE_SIZE, XE_CACHE_WB);
> +					XE_PAGE_SIZE, pat_index);
>  		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
>  			  entry | flags);
>  	}
> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>  	/* Write PDE's that point to our BO. */
>  	for (i = 0; i < num_entries - num_level; i++) {
>  		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> -				      XE_CACHE_WB);
> +				      pat_index);
>  
>  		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>  			  (i + 1) * 8, u64, entry);
> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
>  
>  			XE_WARN_ON(pt_bo->size != SZ_4K);
>  
> -			addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
> +			addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>  			bb->cs[bb->len++] = lower_32_bits(addr);
>  			bb->cs[bb->len++] = upper_32_bits(addr);
>  		}
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 64713f400d94..019af2920078 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -10,6 +10,7 @@
>  #include "xe_gt.h"
>  #include "xe_gt_tlb_invalidation.h"
>  #include "xe_migrate.h"
> +#include "xe_pat.h"
>  #include "xe_pt_types.h"
>  #include "xe_pt_walk.h"
>  #include "xe_res_cursor.h"
> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
>   *
>   * Return: An encoded page directory entry. No errors.
>   */
> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> -		  const enum xe_cache_level cache)
> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>  {
>  	u64 pde;
>  	struct xe_vm *vm = bo->vm;
>  	struct xe_device *xe = vm->xe;
>  
> -
>  	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>  	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>  
>  	if ((vm->pat_encode).pde_encode)
> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> +		pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>  
>  	return pde;
>  }
>  
> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> +static u64 __pte_encode(u64 pte, u32 pat_index,
>  			struct xe_vma *vma, u32 pt_level)
>  {
>  	struct xe_vm *vm = xe_vma_vm(vma);
> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>  		pte |= XE_PTE_NULL;
>  
>  	if ((vm->pat_encode).pte_encode)
> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> +		pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>  
>  	if (pt_level == 1)
>  		pte |= XE_PDE_PS_2M;
> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>   *
>   * Return: An encoded page-table entry. No errors.
>   */
> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>  		  u32 pt_level)
>  {
>  	u64 pte;
> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>  	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>  		pte |= XE_PPGTT_PTE_DM;
>  
> -	return __pte_encode(pte, cache, NULL, pt_level);
> +	return __pte_encode(pte, pat_index, NULL, pt_level);
>  }
>  
>  static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>  
>  	if (level == 0) {
>  		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> -					  XE_CACHE_WB, 0);
> +					  xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
>  
>  		return empty;
>  	} else {
>  		return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
> -				     XE_CACHE_WB);
> +				     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>  	}
>  }
>  
> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>  	struct xe_vm *vm;
>  	/** @tile: The tile we're building for. */
>  	struct xe_tile *tile;
> -	/** @cache: Desired cache level for the ptes */
> -	enum xe_cache_level cache;
>  	/** @default_pte: PTE flag only template. No address is associated */
>  	u64 default_pte;
>  	/** @dma_offset: DMA offset to add to the PTE. */
> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>  
>  		pte = __pte_encode(is_null ? 0 :
>  				   xe_res_dma(curs) + xe_walk->dma_offset,
> -				   xe_walk->cache, xe_walk->vma, level);
> +				   xe_walk->vma->pat_index, xe_walk->vma, level);
>  		pte |= xe_walk->default_pte;
>  
>  		/*
> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>  			xe_child->is_compact = true;
>  		}
>  
> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
>  		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>  					 pte);
>  	}
> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
>  		if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>  			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>  		xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
> -		xe_walk.cache = XE_CACHE_WB;
> -	} else {
> -		if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
> -			xe_walk.cache = XE_CACHE_WT;
> -		else
> -			xe_walk.cache = XE_CACHE_WB;
>  	}
> +
>  	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>  		xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>  
> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> index 01be7ab08f87..1d433a5a96b4 100644
> --- a/drivers/gpu/drm/xe/xe_pt.h
> +++ b/drivers/gpu/drm/xe/xe_pt.h
> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
>  
>  bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>  
> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> -		  const enum xe_cache_level level);
> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>  
> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> -		  u32 pt_level);
> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
>  
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 7eeeed0411f3..34603a7e84b0 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -6,6 +6,7 @@
>  #include "xe_vm.h"
>  
>  #include <linux/dma-fence-array.h>
> +#include <linux/nospec.h>
>  
>  #include <drm/drm_print.h>
>  #include <drm/ttm/ttm_execbuf_util.h>
> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>  				    u64 start, u64 end,
>  				    bool read_only,
>  				    bool is_null,
> -				    u8 tile_mask)
> +				    u8 tile_mask,
> +				    u32 pat_index)
>  {
>  	struct xe_vma *vma;
>  	struct xe_tile *tile;
> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>  			vma->tile_mask |= 0x1 << id;
>  	}
>  
> +	vma->pat_index = pat_index;
> +
>  	if (vm->xe->info.platform == XE_PVC)
>  		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>  
> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
>  static void vm_destroy_work_func(struct work_struct *w);
>  
>  static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> -						enum xe_cache_level cache)
> +				     u32 pat_index)
>  {
> -	u32 pat_index = xe_pat_get_index(xe, cache);
> -
>  	pde_pat &= ~(XELP_PDE_PAT_MASK);
>  
>  	if (pat_index & BIT(0))
> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>  }
>  
>  static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> -						enum xe_cache_level cache)
> +				     u32 pat_index)
>  {
> -	u32 pat_index = xe_pat_get_index(xe, cache);
> -
>  	pte_pat &= ~(XELP_PTE_PAT_MASK);
>  
>  	if (pat_index & BIT(0))
> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
>  u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>  {
>  	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> -			     XE_CACHE_WB);
> +			     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>  }
>  
>  static struct dma_fence *
> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
>  static struct drm_gpuva_ops *
>  vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>  			 u64 bo_offset_or_userptr, u64 addr, u64 range,
> -			 u32 operation, u8 tile_mask, u32 region)
> +			 u32 operation, u8 tile_mask, u32 region, u32 pat_index)
>  {
>  	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>  	struct ww_acquire_ctx ww;
> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>  			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>  
>  			op->tile_mask = tile_mask;
> +			op->pat_index = pat_index;
>  			op->map.immediate =
>  				operation & XE_VM_BIND_FLAG_IMMEDIATE;
>  			op->map.read_only =
> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>  			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>  
>  			op->tile_mask = tile_mask;
> +			op->pat_index = pat_index;
>  			op->prefetch.region = region;
>  		}
>  		break;
> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>  }
>  
>  static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> -			      u8 tile_mask, bool read_only, bool is_null)
> +			      u8 tile_mask, bool read_only, bool is_null,
> +			      u32 pat_index)
>  {
>  	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
>  	struct xe_vma *vma;
> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>  	vma = xe_vma_create(vm, bo, op->gem.offset,
>  			    op->va.addr, op->va.addr +
>  			    op->va.range - 1, read_only, is_null,
> -			    tile_mask);
> +			    tile_mask, pat_index);
>  	if (bo)
>  		xe_bo_unlock(bo, &ww);
>  
> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>  
>  				vma = new_vma(vm, &op->base.map,
>  					      op->tile_mask, op->map.read_only,
> -					      op->map.is_null);
> +					      op->map.is_null, op->pat_index);
>  				if (IS_ERR(vma)) {
>  					err = PTR_ERR(vma);
>  					goto free_fence;
> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>  
>  					vma = new_vma(vm, op->base.remap.prev,
>  						      op->tile_mask, read_only,
> -						      is_null);
> +						      is_null, op->pat_index);
>  					if (IS_ERR(vma)) {
>  						err = PTR_ERR(vma);
>  						goto free_fence;
> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>  
>  					vma = new_vma(vm, op->base.remap.next,
>  						      op->tile_mask, read_only,
> -						      is_null);
> +						      is_null, op->pat_index);
>  					if (IS_ERR(vma)) {
>  						err = PTR_ERR(vma);
>  						goto free_fence;
> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
>  		u32 obj = (*bind_ops)[i].obj;
>  		u64 obj_offset = (*bind_ops)[i].obj_offset;
>  		u32 region = (*bind_ops)[i].region;
> +		u32 pat_index = (*bind_ops)[i].pat_index;
>  		bool is_null = op & XE_VM_BIND_FLAG_NULL;
>  
> +		if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
> +			err = -EINVAL;
> +			goto free_bind_ops;
> +		}
> +
>  		if (i == 0) {
>  			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>  		} else if (XE_IOCTL_DBG(xe, !*async) ||
> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>  		struct drm_gem_object *gem_obj;
>  		u64 range = bind_ops[i].range;
>  		u64 addr = bind_ops[i].addr;
> +		u32 op = bind_ops[i].op;
>  		u32 obj = bind_ops[i].obj;
>  		u64 obj_offset = bind_ops[i].obj_offset;
> +		u32 pat_index = bind_ops[i].pat_index;
> +		u16 coh_mode;
> +
> +		pat_index = array_index_nospec(pat_index,
> +					       xe->info.pat_table_n_entries);
> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
> +			err = -EINVAL;
> +			goto put_obj;
> +		}
> +
> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
> +				 coh_mode == XE_GEM_COHERENCY_NONE)) {
> +			err = -EINVAL;
> +			goto put_obj;
> +		}
>  
>  		if (!obj)
>  			continue;
> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>  				goto put_obj;
>  			}
>  		}
> +
> +		if (bos[i]->coh_mode) {
> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
> +				err = -EINVAL;
> +				goto put_obj;
> +			}
> +		} else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
> +			/*
> +			 * Imported dma-buf from a different device should
> +			 * require 1way or 2way coherency since we don't know
> +			 * how it was mapped on CPU. Just assume is it
> +			 * potentially cached on CPU side.
> +			 */
> +			err = -EINVAL;
> +			goto put_obj;
> +		}
>  	}
>  
>  	if (args->num_syncs) {
> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>  		u64 obj_offset = bind_ops[i].obj_offset;
>  		u8 tile_mask = bind_ops[i].tile_mask;
>  		u32 region = bind_ops[i].region;
> +		u32 pat_index = bind_ops[i].pat_index;
>  
>  		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>  						  addr, range, op, tile_mask,
> -						  region);
> +						  region, pat_index);
>  		if (IS_ERR(ops[i])) {
>  			err = PTR_ERR(ops[i]);
>  			ops[i] = NULL;
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
> index 83a1f87b6537..508679ed3c74 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -111,6 +111,11 @@ struct xe_vma {
>  	 */
>  	u8 tile_present;
>  
> +	/**
> +	 * @pat_index: The pat index to use when encoding the PTEs for this vma.
> +	 */
> +	u32 pat_index;
> +
>  	struct {
>  		struct list_head rebind_link;
>  	} notifier;
> @@ -338,10 +343,8 @@ struct xe_vm {
>  	bool batch_invalidate_tlb;
>  
>  	struct {
> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> -						enum xe_cache_level cache);
> -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> -						enum xe_cache_level cache);
> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
> +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
>  	} pat_encode;
>  };
>  
> @@ -417,6 +420,8 @@ struct xe_vma_op {
>  	struct async_op_fence *fence;
>  	/** @tile_mask: gt mask for this operation */
>  	u8 tile_mask;
> +	/** @pat_index: The pat index to use for this operation. */
> +	u32 pat_index;
>  	/** @flags: operation flags */
>  	enum xe_vma_op_flags flags;
>  
> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> index 64bc66d4b550..0c15b6f32447 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>  	 */
>  	__u32 obj;
>  
> -	/** @pad: MBZ */
> -	__u32 pad;
> +	/**
> +	 * @pat_index: The platform defined @pat_index to use for this mapping.
> +	 * The index basically maps to some predefined memory attributes,
> +	 * including things like caching, coherency and likely other stuff in
> +	 * the future.  The exact meaning of the pat_index is platform specific
> +	 * and defined in the Bspec and PRMs.  When the KMD sets up the binding
> +	 * the index here is encoded into the ppGTT PTE.
> +	 *
> +	 * For coherency the @pat_index needs to match the
> +	 * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
> +	 * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
> +	 * the coherency mode from the @pat_index and reject if there is a
> +	 * mismatch (see note below for pre-MTL platforms).
> +	 *
> +	 * Note: On pre-MTL platforms there is only a caching mode and no
> +	 * explicit coherency mode, but on such hardware there is always a
> +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
> +	 * CPU caches even with the caching mode set as uncached.  It's only the
> +	 * display engine that is incoherent (on dgpu it must be in VRAM which
> +	 * is always mapped as WC on the CPU). However to keep the uapi somewhat
> +	 * consistent with newer platforms the KMD groups the different cache
> +	 * levels into the following coherency buckets on all pre-MTL platforms:
> +	 *
> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
> +	 *
> +	 * In practice UC/WC/WT should only ever used for scanout surfaces on
> +	 * such platforms since it is only the display engine that is actually
> +	 * incoherent. Everything else should typically use WB given that we
> +	 * have a shared-LLC.  On MTL+ this completely changes (also potentially
> +	 * no shared-LLC) and the HW defines the coherency mode as part of the
> +	 * @pat_index.
> +	 *
> +	 * Note: For userptr and externally imported dma-buf the kernel expects
> +	 * either 1WAY or 2WAY for the @pat_index.
> +	 */
> +	__u32 pat_index;
>  
>  	union {
>  		/**


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-09-07 18:56   ` Souza, Jose
@ 2023-09-08  6:51     ` Matthew Auld
  2023-09-13 15:35       ` Souza, Jose
  0 siblings, 1 reply; 47+ messages in thread
From: Matthew Auld @ 2023-09-08  6:51 UTC (permalink / raw)
  To: Souza, Jose, intel-xe@lists.freedesktop.org

On 07/09/2023 19:56, Souza, Jose wrote:
> On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
>> Allow userspace to directly control the pat_index for a given vm
>> binding. This should allow directly controlling the coherency, caching
>> and potentially other stuff in the future for the ppGTT binding.
>>
>> The exact meaning behind the pat_index is very platform specific (see
>> BSpec or PRMs) but effectively maps to some predefined memory
>> attributes. From the KMD pov we only care about the coherency that is
>> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
>> The vm_bind coherency mode for the given pat_index needs to match the
>> given coh_mode that was set at object creation. For platforms that lack
>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
>>
>> For userptr mappings we lack a corresponding gem object, so the expected
>> coherency mode is instead implicit and must fall into either 1WAY or
>> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
>> dma-buf (from a different device) the coherency mode is also implicit
>> and must also be either 1WAY or 2WAY.
>>
>> As part of adding pat_index support with vm_bind we also need stop using
>> xe_cache_level and instead use the pat_index in various places. We still
>> make use of xe_cache_level, but only as a convenience for kernel
>> internal objectsi (internally it maps to some reasonable pat_index). For
>> now this is just a 1:1 conversion of the existing code, however for
>> platforms like MTL+ we might need to give more control through bo_create
>> or stop using WB on the CPU side if we need CPU access.
> 
> xe_ttm_tt_create() caching should not also take into consideration the PAT index?

Which PAT index? At least for userspace objects we might not know the 
pat_index at this stage, also the pat_index is more per vma, while the 
CPU caching mode in xe_ttm_tt_create() is per object. At gem create we 
only need to consider the overall coh_mode of every pat_index that will 
map the object to determine if the CPU caching mode is "valid".

> 
>>
>> Bspec: 45101, 44235 #xe
>> Bspec: 70552, 71582, 59400 #xe2
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: José Roberto de Souza <jose.souza@intel.com>
>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>> Cc: Carl Zhang <carl.zhang@intel.com>
>> Cc: Effie Yu <effie.yu@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
>>   drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
>>   drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
>>   drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
>>   drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
>>   drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
>>   drivers/gpu/drm/xe/xe_pt.h            |  6 +--
>>   drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++------
>>   drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
>>   include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
>>   10 files changed, 134 insertions(+), 58 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
>> index 4d6296cdbcfd..bb4c182048c3 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>> @@ -302,7 +302,7 @@ struct i915_address_space {
>>   		(*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>>   
>>   	u64 (*pte_encode)(dma_addr_t addr,
>> -			  unsigned int pat_index,
>> +			  u32 pat_index,
>>   			  u32 flags); /* Create a valid PTE */
>>   #define PTE_READ_ONLY	BIT(0)
>>   #define PTE_LM		BIT(1)
>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
>> index 5c8d5e78d9bc..7a128fd20a29 100644
>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
>>   	/* First part of the test, are we updating our pagetable bo with a new entry? */
>>   	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
>>   		  0xdeaddeadbeefbeef);
>> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>>   	if (m->q->vm->flags & XE_VM_FLAG_64K)
>>   		expected |= XE_PTE_PS64;
>>   	if (xe_bo_is_vram(pt))
>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
>> index 209fa053d9fb..4134c26150a5 100644
>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
>>   		pte |= XE_GGTT_PTE_DM;
>>   
>>   	if ((ggtt->pat_encode).pte_encode)
>> -		pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
>> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
>> +						    xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
>>   
>>   	return pte;
>>   }
>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>>   }
>>   
>>   static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>> -						enum xe_cache_level cache)
>> +				     u32 pat_index)
>>   {
>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>>   	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>   
>>   	if (pat_index & BIT(0))
>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
>> index 7e55fac1a8a9..0bc40cb072e3 100644
>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>   
>>   	struct {
>>   		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> -						enum xe_cache_level cache);
>> +				  u32 pat_index);
>>   	} pat_encode;
>>   };
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
>> index a782ea282cb6..54585e98452a 100644
>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>> @@ -24,6 +24,7 @@
>>   #include "xe_lrc.h"
>>   #include "xe_map.h"
>>   #include "xe_mocs.h"
>> +#include "xe_pat.h"
>>   #include "xe_pt.h"
>>   #include "xe_res_cursor.h"
>>   #include "xe_sched_job.h"
>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   	u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
>>   	u32 map_ofs, level, i;
>>   	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>>   	u64 entry;
>>   	int ret;
>>   
>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   		return ret;
>>   	}
>>   
>> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
>> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>>   	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>   
>>   	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>   
>>   	/* Map the entire BO in our level 0 pt */
>>   	for (i = 0, level = 0; i < num_entries; level++) {
>> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
>> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>>   
>>   		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>>   
>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   		for (i = 0; i < batch->size;
>>   		     i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>>   		     XE_PAGE_SIZE) {
>> -			entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
>> +			entry = xe_pte_encode(batch, i, pat_index, 0);
>>   
>>   			xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>>   				  entry);
>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   			flags = XE_PDE_64K;
>>   
>>   		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>> -					XE_PAGE_SIZE, XE_CACHE_WB);
>> +					XE_PAGE_SIZE, pat_index);
>>   		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
>>   			  entry | flags);
>>   	}
>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>   	/* Write PDE's that point to our BO. */
>>   	for (i = 0; i < num_entries - num_level; i++) {
>>   		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>> -				      XE_CACHE_WB);
>> +				      pat_index);
>>   
>>   		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>>   			  (i + 1) * 8, u64, entry);
>> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
>>   
>>   			XE_WARN_ON(pt_bo->size != SZ_4K);
>>   
>> -			addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
>> +			addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>>   			bb->cs[bb->len++] = lower_32_bits(addr);
>>   			bb->cs[bb->len++] = upper_32_bits(addr);
>>   		}
>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
>> index 64713f400d94..019af2920078 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.c
>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>> @@ -10,6 +10,7 @@
>>   #include "xe_gt.h"
>>   #include "xe_gt_tlb_invalidation.h"
>>   #include "xe_migrate.h"
>> +#include "xe_pat.h"
>>   #include "xe_pt_types.h"
>>   #include "xe_pt_walk.h"
>>   #include "xe_res_cursor.h"
>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
>>    *
>>    * Return: An encoded page directory entry. No errors.
>>    */
>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>> -		  const enum xe_cache_level cache)
>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>>   {
>>   	u64 pde;
>>   	struct xe_vm *vm = bo->vm;
>>   	struct xe_device *xe = vm->xe;
>>   
>> -
>>   	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>>   	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>   
>>   	if ((vm->pat_encode).pde_encode)
>> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>> +		pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>>   
>>   	return pde;
>>   }
>>   
>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>>   			struct xe_vma *vma, u32 pt_level)
>>   {
>>   	struct xe_vm *vm = xe_vma_vm(vma);
>> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>   		pte |= XE_PTE_NULL;
>>   
>>   	if ((vm->pat_encode).pte_encode)
>> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>> +		pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>>   
>>   	if (pt_level == 1)
>>   		pte |= XE_PDE_PS_2M;
>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>    *
>>    * Return: An encoded page-table entry. No errors.
>>    */
>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>   		  u32 pt_level)
>>   {
>>   	u64 pte;
>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>>   	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>>   		pte |= XE_PPGTT_PTE_DM;
>>   
>> -	return __pte_encode(pte, cache, NULL, pt_level);
>> +	return __pte_encode(pte, pat_index, NULL, pt_level);
>>   }
>>   
>>   static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>>   
>>   	if (level == 0) {
>>   		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>> -					  XE_CACHE_WB, 0);
>> +					  xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
>>   
>>   		return empty;
>>   	} else {
>>   		return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
>> -				     XE_CACHE_WB);
>> +				     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>>   	}
>>   }
>>   
>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>>   	struct xe_vm *vm;
>>   	/** @tile: The tile we're building for. */
>>   	struct xe_tile *tile;
>> -	/** @cache: Desired cache level for the ptes */
>> -	enum xe_cache_level cache;
>>   	/** @default_pte: PTE flag only template. No address is associated */
>>   	u64 default_pte;
>>   	/** @dma_offset: DMA offset to add to the PTE. */
>> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>>   
>>   		pte = __pte_encode(is_null ? 0 :
>>   				   xe_res_dma(curs) + xe_walk->dma_offset,
>> -				   xe_walk->cache, xe_walk->vma, level);
>> +				   xe_walk->vma->pat_index, xe_walk->vma, level);
>>   		pte |= xe_walk->default_pte;
>>   
>>   		/*
>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>>   			xe_child->is_compact = true;
>>   		}
>>   
>> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
>> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
>>   		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>>   					 pte);
>>   	}
>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
>>   		if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>>   			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>>   		xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
>> -		xe_walk.cache = XE_CACHE_WB;
>> -	} else {
>> -		if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
>> -			xe_walk.cache = XE_CACHE_WT;
>> -		else
>> -			xe_walk.cache = XE_CACHE_WB;
>>   	}
>> +
>>   	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>   		xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
>> index 01be7ab08f87..1d433a5a96b4 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.h
>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
>>   
>>   bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>>   
>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>> -		  const enum xe_cache_level level);
>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>>   
>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> -		  u32 pt_level);
>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
>>   
>>   #endif
>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>> index 7eeeed0411f3..34603a7e84b0 100644
>> --- a/drivers/gpu/drm/xe/xe_vm.c
>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>> @@ -6,6 +6,7 @@
>>   #include "xe_vm.h"
>>   
>>   #include <linux/dma-fence-array.h>
>> +#include <linux/nospec.h>
>>   
>>   #include <drm/drm_print.h>
>>   #include <drm/ttm/ttm_execbuf_util.h>
>> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>>   				    u64 start, u64 end,
>>   				    bool read_only,
>>   				    bool is_null,
>> -				    u8 tile_mask)
>> +				    u8 tile_mask,
>> +				    u32 pat_index)
>>   {
>>   	struct xe_vma *vma;
>>   	struct xe_tile *tile;
>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>>   			vma->tile_mask |= 0x1 << id;
>>   	}
>>   
>> +	vma->pat_index = pat_index;
>> +
>>   	if (vm->xe->info.platform == XE_PVC)
>>   		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>   
>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
>>   static void vm_destroy_work_func(struct work_struct *w);
>>   
>>   static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>> -						enum xe_cache_level cache)
>> +				     u32 pat_index)
>>   {
>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>>   	pde_pat &= ~(XELP_PDE_PAT_MASK);
>>   
>>   	if (pat_index & BIT(0))
>> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>>   }
>>   
>>   static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>> -						enum xe_cache_level cache)
>> +				     u32 pat_index)
>>   {
>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>>   	pte_pat &= ~(XELP_PTE_PAT_MASK);
>>   
>>   	if (pat_index & BIT(0))
>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
>>   u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>>   {
>>   	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>> -			     XE_CACHE_WB);
>> +			     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>>   }
>>   
>>   static struct dma_fence *
>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
>>   static struct drm_gpuva_ops *
>>   vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   			 u64 bo_offset_or_userptr, u64 addr, u64 range,
>> -			 u32 operation, u8 tile_mask, u32 region)
>> +			 u32 operation, u8 tile_mask, u32 region, u32 pat_index)
>>   {
>>   	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>   	struct ww_acquire_ctx ww;
>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>   
>>   			op->tile_mask = tile_mask;
>> +			op->pat_index = pat_index;
>>   			op->map.immediate =
>>   				operation & XE_VM_BIND_FLAG_IMMEDIATE;
>>   			op->map.read_only =
>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>   
>>   			op->tile_mask = tile_mask;
>> +			op->pat_index = pat_index;
>>   			op->prefetch.region = region;
>>   		}
>>   		break;
>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   }
>>   
>>   static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>> -			      u8 tile_mask, bool read_only, bool is_null)
>> +			      u8 tile_mask, bool read_only, bool is_null,
>> +			      u32 pat_index)
>>   {
>>   	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
>>   	struct xe_vma *vma;
>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>>   	vma = xe_vma_create(vm, bo, op->gem.offset,
>>   			    op->va.addr, op->va.addr +
>>   			    op->va.range - 1, read_only, is_null,
>> -			    tile_mask);
>> +			    tile_mask, pat_index);
>>   	if (bo)
>>   		xe_bo_unlock(bo, &ww);
>>   
>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>   
>>   				vma = new_vma(vm, &op->base.map,
>>   					      op->tile_mask, op->map.read_only,
>> -					      op->map.is_null);
>> +					      op->map.is_null, op->pat_index);
>>   				if (IS_ERR(vma)) {
>>   					err = PTR_ERR(vma);
>>   					goto free_fence;
>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>   
>>   					vma = new_vma(vm, op->base.remap.prev,
>>   						      op->tile_mask, read_only,
>> -						      is_null);
>> +						      is_null, op->pat_index);
>>   					if (IS_ERR(vma)) {
>>   						err = PTR_ERR(vma);
>>   						goto free_fence;
>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>   
>>   					vma = new_vma(vm, op->base.remap.next,
>>   						      op->tile_mask, read_only,
>> -						      is_null);
>> +						      is_null, op->pat_index);
>>   					if (IS_ERR(vma)) {
>>   						err = PTR_ERR(vma);
>>   						goto free_fence;
>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
>>   		u32 obj = (*bind_ops)[i].obj;
>>   		u64 obj_offset = (*bind_ops)[i].obj_offset;
>>   		u32 region = (*bind_ops)[i].region;
>> +		u32 pat_index = (*bind_ops)[i].pat_index;
>>   		bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>   
>> +		if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
>> +			err = -EINVAL;
>> +			goto free_bind_ops;
>> +		}
>> +
>>   		if (i == 0) {
>>   			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>>   		} else if (XE_IOCTL_DBG(xe, !*async) ||
>> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>   		struct drm_gem_object *gem_obj;
>>   		u64 range = bind_ops[i].range;
>>   		u64 addr = bind_ops[i].addr;
>> +		u32 op = bind_ops[i].op;
>>   		u32 obj = bind_ops[i].obj;
>>   		u64 obj_offset = bind_ops[i].obj_offset;
>> +		u32 pat_index = bind_ops[i].pat_index;
>> +		u16 coh_mode;
>> +
>> +		pat_index = array_index_nospec(pat_index,
>> +					       xe->info.pat_table_n_entries);
>> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
>> +			err = -EINVAL;
>> +			goto put_obj;
>> +		}
>> +
>> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
>> +				 coh_mode == XE_GEM_COHERENCY_NONE)) {
>> +			err = -EINVAL;
>> +			goto put_obj;
>> +		}
>>   
>>   		if (!obj)
>>   			continue;
>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>   				goto put_obj;
>>   			}
>>   		}
>> +
>> +		if (bos[i]->coh_mode) {
>> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
>> +				err = -EINVAL;
>> +				goto put_obj;
>> +			}
>> +		} else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
>> +			/*
>> +			 * Imported dma-buf from a different device should
>> +			 * require 1way or 2way coherency since we don't know
>> +			 * how it was mapped on CPU. Just assume is it
>> +			 * potentially cached on CPU side.
>> +			 */
>> +			err = -EINVAL;
>> +			goto put_obj;
>> +		}
>>   	}
>>   
>>   	if (args->num_syncs) {
>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>   		u64 obj_offset = bind_ops[i].obj_offset;
>>   		u8 tile_mask = bind_ops[i].tile_mask;
>>   		u32 region = bind_ops[i].region;
>> +		u32 pat_index = bind_ops[i].pat_index;
>>   
>>   		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>>   						  addr, range, op, tile_mask,
>> -						  region);
>> +						  region, pat_index);
>>   		if (IS_ERR(ops[i])) {
>>   			err = PTR_ERR(ops[i]);
>>   			ops[i] = NULL;
>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
>> index 83a1f87b6537..508679ed3c74 100644
>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>> @@ -111,6 +111,11 @@ struct xe_vma {
>>   	 */
>>   	u8 tile_present;
>>   
>> +	/**
>> +	 * @pat_index: The pat index to use when encoding the PTEs for this vma.
>> +	 */
>> +	u32 pat_index;
>> +
>>   	struct {
>>   		struct list_head rebind_link;
>>   	} notifier;
>> @@ -338,10 +343,8 @@ struct xe_vm {
>>   	bool batch_invalidate_tlb;
>>   
>>   	struct {
>> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> -						enum xe_cache_level cache);
>> -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>> -						enum xe_cache_level cache);
>> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
>> +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
>>   	} pat_encode;
>>   };
>>   
>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>>   	struct async_op_fence *fence;
>>   	/** @tile_mask: gt mask for this operation */
>>   	u8 tile_mask;
>> +	/** @pat_index: The pat index to use for this operation. */
>> +	u32 pat_index;
>>   	/** @flags: operation flags */
>>   	enum xe_vma_op_flags flags;
>>   
>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>> index 64bc66d4b550..0c15b6f32447 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>>   	 */
>>   	__u32 obj;
>>   
>> -	/** @pad: MBZ */
>> -	__u32 pad;
>> +	/**
>> +	 * @pat_index: The platform defined @pat_index to use for this mapping.
>> +	 * The index basically maps to some predefined memory attributes,
>> +	 * including things like caching, coherency and likely other stuff in
>> +	 * the future.  The exact meaning of the pat_index is platform specific
>> +	 * and defined in the Bspec and PRMs.  When the KMD sets up the binding
>> +	 * the index here is encoded into the ppGTT PTE.
>> +	 *
>> +	 * For coherency the @pat_index needs to match the
>> +	 * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
>> +	 * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
>> +	 * the coherency mode from the @pat_index and reject if there is a
>> +	 * mismatch (see note below for pre-MTL platforms).
>> +	 *
>> +	 * Note: On pre-MTL platforms there is only a caching mode and no
>> +	 * explicit coherency mode, but on such hardware there is always a
>> +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
>> +	 * CPU caches even with the caching mode set as uncached.  It's only the
>> +	 * display engine that is incoherent (on dgpu it must be in VRAM which
>> +	 * is always mapped as WC on the CPU). However to keep the uapi somewhat
>> +	 * consistent with newer platforms the KMD groups the different cache
>> +	 * levels into the following coherency buckets on all pre-MTL platforms:
>> +	 *
>> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
>> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
>> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
>> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
>> +	 *
>> +	 * In practice UC/WC/WT should only ever used for scanout surfaces on
>> +	 * such platforms since it is only the display engine that is actually
>> +	 * incoherent. Everything else should typically use WB given that we
>> +	 * have a shared-LLC.  On MTL+ this completely changes (also potentially
>> +	 * no shared-LLC) and the HW defines the coherency mode as part of the
>> +	 * @pat_index.
>> +	 *
>> +	 * Note: For userptr and externally imported dma-buf the kernel expects
>> +	 * either 1WAY or 2WAY for the @pat_index.
>> +	 */
>> +	__u32 pat_index;
>>   
>>   	union {
>>   		/**
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-09-08  6:51     ` Matthew Auld
@ 2023-09-13 15:35       ` Souza, Jose
  2023-09-13 15:50         ` Matthew Auld
  0 siblings, 1 reply; 47+ messages in thread
From: Souza, Jose @ 2023-09-13 15:35 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Auld,  Matthew

On Fri, 2023-09-08 at 07:51 +0100, Matthew Auld wrote:
> On 07/09/2023 19:56, Souza, Jose wrote:
> > On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
> > > Allow userspace to directly control the pat_index for a given vm
> > > binding. This should allow directly controlling the coherency, caching
> > > and potentially other stuff in the future for the ppGTT binding.
> > > 
> > > The exact meaning behind the pat_index is very platform specific (see
> > > BSpec or PRMs) but effectively maps to some predefined memory
> > > attributes. From the KMD pov we only care about the coherency that is
> > > provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
> > > The vm_bind coherency mode for the given pat_index needs to match the
> > > given coh_mode that was set at object creation. For platforms that lack
> > > the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
> > > 
> > > For userptr mappings we lack a corresponding gem object, so the expected
> > > coherency mode is instead implicit and must fall into either 1WAY or
> > > 2WAY. Trying to use NONE will be rejected by the kernel. For imported
> > > dma-buf (from a different device) the coherency mode is also implicit
> > > and must also be either 1WAY or 2WAY.
> > > 
> > > As part of adding pat_index support with vm_bind we also need stop using
> > > xe_cache_level and instead use the pat_index in various places. We still
> > > make use of xe_cache_level, but only as a convenience for kernel
> > > internal objectsi (internally it maps to some reasonable pat_index). For
> > > now this is just a 1:1 conversion of the existing code, however for
> > > platforms like MTL+ we might need to give more control through bo_create
> > > or stop using WB on the CPU side if we need CPU access.
> > 
> > xe_ttm_tt_create() caching should not also take into consideration the PAT index?
> 
> Which PAT index? At least for userspace objects we might not know the 
> pat_index at this stage, also the pat_index is more per vma, while the 
> CPU caching mode in xe_ttm_tt_create() is per object. At gem create we 
> only need to consider the overall coh_mode of every pat_index that will 
> map the object to determine if the CPU caching mode is "valid".

Yeah it is too late.
I should not consider smem_caching in drm_xe_gem_create? It that what will be used as caching mode for mmap?

> 
> > 
> > > 
> > > Bspec: 45101, 44235 #xe
> > > Bspec: 70552, 71582, 59400 #xe2
> > > Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> > > Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > > Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> > > Cc: Matt Roper <matthew.d.roper@intel.com>
> > > Cc: José Roberto de Souza <jose.souza@intel.com>
> > > Cc: Filip Hazubski <filip.hazubski@intel.com>
> > > Cc: Carl Zhang <carl.zhang@intel.com>
> > > Cc: Effie Yu <effie.yu@intel.com>
> > > ---
> > >   drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
> > >   drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
> > >   drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
> > >   drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
> > >   drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
> > >   drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
> > >   drivers/gpu/drm/xe/xe_pt.h            |  6 +--
> > >   drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++------
> > >   drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
> > >   include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
> > >   10 files changed, 134 insertions(+), 58 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
> > > index 4d6296cdbcfd..bb4c182048c3 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> > > @@ -302,7 +302,7 @@ struct i915_address_space {
> > >   		(*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
> > >   
> > >   	u64 (*pte_encode)(dma_addr_t addr,
> > > -			  unsigned int pat_index,
> > > +			  u32 pat_index,
> > >   			  u32 flags); /* Create a valid PTE */
> > >   #define PTE_READ_ONLY	BIT(0)
> > >   #define PTE_LM		BIT(1)
> > > diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
> > > index 5c8d5e78d9bc..7a128fd20a29 100644
> > > --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> > > +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> > > @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
> > >   	/* First part of the test, are we updating our pagetable bo with a new entry? */
> > >   	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
> > >   		  0xdeaddeadbeefbeef);
> > > -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> > > +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
> > >   	if (m->q->vm->flags & XE_VM_FLAG_64K)
> > >   		expected |= XE_PTE_PS64;
> > >   	if (xe_bo_is_vram(pt))
> > > diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
> > > index 209fa053d9fb..4134c26150a5 100644
> > > --- a/drivers/gpu/drm/xe/xe_ggtt.c
> > > +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> > > @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
> > >   		pte |= XE_GGTT_PTE_DM;
> > >   
> > >   	if ((ggtt->pat_encode).pte_encode)
> > > -		pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
> > > +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> > > +						    xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
> > >   
> > >   	return pte;
> > >   }
> > > @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
> > >   }
> > >   
> > >   static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> > > -						enum xe_cache_level cache)
> > > +				     u32 pat_index)
> > >   {
> > > -	u32 pat_index = xe_pat_get_index(xe, cache);
> > > -
> > >   	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
> > >   
> > >   	if (pat_index & BIT(0))
> > > diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
> > > index 7e55fac1a8a9..0bc40cb072e3 100644
> > > --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> > > @@ -31,7 +31,7 @@ struct xe_ggtt {
> > >   
> > >   	struct {
> > >   		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> > > -						enum xe_cache_level cache);
> > > +				  u32 pat_index);
> > >   	} pat_encode;
> > >   };
> > >   
> > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> > > index a782ea282cb6..54585e98452a 100644
> > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > @@ -24,6 +24,7 @@
> > >   #include "xe_lrc.h"
> > >   #include "xe_map.h"
> > >   #include "xe_mocs.h"
> > > +#include "xe_pat.h"
> > >   #include "xe_pt.h"
> > >   #include "xe_res_cursor.h"
> > >   #include "xe_sched_job.h"
> > > @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > >   	u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
> > >   	u32 map_ofs, level, i;
> > >   	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> > > +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> > >   	u64 entry;
> > >   	int ret;
> > >   
> > > @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > >   		return ret;
> > >   	}
> > >   
> > > -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
> > > +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
> > >   	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
> > >   
> > >   	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
> > >   
> > >   	/* Map the entire BO in our level 0 pt */
> > >   	for (i = 0, level = 0; i < num_entries; level++) {
> > > -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
> > > +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
> > >   
> > >   		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
> > >   
> > > @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > >   		for (i = 0; i < batch->size;
> > >   		     i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
> > >   		     XE_PAGE_SIZE) {
> > > -			entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
> > > +			entry = xe_pte_encode(batch, i, pat_index, 0);
> > >   
> > >   			xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> > >   				  entry);
> > > @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > >   			flags = XE_PDE_64K;
> > >   
> > >   		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> > > -					XE_PAGE_SIZE, XE_CACHE_WB);
> > > +					XE_PAGE_SIZE, pat_index);
> > >   		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
> > >   			  entry | flags);
> > >   	}
> > > @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > >   	/* Write PDE's that point to our BO. */
> > >   	for (i = 0; i < num_entries - num_level; i++) {
> > >   		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> > > -				      XE_CACHE_WB);
> > > +				      pat_index);
> > >   
> > >   		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
> > >   			  (i + 1) * 8, u64, entry);
> > > @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
> > >   
> > >   			XE_WARN_ON(pt_bo->size != SZ_4K);
> > >   
> > > -			addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
> > > +			addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
> > >   			bb->cs[bb->len++] = lower_32_bits(addr);
> > >   			bb->cs[bb->len++] = upper_32_bits(addr);
> > >   		}
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> > > index 64713f400d94..019af2920078 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > @@ -10,6 +10,7 @@
> > >   #include "xe_gt.h"
> > >   #include "xe_gt_tlb_invalidation.h"
> > >   #include "xe_migrate.h"
> > > +#include "xe_pat.h"
> > >   #include "xe_pt_types.h"
> > >   #include "xe_pt_walk.h"
> > >   #include "xe_res_cursor.h"
> > > @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
> > >    *
> > >    * Return: An encoded page directory entry. No errors.
> > >    */
> > > -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> > > -		  const enum xe_cache_level cache)
> > > +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
> > >   {
> > >   	u64 pde;
> > >   	struct xe_vm *vm = bo->vm;
> > >   	struct xe_device *xe = vm->xe;
> > >   
> > > -
> > >   	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> > >   	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
> > >   
> > >   	if ((vm->pat_encode).pde_encode)
> > > -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> > > +		pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
> > >   
> > >   	return pde;
> > >   }
> > >   
> > > -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> > > +static u64 __pte_encode(u64 pte, u32 pat_index,
> > >   			struct xe_vma *vma, u32 pt_level)
> > >   {
> > >   	struct xe_vm *vm = xe_vma_vm(vma);
> > > @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> > >   		pte |= XE_PTE_NULL;
> > >   
> > >   	if ((vm->pat_encode).pte_encode)
> > > -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> > > +		pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
> > >   
> > >   	if (pt_level == 1)
> > >   		pte |= XE_PDE_PS_2M;
> > > @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> > >    *
> > >    * Return: An encoded page-table entry. No errors.
> > >    */
> > > -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> > > +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> > >   		  u32 pt_level)
> > >   {
> > >   	u64 pte;
> > > @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> > >   	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> > >   		pte |= XE_PPGTT_PTE_DM;
> > >   
> > > -	return __pte_encode(pte, cache, NULL, pt_level);
> > > +	return __pte_encode(pte, pat_index, NULL, pt_level);
> > >   }
> > >   
> > >   static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> > > @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> > >   
> > >   	if (level == 0) {
> > >   		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> > > -					  XE_CACHE_WB, 0);
> > > +					  xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
> > >   
> > >   		return empty;
> > >   	} else {
> > >   		return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
> > > -				     XE_CACHE_WB);
> > > +				     xe_pat_get_index(vm->xe, XE_CACHE_WB));
> > >   	}
> > >   }
> > >   
> > > @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> > >   	struct xe_vm *vm;
> > >   	/** @tile: The tile we're building for. */
> > >   	struct xe_tile *tile;
> > > -	/** @cache: Desired cache level for the ptes */
> > > -	enum xe_cache_level cache;
> > >   	/** @default_pte: PTE flag only template. No address is associated */
> > >   	u64 default_pte;
> > >   	/** @dma_offset: DMA offset to add to the PTE. */
> > > @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
> > >   
> > >   		pte = __pte_encode(is_null ? 0 :
> > >   				   xe_res_dma(curs) + xe_walk->dma_offset,
> > > -				   xe_walk->cache, xe_walk->vma, level);
> > > +				   xe_walk->vma->pat_index, xe_walk->vma, level);
> > >   		pte |= xe_walk->default_pte;
> > >   
> > >   		/*
> > > @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
> > >   			xe_child->is_compact = true;
> > >   		}
> > >   
> > > -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
> > > +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
> > >   		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
> > >   					 pte);
> > >   	}
> > > @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> > >   		if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
> > >   			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
> > >   		xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
> > > -		xe_walk.cache = XE_CACHE_WB;
> > > -	} else {
> > > -		if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
> > > -			xe_walk.cache = XE_CACHE_WT;
> > > -		else
> > > -			xe_walk.cache = XE_CACHE_WB;
> > >   	}
> > > +
> > >   	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> > >   		xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> > >   
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> > > index 01be7ab08f87..1d433a5a96b4 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.h
> > > +++ b/drivers/gpu/drm/xe/xe_pt.h
> > > @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
> > >   
> > >   bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
> > >   
> > > -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> > > -		  const enum xe_cache_level level);
> > > +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
> > >   
> > > -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> > > -		  u32 pt_level);
> > > +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
> > >   
> > >   #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > > index 7eeeed0411f3..34603a7e84b0 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -6,6 +6,7 @@
> > >   #include "xe_vm.h"
> > >   
> > >   #include <linux/dma-fence-array.h>
> > > +#include <linux/nospec.h>
> > >   
> > >   #include <drm/drm_print.h>
> > >   #include <drm/ttm/ttm_execbuf_util.h>
> > > @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
> > >   				    u64 start, u64 end,
> > >   				    bool read_only,
> > >   				    bool is_null,
> > > -				    u8 tile_mask)
> > > +				    u8 tile_mask,
> > > +				    u32 pat_index)
> > >   {
> > >   	struct xe_vma *vma;
> > >   	struct xe_tile *tile;
> > > @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
> > >   			vma->tile_mask |= 0x1 << id;
> > >   	}
> > >   
> > > +	vma->pat_index = pat_index;
> > > +
> > >   	if (vm->xe->info.platform == XE_PVC)
> > >   		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
> > >   
> > > @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
> > >   static void vm_destroy_work_func(struct work_struct *w);
> > >   
> > >   static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> > > -						enum xe_cache_level cache)
> > > +				     u32 pat_index)
> > >   {
> > > -	u32 pat_index = xe_pat_get_index(xe, cache);
> > > -
> > >   	pde_pat &= ~(XELP_PDE_PAT_MASK);
> > >   
> > >   	if (pat_index & BIT(0))
> > > @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> > >   }
> > >   
> > >   static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> > > -						enum xe_cache_level cache)
> > > +				     u32 pat_index)
> > >   {
> > > -	u32 pat_index = xe_pat_get_index(xe, cache);
> > > -
> > >   	pte_pat &= ~(XELP_PTE_PAT_MASK);
> > >   
> > >   	if (pat_index & BIT(0))
> > > @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
> > >   u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
> > >   {
> > >   	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> > > -			     XE_CACHE_WB);
> > > +			     xe_pat_get_index(vm->xe, XE_CACHE_WB));
> > >   }
> > >   
> > >   static struct dma_fence *
> > > @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
> > >   static struct drm_gpuva_ops *
> > >   vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> > >   			 u64 bo_offset_or_userptr, u64 addr, u64 range,
> > > -			 u32 operation, u8 tile_mask, u32 region)
> > > +			 u32 operation, u8 tile_mask, u32 region, u32 pat_index)
> > >   {
> > >   	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> > >   	struct ww_acquire_ctx ww;
> > > @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> > >   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> > >   
> > >   			op->tile_mask = tile_mask;
> > > +			op->pat_index = pat_index;
> > >   			op->map.immediate =
> > >   				operation & XE_VM_BIND_FLAG_IMMEDIATE;
> > >   			op->map.read_only =
> > > @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> > >   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> > >   
> > >   			op->tile_mask = tile_mask;
> > > +			op->pat_index = pat_index;
> > >   			op->prefetch.region = region;
> > >   		}
> > >   		break;
> > > @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> > >   }
> > >   
> > >   static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> > > -			      u8 tile_mask, bool read_only, bool is_null)
> > > +			      u8 tile_mask, bool read_only, bool is_null,
> > > +			      u32 pat_index)
> > >   {
> > >   	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
> > >   	struct xe_vma *vma;
> > > @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> > >   	vma = xe_vma_create(vm, bo, op->gem.offset,
> > >   			    op->va.addr, op->va.addr +
> > >   			    op->va.range - 1, read_only, is_null,
> > > -			    tile_mask);
> > > +			    tile_mask, pat_index);
> > >   	if (bo)
> > >   		xe_bo_unlock(bo, &ww);
> > >   
> > > @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
> > >   
> > >   				vma = new_vma(vm, &op->base.map,
> > >   					      op->tile_mask, op->map.read_only,
> > > -					      op->map.is_null);
> > > +					      op->map.is_null, op->pat_index);
> > >   				if (IS_ERR(vma)) {
> > >   					err = PTR_ERR(vma);
> > >   					goto free_fence;
> > > @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
> > >   
> > >   					vma = new_vma(vm, op->base.remap.prev,
> > >   						      op->tile_mask, read_only,
> > > -						      is_null);
> > > +						      is_null, op->pat_index);
> > >   					if (IS_ERR(vma)) {
> > >   						err = PTR_ERR(vma);
> > >   						goto free_fence;
> > > @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
> > >   
> > >   					vma = new_vma(vm, op->base.remap.next,
> > >   						      op->tile_mask, read_only,
> > > -						      is_null);
> > > +						      is_null, op->pat_index);
> > >   					if (IS_ERR(vma)) {
> > >   						err = PTR_ERR(vma);
> > >   						goto free_fence;
> > > @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
> > >   		u32 obj = (*bind_ops)[i].obj;
> > >   		u64 obj_offset = (*bind_ops)[i].obj_offset;
> > >   		u32 region = (*bind_ops)[i].region;
> > > +		u32 pat_index = (*bind_ops)[i].pat_index;
> > >   		bool is_null = op & XE_VM_BIND_FLAG_NULL;
> > >   
> > > +		if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
> > > +			err = -EINVAL;
> > > +			goto free_bind_ops;
> > > +		}
> > > +
> > >   		if (i == 0) {
> > >   			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
> > >   		} else if (XE_IOCTL_DBG(xe, !*async) ||
> > > @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> > >   		struct drm_gem_object *gem_obj;
> > >   		u64 range = bind_ops[i].range;
> > >   		u64 addr = bind_ops[i].addr;
> > > +		u32 op = bind_ops[i].op;
> > >   		u32 obj = bind_ops[i].obj;
> > >   		u64 obj_offset = bind_ops[i].obj_offset;
> > > +		u32 pat_index = bind_ops[i].pat_index;
> > > +		u16 coh_mode;
> > > +
> > > +		pat_index = array_index_nospec(pat_index,
> > > +					       xe->info.pat_table_n_entries);
> > > +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
> > > +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
> > > +			err = -EINVAL;
> > > +			goto put_obj;
> > > +		}
> > > +
> > > +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
> > > +				 coh_mode == XE_GEM_COHERENCY_NONE)) {
> > > +			err = -EINVAL;
> > > +			goto put_obj;
> > > +		}
> > >   
> > >   		if (!obj)
> > >   			continue;
> > > @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> > >   				goto put_obj;
> > >   			}
> > >   		}
> > > +
> > > +		if (bos[i]->coh_mode) {
> > > +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
> > > +				err = -EINVAL;
> > > +				goto put_obj;
> > > +			}
> > > +		} else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
> > > +			/*
> > > +			 * Imported dma-buf from a different device should
> > > +			 * require 1way or 2way coherency since we don't know
> > > +			 * how it was mapped on CPU. Just assume is it
> > > +			 * potentially cached on CPU side.
> > > +			 */
> > > +			err = -EINVAL;
> > > +			goto put_obj;
> > > +		}
> > >   	}
> > >   
> > >   	if (args->num_syncs) {
> > > @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> > >   		u64 obj_offset = bind_ops[i].obj_offset;
> > >   		u8 tile_mask = bind_ops[i].tile_mask;
> > >   		u32 region = bind_ops[i].region;
> > > +		u32 pat_index = bind_ops[i].pat_index;
> > >   
> > >   		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
> > >   						  addr, range, op, tile_mask,
> > > -						  region);
> > > +						  region, pat_index);
> > >   		if (IS_ERR(ops[i])) {
> > >   			err = PTR_ERR(ops[i]);
> > >   			ops[i] = NULL;
> > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
> > > index 83a1f87b6537..508679ed3c74 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > > @@ -111,6 +111,11 @@ struct xe_vma {
> > >   	 */
> > >   	u8 tile_present;
> > >   
> > > +	/**
> > > +	 * @pat_index: The pat index to use when encoding the PTEs for this vma.
> > > +	 */
> > > +	u32 pat_index;
> > > +
> > >   	struct {
> > >   		struct list_head rebind_link;
> > >   	} notifier;
> > > @@ -338,10 +343,8 @@ struct xe_vm {
> > >   	bool batch_invalidate_tlb;
> > >   
> > >   	struct {
> > > -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> > > -						enum xe_cache_level cache);
> > > -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> > > -						enum xe_cache_level cache);
> > > +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
> > > +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
> > >   	} pat_encode;
> > >   };
> > >   
> > > @@ -417,6 +420,8 @@ struct xe_vma_op {
> > >   	struct async_op_fence *fence;
> > >   	/** @tile_mask: gt mask for this operation */
> > >   	u8 tile_mask;
> > > +	/** @pat_index: The pat index to use for this operation. */
> > > +	u32 pat_index;
> > >   	/** @flags: operation flags */
> > >   	enum xe_vma_op_flags flags;
> > >   
> > > diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> > > index 64bc66d4b550..0c15b6f32447 100644
> > > --- a/include/uapi/drm/xe_drm.h
> > > +++ b/include/uapi/drm/xe_drm.h
> > > @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> > >   	 */
> > >   	__u32 obj;
> > >   
> > > -	/** @pad: MBZ */
> > > -	__u32 pad;
> > > +	/**
> > > +	 * @pat_index: The platform defined @pat_index to use for this mapping.
> > > +	 * The index basically maps to some predefined memory attributes,
> > > +	 * including things like caching, coherency and likely other stuff in
> > > +	 * the future.  The exact meaning of the pat_index is platform specific
> > > +	 * and defined in the Bspec and PRMs.  When the KMD sets up the binding
> > > +	 * the index here is encoded into the ppGTT PTE.
> > > +	 *
> > > +	 * For coherency the @pat_index needs to match the
> > > +	 * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
> > > +	 * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
> > > +	 * the coherency mode from the @pat_index and reject if there is a
> > > +	 * mismatch (see note below for pre-MTL platforms).
> > > +	 *
> > > +	 * Note: On pre-MTL platforms there is only a caching mode and no
> > > +	 * explicit coherency mode, but on such hardware there is always a
> > > +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
> > > +	 * CPU caches even with the caching mode set as uncached.  It's only the
> > > +	 * display engine that is incoherent (on dgpu it must be in VRAM which
> > > +	 * is always mapped as WC on the CPU). However to keep the uapi somewhat
> > > +	 * consistent with newer platforms the KMD groups the different cache
> > > +	 * levels into the following coherency buckets on all pre-MTL platforms:
> > > +	 *
> > > +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
> > > +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
> > > +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
> > > +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
> > > +	 *
> > > +	 * In practice UC/WC/WT should only ever used for scanout surfaces on
> > > +	 * such platforms since it is only the display engine that is actually
> > > +	 * incoherent. Everything else should typically use WB given that we
> > > +	 * have a shared-LLC.  On MTL+ this completely changes (also potentially
> > > +	 * no shared-LLC) and the HW defines the coherency mode as part of the
> > > +	 * @pat_index.
> > > +	 *
> > > +	 * Note: For userptr and externally imported dma-buf the kernel expects
> > > +	 * either 1WAY or 2WAY for the @pat_index.
> > > +	 */
> > > +	__u32 pat_index;
> > >   
> > >   	union {
> > >   		/**
> > 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
  2023-09-13 15:35       ` Souza, Jose
@ 2023-09-13 15:50         ` Matthew Auld
  0 siblings, 0 replies; 47+ messages in thread
From: Matthew Auld @ 2023-09-13 15:50 UTC (permalink / raw)
  To: Souza, Jose, intel-xe@lists.freedesktop.org

On 13/09/2023 16:35, Souza, Jose wrote:
> On Fri, 2023-09-08 at 07:51 +0100, Matthew Auld wrote:
>> On 07/09/2023 19:56, Souza, Jose wrote:
>>> On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
>>>> Allow userspace to directly control the pat_index for a given vm
>>>> binding. This should allow directly controlling the coherency, caching
>>>> and potentially other stuff in the future for the ppGTT binding.
>>>>
>>>> The exact meaning behind the pat_index is very platform specific (see
>>>> BSpec or PRMs) but effectively maps to some predefined memory
>>>> attributes. From the KMD pov we only care about the coherency that is
>>>> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
>>>> The vm_bind coherency mode for the given pat_index needs to match the
>>>> given coh_mode that was set at object creation. For platforms that lack
>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
>>>>
>>>> For userptr mappings we lack a corresponding gem object, so the expected
>>>> coherency mode is instead implicit and must fall into either 1WAY or
>>>> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
>>>> dma-buf (from a different device) the coherency mode is also implicit
>>>> and must also be either 1WAY or 2WAY.
>>>>
>>>> As part of adding pat_index support with vm_bind we also need stop using
>>>> xe_cache_level and instead use the pat_index in various places. We still
>>>> make use of xe_cache_level, but only as a convenience for kernel
>>>> internal objectsi (internally it maps to some reasonable pat_index). For
>>>> now this is just a 1:1 conversion of the existing code, however for
>>>> platforms like MTL+ we might need to give more control through bo_create
>>>> or stop using WB on the CPU side if we need CPU access.
>>>
>>> xe_ttm_tt_create() caching should not also take into consideration the PAT index?
>>
>> Which PAT index? At least for userspace objects we might not know the
>> pat_index at this stage, also the pat_index is more per vma, while the
>> CPU caching mode in xe_ttm_tt_create() is per object. At gem create we
>> only need to consider the overall coh_mode of every pat_index that will
>> map the object to determine if the CPU caching mode is "valid".
> 
> Yeah it is too late.
> I should not consider smem_caching in drm_xe_gem_create? It that what will be used as caching mode for mmap?

You will need pick something for smem_caching, unless it's vram-only in 
which case just leave as zero. And yes, the mmap caching mode will be 
whatever you picked for smem_caching.

> 
>>
>>>
>>>>
>>>> Bspec: 45101, 44235 #xe
>>>> Bspec: 70552, 71582, 59400 #xe2
>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>>>> Cc: Carl Zhang <carl.zhang@intel.com>
>>>> Cc: Effie Yu <effie.yu@intel.com>
>>>> ---
>>>>    drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
>>>>    drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
>>>>    drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
>>>>    drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
>>>>    drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
>>>>    drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
>>>>    drivers/gpu/drm/xe/xe_pt.h            |  6 +--
>>>>    drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++------
>>>>    drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
>>>>    include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
>>>>    10 files changed, 134 insertions(+), 58 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>> index 4d6296cdbcfd..bb4c182048c3 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
>>>>    		(*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>>>>    
>>>>    	u64 (*pte_encode)(dma_addr_t addr,
>>>> -			  unsigned int pat_index,
>>>> +			  u32 pat_index,
>>>>    			  u32 flags); /* Create a valid PTE */
>>>>    #define PTE_READ_ONLY	BIT(0)
>>>>    #define PTE_LM		BIT(1)
>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
>>>>    	/* First part of the test, are we updating our pagetable bo with a new entry? */
>>>>    	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
>>>>    		  0xdeaddeadbeefbeef);
>>>> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>>>> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>>>>    	if (m->q->vm->flags & XE_VM_FLAG_64K)
>>>>    		expected |= XE_PTE_PS64;
>>>>    	if (xe_bo_is_vram(pt))
>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
>>>> index 209fa053d9fb..4134c26150a5 100644
>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
>>>>    		pte |= XE_GGTT_PTE_DM;
>>>>    
>>>>    	if ((ggtt->pat_encode).pte_encode)
>>>> -		pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
>>>> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>> +						    xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
>>>>    
>>>>    	return pte;
>>>>    }
>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>>>>    }
>>>>    
>>>>    static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>>>> -						enum xe_cache_level cache)
>>>> +				     u32 pat_index)
>>>>    {
>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>> -
>>>>    	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>>>    
>>>>    	if (pat_index & BIT(0))
>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>>>    
>>>>    	struct {
>>>>    		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>> -						enum xe_cache_level cache);
>>>> +				  u32 pat_index);
>>>>    	} pat_encode;
>>>>    };
>>>>    
>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
>>>> index a782ea282cb6..54585e98452a 100644
>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>>>> @@ -24,6 +24,7 @@
>>>>    #include "xe_lrc.h"
>>>>    #include "xe_map.h"
>>>>    #include "xe_mocs.h"
>>>> +#include "xe_pat.h"
>>>>    #include "xe_pt.h"
>>>>    #include "xe_res_cursor.h"
>>>>    #include "xe_sched_job.h"
>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>>>    	u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
>>>>    	u32 map_ofs, level, i;
>>>>    	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>>>> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>>>>    	u64 entry;
>>>>    	int ret;
>>>>    
>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>>>    		return ret;
>>>>    	}
>>>>    
>>>> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
>>>> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>>>>    	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>>>    
>>>>    	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>>>    
>>>>    	/* Map the entire BO in our level 0 pt */
>>>>    	for (i = 0, level = 0; i < num_entries; level++) {
>>>> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
>>>> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>>>>    
>>>>    		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>>>>    
>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>>>    		for (i = 0; i < batch->size;
>>>>    		     i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>>>>    		     XE_PAGE_SIZE) {
>>>> -			entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
>>>> +			entry = xe_pte_encode(batch, i, pat_index, 0);
>>>>    
>>>>    			xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>>>>    				  entry);
>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>>>    			flags = XE_PDE_64K;
>>>>    
>>>>    		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>>>> -					XE_PAGE_SIZE, XE_CACHE_WB);
>>>> +					XE_PAGE_SIZE, pat_index);
>>>>    		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
>>>>    			  entry | flags);
>>>>    	}
>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>>>    	/* Write PDE's that point to our BO. */
>>>>    	for (i = 0; i < num_entries - num_level; i++) {
>>>>    		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>>>> -				      XE_CACHE_WB);
>>>> +				      pat_index);
>>>>    
>>>>    		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>>>>    			  (i + 1) * 8, u64, entry);
>>>> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
>>>>    
>>>>    			XE_WARN_ON(pt_bo->size != SZ_4K);
>>>>    
>>>> -			addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
>>>> +			addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>>>>    			bb->cs[bb->len++] = lower_32_bits(addr);
>>>>    			bb->cs[bb->len++] = upper_32_bits(addr);
>>>>    		}
>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
>>>> index 64713f400d94..019af2920078 100644
>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>>>> @@ -10,6 +10,7 @@
>>>>    #include "xe_gt.h"
>>>>    #include "xe_gt_tlb_invalidation.h"
>>>>    #include "xe_migrate.h"
>>>> +#include "xe_pat.h"
>>>>    #include "xe_pt_types.h"
>>>>    #include "xe_pt_walk.h"
>>>>    #include "xe_res_cursor.h"
>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
>>>>     *
>>>>     * Return: An encoded page directory entry. No errors.
>>>>     */
>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>> -		  const enum xe_cache_level cache)
>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>>>>    {
>>>>    	u64 pde;
>>>>    	struct xe_vm *vm = bo->vm;
>>>>    	struct xe_device *xe = vm->xe;
>>>>    
>>>> -
>>>>    	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>>>>    	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>>>    
>>>>    	if ((vm->pat_encode).pde_encode)
>>>> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>>>> +		pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>>>>    
>>>>    	return pde;
>>>>    }
>>>>    
>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>>>>    			struct xe_vma *vma, u32 pt_level)
>>>>    {
>>>>    	struct xe_vm *vm = xe_vma_vm(vma);
>>>> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>>>    		pte |= XE_PTE_NULL;
>>>>    
>>>>    	if ((vm->pat_encode).pte_encode)
>>>> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>>>> +		pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>>>>    
>>>>    	if (pt_level == 1)
>>>>    		pte |= XE_PDE_PS_2M;
>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>>>     *
>>>>     * Return: An encoded page-table entry. No errors.
>>>>     */
>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>>>    		  u32 pt_level)
>>>>    {
>>>>    	u64 pte;
>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>>>>    	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>>>>    		pte |= XE_PPGTT_PTE_DM;
>>>>    
>>>> -	return __pte_encode(pte, cache, NULL, pt_level);
>>>> +	return __pte_encode(pte, pat_index, NULL, pt_level);
>>>>    }
>>>>    
>>>>    static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>>>> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>>>>    
>>>>    	if (level == 0) {
>>>>    		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>>>> -					  XE_CACHE_WB, 0);
>>>> +					  xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
>>>>    
>>>>    		return empty;
>>>>    	} else {
>>>>    		return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
>>>> -				     XE_CACHE_WB);
>>>> +				     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>>>>    	}
>>>>    }
>>>>    
>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>>>>    	struct xe_vm *vm;
>>>>    	/** @tile: The tile we're building for. */
>>>>    	struct xe_tile *tile;
>>>> -	/** @cache: Desired cache level for the ptes */
>>>> -	enum xe_cache_level cache;
>>>>    	/** @default_pte: PTE flag only template. No address is associated */
>>>>    	u64 default_pte;
>>>>    	/** @dma_offset: DMA offset to add to the PTE. */
>>>> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>>>>    
>>>>    		pte = __pte_encode(is_null ? 0 :
>>>>    				   xe_res_dma(curs) + xe_walk->dma_offset,
>>>> -				   xe_walk->cache, xe_walk->vma, level);
>>>> +				   xe_walk->vma->pat_index, xe_walk->vma, level);
>>>>    		pte |= xe_walk->default_pte;
>>>>    
>>>>    		/*
>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>>>>    			xe_child->is_compact = true;
>>>>    		}
>>>>    
>>>> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
>>>> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
>>>>    		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>>>>    					 pte);
>>>>    	}
>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
>>>>    		if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>>>>    			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>>>>    		xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
>>>> -		xe_walk.cache = XE_CACHE_WB;
>>>> -	} else {
>>>> -		if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
>>>> -			xe_walk.cache = XE_CACHE_WT;
>>>> -		else
>>>> -			xe_walk.cache = XE_CACHE_WB;
>>>>    	}
>>>> +
>>>>    	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>>>    		xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>>>    
>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
>>>> index 01be7ab08f87..1d433a5a96b4 100644
>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
>>>>    
>>>>    bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>>>>    
>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>> -		  const enum xe_cache_level level);
>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>>>>    
>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>>>> -		  u32 pt_level);
>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
>>>>    
>>>>    #endif
>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>>>> index 7eeeed0411f3..34603a7e84b0 100644
>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>> @@ -6,6 +6,7 @@
>>>>    #include "xe_vm.h"
>>>>    
>>>>    #include <linux/dma-fence-array.h>
>>>> +#include <linux/nospec.h>
>>>>    
>>>>    #include <drm/drm_print.h>
>>>>    #include <drm/ttm/ttm_execbuf_util.h>
>>>> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>>>>    				    u64 start, u64 end,
>>>>    				    bool read_only,
>>>>    				    bool is_null,
>>>> -				    u8 tile_mask)
>>>> +				    u8 tile_mask,
>>>> +				    u32 pat_index)
>>>>    {
>>>>    	struct xe_vma *vma;
>>>>    	struct xe_tile *tile;
>>>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>>>>    			vma->tile_mask |= 0x1 << id;
>>>>    	}
>>>>    
>>>> +	vma->pat_index = pat_index;
>>>> +
>>>>    	if (vm->xe->info.platform == XE_PVC)
>>>>    		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>>>    
>>>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
>>>>    static void vm_destroy_work_func(struct work_struct *w);
>>>>    
>>>>    static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>>>> -						enum xe_cache_level cache)
>>>> +				     u32 pat_index)
>>>>    {
>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>> -
>>>>    	pde_pat &= ~(XELP_PDE_PAT_MASK);
>>>>    
>>>>    	if (pat_index & BIT(0))
>>>> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>>>>    }
>>>>    
>>>>    static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>>>> -						enum xe_cache_level cache)
>>>> +				     u32 pat_index)
>>>>    {
>>>> -	u32 pat_index = xe_pat_get_index(xe, cache);
>>>> -
>>>>    	pte_pat &= ~(XELP_PTE_PAT_MASK);
>>>>    
>>>>    	if (pat_index & BIT(0))
>>>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
>>>>    u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>>>>    {
>>>>    	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>>>> -			     XE_CACHE_WB);
>>>> +			     xe_pat_get_index(vm->xe, XE_CACHE_WB));
>>>>    }
>>>>    
>>>>    static struct dma_fence *
>>>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
>>>>    static struct drm_gpuva_ops *
>>>>    vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>>    			 u64 bo_offset_or_userptr, u64 addr, u64 range,
>>>> -			 u32 operation, u8 tile_mask, u32 region)
>>>> +			 u32 operation, u8 tile_mask, u32 region, u32 pat_index)
>>>>    {
>>>>    	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>>>    	struct ww_acquire_ctx ww;
>>>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>>    			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>>>    
>>>>    			op->tile_mask = tile_mask;
>>>> +			op->pat_index = pat_index;
>>>>    			op->map.immediate =
>>>>    				operation & XE_VM_BIND_FLAG_IMMEDIATE;
>>>>    			op->map.read_only =
>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>>    			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>>>    
>>>>    			op->tile_mask = tile_mask;
>>>> +			op->pat_index = pat_index;
>>>>    			op->prefetch.region = region;
>>>>    		}
>>>>    		break;
>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>>    }
>>>>    
>>>>    static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>>>> -			      u8 tile_mask, bool read_only, bool is_null)
>>>> +			      u8 tile_mask, bool read_only, bool is_null,
>>>> +			      u32 pat_index)
>>>>    {
>>>>    	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
>>>>    	struct xe_vma *vma;
>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>>>>    	vma = xe_vma_create(vm, bo, op->gem.offset,
>>>>    			    op->va.addr, op->va.addr +
>>>>    			    op->va.range - 1, read_only, is_null,
>>>> -			    tile_mask);
>>>> +			    tile_mask, pat_index);
>>>>    	if (bo)
>>>>    		xe_bo_unlock(bo, &ww);
>>>>    
>>>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>>>    
>>>>    				vma = new_vma(vm, &op->base.map,
>>>>    					      op->tile_mask, op->map.read_only,
>>>> -					      op->map.is_null);
>>>> +					      op->map.is_null, op->pat_index);
>>>>    				if (IS_ERR(vma)) {
>>>>    					err = PTR_ERR(vma);
>>>>    					goto free_fence;
>>>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>>>    
>>>>    					vma = new_vma(vm, op->base.remap.prev,
>>>>    						      op->tile_mask, read_only,
>>>> -						      is_null);
>>>> +						      is_null, op->pat_index);
>>>>    					if (IS_ERR(vma)) {
>>>>    						err = PTR_ERR(vma);
>>>>    						goto free_fence;
>>>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>>>    
>>>>    					vma = new_vma(vm, op->base.remap.next,
>>>>    						      op->tile_mask, read_only,
>>>> -						      is_null);
>>>> +						      is_null, op->pat_index);
>>>>    					if (IS_ERR(vma)) {
>>>>    						err = PTR_ERR(vma);
>>>>    						goto free_fence;
>>>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
>>>>    		u32 obj = (*bind_ops)[i].obj;
>>>>    		u64 obj_offset = (*bind_ops)[i].obj_offset;
>>>>    		u32 region = (*bind_ops)[i].region;
>>>> +		u32 pat_index = (*bind_ops)[i].pat_index;
>>>>    		bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>>>    
>>>> +		if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
>>>> +			err = -EINVAL;
>>>> +			goto free_bind_ops;
>>>> +		}
>>>> +
>>>>    		if (i == 0) {
>>>>    			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>>>>    		} else if (XE_IOCTL_DBG(xe, !*async) ||
>>>> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>>>    		struct drm_gem_object *gem_obj;
>>>>    		u64 range = bind_ops[i].range;
>>>>    		u64 addr = bind_ops[i].addr;
>>>> +		u32 op = bind_ops[i].op;
>>>>    		u32 obj = bind_ops[i].obj;
>>>>    		u64 obj_offset = bind_ops[i].obj_offset;
>>>> +		u32 pat_index = bind_ops[i].pat_index;
>>>> +		u16 coh_mode;
>>>> +
>>>> +		pat_index = array_index_nospec(pat_index,
>>>> +					       xe->info.pat_table_n_entries);
>>>> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>>>> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
>>>> +			err = -EINVAL;
>>>> +			goto put_obj;
>>>> +		}
>>>> +
>>>> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
>>>> +				 coh_mode == XE_GEM_COHERENCY_NONE)) {
>>>> +			err = -EINVAL;
>>>> +			goto put_obj;
>>>> +		}
>>>>    
>>>>    		if (!obj)
>>>>    			continue;
>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>>>    				goto put_obj;
>>>>    			}
>>>>    		}
>>>> +
>>>> +		if (bos[i]->coh_mode) {
>>>> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
>>>> +				err = -EINVAL;
>>>> +				goto put_obj;
>>>> +			}
>>>> +		} else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
>>>> +			/*
>>>> +			 * Imported dma-buf from a different device should
>>>> +			 * require 1way or 2way coherency since we don't know
>>>> +			 * how it was mapped on CPU. Just assume is it
>>>> +			 * potentially cached on CPU side.
>>>> +			 */
>>>> +			err = -EINVAL;
>>>> +			goto put_obj;
>>>> +		}
>>>>    	}
>>>>    
>>>>    	if (args->num_syncs) {
>>>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>>>    		u64 obj_offset = bind_ops[i].obj_offset;
>>>>    		u8 tile_mask = bind_ops[i].tile_mask;
>>>>    		u32 region = bind_ops[i].region;
>>>> +		u32 pat_index = bind_ops[i].pat_index;
>>>>    
>>>>    		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>>>>    						  addr, range, op, tile_mask,
>>>> -						  region);
>>>> +						  region, pat_index);
>>>>    		if (IS_ERR(ops[i])) {
>>>>    			err = PTR_ERR(ops[i]);
>>>>    			ops[i] = NULL;
>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
>>>> index 83a1f87b6537..508679ed3c74 100644
>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>>>> @@ -111,6 +111,11 @@ struct xe_vma {
>>>>    	 */
>>>>    	u8 tile_present;
>>>>    
>>>> +	/**
>>>> +	 * @pat_index: The pat index to use when encoding the PTEs for this vma.
>>>> +	 */
>>>> +	u32 pat_index;
>>>> +
>>>>    	struct {
>>>>    		struct list_head rebind_link;
>>>>    	} notifier;
>>>> @@ -338,10 +343,8 @@ struct xe_vm {
>>>>    	bool batch_invalidate_tlb;
>>>>    
>>>>    	struct {
>>>> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>> -						enum xe_cache_level cache);
>>>> -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>>>> -						enum xe_cache_level cache);
>>>> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
>>>> +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
>>>>    	} pat_encode;
>>>>    };
>>>>    
>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>>>>    	struct async_op_fence *fence;
>>>>    	/** @tile_mask: gt mask for this operation */
>>>>    	u8 tile_mask;
>>>> +	/** @pat_index: The pat index to use for this operation. */
>>>> +	u32 pat_index;
>>>>    	/** @flags: operation flags */
>>>>    	enum xe_vma_op_flags flags;
>>>>    
>>>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>>>> index 64bc66d4b550..0c15b6f32447 100644
>>>> --- a/include/uapi/drm/xe_drm.h
>>>> +++ b/include/uapi/drm/xe_drm.h
>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>>>>    	 */
>>>>    	__u32 obj;
>>>>    
>>>> -	/** @pad: MBZ */
>>>> -	__u32 pad;
>>>> +	/**
>>>> +	 * @pat_index: The platform defined @pat_index to use for this mapping.
>>>> +	 * The index basically maps to some predefined memory attributes,
>>>> +	 * including things like caching, coherency and likely other stuff in
>>>> +	 * the future.  The exact meaning of the pat_index is platform specific
>>>> +	 * and defined in the Bspec and PRMs.  When the KMD sets up the binding
>>>> +	 * the index here is encoded into the ppGTT PTE.
>>>> +	 *
>>>> +	 * For coherency the @pat_index needs to match the
>>>> +	 * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
>>>> +	 * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
>>>> +	 * the coherency mode from the @pat_index and reject if there is a
>>>> +	 * mismatch (see note below for pre-MTL platforms).
>>>> +	 *
>>>> +	 * Note: On pre-MTL platforms there is only a caching mode and no
>>>> +	 * explicit coherency mode, but on such hardware there is always a
>>>> +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
>>>> +	 * CPU caches even with the caching mode set as uncached.  It's only the
>>>> +	 * display engine that is incoherent (on dgpu it must be in VRAM which
>>>> +	 * is always mapped as WC on the CPU). However to keep the uapi somewhat
>>>> +	 * consistent with newer platforms the KMD groups the different cache
>>>> +	 * levels into the following coherency buckets on all pre-MTL platforms:
>>>> +	 *
>>>> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
>>>> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
>>>> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
>>>> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
>>>> +	 *
>>>> +	 * In practice UC/WC/WT should only ever used for scanout surfaces on
>>>> +	 * such platforms since it is only the display engine that is actually
>>>> +	 * incoherent. Everything else should typically use WB given that we
>>>> +	 * have a shared-LLC.  On MTL+ this completely changes (also potentially
>>>> +	 * no shared-LLC) and the HW defines the coherency mode as part of the
>>>> +	 * @pat_index.
>>>> +	 *
>>>> +	 * Note: For userptr and externally imported dma-buf the kernel expects
>>>> +	 * either 1WAY or 2WAY for the @pat_index.
>>>> +	 */
>>>> +	__u32 pat_index;
>>>>    
>>>>    	union {
>>>>    		/**
>>>
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Intel-xe] ✗ CI.Patch_applied: failure for PAT and cache coherency support
  2023-08-29 16:28 [Intel-xe] [RFC 0/5] PAT and cache coherency support Matthew Auld
                   ` (4 preceding siblings ...)
  2023-08-29 16:28 ` [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind Matthew Auld
@ 2023-08-29 16:40 ` Patchwork
  2023-09-04 20:25 ` [Intel-xe] [RFC 0/5] " Souza, Jose
  6 siblings, 0 replies; 47+ messages in thread
From: Patchwork @ 2023-08-29 16:40 UTC (permalink / raw)
  To: Matthew Auld; +Cc: intel-xe

== Series Details ==

Series: PAT and cache coherency support
URL   : https://patchwork.freedesktop.org/series/123027/
State : failure

== Summary ==

=== Applying kernel patches on branch 'drm-xe-next' with base: ===
Base commit: d8c08057a drm/xe: Add patch version on guc firmware init
=== git am output follows ===
error: patch failed: drivers/gpu/drm/xe/xe_pat.c:135
error: drivers/gpu/drm/xe/xe_pat.c: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_pat.h:26
error: drivers/gpu/drm/xe/xe_pat.h: patch does not apply
hint: Use 'git am --show-current-patch' to see the failed patch
Applying: drm/xe/uapi: Add support for cache and coherency mode
Applying: drm/xe: fix has_llc on rkl
Applying: drm/xe: move pat_table into device info
Patch failed at 0003 drm/xe: move pat_table into device info
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 0/5] PAT and cache coherency support
  2023-08-29 16:28 [Intel-xe] [RFC 0/5] PAT and cache coherency support Matthew Auld
                   ` (5 preceding siblings ...)
  2023-08-29 16:40 ` [Intel-xe] ✗ CI.Patch_applied: failure for PAT and cache coherency support Patchwork
@ 2023-09-04 20:25 ` Souza, Jose
  2023-09-05  9:16   ` Matthew Auld
  6 siblings, 1 reply; 47+ messages in thread
From: Souza, Jose @ 2023-09-04 20:25 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Auld,  Matthew

On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
> Series still needs bunch of testing, but sending as RFC to get early feedback
> while the IGT bits are being worked on. Note that the series also directly
> depends on the WIP series here: https://patchwork.freedesktop.org/series/122708/

Left some comments in the patches but can't start experimenting because https://patchwork.freedesktop.org/series/122708/ breaks driver load.
xe_pte_encode() calls __pte_encode() with a NULL vma.

> 
> Goal here is to allow userspace to directly control the pat_index when mapping
> memory via the ppGTT, in addtion to the CPU caching mode for system memory. This
> is very much needed on newer igpu platforms which allow incoherent GT access,
> where the choice over the cache level and expected coherency is best left to
> userspace depending on their usecase.  In the future there may also be other
> stuff encoded in the pat_index, so giving userspace direct control will also be
> needed there.
> 
> To support this we added new gem_create uAPI for selecting the CPU cache
> mode to use for system memory, including the expected GPU coherency mode. There
> are various restrictions here for the selected coherency mode and compatible CPU
> cache modes.  With that in place the actual pat_index can now be provided as
> part of vm_bind. The only restriction is that the coherency mode of the
> pat_index must match the gem_create coherency mode. There are also some special
> cases like with userptr and dma-buf.
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Intel-xe] [RFC 0/5] PAT and cache coherency support
  2023-09-04 20:25 ` [Intel-xe] [RFC 0/5] " Souza, Jose
@ 2023-09-05  9:16   ` Matthew Auld
  0 siblings, 0 replies; 47+ messages in thread
From: Matthew Auld @ 2023-09-05  9:16 UTC (permalink / raw)
  To: Souza, Jose, intel-xe@lists.freedesktop.org

On 04/09/2023 21:25, Souza, Jose wrote:
> On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
>> Series still needs bunch of testing, but sending as RFC to get early feedback
>> while the IGT bits are being worked on. Note that the series also directly
>> depends on the WIP series here: https://patchwork.freedesktop.org/series/122708/
> 
> Left some comments in the patches but can't start experimenting because https://patchwork.freedesktop.org/series/122708/ breaks driver load.
> xe_pte_encode() calls __pte_encode() with a NULL vma.

Thanks for the comments.

Yeah, mostly just build tested so far. Once it makes it out of RFC and 
the uAPI is mostly stable we will share a branch.

> 
>>
>> Goal here is to allow userspace to directly control the pat_index when mapping
>> memory via the ppGTT, in addtion to the CPU caching mode for system memory. This
>> is very much needed on newer igpu platforms which allow incoherent GT access,
>> where the choice over the cache level and expected coherency is best left to
>> userspace depending on their usecase.  In the future there may also be other
>> stuff encoded in the pat_index, so giving userspace direct control will also be
>> needed there.
>>
>> To support this we added new gem_create uAPI for selecting the CPU cache
>> mode to use for system memory, including the expected GPU coherency mode. There
>> are various restrictions here for the selected coherency mode and compatible CPU
>> cache modes.  With that in place the actual pat_index can now be provided as
>> part of vm_bind. The only restriction is that the coherency mode of the
>> pat_index must match the gem_create coherency mode. There are also some special
>> cases like with userptr and dma-buf.
>>
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2023-09-13 15:50 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-29 16:28 [Intel-xe] [RFC 0/5] PAT and cache coherency support Matthew Auld
2023-08-29 16:28 ` [Intel-xe] [RFC 1/5] drm/xe/uapi: Add support for cache and coherency mode Matthew Auld
2023-08-29 18:09   ` Matt Roper
2023-08-30 11:13     ` Matthew Auld
2023-09-04 20:00   ` Souza, Jose
2023-09-05  9:04     ` Matthew Auld
2023-09-05 15:30       ` Souza, Jose
2023-08-29 16:28 ` [Intel-xe] [RFC 2/5] drm/xe: fix has_llc on rkl Matthew Auld
2023-08-29 18:46   ` Matt Roper
2023-08-30  1:20   ` Mishra, Pallavi
2023-08-29 16:28 ` [Intel-xe] [RFC 3/5] drm/xe: move pat_table into device info Matthew Auld
2023-08-29 19:14   ` Matt Roper
2023-08-29 21:49   ` Lucas De Marchi
2023-08-29 22:20     ` Matt Roper
2023-08-30  9:34       ` Matthew Auld
2023-08-30  9:43     ` Matthew Auld
2023-08-30  5:14   ` Mishra, Pallavi
2023-09-05 20:50   ` Souza, Jose
2023-08-29 16:28 ` [Intel-xe] [RFC 4/5] drm/xe/pat: annotate pat_index with coherency mode Matthew Auld
2023-08-29 21:08   ` Matt Roper
2023-08-30  9:32     ` Matthew Auld
2023-08-30 19:40       ` Matt Roper
2023-08-29 22:02   ` Lucas De Marchi
2023-08-29 16:28 ` [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind Matthew Auld
2023-08-29 21:36   ` Matt Roper
2023-08-30  6:38     ` Thomas Hellström
2023-08-30 19:28       ` Matt Roper
2023-08-30 11:28     ` Matthew Auld
2023-08-30 15:27       ` Zhang, Carl
2023-08-30 16:02         ` Matthew Auld
2023-08-31  8:24           ` Zhang, Carl
2023-08-31 10:44             ` Matthew Auld
2023-09-01  9:34               ` Zhang, Carl
2023-09-04  9:23                 ` Matthew Auld
2023-09-05  9:12                   ` Zhang, Carl
2023-09-05  9:46                     ` Matthew Auld
2023-09-05 13:50                       ` Zhang, Carl
2023-09-05 14:07                         ` Matthew Auld
2023-09-04 20:21   ` Souza, Jose
2023-09-05  9:08     ` Matthew Auld
2023-09-07 18:56   ` Souza, Jose
2023-09-08  6:51     ` Matthew Auld
2023-09-13 15:35       ` Souza, Jose
2023-09-13 15:50         ` Matthew Auld
2023-08-29 16:40 ` [Intel-xe] ✗ CI.Patch_applied: failure for PAT and cache coherency support Patchwork
2023-09-04 20:25 ` [Intel-xe] [RFC 0/5] " Souza, Jose
2023-09-05  9:16   ` Matthew Auld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox