* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-08-29 16:28 ` [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind Matthew Auld
@ 2023-08-29 21:36 ` Matt Roper
2023-08-30 6:38 ` Thomas Hellström
2023-08-30 11:28 ` Matthew Auld
2023-09-04 20:21 ` Souza, Jose
2023-09-07 18:56 ` Souza, Jose
2 siblings, 2 replies; 47+ messages in thread
From: Matt Roper @ 2023-08-29 21:36 UTC (permalink / raw)
To: Matthew Auld
Cc: Filip Hazubski, Joonas Lahtinen, Lucas De Marchi, Carl Zhang,
Effie Yu, intel-xe
On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
> Allow userspace to directly control the pat_index for a given vm
> binding. This should allow directly controlling the coherency, caching
> and potentially other stuff in the future for the ppGTT binding.
>
> The exact meaning behind the pat_index is very platform specific (see
> BSpec or PRMs) but effectively maps to some predefined memory
> attributes. From the KMD pov we only care about the coherency that is
> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
> The vm_bind coherency mode for the given pat_index needs to match the
> given coh_mode that was set at object creation. For platforms that lack
Is it actually important to match the coherency mode? I think one of
the main goals was to know up front if userspace might be using a
non-snooping PAT setting that would let it bypass the CPU cache (and
potentially read old, stale data from a different process if the
buffer's clear value is still sitting in cache and hasn't landed in
memory yet).
If that's the only concern, then I think it should still be fine to map
with a non-matching PAT as long as it's more coherent than the one
specified at creation, right? E.g., if the buffer was created with
1-way coherency, it would be fine to map it with 2-way because userspace
still can't use that to observe the previous contents of the buffer. Or
if the buffer was created with "non-coherent" then we've already done
the necessary clflushing in kernel before handing to buffer over to
userspace to ensure the clear value landed in memory, so any valid PAT
index should be safe (from a security POV) after that, right? Any other
problems that arise from mismatched coherency would just be contained to
the app possibly shooting itself in the foot, which isn't really our
concern.
> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
>
> For userptr mappings we lack a corresponding gem object, so the expected
> coherency mode is instead implicit and must fall into either 1WAY or
> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
> dma-buf (from a different device) the coherency mode is also implicit
> and must also be either 1WAY or 2WAY.
>
> As part of adding pat_index support with vm_bind we also need stop using
> xe_cache_level and instead use the pat_index in various places. We still
> make use of xe_cache_level, but only as a convenience for kernel
> internal objectsi (internally it maps to some reasonable pat_index). For
Maybe we should kill xe_cache_level completely and just assign
xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
appropriate places, similar to what we do with MOCS (gt->mocs.uc_index,
gt->mocs.wb_index)?
> now this is just a 1:1 conversion of the existing code, however for
> platforms like MTL+ we might need to give more control through bo_create
> or stop using WB on the CPU side if we need CPU access.
>
> Bspec: 45101, 44235 #xe
> Bspec: 70552, 71582, 59400 #xe2
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: José Roberto de Souza <jose.souza@intel.com>
> Cc: Filip Hazubski <filip.hazubski@intel.com>
> Cc: Carl Zhang <carl.zhang@intel.com>
> Cc: Effie Yu <effie.yu@intel.com>
> ---
> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
> drivers/gpu/drm/xe/xe_pt.h | 6 +--
> drivers/gpu/drm/xe/xe_vm.c | 73 +++++++++++++++++++++------
> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
> 10 files changed, 134 insertions(+), 58 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
> index 4d6296cdbcfd..bb4c182048c3 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> @@ -302,7 +302,7 @@ struct i915_address_space {
> (*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>
> u64 (*pte_encode)(dma_addr_t addr,
> - unsigned int pat_index,
> + u32 pat_index,
> u32 flags); /* Create a valid PTE */
> #define PTE_READ_ONLY BIT(0)
> #define PTE_LM BIT(1)
> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
> index 5c8d5e78d9bc..7a128fd20a29 100644
> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
> /* First part of the test, are we updating our pagetable bo with a new entry? */
> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
> 0xdeaddeadbeefbeef);
> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
> if (m->q->vm->flags & XE_VM_FLAG_64K)
> expected |= XE_PTE_PS64;
> if (xe_bo_is_vram(pt))
> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
> index 209fa053d9fb..4134c26150a5 100644
> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
> pte |= XE_GGTT_PTE_DM;
>
> if ((ggtt->pat_encode).pte_encode)
> - pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
> + xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
>
> return pte;
> }
> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
> }
>
> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> - enum xe_cache_level cache)
> + u32 pat_index)
> {
> - u32 pat_index = xe_pat_get_index(xe, cache);
> -
> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>
> if (pat_index & BIT(0))
> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
> index 7e55fac1a8a9..0bc40cb072e3 100644
> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> @@ -31,7 +31,7 @@ struct xe_ggtt {
>
> struct {
> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> - enum xe_cache_level cache);
> + u32 pat_index);
> } pat_encode;
> };
>
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> index a782ea282cb6..54585e98452a 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -24,6 +24,7 @@
> #include "xe_lrc.h"
> #include "xe_map.h"
> #include "xe_mocs.h"
> +#include "xe_pat.h"
> #include "xe_pt.h"
> #include "xe_res_cursor.h"
> #include "xe_sched_job.h"
> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
> u32 map_ofs, level, i;
> struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> u64 entry;
> int ret;
>
> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> return ret;
> }
>
> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>
> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>
> /* Map the entire BO in our level 0 pt */
> for (i = 0, level = 0; i < num_entries; level++) {
> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>
> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>
> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> for (i = 0; i < batch->size;
> i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
> XE_PAGE_SIZE) {
> - entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
> + entry = xe_pte_encode(batch, i, pat_index, 0);
>
> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> entry);
> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> flags = XE_PDE_64K;
>
> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> - XE_PAGE_SIZE, XE_CACHE_WB);
> + XE_PAGE_SIZE, pat_index);
> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
> entry | flags);
> }
> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> /* Write PDE's that point to our BO. */
> for (i = 0; i < num_entries - num_level; i++) {
> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> - XE_CACHE_WB);
> + pat_index);
>
> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
> (i + 1) * 8, u64, entry);
> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
>
> XE_WARN_ON(pt_bo->size != SZ_4K);
>
> - addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
> + addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
> bb->cs[bb->len++] = lower_32_bits(addr);
> bb->cs[bb->len++] = upper_32_bits(addr);
> }
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 64713f400d94..019af2920078 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -10,6 +10,7 @@
> #include "xe_gt.h"
> #include "xe_gt_tlb_invalidation.h"
> #include "xe_migrate.h"
> +#include "xe_pat.h"
> #include "xe_pt_types.h"
> #include "xe_pt_walk.h"
> #include "xe_res_cursor.h"
> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
> *
> * Return: An encoded page directory entry. No errors.
> */
> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> - const enum xe_cache_level cache)
> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
> {
> u64 pde;
> struct xe_vm *vm = bo->vm;
> struct xe_device *xe = vm->xe;
>
> -
> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>
> if ((vm->pat_encode).pde_encode)
> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> + pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>
> return pde;
> }
>
> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> +static u64 __pte_encode(u64 pte, u32 pat_index,
> struct xe_vma *vma, u32 pt_level)
> {
> struct xe_vm *vm = xe_vma_vm(vma);
> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> pte |= XE_PTE_NULL;
>
> if ((vm->pat_encode).pte_encode)
> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> + pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>
> if (pt_level == 1)
> pte |= XE_PDE_PS_2M;
> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> *
> * Return: An encoded page-table entry. No errors.
> */
> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> u32 pt_level)
> {
> u64 pte;
> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> pte |= XE_PPGTT_PTE_DM;
>
> - return __pte_encode(pte, cache, NULL, pt_level);
> + return __pte_encode(pte, pat_index, NULL, pt_level);
> }
>
> static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>
> if (level == 0) {
> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> - XE_CACHE_WB, 0);
> + xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
>
> return empty;
> } else {
> return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
> - XE_CACHE_WB);
> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
> }
> }
>
> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> struct xe_vm *vm;
> /** @tile: The tile we're building for. */
> struct xe_tile *tile;
> - /** @cache: Desired cache level for the ptes */
> - enum xe_cache_level cache;
> /** @default_pte: PTE flag only template. No address is associated */
> u64 default_pte;
> /** @dma_offset: DMA offset to add to the PTE. */
> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>
> pte = __pte_encode(is_null ? 0 :
> xe_res_dma(curs) + xe_walk->dma_offset,
> - xe_walk->cache, xe_walk->vma, level);
> + xe_walk->vma->pat_index, xe_walk->vma, level);
> pte |= xe_walk->default_pte;
>
> /*
> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
> xe_child->is_compact = true;
> }
>
> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
> pte);
> }
> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
> xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
> xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
> - xe_walk.cache = XE_CACHE_WB;
> - } else {
> - if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
> - xe_walk.cache = XE_CACHE_WT;
> - else
> - xe_walk.cache = XE_CACHE_WB;
> }
> +
> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>
> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> index 01be7ab08f87..1d433a5a96b4 100644
> --- a/drivers/gpu/drm/xe/xe_pt.h
> +++ b/drivers/gpu/drm/xe/xe_pt.h
> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
>
> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>
> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> - const enum xe_cache_level level);
> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>
> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> - u32 pt_level);
> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
>
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 7eeeed0411f3..34603a7e84b0 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -6,6 +6,7 @@
> #include "xe_vm.h"
>
> #include <linux/dma-fence-array.h>
> +#include <linux/nospec.h>
>
> #include <drm/drm_print.h>
> #include <drm/ttm/ttm_execbuf_util.h>
> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
> u64 start, u64 end,
> bool read_only,
> bool is_null,
> - u8 tile_mask)
> + u8 tile_mask,
> + u32 pat_index)
> {
> struct xe_vma *vma;
> struct xe_tile *tile;
> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
> vma->tile_mask |= 0x1 << id;
> }
>
> + vma->pat_index = pat_index;
> +
> if (vm->xe->info.platform == XE_PVC)
> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>
> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
> static void vm_destroy_work_func(struct work_struct *w);
>
> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> - enum xe_cache_level cache)
> + u32 pat_index)
> {
> - u32 pat_index = xe_pat_get_index(xe, cache);
> -
> pde_pat &= ~(XELP_PDE_PAT_MASK);
>
> if (pat_index & BIT(0))
> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> }
>
> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> - enum xe_cache_level cache)
> + u32 pat_index)
> {
> - u32 pat_index = xe_pat_get_index(xe, cache);
> -
> pte_pat &= ~(XELP_PTE_PAT_MASK);
>
> if (pat_index & BIT(0))
> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
> {
> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> - XE_CACHE_WB);
> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
> }
>
> static struct dma_fence *
> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
> static struct drm_gpuva_ops *
> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> u64 bo_offset_or_userptr, u64 addr, u64 range,
> - u32 operation, u8 tile_mask, u32 region)
> + u32 operation, u8 tile_mask, u32 region, u32 pat_index)
> {
> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> struct ww_acquire_ctx ww;
> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>
> op->tile_mask = tile_mask;
> + op->pat_index = pat_index;
> op->map.immediate =
> operation & XE_VM_BIND_FLAG_IMMEDIATE;
> op->map.read_only =
> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>
> op->tile_mask = tile_mask;
> + op->pat_index = pat_index;
> op->prefetch.region = region;
> }
> break;
> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> }
>
> static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> - u8 tile_mask, bool read_only, bool is_null)
> + u8 tile_mask, bool read_only, bool is_null,
> + u32 pat_index)
> {
> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
> struct xe_vma *vma;
> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> vma = xe_vma_create(vm, bo, op->gem.offset,
> op->va.addr, op->va.addr +
> op->va.range - 1, read_only, is_null,
> - tile_mask);
> + tile_mask, pat_index);
> if (bo)
> xe_bo_unlock(bo, &ww);
>
> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>
> vma = new_vma(vm, &op->base.map,
> op->tile_mask, op->map.read_only,
> - op->map.is_null);
> + op->map.is_null, op->pat_index);
> if (IS_ERR(vma)) {
> err = PTR_ERR(vma);
> goto free_fence;
> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>
> vma = new_vma(vm, op->base.remap.prev,
> op->tile_mask, read_only,
> - is_null);
> + is_null, op->pat_index);
> if (IS_ERR(vma)) {
> err = PTR_ERR(vma);
> goto free_fence;
> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>
> vma = new_vma(vm, op->base.remap.next,
> op->tile_mask, read_only,
> - is_null);
> + is_null, op->pat_index);
> if (IS_ERR(vma)) {
> err = PTR_ERR(vma);
> goto free_fence;
> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
> u32 obj = (*bind_ops)[i].obj;
> u64 obj_offset = (*bind_ops)[i].obj_offset;
> u32 region = (*bind_ops)[i].region;
> + u32 pat_index = (*bind_ops)[i].pat_index;
> bool is_null = op & XE_VM_BIND_FLAG_NULL;
>
> + if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
> + err = -EINVAL;
> + goto free_bind_ops;
> + }
> +
> if (i == 0) {
> *async = !!(op & XE_VM_BIND_FLAG_ASYNC);
> } else if (XE_IOCTL_DBG(xe, !*async) ||
> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> struct drm_gem_object *gem_obj;
> u64 range = bind_ops[i].range;
> u64 addr = bind_ops[i].addr;
> + u32 op = bind_ops[i].op;
> u32 obj = bind_ops[i].obj;
> u64 obj_offset = bind_ops[i].obj_offset;
> + u32 pat_index = bind_ops[i].pat_index;
> + u16 coh_mode;
> +
> + pat_index = array_index_nospec(pat_index,
> + xe->info.pat_table_n_entries);
> + coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
Assuming we drop the unusable entries from the TGL table, this should be
impossible, right? Any index that makes it past the n_entries check at
the top of the function should have a valid, non-zero coh_mode value.
So this should probably be an assertion (to highlight a KMD bug) rather
than just a silent uapi failure return.
> + err = -EINVAL;
> + goto put_obj;
> + }
> +
> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
> + coh_mode == XE_GEM_COHERENCY_NONE)) {
> + err = -EINVAL;
> + goto put_obj;
> + }
>
> if (!obj)
> continue;
> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> goto put_obj;
> }
> }
> +
> + if (bos[i]->coh_mode) {
> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
> + err = -EINVAL;
> + goto put_obj;
> + }
> + } else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
> + /*
> + * Imported dma-buf from a different device should
> + * require 1way or 2way coherency since we don't know
> + * how it was mapped on CPU. Just assume is it
> + * potentially cached on CPU side.
> + */
> + err = -EINVAL;
> + goto put_obj;
> + }
> }
>
> if (args->num_syncs) {
> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> u64 obj_offset = bind_ops[i].obj_offset;
> u8 tile_mask = bind_ops[i].tile_mask;
> u32 region = bind_ops[i].region;
> + u32 pat_index = bind_ops[i].pat_index;
>
> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
> addr, range, op, tile_mask,
> - region);
> + region, pat_index);
> if (IS_ERR(ops[i])) {
> err = PTR_ERR(ops[i]);
> ops[i] = NULL;
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
> index 83a1f87b6537..508679ed3c74 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -111,6 +111,11 @@ struct xe_vma {
> */
> u8 tile_present;
>
> + /**
> + * @pat_index: The pat index to use when encoding the PTEs for this vma.
> + */
> + u32 pat_index;
> +
> struct {
> struct list_head rebind_link;
> } notifier;
> @@ -338,10 +343,8 @@ struct xe_vm {
> bool batch_invalidate_tlb;
>
> struct {
> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> - enum xe_cache_level cache);
> - u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> - enum xe_cache_level cache);
> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
> + u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
> } pat_encode;
> };
>
> @@ -417,6 +420,8 @@ struct xe_vma_op {
> struct async_op_fence *fence;
> /** @tile_mask: gt mask for this operation */
> u8 tile_mask;
> + /** @pat_index: The pat index to use for this operation. */
> + u32 pat_index;
> /** @flags: operation flags */
> enum xe_vma_op_flags flags;
>
> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> index 64bc66d4b550..0c15b6f32447 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> */
> __u32 obj;
>
> - /** @pad: MBZ */
> - __u32 pad;
> + /**
> + * @pat_index: The platform defined @pat_index to use for this mapping.
> + * The index basically maps to some predefined memory attributes,
> + * including things like caching, coherency and likely other stuff in
> + * the future. The exact meaning of the pat_index is platform specific
BTW, "other stuff in the future" already includes compression on Xe2, we
just haven't landed the patches for the Xe2 table yet.
> + * and defined in the Bspec and PRMs. When the KMD sets up the binding
> + * the index here is encoded into the ppGTT PTE.
> + *
> + * For coherency the @pat_index needs to match the
> + * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
> + * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
> + * the coherency mode from the @pat_index and reject if there is a
> + * mismatch (see note below for pre-MTL platforms).
> + *
> + * Note: On pre-MTL platforms there is only a caching mode and no
> + * explicit coherency mode, but on such hardware there is always a
> + * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
> + * CPU caches even with the caching mode set as uncached. It's only the
> + * display engine that is incoherent (on dgpu it must be in VRAM which
> + * is always mapped as WC on the CPU). However to keep the uapi somewhat
> + * consistent with newer platforms the KMD groups the different cache
> + * levels into the following coherency buckets on all pre-MTL platforms:
> + *
> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
As noted on the previous patch, it seems like 2-way is appropriate for
LLC platforms, but 1-way might be a more accurate description of dGPU
behavior.
> + *
> + * In practice UC/WC/WT should only ever used for scanout surfaces on
> + * such platforms since it is only the display engine that is actually
> + * incoherent. Everything else should typically use WB given that we
What if we're sharing our buffers with some other (non-GPU) device? Are
there cases where that other device wouldn't be coherent with the LLC,
so we'd want to use one of these?
Matt
> + * have a shared-LLC. On MTL+ this completely changes (also potentially
> + * no shared-LLC) and the HW defines the coherency mode as part of the
> + * @pat_index.
> + *
> + * Note: For userptr and externally imported dma-buf the kernel expects
> + * either 1WAY or 2WAY for the @pat_index.
> + */
> + __u32 pat_index;
>
> union {
> /**
> --
> 2.41.0
>
--
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-08-29 21:36 ` Matt Roper
@ 2023-08-30 6:38 ` Thomas Hellström
2023-08-30 19:28 ` Matt Roper
2023-08-30 11:28 ` Matthew Auld
1 sibling, 1 reply; 47+ messages in thread
From: Thomas Hellström @ 2023-08-30 6:38 UTC (permalink / raw)
To: Matt Roper, Matthew Auld
Cc: Filip Hazubski, Joonas Lahtinen, Lucas De Marchi, Carl Zhang,
Effie Yu, intel-xe
On 8/29/23 23:36, Matt Roper wrote:
> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
>> Allow userspace to directly control the pat_index for a given vm
>> binding. This should allow directly controlling the coherency, caching
>> and potentially other stuff in the future for the ppGTT binding.
>>
>> The exact meaning behind the pat_index is very platform specific (see
>> BSpec or PRMs) but effectively maps to some predefined memory
>> attributes. From the KMD pov we only care about the coherency that is
>> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
>> The vm_bind coherency mode for the given pat_index needs to match the
>> given coh_mode that was set at object creation. For platforms that lack
> Is it actually important to match the coherency mode? I think one of
> the main goals was to know up front if userspace might be using a
> non-snooping PAT setting that would let it bypass the CPU cache (and
> potentially read old, stale data from a different process if the
> buffer's clear value is still sitting in cache and hasn't landed in
> memory yet).
>
> If that's the only concern, then I think it should still be fine to map
> with a non-matching PAT as long as it's more coherent than the one
> specified at creation, right? E.g., if the buffer was created with
> 1-way coherency, it would be fine to map it with 2-way because userspace
> still can't use that to observe the previous contents of the buffer. Or
> if the buffer was created with "non-coherent" then we've already done
> the necessary clflushing in kernel before handing to buffer over to
> userspace to ensure the clear value landed in memory, so any valid PAT
> index should be safe (from a security POV) after that, right? Any other
> problems that arise from mismatched coherency would just be contained to
> the app possibly shooting itself in the foot, which isn't really our
> concern.
We also have shrinking to keep in mind: If content was generated by the
GPU using 2-way coherency, and then a copy from the backing page is done
with the page mapped WC, wouldn't that blow up? I think requiring
consistency here is reasonable also keeping unforeseen side-effects of
future HW in mind. Or do you see it as blocking an important use-case?
Thanks,
Thomas
>
>
>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
>>
>> For userptr mappings we lack a corresponding gem object, so the expected
>> coherency mode is instead implicit and must fall into either 1WAY or
>> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
>> dma-buf (from a different device) the coherency mode is also implicit
>> and must also be either 1WAY or 2WAY.
>>
>> As part of adding pat_index support with vm_bind we also need stop using
>> xe_cache_level and instead use the pat_index in various places. We still
>> make use of xe_cache_level, but only as a convenience for kernel
>> internal objectsi (internally it maps to some reasonable pat_index). For
> Maybe we should kill xe_cache_level completely and just assign
> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
> appropriate places, similar to what we do with MOCS (gt->mocs.uc_index,
> gt->mocs.wb_index)?
>
>> now this is just a 1:1 conversion of the existing code, however for
>> platforms like MTL+ we might need to give more control through bo_create
>> or stop using WB on the CPU side if we need CPU access.
>>
>> Bspec: 45101, 44235 #xe
>> Bspec: 70552, 71582, 59400 #xe2
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: José Roberto de Souza <jose.souza@intel.com>
>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>> Cc: Carl Zhang <carl.zhang@intel.com>
>> Cc: Effie Yu <effie.yu@intel.com>
>> ---
>> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
>> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
>> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
>> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
>> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
>> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
>> drivers/gpu/drm/xe/xe_pt.h | 6 +--
>> drivers/gpu/drm/xe/xe_vm.c | 73 +++++++++++++++++++++------
>> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
>> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
>> 10 files changed, 134 insertions(+), 58 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
>> index 4d6296cdbcfd..bb4c182048c3 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>> @@ -302,7 +302,7 @@ struct i915_address_space {
>> (*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>>
>> u64 (*pte_encode)(dma_addr_t addr,
>> - unsigned int pat_index,
>> + u32 pat_index,
>> u32 flags); /* Create a valid PTE */
>> #define PTE_READ_ONLY BIT(0)
>> #define PTE_LM BIT(1)
>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
>> index 5c8d5e78d9bc..7a128fd20a29 100644
>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
>> /* First part of the test, are we updating our pagetable bo with a new entry? */
>> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
>> 0xdeaddeadbeefbeef);
>> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>> if (m->q->vm->flags & XE_VM_FLAG_64K)
>> expected |= XE_PTE_PS64;
>> if (xe_bo_is_vram(pt))
>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
>> index 209fa053d9fb..4134c26150a5 100644
>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
>> pte |= XE_GGTT_PTE_DM;
>>
>> if ((ggtt->pat_encode).pte_encode)
>> - pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
>> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
>> + xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
>>
>> return pte;
>> }
>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>> }
>>
>> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>> - enum xe_cache_level cache)
>> + u32 pat_index)
>> {
>> - u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>
>> if (pat_index & BIT(0))
>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
>> index 7e55fac1a8a9..0bc40cb072e3 100644
>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>
>> struct {
>> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> - enum xe_cache_level cache);
>> + u32 pat_index);
>> } pat_encode;
>> };
>>
>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
>> index a782ea282cb6..54585e98452a 100644
>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>> @@ -24,6 +24,7 @@
>> #include "xe_lrc.h"
>> #include "xe_map.h"
>> #include "xe_mocs.h"
>> +#include "xe_pat.h"
>> #include "xe_pt.h"
>> #include "xe_res_cursor.h"
>> #include "xe_sched_job.h"
>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
>> u32 map_ofs, level, i;
>> struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>> u64 entry;
>> int ret;
>>
>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> return ret;
>> }
>>
>> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
>> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>
>> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>
>> /* Map the entire BO in our level 0 pt */
>> for (i = 0, level = 0; i < num_entries; level++) {
>> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
>> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>>
>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>>
>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> for (i = 0; i < batch->size;
>> i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>> XE_PAGE_SIZE) {
>> - entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
>> + entry = xe_pte_encode(batch, i, pat_index, 0);
>>
>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>> entry);
>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> flags = XE_PDE_64K;
>>
>> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>> - XE_PAGE_SIZE, XE_CACHE_WB);
>> + XE_PAGE_SIZE, pat_index);
>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
>> entry | flags);
>> }
>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> /* Write PDE's that point to our BO. */
>> for (i = 0; i < num_entries - num_level; i++) {
>> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>> - XE_CACHE_WB);
>> + pat_index);
>>
>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>> (i + 1) * 8, u64, entry);
>> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
>>
>> XE_WARN_ON(pt_bo->size != SZ_4K);
>>
>> - addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
>> + addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>> bb->cs[bb->len++] = lower_32_bits(addr);
>> bb->cs[bb->len++] = upper_32_bits(addr);
>> }
>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
>> index 64713f400d94..019af2920078 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.c
>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>> @@ -10,6 +10,7 @@
>> #include "xe_gt.h"
>> #include "xe_gt_tlb_invalidation.h"
>> #include "xe_migrate.h"
>> +#include "xe_pat.h"
>> #include "xe_pt_types.h"
>> #include "xe_pt_walk.h"
>> #include "xe_res_cursor.h"
>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
>> *
>> * Return: An encoded page directory entry. No errors.
>> */
>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>> - const enum xe_cache_level cache)
>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>> {
>> u64 pde;
>> struct xe_vm *vm = bo->vm;
>> struct xe_device *xe = vm->xe;
>>
>> -
>> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>
>> if ((vm->pat_encode).pde_encode)
>> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>> + pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>>
>> return pde;
>> }
>>
>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>> struct xe_vma *vma, u32 pt_level)
>> {
>> struct xe_vm *vm = xe_vma_vm(vma);
>> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>> pte |= XE_PTE_NULL;
>>
>> if ((vm->pat_encode).pte_encode)
>> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>> + pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>>
>> if (pt_level == 1)
>> pte |= XE_PDE_PS_2M;
>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>> *
>> * Return: An encoded page-table entry. No errors.
>> */
>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>> u32 pt_level)
>> {
>> u64 pte;
>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>> pte |= XE_PPGTT_PTE_DM;
>>
>> - return __pte_encode(pte, cache, NULL, pt_level);
>> + return __pte_encode(pte, pat_index, NULL, pt_level);
>> }
>>
>> static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>>
>> if (level == 0) {
>> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>> - XE_CACHE_WB, 0);
>> + xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
>>
>> return empty;
>> } else {
>> return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
>> - XE_CACHE_WB);
>> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
>> }
>> }
>>
>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>> struct xe_vm *vm;
>> /** @tile: The tile we're building for. */
>> struct xe_tile *tile;
>> - /** @cache: Desired cache level for the ptes */
>> - enum xe_cache_level cache;
>> /** @default_pte: PTE flag only template. No address is associated */
>> u64 default_pte;
>> /** @dma_offset: DMA offset to add to the PTE. */
>> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>>
>> pte = __pte_encode(is_null ? 0 :
>> xe_res_dma(curs) + xe_walk->dma_offset,
>> - xe_walk->cache, xe_walk->vma, level);
>> + xe_walk->vma->pat_index, xe_walk->vma, level);
>> pte |= xe_walk->default_pte;
>>
>> /*
>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>> xe_child->is_compact = true;
>> }
>>
>> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
>> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
>> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>> pte);
>> }
>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
>> if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>> xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>> xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
>> - xe_walk.cache = XE_CACHE_WB;
>> - } else {
>> - if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
>> - xe_walk.cache = XE_CACHE_WT;
>> - else
>> - xe_walk.cache = XE_CACHE_WB;
>> }
>> +
>> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>> xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>
>> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
>> index 01be7ab08f87..1d433a5a96b4 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.h
>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
>>
>> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>>
>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>> - const enum xe_cache_level level);
>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>>
>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> - u32 pt_level);
>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
>>
>> #endif
>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>> index 7eeeed0411f3..34603a7e84b0 100644
>> --- a/drivers/gpu/drm/xe/xe_vm.c
>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>> @@ -6,6 +6,7 @@
>> #include "xe_vm.h"
>>
>> #include <linux/dma-fence-array.h>
>> +#include <linux/nospec.h>
>>
>> #include <drm/drm_print.h>
>> #include <drm/ttm/ttm_execbuf_util.h>
>> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>> u64 start, u64 end,
>> bool read_only,
>> bool is_null,
>> - u8 tile_mask)
>> + u8 tile_mask,
>> + u32 pat_index)
>> {
>> struct xe_vma *vma;
>> struct xe_tile *tile;
>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>> vma->tile_mask |= 0x1 << id;
>> }
>>
>> + vma->pat_index = pat_index;
>> +
>> if (vm->xe->info.platform == XE_PVC)
>> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>
>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
>> static void vm_destroy_work_func(struct work_struct *w);
>>
>> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>> - enum xe_cache_level cache)
>> + u32 pat_index)
>> {
>> - u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>> pde_pat &= ~(XELP_PDE_PAT_MASK);
>>
>> if (pat_index & BIT(0))
>> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>> }
>>
>> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>> - enum xe_cache_level cache)
>> + u32 pat_index)
>> {
>> - u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>> pte_pat &= ~(XELP_PTE_PAT_MASK);
>>
>> if (pat_index & BIT(0))
>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
>> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>> {
>> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>> - XE_CACHE_WB);
>> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
>> }
>>
>> static struct dma_fence *
>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
>> static struct drm_gpuva_ops *
>> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>> u64 bo_offset_or_userptr, u64 addr, u64 range,
>> - u32 operation, u8 tile_mask, u32 region)
>> + u32 operation, u8 tile_mask, u32 region, u32 pat_index)
>> {
>> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>> struct ww_acquire_ctx ww;
>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>
>> op->tile_mask = tile_mask;
>> + op->pat_index = pat_index;
>> op->map.immediate =
>> operation & XE_VM_BIND_FLAG_IMMEDIATE;
>> op->map.read_only =
>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>
>> op->tile_mask = tile_mask;
>> + op->pat_index = pat_index;
>> op->prefetch.region = region;
>> }
>> break;
>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>> }
>>
>> static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>> - u8 tile_mask, bool read_only, bool is_null)
>> + u8 tile_mask, bool read_only, bool is_null,
>> + u32 pat_index)
>> {
>> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
>> struct xe_vma *vma;
>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>> vma = xe_vma_create(vm, bo, op->gem.offset,
>> op->va.addr, op->va.addr +
>> op->va.range - 1, read_only, is_null,
>> - tile_mask);
>> + tile_mask, pat_index);
>> if (bo)
>> xe_bo_unlock(bo, &ww);
>>
>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>
>> vma = new_vma(vm, &op->base.map,
>> op->tile_mask, op->map.read_only,
>> - op->map.is_null);
>> + op->map.is_null, op->pat_index);
>> if (IS_ERR(vma)) {
>> err = PTR_ERR(vma);
>> goto free_fence;
>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>
>> vma = new_vma(vm, op->base.remap.prev,
>> op->tile_mask, read_only,
>> - is_null);
>> + is_null, op->pat_index);
>> if (IS_ERR(vma)) {
>> err = PTR_ERR(vma);
>> goto free_fence;
>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>
>> vma = new_vma(vm, op->base.remap.next,
>> op->tile_mask, read_only,
>> - is_null);
>> + is_null, op->pat_index);
>> if (IS_ERR(vma)) {
>> err = PTR_ERR(vma);
>> goto free_fence;
>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
>> u32 obj = (*bind_ops)[i].obj;
>> u64 obj_offset = (*bind_ops)[i].obj_offset;
>> u32 region = (*bind_ops)[i].region;
>> + u32 pat_index = (*bind_ops)[i].pat_index;
>> bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>
>> + if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
>> + err = -EINVAL;
>> + goto free_bind_ops;
>> + }
>> +
>> if (i == 0) {
>> *async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>> } else if (XE_IOCTL_DBG(xe, !*async) ||
>> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>> struct drm_gem_object *gem_obj;
>> u64 range = bind_ops[i].range;
>> u64 addr = bind_ops[i].addr;
>> + u32 op = bind_ops[i].op;
>> u32 obj = bind_ops[i].obj;
>> u64 obj_offset = bind_ops[i].obj_offset;
>> + u32 pat_index = bind_ops[i].pat_index;
>> + u16 coh_mode;
>> +
>> + pat_index = array_index_nospec(pat_index,
>> + xe->info.pat_table_n_entries);
>> + coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
> Assuming we drop the unusable entries from the TGL table, this should be
> impossible, right? Any index that makes it past the n_entries check at
> the top of the function should have a valid, non-zero coh_mode value.
> So this should probably be an assertion (to highlight a KMD bug) rather
> than just a silent uapi failure return.
>
>> + err = -EINVAL;
>> + goto put_obj;
>> + }
>> +
>> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
>> + coh_mode == XE_GEM_COHERENCY_NONE)) {
>> + err = -EINVAL;
>> + goto put_obj;
>> + }
>>
>> if (!obj)
>> continue;
>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>> goto put_obj;
>> }
>> }
>> +
>> + if (bos[i]->coh_mode) {
>> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
>> + err = -EINVAL;
>> + goto put_obj;
>> + }
>> + } else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
>> + /*
>> + * Imported dma-buf from a different device should
>> + * require 1way or 2way coherency since we don't know
>> + * how it was mapped on CPU. Just assume is it
>> + * potentially cached on CPU side.
>> + */
>> + err = -EINVAL;
>> + goto put_obj;
>> + }
>> }
>>
>> if (args->num_syncs) {
>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>> u64 obj_offset = bind_ops[i].obj_offset;
>> u8 tile_mask = bind_ops[i].tile_mask;
>> u32 region = bind_ops[i].region;
>> + u32 pat_index = bind_ops[i].pat_index;
>>
>> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>> addr, range, op, tile_mask,
>> - region);
>> + region, pat_index);
>> if (IS_ERR(ops[i])) {
>> err = PTR_ERR(ops[i]);
>> ops[i] = NULL;
>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
>> index 83a1f87b6537..508679ed3c74 100644
>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>> @@ -111,6 +111,11 @@ struct xe_vma {
>> */
>> u8 tile_present;
>>
>> + /**
>> + * @pat_index: The pat index to use when encoding the PTEs for this vma.
>> + */
>> + u32 pat_index;
>> +
>> struct {
>> struct list_head rebind_link;
>> } notifier;
>> @@ -338,10 +343,8 @@ struct xe_vm {
>> bool batch_invalidate_tlb;
>>
>> struct {
>> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> - enum xe_cache_level cache);
>> - u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>> - enum xe_cache_level cache);
>> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
>> + u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
>> } pat_encode;
>> };
>>
>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>> struct async_op_fence *fence;
>> /** @tile_mask: gt mask for this operation */
>> u8 tile_mask;
>> + /** @pat_index: The pat index to use for this operation. */
>> + u32 pat_index;
>> /** @flags: operation flags */
>> enum xe_vma_op_flags flags;
>>
>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>> index 64bc66d4b550..0c15b6f32447 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>> */
>> __u32 obj;
>>
>> - /** @pad: MBZ */
>> - __u32 pad;
>> + /**
>> + * @pat_index: The platform defined @pat_index to use for this mapping.
>> + * The index basically maps to some predefined memory attributes,
>> + * including things like caching, coherency and likely other stuff in
>> + * the future. The exact meaning of the pat_index is platform specific
> BTW, "other stuff in the future" already includes compression on Xe2, we
> just haven't landed the patches for the Xe2 table yet.
>
>> + * and defined in the Bspec and PRMs. When the KMD sets up the binding
>> + * the index here is encoded into the ppGTT PTE.
>> + *
>> + * For coherency the @pat_index needs to match the
>> + * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
>> + * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
>> + * the coherency mode from the @pat_index and reject if there is a
>> + * mismatch (see note below for pre-MTL platforms).
>> + *
>> + * Note: On pre-MTL platforms there is only a caching mode and no
>> + * explicit coherency mode, but on such hardware there is always a
>> + * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
>> + * CPU caches even with the caching mode set as uncached. It's only the
>> + * display engine that is incoherent (on dgpu it must be in VRAM which
>> + * is always mapped as WC on the CPU). However to keep the uapi somewhat
>> + * consistent with newer platforms the KMD groups the different cache
>> + * levels into the following coherency buckets on all pre-MTL platforms:
>> + *
>> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
>> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
>> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
>> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
> As noted on the previous patch, it seems like 2-way is appropriate for
> LLC platforms, but 1-way might be a more accurate description of dGPU
> behavior.
>
>> + *
>> + * In practice UC/WC/WT should only ever used for scanout surfaces on
>> + * such platforms since it is only the display engine that is actually
>> + * incoherent. Everything else should typically use WB given that we
> What if we're sharing our buffers with some other (non-GPU) device? Are
> there cases where that other device wouldn't be coherent with the LLC,
> so we'd want to use one of these?
>
>
> Matt
>
>> + * have a shared-LLC. On MTL+ this completely changes (also potentially
>> + * no shared-LLC) and the HW defines the coherency mode as part of the
>> + * @pat_index.
>> + *
>> + * Note: For userptr and externally imported dma-buf the kernel expects
>> + * either 1WAY or 2WAY for the @pat_index.
>> + */
>> + __u32 pat_index;
>>
>> union {
>> /**
>> --
>> 2.41.0
>>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-08-30 6:38 ` Thomas Hellström
@ 2023-08-30 19:28 ` Matt Roper
0 siblings, 0 replies; 47+ messages in thread
From: Matt Roper @ 2023-08-30 19:28 UTC (permalink / raw)
To: Thomas Hellström
Cc: Filip Hazubski, Joonas Lahtinen, Lucas De Marchi, Carl Zhang,
Effie Yu, Matthew Auld, intel-xe
On Wed, Aug 30, 2023 at 08:38:13AM +0200, Thomas Hellström wrote:
>
> On 8/29/23 23:36, Matt Roper wrote:
> > On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
> > > Allow userspace to directly control the pat_index for a given vm
> > > binding. This should allow directly controlling the coherency, caching
> > > and potentially other stuff in the future for the ppGTT binding.
> > >
> > > The exact meaning behind the pat_index is very platform specific (see
> > > BSpec or PRMs) but effectively maps to some predefined memory
> > > attributes. From the KMD pov we only care about the coherency that is
> > > provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
> > > The vm_bind coherency mode for the given pat_index needs to match the
> > > given coh_mode that was set at object creation. For platforms that lack
> > Is it actually important to match the coherency mode? I think one of
> > the main goals was to know up front if userspace might be using a
> > non-snooping PAT setting that would let it bypass the CPU cache (and
> > potentially read old, stale data from a different process if the
> > buffer's clear value is still sitting in cache and hasn't landed in
> > memory yet).
> >
> > If that's the only concern, then I think it should still be fine to map
> > with a non-matching PAT as long as it's more coherent than the one
> > specified at creation, right? E.g., if the buffer was created with
> > 1-way coherency, it would be fine to map it with 2-way because userspace
> > still can't use that to observe the previous contents of the buffer. Or
> > if the buffer was created with "non-coherent" then we've already done
> > the necessary clflushing in kernel before handing to buffer over to
> > userspace to ensure the clear value landed in memory, so any valid PAT
> > index should be safe (from a security POV) after that, right? Any other
> > problems that arise from mismatched coherency would just be contained to
> > the app possibly shooting itself in the foot, which isn't really our
> > concern.
>
> We also have shrinking to keep in mind: If content was generated by the GPU
> using 2-way coherency, and then a copy from the backing page is done with
> the page mapped WC, wouldn't that blow up? I think requiring consistency
My understanding (which could easily be wrong since this isn't an area
I've done much work in) is that on dGPU and non-LLC igpu (MTL):
- A GPU update done with a 1-way coherent PAT setting will invalidate
the CPU cache. Any type of CPU mapping should read the correct
content at that point, as long as the GPU's caches have been flushed.
- A GPU update done with a non-coherent PAT setting won't invalidate
the CPU cache, so CPU reads will need to be CPU:UC or CPU:WC to read
the latest content from memory (assuming the GPU caches were
flushed and/or the update bypassed the GPU caches).
- 2-way coherency isn't actually supported (even though there's a PAT
index labelled as 2-way on the MTL table, it's misleading and still
behaves as 1-way for everything except atomic operations). Maybe
this changes on LNL; I don't remember off the top of my head right
now.
I believe we always take care of the necessary GPU flushing
(PIPE_CONTROL or MI_FLUSH_DW) as part of our post-bb handling, so it
seems to me like using a CPU:WC mapping would always be safe on these
platforms, regardless of which PAT setting is selected at vm_bind.
I'm less sure about the exact semantics of the older LLC platforms; it's
a bit hard to find the details for that in the bspec. But I'm not sure
the PAT really matters much at all on those platforms since the
important cache behavior was mostly defined in the MOCS on those
platforms, and the MOCS would usually override the PAT in cases where
there was overlap. No matter what vm_bind PAT is chosen, the CPU
handling needs to be conservative enough to work properly no matter what
the MOCS might have done.
> here is reasonable also keeping unforeseen side-effects of future HW in
> mind. Or do you see it as blocking an important use-case?
I don't know of any specific use cases; the userspace guys would
probably have a much better feel for that. I think the current proposal
here at least avoids the biggest problems we had in the past with i915
(where 3D/media content generation into a buffer needed very different
cache behavior [in the PPGTT] than the display scanout wanted [in the
GGTT] since the two now remain unrelated).
I suppose even if we don't need to be so strict about restricting the
vm_bind-time PAT selection, it doesn't hurt anything as long as the
userspace guys can live with it. We can always loosen the restrictions
down the road if/when we find it necessary.
Matt
>
> Thanks,
>
> Thomas
>
>
> >
> >
> > > the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
> > >
> > > For userptr mappings we lack a corresponding gem object, so the expected
> > > coherency mode is instead implicit and must fall into either 1WAY or
> > > 2WAY. Trying to use NONE will be rejected by the kernel. For imported
> > > dma-buf (from a different device) the coherency mode is also implicit
> > > and must also be either 1WAY or 2WAY.
> > >
> > > As part of adding pat_index support with vm_bind we also need stop using
> > > xe_cache_level and instead use the pat_index in various places. We still
> > > make use of xe_cache_level, but only as a convenience for kernel
> > > internal objectsi (internally it maps to some reasonable pat_index). For
> > Maybe we should kill xe_cache_level completely and just assign
> > xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
> > appropriate places, similar to what we do with MOCS (gt->mocs.uc_index,
> > gt->mocs.wb_index)?
> >
> > > now this is just a 1:1 conversion of the existing code, however for
> > > platforms like MTL+ we might need to give more control through bo_create
> > > or stop using WB on the CPU side if we need CPU access.
> > >
> > > Bspec: 45101, 44235 #xe
> > > Bspec: 70552, 71582, 59400 #xe2
> > > Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> > > Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > > Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> > > Cc: Matt Roper <matthew.d.roper@intel.com>
> > > Cc: José Roberto de Souza <jose.souza@intel.com>
> > > Cc: Filip Hazubski <filip.hazubski@intel.com>
> > > Cc: Carl Zhang <carl.zhang@intel.com>
> > > Cc: Effie Yu <effie.yu@intel.com>
> > > ---
> > > drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
> > > drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
> > > drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
> > > drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
> > > drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
> > > drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
> > > drivers/gpu/drm/xe/xe_pt.h | 6 +--
> > > drivers/gpu/drm/xe/xe_vm.c | 73 +++++++++++++++++++++------
> > > drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
> > > include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
> > > 10 files changed, 134 insertions(+), 58 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
> > > index 4d6296cdbcfd..bb4c182048c3 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> > > @@ -302,7 +302,7 @@ struct i915_address_space {
> > > (*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
> > > u64 (*pte_encode)(dma_addr_t addr,
> > > - unsigned int pat_index,
> > > + u32 pat_index,
> > > u32 flags); /* Create a valid PTE */
> > > #define PTE_READ_ONLY BIT(0)
> > > #define PTE_LM BIT(1)
> > > diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
> > > index 5c8d5e78d9bc..7a128fd20a29 100644
> > > --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> > > +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> > > @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
> > > /* First part of the test, are we updating our pagetable bo with a new entry? */
> > > xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
> > > 0xdeaddeadbeefbeef);
> > > - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> > > + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
> > > if (m->q->vm->flags & XE_VM_FLAG_64K)
> > > expected |= XE_PTE_PS64;
> > > if (xe_bo_is_vram(pt))
> > > diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
> > > index 209fa053d9fb..4134c26150a5 100644
> > > --- a/drivers/gpu/drm/xe/xe_ggtt.c
> > > +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> > > @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
> > > pte |= XE_GGTT_PTE_DM;
> > > if ((ggtt->pat_encode).pte_encode)
> > > - pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
> > > + pte = (ggtt->pat_encode).pte_encode(xe, pte,
> > > + xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
> > > return pte;
> > > }
> > > @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
> > > }
> > > static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> > > - enum xe_cache_level cache)
> > > + u32 pat_index)
> > > {
> > > - u32 pat_index = xe_pat_get_index(xe, cache);
> > > -
> > > pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
> > > if (pat_index & BIT(0))
> > > diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
> > > index 7e55fac1a8a9..0bc40cb072e3 100644
> > > --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> > > @@ -31,7 +31,7 @@ struct xe_ggtt {
> > > struct {
> > > u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> > > - enum xe_cache_level cache);
> > > + u32 pat_index);
> > > } pat_encode;
> > > };
> > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> > > index a782ea282cb6..54585e98452a 100644
> > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > @@ -24,6 +24,7 @@
> > > #include "xe_lrc.h"
> > > #include "xe_map.h"
> > > #include "xe_mocs.h"
> > > +#include "xe_pat.h"
> > > #include "xe_pt.h"
> > > #include "xe_res_cursor.h"
> > > #include "xe_sched_job.h"
> > > @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > > u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
> > > u32 map_ofs, level, i;
> > > struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> > > + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> > > u64 entry;
> > > int ret;
> > > @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > > return ret;
> > > }
> > > - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
> > > + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
> > > xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
> > > map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
> > > /* Map the entire BO in our level 0 pt */
> > > for (i = 0, level = 0; i < num_entries; level++) {
> > > - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
> > > + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
> > > xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
> > > @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > > for (i = 0; i < batch->size;
> > > i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
> > > XE_PAGE_SIZE) {
> > > - entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
> > > + entry = xe_pte_encode(batch, i, pat_index, 0);
> > > xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> > > entry);
> > > @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > > flags = XE_PDE_64K;
> > > entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> > > - XE_PAGE_SIZE, XE_CACHE_WB);
> > > + XE_PAGE_SIZE, pat_index);
> > > xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
> > > entry | flags);
> > > }
> > > @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > > /* Write PDE's that point to our BO. */
> > > for (i = 0; i < num_entries - num_level; i++) {
> > > entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> > > - XE_CACHE_WB);
> > > + pat_index);
> > > xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
> > > (i + 1) * 8, u64, entry);
> > > @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
> > > XE_WARN_ON(pt_bo->size != SZ_4K);
> > > - addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
> > > + addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
> > > bb->cs[bb->len++] = lower_32_bits(addr);
> > > bb->cs[bb->len++] = upper_32_bits(addr);
> > > }
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> > > index 64713f400d94..019af2920078 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > @@ -10,6 +10,7 @@
> > > #include "xe_gt.h"
> > > #include "xe_gt_tlb_invalidation.h"
> > > #include "xe_migrate.h"
> > > +#include "xe_pat.h"
> > > #include "xe_pt_types.h"
> > > #include "xe_pt_walk.h"
> > > #include "xe_res_cursor.h"
> > > @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
> > > *
> > > * Return: An encoded page directory entry. No errors.
> > > */
> > > -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> > > - const enum xe_cache_level cache)
> > > +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
> > > {
> > > u64 pde;
> > > struct xe_vm *vm = bo->vm;
> > > struct xe_device *xe = vm->xe;
> > > -
> > > pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> > > pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
> > > if ((vm->pat_encode).pde_encode)
> > > - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> > > + pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
> > > return pde;
> > > }
> > > -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> > > +static u64 __pte_encode(u64 pte, u32 pat_index,
> > > struct xe_vma *vma, u32 pt_level)
> > > {
> > > struct xe_vm *vm = xe_vma_vm(vma);
> > > @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> > > pte |= XE_PTE_NULL;
> > > if ((vm->pat_encode).pte_encode)
> > > - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> > > + pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
> > > if (pt_level == 1)
> > > pte |= XE_PDE_PS_2M;
> > > @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> > > *
> > > * Return: An encoded page-table entry. No errors.
> > > */
> > > -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> > > +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> > > u32 pt_level)
> > > {
> > > u64 pte;
> > > @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> > > if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> > > pte |= XE_PPGTT_PTE_DM;
> > > - return __pte_encode(pte, cache, NULL, pt_level);
> > > + return __pte_encode(pte, pat_index, NULL, pt_level);
> > > }
> > > static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> > > @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> > > if (level == 0) {
> > > u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> > > - XE_CACHE_WB, 0);
> > > + xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
> > > return empty;
> > > } else {
> > > return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
> > > - XE_CACHE_WB);
> > > + xe_pat_get_index(vm->xe, XE_CACHE_WB));
> > > }
> > > }
> > > @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> > > struct xe_vm *vm;
> > > /** @tile: The tile we're building for. */
> > > struct xe_tile *tile;
> > > - /** @cache: Desired cache level for the ptes */
> > > - enum xe_cache_level cache;
> > > /** @default_pte: PTE flag only template. No address is associated */
> > > u64 default_pte;
> > > /** @dma_offset: DMA offset to add to the PTE. */
> > > @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
> > > pte = __pte_encode(is_null ? 0 :
> > > xe_res_dma(curs) + xe_walk->dma_offset,
> > > - xe_walk->cache, xe_walk->vma, level);
> > > + xe_walk->vma->pat_index, xe_walk->vma, level);
> > > pte |= xe_walk->default_pte;
> > > /*
> > > @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
> > > xe_child->is_compact = true;
> > > }
> > > - pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
> > > + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
> > > ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
> > > pte);
> > > }
> > > @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
> > > xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
> > > xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
> > > - xe_walk.cache = XE_CACHE_WB;
> > > - } else {
> > > - if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
> > > - xe_walk.cache = XE_CACHE_WT;
> > > - else
> > > - xe_walk.cache = XE_CACHE_WB;
> > > }
> > > +
> > > if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> > > xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> > > index 01be7ab08f87..1d433a5a96b4 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.h
> > > +++ b/drivers/gpu/drm/xe/xe_pt.h
> > > @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
> > > bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
> > > -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> > > - const enum xe_cache_level level);
> > > +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
> > > -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> > > - u32 pt_level);
> > > +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
> > > #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > > index 7eeeed0411f3..34603a7e84b0 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -6,6 +6,7 @@
> > > #include "xe_vm.h"
> > > #include <linux/dma-fence-array.h>
> > > +#include <linux/nospec.h>
> > > #include <drm/drm_print.h>
> > > #include <drm/ttm/ttm_execbuf_util.h>
> > > @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
> > > u64 start, u64 end,
> > > bool read_only,
> > > bool is_null,
> > > - u8 tile_mask)
> > > + u8 tile_mask,
> > > + u32 pat_index)
> > > {
> > > struct xe_vma *vma;
> > > struct xe_tile *tile;
> > > @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
> > > vma->tile_mask |= 0x1 << id;
> > > }
> > > + vma->pat_index = pat_index;
> > > +
> > > if (vm->xe->info.platform == XE_PVC)
> > > vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
> > > @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
> > > static void vm_destroy_work_func(struct work_struct *w);
> > > static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> > > - enum xe_cache_level cache)
> > > + u32 pat_index)
> > > {
> > > - u32 pat_index = xe_pat_get_index(xe, cache);
> > > -
> > > pde_pat &= ~(XELP_PDE_PAT_MASK);
> > > if (pat_index & BIT(0))
> > > @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> > > }
> > > static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> > > - enum xe_cache_level cache)
> > > + u32 pat_index)
> > > {
> > > - u32 pat_index = xe_pat_get_index(xe, cache);
> > > -
> > > pte_pat &= ~(XELP_PTE_PAT_MASK);
> > > if (pat_index & BIT(0))
> > > @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
> > > u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
> > > {
> > > return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> > > - XE_CACHE_WB);
> > > + xe_pat_get_index(vm->xe, XE_CACHE_WB));
> > > }
> > > static struct dma_fence *
> > > @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
> > > static struct drm_gpuva_ops *
> > > vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> > > u64 bo_offset_or_userptr, u64 addr, u64 range,
> > > - u32 operation, u8 tile_mask, u32 region)
> > > + u32 operation, u8 tile_mask, u32 region, u32 pat_index)
> > > {
> > > struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> > > struct ww_acquire_ctx ww;
> > > @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> > > struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> > > op->tile_mask = tile_mask;
> > > + op->pat_index = pat_index;
> > > op->map.immediate =
> > > operation & XE_VM_BIND_FLAG_IMMEDIATE;
> > > op->map.read_only =
> > > @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> > > struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> > > op->tile_mask = tile_mask;
> > > + op->pat_index = pat_index;
> > > op->prefetch.region = region;
> > > }
> > > break;
> > > @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> > > }
> > > static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> > > - u8 tile_mask, bool read_only, bool is_null)
> > > + u8 tile_mask, bool read_only, bool is_null,
> > > + u32 pat_index)
> > > {
> > > struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
> > > struct xe_vma *vma;
> > > @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> > > vma = xe_vma_create(vm, bo, op->gem.offset,
> > > op->va.addr, op->va.addr +
> > > op->va.range - 1, read_only, is_null,
> > > - tile_mask);
> > > + tile_mask, pat_index);
> > > if (bo)
> > > xe_bo_unlock(bo, &ww);
> > > @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
> > > vma = new_vma(vm, &op->base.map,
> > > op->tile_mask, op->map.read_only,
> > > - op->map.is_null);
> > > + op->map.is_null, op->pat_index);
> > > if (IS_ERR(vma)) {
> > > err = PTR_ERR(vma);
> > > goto free_fence;
> > > @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
> > > vma = new_vma(vm, op->base.remap.prev,
> > > op->tile_mask, read_only,
> > > - is_null);
> > > + is_null, op->pat_index);
> > > if (IS_ERR(vma)) {
> > > err = PTR_ERR(vma);
> > > goto free_fence;
> > > @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
> > > vma = new_vma(vm, op->base.remap.next,
> > > op->tile_mask, read_only,
> > > - is_null);
> > > + is_null, op->pat_index);
> > > if (IS_ERR(vma)) {
> > > err = PTR_ERR(vma);
> > > goto free_fence;
> > > @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
> > > u32 obj = (*bind_ops)[i].obj;
> > > u64 obj_offset = (*bind_ops)[i].obj_offset;
> > > u32 region = (*bind_ops)[i].region;
> > > + u32 pat_index = (*bind_ops)[i].pat_index;
> > > bool is_null = op & XE_VM_BIND_FLAG_NULL;
> > > + if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
> > > + err = -EINVAL;
> > > + goto free_bind_ops;
> > > + }
> > > +
> > > if (i == 0) {
> > > *async = !!(op & XE_VM_BIND_FLAG_ASYNC);
> > > } else if (XE_IOCTL_DBG(xe, !*async) ||
> > > @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> > > struct drm_gem_object *gem_obj;
> > > u64 range = bind_ops[i].range;
> > > u64 addr = bind_ops[i].addr;
> > > + u32 op = bind_ops[i].op;
> > > u32 obj = bind_ops[i].obj;
> > > u64 obj_offset = bind_ops[i].obj_offset;
> > > + u32 pat_index = bind_ops[i].pat_index;
> > > + u16 coh_mode;
> > > +
> > > + pat_index = array_index_nospec(pat_index,
> > > + xe->info.pat_table_n_entries);
> > > + coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
> > > + if (XE_IOCTL_DBG(xe, !coh_mode)) {
> > Assuming we drop the unusable entries from the TGL table, this should be
> > impossible, right? Any index that makes it past the n_entries check at
> > the top of the function should have a valid, non-zero coh_mode value.
> > So this should probably be an assertion (to highlight a KMD bug) rather
> > than just a silent uapi failure return.
> >
> > > + err = -EINVAL;
> > > + goto put_obj;
> > > + }
> > > +
> > > + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
> > > + coh_mode == XE_GEM_COHERENCY_NONE)) {
> > > + err = -EINVAL;
> > > + goto put_obj;
> > > + }
> > > if (!obj)
> > > continue;
> > > @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> > > goto put_obj;
> > > }
> > > }
> > > +
> > > + if (bos[i]->coh_mode) {
> > > + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
> > > + err = -EINVAL;
> > > + goto put_obj;
> > > + }
> > > + } else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
> > > + /*
> > > + * Imported dma-buf from a different device should
> > > + * require 1way or 2way coherency since we don't know
> > > + * how it was mapped on CPU. Just assume is it
> > > + * potentially cached on CPU side.
> > > + */
> > > + err = -EINVAL;
> > > + goto put_obj;
> > > + }
> > > }
> > > if (args->num_syncs) {
> > > @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> > > u64 obj_offset = bind_ops[i].obj_offset;
> > > u8 tile_mask = bind_ops[i].tile_mask;
> > > u32 region = bind_ops[i].region;
> > > + u32 pat_index = bind_ops[i].pat_index;
> > > ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
> > > addr, range, op, tile_mask,
> > > - region);
> > > + region, pat_index);
> > > if (IS_ERR(ops[i])) {
> > > err = PTR_ERR(ops[i]);
> > > ops[i] = NULL;
> > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
> > > index 83a1f87b6537..508679ed3c74 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > > @@ -111,6 +111,11 @@ struct xe_vma {
> > > */
> > > u8 tile_present;
> > > + /**
> > > + * @pat_index: The pat index to use when encoding the PTEs for this vma.
> > > + */
> > > + u32 pat_index;
> > > +
> > > struct {
> > > struct list_head rebind_link;
> > > } notifier;
> > > @@ -338,10 +343,8 @@ struct xe_vm {
> > > bool batch_invalidate_tlb;
> > > struct {
> > > - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> > > - enum xe_cache_level cache);
> > > - u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> > > - enum xe_cache_level cache);
> > > + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
> > > + u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
> > > } pat_encode;
> > > };
> > > @@ -417,6 +420,8 @@ struct xe_vma_op {
> > > struct async_op_fence *fence;
> > > /** @tile_mask: gt mask for this operation */
> > > u8 tile_mask;
> > > + /** @pat_index: The pat index to use for this operation. */
> > > + u32 pat_index;
> > > /** @flags: operation flags */
> > > enum xe_vma_op_flags flags;
> > > diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> > > index 64bc66d4b550..0c15b6f32447 100644
> > > --- a/include/uapi/drm/xe_drm.h
> > > +++ b/include/uapi/drm/xe_drm.h
> > > @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> > > */
> > > __u32 obj;
> > > - /** @pad: MBZ */
> > > - __u32 pad;
> > > + /**
> > > + * @pat_index: The platform defined @pat_index to use for this mapping.
> > > + * The index basically maps to some predefined memory attributes,
> > > + * including things like caching, coherency and likely other stuff in
> > > + * the future. The exact meaning of the pat_index is platform specific
> > BTW, "other stuff in the future" already includes compression on Xe2, we
> > just haven't landed the patches for the Xe2 table yet.
> >
> > > + * and defined in the Bspec and PRMs. When the KMD sets up the binding
> > > + * the index here is encoded into the ppGTT PTE.
> > > + *
> > > + * For coherency the @pat_index needs to match the
> > > + * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
> > > + * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
> > > + * the coherency mode from the @pat_index and reject if there is a
> > > + * mismatch (see note below for pre-MTL platforms).
> > > + *
> > > + * Note: On pre-MTL platforms there is only a caching mode and no
> > > + * explicit coherency mode, but on such hardware there is always a
> > > + * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
> > > + * CPU caches even with the caching mode set as uncached. It's only the
> > > + * display engine that is incoherent (on dgpu it must be in VRAM which
> > > + * is always mapped as WC on the CPU). However to keep the uapi somewhat
> > > + * consistent with newer platforms the KMD groups the different cache
> > > + * levels into the following coherency buckets on all pre-MTL platforms:
> > > + *
> > > + * ppGTT UC -> XE_GEM_COHERENCY_NONE
> > > + * ppGTT WC -> XE_GEM_COHERENCY_NONE
> > > + * ppGTT WT -> XE_GEM_COHERENCY_NONE
> > > + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
> > As noted on the previous patch, it seems like 2-way is appropriate for
> > LLC platforms, but 1-way might be a more accurate description of dGPU
> > behavior.
> >
> > > + *
> > > + * In practice UC/WC/WT should only ever used for scanout surfaces on
> > > + * such platforms since it is only the display engine that is actually
> > > + * incoherent. Everything else should typically use WB given that we
> > What if we're sharing our buffers with some other (non-GPU) device? Are
> > there cases where that other device wouldn't be coherent with the LLC,
> > so we'd want to use one of these?
> >
> >
> > Matt
> >
> > > + * have a shared-LLC. On MTL+ this completely changes (also potentially
> > > + * no shared-LLC) and the HW defines the coherency mode as part of the
> > > + * @pat_index.
> > > + *
> > > + * Note: For userptr and externally imported dma-buf the kernel expects
> > > + * either 1WAY or 2WAY for the @pat_index.
> > > + */
> > > + __u32 pat_index;
> > > union {
> > > /**
> > > --
> > > 2.41.0
> > >
--
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-08-29 21:36 ` Matt Roper
2023-08-30 6:38 ` Thomas Hellström
@ 2023-08-30 11:28 ` Matthew Auld
2023-08-30 15:27 ` Zhang, Carl
1 sibling, 1 reply; 47+ messages in thread
From: Matthew Auld @ 2023-08-30 11:28 UTC (permalink / raw)
To: Matt Roper
Cc: Filip Hazubski, Joonas Lahtinen, Lucas De Marchi, Carl Zhang,
Effie Yu, intel-xe
On 29/08/2023 22:36, Matt Roper wrote:
> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
>> Allow userspace to directly control the pat_index for a given vm
>> binding. This should allow directly controlling the coherency, caching
>> and potentially other stuff in the future for the ppGTT binding.
>>
>> The exact meaning behind the pat_index is very platform specific (see
>> BSpec or PRMs) but effectively maps to some predefined memory
>> attributes. From the KMD pov we only care about the coherency that is
>> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
>> The vm_bind coherency mode for the given pat_index needs to match the
>> given coh_mode that was set at object creation. For platforms that lack
>
> Is it actually important to match the coherency mode? I think one of
> the main goals was to know up front if userspace might be using a
> non-snooping PAT setting that would let it bypass the CPU cache (and
> potentially read old, stale data from a different process if the
> buffer's clear value is still sitting in cache and hasn't landed in
> memory yet).
>
> If that's the only concern, then I think it should still be fine to map
> with a non-matching PAT as long as it's more coherent than the one
> specified at creation, right? E.g., if the buffer was created with
> 1-way coherency, it would be fine to map it with 2-way because userspace
> still can't use that to observe the previous contents of the buffer. Or
Yeah, I guess we could in theory do something that.
> if the buffer was created with "non-coherent" then we've already done
> the necessary clflushing in kernel before handing to buffer over to
> userspace to ensure the clear value landed in memory, so any valid PAT
> index should be safe (from a security POV) after that, right? Any other
> problems that arise from mismatched coherency would just be contained to
> the app possibly shooting itself in the foot, which isn't really our
> concern.
That is also my understanding, at least from the KMD security pov. If
you allocate as wb then you must use at least 1way, since there is no
flushing for clearing or swap-in. For uc/wc you could in theory use
whatever you want.
>
>
>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
>>
>> For userptr mappings we lack a corresponding gem object, so the expected
>> coherency mode is instead implicit and must fall into either 1WAY or
>> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
>> dma-buf (from a different device) the coherency mode is also implicit
>> and must also be either 1WAY or 2WAY.
>>
>> As part of adding pat_index support with vm_bind we also need stop using
>> xe_cache_level and instead use the pat_index in various places. We still
>> make use of xe_cache_level, but only as a convenience for kernel
>> internal objectsi (internally it maps to some reasonable pat_index). For
>
> Maybe we should kill xe_cache_level completely and just assign
> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
> appropriate places, similar to what we do with MOCS (gt->mocs.uc_index,
> gt->mocs.wb_index)?
OK, seems reasonable to me.
>
>> now this is just a 1:1 conversion of the existing code, however for
>> platforms like MTL+ we might need to give more control through bo_create
>> or stop using WB on the CPU side if we need CPU access.
>>
>> Bspec: 45101, 44235 #xe
>> Bspec: 70552, 71582, 59400 #xe2
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: José Roberto de Souza <jose.souza@intel.com>
>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>> Cc: Carl Zhang <carl.zhang@intel.com>
>> Cc: Effie Yu <effie.yu@intel.com>
>> ---
>> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
>> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
>> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
>> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
>> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
>> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
>> drivers/gpu/drm/xe/xe_pt.h | 6 +--
>> drivers/gpu/drm/xe/xe_vm.c | 73 +++++++++++++++++++++------
>> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
>> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
>> 10 files changed, 134 insertions(+), 58 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
>> index 4d6296cdbcfd..bb4c182048c3 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>> @@ -302,7 +302,7 @@ struct i915_address_space {
>> (*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>>
>> u64 (*pte_encode)(dma_addr_t addr,
>> - unsigned int pat_index,
>> + u32 pat_index,
>> u32 flags); /* Create a valid PTE */
>> #define PTE_READ_ONLY BIT(0)
>> #define PTE_LM BIT(1)
>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
>> index 5c8d5e78d9bc..7a128fd20a29 100644
>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
>> /* First part of the test, are we updating our pagetable bo with a new entry? */
>> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
>> 0xdeaddeadbeefbeef);
>> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>> if (m->q->vm->flags & XE_VM_FLAG_64K)
>> expected |= XE_PTE_PS64;
>> if (xe_bo_is_vram(pt))
>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
>> index 209fa053d9fb..4134c26150a5 100644
>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
>> pte |= XE_GGTT_PTE_DM;
>>
>> if ((ggtt->pat_encode).pte_encode)
>> - pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
>> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
>> + xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
>>
>> return pte;
>> }
>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>> }
>>
>> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>> - enum xe_cache_level cache)
>> + u32 pat_index)
>> {
>> - u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>
>> if (pat_index & BIT(0))
>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
>> index 7e55fac1a8a9..0bc40cb072e3 100644
>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>
>> struct {
>> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> - enum xe_cache_level cache);
>> + u32 pat_index);
>> } pat_encode;
>> };
>>
>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
>> index a782ea282cb6..54585e98452a 100644
>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>> @@ -24,6 +24,7 @@
>> #include "xe_lrc.h"
>> #include "xe_map.h"
>> #include "xe_mocs.h"
>> +#include "xe_pat.h"
>> #include "xe_pt.h"
>> #include "xe_res_cursor.h"
>> #include "xe_sched_job.h"
>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
>> u32 map_ofs, level, i;
>> struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>> u64 entry;
>> int ret;
>>
>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> return ret;
>> }
>>
>> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
>> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>
>> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>
>> /* Map the entire BO in our level 0 pt */
>> for (i = 0, level = 0; i < num_entries; level++) {
>> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
>> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>>
>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>>
>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> for (i = 0; i < batch->size;
>> i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>> XE_PAGE_SIZE) {
>> - entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
>> + entry = xe_pte_encode(batch, i, pat_index, 0);
>>
>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>> entry);
>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> flags = XE_PDE_64K;
>>
>> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>> - XE_PAGE_SIZE, XE_CACHE_WB);
>> + XE_PAGE_SIZE, pat_index);
>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
>> entry | flags);
>> }
>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> /* Write PDE's that point to our BO. */
>> for (i = 0; i < num_entries - num_level; i++) {
>> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>> - XE_CACHE_WB);
>> + pat_index);
>>
>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>> (i + 1) * 8, u64, entry);
>> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
>>
>> XE_WARN_ON(pt_bo->size != SZ_4K);
>>
>> - addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
>> + addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>> bb->cs[bb->len++] = lower_32_bits(addr);
>> bb->cs[bb->len++] = upper_32_bits(addr);
>> }
>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
>> index 64713f400d94..019af2920078 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.c
>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>> @@ -10,6 +10,7 @@
>> #include "xe_gt.h"
>> #include "xe_gt_tlb_invalidation.h"
>> #include "xe_migrate.h"
>> +#include "xe_pat.h"
>> #include "xe_pt_types.h"
>> #include "xe_pt_walk.h"
>> #include "xe_res_cursor.h"
>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
>> *
>> * Return: An encoded page directory entry. No errors.
>> */
>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>> - const enum xe_cache_level cache)
>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>> {
>> u64 pde;
>> struct xe_vm *vm = bo->vm;
>> struct xe_device *xe = vm->xe;
>>
>> -
>> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>
>> if ((vm->pat_encode).pde_encode)
>> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>> + pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>>
>> return pde;
>> }
>>
>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>> struct xe_vma *vma, u32 pt_level)
>> {
>> struct xe_vm *vm = xe_vma_vm(vma);
>> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>> pte |= XE_PTE_NULL;
>>
>> if ((vm->pat_encode).pte_encode)
>> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>> + pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>>
>> if (pt_level == 1)
>> pte |= XE_PDE_PS_2M;
>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>> *
>> * Return: An encoded page-table entry. No errors.
>> */
>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>> u32 pt_level)
>> {
>> u64 pte;
>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>> pte |= XE_PPGTT_PTE_DM;
>>
>> - return __pte_encode(pte, cache, NULL, pt_level);
>> + return __pte_encode(pte, pat_index, NULL, pt_level);
>> }
>>
>> static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>>
>> if (level == 0) {
>> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>> - XE_CACHE_WB, 0);
>> + xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
>>
>> return empty;
>> } else {
>> return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
>> - XE_CACHE_WB);
>> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
>> }
>> }
>>
>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>> struct xe_vm *vm;
>> /** @tile: The tile we're building for. */
>> struct xe_tile *tile;
>> - /** @cache: Desired cache level for the ptes */
>> - enum xe_cache_level cache;
>> /** @default_pte: PTE flag only template. No address is associated */
>> u64 default_pte;
>> /** @dma_offset: DMA offset to add to the PTE. */
>> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>>
>> pte = __pte_encode(is_null ? 0 :
>> xe_res_dma(curs) + xe_walk->dma_offset,
>> - xe_walk->cache, xe_walk->vma, level);
>> + xe_walk->vma->pat_index, xe_walk->vma, level);
>> pte |= xe_walk->default_pte;
>>
>> /*
>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>> xe_child->is_compact = true;
>> }
>>
>> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
>> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
>> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>> pte);
>> }
>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
>> if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>> xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>> xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
>> - xe_walk.cache = XE_CACHE_WB;
>> - } else {
>> - if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
>> - xe_walk.cache = XE_CACHE_WT;
>> - else
>> - xe_walk.cache = XE_CACHE_WB;
>> }
>> +
>> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>> xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>
>> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
>> index 01be7ab08f87..1d433a5a96b4 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.h
>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
>>
>> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>>
>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>> - const enum xe_cache_level level);
>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>>
>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> - u32 pt_level);
>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
>>
>> #endif
>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>> index 7eeeed0411f3..34603a7e84b0 100644
>> --- a/drivers/gpu/drm/xe/xe_vm.c
>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>> @@ -6,6 +6,7 @@
>> #include "xe_vm.h"
>>
>> #include <linux/dma-fence-array.h>
>> +#include <linux/nospec.h>
>>
>> #include <drm/drm_print.h>
>> #include <drm/ttm/ttm_execbuf_util.h>
>> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>> u64 start, u64 end,
>> bool read_only,
>> bool is_null,
>> - u8 tile_mask)
>> + u8 tile_mask,
>> + u32 pat_index)
>> {
>> struct xe_vma *vma;
>> struct xe_tile *tile;
>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>> vma->tile_mask |= 0x1 << id;
>> }
>>
>> + vma->pat_index = pat_index;
>> +
>> if (vm->xe->info.platform == XE_PVC)
>> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>
>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
>> static void vm_destroy_work_func(struct work_struct *w);
>>
>> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>> - enum xe_cache_level cache)
>> + u32 pat_index)
>> {
>> - u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>> pde_pat &= ~(XELP_PDE_PAT_MASK);
>>
>> if (pat_index & BIT(0))
>> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>> }
>>
>> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>> - enum xe_cache_level cache)
>> + u32 pat_index)
>> {
>> - u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>> pte_pat &= ~(XELP_PTE_PAT_MASK);
>>
>> if (pat_index & BIT(0))
>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
>> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>> {
>> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>> - XE_CACHE_WB);
>> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
>> }
>>
>> static struct dma_fence *
>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
>> static struct drm_gpuva_ops *
>> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>> u64 bo_offset_or_userptr, u64 addr, u64 range,
>> - u32 operation, u8 tile_mask, u32 region)
>> + u32 operation, u8 tile_mask, u32 region, u32 pat_index)
>> {
>> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>> struct ww_acquire_ctx ww;
>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>
>> op->tile_mask = tile_mask;
>> + op->pat_index = pat_index;
>> op->map.immediate =
>> operation & XE_VM_BIND_FLAG_IMMEDIATE;
>> op->map.read_only =
>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>
>> op->tile_mask = tile_mask;
>> + op->pat_index = pat_index;
>> op->prefetch.region = region;
>> }
>> break;
>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>> }
>>
>> static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>> - u8 tile_mask, bool read_only, bool is_null)
>> + u8 tile_mask, bool read_only, bool is_null,
>> + u32 pat_index)
>> {
>> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
>> struct xe_vma *vma;
>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>> vma = xe_vma_create(vm, bo, op->gem.offset,
>> op->va.addr, op->va.addr +
>> op->va.range - 1, read_only, is_null,
>> - tile_mask);
>> + tile_mask, pat_index);
>> if (bo)
>> xe_bo_unlock(bo, &ww);
>>
>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>
>> vma = new_vma(vm, &op->base.map,
>> op->tile_mask, op->map.read_only,
>> - op->map.is_null);
>> + op->map.is_null, op->pat_index);
>> if (IS_ERR(vma)) {
>> err = PTR_ERR(vma);
>> goto free_fence;
>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>
>> vma = new_vma(vm, op->base.remap.prev,
>> op->tile_mask, read_only,
>> - is_null);
>> + is_null, op->pat_index);
>> if (IS_ERR(vma)) {
>> err = PTR_ERR(vma);
>> goto free_fence;
>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>
>> vma = new_vma(vm, op->base.remap.next,
>> op->tile_mask, read_only,
>> - is_null);
>> + is_null, op->pat_index);
>> if (IS_ERR(vma)) {
>> err = PTR_ERR(vma);
>> goto free_fence;
>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
>> u32 obj = (*bind_ops)[i].obj;
>> u64 obj_offset = (*bind_ops)[i].obj_offset;
>> u32 region = (*bind_ops)[i].region;
>> + u32 pat_index = (*bind_ops)[i].pat_index;
>> bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>
>> + if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
>> + err = -EINVAL;
>> + goto free_bind_ops;
>> + }
>> +
>> if (i == 0) {
>> *async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>> } else if (XE_IOCTL_DBG(xe, !*async) ||
>> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>> struct drm_gem_object *gem_obj;
>> u64 range = bind_ops[i].range;
>> u64 addr = bind_ops[i].addr;
>> + u32 op = bind_ops[i].op;
>> u32 obj = bind_ops[i].obj;
>> u64 obj_offset = bind_ops[i].obj_offset;
>> + u32 pat_index = bind_ops[i].pat_index;
>> + u16 coh_mode;
>> +
>> + pat_index = array_index_nospec(pat_index,
>> + xe->info.pat_table_n_entries);
>> + coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
>
> Assuming we drop the unusable entries from the TGL table, this should be
> impossible, right? Any index that makes it past the n_entries check at
> the top of the function should have a valid, non-zero coh_mode value.
> So this should probably be an assertion (to highlight a KMD bug) rather
> than just a silent uapi failure return.
Makes sense.
>
>> + err = -EINVAL;
>> + goto put_obj;
>> + }
>> +
>> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
>> + coh_mode == XE_GEM_COHERENCY_NONE)) {
>> + err = -EINVAL;
>> + goto put_obj;
>> + }
>>
>> if (!obj)
>> continue;
>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>> goto put_obj;
>> }
>> }
>> +
>> + if (bos[i]->coh_mode) {
>> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
>> + err = -EINVAL;
>> + goto put_obj;
>> + }
>> + } else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
>> + /*
>> + * Imported dma-buf from a different device should
>> + * require 1way or 2way coherency since we don't know
>> + * how it was mapped on CPU. Just assume is it
>> + * potentially cached on CPU side.
>> + */
>> + err = -EINVAL;
>> + goto put_obj;
>> + }
>> }
>>
>> if (args->num_syncs) {
>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>> u64 obj_offset = bind_ops[i].obj_offset;
>> u8 tile_mask = bind_ops[i].tile_mask;
>> u32 region = bind_ops[i].region;
>> + u32 pat_index = bind_ops[i].pat_index;
>>
>> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>> addr, range, op, tile_mask,
>> - region);
>> + region, pat_index);
>> if (IS_ERR(ops[i])) {
>> err = PTR_ERR(ops[i]);
>> ops[i] = NULL;
>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
>> index 83a1f87b6537..508679ed3c74 100644
>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>> @@ -111,6 +111,11 @@ struct xe_vma {
>> */
>> u8 tile_present;
>>
>> + /**
>> + * @pat_index: The pat index to use when encoding the PTEs for this vma.
>> + */
>> + u32 pat_index;
>> +
>> struct {
>> struct list_head rebind_link;
>> } notifier;
>> @@ -338,10 +343,8 @@ struct xe_vm {
>> bool batch_invalidate_tlb;
>>
>> struct {
>> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> - enum xe_cache_level cache);
>> - u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>> - enum xe_cache_level cache);
>> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
>> + u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
>> } pat_encode;
>> };
>>
>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>> struct async_op_fence *fence;
>> /** @tile_mask: gt mask for this operation */
>> u8 tile_mask;
>> + /** @pat_index: The pat index to use for this operation. */
>> + u32 pat_index;
>> /** @flags: operation flags */
>> enum xe_vma_op_flags flags;
>>
>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>> index 64bc66d4b550..0c15b6f32447 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>> */
>> __u32 obj;
>>
>> - /** @pad: MBZ */
>> - __u32 pad;
>> + /**
>> + * @pat_index: The platform defined @pat_index to use for this mapping.
>> + * The index basically maps to some predefined memory attributes,
>> + * including things like caching, coherency and likely other stuff in
>> + * the future. The exact meaning of the pat_index is platform specific
>
> BTW, "other stuff in the future" already includes compression on Xe2, we
> just haven't landed the patches for the Xe2 table yet.
Ok, good to know.
>
>> + * and defined in the Bspec and PRMs. When the KMD sets up the binding
>> + * the index here is encoded into the ppGTT PTE.
>> + *
>> + * For coherency the @pat_index needs to match the
>> + * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
>> + * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
>> + * the coherency mode from the @pat_index and reject if there is a
>> + * mismatch (see note below for pre-MTL platforms).
>> + *
>> + * Note: On pre-MTL platforms there is only a caching mode and no
>> + * explicit coherency mode, but on such hardware there is always a
>> + * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
>> + * CPU caches even with the caching mode set as uncached. It's only the
>> + * display engine that is incoherent (on dgpu it must be in VRAM which
>> + * is always mapped as WC on the CPU). However to keep the uapi somewhat
>> + * consistent with newer platforms the KMD groups the different cache
>> + * levels into the following coherency buckets on all pre-MTL platforms:
>> + *
>> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
>> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
>> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
>> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
>
> As noted on the previous patch, it seems like 2-way is appropriate for
> LLC platforms, but 1-way might be a more accurate description of dGPU
> behavior.
>
>> + *
>> + * In practice UC/WC/WT should only ever used for scanout surfaces on
>> + * such platforms since it is only the display engine that is actually
>> + * incoherent. Everything else should typically use WB given that we
>
> What if we're sharing our buffers with some other (non-GPU) device? Are
> there cases where that other device wouldn't be coherent with the LLC,
> so we'd want to use one of these?
Yeah, I guess there might be cases like that. I'll reword.
>
>
> Matt
>
>> + * have a shared-LLC. On MTL+ this completely changes (also potentially
>> + * no shared-LLC) and the HW defines the coherency mode as part of the
>> + * @pat_index.
>> + *
>> + * Note: For userptr and externally imported dma-buf the kernel expects
>> + * either 1WAY or 2WAY for the @pat_index.
>> + */
>> + __u32 pat_index;
>>
>> union {
>> /**
>> --
>> 2.41.0
>>
>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-08-30 11:28 ` Matthew Auld
@ 2023-08-30 15:27 ` Zhang, Carl
2023-08-30 16:02 ` Matthew Auld
0 siblings, 1 reply; 47+ messages in thread
From: Zhang, Carl @ 2023-08-30 15:27 UTC (permalink / raw)
To: Auld, Matthew, Roper, Matthew D
Cc: Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas, Yu, Effie,
intel-xe@lists.freedesktop.org
Several questions:
1. the pat_index from vm_bind will override the setting from bo_create?
How to keep the value from bo_create unchanged?
2. no UC/WB/WC definition (CPU cachable) in drm_xe_gem_mmap_offset, will it be handled by KMD automatically?
For example: if set 1-way, it means GPU could snoop CPU cache, we could use WB in mmap offset
If it is COHERENCY_NONE, we could only use UC, all these logic is handled by KMD automatically?
3. about " For imported dma-buf (from a different device) the coherency mode is also implicit
and must also be either 1WAY or 2WAY"
it means it must be 1way or 2way, and UMD need not to set it?
> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: Wednesday, August 30, 2023 7:28 PM
> To: Roper, Matthew D <matthew.d.roper@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi <pallavi.mishra@intel.com>;
> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas Lahtinen
> <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>; Hazubski,
> Filip <filip.hazubski@intel.com>; Zhang, Carl <carl.zhang@intel.com>; Yu, Effie
> <effie.yu@intel.com>
> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
>
> On 29/08/2023 22:36, Matt Roper wrote:
> > On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
> >> Allow userspace to directly control the pat_index for a given vm
> >> binding. This should allow directly controlling the coherency,
> >> caching and potentially other stuff in the future for the ppGTT binding.
> >>
> >> The exact meaning behind the pat_index is very platform specific (see
> >> BSpec or PRMs) but effectively maps to some predefined memory
> >> attributes. From the KMD pov we only care about the coherency that is
> >> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
> >> The vm_bind coherency mode for the given pat_index needs to match the
> >> given coh_mode that was set at object creation. For platforms that
> >> lack
> >
> > Is it actually important to match the coherency mode? I think one of
> > the main goals was to know up front if userspace might be using a
> > non-snooping PAT setting that would let it bypass the CPU cache (and
> > potentially read old, stale data from a different process if the
> > buffer's clear value is still sitting in cache and hasn't landed in
> > memory yet).
> >
> > If that's the only concern, then I think it should still be fine to
> > map with a non-matching PAT as long as it's more coherent than the one
> > specified at creation, right? E.g., if the buffer was created with
> > 1-way coherency, it would be fine to map it with 2-way because
> > userspace still can't use that to observe the previous contents of the
> > buffer. Or
>
> Yeah, I guess we could in theory do something that.
>
> > if the buffer was created with "non-coherent" then we've already done
> > the necessary clflushing in kernel before handing to buffer over to
> > userspace to ensure the clear value landed in memory, so any valid PAT
> > index should be safe (from a security POV) after that, right? Any
> > other problems that arise from mismatched coherency would just be
> > contained to the app possibly shooting itself in the foot, which isn't
> > really our concern.
>
> That is also my understanding, at least from the KMD security pov. If you
> allocate as wb then you must use at least 1way, since there is no flushing for
> clearing or swap-in. For uc/wc you could in theory use whatever you want.
>
> >
> >
> >> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as
> 2WAY.
> >>
> >> For userptr mappings we lack a corresponding gem object, so the expected
> >> coherency mode is instead implicit and must fall into either 1WAY or
> >> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
> >> dma-buf (from a different device) the coherency mode is also implicit
> >> and must also be either 1WAY or 2WAY.
> >>
> >> As part of adding pat_index support with vm_bind we also need stop using
> >> xe_cache_level and instead use the pat_index in various places. We still
> >> make use of xe_cache_level, but only as a convenience for kernel
> >> internal objectsi (internally it maps to some reasonable pat_index). For
> >
> > Maybe we should kill xe_cache_level completely and just assign
> > xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
> > appropriate places, similar to what we do with MOCS (gt->mocs.uc_index,
> > gt->mocs.wb_index)?
>
> OK, seems reasonable to me.
>
> >
> >> now this is just a 1:1 conversion of the existing code, however for
> >> platforms like MTL+ we might need to give more control through bo_create
> >> or stop using WB on the CPU side if we need CPU access.
> >>
> >> Bspec: 45101, 44235 #xe
> >> Bspec: 70552, 71582, 59400 #xe2
> >> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> >> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> >> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> >> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> >> Cc: Matt Roper <matthew.d.roper@intel.com>
> >> Cc: José Roberto de Souza <jose.souza@intel.com>
> >> Cc: Filip Hazubski <filip.hazubski@intel.com>
> >> Cc: Carl Zhang <carl.zhang@intel.com>
> >> Cc: Effie Yu <effie.yu@intel.com>
> >> ---
> >> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
> >> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
> >> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
> >> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
> >> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
> >> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
> >> drivers/gpu/drm/xe/xe_pt.h | 6 +--
> >> drivers/gpu/drm/xe/xe_vm.c | 73 +++++++++++++++++++++------
> >> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
> >> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
> >> 10 files changed, 134 insertions(+), 58 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
> b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >> index 4d6296cdbcfd..bb4c182048c3 100644
> >> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >> @@ -302,7 +302,7 @@ struct i915_address_space {
> >> (*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
> >>
> >> u64 (*pte_encode)(dma_addr_t addr,
> >> - unsigned int pat_index,
> >> + u32 pat_index,
> >> u32 flags); /* Create a valid PTE */
> >> #define PTE_READ_ONLY BIT(0)
> >> #define PTE_LM BIT(1)
> >> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
> b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >> index 5c8d5e78d9bc..7a128fd20a29 100644
> >> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate
> *m, struct kunit *test)
> >> /* First part of the test, are we updating our pagetable bo with a new
> entry? */
> >> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1),
> u64,
> >> 0xdeaddeadbeefbeef);
> >> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> >> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
> XE_CACHE_WB), 0);
> >> if (m->q->vm->flags & XE_VM_FLAG_64K)
> >> expected |= XE_PTE_PS64;
> >> if (xe_bo_is_vram(pt))
> >> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
> >> index 209fa053d9fb..4134c26150a5 100644
> >> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> >> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> >> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64
> bo_offset)
> >> pte |= XE_GGTT_PTE_DM;
> >>
> >> if ((ggtt->pat_encode).pte_encode)
> >> - pte = (ggtt->pat_encode).pte_encode(xe, pte,
> XE_CACHE_WB_1_WAY);
> >> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >> + xe_pat_get_index(xe,
> XE_CACHE_WB_1_WAY));
> >>
> >> return pte;
> >> }
> >> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
> >> }
> >>
> >> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> >> - enum xe_cache_level cache)
> >> + u32 pat_index)
> >> {
> >> - u32 pat_index = xe_pat_get_index(xe, cache);
> >> -
> >> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
> >>
> >> if (pat_index & BIT(0))
> >> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
> b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >> index 7e55fac1a8a9..0bc40cb072e3 100644
> >> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >> @@ -31,7 +31,7 @@ struct xe_ggtt {
> >>
> >> struct {
> >> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >> - enum xe_cache_level cache);
> >> + u32 pat_index);
> >> } pat_encode;
> >> };
> >>
> >> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> b/drivers/gpu/drm/xe/xe_migrate.c
> >> index a782ea282cb6..54585e98452a 100644
> >> --- a/drivers/gpu/drm/xe/xe_migrate.c
> >> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> >> @@ -24,6 +24,7 @@
> >> #include "xe_lrc.h"
> >> #include "xe_map.h"
> >> #include "xe_mocs.h"
> >> +#include "xe_pat.h"
> >> #include "xe_pt.h"
> >> #include "xe_res_cursor.h"
> >> #include "xe_sched_job.h"
> >> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
> struct xe_migrate *m,
> >> u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]-
> >level;
> >> u32 map_ofs, level, i;
> >> struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> >> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> >> u64 entry;
> >> int ret;
> >>
> >> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile
> *tile, struct xe_migrate *m,
> >> return ret;
> >> }
> >>
> >> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
> >> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
> >> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
> >>
> >> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
> >>
> >> /* Map the entire BO in our level 0 pt */
> >> for (i = 0, level = 0; i < num_entries; level++) {
> >> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB,
> 0);
> >> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
> >>
> >> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
> >>
> >> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
> struct xe_migrate *m,
> >> for (i = 0; i < batch->size;
> >> i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
> >> XE_PAGE_SIZE) {
> >> - entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
> >> + entry = xe_pte_encode(batch, i, pat_index, 0);
> >>
> >> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> >> entry);
> >> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
> struct xe_migrate *m,
> >> flags = XE_PDE_64K;
> >>
> >> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> >> - XE_PAGE_SIZE, XE_CACHE_WB);
> >> + XE_PAGE_SIZE, pat_index);
> >> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level,
> u64,
> >> entry | flags);
> >> }
> >> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
> struct xe_migrate *m,
> >> /* Write PDE's that point to our BO. */
> >> for (i = 0; i < num_entries - num_level; i++) {
> >> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> >> - XE_CACHE_WB);
> >> + pat_index);
> >>
> >> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
> >> (i + 1) * 8, u64, entry);
> >> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate
> *m,
> >>
> >> XE_WARN_ON(pt_bo->size != SZ_4K);
> >>
> >> - addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
> >> + addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe,
> XE_CACHE_WB), 0);
> >> bb->cs[bb->len++] = lower_32_bits(addr);
> >> bb->cs[bb->len++] = upper_32_bits(addr);
> >> }
> >> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> >> index 64713f400d94..019af2920078 100644
> >> --- a/drivers/gpu/drm/xe/xe_pt.c
> >> +++ b/drivers/gpu/drm/xe/xe_pt.c
> >> @@ -10,6 +10,7 @@
> >> #include "xe_gt.h"
> >> #include "xe_gt_tlb_invalidation.h"
> >> #include "xe_migrate.h"
> >> +#include "xe_pat.h"
> >> #include "xe_pt_types.h"
> >> #include "xe_pt_walk.h"
> >> #include "xe_res_cursor.h"
> >> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir
> *pt_dir, unsigned int index)
> >> *
> >> * Return: An encoded page directory entry. No errors.
> >> */
> >> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >> - const enum xe_cache_level cache)
> >> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
> >> {
> >> u64 pde;
> >> struct xe_vm *vm = bo->vm;
> >> struct xe_device *xe = vm->xe;
> >>
> >> -
> >> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> >> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
> >>
> >> if ((vm->pat_encode).pde_encode)
> >> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> >> + pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
> >>
> >> return pde;
> >> }
> >>
> >> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> >> +static u64 __pte_encode(u64 pte, u32 pat_index,
> >> struct xe_vma *vma, u32 pt_level)
> >> {
> >> struct xe_vm *vm = xe_vma_vm(vma);
> >> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum
> xe_cache_level cache,
> >> pte |= XE_PTE_NULL;
> >>
> >> if ((vm->pat_encode).pte_encode)
> >> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> >> + pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
> >>
> >> if (pt_level == 1)
> >> pte |= XE_PDE_PS_2M;
> >> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
> xe_cache_level cache,
> >> *
> >> * Return: An encoded page-table entry. No errors.
> >> */
> >> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
> cache,
> >> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> >> u32 pt_level)
> >> {
> >> u64 pte;
> >> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset,
> enum xe_cache_level cache,
> >> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> >> pte |= XE_PPGTT_PTE_DM;
> >>
> >> - return __pte_encode(pte, cache, NULL, pt_level);
> >> + return __pte_encode(pte, pat_index, NULL, pt_level);
> >> }
> >>
> >> static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> >> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile,
> struct xe_vm *vm,
> >>
> >> if (level == 0) {
> >> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> >> - XE_CACHE_WB, 0);
> >> + xe_pat_get_index(vm->xe,
> XE_CACHE_WB), 0);
> >>
> >> return empty;
> >> } else {
> >> return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
> >> - XE_CACHE_WB);
> >> + xe_pat_get_index(vm->xe,
> XE_CACHE_WB));
> >> }
> >> }
> >>
> >> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> >> struct xe_vm *vm;
> >> /** @tile: The tile we're building for. */
> >> struct xe_tile *tile;
> >> - /** @cache: Desired cache level for the ptes */
> >> - enum xe_cache_level cache;
> >> /** @default_pte: PTE flag only template. No address is associated */
> >> u64 default_pte;
> >> /** @dma_offset: DMA offset to add to the PTE. */
> >> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> pgoff_t offset,
> >>
> >> pte = __pte_encode(is_null ? 0 :
> >> xe_res_dma(curs) + xe_walk->dma_offset,
> >> - xe_walk->cache, xe_walk->vma, level);
> >> + xe_walk->vma->pat_index, xe_walk->vma,
> level);
> >> pte |= xe_walk->default_pte;
> >>
> >> /*
> >> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> pgoff_t offset,
> >> xe_child->is_compact = true;
> >> }
> >>
> >> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
> >> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
> >pat_index) | flags;
> >> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
> >> pte);
> >> }
> >> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma
> *vma,
> >> if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
> >> xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
> >> xe_walk.dma_offset = vram_region_gpu_offset(bo-
> >ttm.resource);
> >> - xe_walk.cache = XE_CACHE_WB;
> >> - } else {
> >> - if (!xe_vma_has_no_bo(vma) && bo->flags &
> XE_BO_SCANOUT_BIT)
> >> - xe_walk.cache = XE_CACHE_WT;
> >> - else
> >> - xe_walk.cache = XE_CACHE_WB;
> >> }
> >> +
> >> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> >> xe_walk.dma_offset =
> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> >>
> >> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> >> index 01be7ab08f87..1d433a5a96b4 100644
> >> --- a/drivers/gpu/drm/xe/xe_pt.h
> >> +++ b/drivers/gpu/drm/xe/xe_pt.h
> >> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct
> xe_vma *vma, struct xe_exec_queu
> >>
> >> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
> >>
> >> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >> - const enum xe_cache_level level);
> >> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
> >>
> >> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
> cache,
> >> - u32 pt_level);
> >> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32
> pt_level);
> >>
> >> #endif
> >> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> >> index 7eeeed0411f3..34603a7e84b0 100644
> >> --- a/drivers/gpu/drm/xe/xe_vm.c
> >> +++ b/drivers/gpu/drm/xe/xe_vm.c
> >> @@ -6,6 +6,7 @@
> >> #include "xe_vm.h"
> >>
> >> #include <linux/dma-fence-array.h>
> >> +#include <linux/nospec.h>
> >>
> >> #include <drm/drm_print.h>
> >> #include <drm/ttm/ttm_execbuf_util.h>
> >> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm
> *vm,
> >> u64 start, u64 end,
> >> bool read_only,
> >> bool is_null,
> >> - u8 tile_mask)
> >> + u8 tile_mask,
> >> + u32 pat_index)
> >> {
> >> struct xe_vma *vma;
> >> struct xe_tile *tile;
> >> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm
> *vm,
> >> vma->tile_mask |= 0x1 << id;
> >> }
> >>
> >> + vma->pat_index = pat_index;
> >> +
> >> if (vm->xe->info.platform == XE_PVC)
> >> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
> >>
> >> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct
> work_struct *w);
> >> static void vm_destroy_work_func(struct work_struct *w);
> >>
> >> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> >> - enum xe_cache_level cache)
> >> + u32 pat_index)
> >> {
> >> - u32 pat_index = xe_pat_get_index(xe, cache);
> >> -
> >> pde_pat &= ~(XELP_PDE_PAT_MASK);
> >>
> >> if (pat_index & BIT(0))
> >> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct
> xe_device *xe, u64 pde_pat,
> >> }
> >>
> >> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> >> - enum xe_cache_level cache)
> >> + u32 pat_index)
> >> {
> >> - u32 pat_index = xe_pat_get_index(xe, cache);
> >> -
> >> pte_pat &= ~(XELP_PTE_PAT_MASK);
> >>
> >> if (pat_index & BIT(0))
> >> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef,
> u32 id)
> >> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
> >> {
> >> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> >> - XE_CACHE_WB);
> >> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
> >> }
> >>
> >> static struct dma_fence *
> >> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct
> drm_gpuva_op *op)
> >> static struct drm_gpuva_ops *
> >> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> >> u64 bo_offset_or_userptr, u64 addr, u64 range,
> >> - u32 operation, u8 tile_mask, u32 region)
> >> + u32 operation, u8 tile_mask, u32 region, u32
> pat_index)
> >> {
> >> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> >> struct ww_acquire_ctx ww;
> >> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> struct xe_bo *bo,
> >> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> >>
> >> op->tile_mask = tile_mask;
> >> + op->pat_index = pat_index;
> >> op->map.immediate =
> >> operation & XE_VM_BIND_FLAG_IMMEDIATE;
> >> op->map.read_only =
> >> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> struct xe_bo *bo,
> >> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> >>
> >> op->tile_mask = tile_mask;
> >> + op->pat_index = pat_index;
> >> op->prefetch.region = region;
> >> }
> >> break;
> >> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> struct xe_bo *bo,
> >> }
> >>
> >> static struct xe_vma *new_vma(struct xe_vm *vm, struct
> drm_gpuva_op_map *op,
> >> - u8 tile_mask, bool read_only, bool is_null)
> >> + u8 tile_mask, bool read_only, bool is_null,
> >> + u32 pat_index)
> >> {
> >> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
> >> struct xe_vma *vma;
> >> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm
> *vm, struct drm_gpuva_op_map *op,
> >> vma = xe_vma_create(vm, bo, op->gem.offset,
> >> op->va.addr, op->va.addr +
> >> op->va.range - 1, read_only, is_null,
> >> - tile_mask);
> >> + tile_mask, pat_index);
> >> if (bo)
> >> xe_bo_unlock(bo, &ww);
> >>
> >> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
> *vm, struct xe_exec_queue *q,
> >>
> >> vma = new_vma(vm, &op->base.map,
> >> op->tile_mask, op->map.read_only,
> >> - op->map.is_null);
> >> + op->map.is_null, op->pat_index);
> >> if (IS_ERR(vma)) {
> >> err = PTR_ERR(vma);
> >> goto free_fence;
> >> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
> *vm, struct xe_exec_queue *q,
> >>
> >> vma = new_vma(vm, op-
> >base.remap.prev,
> >> op->tile_mask, read_only,
> >> - is_null);
> >> + is_null, op->pat_index);
> >> if (IS_ERR(vma)) {
> >> err = PTR_ERR(vma);
> >> goto free_fence;
> >> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
> *vm, struct xe_exec_queue *q,
> >>
> >> vma = new_vma(vm, op-
> >base.remap.next,
> >> op->tile_mask, read_only,
> >> - is_null);
> >> + is_null, op->pat_index);
> >> if (IS_ERR(vma)) {
> >> err = PTR_ERR(vma);
> >> goto free_fence;
> >> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct
> xe_device *xe,
> >> u32 obj = (*bind_ops)[i].obj;
> >> u64 obj_offset = (*bind_ops)[i].obj_offset;
> >> u32 region = (*bind_ops)[i].region;
> >> + u32 pat_index = (*bind_ops)[i].pat_index;
> >> bool is_null = op & XE_VM_BIND_FLAG_NULL;
> >>
> >> + if (XE_IOCTL_DBG(xe, pat_index >= xe-
> >info.pat_table_n_entries)) {
> >> + err = -EINVAL;
> >> + goto free_bind_ops;
> >> + }
> >> +
> >> if (i == 0) {
> >> *async = !!(op & XE_VM_BIND_FLAG_ASYNC);
> >> } else if (XE_IOCTL_DBG(xe, !*async) ||
> >> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
> void *data, struct drm_file *file)
> >> struct drm_gem_object *gem_obj;
> >> u64 range = bind_ops[i].range;
> >> u64 addr = bind_ops[i].addr;
> >> + u32 op = bind_ops[i].op;
> >> u32 obj = bind_ops[i].obj;
> >> u64 obj_offset = bind_ops[i].obj_offset;
> >> + u32 pat_index = bind_ops[i].pat_index;
> >> + u16 coh_mode;
> >> +
> >> + pat_index = array_index_nospec(pat_index,
> >> + xe->info.pat_table_n_entries);
> >> + coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
> >> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
> >
> > Assuming we drop the unusable entries from the TGL table, this should be
> > impossible, right? Any index that makes it past the n_entries check at
> > the top of the function should have a valid, non-zero coh_mode value.
> > So this should probably be an assertion (to highlight a KMD bug) rather
> > than just a silent uapi failure return.
>
> Makes sense.
>
> >
> >> + err = -EINVAL;
> >> + goto put_obj;
> >> + }
> >> +
> >> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
> XE_VM_BIND_OP_MAP_USERPTR &&
> >> + coh_mode == XE_GEM_COHERENCY_NONE)) {
> >> + err = -EINVAL;
> >> + goto put_obj;
> >> + }
> >>
> >> if (!obj)
> >> continue;
> >> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
> void *data, struct drm_file *file)
> >> goto put_obj;
> >> }
> >> }
> >> +
> >> + if (bos[i]->coh_mode) {
> >> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
> coh_mode)) {
> >> + err = -EINVAL;
> >> + goto put_obj;
> >> + }
> >> + } else if (XE_IOCTL_DBG(xe, coh_mode ==
> XE_GEM_COHERENCY_NONE)) {
> >> + /*
> >> + * Imported dma-buf from a different device should
> >> + * require 1way or 2way coherency since we don't
> know
> >> + * how it was mapped on CPU. Just assume is it
> >> + * potentially cached on CPU side.
> >> + */
> >> + err = -EINVAL;
> >> + goto put_obj;
> >> + }
> >> }
> >>
> >> if (args->num_syncs) {
> >> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
> void *data, struct drm_file *file)
> >> u64 obj_offset = bind_ops[i].obj_offset;
> >> u8 tile_mask = bind_ops[i].tile_mask;
> >> u32 region = bind_ops[i].region;
> >> + u32 pat_index = bind_ops[i].pat_index;
> >>
> >> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
> >> addr, range, op, tile_mask,
> >> - region);
> >> + region, pat_index);
> >> if (IS_ERR(ops[i])) {
> >> err = PTR_ERR(ops[i]);
> >> ops[i] = NULL;
> >> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> b/drivers/gpu/drm/xe/xe_vm_types.h
> >> index 83a1f87b6537..508679ed3c74 100644
> >> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> >> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> >> @@ -111,6 +111,11 @@ struct xe_vma {
> >> */
> >> u8 tile_present;
> >>
> >> + /**
> >> + * @pat_index: The pat index to use when encoding the PTEs for this
> vma.
> >> + */
> >> + u32 pat_index;
> >> +
> >> struct {
> >> struct list_head rebind_link;
> >> } notifier;
> >> @@ -338,10 +343,8 @@ struct xe_vm {
> >> bool batch_invalidate_tlb;
> >>
> >> struct {
> >> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >> - enum xe_cache_level cache);
> >> - u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> >> - enum xe_cache_level cache);
> >> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32
> pat_index);
> >> + u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32
> pat_index);
> >> } pat_encode;
> >> };
> >>
> >> @@ -417,6 +420,8 @@ struct xe_vma_op {
> >> struct async_op_fence *fence;
> >> /** @tile_mask: gt mask for this operation */
> >> u8 tile_mask;
> >> + /** @pat_index: The pat index to use for this operation. */
> >> + u32 pat_index;
> >> /** @flags: operation flags */
> >> enum xe_vma_op_flags flags;
> >>
> >> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> >> index 64bc66d4b550..0c15b6f32447 100644
> >> --- a/include/uapi/drm/xe_drm.h
> >> +++ b/include/uapi/drm/xe_drm.h
> >> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> >> */
> >> __u32 obj;
> >>
> >> - /** @pad: MBZ */
> >> - __u32 pad;
> >> + /**
> >> + * @pat_index: The platform defined @pat_index to use for this
> mapping.
> >> + * The index basically maps to some predefined memory attributes,
> >> + * including things like caching, coherency and likely other stuff in
> >> + * the future. The exact meaning of the pat_index is platform specific
> >
> > BTW, "other stuff in the future" already includes compression on Xe2, we
> > just haven't landed the patches for the Xe2 table yet.
>
> Ok, good to know.
>
> >
> >> + * and defined in the Bspec and PRMs. When the KMD sets up the
> binding
> >> + * the index here is encoded into the ppGTT PTE.
> >> + *
> >> + * For coherency the @pat_index needs to match the
> >> + * drm_xe_gem_create.coh_mode, so either
> XE_GEM_COHERENCY_NONE,
> >> + * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The
> KMD will extract
> >> + * the coherency mode from the @pat_index and reject if there is a
> >> + * mismatch (see note below for pre-MTL platforms).
> >> + *
> >> + * Note: On pre-MTL platforms there is only a caching mode and no
> >> + * explicit coherency mode, but on such hardware there is always a
> >> + * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
> >> + * CPU caches even with the caching mode set as uncached. It's only
> the
> >> + * display engine that is incoherent (on dgpu it must be in VRAM which
> >> + * is always mapped as WC on the CPU). However to keep the uapi
> somewhat
> >> + * consistent with newer platforms the KMD groups the different cache
> >> + * levels into the following coherency buckets on all pre-MTL platforms:
> >> + *
> >> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
> >> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
> >> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
> >> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
> >
> > As noted on the previous patch, it seems like 2-way is appropriate for
> > LLC platforms, but 1-way might be a more accurate description of dGPU
> > behavior.
> >
> >> + *
> >> + * In practice UC/WC/WT should only ever used for scanout surfaces on
> >> + * such platforms since it is only the display engine that is actually
> >> + * incoherent. Everything else should typically use WB given that we
> >
> > What if we're sharing our buffers with some other (non-GPU) device? Are
> > there cases where that other device wouldn't be coherent with the LLC,
> > so we'd want to use one of these?
>
> Yeah, I guess there might be cases like that. I'll reword.
>
> >
> >
> > Matt
> >
> >> + * have a shared-LLC. On MTL+ this completely changes (also
> potentially
> >> + * no shared-LLC) and the HW defines the coherency mode as part of
> the
> >> + * @pat_index.
> >> + *
> >> + * Note: For userptr and externally imported dma-buf the kernel
> expects
> >> + * either 1WAY or 2WAY for the @pat_index.
> >> + */
> >> + __u32 pat_index;
> >>
> >> union {
> >> /**
> >> --
> >> 2.41.0
> >>
> >
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-08-30 15:27 ` Zhang, Carl
@ 2023-08-30 16:02 ` Matthew Auld
2023-08-31 8:24 ` Zhang, Carl
0 siblings, 1 reply; 47+ messages in thread
From: Matthew Auld @ 2023-08-30 16:02 UTC (permalink / raw)
To: Zhang, Carl, Roper, Matthew D
Cc: Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas, Yu, Effie,
intel-xe@lists.freedesktop.org
On 30/08/2023 16:27, Zhang, Carl wrote:
> Several questions:
> 1. the pat_index from vm_bind will override the setting from bo_create?
> How to keep the value from bo_create unchanged?
You only set the pat_index during vm_bind. At gem_create you just need
tell the kernel what CPU side caching mode to use for system memory
(wb/wc/uc), plus the expected GPU coherency mode of the pat_index. The
current expectation is that the coherency mode of the pat_index should
match the one at gem_create.
> 2. no UC/WB/WC definition (CPU cachable) in drm_xe_gem_mmap_offset, will it be handled by KMD automatically?
> For example: if set 1-way, it means GPU could snoop CPU cache, we could use WB in mmap offset
> If it is COHERENCY_NONE, we could only use UC, all these logic is handled by KMD automatically?
Yes, the mmap will use the same CPU side caching mode that you set at
gem_create with smem_caching (wb/wc/uc).
> 3. about " For imported dma-buf (from a different device) the coherency mode is also implicit
> and must also be either 1WAY or 2WAY"
> it means it must be 1way or 2way, and UMD need not to set it?
If you are going to vm_bind it you need to supply the pat_index and in
this case it needs to be either 1way or 2way. The object comes from a
different device so we might not have smem_caching/coh_mode like we do
for native objects.
>
>
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld@intel.com>
>> Sent: Wednesday, August 30, 2023 7:28 PM
>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi <pallavi.mishra@intel.com>;
>> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas Lahtinen
>> <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
>> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>; Hazubski,
>> Filip <filip.hazubski@intel.com>; Zhang, Carl <carl.zhang@intel.com>; Yu, Effie
>> <effie.yu@intel.com>
>> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
>>
>> On 29/08/2023 22:36, Matt Roper wrote:
>>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
>>>> Allow userspace to directly control the pat_index for a given vm
>>>> binding. This should allow directly controlling the coherency,
>>>> caching and potentially other stuff in the future for the ppGTT binding.
>>>>
>>>> The exact meaning behind the pat_index is very platform specific (see
>>>> BSpec or PRMs) but effectively maps to some predefined memory
>>>> attributes. From the KMD pov we only care about the coherency that is
>>>> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
>>>> The vm_bind coherency mode for the given pat_index needs to match the
>>>> given coh_mode that was set at object creation. For platforms that
>>>> lack
>>>
>>> Is it actually important to match the coherency mode? I think one of
>>> the main goals was to know up front if userspace might be using a
>>> non-snooping PAT setting that would let it bypass the CPU cache (and
>>> potentially read old, stale data from a different process if the
>>> buffer's clear value is still sitting in cache and hasn't landed in
>>> memory yet).
>>>
>>> If that's the only concern, then I think it should still be fine to
>>> map with a non-matching PAT as long as it's more coherent than the one
>>> specified at creation, right? E.g., if the buffer was created with
>>> 1-way coherency, it would be fine to map it with 2-way because
>>> userspace still can't use that to observe the previous contents of the
>>> buffer. Or
>>
>> Yeah, I guess we could in theory do something that.
>>
>>> if the buffer was created with "non-coherent" then we've already done
>>> the necessary clflushing in kernel before handing to buffer over to
>>> userspace to ensure the clear value landed in memory, so any valid PAT
>>> index should be safe (from a security POV) after that, right? Any
>>> other problems that arise from mismatched coherency would just be
>>> contained to the app possibly shooting itself in the foot, which isn't
>>> really our concern.
>>
>> That is also my understanding, at least from the KMD security pov. If you
>> allocate as wb then you must use at least 1way, since there is no flushing for
>> clearing or swap-in. For uc/wc you could in theory use whatever you want.
>>
>>>
>>>
>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as
>> 2WAY.
>>>>
>>>> For userptr mappings we lack a corresponding gem object, so the expected
>>>> coherency mode is instead implicit and must fall into either 1WAY or
>>>> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
>>>> dma-buf (from a different device) the coherency mode is also implicit
>>>> and must also be either 1WAY or 2WAY.
>>>>
>>>> As part of adding pat_index support with vm_bind we also need stop using
>>>> xe_cache_level and instead use the pat_index in various places. We still
>>>> make use of xe_cache_level, but only as a convenience for kernel
>>>> internal objectsi (internally it maps to some reasonable pat_index). For
>>>
>>> Maybe we should kill xe_cache_level completely and just assign
>>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
>>> appropriate places, similar to what we do with MOCS (gt->mocs.uc_index,
>>> gt->mocs.wb_index)?
>>
>> OK, seems reasonable to me.
>>
>>>
>>>> now this is just a 1:1 conversion of the existing code, however for
>>>> platforms like MTL+ we might need to give more control through bo_create
>>>> or stop using WB on the CPU side if we need CPU access.
>>>>
>>>> Bspec: 45101, 44235 #xe
>>>> Bspec: 70552, 71582, 59400 #xe2
>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>>>> Cc: Carl Zhang <carl.zhang@intel.com>
>>>> Cc: Effie Yu <effie.yu@intel.com>
>>>> ---
>>>> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
>>>> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
>>>> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
>>>> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
>>>> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
>>>> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
>>>> drivers/gpu/drm/xe/xe_pt.h | 6 +--
>>>> drivers/gpu/drm/xe/xe_vm.c | 73 +++++++++++++++++++++------
>>>> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
>>>> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
>>>> 10 files changed, 134 insertions(+), 58 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
>> b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>> index 4d6296cdbcfd..bb4c182048c3 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
>>>> (*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>>>>
>>>> u64 (*pte_encode)(dma_addr_t addr,
>>>> - unsigned int pat_index,
>>>> + u32 pat_index,
>>>> u32 flags); /* Create a valid PTE */
>>>> #define PTE_READ_ONLY BIT(0)
>>>> #define PTE_LM BIT(1)
>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
>> b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate
>> *m, struct kunit *test)
>>>> /* First part of the test, are we updating our pagetable bo with a new
>> entry? */
>>>> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1),
>> u64,
>>>> 0xdeaddeadbeefbeef);
>>>> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>>>> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
>> XE_CACHE_WB), 0);
>>>> if (m->q->vm->flags & XE_VM_FLAG_64K)
>>>> expected |= XE_PTE_PS64;
>>>> if (xe_bo_is_vram(pt))
>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
>>>> index 209fa053d9fb..4134c26150a5 100644
>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64
>> bo_offset)
>>>> pte |= XE_GGTT_PTE_DM;
>>>>
>>>> if ((ggtt->pat_encode).pte_encode)
>>>> - pte = (ggtt->pat_encode).pte_encode(xe, pte,
>> XE_CACHE_WB_1_WAY);
>>>> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>> + xe_pat_get_index(xe,
>> XE_CACHE_WB_1_WAY));
>>>>
>>>> return pte;
>>>> }
>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>>>> }
>>>>
>>>> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>>>> - enum xe_cache_level cache)
>>>> + u32 pat_index)
>>>> {
>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>> -
>>>> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>>>
>>>> if (pat_index & BIT(0))
>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
>> b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>>>
>>>> struct {
>>>> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>> - enum xe_cache_level cache);
>>>> + u32 pat_index);
>>>> } pat_encode;
>>>> };
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
>> b/drivers/gpu/drm/xe/xe_migrate.c
>>>> index a782ea282cb6..54585e98452a 100644
>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>>>> @@ -24,6 +24,7 @@
>>>> #include "xe_lrc.h"
>>>> #include "xe_map.h"
>>>> #include "xe_mocs.h"
>>>> +#include "xe_pat.h"
>>>> #include "xe_pt.h"
>>>> #include "xe_res_cursor.h"
>>>> #include "xe_sched_job.h"
>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
>> struct xe_migrate *m,
>>>> u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]-
>>> level;
>>>> u32 map_ofs, level, i;
>>>> struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>>>> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>>>> u64 entry;
>>>> int ret;
>>>>
>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile
>> *tile, struct xe_migrate *m,
>>>> return ret;
>>>> }
>>>>
>>>> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
>>>> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>>>> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>>>
>>>> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>>>
>>>> /* Map the entire BO in our level 0 pt */
>>>> for (i = 0, level = 0; i < num_entries; level++) {
>>>> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB,
>> 0);
>>>> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>>>>
>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>>>>
>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
>> struct xe_migrate *m,
>>>> for (i = 0; i < batch->size;
>>>> i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>>>> XE_PAGE_SIZE) {
>>>> - entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
>>>> + entry = xe_pte_encode(batch, i, pat_index, 0);
>>>>
>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>>>> entry);
>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
>> struct xe_migrate *m,
>>>> flags = XE_PDE_64K;
>>>>
>>>> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>>>> - XE_PAGE_SIZE, XE_CACHE_WB);
>>>> + XE_PAGE_SIZE, pat_index);
>>>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level,
>> u64,
>>>> entry | flags);
>>>> }
>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
>> struct xe_migrate *m,
>>>> /* Write PDE's that point to our BO. */
>>>> for (i = 0; i < num_entries - num_level; i++) {
>>>> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>>>> - XE_CACHE_WB);
>>>> + pat_index);
>>>>
>>>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>>>> (i + 1) * 8, u64, entry);
>>>> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate
>> *m,
>>>>
>>>> XE_WARN_ON(pt_bo->size != SZ_4K);
>>>>
>>>> - addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
>>>> + addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe,
>> XE_CACHE_WB), 0);
>>>> bb->cs[bb->len++] = lower_32_bits(addr);
>>>> bb->cs[bb->len++] = upper_32_bits(addr);
>>>> }
>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
>>>> index 64713f400d94..019af2920078 100644
>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>>>> @@ -10,6 +10,7 @@
>>>> #include "xe_gt.h"
>>>> #include "xe_gt_tlb_invalidation.h"
>>>> #include "xe_migrate.h"
>>>> +#include "xe_pat.h"
>>>> #include "xe_pt_types.h"
>>>> #include "xe_pt_walk.h"
>>>> #include "xe_res_cursor.h"
>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir
>> *pt_dir, unsigned int index)
>>>> *
>>>> * Return: An encoded page directory entry. No errors.
>>>> */
>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>> - const enum xe_cache_level cache)
>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>>>> {
>>>> u64 pde;
>>>> struct xe_vm *vm = bo->vm;
>>>> struct xe_device *xe = vm->xe;
>>>>
>>>> -
>>>> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>>>> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>>>
>>>> if ((vm->pat_encode).pde_encode)
>>>> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>>>> + pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>>>>
>>>> return pde;
>>>> }
>>>>
>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>>>> struct xe_vma *vma, u32 pt_level)
>>>> {
>>>> struct xe_vm *vm = xe_vma_vm(vma);
>>>> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum
>> xe_cache_level cache,
>>>> pte |= XE_PTE_NULL;
>>>>
>>>> if ((vm->pat_encode).pte_encode)
>>>> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>>>> + pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>>>>
>>>> if (pt_level == 1)
>>>> pte |= XE_PDE_PS_2M;
>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
>> xe_cache_level cache,
>>>> *
>>>> * Return: An encoded page-table entry. No errors.
>>>> */
>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
>> cache,
>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>>> u32 pt_level)
>>>> {
>>>> u64 pte;
>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset,
>> enum xe_cache_level cache,
>>>> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>>>> pte |= XE_PPGTT_PTE_DM;
>>>>
>>>> - return __pte_encode(pte, cache, NULL, pt_level);
>>>> + return __pte_encode(pte, pat_index, NULL, pt_level);
>>>> }
>>>>
>>>> static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>>>> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile,
>> struct xe_vm *vm,
>>>>
>>>> if (level == 0) {
>>>> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>>>> - XE_CACHE_WB, 0);
>>>> + xe_pat_get_index(vm->xe,
>> XE_CACHE_WB), 0);
>>>>
>>>> return empty;
>>>> } else {
>>>> return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
>>>> - XE_CACHE_WB);
>>>> + xe_pat_get_index(vm->xe,
>> XE_CACHE_WB));
>>>> }
>>>> }
>>>>
>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>>>> struct xe_vm *vm;
>>>> /** @tile: The tile we're building for. */
>>>> struct xe_tile *tile;
>>>> - /** @cache: Desired cache level for the ptes */
>>>> - enum xe_cache_level cache;
>>>> /** @default_pte: PTE flag only template. No address is associated */
>>>> u64 default_pte;
>>>> /** @dma_offset: DMA offset to add to the PTE. */
>>>> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
>> pgoff_t offset,
>>>>
>>>> pte = __pte_encode(is_null ? 0 :
>>>> xe_res_dma(curs) + xe_walk->dma_offset,
>>>> - xe_walk->cache, xe_walk->vma, level);
>>>> + xe_walk->vma->pat_index, xe_walk->vma,
>> level);
>>>> pte |= xe_walk->default_pte;
>>>>
>>>> /*
>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
>> pgoff_t offset,
>>>> xe_child->is_compact = true;
>>>> }
>>>>
>>>> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
>>>> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
>>> pat_index) | flags;
>>>> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>>>> pte);
>>>> }
>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma
>> *vma,
>>>> if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>>>> xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>>>> xe_walk.dma_offset = vram_region_gpu_offset(bo-
>>> ttm.resource);
>>>> - xe_walk.cache = XE_CACHE_WB;
>>>> - } else {
>>>> - if (!xe_vma_has_no_bo(vma) && bo->flags &
>> XE_BO_SCANOUT_BIT)
>>>> - xe_walk.cache = XE_CACHE_WT;
>>>> - else
>>>> - xe_walk.cache = XE_CACHE_WB;
>>>> }
>>>> +
>>>> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>>> xe_walk.dma_offset =
>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
>>>> index 01be7ab08f87..1d433a5a96b4 100644
>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct
>> xe_vma *vma, struct xe_exec_queu
>>>>
>>>> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>>>>
>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>> - const enum xe_cache_level level);
>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>>>>
>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
>> cache,
>>>> - u32 pt_level);
>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32
>> pt_level);
>>>>
>>>> #endif
>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>>>> index 7eeeed0411f3..34603a7e84b0 100644
>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>> @@ -6,6 +6,7 @@
>>>> #include "xe_vm.h"
>>>>
>>>> #include <linux/dma-fence-array.h>
>>>> +#include <linux/nospec.h>
>>>>
>>>> #include <drm/drm_print.h>
>>>> #include <drm/ttm/ttm_execbuf_util.h>
>>>> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm
>> *vm,
>>>> u64 start, u64 end,
>>>> bool read_only,
>>>> bool is_null,
>>>> - u8 tile_mask)
>>>> + u8 tile_mask,
>>>> + u32 pat_index)
>>>> {
>>>> struct xe_vma *vma;
>>>> struct xe_tile *tile;
>>>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm
>> *vm,
>>>> vma->tile_mask |= 0x1 << id;
>>>> }
>>>>
>>>> + vma->pat_index = pat_index;
>>>> +
>>>> if (vm->xe->info.platform == XE_PVC)
>>>> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>>>
>>>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct
>> work_struct *w);
>>>> static void vm_destroy_work_func(struct work_struct *w);
>>>>
>>>> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>>>> - enum xe_cache_level cache)
>>>> + u32 pat_index)
>>>> {
>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>> -
>>>> pde_pat &= ~(XELP_PDE_PAT_MASK);
>>>>
>>>> if (pat_index & BIT(0))
>>>> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct
>> xe_device *xe, u64 pde_pat,
>>>> }
>>>>
>>>> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>>>> - enum xe_cache_level cache)
>>>> + u32 pat_index)
>>>> {
>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>> -
>>>> pte_pat &= ~(XELP_PTE_PAT_MASK);
>>>>
>>>> if (pat_index & BIT(0))
>>>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef,
>> u32 id)
>>>> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>>>> {
>>>> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>>>> - XE_CACHE_WB);
>>>> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
>>>> }
>>>>
>>>> static struct dma_fence *
>>>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct
>> drm_gpuva_op *op)
>>>> static struct drm_gpuva_ops *
>>>> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>> u64 bo_offset_or_userptr, u64 addr, u64 range,
>>>> - u32 operation, u8 tile_mask, u32 region)
>>>> + u32 operation, u8 tile_mask, u32 region, u32
>> pat_index)
>>>> {
>>>> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>>> struct ww_acquire_ctx ww;
>>>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
>> struct xe_bo *bo,
>>>> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>>>
>>>> op->tile_mask = tile_mask;
>>>> + op->pat_index = pat_index;
>>>> op->map.immediate =
>>>> operation & XE_VM_BIND_FLAG_IMMEDIATE;
>>>> op->map.read_only =
>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
>> struct xe_bo *bo,
>>>> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>>>
>>>> op->tile_mask = tile_mask;
>>>> + op->pat_index = pat_index;
>>>> op->prefetch.region = region;
>>>> }
>>>> break;
>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
>> struct xe_bo *bo,
>>>> }
>>>>
>>>> static struct xe_vma *new_vma(struct xe_vm *vm, struct
>> drm_gpuva_op_map *op,
>>>> - u8 tile_mask, bool read_only, bool is_null)
>>>> + u8 tile_mask, bool read_only, bool is_null,
>>>> + u32 pat_index)
>>>> {
>>>> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
>>>> struct xe_vma *vma;
>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm
>> *vm, struct drm_gpuva_op_map *op,
>>>> vma = xe_vma_create(vm, bo, op->gem.offset,
>>>> op->va.addr, op->va.addr +
>>>> op->va.range - 1, read_only, is_null,
>>>> - tile_mask);
>>>> + tile_mask, pat_index);
>>>> if (bo)
>>>> xe_bo_unlock(bo, &ww);
>>>>
>>>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
>> *vm, struct xe_exec_queue *q,
>>>>
>>>> vma = new_vma(vm, &op->base.map,
>>>> op->tile_mask, op->map.read_only,
>>>> - op->map.is_null);
>>>> + op->map.is_null, op->pat_index);
>>>> if (IS_ERR(vma)) {
>>>> err = PTR_ERR(vma);
>>>> goto free_fence;
>>>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
>> *vm, struct xe_exec_queue *q,
>>>>
>>>> vma = new_vma(vm, op-
>>> base.remap.prev,
>>>> op->tile_mask, read_only,
>>>> - is_null);
>>>> + is_null, op->pat_index);
>>>> if (IS_ERR(vma)) {
>>>> err = PTR_ERR(vma);
>>>> goto free_fence;
>>>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
>> *vm, struct xe_exec_queue *q,
>>>>
>>>> vma = new_vma(vm, op-
>>> base.remap.next,
>>>> op->tile_mask, read_only,
>>>> - is_null);
>>>> + is_null, op->pat_index);
>>>> if (IS_ERR(vma)) {
>>>> err = PTR_ERR(vma);
>>>> goto free_fence;
>>>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct
>> xe_device *xe,
>>>> u32 obj = (*bind_ops)[i].obj;
>>>> u64 obj_offset = (*bind_ops)[i].obj_offset;
>>>> u32 region = (*bind_ops)[i].region;
>>>> + u32 pat_index = (*bind_ops)[i].pat_index;
>>>> bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>>>
>>>> + if (XE_IOCTL_DBG(xe, pat_index >= xe-
>>> info.pat_table_n_entries)) {
>>>> + err = -EINVAL;
>>>> + goto free_bind_ops;
>>>> + }
>>>> +
>>>> if (i == 0) {
>>>> *async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>>>> } else if (XE_IOCTL_DBG(xe, !*async) ||
>>>> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
>> void *data, struct drm_file *file)
>>>> struct drm_gem_object *gem_obj;
>>>> u64 range = bind_ops[i].range;
>>>> u64 addr = bind_ops[i].addr;
>>>> + u32 op = bind_ops[i].op;
>>>> u32 obj = bind_ops[i].obj;
>>>> u64 obj_offset = bind_ops[i].obj_offset;
>>>> + u32 pat_index = bind_ops[i].pat_index;
>>>> + u16 coh_mode;
>>>> +
>>>> + pat_index = array_index_nospec(pat_index,
>>>> + xe->info.pat_table_n_entries);
>>>> + coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>>>> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
>>>
>>> Assuming we drop the unusable entries from the TGL table, this should be
>>> impossible, right? Any index that makes it past the n_entries check at
>>> the top of the function should have a valid, non-zero coh_mode value.
>>> So this should probably be an assertion (to highlight a KMD bug) rather
>>> than just a silent uapi failure return.
>>
>> Makes sense.
>>
>>>
>>>> + err = -EINVAL;
>>>> + goto put_obj;
>>>> + }
>>>> +
>>>> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
>> XE_VM_BIND_OP_MAP_USERPTR &&
>>>> + coh_mode == XE_GEM_COHERENCY_NONE)) {
>>>> + err = -EINVAL;
>>>> + goto put_obj;
>>>> + }
>>>>
>>>> if (!obj)
>>>> continue;
>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
>> void *data, struct drm_file *file)
>>>> goto put_obj;
>>>> }
>>>> }
>>>> +
>>>> + if (bos[i]->coh_mode) {
>>>> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
>> coh_mode)) {
>>>> + err = -EINVAL;
>>>> + goto put_obj;
>>>> + }
>>>> + } else if (XE_IOCTL_DBG(xe, coh_mode ==
>> XE_GEM_COHERENCY_NONE)) {
>>>> + /*
>>>> + * Imported dma-buf from a different device should
>>>> + * require 1way or 2way coherency since we don't
>> know
>>>> + * how it was mapped on CPU. Just assume is it
>>>> + * potentially cached on CPU side.
>>>> + */
>>>> + err = -EINVAL;
>>>> + goto put_obj;
>>>> + }
>>>> }
>>>>
>>>> if (args->num_syncs) {
>>>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
>> void *data, struct drm_file *file)
>>>> u64 obj_offset = bind_ops[i].obj_offset;
>>>> u8 tile_mask = bind_ops[i].tile_mask;
>>>> u32 region = bind_ops[i].region;
>>>> + u32 pat_index = bind_ops[i].pat_index;
>>>>
>>>> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>>>> addr, range, op, tile_mask,
>>>> - region);
>>>> + region, pat_index);
>>>> if (IS_ERR(ops[i])) {
>>>> err = PTR_ERR(ops[i]);
>>>> ops[i] = NULL;
>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
>> b/drivers/gpu/drm/xe/xe_vm_types.h
>>>> index 83a1f87b6537..508679ed3c74 100644
>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>>>> @@ -111,6 +111,11 @@ struct xe_vma {
>>>> */
>>>> u8 tile_present;
>>>>
>>>> + /**
>>>> + * @pat_index: The pat index to use when encoding the PTEs for this
>> vma.
>>>> + */
>>>> + u32 pat_index;
>>>> +
>>>> struct {
>>>> struct list_head rebind_link;
>>>> } notifier;
>>>> @@ -338,10 +343,8 @@ struct xe_vm {
>>>> bool batch_invalidate_tlb;
>>>>
>>>> struct {
>>>> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>> - enum xe_cache_level cache);
>>>> - u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>>>> - enum xe_cache_level cache);
>>>> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32
>> pat_index);
>>>> + u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32
>> pat_index);
>>>> } pat_encode;
>>>> };
>>>>
>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>>>> struct async_op_fence *fence;
>>>> /** @tile_mask: gt mask for this operation */
>>>> u8 tile_mask;
>>>> + /** @pat_index: The pat index to use for this operation. */
>>>> + u32 pat_index;
>>>> /** @flags: operation flags */
>>>> enum xe_vma_op_flags flags;
>>>>
>>>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>>>> index 64bc66d4b550..0c15b6f32447 100644
>>>> --- a/include/uapi/drm/xe_drm.h
>>>> +++ b/include/uapi/drm/xe_drm.h
>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>>>> */
>>>> __u32 obj;
>>>>
>>>> - /** @pad: MBZ */
>>>> - __u32 pad;
>>>> + /**
>>>> + * @pat_index: The platform defined @pat_index to use for this
>> mapping.
>>>> + * The index basically maps to some predefined memory attributes,
>>>> + * including things like caching, coherency and likely other stuff in
>>>> + * the future. The exact meaning of the pat_index is platform specific
>>>
>>> BTW, "other stuff in the future" already includes compression on Xe2, we
>>> just haven't landed the patches for the Xe2 table yet.
>>
>> Ok, good to know.
>>
>>>
>>>> + * and defined in the Bspec and PRMs. When the KMD sets up the
>> binding
>>>> + * the index here is encoded into the ppGTT PTE.
>>>> + *
>>>> + * For coherency the @pat_index needs to match the
>>>> + * drm_xe_gem_create.coh_mode, so either
>> XE_GEM_COHERENCY_NONE,
>>>> + * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The
>> KMD will extract
>>>> + * the coherency mode from the @pat_index and reject if there is a
>>>> + * mismatch (see note below for pre-MTL platforms).
>>>> + *
>>>> + * Note: On pre-MTL platforms there is only a caching mode and no
>>>> + * explicit coherency mode, but on such hardware there is always a
>>>> + * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
>>>> + * CPU caches even with the caching mode set as uncached. It's only
>> the
>>>> + * display engine that is incoherent (on dgpu it must be in VRAM which
>>>> + * is always mapped as WC on the CPU). However to keep the uapi
>> somewhat
>>>> + * consistent with newer platforms the KMD groups the different cache
>>>> + * levels into the following coherency buckets on all pre-MTL platforms:
>>>> + *
>>>> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
>>>> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
>>>> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
>>>> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
>>>
>>> As noted on the previous patch, it seems like 2-way is appropriate for
>>> LLC platforms, but 1-way might be a more accurate description of dGPU
>>> behavior.
>>>
>>>> + *
>>>> + * In practice UC/WC/WT should only ever used for scanout surfaces on
>>>> + * such platforms since it is only the display engine that is actually
>>>> + * incoherent. Everything else should typically use WB given that we
>>>
>>> What if we're sharing our buffers with some other (non-GPU) device? Are
>>> there cases where that other device wouldn't be coherent with the LLC,
>>> so we'd want to use one of these?
>>
>> Yeah, I guess there might be cases like that. I'll reword.
>>
>>>
>>>
>>> Matt
>>>
>>>> + * have a shared-LLC. On MTL+ this completely changes (also
>> potentially
>>>> + * no shared-LLC) and the HW defines the coherency mode as part of
>> the
>>>> + * @pat_index.
>>>> + *
>>>> + * Note: For userptr and externally imported dma-buf the kernel
>> expects
>>>> + * either 1WAY or 2WAY for the @pat_index.
>>>> + */
>>>> + __u32 pat_index;
>>>>
>>>> union {
>>>> /**
>>>> --
>>>> 2.41.0
>>>>
>>>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-08-30 16:02 ` Matthew Auld
@ 2023-08-31 8:24 ` Zhang, Carl
2023-08-31 10:44 ` Matthew Auld
0 siblings, 1 reply; 47+ messages in thread
From: Zhang, Carl @ 2023-08-31 8:24 UTC (permalink / raw)
To: Auld, Matthew, Roper, Matthew D
Cc: Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas, Yu, Effie,
intel-xe@lists.freedesktop.org
> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: Thursday, August 31, 2023 12:02 AM
>
> On 30/08/2023 16:27, Zhang, Carl wrote:
> > Several questions:
> > 1. the pat_index from vm_bind will override the setting from bo_create?
> > How to keep the value from bo_create unchanged?
>
> You only set the pat_index during vm_bind. At gem_create you just need tell
> the kernel what CPU side caching mode to use for system memory
> (wb/wc/uc), plus the expected GPU coherency mode of the pat_index. The
> current expectation is that the coherency mode of the pat_index should match
> the one at gem_create.
>
It means that we set the incomplete pat_index at gem_create (coherency only)
And set another part at vm_bind (gpu cache)
Why we separate them? Why not set a full value at vm_bind?
> > 2. no UC/WB/WC definition (CPU cachable) in drm_xe_gem_mmap_offset,
> will it be handled by KMD automatically?
> > For example: if set 1-way, it means GPU could snoop CPU cache, we
> > could use WB in mmap offset If it is COHERENCY_NONE, we could only use
> UC, all these logic is handled by KMD automatically?
>
> Yes, the mmap will use the same CPU side caching mode that you set at
> gem_create with smem_caching (wb/wc/uc).
>
It means it could not be changed after gem_create. Sometimes, UMD doesn’t know
Whether the data need to be accessed by CPU at gem_create . only application
Knows the purpose of allocation & the surface usage. So, may need change UMD api
to let applications set whether the resource is "accessible", or it will hurt the perf.
> > 3. about " For imported dma-buf (from a different device) the
> > coherency mode is also implicit and must also be either 1WAY or 2WAY"
> > it means it must be 1way or 2way, and UMD need not to set it?
>
> If you are going to vm_bind it you need to supply the pat_index and in this
> case it needs to be either 1way or 2way. The object comes from a different
> device so we might not have smem_caching/coh_mode like we do for native
> objects.
>
How about the GPU cache setting of imported dma-buf?
Is there possible conflict between dma-buf producer and consumer?
For example:
In producer side , one bo is set GPU L3 cache, also CPU cacheable, it should be 2Way
In client side, we know the coherency should be 1-way or 2-way, if it suppose no GPU cache
Used. It just set 1-way.
It may cause some problems?
Also , when a bo is created, UMD does not know whether it will be exported.
It maybe be set as COHERENCY_NONE, right?
> >
> >
> >> -----Original Message-----
> >> From: Auld, Matthew <matthew.auld@intel.com>
> >> Sent: Wednesday, August 30, 2023 7:28 PM
> >> To: Roper, Matthew D <matthew.d.roper@intel.com>
> >> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi
> <pallavi.mishra@intel.com>;
> >> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas Lahtinen
> >> <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
> >> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>;
> Hazubski,
> >> Filip <filip.hazubski@intel.com>; Zhang, Carl <carl.zhang@intel.com>; Yu,
> Effie
> >> <effie.yu@intel.com>
> >> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection with
> vm_bind
> >>
> >> On 29/08/2023 22:36, Matt Roper wrote:
> >>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
> >>>> Allow userspace to directly control the pat_index for a given vm
> >>>> binding. This should allow directly controlling the coherency,
> >>>> caching and potentially other stuff in the future for the ppGTT binding.
> >>>>
> >>>> The exact meaning behind the pat_index is very platform specific (see
> >>>> BSpec or PRMs) but effectively maps to some predefined memory
> >>>> attributes. From the KMD pov we only care about the coherency that is
> >>>> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
> >>>> The vm_bind coherency mode for the given pat_index needs to match
> the
> >>>> given coh_mode that was set at object creation. For platforms that
> >>>> lack
> >>>
> >>> Is it actually important to match the coherency mode? I think one of
> >>> the main goals was to know up front if userspace might be using a
> >>> non-snooping PAT setting that would let it bypass the CPU cache (and
> >>> potentially read old, stale data from a different process if the
> >>> buffer's clear value is still sitting in cache and hasn't landed in
> >>> memory yet).
> >>>
> >>> If that's the only concern, then I think it should still be fine to
> >>> map with a non-matching PAT as long as it's more coherent than the one
> >>> specified at creation, right? E.g., if the buffer was created with
> >>> 1-way coherency, it would be fine to map it with 2-way because
> >>> userspace still can't use that to observe the previous contents of the
> >>> buffer. Or
> >>
> >> Yeah, I guess we could in theory do something that.
> >>
> >>> if the buffer was created with "non-coherent" then we've already done
> >>> the necessary clflushing in kernel before handing to buffer over to
> >>> userspace to ensure the clear value landed in memory, so any valid PAT
> >>> index should be safe (from a security POV) after that, right? Any
> >>> other problems that arise from mismatched coherency would just be
> >>> contained to the app possibly shooting itself in the foot, which isn't
> >>> really our concern.
> >>
> >> That is also my understanding, at least from the KMD security pov. If you
> >> allocate as wb then you must use at least 1way, since there is no flushing
> for
> >> clearing or swap-in. For uc/wc you could in theory use whatever you want.
> >>
> >>>
> >>>
> >>>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as
> >> 2WAY.
> >>>>
> >>>> For userptr mappings we lack a corresponding gem object, so the
> expected
> >>>> coherency mode is instead implicit and must fall into either 1WAY or
> >>>> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
> >>>> dma-buf (from a different device) the coherency mode is also implicit
> >>>> and must also be either 1WAY or 2WAY.
> >>>>
> >>>> As part of adding pat_index support with vm_bind we also need stop
> using
> >>>> xe_cache_level and instead use the pat_index in various places. We still
> >>>> make use of xe_cache_level, but only as a convenience for kernel
> >>>> internal objectsi (internally it maps to some reasonable pat_index). For
> >>>
> >>> Maybe we should kill xe_cache_level completely and just assign
> >>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
> >>> appropriate places, similar to what we do with MOCS (gt-
> >mocs.uc_index,
> >>> gt->mocs.wb_index)?
> >>
> >> OK, seems reasonable to me.
> >>
> >>>
> >>>> now this is just a 1:1 conversion of the existing code, however for
> >>>> platforms like MTL+ we might need to give more control through
> bo_create
> >>>> or stop using WB on the CPU side if we need CPU access.
> >>>>
> >>>> Bspec: 45101, 44235 #xe
> >>>> Bspec: 70552, 71582, 59400 #xe2
> >>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> >>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> >>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> >>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> >>>> Cc: Matt Roper <matthew.d.roper@intel.com>
> >>>> Cc: José Roberto de Souza <jose.souza@intel.com>
> >>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
> >>>> Cc: Carl Zhang <carl.zhang@intel.com>
> >>>> Cc: Effie Yu <effie.yu@intel.com>
> >>>> ---
> >>>> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
> >>>> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
> >>>> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
> >>>> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
> >>>> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
> >>>> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
> >>>> drivers/gpu/drm/xe/xe_pt.h | 6 +--
> >>>> drivers/gpu/drm/xe/xe_vm.c | 73 +++++++++++++++++++++-----
> -
> >>>> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
> >>>> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
> >>>> 10 files changed, 134 insertions(+), 58 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >> b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>> index 4d6296cdbcfd..bb4c182048c3 100644
> >>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>> @@ -302,7 +302,7 @@ struct i915_address_space {
> >>>> (*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
> >>>>
> >>>> u64 (*pte_encode)(dma_addr_t addr,
> >>>> - unsigned int pat_index,
> >>>> + u32 pat_index,
> >>>> u32 flags); /* Create a valid PTE */
> >>>> #define PTE_READ_ONLY BIT(0)
> >>>> #define PTE_LM BIT(1)
> >>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >> b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>> index 5c8d5e78d9bc..7a128fd20a29 100644
> >>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct
> xe_migrate
> >> *m, struct kunit *test)
> >>>> /* First part of the test, are we updating our pagetable bo with a new
> >> entry? */
> >>>> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE -
> 1),
> >> u64,
> >>>> 0xdeaddeadbeefbeef);
> >>>> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> >>>> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
> >> XE_CACHE_WB), 0);
> >>>> if (m->q->vm->flags & XE_VM_FLAG_64K)
> >>>> expected |= XE_PTE_PS64;
> >>>> if (xe_bo_is_vram(pt))
> >>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c
> b/drivers/gpu/drm/xe/xe_ggtt.c
> >>>> index 209fa053d9fb..4134c26150a5 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> >>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> >>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64
> >> bo_offset)
> >>>> pte |= XE_GGTT_PTE_DM;
> >>>>
> >>>> if ((ggtt->pat_encode).pte_encode)
> >>>> - pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >> XE_CACHE_WB_1_WAY);
> >>>> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >>>> + xe_pat_get_index(xe,
> >> XE_CACHE_WB_1_WAY));
> >>>>
> >>>> return pte;
> >>>> }
> >>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
> >>>> }
> >>>>
> >>>> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64
> pte_pat,
> >>>> - enum xe_cache_level cache)
> >>>> + u32 pat_index)
> >>>> {
> >>>> - u32 pat_index = xe_pat_get_index(xe, cache);
> >>>> -
> >>>> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
> >>>>
> >>>> if (pat_index & BIT(0))
> >>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >> b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>> index 7e55fac1a8a9..0bc40cb072e3 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
> >>>>
> >>>> struct {
> >>>> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>> - enum xe_cache_level cache);
> >>>> + u32 pat_index);
> >>>> } pat_encode;
> >>>> };
> >>>>
> >>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> >> b/drivers/gpu/drm/xe/xe_migrate.c
> >>>> index a782ea282cb6..54585e98452a 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
> >>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> >>>> @@ -24,6 +24,7 @@
> >>>> #include "xe_lrc.h"
> >>>> #include "xe_map.h"
> >>>> #include "xe_mocs.h"
> >>>> +#include "xe_pat.h"
> >>>> #include "xe_pt.h"
> >>>> #include "xe_res_cursor.h"
> >>>> #include "xe_sched_job.h"
> >>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile
> *tile,
> >> struct xe_migrate *m,
> >>>> u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]-
> >>> level;
> >>>> u32 map_ofs, level, i;
> >>>> struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> >>>> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> >>>> u64 entry;
> >>>> int ret;
> >>>>
> >>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct
> xe_tile
> >> *tile, struct xe_migrate *m,
> >>>> return ret;
> >>>> }
> >>>>
> >>>> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
> XE_CACHE_WB);
> >>>> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
> >>>> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
> >>>>
> >>>> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
> >>>>
> >>>> /* Map the entire BO in our level 0 pt */
> >>>> for (i = 0, level = 0; i < num_entries; level++) {
> >>>> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB,
> >> 0);
> >>>> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
> >>>>
> >>>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
> >>>>
> >>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile
> *tile,
> >> struct xe_migrate *m,
> >>>> for (i = 0; i < batch->size;
> >>>> i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
> >>>> XE_PAGE_SIZE) {
> >>>> - entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
> >>>> + entry = xe_pte_encode(batch, i, pat_index, 0);
> >>>>
> >>>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> >>>> entry);
> >>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile
> *tile,
> >> struct xe_migrate *m,
> >>>> flags = XE_PDE_64K;
> >>>>
> >>>> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> >>>> - XE_PAGE_SIZE, XE_CACHE_WB);
> >>>> + XE_PAGE_SIZE, pat_index);
> >>>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level,
> >> u64,
> >>>> entry | flags);
> >>>> }
> >>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile
> *tile,
> >> struct xe_migrate *m,
> >>>> /* Write PDE's that point to our BO. */
> >>>> for (i = 0; i < num_entries - num_level; i++) {
> >>>> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> >>>> - XE_CACHE_WB);
> >>>> + pat_index);
> >>>>
> >>>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
> >>>> (i + 1) * 8, u64, entry);
> >>>> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct
> xe_migrate
> >> *m,
> >>>>
> >>>> XE_WARN_ON(pt_bo->size != SZ_4K);
> >>>>
> >>>> - addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
> >>>> + addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe,
> >> XE_CACHE_WB), 0);
> >>>> bb->cs[bb->len++] = lower_32_bits(addr);
> >>>> bb->cs[bb->len++] = upper_32_bits(addr);
> >>>> }
> >>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> >>>> index 64713f400d94..019af2920078 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_pt.c
> >>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
> >>>> @@ -10,6 +10,7 @@
> >>>> #include "xe_gt.h"
> >>>> #include "xe_gt_tlb_invalidation.h"
> >>>> #include "xe_migrate.h"
> >>>> +#include "xe_pat.h"
> >>>> #include "xe_pt_types.h"
> >>>> #include "xe_pt_walk.h"
> >>>> #include "xe_res_cursor.h"
> >>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir
> >> *pt_dir, unsigned int index)
> >>>> *
> >>>> * Return: An encoded page directory entry. No errors.
> >>>> */
> >>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >>>> - const enum xe_cache_level cache)
> >>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
> >>>> {
> >>>> u64 pde;
> >>>> struct xe_vm *vm = bo->vm;
> >>>> struct xe_device *xe = vm->xe;
> >>>>
> >>>> -
> >>>> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> >>>> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
> >>>>
> >>>> if ((vm->pat_encode).pde_encode)
> >>>> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> >>>> + pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
> >>>>
> >>>> return pde;
> >>>> }
> >>>>
> >>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> >>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
> >>>> struct xe_vma *vma, u32 pt_level)
> >>>> {
> >>>> struct xe_vm *vm = xe_vma_vm(vma);
> >>>> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum
> >> xe_cache_level cache,
> >>>> pte |= XE_PTE_NULL;
> >>>>
> >>>> if ((vm->pat_encode).pte_encode)
> >>>> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> >>>> + pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
> >>>>
> >>>> if (pt_level == 1)
> >>>> pte |= XE_PDE_PS_2M;
> >>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
> >> xe_cache_level cache,
> >>>> *
> >>>> * Return: An encoded page-table entry. No errors.
> >>>> */
> >>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
> >> cache,
> >>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> >>>> u32 pt_level)
> >>>> {
> >>>> u64 pte;
> >>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64
> offset,
> >> enum xe_cache_level cache,
> >>>> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> >>>> pte |= XE_PPGTT_PTE_DM;
> >>>>
> >>>> - return __pte_encode(pte, cache, NULL, pt_level);
> >>>> + return __pte_encode(pte, pat_index, NULL, pt_level);
> >>>> }
> >>>>
> >>>> static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> >>>> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile
> *tile,
> >> struct xe_vm *vm,
> >>>>
> >>>> if (level == 0) {
> >>>> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> >>>> - XE_CACHE_WB, 0);
> >>>> + xe_pat_get_index(vm->xe,
> >> XE_CACHE_WB), 0);
> >>>>
> >>>> return empty;
> >>>> } else {
> >>>> return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
> >>>> - XE_CACHE_WB);
> >>>> + xe_pat_get_index(vm->xe,
> >> XE_CACHE_WB));
> >>>> }
> >>>> }
> >>>>
> >>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> >>>> struct xe_vm *vm;
> >>>> /** @tile: The tile we're building for. */
> >>>> struct xe_tile *tile;
> >>>> - /** @cache: Desired cache level for the ptes */
> >>>> - enum xe_cache_level cache;
> >>>> /** @default_pte: PTE flag only template. No address is associated */
> >>>> u64 default_pte;
> >>>> /** @dma_offset: DMA offset to add to the PTE. */
> >>>> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> >> pgoff_t offset,
> >>>>
> >>>> pte = __pte_encode(is_null ? 0 :
> >>>> xe_res_dma(curs) + xe_walk->dma_offset,
> >>>> - xe_walk->cache, xe_walk->vma, level);
> >>>> + xe_walk->vma->pat_index, xe_walk->vma,
> >> level);
> >>>> pte |= xe_walk->default_pte;
> >>>>
> >>>> /*
> >>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> >> pgoff_t offset,
> >>>> xe_child->is_compact = true;
> >>>> }
> >>>>
> >>>> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) |
> flags;
> >>>> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
> >>> pat_index) | flags;
> >>>> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
> >>>> pte);
> >>>> }
> >>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> xe_vma
> >> *vma,
> >>>> if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
> >>>> xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
> >>>> xe_walk.dma_offset = vram_region_gpu_offset(bo-
> >>> ttm.resource);
> >>>> - xe_walk.cache = XE_CACHE_WB;
> >>>> - } else {
> >>>> - if (!xe_vma_has_no_bo(vma) && bo->flags &
> >> XE_BO_SCANOUT_BIT)
> >>>> - xe_walk.cache = XE_CACHE_WT;
> >>>> - else
> >>>> - xe_walk.cache = XE_CACHE_WB;
> >>>> }
> >>>> +
> >>>> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> >>>> xe_walk.dma_offset =
> >> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> >>>>
> >>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> >>>> index 01be7ab08f87..1d433a5a96b4 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_pt.h
> >>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
> >>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct
> >> xe_vma *vma, struct xe_exec_queu
> >>>>
> >>>> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
> >>>>
> >>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >>>> - const enum xe_cache_level level);
> >>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
> >>>>
> >>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
> >> cache,
> >>>> - u32 pt_level);
> >>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32
> >> pt_level);
> >>>>
> >>>> #endif
> >>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> >>>> index 7eeeed0411f3..34603a7e84b0 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_vm.c
> >>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
> >>>> @@ -6,6 +6,7 @@
> >>>> #include "xe_vm.h"
> >>>>
> >>>> #include <linux/dma-fence-array.h>
> >>>> +#include <linux/nospec.h>
> >>>>
> >>>> #include <drm/drm_print.h>
> >>>> #include <drm/ttm/ttm_execbuf_util.h>
> >>>> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct
> xe_vm
> >> *vm,
> >>>> u64 start, u64 end,
> >>>> bool read_only,
> >>>> bool is_null,
> >>>> - u8 tile_mask)
> >>>> + u8 tile_mask,
> >>>> + u32 pat_index)
> >>>> {
> >>>> struct xe_vma *vma;
> >>>> struct xe_tile *tile;
> >>>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct
> xe_vm
> >> *vm,
> >>>> vma->tile_mask |= 0x1 << id;
> >>>> }
> >>>>
> >>>> + vma->pat_index = pat_index;
> >>>> +
> >>>> if (vm->xe->info.platform == XE_PVC)
> >>>> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
> >>>>
> >>>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct
> >> work_struct *w);
> >>>> static void vm_destroy_work_func(struct work_struct *w);
> >>>>
> >>>> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64
> pde_pat,
> >>>> - enum xe_cache_level cache)
> >>>> + u32 pat_index)
> >>>> {
> >>>> - u32 pat_index = xe_pat_get_index(xe, cache);
> >>>> -
> >>>> pde_pat &= ~(XELP_PDE_PAT_MASK);
> >>>>
> >>>> if (pat_index & BIT(0))
> >>>> @@ -1213,10 +1215,8 @@ static u64
> xelp_ppgtt_pde_encode_pat(struct
> >> xe_device *xe, u64 pde_pat,
> >>>> }
> >>>>
> >>>> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64
> pte_pat,
> >>>> - enum xe_cache_level cache)
> >>>> + u32 pat_index)
> >>>> {
> >>>> - u32 pat_index = xe_pat_get_index(xe, cache);
> >>>> -
> >>>> pte_pat &= ~(XELP_PTE_PAT_MASK);
> >>>>
> >>>> if (pat_index & BIT(0))
> >>>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file
> *xef,
> >> u32 id)
> >>>> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
> >>>> {
> >>>> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> >>>> - XE_CACHE_WB);
> >>>> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
> >>>> }
> >>>>
> >>>> static struct dma_fence *
> >>>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe,
> struct
> >> drm_gpuva_op *op)
> >>>> static struct drm_gpuva_ops *
> >>>> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> >>>> u64 bo_offset_or_userptr, u64 addr, u64 range,
> >>>> - u32 operation, u8 tile_mask, u32 region)
> >>>> + u32 operation, u8 tile_mask, u32 region, u32
> >> pat_index)
> >>>> {
> >>>> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> >>>> struct ww_acquire_ctx ww;
> >>>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> >> struct xe_bo *bo,
> >>>> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> >>>>
> >>>> op->tile_mask = tile_mask;
> >>>> + op->pat_index = pat_index;
> >>>> op->map.immediate =
> >>>> operation &
> XE_VM_BIND_FLAG_IMMEDIATE;
> >>>> op->map.read_only =
> >>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> >> struct xe_bo *bo,
> >>>> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> >>>>
> >>>> op->tile_mask = tile_mask;
> >>>> + op->pat_index = pat_index;
> >>>> op->prefetch.region = region;
> >>>> }
> >>>> break;
> >>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> >> struct xe_bo *bo,
> >>>> }
> >>>>
> >>>> static struct xe_vma *new_vma(struct xe_vm *vm, struct
> >> drm_gpuva_op_map *op,
> >>>> - u8 tile_mask, bool read_only, bool is_null)
> >>>> + u8 tile_mask, bool read_only, bool is_null,
> >>>> + u32 pat_index)
> >>>> {
> >>>> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) :
> NULL;
> >>>> struct xe_vma *vma;
> >>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm
> >> *vm, struct drm_gpuva_op_map *op,
> >>>> vma = xe_vma_create(vm, bo, op->gem.offset,
> >>>> op->va.addr, op->va.addr +
> >>>> op->va.range - 1, read_only, is_null,
> >>>> - tile_mask);
> >>>> + tile_mask, pat_index);
> >>>> if (bo)
> >>>> xe_bo_unlock(bo, &ww);
> >>>>
> >>>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct
> xe_vm
> >> *vm, struct xe_exec_queue *q,
> >>>>
> >>>> vma = new_vma(vm, &op->base.map,
> >>>> op->tile_mask, op-
> >map.read_only,
> >>>> - op->map.is_null);
> >>>> + op->map.is_null, op->pat_index);
> >>>> if (IS_ERR(vma)) {
> >>>> err = PTR_ERR(vma);
> >>>> goto free_fence;
> >>>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct
> xe_vm
> >> *vm, struct xe_exec_queue *q,
> >>>>
> >>>> vma = new_vma(vm, op-
> >>> base.remap.prev,
> >>>> op->tile_mask, read_only,
> >>>> - is_null);
> >>>> + is_null, op->pat_index);
> >>>> if (IS_ERR(vma)) {
> >>>> err = PTR_ERR(vma);
> >>>> goto free_fence;
> >>>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct
> xe_vm
> >> *vm, struct xe_exec_queue *q,
> >>>>
> >>>> vma = new_vma(vm, op-
> >>> base.remap.next,
> >>>> op->tile_mask, read_only,
> >>>> - is_null);
> >>>> + is_null, op->pat_index);
> >>>> if (IS_ERR(vma)) {
> >>>> err = PTR_ERR(vma);
> >>>> goto free_fence;
> >>>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct
> >> xe_device *xe,
> >>>> u32 obj = (*bind_ops)[i].obj;
> >>>> u64 obj_offset = (*bind_ops)[i].obj_offset;
> >>>> u32 region = (*bind_ops)[i].region;
> >>>> + u32 pat_index = (*bind_ops)[i].pat_index;
> >>>> bool is_null = op & XE_VM_BIND_FLAG_NULL;
> >>>>
> >>>> + if (XE_IOCTL_DBG(xe, pat_index >= xe-
> >>> info.pat_table_n_entries)) {
> >>>> + err = -EINVAL;
> >>>> + goto free_bind_ops;
> >>>> + }
> >>>> +
> >>>> if (i == 0) {
> >>>> *async = !!(op & XE_VM_BIND_FLAG_ASYNC);
> >>>> } else if (XE_IOCTL_DBG(xe, !*async) ||
> >>>> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device
> *dev,
> >> void *data, struct drm_file *file)
> >>>> struct drm_gem_object *gem_obj;
> >>>> u64 range = bind_ops[i].range;
> >>>> u64 addr = bind_ops[i].addr;
> >>>> + u32 op = bind_ops[i].op;
> >>>> u32 obj = bind_ops[i].obj;
> >>>> u64 obj_offset = bind_ops[i].obj_offset;
> >>>> + u32 pat_index = bind_ops[i].pat_index;
> >>>> + u16 coh_mode;
> >>>> +
> >>>> + pat_index = array_index_nospec(pat_index,
> >>>> + xe->info.pat_table_n_entries);
> >>>> + coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
> >>>> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
> >>>
> >>> Assuming we drop the unusable entries from the TGL table, this should be
> >>> impossible, right? Any index that makes it past the n_entries check at
> >>> the top of the function should have a valid, non-zero coh_mode value.
> >>> So this should probably be an assertion (to highlight a KMD bug) rather
> >>> than just a silent uapi failure return.
> >>
> >> Makes sense.
> >>
> >>>
> >>>> + err = -EINVAL;
> >>>> + goto put_obj;
> >>>> + }
> >>>> +
> >>>> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
> >> XE_VM_BIND_OP_MAP_USERPTR &&
> >>>> + coh_mode == XE_GEM_COHERENCY_NONE))
> {
> >>>> + err = -EINVAL;
> >>>> + goto put_obj;
> >>>> + }
> >>>>
> >>>> if (!obj)
> >>>> continue;
> >>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device
> *dev,
> >> void *data, struct drm_file *file)
> >>>> goto put_obj;
> >>>> }
> >>>> }
> >>>> +
> >>>> + if (bos[i]->coh_mode) {
> >>>> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
> >> coh_mode)) {
> >>>> + err = -EINVAL;
> >>>> + goto put_obj;
> >>>> + }
> >>>> + } else if (XE_IOCTL_DBG(xe, coh_mode ==
> >> XE_GEM_COHERENCY_NONE)) {
> >>>> + /*
> >>>> + * Imported dma-buf from a different device should
> >>>> + * require 1way or 2way coherency since we don't
> >> know
> >>>> + * how it was mapped on CPU. Just assume is it
> >>>> + * potentially cached on CPU side.
> >>>> + */
> >>>> + err = -EINVAL;
> >>>> + goto put_obj;
> >>>> + }
> >>>> }
> >>>>
> >>>> if (args->num_syncs) {
> >>>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device
> *dev,
> >> void *data, struct drm_file *file)
> >>>> u64 obj_offset = bind_ops[i].obj_offset;
> >>>> u8 tile_mask = bind_ops[i].tile_mask;
> >>>> u32 region = bind_ops[i].region;
> >>>> + u32 pat_index = bind_ops[i].pat_index;
> >>>>
> >>>> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
> >>>> addr, range, op, tile_mask,
> >>>> - region);
> >>>> + region, pat_index);
> >>>> if (IS_ERR(ops[i])) {
> >>>> err = PTR_ERR(ops[i]);
> >>>> ops[i] = NULL;
> >>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> >> b/drivers/gpu/drm/xe/xe_vm_types.h
> >>>> index 83a1f87b6537..508679ed3c74 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> >>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> >>>> @@ -111,6 +111,11 @@ struct xe_vma {
> >>>> */
> >>>> u8 tile_present;
> >>>>
> >>>> + /**
> >>>> + * @pat_index: The pat index to use when encoding the PTEs for this
> >> vma.
> >>>> + */
> >>>> + u32 pat_index;
> >>>> +
> >>>> struct {
> >>>> struct list_head rebind_link;
> >>>> } notifier;
> >>>> @@ -338,10 +343,8 @@ struct xe_vm {
> >>>> bool batch_invalidate_tlb;
> >>>>
> >>>> struct {
> >>>> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>> - enum xe_cache_level cache);
> >>>> - u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> >>>> - enum xe_cache_level cache);
> >>>> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32
> >> pat_index);
> >>>> + u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32
> >> pat_index);
> >>>> } pat_encode;
> >>>> };
> >>>>
> >>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
> >>>> struct async_op_fence *fence;
> >>>> /** @tile_mask: gt mask for this operation */
> >>>> u8 tile_mask;
> >>>> + /** @pat_index: The pat index to use for this operation. */
> >>>> + u32 pat_index;
> >>>> /** @flags: operation flags */
> >>>> enum xe_vma_op_flags flags;
> >>>>
> >>>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> >>>> index 64bc66d4b550..0c15b6f32447 100644
> >>>> --- a/include/uapi/drm/xe_drm.h
> >>>> +++ b/include/uapi/drm/xe_drm.h
> >>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> >>>> */
> >>>> __u32 obj;
> >>>>
> >>>> - /** @pad: MBZ */
> >>>> - __u32 pad;
> >>>> + /**
> >>>> + * @pat_index: The platform defined @pat_index to use for this
> >> mapping.
> >>>> + * The index basically maps to some predefined memory attributes,
> >>>> + * including things like caching, coherency and likely other stuff in
> >>>> + * the future. The exact meaning of the pat_index is platform specific
> >>>
> >>> BTW, "other stuff in the future" already includes compression on Xe2, we
> >>> just haven't landed the patches for the Xe2 table yet.
> >>
> >> Ok, good to know.
> >>
> >>>
> >>>> + * and defined in the Bspec and PRMs. When the KMD sets up the
> >> binding
> >>>> + * the index here is encoded into the ppGTT PTE.
> >>>> + *
> >>>> + * For coherency the @pat_index needs to match the
> >>>> + * drm_xe_gem_create.coh_mode, so either
> >> XE_GEM_COHERENCY_NONE,
> >>>> + * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The
> >> KMD will extract
> >>>> + * the coherency mode from the @pat_index and reject if there is a
> >>>> + * mismatch (see note below for pre-MTL platforms).
> >>>> + *
> >>>> + * Note: On pre-MTL platforms there is only a caching mode and no
> >>>> + * explicit coherency mode, but on such hardware there is always a
> >>>> + * shared-LLC (or is dgpu) so all GT memory accesses are coherent
> with
> >>>> + * CPU caches even with the caching mode set as uncached. It's only
> >> the
> >>>> + * display engine that is incoherent (on dgpu it must be in VRAM which
> >>>> + * is always mapped as WC on the CPU). However to keep the uapi
> >> somewhat
> >>>> + * consistent with newer platforms the KMD groups the different
> cache
> >>>> + * levels into the following coherency buckets on all pre-MTL
> platforms:
> >>>> + *
> >>>> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
> >>>> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
> >>>> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
> >>>> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
> >>>
> >>> As noted on the previous patch, it seems like 2-way is appropriate for
> >>> LLC platforms, but 1-way might be a more accurate description of dGPU
> >>> behavior.
> >>>
> >>>> + *
> >>>> + * In practice UC/WC/WT should only ever used for scanout surfaces
> on
> >>>> + * such platforms since it is only the display engine that is actually
> >>>> + * incoherent. Everything else should typically use WB given that we
> >>>
> >>> What if we're sharing our buffers with some other (non-GPU) device? Are
> >>> there cases where that other device wouldn't be coherent with the LLC,
> >>> so we'd want to use one of these?
> >>
> >> Yeah, I guess there might be cases like that. I'll reword.
> >>
> >>>
> >>>
> >>> Matt
> >>>
> >>>> + * have a shared-LLC. On MTL+ this completely changes (also
> >> potentially
> >>>> + * no shared-LLC) and the HW defines the coherency mode as part of
> >> the
> >>>> + * @pat_index.
> >>>> + *
> >>>> + * Note: For userptr and externally imported dma-buf the kernel
> >> expects
> >>>> + * either 1WAY or 2WAY for the @pat_index.
> >>>> + */
> >>>> + __u32 pat_index;
> >>>>
> >>>> union {
> >>>> /**
> >>>> --
> >>>> 2.41.0
> >>>>
> >>>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-08-31 8:24 ` Zhang, Carl
@ 2023-08-31 10:44 ` Matthew Auld
2023-09-01 9:34 ` Zhang, Carl
0 siblings, 1 reply; 47+ messages in thread
From: Matthew Auld @ 2023-08-31 10:44 UTC (permalink / raw)
To: Zhang, Carl, Roper, Matthew D
Cc: Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas, Yu, Effie,
intel-xe@lists.freedesktop.org
On 31/08/2023 09:24, Zhang, Carl wrote:
>
>
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld@intel.com>
>> Sent: Thursday, August 31, 2023 12:02 AM
>>
>> On 30/08/2023 16:27, Zhang, Carl wrote:
>>> Several questions:
>>> 1. the pat_index from vm_bind will override the setting from bo_create?
>>> How to keep the value from bo_create unchanged?
>>
>> You only set the pat_index during vm_bind. At gem_create you just need tell
>> the kernel what CPU side caching mode to use for system memory
>> (wb/wc/uc), plus the expected GPU coherency mode of the pat_index. The
>> current expectation is that the coherency mode of the pat_index should match
>> the one at gem_create.
>>
> It means that we set the incomplete pat_index at gem_create (coherency only)
> And set another part at vm_bind (gpu cache)
> Why we separate them? Why not set a full value at vm_bind?
You always give the full pat_index at vm_bind. It's platform specific
but the BSpec for the most part tells you what the coherency mode of the
pat_index is.
For example if you look at the MTL pat_index table it tells you the
coh_mode:
pat_index 0-2 = coh_mode "No snoop" = COH_NONE
pat_index 3 = coh_mode "1way" = COH_1WAY
pat_index 4 = coh_mode "2way" = COH_2WAY
On future platforms the table is larger and also encodes stuff like
compression, but KMD really only cares about the coh_mode, since that
will restrict the allowed CPU smem_caching values that userspace can
select.
From KMD pov this matters since we always give userspace zeroed memory,
which is useful for normal applications but is also needed for security
reasons. However the actual clearing might result in the writes only
being visible in the CPU cache (not flushed to main memory), and if the
GPU doesn't snoop the CPU cache it can read directly from main memory,
which effectively bypasses the clearing. To prevent that the KMD needs
to know how userspace will map the memory via the GPU i.e what is the
coh_mode for the pat_index. The gist is that if you select COH_NONE then
it needs to clflush the pages before giving them to userspace, which is
done by forcing you to use uc/wc for the smem_caching (the kernel will
internally issue cflush on x86 when marking the pages as uc/wc). And if
you select wb there is no flushing, but here you need to use 1way or 2way.
AFAIK separating them is just to prevent userspace from mixing different
coherency modes for the same object by having it immutable (you can
still use different pat_index but the coherency must be compatible).
Also at creation time we can determine if the smem_caching is compatible
with the coh_mode.
Note that it might be that we ditch the COH_2WAY and just have COH_NONE
and COH_AT_LEAST_1WAY, in which case this might be slightly different.
>
>>> 2. no UC/WB/WC definition (CPU cachable) in drm_xe_gem_mmap_offset,
>> will it be handled by KMD automatically?
>>> For example: if set 1-way, it means GPU could snoop CPU cache, we
>>> could use WB in mmap offset If it is COHERENCY_NONE, we could only use
>> UC, all these logic is handled by KMD automatically?
>>
>> Yes, the mmap will use the same CPU side caching mode that you set at
>> gem_create with smem_caching (wb/wc/uc).
>>
> It means it could not be changed after gem_create. Sometimes, UMD doesn’t know
> Whether the data need to be accessed by CPU at gem_create . only application
> Knows the purpose of allocation & the surface usage. So, may need change UMD api
> to let applications set whether the resource is "accessible", or it will hurt the perf.
Right, the smem_caching is immutable.
>
>>> 3. about " For imported dma-buf (from a different device) the
>>> coherency mode is also implicit and must also be either 1WAY or 2WAY"
>>> it means it must be 1way or 2way, and UMD need not to set it?
>>
>> If you are going to vm_bind it you need to supply the pat_index and in this
>> case it needs to be either 1way or 2way. The object comes from a different
>> device so we might not have smem_caching/coh_mode like we do for native
>> objects.
>>
> How about the GPU cache setting of imported dma-buf?
> Is there possible conflict between dma-buf producer and consumer?
> For example:
> In producer side , one bo is set GPU L3 cache, also CPU cacheable, it should be 2Way
> In client side, we know the coherency should be 1-way or 2-way, if it suppose no GPU cache
> Used. It just set 1-way.
> It may cause some problems?
If the producer and consumer are the same device, then I think the
original gem_create.coh_mode is considered, since it's effectively still
a native object. In which case the normal rules apply.
If it comes from a different device then we don't know what
gem_create.coh_mode is (or if that even exists) since we don't have the
usual native object, but just some opaque dma-buf object. In this case
it must be at least 1way or 2way.
From KMD pov we only care about userspace not being able to bypass the
clearing. There might still be other coherency issues though.
>
> Also , when a bo is created, UMD does not know whether it will be exported.
> It maybe be set as COHERENCY_NONE, right?
Yeah, you can use whatever you want. Just that your smem_caching will
need to be wc/uc if using COH_NONE. On the importer side, if it comes
from the same device then it's just a normal native object and the
normal rules apply. Currently this would mean also using pat_index with
COH_NONE. If it is imported from a different device then you need to use
a pat_index with 1way or 2way, since KMD doesn't really know how the
object is mapped on the CPU or if the CPU caches are dirty (potentially
with clearing).
Is this going to be problematic?
>>>
>>>
>>>> -----Original Message-----
>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>> Sent: Wednesday, August 30, 2023 7:28 PM
>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>>> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi
>> <pallavi.mishra@intel.com>;
>>>> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas Lahtinen
>>>> <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
>>>> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>;
>> Hazubski,
>>>> Filip <filip.hazubski@intel.com>; Zhang, Carl <carl.zhang@intel.com>; Yu,
>> Effie
>>>> <effie.yu@intel.com>
>>>> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection with
>> vm_bind
>>>>
>>>> On 29/08/2023 22:36, Matt Roper wrote:
>>>>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
>>>>>> Allow userspace to directly control the pat_index for a given vm
>>>>>> binding. This should allow directly controlling the coherency,
>>>>>> caching and potentially other stuff in the future for the ppGTT binding.
>>>>>>
>>>>>> The exact meaning behind the pat_index is very platform specific (see
>>>>>> BSpec or PRMs) but effectively maps to some predefined memory
>>>>>> attributes. From the KMD pov we only care about the coherency that is
>>>>>> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
>>>>>> The vm_bind coherency mode for the given pat_index needs to match
>> the
>>>>>> given coh_mode that was set at object creation. For platforms that
>>>>>> lack
>>>>>
>>>>> Is it actually important to match the coherency mode? I think one of
>>>>> the main goals was to know up front if userspace might be using a
>>>>> non-snooping PAT setting that would let it bypass the CPU cache (and
>>>>> potentially read old, stale data from a different process if the
>>>>> buffer's clear value is still sitting in cache and hasn't landed in
>>>>> memory yet).
>>>>>
>>>>> If that's the only concern, then I think it should still be fine to
>>>>> map with a non-matching PAT as long as it's more coherent than the one
>>>>> specified at creation, right? E.g., if the buffer was created with
>>>>> 1-way coherency, it would be fine to map it with 2-way because
>>>>> userspace still can't use that to observe the previous contents of the
>>>>> buffer. Or
>>>>
>>>> Yeah, I guess we could in theory do something that.
>>>>
>>>>> if the buffer was created with "non-coherent" then we've already done
>>>>> the necessary clflushing in kernel before handing to buffer over to
>>>>> userspace to ensure the clear value landed in memory, so any valid PAT
>>>>> index should be safe (from a security POV) after that, right? Any
>>>>> other problems that arise from mismatched coherency would just be
>>>>> contained to the app possibly shooting itself in the foot, which isn't
>>>>> really our concern.
>>>>
>>>> That is also my understanding, at least from the KMD security pov. If you
>>>> allocate as wb then you must use at least 1way, since there is no flushing
>> for
>>>> clearing or swap-in. For uc/wc you could in theory use whatever you want.
>>>>
>>>>>
>>>>>
>>>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as
>>>> 2WAY.
>>>>>>
>>>>>> For userptr mappings we lack a corresponding gem object, so the
>> expected
>>>>>> coherency mode is instead implicit and must fall into either 1WAY or
>>>>>> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
>>>>>> dma-buf (from a different device) the coherency mode is also implicit
>>>>>> and must also be either 1WAY or 2WAY.
>>>>>>
>>>>>> As part of adding pat_index support with vm_bind we also need stop
>> using
>>>>>> xe_cache_level and instead use the pat_index in various places. We still
>>>>>> make use of xe_cache_level, but only as a convenience for kernel
>>>>>> internal objectsi (internally it maps to some reasonable pat_index). For
>>>>>
>>>>> Maybe we should kill xe_cache_level completely and just assign
>>>>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
>>>>> appropriate places, similar to what we do with MOCS (gt-
>>> mocs.uc_index,
>>>>> gt->mocs.wb_index)?
>>>>
>>>> OK, seems reasonable to me.
>>>>
>>>>>
>>>>>> now this is just a 1:1 conversion of the existing code, however for
>>>>>> platforms like MTL+ we might need to give more control through
>> bo_create
>>>>>> or stop using WB on the CPU side if we need CPU access.
>>>>>>
>>>>>> Bspec: 45101, 44235 #xe
>>>>>> Bspec: 70552, 71582, 59400 #xe2
>>>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>>>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
>>>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>>>>>> Cc: Carl Zhang <carl.zhang@intel.com>
>>>>>> Cc: Effie Yu <effie.yu@intel.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
>>>>>> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
>>>>>> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
>>>>>> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
>>>>>> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
>>>>>> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
>>>>>> drivers/gpu/drm/xe/xe_pt.h | 6 +--
>>>>>> drivers/gpu/drm/xe/xe_vm.c | 73 +++++++++++++++++++++-----
>> -
>>>>>> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
>>>>>> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
>>>>>> 10 files changed, 134 insertions(+), 58 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>> b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>> index 4d6296cdbcfd..bb4c182048c3 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
>>>>>> (*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>>>>>>
>>>>>> u64 (*pte_encode)(dma_addr_t addr,
>>>>>> - unsigned int pat_index,
>>>>>> + u32 pat_index,
>>>>>> u32 flags); /* Create a valid PTE */
>>>>>> #define PTE_READ_ONLY BIT(0)
>>>>>> #define PTE_LM BIT(1)
>>>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>> b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
>>>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct
>> xe_migrate
>>>> *m, struct kunit *test)
>>>>>> /* First part of the test, are we updating our pagetable bo with a new
>>>> entry? */
>>>>>> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE -
>> 1),
>>>> u64,
>>>>>> 0xdeaddeadbeefbeef);
>>>>>> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>>>>>> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
>>>> XE_CACHE_WB), 0);
>>>>>> if (m->q->vm->flags & XE_VM_FLAG_64K)
>>>>>> expected |= XE_PTE_PS64;
>>>>>> if (xe_bo_is_vram(pt))
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c
>> b/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>> index 209fa053d9fb..4134c26150a5 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64
>>>> bo_offset)
>>>>>> pte |= XE_GGTT_PTE_DM;
>>>>>>
>>>>>> if ((ggtt->pat_encode).pte_encode)
>>>>>> - pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>> XE_CACHE_WB_1_WAY);
>>>>>> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>>>> + xe_pat_get_index(xe,
>>>> XE_CACHE_WB_1_WAY));
>>>>>>
>>>>>> return pte;
>>>>>> }
>>>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>>>>>> }
>>>>>>
>>>>>> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64
>> pte_pat,
>>>>>> - enum xe_cache_level cache)
>>>>>> + u32 pat_index)
>>>>>> {
>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>> -
>>>>>> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>>>>>
>>>>>> if (pat_index & BIT(0))
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>> b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>>>>>
>>>>>> struct {
>>>>>> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>> - enum xe_cache_level cache);
>>>>>> + u32 pat_index);
>>>>>> } pat_encode;
>>>>>> };
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
>>>> b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>> index a782ea282cb6..54585e98452a 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>> @@ -24,6 +24,7 @@
>>>>>> #include "xe_lrc.h"
>>>>>> #include "xe_map.h"
>>>>>> #include "xe_mocs.h"
>>>>>> +#include "xe_pat.h"
>>>>>> #include "xe_pt.h"
>>>>>> #include "xe_res_cursor.h"
>>>>>> #include "xe_sched_job.h"
>>>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile
>> *tile,
>>>> struct xe_migrate *m,
>>>>>> u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]-
>>>>> level;
>>>>>> u32 map_ofs, level, i;
>>>>>> struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>>>>>> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>>>>>> u64 entry;
>>>>>> int ret;
>>>>>>
>>>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct
>> xe_tile
>>>> *tile, struct xe_migrate *m,
>>>>>> return ret;
>>>>>> }
>>>>>>
>>>>>> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
>> XE_CACHE_WB);
>>>>>> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>>>>>> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>>>>>
>>>>>> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>>>>>
>>>>>> /* Map the entire BO in our level 0 pt */
>>>>>> for (i = 0, level = 0; i < num_entries; level++) {
>>>>>> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB,
>>>> 0);
>>>>>> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>>>>>>
>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>>>>>>
>>>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile
>> *tile,
>>>> struct xe_migrate *m,
>>>>>> for (i = 0; i < batch->size;
>>>>>> i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>>>>>> XE_PAGE_SIZE) {
>>>>>> - entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
>>>>>> + entry = xe_pte_encode(batch, i, pat_index, 0);
>>>>>>
>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>>>>>> entry);
>>>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile
>> *tile,
>>>> struct xe_migrate *m,
>>>>>> flags = XE_PDE_64K;
>>>>>>
>>>>>> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>>>>>> - XE_PAGE_SIZE, XE_CACHE_WB);
>>>>>> + XE_PAGE_SIZE, pat_index);
>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level,
>>>> u64,
>>>>>> entry | flags);
>>>>>> }
>>>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile
>> *tile,
>>>> struct xe_migrate *m,
>>>>>> /* Write PDE's that point to our BO. */
>>>>>> for (i = 0; i < num_entries - num_level; i++) {
>>>>>> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>>>>>> - XE_CACHE_WB);
>>>>>> + pat_index);
>>>>>>
>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>>>>>> (i + 1) * 8, u64, entry);
>>>>>> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct
>> xe_migrate
>>>> *m,
>>>>>>
>>>>>> XE_WARN_ON(pt_bo->size != SZ_4K);
>>>>>>
>>>>>> - addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
>>>>>> + addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe,
>>>> XE_CACHE_WB), 0);
>>>>>> bb->cs[bb->len++] = lower_32_bits(addr);
>>>>>> bb->cs[bb->len++] = upper_32_bits(addr);
>>>>>> }
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
>>>>>> index 64713f400d94..019af2920078 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>>>>>> @@ -10,6 +10,7 @@
>>>>>> #include "xe_gt.h"
>>>>>> #include "xe_gt_tlb_invalidation.h"
>>>>>> #include "xe_migrate.h"
>>>>>> +#include "xe_pat.h"
>>>>>> #include "xe_pt_types.h"
>>>>>> #include "xe_pt_walk.h"
>>>>>> #include "xe_res_cursor.h"
>>>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir
>>>> *pt_dir, unsigned int index)
>>>>>> *
>>>>>> * Return: An encoded page directory entry. No errors.
>>>>>> */
>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>>>> - const enum xe_cache_level cache)
>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>>>>>> {
>>>>>> u64 pde;
>>>>>> struct xe_vm *vm = bo->vm;
>>>>>> struct xe_device *xe = vm->xe;
>>>>>>
>>>>>> -
>>>>>> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>>>>>> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>>>>>
>>>>>> if ((vm->pat_encode).pde_encode)
>>>>>> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>>>>>> + pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>>>>>>
>>>>>> return pde;
>>>>>> }
>>>>>>
>>>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>>>>>> struct xe_vma *vma, u32 pt_level)
>>>>>> {
>>>>>> struct xe_vm *vm = xe_vma_vm(vma);
>>>>>> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum
>>>> xe_cache_level cache,
>>>>>> pte |= XE_PTE_NULL;
>>>>>>
>>>>>> if ((vm->pat_encode).pte_encode)
>>>>>> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>>>>>> + pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>>>>>>
>>>>>> if (pt_level == 1)
>>>>>> pte |= XE_PDE_PS_2M;
>>>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
>>>> xe_cache_level cache,
>>>>>> *
>>>>>> * Return: An encoded page-table entry. No errors.
>>>>>> */
>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
>>>> cache,
>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>>>>> u32 pt_level)
>>>>>> {
>>>>>> u64 pte;
>>>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64
>> offset,
>>>> enum xe_cache_level cache,
>>>>>> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>>>>>> pte |= XE_PPGTT_PTE_DM;
>>>>>>
>>>>>> - return __pte_encode(pte, cache, NULL, pt_level);
>>>>>> + return __pte_encode(pte, pat_index, NULL, pt_level);
>>>>>> }
>>>>>>
>>>>>> static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>>>>>> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile
>> *tile,
>>>> struct xe_vm *vm,
>>>>>>
>>>>>> if (level == 0) {
>>>>>> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>>>>>> - XE_CACHE_WB, 0);
>>>>>> + xe_pat_get_index(vm->xe,
>>>> XE_CACHE_WB), 0);
>>>>>>
>>>>>> return empty;
>>>>>> } else {
>>>>>> return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
>>>>>> - XE_CACHE_WB);
>>>>>> + xe_pat_get_index(vm->xe,
>>>> XE_CACHE_WB));
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>>>>>> struct xe_vm *vm;
>>>>>> /** @tile: The tile we're building for. */
>>>>>> struct xe_tile *tile;
>>>>>> - /** @cache: Desired cache level for the ptes */
>>>>>> - enum xe_cache_level cache;
>>>>>> /** @default_pte: PTE flag only template. No address is associated */
>>>>>> u64 default_pte;
>>>>>> /** @dma_offset: DMA offset to add to the PTE. */
>>>>>> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
>>>> pgoff_t offset,
>>>>>>
>>>>>> pte = __pte_encode(is_null ? 0 :
>>>>>> xe_res_dma(curs) + xe_walk->dma_offset,
>>>>>> - xe_walk->cache, xe_walk->vma, level);
>>>>>> + xe_walk->vma->pat_index, xe_walk->vma,
>>>> level);
>>>>>> pte |= xe_walk->default_pte;
>>>>>>
>>>>>> /*
>>>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
>>>> pgoff_t offset,
>>>>>> xe_child->is_compact = true;
>>>>>> }
>>>>>>
>>>>>> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) |
>> flags;
>>>>>> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
>>>>> pat_index) | flags;
>>>>>> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>>>>>> pte);
>>>>>> }
>>>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
>> xe_vma
>>>> *vma,
>>>>>> if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>>>>>> xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>>>>>> xe_walk.dma_offset = vram_region_gpu_offset(bo-
>>>>> ttm.resource);
>>>>>> - xe_walk.cache = XE_CACHE_WB;
>>>>>> - } else {
>>>>>> - if (!xe_vma_has_no_bo(vma) && bo->flags &
>>>> XE_BO_SCANOUT_BIT)
>>>>>> - xe_walk.cache = XE_CACHE_WT;
>>>>>> - else
>>>>>> - xe_walk.cache = XE_CACHE_WB;
>>>>>> }
>>>>>> +
>>>>>> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>>>>> xe_walk.dma_offset =
>>>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
>>>>>> index 01be7ab08f87..1d433a5a96b4 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>>>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct
>>>> xe_vma *vma, struct xe_exec_queu
>>>>>>
>>>>>> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>>>>>>
>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>>>> - const enum xe_cache_level level);
>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>>>>>>
>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
>>>> cache,
>>>>>> - u32 pt_level);
>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32
>>>> pt_level);
>>>>>>
>>>>>> #endif
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>>>>>> index 7eeeed0411f3..34603a7e84b0 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>>>> @@ -6,6 +6,7 @@
>>>>>> #include "xe_vm.h"
>>>>>>
>>>>>> #include <linux/dma-fence-array.h>
>>>>>> +#include <linux/nospec.h>
>>>>>>
>>>>>> #include <drm/drm_print.h>
>>>>>> #include <drm/ttm/ttm_execbuf_util.h>
>>>>>> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct
>> xe_vm
>>>> *vm,
>>>>>> u64 start, u64 end,
>>>>>> bool read_only,
>>>>>> bool is_null,
>>>>>> - u8 tile_mask)
>>>>>> + u8 tile_mask,
>>>>>> + u32 pat_index)
>>>>>> {
>>>>>> struct xe_vma *vma;
>>>>>> struct xe_tile *tile;
>>>>>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct
>> xe_vm
>>>> *vm,
>>>>>> vma->tile_mask |= 0x1 << id;
>>>>>> }
>>>>>>
>>>>>> + vma->pat_index = pat_index;
>>>>>> +
>>>>>> if (vm->xe->info.platform == XE_PVC)
>>>>>> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>>>>>
>>>>>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct
>>>> work_struct *w);
>>>>>> static void vm_destroy_work_func(struct work_struct *w);
>>>>>>
>>>>>> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64
>> pde_pat,
>>>>>> - enum xe_cache_level cache)
>>>>>> + u32 pat_index)
>>>>>> {
>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>> -
>>>>>> pde_pat &= ~(XELP_PDE_PAT_MASK);
>>>>>>
>>>>>> if (pat_index & BIT(0))
>>>>>> @@ -1213,10 +1215,8 @@ static u64
>> xelp_ppgtt_pde_encode_pat(struct
>>>> xe_device *xe, u64 pde_pat,
>>>>>> }
>>>>>>
>>>>>> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64
>> pte_pat,
>>>>>> - enum xe_cache_level cache)
>>>>>> + u32 pat_index)
>>>>>> {
>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>> -
>>>>>> pte_pat &= ~(XELP_PTE_PAT_MASK);
>>>>>>
>>>>>> if (pat_index & BIT(0))
>>>>>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file
>> *xef,
>>>> u32 id)
>>>>>> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>>>>>> {
>>>>>> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>>>>>> - XE_CACHE_WB);
>>>>>> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
>>>>>> }
>>>>>>
>>>>>> static struct dma_fence *
>>>>>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe,
>> struct
>>>> drm_gpuva_op *op)
>>>>>> static struct drm_gpuva_ops *
>>>>>> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>>>> u64 bo_offset_or_userptr, u64 addr, u64 range,
>>>>>> - u32 operation, u8 tile_mask, u32 region)
>>>>>> + u32 operation, u8 tile_mask, u32 region, u32
>>>> pat_index)
>>>>>> {
>>>>>> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>>>>> struct ww_acquire_ctx ww;
>>>>>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
>>>> struct xe_bo *bo,
>>>>>> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>>>>>
>>>>>> op->tile_mask = tile_mask;
>>>>>> + op->pat_index = pat_index;
>>>>>> op->map.immediate =
>>>>>> operation &
>> XE_VM_BIND_FLAG_IMMEDIATE;
>>>>>> op->map.read_only =
>>>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
>>>> struct xe_bo *bo,
>>>>>> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>>>>>
>>>>>> op->tile_mask = tile_mask;
>>>>>> + op->pat_index = pat_index;
>>>>>> op->prefetch.region = region;
>>>>>> }
>>>>>> break;
>>>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
>>>> struct xe_bo *bo,
>>>>>> }
>>>>>>
>>>>>> static struct xe_vma *new_vma(struct xe_vm *vm, struct
>>>> drm_gpuva_op_map *op,
>>>>>> - u8 tile_mask, bool read_only, bool is_null)
>>>>>> + u8 tile_mask, bool read_only, bool is_null,
>>>>>> + u32 pat_index)
>>>>>> {
>>>>>> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) :
>> NULL;
>>>>>> struct xe_vma *vma;
>>>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm
>>>> *vm, struct drm_gpuva_op_map *op,
>>>>>> vma = xe_vma_create(vm, bo, op->gem.offset,
>>>>>> op->va.addr, op->va.addr +
>>>>>> op->va.range - 1, read_only, is_null,
>>>>>> - tile_mask);
>>>>>> + tile_mask, pat_index);
>>>>>> if (bo)
>>>>>> xe_bo_unlock(bo, &ww);
>>>>>>
>>>>>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct
>> xe_vm
>>>> *vm, struct xe_exec_queue *q,
>>>>>>
>>>>>> vma = new_vma(vm, &op->base.map,
>>>>>> op->tile_mask, op-
>>> map.read_only,
>>>>>> - op->map.is_null);
>>>>>> + op->map.is_null, op->pat_index);
>>>>>> if (IS_ERR(vma)) {
>>>>>> err = PTR_ERR(vma);
>>>>>> goto free_fence;
>>>>>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct
>> xe_vm
>>>> *vm, struct xe_exec_queue *q,
>>>>>>
>>>>>> vma = new_vma(vm, op-
>>>>> base.remap.prev,
>>>>>> op->tile_mask, read_only,
>>>>>> - is_null);
>>>>>> + is_null, op->pat_index);
>>>>>> if (IS_ERR(vma)) {
>>>>>> err = PTR_ERR(vma);
>>>>>> goto free_fence;
>>>>>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct
>> xe_vm
>>>> *vm, struct xe_exec_queue *q,
>>>>>>
>>>>>> vma = new_vma(vm, op-
>>>>> base.remap.next,
>>>>>> op->tile_mask, read_only,
>>>>>> - is_null);
>>>>>> + is_null, op->pat_index);
>>>>>> if (IS_ERR(vma)) {
>>>>>> err = PTR_ERR(vma);
>>>>>> goto free_fence;
>>>>>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct
>>>> xe_device *xe,
>>>>>> u32 obj = (*bind_ops)[i].obj;
>>>>>> u64 obj_offset = (*bind_ops)[i].obj_offset;
>>>>>> u32 region = (*bind_ops)[i].region;
>>>>>> + u32 pat_index = (*bind_ops)[i].pat_index;
>>>>>> bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>>>>>
>>>>>> + if (XE_IOCTL_DBG(xe, pat_index >= xe-
>>>>> info.pat_table_n_entries)) {
>>>>>> + err = -EINVAL;
>>>>>> + goto free_bind_ops;
>>>>>> + }
>>>>>> +
>>>>>> if (i == 0) {
>>>>>> *async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>>>>>> } else if (XE_IOCTL_DBG(xe, !*async) ||
>>>>>> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device
>> *dev,
>>>> void *data, struct drm_file *file)
>>>>>> struct drm_gem_object *gem_obj;
>>>>>> u64 range = bind_ops[i].range;
>>>>>> u64 addr = bind_ops[i].addr;
>>>>>> + u32 op = bind_ops[i].op;
>>>>>> u32 obj = bind_ops[i].obj;
>>>>>> u64 obj_offset = bind_ops[i].obj_offset;
>>>>>> + u32 pat_index = bind_ops[i].pat_index;
>>>>>> + u16 coh_mode;
>>>>>> +
>>>>>> + pat_index = array_index_nospec(pat_index,
>>>>>> + xe->info.pat_table_n_entries);
>>>>>> + coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>>>>>> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
>>>>>
>>>>> Assuming we drop the unusable entries from the TGL table, this should be
>>>>> impossible, right? Any index that makes it past the n_entries check at
>>>>> the top of the function should have a valid, non-zero coh_mode value.
>>>>> So this should probably be an assertion (to highlight a KMD bug) rather
>>>>> than just a silent uapi failure return.
>>>>
>>>> Makes sense.
>>>>
>>>>>
>>>>>> + err = -EINVAL;
>>>>>> + goto put_obj;
>>>>>> + }
>>>>>> +
>>>>>> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
>>>> XE_VM_BIND_OP_MAP_USERPTR &&
>>>>>> + coh_mode == XE_GEM_COHERENCY_NONE))
>> {
>>>>>> + err = -EINVAL;
>>>>>> + goto put_obj;
>>>>>> + }
>>>>>>
>>>>>> if (!obj)
>>>>>> continue;
>>>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device
>> *dev,
>>>> void *data, struct drm_file *file)
>>>>>> goto put_obj;
>>>>>> }
>>>>>> }
>>>>>> +
>>>>>> + if (bos[i]->coh_mode) {
>>>>>> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
>>>> coh_mode)) {
>>>>>> + err = -EINVAL;
>>>>>> + goto put_obj;
>>>>>> + }
>>>>>> + } else if (XE_IOCTL_DBG(xe, coh_mode ==
>>>> XE_GEM_COHERENCY_NONE)) {
>>>>>> + /*
>>>>>> + * Imported dma-buf from a different device should
>>>>>> + * require 1way or 2way coherency since we don't
>>>> know
>>>>>> + * how it was mapped on CPU. Just assume is it
>>>>>> + * potentially cached on CPU side.
>>>>>> + */
>>>>>> + err = -EINVAL;
>>>>>> + goto put_obj;
>>>>>> + }
>>>>>> }
>>>>>>
>>>>>> if (args->num_syncs) {
>>>>>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device
>> *dev,
>>>> void *data, struct drm_file *file)
>>>>>> u64 obj_offset = bind_ops[i].obj_offset;
>>>>>> u8 tile_mask = bind_ops[i].tile_mask;
>>>>>> u32 region = bind_ops[i].region;
>>>>>> + u32 pat_index = bind_ops[i].pat_index;
>>>>>>
>>>>>> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>>>>>> addr, range, op, tile_mask,
>>>>>> - region);
>>>>>> + region, pat_index);
>>>>>> if (IS_ERR(ops[i])) {
>>>>>> err = PTR_ERR(ops[i]);
>>>>>> ops[i] = NULL;
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
>>>> b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>> index 83a1f87b6537..508679ed3c74 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>> @@ -111,6 +111,11 @@ struct xe_vma {
>>>>>> */
>>>>>> u8 tile_present;
>>>>>>
>>>>>> + /**
>>>>>> + * @pat_index: The pat index to use when encoding the PTEs for this
>>>> vma.
>>>>>> + */
>>>>>> + u32 pat_index;
>>>>>> +
>>>>>> struct {
>>>>>> struct list_head rebind_link;
>>>>>> } notifier;
>>>>>> @@ -338,10 +343,8 @@ struct xe_vm {
>>>>>> bool batch_invalidate_tlb;
>>>>>>
>>>>>> struct {
>>>>>> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>> - enum xe_cache_level cache);
>>>>>> - u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>>>>>> - enum xe_cache_level cache);
>>>>>> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32
>>>> pat_index);
>>>>>> + u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32
>>>> pat_index);
>>>>>> } pat_encode;
>>>>>> };
>>>>>>
>>>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>>>>>> struct async_op_fence *fence;
>>>>>> /** @tile_mask: gt mask for this operation */
>>>>>> u8 tile_mask;
>>>>>> + /** @pat_index: The pat index to use for this operation. */
>>>>>> + u32 pat_index;
>>>>>> /** @flags: operation flags */
>>>>>> enum xe_vma_op_flags flags;
>>>>>>
>>>>>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>>>>>> index 64bc66d4b550..0c15b6f32447 100644
>>>>>> --- a/include/uapi/drm/xe_drm.h
>>>>>> +++ b/include/uapi/drm/xe_drm.h
>>>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>>>>>> */
>>>>>> __u32 obj;
>>>>>>
>>>>>> - /** @pad: MBZ */
>>>>>> - __u32 pad;
>>>>>> + /**
>>>>>> + * @pat_index: The platform defined @pat_index to use for this
>>>> mapping.
>>>>>> + * The index basically maps to some predefined memory attributes,
>>>>>> + * including things like caching, coherency and likely other stuff in
>>>>>> + * the future. The exact meaning of the pat_index is platform specific
>>>>>
>>>>> BTW, "other stuff in the future" already includes compression on Xe2, we
>>>>> just haven't landed the patches for the Xe2 table yet.
>>>>
>>>> Ok, good to know.
>>>>
>>>>>
>>>>>> + * and defined in the Bspec and PRMs. When the KMD sets up the
>>>> binding
>>>>>> + * the index here is encoded into the ppGTT PTE.
>>>>>> + *
>>>>>> + * For coherency the @pat_index needs to match the
>>>>>> + * drm_xe_gem_create.coh_mode, so either
>>>> XE_GEM_COHERENCY_NONE,
>>>>>> + * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The
>>>> KMD will extract
>>>>>> + * the coherency mode from the @pat_index and reject if there is a
>>>>>> + * mismatch (see note below for pre-MTL platforms).
>>>>>> + *
>>>>>> + * Note: On pre-MTL platforms there is only a caching mode and no
>>>>>> + * explicit coherency mode, but on such hardware there is always a
>>>>>> + * shared-LLC (or is dgpu) so all GT memory accesses are coherent
>> with
>>>>>> + * CPU caches even with the caching mode set as uncached. It's only
>>>> the
>>>>>> + * display engine that is incoherent (on dgpu it must be in VRAM which
>>>>>> + * is always mapped as WC on the CPU). However to keep the uapi
>>>> somewhat
>>>>>> + * consistent with newer platforms the KMD groups the different
>> cache
>>>>>> + * levels into the following coherency buckets on all pre-MTL
>> platforms:
>>>>>> + *
>>>>>> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
>>>>>> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
>>>>>> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
>>>>>> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
>>>>>
>>>>> As noted on the previous patch, it seems like 2-way is appropriate for
>>>>> LLC platforms, but 1-way might be a more accurate description of dGPU
>>>>> behavior.
>>>>>
>>>>>> + *
>>>>>> + * In practice UC/WC/WT should only ever used for scanout surfaces
>> on
>>>>>> + * such platforms since it is only the display engine that is actually
>>>>>> + * incoherent. Everything else should typically use WB given that we
>>>>>
>>>>> What if we're sharing our buffers with some other (non-GPU) device? Are
>>>>> there cases where that other device wouldn't be coherent with the LLC,
>>>>> so we'd want to use one of these?
>>>>
>>>> Yeah, I guess there might be cases like that. I'll reword.
>>>>
>>>>>
>>>>>
>>>>> Matt
>>>>>
>>>>>> + * have a shared-LLC. On MTL+ this completely changes (also
>>>> potentially
>>>>>> + * no shared-LLC) and the HW defines the coherency mode as part of
>>>> the
>>>>>> + * @pat_index.
>>>>>> + *
>>>>>> + * Note: For userptr and externally imported dma-buf the kernel
>>>> expects
>>>>>> + * either 1WAY or 2WAY for the @pat_index.
>>>>>> + */
>>>>>> + __u32 pat_index;
>>>>>>
>>>>>> union {
>>>>>> /**
>>>>>> --
>>>>>> 2.41.0
>>>>>>
>>>>>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-08-31 10:44 ` Matthew Auld
@ 2023-09-01 9:34 ` Zhang, Carl
2023-09-04 9:23 ` Matthew Auld
0 siblings, 1 reply; 47+ messages in thread
From: Zhang, Carl @ 2023-09-01 9:34 UTC (permalink / raw)
To: Auld, Matthew, Roper, Matthew D
Cc: Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas, Yu, Effie,
intel-xe@lists.freedesktop.org
> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: Thursday, August 31, 2023 6:44 PM
>
> On 31/08/2023 09:24, Zhang, Carl wrote:
> >
> >
> >> -----Original Message-----
> >> From: Auld, Matthew <matthew.auld@intel.com>
> >> Sent: Thursday, August 31, 2023 12:02 AM
> >>
> >> On 30/08/2023 16:27, Zhang, Carl wrote:
> >>> Several questions:
> >>> 1. the pat_index from vm_bind will override the setting from bo_create?
> >>> How to keep the value from bo_create unchanged?
> >>
> >> You only set the pat_index during vm_bind. At gem_create you just
> >> need tell the kernel what CPU side caching mode to use for system
> >> memory (wb/wc/uc), plus the expected GPU coherency mode of the
> >> pat_index. The current expectation is that the coherency mode of the
> >> pat_index should match the one at gem_create.
> >>
> > It means that we set the incomplete pat_index at gem_create (coherency
> > only) And set another part at vm_bind (gpu cache) Why we separate
> > them? Why not set a full value at vm_bind?
>
> You always give the full pat_index at vm_bind. It's platform specific but the
> BSpec for the most part tells you what the coherency mode of the pat_index is.
>
> For example if you look at the MTL pat_index table it tells you the
> coh_mode:
>
> pat_index 0-2 = coh_mode "No snoop" = COH_NONE
> pat_index 3 = coh_mode "1way" = COH_1WAY
> pat_index 4 = coh_mode "2way" = COH_2WAY
>
> On future platforms the table is larger and also encodes stuff like compression,
> but KMD really only cares about the coh_mode, since that will restrict the
> allowed CPU smem_caching values that userspace can select.
>
> From KMD pov this matters since we always give userspace zeroed memory,
> which is useful for normal applications but is also needed for security reasons.
> However the actual clearing might result in the writes only being visible in the
> CPU cache (not flushed to main memory), and if the GPU doesn't snoop the
> CPU cache it can read directly from main memory, which effectively bypasses
> the clearing. To prevent that the KMD needs to know how userspace will map
> the memory via the GPU i.e what is the coh_mode for the pat_index. The gist is
> that if you select COH_NONE then it needs to clflush the pages before giving
> them to userspace, which is done by forcing you to use uc/wc for the
> smem_caching (the kernel will internally issue cflush on x86 when marking the
> pages as uc/wc). And if you select wb there is no flushing, but here you need to
> use 1way or 2way.
>
It sounds like a perf optimization, if no considering the perf, you could always call
Clflush to flush the cpu cache to main memory before giving it to user space
> AFAIK separating them is just to prevent userspace from mixing different
> coherency modes for the same object by having it immutable (you can still use
> different pat_index but the coherency must be compatible).
> Also at creation time we can determine if the smem_caching is compatible with
> the coh_mode.
>
> Note that it might be that we ditch the COH_2WAY and just have COH_NONE
> and COH_AT_LEAST_1WAY, in which case this might be slightly different.
>
> >
> >>> 2. no UC/WB/WC definition (CPU cachable) in drm_xe_gem_mmap_offset,
> >> will it be handled by KMD automatically?
> >>> For example: if set 1-way, it means GPU could snoop CPU cache, we
> >>> could use WB in mmap offset If it is COHERENCY_NONE, we could only
> >>> use
> >> UC, all these logic is handled by KMD automatically?
> >>
> >> Yes, the mmap will use the same CPU side caching mode that you set at
> >> gem_create with smem_caching (wb/wc/uc).
> >>
> > It means it could not be changed after gem_create. Sometimes, UMD
> > doesn’t know Whether the data need to be accessed by CPU at gem_create
> > . only application Knows the purpose of allocation & the surface
> > usage. So, may need change UMD api to let applications set whether the
> resource is "accessible", or it will hurt the perf.
>
> Right, the smem_caching is immutable.
>
> >
> >>> 3. about " For imported dma-buf (from a different device) the
> >>> coherency mode is also implicit and must also be either 1WAY or 2WAY"
> >>> it means it must be 1way or 2way, and UMD need not to set it?
> >>
> >> If you are going to vm_bind it you need to supply the pat_index and
> >> in this case it needs to be either 1way or 2way. The object comes
> >> from a different device so we might not have smem_caching/coh_mode
> >> like we do for native objects.
> >>
> > How about the GPU cache setting of imported dma-buf?
> > Is there possible conflict between dma-buf producer and consumer?
> > For example:
> > In producer side , one bo is set GPU L3 cache, also CPU cacheable, it
> > should be 2Way In client side, we know the coherency should be 1-way
> > or 2-way, if it suppose no GPU cache Used. It just set 1-way.
> > It may cause some problems?
>
> If the producer and consumer are the same device, then I think the original
> gem_create.coh_mode is considered, since it's effectively still a native object. In
> which case the normal rules apply.
But the consumer does not know the coh_mode, so , it is possible that it set will set
Another value, then it will failed.
And because consumer does not know it is 1-way or 2-way, consumer could not
Set the gpu cache correctly. Because if it is 1-way, cpu could not snoop gpu cache
>
> If it comes from a different device then we don't know what
> gem_create.coh_mode is (or if that even exists) since we don't have the usual
> native object, but just some opaque dma-buf object. In this case it must be at
> least 1way or 2way.
>
> From KMD pov we only care about userspace not being able to bypass the
> clearing. There might still be other coherency issues though.
>
> >
> > Also , when a bo is created, UMD does not know whether it will be exported.
> > It maybe be set as COHERENCY_NONE, right?
>
> Yeah, you can use whatever you want. Just that your smem_caching will need
> to be wc/uc if using COH_NONE. On the importer side, if it comes from the
> same device then it's just a normal native object and the normal rules apply.
> Currently this would mean also using pat_index with COH_NONE. If it is
> imported from a different device then you need to use a pat_index with 1way
> or 2way, since KMD doesn't really know how the object is mapped on the CPU
> or if the CPU caches are dirty (potentially with clearing).
>
> Is this going to be problematic?
>
If imported surfaces is from same device, how consumer know it is original coh_non
Or 1-way or 2-way?
And it also means producer and consumer could have different pat_index for same bo.
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Auld, Matthew <matthew.auld@intel.com>
> >>>> Sent: Wednesday, August 30, 2023 7:28 PM
> >>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
> >>>> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi
> >> <pallavi.mishra@intel.com>;
> >>>> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas
> >>>> Lahtinen <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
> >>>> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>;
> >> Hazubski,
> >>>> Filip <filip.hazubski@intel.com>; Zhang, Carl
> >>>> <carl.zhang@intel.com>; Yu,
> >> Effie
> >>>> <effie.yu@intel.com>
> >>>> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection
> >>>> with
> >> vm_bind
> >>>>
> >>>> On 29/08/2023 22:36, Matt Roper wrote:
> >>>>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
> >>>>>> Allow userspace to directly control the pat_index for a given vm
> >>>>>> binding. This should allow directly controlling the coherency,
> >>>>>> caching and potentially other stuff in the future for the ppGTT binding.
> >>>>>>
> >>>>>> The exact meaning behind the pat_index is very platform specific
> >>>>>> (see BSpec or PRMs) but effectively maps to some predefined
> >>>>>> memory attributes. From the KMD pov we only care about the
> >>>>>> coherency that is provided by the pat_index, which falls into either
> NONE, 1WAY or 2WAY.
> >>>>>> The vm_bind coherency mode for the given pat_index needs to match
> >> the
> >>>>>> given coh_mode that was set at object creation. For platforms
> >>>>>> that lack
> >>>>>
> >>>>> Is it actually important to match the coherency mode? I think one
> >>>>> of the main goals was to know up front if userspace might be using
> >>>>> a non-snooping PAT setting that would let it bypass the CPU cache
> >>>>> (and potentially read old, stale data from a different process if
> >>>>> the buffer's clear value is still sitting in cache and hasn't
> >>>>> landed in memory yet).
> >>>>>
> >>>>> If that's the only concern, then I think it should still be fine
> >>>>> to map with a non-matching PAT as long as it's more coherent than
> >>>>> the one specified at creation, right? E.g., if the buffer was
> >>>>> created with 1-way coherency, it would be fine to map it with
> >>>>> 2-way because userspace still can't use that to observe the
> >>>>> previous contents of the buffer. Or
> >>>>
> >>>> Yeah, I guess we could in theory do something that.
> >>>>
> >>>>> if the buffer was created with "non-coherent" then we've already
> >>>>> done the necessary clflushing in kernel before handing to buffer
> >>>>> over to userspace to ensure the clear value landed in memory, so
> >>>>> any valid PAT index should be safe (from a security POV) after
> >>>>> that, right? Any other problems that arise from mismatched
> >>>>> coherency would just be contained to the app possibly shooting
> >>>>> itself in the foot, which isn't really our concern.
> >>>>
> >>>> That is also my understanding, at least from the KMD security pov.
> >>>> If you allocate as wb then you must use at least 1way, since there
> >>>> is no flushing
> >> for
> >>>> clearing or swap-in. For uc/wc you could in theory use whatever you want.
> >>>>
> >>>>>
> >>>>>
> >>>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as
> >>>> 2WAY.
> >>>>>>
> >>>>>> For userptr mappings we lack a corresponding gem object, so the
> >> expected
> >>>>>> coherency mode is instead implicit and must fall into either 1WAY
> >>>>>> or 2WAY. Trying to use NONE will be rejected by the kernel. For
> >>>>>> imported dma-buf (from a different device) the coherency mode is
> >>>>>> also implicit and must also be either 1WAY or 2WAY.
> >>>>>>
> >>>>>> As part of adding pat_index support with vm_bind we also need
> >>>>>> stop
> >> using
> >>>>>> xe_cache_level and instead use the pat_index in various places.
> >>>>>> We still make use of xe_cache_level, but only as a convenience
> >>>>>> for kernel internal objectsi (internally it maps to some
> >>>>>> reasonable pat_index). For
> >>>>>
> >>>>> Maybe we should kill xe_cache_level completely and just assign
> >>>>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used
> >>>>> in appropriate places, similar to what we do with MOCS (gt-
> >>> mocs.uc_index,
> >>>>> gt->mocs.wb_index)?
> >>>>
> >>>> OK, seems reasonable to me.
> >>>>
> >>>>>
> >>>>>> now this is just a 1:1 conversion of the existing code, however
> >>>>>> for platforms like MTL+ we might need to give more control
> >>>>>> through
> >> bo_create
> >>>>>> or stop using WB on the CPU side if we need CPU access.
> >>>>>>
> >>>>>> Bspec: 45101, 44235 #xe
> >>>>>> Bspec: 70552, 71582, 59400 #xe2
> >>>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> >>>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> >>>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> >>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >>>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> >>>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
> >>>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
> >>>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
> >>>>>> Cc: Carl Zhang <carl.zhang@intel.com>
> >>>>>> Cc: Effie Yu <effie.yu@intel.com>
> >>>>>> ---
> >>>>>> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
> >>>>>> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
> >>>>>> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
> >>>>>> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
> >>>>>> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
> >>>>>> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
> >>>>>> drivers/gpu/drm/xe/xe_pt.h | 6 +--
> >>>>>> drivers/gpu/drm/xe/xe_vm.c | 73 +++++++++++++++++++++----
> -
> >> -
> >>>>>> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
> >>>>>> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
> >>>>>> 10 files changed, 134 insertions(+), 58 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>> b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>> index 4d6296cdbcfd..bb4c182048c3 100644
> >>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
> >>>>>> (*alloc_scratch_dma)(struct i915_address_space *vm,
> int
> >>>>>> sz);
> >>>>>>
> >>>>>> u64 (*pte_encode)(dma_addr_t addr,
> >>>>>> - unsigned int pat_index,
> >>>>>> + u32 pat_index,
> >>>>>> u32 flags); /* Create a valid PTE */
> >>>>>> #define PTE_READ_ONLY BIT(0)
> >>>>>> #define PTE_LM BIT(1)
> >>>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>> b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
> >>>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct
> >> xe_migrate
> >>>> *m, struct kunit *test)
> >>>>>> /* First part of the test, are we updating our pagetable bo
> >>>>>> with a new
> >>>> entry? */
> >>>>>> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE *
> (NUM_KERNEL_PDE -
> >> 1),
> >>>> u64,
> >>>>>> 0xdeaddeadbeefbeef);
> >>>>>> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> >>>>>> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
> >>>> XE_CACHE_WB), 0);
> >>>>>> if (m->q->vm->flags & XE_VM_FLAG_64K)
> >>>>>> expected |= XE_PTE_PS64;
> >>>>>> if (xe_bo_is_vram(pt))
> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c
> >> b/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>> index 209fa053d9fb..4134c26150a5 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64
> >>>> bo_offset)
> >>>>>> pte |= XE_GGTT_PTE_DM;
> >>>>>>
> >>>>>> if ((ggtt->pat_encode).pte_encode)
> >>>>>> - pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >>>> XE_CACHE_WB_1_WAY);
> >>>>>> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >>>>>> +
> xe_pat_get_index(xe,
> >>>> XE_CACHE_WB_1_WAY));
> >>>>>>
> >>>>>> return pte;
> >>>>>> }
> >>>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
> >>>>>> }
> >>>>>>
> >>>>>> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe,
> >>>>>> u64
> >> pte_pat,
> >>>>>> - enum xe_cache_level
> cache)
> >>>>>> + u32 pat_index)
> >>>>>> {
> >>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>> -
> >>>>>> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
> >>>>>>
> >>>>>> if (pat_index & BIT(0))
> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>> b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
> >>>>>>
> >>>>>> struct {
> >>>>>> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>>>> - enum xe_cache_level
> cache);
> >>>>>> + u32 pat_index);
> >>>>>> } pat_encode;
> >>>>>> };
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> >>>> b/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>> index a782ea282cb6..54585e98452a 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>> @@ -24,6 +24,7 @@
> >>>>>> #include "xe_lrc.h"
> >>>>>> #include "xe_map.h"
> >>>>>> #include "xe_mocs.h"
> >>>>>> +#include "xe_pat.h"
> >>>>>> #include "xe_pt.h"
> >>>>>> #include "xe_res_cursor.h"
> >>>>>> #include "xe_sched_job.h"
> >>>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>> xe_tile
> >> *tile,
> >>>> struct xe_migrate *m,
> >>>>>> u32 num_entries = NUM_PT_SLOTS, num_level = vm-
> >pt_root[id]-
> >>>>> level;
> >>>>>> u32 map_ofs, level, i;
> >>>>>> struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> >>>>>> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> >>>>>> u64 entry;
> >>>>>> int ret;
> >>>>>>
> >>>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct
> >> xe_tile
> >>>> *tile, struct xe_migrate *m,
> >>>>>> return ret;
> >>>>>> }
> >>>>>>
> >>>>>> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
> >> XE_CACHE_WB);
> >>>>>> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
> pat_index);
> >>>>>> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
> >>>>>>
> >>>>>> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
> >>>>>>
> >>>>>> /* Map the entire BO in our level 0 pt */
> >>>>>> for (i = 0, level = 0; i < num_entries; level++) {
> >>>>>> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
> XE_CACHE_WB,
> >>>> 0);
> >>>>>> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
> pat_index, 0);
> >>>>>>
> >>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> entry);
> >>>>>>
> >>>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>> xe_tile
> >> *tile,
> >>>> struct xe_migrate *m,
> >>>>>> for (i = 0; i < batch->size;
> >>>>>> i += vm->flags & XE_VM_FLAG_64K ?
> XE_64K_PAGE_SIZE :
> >>>>>> XE_PAGE_SIZE) {
> >>>>>> - entry = xe_pte_encode(batch, i,
> XE_CACHE_WB, 0);
> >>>>>> + entry = xe_pte_encode(batch, i, pat_index, 0);
> >>>>>>
> >>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level *
> 8, u64,
> >>>>>> entry);
> >>>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>> xe_tile
> >> *tile,
> >>>> struct xe_migrate *m,
> >>>>>> flags = XE_PDE_64K;
> >>>>>>
> >>>>>> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> >>>>>> - XE_PAGE_SIZE,
> XE_CACHE_WB);
> >>>>>> + XE_PAGE_SIZE, pat_index);
> >>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE
> * level,
> >>>> u64,
> >>>>>> entry | flags);
> >>>>>> }
> >>>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>> xe_tile
> >> *tile,
> >>>> struct xe_migrate *m,
> >>>>>> /* Write PDE's that point to our BO. */
> >>>>>> for (i = 0; i < num_entries - num_level; i++) {
> >>>>>> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> >>>>>> - XE_CACHE_WB);
> >>>>>> + pat_index);
> >>>>>>
> >>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE
> +
> >>>>>> (i + 1) * 8, u64, entry); @@ -1256,7 +1258,7
> @@
> >>>>>> xe_migrate_update_pgtables(struct
> >> xe_migrate
> >>>> *m,
> >>>>>>
> >>>>>> XE_WARN_ON(pt_bo->size != SZ_4K);
> >>>>>>
> >>>>>> - addr = xe_pte_encode(pt_bo, 0,
> XE_CACHE_WB, 0);
> >>>>>> + addr = xe_pte_encode(pt_bo, 0,
> xe_pat_get_index(xe,
> >>>> XE_CACHE_WB), 0);
> >>>>>> bb->cs[bb->len++] = lower_32_bits(addr);
> >>>>>> bb->cs[bb->len++] = upper_32_bits(addr);
> >>>>>> }
> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c
> >>>>>> b/drivers/gpu/drm/xe/xe_pt.c index 64713f400d94..019af2920078
> >>>>>> 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
> >>>>>> @@ -10,6 +10,7 @@
> >>>>>> #include "xe_gt.h"
> >>>>>> #include "xe_gt_tlb_invalidation.h"
> >>>>>> #include "xe_migrate.h"
> >>>>>> +#include "xe_pat.h"
> >>>>>> #include "xe_pt_types.h"
> >>>>>> #include "xe_pt_walk.h"
> >>>>>> #include "xe_res_cursor.h"
> >>>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct
> >>>>>> xe_pt_dir
> >>>> *pt_dir, unsigned int index)
> >>>>>> *
> >>>>>> * Return: An encoded page directory entry. No errors.
> >>>>>> */
> >>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >>>>>> - const enum xe_cache_level cache)
> >>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
> >>>>>> +pat_index)
> >>>>>> {
> >>>>>> u64 pde;
> >>>>>> struct xe_vm *vm = bo->vm;
> >>>>>> struct xe_device *xe = vm->xe;
> >>>>>>
> >>>>>> -
> >>>>>> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> >>>>>> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
> >>>>>>
> >>>>>> if ((vm->pat_encode).pde_encode)
> >>>>>> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> >>>>>> + pde = (vm->pat_encode).pde_encode(xe, pde,
> pat_index);
> >>>>>>
> >>>>>> return pde;
> >>>>>> }
> >>>>>>
> >>>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> >>>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
> >>>>>> struct xe_vma *vma, u32 pt_level)
> >>>>>> {
> >>>>>> struct xe_vm *vm = xe_vma_vm(vma); @@ -89,7 +88,7 @@
> static
> >>>>>> u64 __pte_encode(u64 pte, enum
> >>>> xe_cache_level cache,
> >>>>>> pte |= XE_PTE_NULL;
> >>>>>>
> >>>>>> if ((vm->pat_encode).pte_encode)
> >>>>>> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> >>>>>> + pte = (vm->pat_encode).pte_encode(xe, pte,
> pat_index);
> >>>>>>
> >>>>>> if (pt_level == 1)
> >>>>>> pte |= XE_PDE_PS_2M;
> >>>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
> >>>> xe_cache_level cache,
> >>>>>> *
> >>>>>> * Return: An encoded page-table entry. No errors.
> >>>>>> */
> >>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
> >>>>>> xe_cache_level
> >>>> cache,
> >>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> >>>>>> u32 pt_level)
> >>>>>> {
> >>>>>> u64 pte;
> >>>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64
> >> offset,
> >>>> enum xe_cache_level cache,
> >>>>>> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> >>>>>> pte |= XE_PPGTT_PTE_DM;
> >>>>>>
> >>>>>> - return __pte_encode(pte, cache, NULL, pt_level);
> >>>>>> + return __pte_encode(pte, pat_index, NULL, pt_level);
> >>>>>> }
> >>>>>>
> >>>>>> static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct
> >>>>>> xe_vm *vm, @@ -134,12 +133,12 @@ static u64
> >>>>>> __xe_pt_empty_pte(struct xe_tile
> >> *tile,
> >>>> struct xe_vm *vm,
> >>>>>>
> >>>>>> if (level == 0) {
> >>>>>> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> >>>>>> - XE_CACHE_WB, 0);
> >>>>>> + xe_pat_get_index(vm->xe,
> >>>> XE_CACHE_WB), 0);
> >>>>>>
> >>>>>> return empty;
> >>>>>> } else {
> >>>>>> return xe_pde_encode(vm->scratch_pt[id][level - 1]-
> >bo, 0,
> >>>>>> - XE_CACHE_WB);
> >>>>>> + xe_pat_get_index(vm->xe,
> >>>> XE_CACHE_WB));
> >>>>>> }
> >>>>>> }
> >>>>>>
> >>>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> >>>>>> struct xe_vm *vm;
> >>>>>> /** @tile: The tile we're building for. */
> >>>>>> struct xe_tile *tile;
> >>>>>> - /** @cache: Desired cache level for the ptes */
> >>>>>> - enum xe_cache_level cache;
> >>>>>> /** @default_pte: PTE flag only template. No address is
> associated */
> >>>>>> u64 default_pte;
> >>>>>> /** @dma_offset: DMA offset to add to the PTE. */ @@ -604,7
> >>>>>> +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> >>>> pgoff_t offset,
> >>>>>>
> >>>>>> pte = __pte_encode(is_null ? 0 :
> >>>>>> xe_res_dma(curs) + xe_walk-
> >dma_offset,
> >>>>>> - xe_walk->cache, xe_walk->vma,
> level);
> >>>>>> + xe_walk->vma->pat_index, xe_walk-
> >vma,
> >>>> level);
> >>>>>> pte |= xe_walk->default_pte;
> >>>>>>
> >>>>>> /*
> >>>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> >>>> pgoff_t offset,
> >>>>>> xe_child->is_compact = true;
> >>>>>> }
> >>>>>>
> >>>>>> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache)
> |
> >> flags;
> >>>>>> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
> >>>>> pat_index) | flags;
> >>>>>> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset,
> xe_child,
> >>>>>> pte);
> >>>>>> }
> >>>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile,
> >>>>>> struct
> >> xe_vma
> >>>> *vma,
> >>>>>> if (vma && vma->gpuva.flags &
> XE_VMA_ATOMIC_PTE_BIT)
> >>>>>> xe_walk.default_pte |=
> XE_USM_PPGTT_PTE_AE;
> >>>>>> xe_walk.dma_offset = vram_region_gpu_offset(bo-
> >>>>> ttm.resource);
> >>>>>> - xe_walk.cache = XE_CACHE_WB;
> >>>>>> - } else {
> >>>>>> - if (!xe_vma_has_no_bo(vma) && bo->flags &
> >>>> XE_BO_SCANOUT_BIT)
> >>>>>> - xe_walk.cache = XE_CACHE_WT;
> >>>>>> - else
> >>>>>> - xe_walk.cache = XE_CACHE_WB;
> >>>>>> }
> >>>>>> +
> >>>>>> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> >>>>>> xe_walk.dma_offset =
> >>>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h
> >>>>>> b/drivers/gpu/drm/xe/xe_pt.h index 01be7ab08f87..1d433a5a96b4
> >>>>>> 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
> >>>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile,
> >>>>>> struct
> >>>> xe_vma *vma, struct xe_exec_queu
> >>>>>>
> >>>>>> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma
> >>>>>> *vma);
> >>>>>>
> >>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >>>>>> - const enum xe_cache_level level);
> >>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
> >>>>>> +pat_index);
> >>>>>>
> >>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
> >>>>>> xe_cache_level
> >>>> cache,
> >>>>>> - u32 pt_level);
> >>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> >>>>>> +u32
> >>>> pt_level);
> >>>>>>
> >>>>>> #endif
> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c
> >>>>>> b/drivers/gpu/drm/xe/xe_vm.c index 7eeeed0411f3..34603a7e84b0
> >>>>>> 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
> >>>>>> @@ -6,6 +6,7 @@
> >>>>>> #include "xe_vm.h"
> >>>>>>
> >>>>>> #include <linux/dma-fence-array.h>
> >>>>>> +#include <linux/nospec.h>
> >>>>>>
> >>>>>> #include <drm/drm_print.h>
> >>>>>> #include <drm/ttm/ttm_execbuf_util.h> @@ -874,7 +875,8 @@
> >>>>>> static struct xe_vma *xe_vma_create(struct
> >> xe_vm
> >>>> *vm,
> >>>>>> u64 start, u64 end,
> >>>>>> bool read_only,
> >>>>>> bool is_null,
> >>>>>> - u8 tile_mask)
> >>>>>> + u8 tile_mask,
> >>>>>> + u32 pat_index)
> >>>>>> {
> >>>>>> struct xe_vma *vma;
> >>>>>> struct xe_tile *tile;
> >>>>>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct
> >> xe_vm
> >>>> *vm,
> >>>>>> vma->tile_mask |= 0x1 << id;
> >>>>>> }
> >>>>>>
> >>>>>> + vma->pat_index = pat_index;
> >>>>>> +
> >>>>>> if (vm->xe->info.platform == XE_PVC)
> >>>>>> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
> >>>>>>
> >>>>>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct
> >>>> work_struct *w);
> >>>>>> static void vm_destroy_work_func(struct work_struct *w);
> >>>>>>
> >>>>>> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe,
> >>>>>> u64
> >> pde_pat,
> >>>>>> - enum xe_cache_level
> cache)
> >>>>>> + u32 pat_index)
> >>>>>> {
> >>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>> -
> >>>>>> pde_pat &= ~(XELP_PDE_PAT_MASK);
> >>>>>>
> >>>>>> if (pat_index & BIT(0))
> >>>>>> @@ -1213,10 +1215,8 @@ static u64
> >> xelp_ppgtt_pde_encode_pat(struct
> >>>> xe_device *xe, u64 pde_pat,
> >>>>>> }
> >>>>>>
> >>>>>> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe,
> >>>>>> u64
> >> pte_pat,
> >>>>>> - enum xe_cache_level
> cache)
> >>>>>> + u32 pat_index)
> >>>>>> {
> >>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>> -
> >>>>>> pte_pat &= ~(XELP_PTE_PAT_MASK);
> >>>>>>
> >>>>>> if (pat_index & BIT(0))
> >>>>>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file
> >> *xef,
> >>>> u32 id)
> >>>>>> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
> >>>>>> {
> >>>>>> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> >>>>>> - XE_CACHE_WB);
> >>>>>> + xe_pat_get_index(vm->xe,
> XE_CACHE_WB));
> >>>>>> }
> >>>>>>
> >>>>>> static struct dma_fence *
> >>>>>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe,
> >> struct
> >>>> drm_gpuva_op *op)
> >>>>>> static struct drm_gpuva_ops *
> >>>>>> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> >>>>>> u64 bo_offset_or_userptr, u64 addr, u64
> range,
> >>>>>> - u32 operation, u8 tile_mask, u32 region)
> >>>>>> + u32 operation, u8 tile_mask, u32 region, u32
> >>>> pat_index)
> >>>>>> {
> >>>>>> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> >>>>>> struct ww_acquire_ctx ww;
> >>>>>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm
> *vm,
> >>>> struct xe_bo *bo,
> >>>>>> struct xe_vma_op *op =
> gpuva_op_to_vma_op(__op);
> >>>>>>
> >>>>>> op->tile_mask = tile_mask;
> >>>>>> + op->pat_index = pat_index;
> >>>>>> op->map.immediate =
> >>>>>> operation &
> >> XE_VM_BIND_FLAG_IMMEDIATE;
> >>>>>> op->map.read_only =
> >>>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm
> *vm,
> >>>> struct xe_bo *bo,
> >>>>>> struct xe_vma_op *op =
> gpuva_op_to_vma_op(__op);
> >>>>>>
> >>>>>> op->tile_mask = tile_mask;
> >>>>>> + op->pat_index = pat_index;
> >>>>>> op->prefetch.region = region;
> >>>>>> }
> >>>>>> break;
> >>>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
> *vm,
> >>>> struct xe_bo *bo,
> >>>>>> }
> >>>>>>
> >>>>>> static struct xe_vma *new_vma(struct xe_vm *vm, struct
> >>>> drm_gpuva_op_map *op,
> >>>>>> - u8 tile_mask, bool read_only, bool is_null)
> >>>>>> + u8 tile_mask, bool read_only, bool is_null,
> >>>>>> + u32 pat_index)
> >>>>>> {
> >>>>>> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op-
> >gem.obj) :
> >> NULL;
> >>>>>> struct xe_vma *vma;
> >>>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm
> >>>> *vm, struct drm_gpuva_op_map *op,
> >>>>>> vma = xe_vma_create(vm, bo, op->gem.offset,
> >>>>>> op->va.addr, op->va.addr +
> >>>>>> op->va.range - 1, read_only, is_null,
> >>>>>> - tile_mask);
> >>>>>> + tile_mask, pat_index);
> >>>>>> if (bo)
> >>>>>> xe_bo_unlock(bo, &ww);
> >>>>>>
> >>>>>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct
> >> xe_vm
> >>>> *vm, struct xe_exec_queue *q,
> >>>>>>
> >>>>>> vma = new_vma(vm, &op->base.map,
> >>>>>> op->tile_mask, op-
> >>> map.read_only,
> >>>>>> - op->map.is_null);
> >>>>>> + op->map.is_null, op-
> >pat_index);
> >>>>>> if (IS_ERR(vma)) {
> >>>>>> err = PTR_ERR(vma);
> >>>>>> goto free_fence;
> >>>>>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct
> >> xe_vm
> >>>> *vm, struct xe_exec_queue *q,
> >>>>>>
> >>>>>> vma = new_vma(vm, op-
> >>>>> base.remap.prev,
> >>>>>> op->tile_mask,
> read_only,
> >>>>>> - is_null);
> >>>>>> + is_null, op-
> >pat_index);
> >>>>>> if (IS_ERR(vma)) {
> >>>>>> err = PTR_ERR(vma);
> >>>>>> goto free_fence;
> >>>>>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct
> >> xe_vm
> >>>> *vm, struct xe_exec_queue *q,
> >>>>>>
> >>>>>> vma = new_vma(vm, op-
> >>>>> base.remap.next,
> >>>>>> op->tile_mask,
> read_only,
> >>>>>> - is_null);
> >>>>>> + is_null, op-
> >pat_index);
> >>>>>> if (IS_ERR(vma)) {
> >>>>>> err = PTR_ERR(vma);
> >>>>>> goto free_fence;
> >>>>>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct
> >>>> xe_device *xe,
> >>>>>> u32 obj = (*bind_ops)[i].obj;
> >>>>>> u64 obj_offset = (*bind_ops)[i].obj_offset;
> >>>>>> u32 region = (*bind_ops)[i].region;
> >>>>>> + u32 pat_index = (*bind_ops)[i].pat_index;
> >>>>>> bool is_null = op & XE_VM_BIND_FLAG_NULL;
> >>>>>>
> >>>>>> + if (XE_IOCTL_DBG(xe, pat_index >= xe-
> >>>>> info.pat_table_n_entries)) {
> >>>>>> + err = -EINVAL;
> >>>>>> + goto free_bind_ops;
> >>>>>> + }
> >>>>>> +
> >>>>>> if (i == 0) {
> >>>>>> *async = !!(op & XE_VM_BIND_FLAG_ASYNC);
> >>>>>> } else if (XE_IOCTL_DBG(xe, !*async) || @@ -3346,8
> +3355,25
> >>>>>> @@ int xe_vm_bind_ioctl(struct drm_device
> >> *dev,
> >>>> void *data, struct drm_file *file)
> >>>>>> struct drm_gem_object *gem_obj;
> >>>>>> u64 range = bind_ops[i].range;
> >>>>>> u64 addr = bind_ops[i].addr;
> >>>>>> + u32 op = bind_ops[i].op;
> >>>>>> u32 obj = bind_ops[i].obj;
> >>>>>> u64 obj_offset = bind_ops[i].obj_offset;
> >>>>>> + u32 pat_index = bind_ops[i].pat_index;
> >>>>>> + u16 coh_mode;
> >>>>>> +
> >>>>>> + pat_index = array_index_nospec(pat_index,
> >>>>>> + xe-
> >info.pat_table_n_entries);
> >>>>>> + coh_mode = xe_pat_index_get_coh_mode(xe,
> pat_index);
> >>>>>> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
> >>>>>
> >>>>> Assuming we drop the unusable entries from the TGL table, this
> >>>>> should be impossible, right? Any index that makes it past the
> >>>>> n_entries check at the top of the function should have a valid, non-zero
> coh_mode value.
> >>>>> So this should probably be an assertion (to highlight a KMD bug)
> >>>>> rather than just a silent uapi failure return.
> >>>>
> >>>> Makes sense.
> >>>>
> >>>>>
> >>>>>> + err = -EINVAL;
> >>>>>> + goto put_obj;
> >>>>>> + }
> >>>>>> +
> >>>>>> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
> >>>> XE_VM_BIND_OP_MAP_USERPTR &&
> >>>>>> + coh_mode ==
> XE_GEM_COHERENCY_NONE))
> >> {
> >>>>>> + err = -EINVAL;
> >>>>>> + goto put_obj;
> >>>>>> + }
> >>>>>>
> >>>>>> if (!obj)
> >>>>>> continue;
> >>>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device
> >> *dev,
> >>>> void *data, struct drm_file *file)
> >>>>>> goto put_obj;
> >>>>>> }
> >>>>>> }
> >>>>>> +
> >>>>>> + if (bos[i]->coh_mode) {
> >>>>>> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
> >>>> coh_mode)) {
> >>>>>> + err = -EINVAL;
> >>>>>> + goto put_obj;
> >>>>>> + }
> >>>>>> + } else if (XE_IOCTL_DBG(xe, coh_mode ==
> >>>> XE_GEM_COHERENCY_NONE)) {
> >>>>>> + /*
> >>>>>> + * Imported dma-buf from a different device
> should
> >>>>>> + * require 1way or 2way coherency since we
> don't
> >>>> know
> >>>>>> + * how it was mapped on CPU. Just assume is it
> >>>>>> + * potentially cached on CPU side.
> >>>>>> + */
> >>>>>> + err = -EINVAL;
> >>>>>> + goto put_obj;
> >>>>>> + }
> >>>>>> }
> >>>>>>
> >>>>>> if (args->num_syncs) {
> >>>>>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device
> >> *dev,
> >>>> void *data, struct drm_file *file)
> >>>>>> u64 obj_offset = bind_ops[i].obj_offset;
> >>>>>> u8 tile_mask = bind_ops[i].tile_mask;
> >>>>>> u32 region = bind_ops[i].region;
> >>>>>> + u32 pat_index = bind_ops[i].pat_index;
> >>>>>>
> >>>>>> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i],
> obj_offset,
> >>>>>> addr, range, op,
> tile_mask,
> >>>>>> - region);
> >>>>>> + region, pat_index);
> >>>>>> if (IS_ERR(ops[i])) {
> >>>>>> err = PTR_ERR(ops[i]);
> >>>>>> ops[i] = NULL;
> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> >>>> b/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>> index 83a1f87b6537..508679ed3c74 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>> @@ -111,6 +111,11 @@ struct xe_vma {
> >>>>>> */
> >>>>>> u8 tile_present;
> >>>>>>
> >>>>>> + /**
> >>>>>> + * @pat_index: The pat index to use when encoding the PTEs
> for
> >>>>>> +this
> >>>> vma.
> >>>>>> + */
> >>>>>> + u32 pat_index;
> >>>>>> +
> >>>>>> struct {
> >>>>>> struct list_head rebind_link;
> >>>>>> } notifier;
> >>>>>> @@ -338,10 +343,8 @@ struct xe_vm {
> >>>>>> bool batch_invalidate_tlb;
> >>>>>>
> >>>>>> struct {
> >>>>>> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>>>> - enum xe_cache_level
> cache);
> >>>>>> - u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> >>>>>> - enum xe_cache_level
> cache);
> >>>>>> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> u32
> >>>> pat_index);
> >>>>>> + u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> u32
> >>>> pat_index);
> >>>>>> } pat_encode;
> >>>>>> };
> >>>>>>
> >>>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
> >>>>>> struct async_op_fence *fence;
> >>>>>> /** @tile_mask: gt mask for this operation */
> >>>>>> u8 tile_mask;
> >>>>>> + /** @pat_index: The pat index to use for this operation. */
> >>>>>> + u32 pat_index;
> >>>>>> /** @flags: operation flags */
> >>>>>> enum xe_vma_op_flags flags;
> >>>>>>
> >>>>>> diff --git a/include/uapi/drm/xe_drm.h
> >>>>>> b/include/uapi/drm/xe_drm.h index 64bc66d4b550..0c15b6f32447
> >>>>>> 100644
> >>>>>> --- a/include/uapi/drm/xe_drm.h
> >>>>>> +++ b/include/uapi/drm/xe_drm.h
> >>>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> >>>>>> */
> >>>>>> __u32 obj;
> >>>>>>
> >>>>>> - /** @pad: MBZ */
> >>>>>> - __u32 pad;
> >>>>>> + /**
> >>>>>> + * @pat_index: The platform defined @pat_index to use for
> this
> >>>> mapping.
> >>>>>> + * The index basically maps to some predefined memory
> attributes,
> >>>>>> + * including things like caching, coherency and likely other stuff
> in
> >>>>>> + * the future. The exact meaning of the pat_index is platform
> >>>>>> +specific
> >>>>>
> >>>>> BTW, "other stuff in the future" already includes compression on
> >>>>> Xe2, we just haven't landed the patches for the Xe2 table yet.
> >>>>
> >>>> Ok, good to know.
> >>>>
> >>>>>
> >>>>>> + * and defined in the Bspec and PRMs. When the KMD sets up
> the
> >>>> binding
> >>>>>> + * the index here is encoded into the ppGTT PTE.
> >>>>>> + *
> >>>>>> + * For coherency the @pat_index needs to match the
> >>>>>> + * drm_xe_gem_create.coh_mode, so either
> >>>> XE_GEM_COHERENCY_NONE,
> >>>>>> + * XE_GEM_COHERENCY_1WAY or
> XE_GEM_COHERENCY_2WAY. The
> >>>> KMD will extract
> >>>>>> + * the coherency mode from the @pat_index and reject if
> there is a
> >>>>>> + * mismatch (see note below for pre-MTL platforms).
> >>>>>> + *
> >>>>>> + * Note: On pre-MTL platforms there is only a caching mode
> and no
> >>>>>> + * explicit coherency mode, but on such hardware there is
> always a
> >>>>>> + * shared-LLC (or is dgpu) so all GT memory accesses are
> >>>>>> +coherent
> >> with
> >>>>>> + * CPU caches even with the caching mode set as uncached.
> It's
> >>>>>> +only
> >>>> the
> >>>>>> + * display engine that is incoherent (on dgpu it must be in
> VRAM which
> >>>>>> + * is always mapped as WC on the CPU). However to keep the
> uapi
> >>>> somewhat
> >>>>>> + * consistent with newer platforms the KMD groups the
> different
> >> cache
> >>>>>> + * levels into the following coherency buckets on all pre-MTL
> >> platforms:
> >>>>>> + *
> >>>>>> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
> >>>>>> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
> >>>>>> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
> >>>>>> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
> >>>>>
> >>>>> As noted on the previous patch, it seems like 2-way is appropriate
> >>>>> for LLC platforms, but 1-way might be a more accurate description
> >>>>> of dGPU behavior.
> >>>>>
> >>>>>> + *
> >>>>>> + * In practice UC/WC/WT should only ever used for scanout
> >>>>>> +surfaces
> >> on
> >>>>>> + * such platforms since it is only the display engine that is
> actually
> >>>>>> + * incoherent. Everything else should typically use WB given
> >>>>>> +that we
> >>>>>
> >>>>> What if we're sharing our buffers with some other (non-GPU)
> >>>>> device? Are there cases where that other device wouldn't be
> >>>>> coherent with the LLC, so we'd want to use one of these?
> >>>>
> >>>> Yeah, I guess there might be cases like that. I'll reword.
> >>>>
> >>>>>
> >>>>>
> >>>>> Matt
> >>>>>
> >>>>>> + * have a shared-LLC. On MTL+ this completely changes (also
> >>>> potentially
> >>>>>> + * no shared-LLC) and the HW defines the coherency mode as
> part
> >>>>>> +of
> >>>> the
> >>>>>> + * @pat_index.
> >>>>>> + *
> >>>>>> + * Note: For userptr and externally imported dma-buf the
> kernel
> >>>> expects
> >>>>>> + * either 1WAY or 2WAY for the @pat_index.
> >>>>>> + */
> >>>>>> + __u32 pat_index;
> >>>>>>
> >>>>>> union {
> >>>>>> /**
> >>>>>> --
> >>>>>> 2.41.0
> >>>>>>
> >>>>>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-09-01 9:34 ` Zhang, Carl
@ 2023-09-04 9:23 ` Matthew Auld
2023-09-05 9:12 ` Zhang, Carl
0 siblings, 1 reply; 47+ messages in thread
From: Matthew Auld @ 2023-09-04 9:23 UTC (permalink / raw)
To: Zhang, Carl, Roper, Matthew D
Cc: Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas, Yu, Effie,
intel-xe@lists.freedesktop.org
On 01/09/2023 10:34, Zhang, Carl wrote:
>
>
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld@intel.com>
>> Sent: Thursday, August 31, 2023 6:44 PM
>>
>> On 31/08/2023 09:24, Zhang, Carl wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>> Sent: Thursday, August 31, 2023 12:02 AM
>>>>
>>>> On 30/08/2023 16:27, Zhang, Carl wrote:
>>>>> Several questions:
>>>>> 1. the pat_index from vm_bind will override the setting from bo_create?
>>>>> How to keep the value from bo_create unchanged?
>>>>
>>>> You only set the pat_index during vm_bind. At gem_create you just
>>>> need tell the kernel what CPU side caching mode to use for system
>>>> memory (wb/wc/uc), plus the expected GPU coherency mode of the
>>>> pat_index. The current expectation is that the coherency mode of the
>>>> pat_index should match the one at gem_create.
>>>>
>>> It means that we set the incomplete pat_index at gem_create (coherency
>>> only) And set another part at vm_bind (gpu cache) Why we separate
>>> them? Why not set a full value at vm_bind?
>>
>> You always give the full pat_index at vm_bind. It's platform specific but the
>> BSpec for the most part tells you what the coherency mode of the pat_index is.
>>
>> For example if you look at the MTL pat_index table it tells you the
>> coh_mode:
>>
>> pat_index 0-2 = coh_mode "No snoop" = COH_NONE
>> pat_index 3 = coh_mode "1way" = COH_1WAY
>> pat_index 4 = coh_mode "2way" = COH_2WAY
>>
>> On future platforms the table is larger and also encodes stuff like compression,
>> but KMD really only cares about the coh_mode, since that will restrict the
>> allowed CPU smem_caching values that userspace can select.
>>
>> From KMD pov this matters since we always give userspace zeroed memory,
>> which is useful for normal applications but is also needed for security reasons.
>> However the actual clearing might result in the writes only being visible in the
>> CPU cache (not flushed to main memory), and if the GPU doesn't snoop the
>> CPU cache it can read directly from main memory, which effectively bypasses
>> the clearing. To prevent that the KMD needs to know how userspace will map
>> the memory via the GPU i.e what is the coh_mode for the pat_index. The gist is
>> that if you select COH_NONE then it needs to clflush the pages before giving
>> them to userspace, which is done by forcing you to use uc/wc for the
>> smem_caching (the kernel will internally issue cflush on x86 when marking the
>> pages as uc/wc). And if you select wb there is no flushing, but here you need to
>> use 1way or 2way.
>>
>
> It sounds like a perf optimization, if no considering the perf, you could always call
> Clflush to flush the cpu cache to main memory before giving it to user space
I think the other thing is that the KMD is no longer doing any manual
flushing, but instead the core kernel will now do that for us when
marking the pages as wc/uc on x86.
>
>
>> AFAIK separating them is just to prevent userspace from mixing different
>> coherency modes for the same object by having it immutable (you can still use
>> different pat_index but the coherency must be compatible).
>> Also at creation time we can determine if the smem_caching is compatible with
>> the coh_mode.
>>
>> Note that it might be that we ditch the COH_2WAY and just have COH_NONE
>> and COH_AT_LEAST_1WAY, in which case this might be slightly different.
>>
>>>
>>>>> 2. no UC/WB/WC definition (CPU cachable) in drm_xe_gem_mmap_offset,
>>>> will it be handled by KMD automatically?
>>>>> For example: if set 1-way, it means GPU could snoop CPU cache, we
>>>>> could use WB in mmap offset If it is COHERENCY_NONE, we could only
>>>>> use
>>>> UC, all these logic is handled by KMD automatically?
>>>>
>>>> Yes, the mmap will use the same CPU side caching mode that you set at
>>>> gem_create with smem_caching (wb/wc/uc).
>>>>
>>> It means it could not be changed after gem_create. Sometimes, UMD
>>> doesn’t know Whether the data need to be accessed by CPU at gem_create
>>> . only application Knows the purpose of allocation & the surface
>>> usage. So, may need change UMD api to let applications set whether the
>> resource is "accessible", or it will hurt the perf.
>>
>> Right, the smem_caching is immutable.
>>
>>>
>>>>> 3. about " For imported dma-buf (from a different device) the
>>>>> coherency mode is also implicit and must also be either 1WAY or 2WAY"
>>>>> it means it must be 1way or 2way, and UMD need not to set it?
>>>>
>>>> If you are going to vm_bind it you need to supply the pat_index and
>>>> in this case it needs to be either 1way or 2way. The object comes
>>>> from a different device so we might not have smem_caching/coh_mode
>>>> like we do for native objects.
>>>>
>>> How about the GPU cache setting of imported dma-buf?
>>> Is there possible conflict between dma-buf producer and consumer?
>>> For example:
>>> In producer side , one bo is set GPU L3 cache, also CPU cacheable, it
>>> should be 2Way In client side, we know the coherency should be 1-way
>>> or 2-way, if it suppose no GPU cache Used. It just set 1-way.
>>> It may cause some problems?
>>
>> If the producer and consumer are the same device, then I think the original
>> gem_create.coh_mode is considered, since it's effectively still a native object. In
>> which case the normal rules apply.
>
> But the consumer does not know the coh_mode, so , it is possible that it set will set
> Another value, then it will failed.
> And because consumer does not know it is 1-way or 2-way, consumer could not
> Set the gpu cache correctly. Because if it is 1-way, cpu could not snoop gpu cache
>
>>
>> If it comes from a different device then we don't know what
>> gem_create.coh_mode is (or if that even exists) since we don't have the usual
>> native object, but just some opaque dma-buf object. In this case it must be at
>> least 1way or 2way.
>>
>> From KMD pov we only care about userspace not being able to bypass the
>> clearing. There might still be other coherency issues though.
>>
>>>
>>> Also , when a bo is created, UMD does not know whether it will be exported.
>>> It maybe be set as COHERENCY_NONE, right?
>>
>> Yeah, you can use whatever you want. Just that your smem_caching will need
>> to be wc/uc if using COH_NONE. On the importer side, if it comes from the
>> same device then it's just a normal native object and the normal rules apply.
>> Currently this would mean also using pat_index with COH_NONE. If it is
>> imported from a different device then you need to use a pat_index with 1way
>> or 2way, since KMD doesn't really know how the object is mapped on the CPU
>> or if the CPU caches are dirty (potentially with clearing).
>>
>> Is this going to be problematic?
>>
> If imported surfaces is from same device, how consumer know it is original coh_non
> Or 1-way or 2-way?
>
> And it also means producer and consumer could have different pat_index for same bo.
Ok, if KMD instead allows 1way or 2way for dma-buf when it's external or
from the same device? Would that help here? i.e you can always select
1way or 2way, even if it came from the same device and was coh_none?
Other option is you could call the vm_bind ioctl to check if the
coh_mode is supported by checking if it returns an error? But maybe that
is too nasy.
>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>>>> Sent: Wednesday, August 30, 2023 7:28 PM
>>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>>>>> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi
>>>> <pallavi.mishra@intel.com>;
>>>>>> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas
>>>>>> Lahtinen <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
>>>>>> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>;
>>>> Hazubski,
>>>>>> Filip <filip.hazubski@intel.com>; Zhang, Carl
>>>>>> <carl.zhang@intel.com>; Yu,
>>>> Effie
>>>>>> <effie.yu@intel.com>
>>>>>> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection
>>>>>> with
>>>> vm_bind
>>>>>>
>>>>>> On 29/08/2023 22:36, Matt Roper wrote:
>>>>>>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
>>>>>>>> Allow userspace to directly control the pat_index for a given vm
>>>>>>>> binding. This should allow directly controlling the coherency,
>>>>>>>> caching and potentially other stuff in the future for the ppGTT binding.
>>>>>>>>
>>>>>>>> The exact meaning behind the pat_index is very platform specific
>>>>>>>> (see BSpec or PRMs) but effectively maps to some predefined
>>>>>>>> memory attributes. From the KMD pov we only care about the
>>>>>>>> coherency that is provided by the pat_index, which falls into either
>> NONE, 1WAY or 2WAY.
>>>>>>>> The vm_bind coherency mode for the given pat_index needs to match
>>>> the
>>>>>>>> given coh_mode that was set at object creation. For platforms
>>>>>>>> that lack
>>>>>>>
>>>>>>> Is it actually important to match the coherency mode? I think one
>>>>>>> of the main goals was to know up front if userspace might be using
>>>>>>> a non-snooping PAT setting that would let it bypass the CPU cache
>>>>>>> (and potentially read old, stale data from a different process if
>>>>>>> the buffer's clear value is still sitting in cache and hasn't
>>>>>>> landed in memory yet).
>>>>>>>
>>>>>>> If that's the only concern, then I think it should still be fine
>>>>>>> to map with a non-matching PAT as long as it's more coherent than
>>>>>>> the one specified at creation, right? E.g., if the buffer was
>>>>>>> created with 1-way coherency, it would be fine to map it with
>>>>>>> 2-way because userspace still can't use that to observe the
>>>>>>> previous contents of the buffer. Or
>>>>>>
>>>>>> Yeah, I guess we could in theory do something that.
>>>>>>
>>>>>>> if the buffer was created with "non-coherent" then we've already
>>>>>>> done the necessary clflushing in kernel before handing to buffer
>>>>>>> over to userspace to ensure the clear value landed in memory, so
>>>>>>> any valid PAT index should be safe (from a security POV) after
>>>>>>> that, right? Any other problems that arise from mismatched
>>>>>>> coherency would just be contained to the app possibly shooting
>>>>>>> itself in the foot, which isn't really our concern.
>>>>>>
>>>>>> That is also my understanding, at least from the KMD security pov.
>>>>>> If you allocate as wb then you must use at least 1way, since there
>>>>>> is no flushing
>>>> for
>>>>>> clearing or swap-in. For uc/wc you could in theory use whatever you want.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as
>>>>>> 2WAY.
>>>>>>>>
>>>>>>>> For userptr mappings we lack a corresponding gem object, so the
>>>> expected
>>>>>>>> coherency mode is instead implicit and must fall into either 1WAY
>>>>>>>> or 2WAY. Trying to use NONE will be rejected by the kernel. For
>>>>>>>> imported dma-buf (from a different device) the coherency mode is
>>>>>>>> also implicit and must also be either 1WAY or 2WAY.
>>>>>>>>
>>>>>>>> As part of adding pat_index support with vm_bind we also need
>>>>>>>> stop
>>>> using
>>>>>>>> xe_cache_level and instead use the pat_index in various places.
>>>>>>>> We still make use of xe_cache_level, but only as a convenience
>>>>>>>> for kernel internal objectsi (internally it maps to some
>>>>>>>> reasonable pat_index). For
>>>>>>>
>>>>>>> Maybe we should kill xe_cache_level completely and just assign
>>>>>>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used
>>>>>>> in appropriate places, similar to what we do with MOCS (gt-
>>>>> mocs.uc_index,
>>>>>>> gt->mocs.wb_index)?
>>>>>>
>>>>>> OK, seems reasonable to me.
>>>>>>
>>>>>>>
>>>>>>>> now this is just a 1:1 conversion of the existing code, however
>>>>>>>> for platforms like MTL+ we might need to give more control
>>>>>>>> through
>>>> bo_create
>>>>>>>> or stop using WB on the CPU side if we need CPU access.
>>>>>>>>
>>>>>>>> Bspec: 45101, 44235 #xe
>>>>>>>> Bspec: 70552, 71582, 59400 #xe2
>>>>>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>>>>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>>>>>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>>>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>>>>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>>>>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
>>>>>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>>>>>>>> Cc: Carl Zhang <carl.zhang@intel.com>
>>>>>>>> Cc: Effie Yu <effie.yu@intel.com>
>>>>>>>> ---
>>>>>>>> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
>>>>>>>> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
>>>>>>>> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
>>>>>>>> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
>>>>>>>> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
>>>>>>>> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
>>>>>>>> drivers/gpu/drm/xe/xe_pt.h | 6 +--
>>>>>>>> drivers/gpu/drm/xe/xe_vm.c | 73 +++++++++++++++++++++----
>> -
>>>> -
>>>>>>>> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
>>>>>>>> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
>>>>>>>> 10 files changed, 134 insertions(+), 58 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>> b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>> index 4d6296cdbcfd..bb4c182048c3 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
>>>>>>>> (*alloc_scratch_dma)(struct i915_address_space *vm,
>> int
>>>>>>>> sz);
>>>>>>>>
>>>>>>>> u64 (*pte_encode)(dma_addr_t addr,
>>>>>>>> - unsigned int pat_index,
>>>>>>>> + u32 pat_index,
>>>>>>>> u32 flags); /* Create a valid PTE */
>>>>>>>> #define PTE_READ_ONLY BIT(0)
>>>>>>>> #define PTE_LM BIT(1)
>>>>>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>> b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct
>>>> xe_migrate
>>>>>> *m, struct kunit *test)
>>>>>>>> /* First part of the test, are we updating our pagetable bo
>>>>>>>> with a new
>>>>>> entry? */
>>>>>>>> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE *
>> (NUM_KERNEL_PDE -
>>>> 1),
>>>>>> u64,
>>>>>>>> 0xdeaddeadbeefbeef);
>>>>>>>> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>>>>>>>> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
>>>>>> XE_CACHE_WB), 0);
>>>>>>>> if (m->q->vm->flags & XE_VM_FLAG_64K)
>>>>>>>> expected |= XE_PTE_PS64;
>>>>>>>> if (xe_bo_is_vram(pt))
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c
>>>> b/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>> index 209fa053d9fb..4134c26150a5 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64
>>>>>> bo_offset)
>>>>>>>> pte |= XE_GGTT_PTE_DM;
>>>>>>>>
>>>>>>>> if ((ggtt->pat_encode).pte_encode)
>>>>>>>> - pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>>>> XE_CACHE_WB_1_WAY);
>>>>>>>> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>>>>>> +
>> xe_pat_get_index(xe,
>>>>>> XE_CACHE_WB_1_WAY));
>>>>>>>>
>>>>>>>> return pte;
>>>>>>>> }
>>>>>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>>>>>>>> }
>>>>>>>>
>>>>>>>> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe,
>>>>>>>> u64
>>>> pte_pat,
>>>>>>>> - enum xe_cache_level
>> cache)
>>>>>>>> + u32 pat_index)
>>>>>>>> {
>>>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>> -
>>>>>>>> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>>>>>>>
>>>>>>>> if (pat_index & BIT(0))
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>> b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>>>>>>>
>>>>>>>> struct {
>>>>>>>> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>>>> - enum xe_cache_level
>> cache);
>>>>>>>> + u32 pat_index);
>>>>>>>> } pat_encode;
>>>>>>>> };
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>> b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>> index a782ea282cb6..54585e98452a 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>> @@ -24,6 +24,7 @@
>>>>>>>> #include "xe_lrc.h"
>>>>>>>> #include "xe_map.h"
>>>>>>>> #include "xe_mocs.h"
>>>>>>>> +#include "xe_pat.h"
>>>>>>>> #include "xe_pt.h"
>>>>>>>> #include "xe_res_cursor.h"
>>>>>>>> #include "xe_sched_job.h"
>>>>>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>> xe_tile
>>>> *tile,
>>>>>> struct xe_migrate *m,
>>>>>>>> u32 num_entries = NUM_PT_SLOTS, num_level = vm-
>>> pt_root[id]-
>>>>>>> level;
>>>>>>>> u32 map_ofs, level, i;
>>>>>>>> struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>>>>>>>> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>>>>>>>> u64 entry;
>>>>>>>> int ret;
>>>>>>>>
>>>>>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct
>>>> xe_tile
>>>>>> *tile, struct xe_migrate *m,
>>>>>>>> return ret;
>>>>>>>> }
>>>>>>>>
>>>>>>>> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
>>>> XE_CACHE_WB);
>>>>>>>> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
>> pat_index);
>>>>>>>> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>>>>>>>
>>>>>>>> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>>>>>>>
>>>>>>>> /* Map the entire BO in our level 0 pt */
>>>>>>>> for (i = 0, level = 0; i < num_entries; level++) {
>>>>>>>> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
>> XE_CACHE_WB,
>>>>>> 0);
>>>>>>>> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
>> pat_index, 0);
>>>>>>>>
>>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>> entry);
>>>>>>>>
>>>>>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>> xe_tile
>>>> *tile,
>>>>>> struct xe_migrate *m,
>>>>>>>> for (i = 0; i < batch->size;
>>>>>>>> i += vm->flags & XE_VM_FLAG_64K ?
>> XE_64K_PAGE_SIZE :
>>>>>>>> XE_PAGE_SIZE) {
>>>>>>>> - entry = xe_pte_encode(batch, i,
>> XE_CACHE_WB, 0);
>>>>>>>> + entry = xe_pte_encode(batch, i, pat_index, 0);
>>>>>>>>
>>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level *
>> 8, u64,
>>>>>>>> entry);
>>>>>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>> xe_tile
>>>> *tile,
>>>>>> struct xe_migrate *m,
>>>>>>>> flags = XE_PDE_64K;
>>>>>>>>
>>>>>>>> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>>>>>>>> - XE_PAGE_SIZE,
>> XE_CACHE_WB);
>>>>>>>> + XE_PAGE_SIZE, pat_index);
>>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE
>> * level,
>>>>>> u64,
>>>>>>>> entry | flags);
>>>>>>>> }
>>>>>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>> xe_tile
>>>> *tile,
>>>>>> struct xe_migrate *m,
>>>>>>>> /* Write PDE's that point to our BO. */
>>>>>>>> for (i = 0; i < num_entries - num_level; i++) {
>>>>>>>> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>>>>>>>> - XE_CACHE_WB);
>>>>>>>> + pat_index);
>>>>>>>>
>>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE
>> +
>>>>>>>> (i + 1) * 8, u64, entry); @@ -1256,7 +1258,7
>> @@
>>>>>>>> xe_migrate_update_pgtables(struct
>>>> xe_migrate
>>>>>> *m,
>>>>>>>>
>>>>>>>> XE_WARN_ON(pt_bo->size != SZ_4K);
>>>>>>>>
>>>>>>>> - addr = xe_pte_encode(pt_bo, 0,
>> XE_CACHE_WB, 0);
>>>>>>>> + addr = xe_pte_encode(pt_bo, 0,
>> xe_pat_get_index(xe,
>>>>>> XE_CACHE_WB), 0);
>>>>>>>> bb->cs[bb->len++] = lower_32_bits(addr);
>>>>>>>> bb->cs[bb->len++] = upper_32_bits(addr);
>>>>>>>> }
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>> b/drivers/gpu/drm/xe/xe_pt.c index 64713f400d94..019af2920078
>>>>>>>> 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>> @@ -10,6 +10,7 @@
>>>>>>>> #include "xe_gt.h"
>>>>>>>> #include "xe_gt_tlb_invalidation.h"
>>>>>>>> #include "xe_migrate.h"
>>>>>>>> +#include "xe_pat.h"
>>>>>>>> #include "xe_pt_types.h"
>>>>>>>> #include "xe_pt_walk.h"
>>>>>>>> #include "xe_res_cursor.h"
>>>>>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct
>>>>>>>> xe_pt_dir
>>>>>> *pt_dir, unsigned int index)
>>>>>>>> *
>>>>>>>> * Return: An encoded page directory entry. No errors.
>>>>>>>> */
>>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>>>>>> - const enum xe_cache_level cache)
>>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
>>>>>>>> +pat_index)
>>>>>>>> {
>>>>>>>> u64 pde;
>>>>>>>> struct xe_vm *vm = bo->vm;
>>>>>>>> struct xe_device *xe = vm->xe;
>>>>>>>>
>>>>>>>> -
>>>>>>>> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>>>>>>>> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>>>>>>>
>>>>>>>> if ((vm->pat_encode).pde_encode)
>>>>>>>> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>>>>>>>> + pde = (vm->pat_encode).pde_encode(xe, pde,
>> pat_index);
>>>>>>>>
>>>>>>>> return pde;
>>>>>>>> }
>>>>>>>>
>>>>>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>>>>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>>>>>>>> struct xe_vma *vma, u32 pt_level)
>>>>>>>> {
>>>>>>>> struct xe_vm *vm = xe_vma_vm(vma); @@ -89,7 +88,7 @@
>> static
>>>>>>>> u64 __pte_encode(u64 pte, enum
>>>>>> xe_cache_level cache,
>>>>>>>> pte |= XE_PTE_NULL;
>>>>>>>>
>>>>>>>> if ((vm->pat_encode).pte_encode)
>>>>>>>> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>>>>>>>> + pte = (vm->pat_encode).pte_encode(xe, pte,
>> pat_index);
>>>>>>>>
>>>>>>>> if (pt_level == 1)
>>>>>>>> pte |= XE_PDE_PS_2M;
>>>>>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
>>>>>> xe_cache_level cache,
>>>>>>>> *
>>>>>>>> * Return: An encoded page-table entry. No errors.
>>>>>>>> */
>>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
>>>>>>>> xe_cache_level
>>>>>> cache,
>>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>>>>>>> u32 pt_level)
>>>>>>>> {
>>>>>>>> u64 pte;
>>>>>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64
>>>> offset,
>>>>>> enum xe_cache_level cache,
>>>>>>>> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>>>>>>>> pte |= XE_PPGTT_PTE_DM;
>>>>>>>>
>>>>>>>> - return __pte_encode(pte, cache, NULL, pt_level);
>>>>>>>> + return __pte_encode(pte, pat_index, NULL, pt_level);
>>>>>>>> }
>>>>>>>>
>>>>>>>> static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct
>>>>>>>> xe_vm *vm, @@ -134,12 +133,12 @@ static u64
>>>>>>>> __xe_pt_empty_pte(struct xe_tile
>>>> *tile,
>>>>>> struct xe_vm *vm,
>>>>>>>>
>>>>>>>> if (level == 0) {
>>>>>>>> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>>>>>>>> - XE_CACHE_WB, 0);
>>>>>>>> + xe_pat_get_index(vm->xe,
>>>>>> XE_CACHE_WB), 0);
>>>>>>>>
>>>>>>>> return empty;
>>>>>>>> } else {
>>>>>>>> return xe_pde_encode(vm->scratch_pt[id][level - 1]-
>>> bo, 0,
>>>>>>>> - XE_CACHE_WB);
>>>>>>>> + xe_pat_get_index(vm->xe,
>>>>>> XE_CACHE_WB));
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>>>>>>>> struct xe_vm *vm;
>>>>>>>> /** @tile: The tile we're building for. */
>>>>>>>> struct xe_tile *tile;
>>>>>>>> - /** @cache: Desired cache level for the ptes */
>>>>>>>> - enum xe_cache_level cache;
>>>>>>>> /** @default_pte: PTE flag only template. No address is
>> associated */
>>>>>>>> u64 default_pte;
>>>>>>>> /** @dma_offset: DMA offset to add to the PTE. */ @@ -604,7
>>>>>>>> +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
>>>>>> pgoff_t offset,
>>>>>>>>
>>>>>>>> pte = __pte_encode(is_null ? 0 :
>>>>>>>> xe_res_dma(curs) + xe_walk-
>>> dma_offset,
>>>>>>>> - xe_walk->cache, xe_walk->vma,
>> level);
>>>>>>>> + xe_walk->vma->pat_index, xe_walk-
>>> vma,
>>>>>> level);
>>>>>>>> pte |= xe_walk->default_pte;
>>>>>>>>
>>>>>>>> /*
>>>>>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
>>>>>> pgoff_t offset,
>>>>>>>> xe_child->is_compact = true;
>>>>>>>> }
>>>>>>>>
>>>>>>>> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache)
>> |
>>>> flags;
>>>>>>>> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
>>>>>>> pat_index) | flags;
>>>>>>>> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset,
>> xe_child,
>>>>>>>> pte);
>>>>>>>> }
>>>>>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile,
>>>>>>>> struct
>>>> xe_vma
>>>>>> *vma,
>>>>>>>> if (vma && vma->gpuva.flags &
>> XE_VMA_ATOMIC_PTE_BIT)
>>>>>>>> xe_walk.default_pte |=
>> XE_USM_PPGTT_PTE_AE;
>>>>>>>> xe_walk.dma_offset = vram_region_gpu_offset(bo-
>>>>>>> ttm.resource);
>>>>>>>> - xe_walk.cache = XE_CACHE_WB;
>>>>>>>> - } else {
>>>>>>>> - if (!xe_vma_has_no_bo(vma) && bo->flags &
>>>>>> XE_BO_SCANOUT_BIT)
>>>>>>>> - xe_walk.cache = XE_CACHE_WT;
>>>>>>>> - else
>>>>>>>> - xe_walk.cache = XE_CACHE_WB;
>>>>>>>> }
>>>>>>>> +
>>>>>>>> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>>>>>>> xe_walk.dma_offset =
>>>>>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>> b/drivers/gpu/drm/xe/xe_pt.h index 01be7ab08f87..1d433a5a96b4
>>>>>>>> 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile,
>>>>>>>> struct
>>>>>> xe_vma *vma, struct xe_exec_queu
>>>>>>>>
>>>>>>>> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma
>>>>>>>> *vma);
>>>>>>>>
>>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>>>>>> - const enum xe_cache_level level);
>>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
>>>>>>>> +pat_index);
>>>>>>>>
>>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
>>>>>>>> xe_cache_level
>>>>>> cache,
>>>>>>>> - u32 pt_level);
>>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>>>>>>> +u32
>>>>>> pt_level);
>>>>>>>>
>>>>>>>> #endif
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>> b/drivers/gpu/drm/xe/xe_vm.c index 7eeeed0411f3..34603a7e84b0
>>>>>>>> 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>> @@ -6,6 +6,7 @@
>>>>>>>> #include "xe_vm.h"
>>>>>>>>
>>>>>>>> #include <linux/dma-fence-array.h>
>>>>>>>> +#include <linux/nospec.h>
>>>>>>>>
>>>>>>>> #include <drm/drm_print.h>
>>>>>>>> #include <drm/ttm/ttm_execbuf_util.h> @@ -874,7 +875,8 @@
>>>>>>>> static struct xe_vma *xe_vma_create(struct
>>>> xe_vm
>>>>>> *vm,
>>>>>>>> u64 start, u64 end,
>>>>>>>> bool read_only,
>>>>>>>> bool is_null,
>>>>>>>> - u8 tile_mask)
>>>>>>>> + u8 tile_mask,
>>>>>>>> + u32 pat_index)
>>>>>>>> {
>>>>>>>> struct xe_vma *vma;
>>>>>>>> struct xe_tile *tile;
>>>>>>>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct
>>>> xe_vm
>>>>>> *vm,
>>>>>>>> vma->tile_mask |= 0x1 << id;
>>>>>>>> }
>>>>>>>>
>>>>>>>> + vma->pat_index = pat_index;
>>>>>>>> +
>>>>>>>> if (vm->xe->info.platform == XE_PVC)
>>>>>>>> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>>>>>>>
>>>>>>>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct
>>>>>> work_struct *w);
>>>>>>>> static void vm_destroy_work_func(struct work_struct *w);
>>>>>>>>
>>>>>>>> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe,
>>>>>>>> u64
>>>> pde_pat,
>>>>>>>> - enum xe_cache_level
>> cache)
>>>>>>>> + u32 pat_index)
>>>>>>>> {
>>>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>> -
>>>>>>>> pde_pat &= ~(XELP_PDE_PAT_MASK);
>>>>>>>>
>>>>>>>> if (pat_index & BIT(0))
>>>>>>>> @@ -1213,10 +1215,8 @@ static u64
>>>> xelp_ppgtt_pde_encode_pat(struct
>>>>>> xe_device *xe, u64 pde_pat,
>>>>>>>> }
>>>>>>>>
>>>>>>>> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe,
>>>>>>>> u64
>>>> pte_pat,
>>>>>>>> - enum xe_cache_level
>> cache)
>>>>>>>> + u32 pat_index)
>>>>>>>> {
>>>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>> -
>>>>>>>> pte_pat &= ~(XELP_PTE_PAT_MASK);
>>>>>>>>
>>>>>>>> if (pat_index & BIT(0))
>>>>>>>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file
>>>> *xef,
>>>>>> u32 id)
>>>>>>>> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>>>>>>>> {
>>>>>>>> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>>>>>>>> - XE_CACHE_WB);
>>>>>>>> + xe_pat_get_index(vm->xe,
>> XE_CACHE_WB));
>>>>>>>> }
>>>>>>>>
>>>>>>>> static struct dma_fence *
>>>>>>>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe,
>>>> struct
>>>>>> drm_gpuva_op *op)
>>>>>>>> static struct drm_gpuva_ops *
>>>>>>>> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>>>>>> u64 bo_offset_or_userptr, u64 addr, u64
>> range,
>>>>>>>> - u32 operation, u8 tile_mask, u32 region)
>>>>>>>> + u32 operation, u8 tile_mask, u32 region, u32
>>>>>> pat_index)
>>>>>>>> {
>>>>>>>> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>>>>>>> struct ww_acquire_ctx ww;
>>>>>>>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm
>> *vm,
>>>>>> struct xe_bo *bo,
>>>>>>>> struct xe_vma_op *op =
>> gpuva_op_to_vma_op(__op);
>>>>>>>>
>>>>>>>> op->tile_mask = tile_mask;
>>>>>>>> + op->pat_index = pat_index;
>>>>>>>> op->map.immediate =
>>>>>>>> operation &
>>>> XE_VM_BIND_FLAG_IMMEDIATE;
>>>>>>>> op->map.read_only =
>>>>>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm
>> *vm,
>>>>>> struct xe_bo *bo,
>>>>>>>> struct xe_vma_op *op =
>> gpuva_op_to_vma_op(__op);
>>>>>>>>
>>>>>>>> op->tile_mask = tile_mask;
>>>>>>>> + op->pat_index = pat_index;
>>>>>>>> op->prefetch.region = region;
>>>>>>>> }
>>>>>>>> break;
>>>>>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
>> *vm,
>>>>>> struct xe_bo *bo,
>>>>>>>> }
>>>>>>>>
>>>>>>>> static struct xe_vma *new_vma(struct xe_vm *vm, struct
>>>>>> drm_gpuva_op_map *op,
>>>>>>>> - u8 tile_mask, bool read_only, bool is_null)
>>>>>>>> + u8 tile_mask, bool read_only, bool is_null,
>>>>>>>> + u32 pat_index)
>>>>>>>> {
>>>>>>>> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op-
>>> gem.obj) :
>>>> NULL;
>>>>>>>> struct xe_vma *vma;
>>>>>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm
>>>>>> *vm, struct drm_gpuva_op_map *op,
>>>>>>>> vma = xe_vma_create(vm, bo, op->gem.offset,
>>>>>>>> op->va.addr, op->va.addr +
>>>>>>>> op->va.range - 1, read_only, is_null,
>>>>>>>> - tile_mask);
>>>>>>>> + tile_mask, pat_index);
>>>>>>>> if (bo)
>>>>>>>> xe_bo_unlock(bo, &ww);
>>>>>>>>
>>>>>>>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct
>>>> xe_vm
>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>
>>>>>>>> vma = new_vma(vm, &op->base.map,
>>>>>>>> op->tile_mask, op-
>>>>> map.read_only,
>>>>>>>> - op->map.is_null);
>>>>>>>> + op->map.is_null, op-
>>> pat_index);
>>>>>>>> if (IS_ERR(vma)) {
>>>>>>>> err = PTR_ERR(vma);
>>>>>>>> goto free_fence;
>>>>>>>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct
>>>> xe_vm
>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>
>>>>>>>> vma = new_vma(vm, op-
>>>>>>> base.remap.prev,
>>>>>>>> op->tile_mask,
>> read_only,
>>>>>>>> - is_null);
>>>>>>>> + is_null, op-
>>> pat_index);
>>>>>>>> if (IS_ERR(vma)) {
>>>>>>>> err = PTR_ERR(vma);
>>>>>>>> goto free_fence;
>>>>>>>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct
>>>> xe_vm
>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>
>>>>>>>> vma = new_vma(vm, op-
>>>>>>> base.remap.next,
>>>>>>>> op->tile_mask,
>> read_only,
>>>>>>>> - is_null);
>>>>>>>> + is_null, op-
>>> pat_index);
>>>>>>>> if (IS_ERR(vma)) {
>>>>>>>> err = PTR_ERR(vma);
>>>>>>>> goto free_fence;
>>>>>>>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct
>>>>>> xe_device *xe,
>>>>>>>> u32 obj = (*bind_ops)[i].obj;
>>>>>>>> u64 obj_offset = (*bind_ops)[i].obj_offset;
>>>>>>>> u32 region = (*bind_ops)[i].region;
>>>>>>>> + u32 pat_index = (*bind_ops)[i].pat_index;
>>>>>>>> bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>>>>>>>
>>>>>>>> + if (XE_IOCTL_DBG(xe, pat_index >= xe-
>>>>>>> info.pat_table_n_entries)) {
>>>>>>>> + err = -EINVAL;
>>>>>>>> + goto free_bind_ops;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> if (i == 0) {
>>>>>>>> *async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>>>>>>>> } else if (XE_IOCTL_DBG(xe, !*async) || @@ -3346,8
>> +3355,25
>>>>>>>> @@ int xe_vm_bind_ioctl(struct drm_device
>>>> *dev,
>>>>>> void *data, struct drm_file *file)
>>>>>>>> struct drm_gem_object *gem_obj;
>>>>>>>> u64 range = bind_ops[i].range;
>>>>>>>> u64 addr = bind_ops[i].addr;
>>>>>>>> + u32 op = bind_ops[i].op;
>>>>>>>> u32 obj = bind_ops[i].obj;
>>>>>>>> u64 obj_offset = bind_ops[i].obj_offset;
>>>>>>>> + u32 pat_index = bind_ops[i].pat_index;
>>>>>>>> + u16 coh_mode;
>>>>>>>> +
>>>>>>>> + pat_index = array_index_nospec(pat_index,
>>>>>>>> + xe-
>>> info.pat_table_n_entries);
>>>>>>>> + coh_mode = xe_pat_index_get_coh_mode(xe,
>> pat_index);
>>>>>>>> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
>>>>>>>
>>>>>>> Assuming we drop the unusable entries from the TGL table, this
>>>>>>> should be impossible, right? Any index that makes it past the
>>>>>>> n_entries check at the top of the function should have a valid, non-zero
>> coh_mode value.
>>>>>>> So this should probably be an assertion (to highlight a KMD bug)
>>>>>>> rather than just a silent uapi failure return.
>>>>>>
>>>>>> Makes sense.
>>>>>>
>>>>>>>
>>>>>>>> + err = -EINVAL;
>>>>>>>> + goto put_obj;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
>>>>>> XE_VM_BIND_OP_MAP_USERPTR &&
>>>>>>>> + coh_mode ==
>> XE_GEM_COHERENCY_NONE))
>>>> {
>>>>>>>> + err = -EINVAL;
>>>>>>>> + goto put_obj;
>>>>>>>> + }
>>>>>>>>
>>>>>>>> if (!obj)
>>>>>>>> continue;
>>>>>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device
>>>> *dev,
>>>>>> void *data, struct drm_file *file)
>>>>>>>> goto put_obj;
>>>>>>>> }
>>>>>>>> }
>>>>>>>> +
>>>>>>>> + if (bos[i]->coh_mode) {
>>>>>>>> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
>>>>>> coh_mode)) {
>>>>>>>> + err = -EINVAL;
>>>>>>>> + goto put_obj;
>>>>>>>> + }
>>>>>>>> + } else if (XE_IOCTL_DBG(xe, coh_mode ==
>>>>>> XE_GEM_COHERENCY_NONE)) {
>>>>>>>> + /*
>>>>>>>> + * Imported dma-buf from a different device
>> should
>>>>>>>> + * require 1way or 2way coherency since we
>> don't
>>>>>> know
>>>>>>>> + * how it was mapped on CPU. Just assume is it
>>>>>>>> + * potentially cached on CPU side.
>>>>>>>> + */
>>>>>>>> + err = -EINVAL;
>>>>>>>> + goto put_obj;
>>>>>>>> + }
>>>>>>>> }
>>>>>>>>
>>>>>>>> if (args->num_syncs) {
>>>>>>>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device
>>>> *dev,
>>>>>> void *data, struct drm_file *file)
>>>>>>>> u64 obj_offset = bind_ops[i].obj_offset;
>>>>>>>> u8 tile_mask = bind_ops[i].tile_mask;
>>>>>>>> u32 region = bind_ops[i].region;
>>>>>>>> + u32 pat_index = bind_ops[i].pat_index;
>>>>>>>>
>>>>>>>> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i],
>> obj_offset,
>>>>>>>> addr, range, op,
>> tile_mask,
>>>>>>>> - region);
>>>>>>>> + region, pat_index);
>>>>>>>> if (IS_ERR(ops[i])) {
>>>>>>>> err = PTR_ERR(ops[i]);
>>>>>>>> ops[i] = NULL;
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>> b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>> index 83a1f87b6537..508679ed3c74 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>> @@ -111,6 +111,11 @@ struct xe_vma {
>>>>>>>> */
>>>>>>>> u8 tile_present;
>>>>>>>>
>>>>>>>> + /**
>>>>>>>> + * @pat_index: The pat index to use when encoding the PTEs
>> for
>>>>>>>> +this
>>>>>> vma.
>>>>>>>> + */
>>>>>>>> + u32 pat_index;
>>>>>>>> +
>>>>>>>> struct {
>>>>>>>> struct list_head rebind_link;
>>>>>>>> } notifier;
>>>>>>>> @@ -338,10 +343,8 @@ struct xe_vm {
>>>>>>>> bool batch_invalidate_tlb;
>>>>>>>>
>>>>>>>> struct {
>>>>>>>> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>>>> - enum xe_cache_level
>> cache);
>>>>>>>> - u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>>>>>>>> - enum xe_cache_level
>> cache);
>>>>>>>> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> u32
>>>>>> pat_index);
>>>>>>>> + u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>> u32
>>>>>> pat_index);
>>>>>>>> } pat_encode;
>>>>>>>> };
>>>>>>>>
>>>>>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>>>>>>>> struct async_op_fence *fence;
>>>>>>>> /** @tile_mask: gt mask for this operation */
>>>>>>>> u8 tile_mask;
>>>>>>>> + /** @pat_index: The pat index to use for this operation. */
>>>>>>>> + u32 pat_index;
>>>>>>>> /** @flags: operation flags */
>>>>>>>> enum xe_vma_op_flags flags;
>>>>>>>>
>>>>>>>> diff --git a/include/uapi/drm/xe_drm.h
>>>>>>>> b/include/uapi/drm/xe_drm.h index 64bc66d4b550..0c15b6f32447
>>>>>>>> 100644
>>>>>>>> --- a/include/uapi/drm/xe_drm.h
>>>>>>>> +++ b/include/uapi/drm/xe_drm.h
>>>>>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>>>>>>>> */
>>>>>>>> __u32 obj;
>>>>>>>>
>>>>>>>> - /** @pad: MBZ */
>>>>>>>> - __u32 pad;
>>>>>>>> + /**
>>>>>>>> + * @pat_index: The platform defined @pat_index to use for
>> this
>>>>>> mapping.
>>>>>>>> + * The index basically maps to some predefined memory
>> attributes,
>>>>>>>> + * including things like caching, coherency and likely other stuff
>> in
>>>>>>>> + * the future. The exact meaning of the pat_index is platform
>>>>>>>> +specific
>>>>>>>
>>>>>>> BTW, "other stuff in the future" already includes compression on
>>>>>>> Xe2, we just haven't landed the patches for the Xe2 table yet.
>>>>>>
>>>>>> Ok, good to know.
>>>>>>
>>>>>>>
>>>>>>>> + * and defined in the Bspec and PRMs. When the KMD sets up
>> the
>>>>>> binding
>>>>>>>> + * the index here is encoded into the ppGTT PTE.
>>>>>>>> + *
>>>>>>>> + * For coherency the @pat_index needs to match the
>>>>>>>> + * drm_xe_gem_create.coh_mode, so either
>>>>>> XE_GEM_COHERENCY_NONE,
>>>>>>>> + * XE_GEM_COHERENCY_1WAY or
>> XE_GEM_COHERENCY_2WAY. The
>>>>>> KMD will extract
>>>>>>>> + * the coherency mode from the @pat_index and reject if
>> there is a
>>>>>>>> + * mismatch (see note below for pre-MTL platforms).
>>>>>>>> + *
>>>>>>>> + * Note: On pre-MTL platforms there is only a caching mode
>> and no
>>>>>>>> + * explicit coherency mode, but on such hardware there is
>> always a
>>>>>>>> + * shared-LLC (or is dgpu) so all GT memory accesses are
>>>>>>>> +coherent
>>>> with
>>>>>>>> + * CPU caches even with the caching mode set as uncached.
>> It's
>>>>>>>> +only
>>>>>> the
>>>>>>>> + * display engine that is incoherent (on dgpu it must be in
>> VRAM which
>>>>>>>> + * is always mapped as WC on the CPU). However to keep the
>> uapi
>>>>>> somewhat
>>>>>>>> + * consistent with newer platforms the KMD groups the
>> different
>>>> cache
>>>>>>>> + * levels into the following coherency buckets on all pre-MTL
>>>> platforms:
>>>>>>>> + *
>>>>>>>> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
>>>>>>>> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
>>>>>>>> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
>>>>>>>> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
>>>>>>>
>>>>>>> As noted on the previous patch, it seems like 2-way is appropriate
>>>>>>> for LLC platforms, but 1-way might be a more accurate description
>>>>>>> of dGPU behavior.
>>>>>>>
>>>>>>>> + *
>>>>>>>> + * In practice UC/WC/WT should only ever used for scanout
>>>>>>>> +surfaces
>>>> on
>>>>>>>> + * such platforms since it is only the display engine that is
>> actually
>>>>>>>> + * incoherent. Everything else should typically use WB given
>>>>>>>> +that we
>>>>>>>
>>>>>>> What if we're sharing our buffers with some other (non-GPU)
>>>>>>> device? Are there cases where that other device wouldn't be
>>>>>>> coherent with the LLC, so we'd want to use one of these?
>>>>>>
>>>>>> Yeah, I guess there might be cases like that. I'll reword.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>>> + * have a shared-LLC. On MTL+ this completely changes (also
>>>>>> potentially
>>>>>>>> + * no shared-LLC) and the HW defines the coherency mode as
>> part
>>>>>>>> +of
>>>>>> the
>>>>>>>> + * @pat_index.
>>>>>>>> + *
>>>>>>>> + * Note: For userptr and externally imported dma-buf the
>> kernel
>>>>>> expects
>>>>>>>> + * either 1WAY or 2WAY for the @pat_index.
>>>>>>>> + */
>>>>>>>> + __u32 pat_index;
>>>>>>>>
>>>>>>>> union {
>>>>>>>> /**
>>>>>>>> --
>>>>>>>> 2.41.0
>>>>>>>>
>>>>>>>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-09-04 9:23 ` Matthew Auld
@ 2023-09-05 9:12 ` Zhang, Carl
2023-09-05 9:46 ` Matthew Auld
0 siblings, 1 reply; 47+ messages in thread
From: Zhang, Carl @ 2023-09-05 9:12 UTC (permalink / raw)
To: Auld, Matthew, Roper, Matthew D
Cc: Gu, Lihao, Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas,
Yu, Effie, intel-xe@lists.freedesktop.org
> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: Monday, September 4, 2023 5:24 PM
>
> On 01/09/2023 10:34, Zhang, Carl wrote:
> >
> >
> >> -----Original Message-----
> >> From: Auld, Matthew <matthew.auld@intel.com>
> >> Sent: Thursday, August 31, 2023 6:44 PM
> >>
> >> On 31/08/2023 09:24, Zhang, Carl wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Auld, Matthew <matthew.auld@intel.com>
> >>>> Sent: Thursday, August 31, 2023 12:02 AM
> >>>>
> >>>> On 30/08/2023 16:27, Zhang, Carl wrote:
> >>>>> Several questions:
> >>>>> 1. the pat_index from vm_bind will override the setting from
> bo_create?
> >>>>> How to keep the value from bo_create unchanged?
> >>>>
> >>>> You only set the pat_index during vm_bind. At gem_create you just
> >>>> need tell the kernel what CPU side caching mode to use for system
> >>>> memory (wb/wc/uc), plus the expected GPU coherency mode of the
> >>>> pat_index. The current expectation is that the coherency mode of
> >>>> the pat_index should match the one at gem_create.
> >>>>
> >>> It means that we set the incomplete pat_index at gem_create
> >>> (coherency
> >>> only) And set another part at vm_bind (gpu cache) Why we separate
> >>> them? Why not set a full value at vm_bind?
> >>
> >> You always give the full pat_index at vm_bind. It's platform specific
> >> but the BSpec for the most part tells you what the coherency mode of the
> pat_index is.
> >>
> >> For example if you look at the MTL pat_index table it tells you the
> >> coh_mode:
> >>
> >> pat_index 0-2 = coh_mode "No snoop" = COH_NONE
> >> pat_index 3 = coh_mode "1way" = COH_1WAY
> >> pat_index 4 = coh_mode "2way" = COH_2WAY
> >>
> >> On future platforms the table is larger and also encodes stuff like
> >> compression, but KMD really only cares about the coh_mode, since that
> >> will restrict the allowed CPU smem_caching values that userspace can
> select.
> >>
> >> From KMD pov this matters since we always give userspace zeroed
> >> memory, which is useful for normal applications but is also needed for
> security reasons.
> >> However the actual clearing might result in the writes only being
> >> visible in the CPU cache (not flushed to main memory), and if the GPU
> >> doesn't snoop the CPU cache it can read directly from main memory,
> >> which effectively bypasses the clearing. To prevent that the KMD
> >> needs to know how userspace will map the memory via the GPU i.e what
> >> is the coh_mode for the pat_index. The gist is that if you select
> >> COH_NONE then it needs to clflush the pages before giving them to
> >> userspace, which is done by forcing you to use uc/wc for the
> >> smem_caching (the kernel will internally issue cflush on x86 when
> >> marking the pages as uc/wc). And if you select wb there is no flushing, but
> here you need to use 1way or 2way.
> >>
> >
> > It sounds like a perf optimization, if no considering the perf, you
> > could always call Clflush to flush the cpu cache to main memory before
> > giving it to user space
>
> I think the other thing is that the KMD is no longer doing any manual flushing,
> but instead the core kernel will now do that for us when marking the pages as
> wc/uc on x86.
>
> >
> >
> >> AFAIK separating them is just to prevent userspace from mixing
> >> different coherency modes for the same object by having it immutable
> >> (you can still use different pat_index but the coherency must be
> compatible).
> >> Also at creation time we can determine if the smem_caching is
> >> compatible with the coh_mode.
> >>
> >> Note that it might be that we ditch the COH_2WAY and just have
> >> COH_NONE and COH_AT_LEAST_1WAY, in which case this might be slightly
> different.
> >>
> >>>
> >>>>> 2. no UC/WB/WC definition (CPU cachable) in
> >>>>> drm_xe_gem_mmap_offset,
> >>>> will it be handled by KMD automatically?
> >>>>> For example: if set 1-way, it means GPU could snoop CPU cache, we
> >>>>> could use WB in mmap offset If it is COHERENCY_NONE, we could
> >>>>> only use
> >>>> UC, all these logic is handled by KMD automatically?
> >>>>
> >>>> Yes, the mmap will use the same CPU side caching mode that you set
> >>>> at gem_create with smem_caching (wb/wc/uc).
> >>>>
> >>> It means it could not be changed after gem_create. Sometimes, UMD
> >>> doesn’t know Whether the data need to be accessed by CPU at
> >>> gem_create . only application Knows the purpose of allocation & the
> >>> surface usage. So, may need change UMD api to let applications set
> >>> whether the
> >> resource is "accessible", or it will hurt the perf.
> >>
> >> Right, the smem_caching is immutable.
> >>
> >>>
> >>>>> 3. about " For imported dma-buf (from a different device) the
> >>>>> coherency mode is also implicit and must also be either 1WAY or
> 2WAY"
> >>>>> it means it must be 1way or 2way, and UMD need not to set it?
> >>>>
> >>>> If you are going to vm_bind it you need to supply the pat_index and
> >>>> in this case it needs to be either 1way or 2way. The object comes
> >>>> from a different device so we might not have smem_caching/coh_mode
> >>>> like we do for native objects.
> >>>>
> >>> How about the GPU cache setting of imported dma-buf?
> >>> Is there possible conflict between dma-buf producer and consumer?
> >>> For example:
> >>> In producer side , one bo is set GPU L3 cache, also CPU cacheable,
> >>> it should be 2Way In client side, we know the coherency should be
> >>> 1-way or 2-way, if it suppose no GPU cache Used. It just set 1-way.
> >>> It may cause some problems?
> >>
> >> If the producer and consumer are the same device, then I think the
> >> original gem_create.coh_mode is considered, since it's effectively
> >> still a native object. In which case the normal rules apply.
> >
> > But the consumer does not know the coh_mode, so , it is possible that
> > it set will set Another value, then it will failed.
> > And because consumer does not know it is 1-way or 2-way, consumer
> > could not Set the gpu cache correctly. Because if it is 1-way, cpu
> > could not snoop gpu cache
> >
> >>
> >> If it comes from a different device then we don't know what
> >> gem_create.coh_mode is (or if that even exists) since we don't have
> >> the usual native object, but just some opaque dma-buf object. In this
> >> case it must be at least 1way or 2way.
> >>
> >> From KMD pov we only care about userspace not being able to bypass
> >> the clearing. There might still be other coherency issues though.
> >>
> >>>
> >>> Also , when a bo is created, UMD does not know whether it will be
> exported.
> >>> It maybe be set as COHERENCY_NONE, right?
> >>
> >> Yeah, you can use whatever you want. Just that your smem_caching will
> >> need to be wc/uc if using COH_NONE. On the importer side, if it comes
> >> from the same device then it's just a normal native object and the normal
> rules apply.
> >> Currently this would mean also using pat_index with COH_NONE. If it
> >> is imported from a different device then you need to use a pat_index
> >> with 1way or 2way, since KMD doesn't really know how the object is
> >> mapped on the CPU or if the CPU caches are dirty (potentially with
> clearing).
> >>
> >> Is this going to be problematic?
> >>
> > If imported surfaces is from same device, how consumer know it is
> > original coh_non Or 1-way or 2-way?
> >
> > And it also means producer and consumer could have different pat_index for
> same bo.
>
> Ok, if KMD instead allows 1way or 2way for dma-buf when it's external or
> from the same device? Would that help here? i.e you can always select 1way
> or 2way, even if it came from the same device and was coh_none?
>
> Other option is you could call the vm_bind ioctl to check if the coh_mode is
> supported by checking if it returns an error? But maybe that is too nasy.
>
Still a bit confuse, I create a bo with coh_none, then export it to a PRIME fd.
In consumer side, it does not know it is coh_none, and will call vm_bind to set pat_index.
You mean it will return a error when I set different coh value though vm_bind?
And try coh_non, 1-way, 2-way, then get the correct one? TBH, it is ugly.
My concern is: if it is coh_non or 1-way, it means I could not use GPU cache if the bo also is CPU accessible.
Because CPU could not snoop GPU cache.
So, GPU cache specified by pat_index also will be rejected?
Could we add some query interface to query the CPU cache and coherency setting of a bo?
>
> >
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Auld, Matthew <matthew.auld@intel.com>
> >>>>>> Sent: Wednesday, August 30, 2023 7:28 PM
> >>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
> >>>>>> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi
> >>>> <pallavi.mishra@intel.com>;
> >>>>>> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas
> >>>>>> Lahtinen <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
> >>>>>> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>;
> >>>> Hazubski,
> >>>>>> Filip <filip.hazubski@intel.com>; Zhang, Carl
> >>>>>> <carl.zhang@intel.com>; Yu,
> >>>> Effie
> >>>>>> <effie.yu@intel.com>
> >>>>>> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection
> >>>>>> with
> >>>> vm_bind
> >>>>>>
> >>>>>> On 29/08/2023 22:36, Matt Roper wrote:
> >>>>>>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
> >>>>>>>> Allow userspace to directly control the pat_index for a given
> >>>>>>>> vm binding. This should allow directly controlling the
> >>>>>>>> coherency, caching and potentially other stuff in the future for the
> ppGTT binding.
> >>>>>>>>
> >>>>>>>> The exact meaning behind the pat_index is very platform
> >>>>>>>> specific (see BSpec or PRMs) but effectively maps to some
> >>>>>>>> predefined memory attributes. From the KMD pov we only care
> >>>>>>>> about the coherency that is provided by the pat_index, which
> >>>>>>>> falls into either
> >> NONE, 1WAY or 2WAY.
> >>>>>>>> The vm_bind coherency mode for the given pat_index needs to
> >>>>>>>> match
> >>>> the
> >>>>>>>> given coh_mode that was set at object creation. For platforms
> >>>>>>>> that lack
> >>>>>>>
> >>>>>>> Is it actually important to match the coherency mode? I think
> >>>>>>> one of the main goals was to know up front if userspace might be
> >>>>>>> using a non-snooping PAT setting that would let it bypass the
> >>>>>>> CPU cache (and potentially read old, stale data from a different
> >>>>>>> process if the buffer's clear value is still sitting in cache
> >>>>>>> and hasn't landed in memory yet).
> >>>>>>>
> >>>>>>> If that's the only concern, then I think it should still be fine
> >>>>>>> to map with a non-matching PAT as long as it's more coherent
> >>>>>>> than the one specified at creation, right? E.g., if the buffer
> >>>>>>> was created with 1-way coherency, it would be fine to map it
> >>>>>>> with 2-way because userspace still can't use that to observe the
> >>>>>>> previous contents of the buffer. Or
> >>>>>>
> >>>>>> Yeah, I guess we could in theory do something that.
> >>>>>>
> >>>>>>> if the buffer was created with "non-coherent" then we've already
> >>>>>>> done the necessary clflushing in kernel before handing to buffer
> >>>>>>> over to userspace to ensure the clear value landed in memory, so
> >>>>>>> any valid PAT index should be safe (from a security POV) after
> >>>>>>> that, right? Any other problems that arise from mismatched
> >>>>>>> coherency would just be contained to the app possibly shooting
> >>>>>>> itself in the foot, which isn't really our concern.
> >>>>>>
> >>>>>> That is also my understanding, at least from the KMD security pov.
> >>>>>> If you allocate as wb then you must use at least 1way, since
> >>>>>> there is no flushing
> >>>> for
> >>>>>> clearing or swap-in. For uc/wc you could in theory use whatever you
> want.
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and
> WB
> >>>>>>>> as
> >>>>>> 2WAY.
> >>>>>>>>
> >>>>>>>> For userptr mappings we lack a corresponding gem object, so the
> >>>> expected
> >>>>>>>> coherency mode is instead implicit and must fall into either
> >>>>>>>> 1WAY or 2WAY. Trying to use NONE will be rejected by the
> >>>>>>>> kernel. For imported dma-buf (from a different device) the
> >>>>>>>> coherency mode is also implicit and must also be either 1WAY or
> 2WAY.
> >>>>>>>>
> >>>>>>>> As part of adding pat_index support with vm_bind we also need
> >>>>>>>> stop
> >>>> using
> >>>>>>>> xe_cache_level and instead use the pat_index in various places.
> >>>>>>>> We still make use of xe_cache_level, but only as a convenience
> >>>>>>>> for kernel internal objectsi (internally it maps to some
> >>>>>>>> reasonable pat_index). For
> >>>>>>>
> >>>>>>> Maybe we should kill xe_cache_level completely and just assign
> >>>>>>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used
> >>>>>>> in appropriate places, similar to what we do with MOCS (gt-
> >>>>> mocs.uc_index,
> >>>>>>> gt->mocs.wb_index)?
> >>>>>>
> >>>>>> OK, seems reasonable to me.
> >>>>>>
> >>>>>>>
> >>>>>>>> now this is just a 1:1 conversion of the existing code, however
> >>>>>>>> for platforms like MTL+ we might need to give more control
> >>>>>>>> through
> >>>> bo_create
> >>>>>>>> or stop using WB on the CPU side if we need CPU access.
> >>>>>>>>
> >>>>>>>> Bspec: 45101, 44235 #xe
> >>>>>>>> Bspec: 70552, 71582, 59400 #xe2
> >>>>>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> >>>>>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> >>>>>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> >>>>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >>>>>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> >>>>>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
> >>>>>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
> >>>>>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
> >>>>>>>> Cc: Carl Zhang <carl.zhang@intel.com>
> >>>>>>>> Cc: Effie Yu <effie.yu@intel.com>
> >>>>>>>> ---
> >>>>>>>> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
> >>>>>>>> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
> >>>>>>>> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
> >>>>>>>> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
> >>>>>>>> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
> >>>>>>>> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
> >>>>>>>> drivers/gpu/drm/xe/xe_pt.h | 6 +--
> >>>>>>>> drivers/gpu/drm/xe/xe_vm.c | 73
> +++++++++++++++++++++----
> >> -
> >>>> -
> >>>>>>>> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
> >>>>>>>> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
> >>>>>>>> 10 files changed, 134 insertions(+), 58 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>> b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>>>> index 4d6296cdbcfd..bb4c182048c3 100644
> >>>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
> >>>>>>>> (*alloc_scratch_dma)(struct i915_address_space
> *vm,
> >> int
> >>>>>>>> sz);
> >>>>>>>>
> >>>>>>>> u64 (*pte_encode)(dma_addr_t addr,
> >>>>>>>> - unsigned int pat_index,
> >>>>>>>> + u32 pat_index,
> >>>>>>>> u32 flags); /* Create a valid PTE */
> >>>>>>>> #define PTE_READ_ONLY BIT(0)
> >>>>>>>> #define PTE_LM BIT(1)
> >>>>>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>> b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
> >>>>>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct
> >>>> xe_migrate
> >>>>>> *m, struct kunit *test)
> >>>>>>>> /* First part of the test, are we updating our pagetable
> >>>>>>>> bo with a new
> >>>>>> entry? */
> >>>>>>>> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE *
> >> (NUM_KERNEL_PDE -
> >>>> 1),
> >>>>>> u64,
> >>>>>>>> 0xdeaddeadbeefbeef);
> >>>>>>>> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> >>>>>>>> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
> >>>>>> XE_CACHE_WB), 0);
> >>>>>>>> if (m->q->vm->flags & XE_VM_FLAG_64K)
> >>>>>>>> expected |= XE_PTE_PS64;
> >>>>>>>> if (xe_bo_is_vram(pt))
> >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c
> >>>> b/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>>>> index 209fa053d9fb..4134c26150a5 100644
> >>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo,
> u64
> >>>>>> bo_offset)
> >>>>>>>> pte |= XE_GGTT_PTE_DM;
> >>>>>>>>
> >>>>>>>> if ((ggtt->pat_encode).pte_encode)
> >>>>>>>> - pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >>>>>> XE_CACHE_WB_1_WAY);
> >>>>>>>> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >>>>>>>> +
> >> xe_pat_get_index(xe,
> >>>>>> XE_CACHE_WB_1_WAY));
> >>>>>>>>
> >>>>>>>> return pte;
> >>>>>>>> }
> >>>>>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt
> *ggtt)
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe,
> >>>>>>>> u64
> >>>> pte_pat,
> >>>>>>>> - enum xe_cache_level
> >> cache)
> >>>>>>>> + u32 pat_index)
> >>>>>>>> {
> >>>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>>>> -
> >>>>>>>> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
> >>>>>>>>
> >>>>>>>> if (pat_index & BIT(0))
> >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>> b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
> >>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
> >>>>>>>>
> >>>>>>>> struct {
> >>>>>>>> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>>>>>> - enum xe_cache_level
> >> cache);
> >>>>>>>> + u32 pat_index);
> >>>>>>>> } pat_encode;
> >>>>>>>> };
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>> b/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>>>> index a782ea282cb6..54585e98452a 100644
> >>>>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>>>> @@ -24,6 +24,7 @@
> >>>>>>>> #include "xe_lrc.h"
> >>>>>>>> #include "xe_map.h"
> >>>>>>>> #include "xe_mocs.h"
> >>>>>>>> +#include "xe_pat.h"
> >>>>>>>> #include "xe_pt.h"
> >>>>>>>> #include "xe_res_cursor.h"
> >>>>>>>> #include "xe_sched_job.h"
> >>>>>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>>>> xe_tile
> >>>> *tile,
> >>>>>> struct xe_migrate *m,
> >>>>>>>> u32 num_entries = NUM_PT_SLOTS, num_level = vm-
> >>> pt_root[id]-
> >>>>>>> level;
> >>>>>>>> u32 map_ofs, level, i;
> >>>>>>>> struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> >>>>>>>> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> >>>>>>>> u64 entry;
> >>>>>>>> int ret;
> >>>>>>>>
> >>>>>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct
> >>>> xe_tile
> >>>>>> *tile, struct xe_migrate *m,
> >>>>>>>> return ret;
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
> >>>> XE_CACHE_WB);
> >>>>>>>> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
> >> pat_index);
> >>>>>>>> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
> >>>>>>>>
> >>>>>>>> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
> >>>>>>>>
> >>>>>>>> /* Map the entire BO in our level 0 pt */
> >>>>>>>> for (i = 0, level = 0; i < num_entries; level++) {
> >>>>>>>> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
> >> XE_CACHE_WB,
> >>>>>> 0);
> >>>>>>>> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
> >> pat_index, 0);
> >>>>>>>>
> >>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> >> entry);
> >>>>>>>>
> >>>>>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>>>> xe_tile
> >>>> *tile,
> >>>>>> struct xe_migrate *m,
> >>>>>>>> for (i = 0; i < batch->size;
> >>>>>>>> i += vm->flags & XE_VM_FLAG_64K ?
> >> XE_64K_PAGE_SIZE :
> >>>>>>>> XE_PAGE_SIZE) {
> >>>>>>>> - entry = xe_pte_encode(batch, i,
> >> XE_CACHE_WB, 0);
> >>>>>>>> + entry = xe_pte_encode(batch, i, pat_index, 0);
> >>>>>>>>
> >>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level
> *
> >> 8, u64,
> >>>>>>>> entry);
> >>>>>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>>>> xe_tile
> >>>> *tile,
> >>>>>> struct xe_migrate *m,
> >>>>>>>> flags = XE_PDE_64K;
> >>>>>>>>
> >>>>>>>> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> >>>>>>>> - XE_PAGE_SIZE,
> >> XE_CACHE_WB);
> >>>>>>>> + XE_PAGE_SIZE, pat_index);
> >>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs +
> XE_PAGE_SIZE
> >> * level,
> >>>>>> u64,
> >>>>>>>> entry | flags);
> >>>>>>>> }
> >>>>>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>>>> xe_tile
> >>>> *tile,
> >>>>>> struct xe_migrate *m,
> >>>>>>>> /* Write PDE's that point to our BO. */
> >>>>>>>> for (i = 0; i < num_entries - num_level; i++) {
> >>>>>>>> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> >>>>>>>> - XE_CACHE_WB);
> >>>>>>>> + pat_index);
> >>>>>>>>
> >>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs +
> XE_PAGE_SIZE
> >> +
> >>>>>>>> (i + 1) * 8, u64, entry); @@ -1256,7 +1258,7
> >> @@
> >>>>>>>> xe_migrate_update_pgtables(struct
> >>>> xe_migrate
> >>>>>> *m,
> >>>>>>>>
> >>>>>>>> XE_WARN_ON(pt_bo->size != SZ_4K);
> >>>>>>>>
> >>>>>>>> - addr = xe_pte_encode(pt_bo, 0,
> >> XE_CACHE_WB, 0);
> >>>>>>>> + addr = xe_pte_encode(pt_bo, 0,
> >> xe_pat_get_index(xe,
> >>>>>> XE_CACHE_WB), 0);
> >>>>>>>> bb->cs[bb->len++] = lower_32_bits(addr);
> >>>>>>>> bb->cs[bb->len++] = upper_32_bits(addr);
> >>>>>>>> }
> >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c
> >>>>>>>> b/drivers/gpu/drm/xe/xe_pt.c index 64713f400d94..019af2920078
> >>>>>>>> 100644
> >>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
> >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
> >>>>>>>> @@ -10,6 +10,7 @@
> >>>>>>>> #include "xe_gt.h"
> >>>>>>>> #include "xe_gt_tlb_invalidation.h"
> >>>>>>>> #include "xe_migrate.h"
> >>>>>>>> +#include "xe_pat.h"
> >>>>>>>> #include "xe_pt_types.h"
> >>>>>>>> #include "xe_pt_walk.h"
> >>>>>>>> #include "xe_res_cursor.h"
> >>>>>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct
> >>>>>>>> xe_pt_dir
> >>>>>> *pt_dir, unsigned int index)
> >>>>>>>> *
> >>>>>>>> * Return: An encoded page directory entry. No errors.
> >>>>>>>> */
> >>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >>>>>>>> - const enum xe_cache_level cache)
> >>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
> >>>>>>>> +pat_index)
> >>>>>>>> {
> >>>>>>>> u64 pde;
> >>>>>>>> struct xe_vm *vm = bo->vm;
> >>>>>>>> struct xe_device *xe = vm->xe;
> >>>>>>>>
> >>>>>>>> -
> >>>>>>>> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> >>>>>>>> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
> >>>>>>>>
> >>>>>>>> if ((vm->pat_encode).pde_encode)
> >>>>>>>> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> >>>>>>>> + pde = (vm->pat_encode).pde_encode(xe, pde,
> >> pat_index);
> >>>>>>>>
> >>>>>>>> return pde;
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> >>>>>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
> >>>>>>>> struct xe_vma *vma, u32 pt_level)
> >>>>>>>> {
> >>>>>>>> struct xe_vm *vm = xe_vma_vm(vma); @@ -89,7 +88,7 @@
> >> static
> >>>>>>>> u64 __pte_encode(u64 pte, enum
> >>>>>> xe_cache_level cache,
> >>>>>>>> pte |= XE_PTE_NULL;
> >>>>>>>>
> >>>>>>>> if ((vm->pat_encode).pte_encode)
> >>>>>>>> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> >>>>>>>> + pte = (vm->pat_encode).pte_encode(xe, pte,
> >> pat_index);
> >>>>>>>>
> >>>>>>>> if (pt_level == 1)
> >>>>>>>> pte |= XE_PDE_PS_2M;
> >>>>>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
> >>>>>> xe_cache_level cache,
> >>>>>>>> *
> >>>>>>>> * Return: An encoded page-table entry. No errors.
> >>>>>>>> */
> >>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
> >>>>>>>> xe_cache_level
> >>>>>> cache,
> >>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> >>>>>>>> u32 pt_level)
> >>>>>>>> {
> >>>>>>>> u64 pte;
> >>>>>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64
> >>>> offset,
> >>>>>> enum xe_cache_level cache,
> >>>>>>>> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> >>>>>>>> pte |= XE_PPGTT_PTE_DM;
> >>>>>>>>
> >>>>>>>> - return __pte_encode(pte, cache, NULL, pt_level);
> >>>>>>>> + return __pte_encode(pte, pat_index, NULL, pt_level);
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct
> >>>>>>>> xe_vm *vm, @@ -134,12 +133,12 @@ static u64
> >>>>>>>> __xe_pt_empty_pte(struct xe_tile
> >>>> *tile,
> >>>>>> struct xe_vm *vm,
> >>>>>>>>
> >>>>>>>> if (level == 0) {
> >>>>>>>> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> >>>>>>>> - XE_CACHE_WB, 0);
> >>>>>>>> + xe_pat_get_index(vm->xe,
> >>>>>> XE_CACHE_WB), 0);
> >>>>>>>>
> >>>>>>>> return empty;
> >>>>>>>> } else {
> >>>>>>>> return xe_pde_encode(vm->scratch_pt[id][level - 1]-
> >>> bo, 0,
> >>>>>>>> - XE_CACHE_WB);
> >>>>>>>> + xe_pat_get_index(vm->xe,
> >>>>>> XE_CACHE_WB));
> >>>>>>>> }
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> >>>>>>>> struct xe_vm *vm;
> >>>>>>>> /** @tile: The tile we're building for. */
> >>>>>>>> struct xe_tile *tile;
> >>>>>>>> - /** @cache: Desired cache level for the ptes */
> >>>>>>>> - enum xe_cache_level cache;
> >>>>>>>> /** @default_pte: PTE flag only template. No address is
> >> associated */
> >>>>>>>> u64 default_pte;
> >>>>>>>> /** @dma_offset: DMA offset to add to the PTE. */ @@
> >>>>>>>> -604,7
> >>>>>>>> +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> >>>>>> pgoff_t offset,
> >>>>>>>>
> >>>>>>>> pte = __pte_encode(is_null ? 0 :
> >>>>>>>> xe_res_dma(curs) + xe_walk-
> >>> dma_offset,
> >>>>>>>> - xe_walk->cache, xe_walk->vma,
> >> level);
> >>>>>>>> + xe_walk->vma->pat_index,
> xe_walk-
> >>> vma,
> >>>>>> level);
> >>>>>>>> pte |= xe_walk->default_pte;
> >>>>>>>>
> >>>>>>>> /*
> >>>>>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw
> >>>>>>>> *parent,
> >>>>>> pgoff_t offset,
> >>>>>>>> xe_child->is_compact = true;
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk-
> >cache)
> >> |
> >>>> flags;
> >>>>>>>> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
> >>>>>>> pat_index) | flags;
> >>>>>>>> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset,
> >> xe_child,
> >>>>>>>> pte);
> >>>>>>>> }
> >>>>>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile,
> >>>>>>>> struct
> >>>> xe_vma
> >>>>>> *vma,
> >>>>>>>> if (vma && vma->gpuva.flags &
> >> XE_VMA_ATOMIC_PTE_BIT)
> >>>>>>>> xe_walk.default_pte |=
> >> XE_USM_PPGTT_PTE_AE;
> >>>>>>>> xe_walk.dma_offset = vram_region_gpu_offset(bo-
> >>>>>>> ttm.resource);
> >>>>>>>> - xe_walk.cache = XE_CACHE_WB;
> >>>>>>>> - } else {
> >>>>>>>> - if (!xe_vma_has_no_bo(vma) && bo->flags &
> >>>>>> XE_BO_SCANOUT_BIT)
> >>>>>>>> - xe_walk.cache = XE_CACHE_WT;
> >>>>>>>> - else
> >>>>>>>> - xe_walk.cache = XE_CACHE_WB;
> >>>>>>>> }
> >>>>>>>> +
> >>>>>>>> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> >>>>>>>> xe_walk.dma_offset =
> >>>>>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h
> >>>>>>>> b/drivers/gpu/drm/xe/xe_pt.h index 01be7ab08f87..1d433a5a96b4
> >>>>>>>> 100644
> >>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
> >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
> >>>>>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile,
> >>>>>>>> struct
> >>>>>> xe_vma *vma, struct xe_exec_queu
> >>>>>>>>
> >>>>>>>> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma
> >>>>>>>> *vma);
> >>>>>>>>
> >>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >>>>>>>> - const enum xe_cache_level level);
> >>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
> >>>>>>>> +pat_index);
> >>>>>>>>
> >>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
> >>>>>>>> xe_cache_level
> >>>>>> cache,
> >>>>>>>> - u32 pt_level);
> >>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> >>>>>>>> +u32
> >>>>>> pt_level);
> >>>>>>>>
> >>>>>>>> #endif
> >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c
> >>>>>>>> b/drivers/gpu/drm/xe/xe_vm.c index
> 7eeeed0411f3..34603a7e84b0
> >>>>>>>> 100644
> >>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
> >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
> >>>>>>>> @@ -6,6 +6,7 @@
> >>>>>>>> #include "xe_vm.h"
> >>>>>>>>
> >>>>>>>> #include <linux/dma-fence-array.h>
> >>>>>>>> +#include <linux/nospec.h>
> >>>>>>>>
> >>>>>>>> #include <drm/drm_print.h>
> >>>>>>>> #include <drm/ttm/ttm_execbuf_util.h> @@ -874,7 +875,8 @@
> >>>>>>>> static struct xe_vma *xe_vma_create(struct
> >>>> xe_vm
> >>>>>> *vm,
> >>>>>>>> u64 start, u64 end,
> >>>>>>>> bool read_only,
> >>>>>>>> bool is_null,
> >>>>>>>> - u8 tile_mask)
> >>>>>>>> + u8 tile_mask,
> >>>>>>>> + u32 pat_index)
> >>>>>>>> {
> >>>>>>>> struct xe_vma *vma;
> >>>>>>>> struct xe_tile *tile;
> >>>>>>>> @@ -913,6 +915,8 @@ static struct xe_vma
> *xe_vma_create(struct
> >>>> xe_vm
> >>>>>> *vm,
> >>>>>>>> vma->tile_mask |= 0x1 << id;
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> + vma->pat_index = pat_index;
> >>>>>>>> +
> >>>>>>>> if (vm->xe->info.platform == XE_PVC)
> >>>>>>>> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
> >>>>>>>>
> >>>>>>>> @@ -1194,10 +1198,8 @@ static void
> xe_vma_op_work_func(struct
> >>>>>> work_struct *w);
> >>>>>>>> static void vm_destroy_work_func(struct work_struct *w);
> >>>>>>>>
> >>>>>>>> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe,
> >>>>>>>> u64
> >>>> pde_pat,
> >>>>>>>> - enum xe_cache_level
> >> cache)
> >>>>>>>> + u32 pat_index)
> >>>>>>>> {
> >>>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>>>> -
> >>>>>>>> pde_pat &= ~(XELP_PDE_PAT_MASK);
> >>>>>>>>
> >>>>>>>> if (pat_index & BIT(0))
> >>>>>>>> @@ -1213,10 +1215,8 @@ static u64
> >>>> xelp_ppgtt_pde_encode_pat(struct
> >>>>>> xe_device *xe, u64 pde_pat,
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe,
> >>>>>>>> u64
> >>>> pte_pat,
> >>>>>>>> - enum xe_cache_level
> >> cache)
> >>>>>>>> + u32 pat_index)
> >>>>>>>> {
> >>>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>>>> -
> >>>>>>>> pte_pat &= ~(XELP_PTE_PAT_MASK);
> >>>>>>>>
> >>>>>>>> if (pat_index & BIT(0))
> >>>>>>>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct
> xe_file
> >>>> *xef,
> >>>>>> u32 id)
> >>>>>>>> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile
> *tile)
> >>>>>>>> {
> >>>>>>>> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> >>>>>>>> - XE_CACHE_WB);
> >>>>>>>> + xe_pat_get_index(vm->xe,
> >> XE_CACHE_WB));
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> static struct dma_fence *
> >>>>>>>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device
> >>>>>>>> *xe,
> >>>> struct
> >>>>>> drm_gpuva_op *op)
> >>>>>>>> static struct drm_gpuva_ops *
> >>>>>>>> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> >>>>>>>> u64 bo_offset_or_userptr, u64 addr, u64
> >> range,
> >>>>>>>> - u32 operation, u8 tile_mask, u32 region)
> >>>>>>>> + u32 operation, u8 tile_mask, u32 region, u32
> >>>>>> pat_index)
> >>>>>>>> {
> >>>>>>>> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> >>>>>>>> struct ww_acquire_ctx ww; @@ -2339,6 +2339,7 @@
> >>>>>>>> vm_bind_ioctl_ops_create(struct xe_vm
> >> *vm,
> >>>>>> struct xe_bo *bo,
> >>>>>>>> struct xe_vma_op *op =
> >> gpuva_op_to_vma_op(__op);
> >>>>>>>>
> >>>>>>>> op->tile_mask = tile_mask;
> >>>>>>>> + op->pat_index = pat_index;
> >>>>>>>> op->map.immediate =
> >>>>>>>> operation &
> >>>> XE_VM_BIND_FLAG_IMMEDIATE;
> >>>>>>>> op->map.read_only =
> >>>>>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm
> >> *vm,
> >>>>>> struct xe_bo *bo,
> >>>>>>>> struct xe_vma_op *op =
> >> gpuva_op_to_vma_op(__op);
> >>>>>>>>
> >>>>>>>> op->tile_mask = tile_mask;
> >>>>>>>> + op->pat_index = pat_index;
> >>>>>>>> op->prefetch.region = region;
> >>>>>>>> }
> >>>>>>>> break;
> >>>>>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
> >> *vm,
> >>>>>> struct xe_bo *bo,
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> static struct xe_vma *new_vma(struct xe_vm *vm, struct
> >>>>>> drm_gpuva_op_map *op,
> >>>>>>>> - u8 tile_mask, bool read_only, bool is_null)
> >>>>>>>> + u8 tile_mask, bool read_only, bool is_null,
> >>>>>>>> + u32 pat_index)
> >>>>>>>> {
> >>>>>>>> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op-
> >>> gem.obj) :
> >>>> NULL;
> >>>>>>>> struct xe_vma *vma;
> >>>>>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct
> >>>>>>>> xe_vm
> >>>>>> *vm, struct drm_gpuva_op_map *op,
> >>>>>>>> vma = xe_vma_create(vm, bo, op->gem.offset,
> >>>>>>>> op->va.addr, op->va.addr +
> >>>>>>>> op->va.range - 1, read_only, is_null,
> >>>>>>>> - tile_mask);
> >>>>>>>> + tile_mask, pat_index);
> >>>>>>>> if (bo)
> >>>>>>>> xe_bo_unlock(bo, &ww);
> >>>>>>>>
> >>>>>>>> @@ -2539,7 +2542,7 @@ static int
> vm_bind_ioctl_ops_parse(struct
> >>>> xe_vm
> >>>>>> *vm, struct xe_exec_queue *q,
> >>>>>>>>
> >>>>>>>> vma = new_vma(vm, &op-
> >base.map,
> >>>>>>>> op->tile_mask, op-
> >>>>> map.read_only,
> >>>>>>>> - op->map.is_null);
> >>>>>>>> + op->map.is_null, op-
> >>> pat_index);
> >>>>>>>> if (IS_ERR(vma)) {
> >>>>>>>> err = PTR_ERR(vma);
> >>>>>>>> goto free_fence;
> >>>>>>>> @@ -2567,7 +2570,7 @@ static int
> vm_bind_ioctl_ops_parse(struct
> >>>> xe_vm
> >>>>>> *vm, struct xe_exec_queue *q,
> >>>>>>>>
> >>>>>>>> vma = new_vma(vm, op-
> >>>>>>> base.remap.prev,
> >>>>>>>> op->tile_mask,
> >> read_only,
> >>>>>>>> - is_null);
> >>>>>>>> + is_null, op-
> >>> pat_index);
> >>>>>>>> if (IS_ERR(vma)) {
> >>>>>>>> err = PTR_ERR(vma);
> >>>>>>>> goto free_fence;
> >>>>>>>> @@ -2603,7 +2606,7 @@ static int
> vm_bind_ioctl_ops_parse(struct
> >>>> xe_vm
> >>>>>> *vm, struct xe_exec_queue *q,
> >>>>>>>>
> >>>>>>>> vma = new_vma(vm, op-
> >>>>>>> base.remap.next,
> >>>>>>>> op->tile_mask,
> >> read_only,
> >>>>>>>> - is_null);
> >>>>>>>> + is_null, op-
> >>> pat_index);
> >>>>>>>> if (IS_ERR(vma)) {
> >>>>>>>> err = PTR_ERR(vma);
> >>>>>>>> goto free_fence;
> >>>>>>>> @@ -3158,8 +3161,14 @@ static int
> >>>>>>>> vm_bind_ioctl_check_args(struct
> >>>>>> xe_device *xe,
> >>>>>>>> u32 obj = (*bind_ops)[i].obj;
> >>>>>>>> u64 obj_offset = (*bind_ops)[i].obj_offset;
> >>>>>>>> u32 region = (*bind_ops)[i].region;
> >>>>>>>> + u32 pat_index = (*bind_ops)[i].pat_index;
> >>>>>>>> bool is_null = op & XE_VM_BIND_FLAG_NULL;
> >>>>>>>>
> >>>>>>>> + if (XE_IOCTL_DBG(xe, pat_index >= xe-
> >>>>>>> info.pat_table_n_entries)) {
> >>>>>>>> + err = -EINVAL;
> >>>>>>>> + goto free_bind_ops;
> >>>>>>>> + }
> >>>>>>>> +
> >>>>>>>> if (i == 0) {
> >>>>>>>> *async = !!(op &
> XE_VM_BIND_FLAG_ASYNC);
> >>>>>>>> } else if (XE_IOCTL_DBG(xe, !*async) || @@ -3346,8
> >> +3355,25
> >>>>>>>> @@ int xe_vm_bind_ioctl(struct drm_device
> >>>> *dev,
> >>>>>> void *data, struct drm_file *file)
> >>>>>>>> struct drm_gem_object *gem_obj;
> >>>>>>>> u64 range = bind_ops[i].range;
> >>>>>>>> u64 addr = bind_ops[i].addr;
> >>>>>>>> + u32 op = bind_ops[i].op;
> >>>>>>>> u32 obj = bind_ops[i].obj;
> >>>>>>>> u64 obj_offset = bind_ops[i].obj_offset;
> >>>>>>>> + u32 pat_index = bind_ops[i].pat_index;
> >>>>>>>> + u16 coh_mode;
> >>>>>>>> +
> >>>>>>>> + pat_index = array_index_nospec(pat_index,
> >>>>>>>> + xe-
> >>> info.pat_table_n_entries);
> >>>>>>>> + coh_mode = xe_pat_index_get_coh_mode(xe,
> >> pat_index);
> >>>>>>>> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
> >>>>>>>
> >>>>>>> Assuming we drop the unusable entries from the TGL table, this
> >>>>>>> should be impossible, right? Any index that makes it past the
> >>>>>>> n_entries check at the top of the function should have a valid,
> >>>>>>> non-zero
> >> coh_mode value.
> >>>>>>> So this should probably be an assertion (to highlight a KMD bug)
> >>>>>>> rather than just a silent uapi failure return.
> >>>>>>
> >>>>>> Makes sense.
> >>>>>>
> >>>>>>>
> >>>>>>>> + err = -EINVAL;
> >>>>>>>> + goto put_obj;
> >>>>>>>> + }
> >>>>>>>> +
> >>>>>>>> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
> >>>>>> XE_VM_BIND_OP_MAP_USERPTR &&
> >>>>>>>> + coh_mode ==
> >> XE_GEM_COHERENCY_NONE))
> >>>> {
> >>>>>>>> + err = -EINVAL;
> >>>>>>>> + goto put_obj;
> >>>>>>>> + }
> >>>>>>>>
> >>>>>>>> if (!obj)
> >>>>>>>> continue;
> >>>>>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct
> drm_device
> >>>> *dev,
> >>>>>> void *data, struct drm_file *file)
> >>>>>>>> goto put_obj;
> >>>>>>>> }
> >>>>>>>> }
> >>>>>>>> +
> >>>>>>>> + if (bos[i]->coh_mode) {
> >>>>>>>> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
> >>>>>> coh_mode)) {
> >>>>>>>> + err = -EINVAL;
> >>>>>>>> + goto put_obj;
> >>>>>>>> + }
> >>>>>>>> + } else if (XE_IOCTL_DBG(xe, coh_mode ==
> >>>>>> XE_GEM_COHERENCY_NONE)) {
> >>>>>>>> + /*
> >>>>>>>> + * Imported dma-buf from a different device
> >> should
> >>>>>>>> + * require 1way or 2way coherency since we
> >> don't
> >>>>>> know
> >>>>>>>> + * how it was mapped on CPU. Just assume is
> it
> >>>>>>>> + * potentially cached on CPU side.
> >>>>>>>> + */
> >>>>>>>> + err = -EINVAL;
> >>>>>>>> + goto put_obj;
> >>>>>>>> + }
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> if (args->num_syncs) {
> >>>>>>>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct
> drm_device
> >>>> *dev,
> >>>>>> void *data, struct drm_file *file)
> >>>>>>>> u64 obj_offset = bind_ops[i].obj_offset;
> >>>>>>>> u8 tile_mask = bind_ops[i].tile_mask;
> >>>>>>>> u32 region = bind_ops[i].region;
> >>>>>>>> + u32 pat_index = bind_ops[i].pat_index;
> >>>>>>>>
> >>>>>>>> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i],
> >> obj_offset,
> >>>>>>>> addr, range, op,
> >> tile_mask,
> >>>>>>>> - region);
> >>>>>>>> + region, pat_index);
> >>>>>>>> if (IS_ERR(ops[i])) {
> >>>>>>>> err = PTR_ERR(ops[i]);
> >>>>>>>> ops[i] = NULL;
> >>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>> b/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>>>> index 83a1f87b6537..508679ed3c74 100644
> >>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>>>> @@ -111,6 +111,11 @@ struct xe_vma {
> >>>>>>>> */
> >>>>>>>> u8 tile_present;
> >>>>>>>>
> >>>>>>>> + /**
> >>>>>>>> + * @pat_index: The pat index to use when encoding the PTEs
> >> for
> >>>>>>>> +this
> >>>>>> vma.
> >>>>>>>> + */
> >>>>>>>> + u32 pat_index;
> >>>>>>>> +
> >>>>>>>> struct {
> >>>>>>>> struct list_head rebind_link;
> >>>>>>>> } notifier;
> >>>>>>>> @@ -338,10 +343,8 @@ struct xe_vm {
> >>>>>>>> bool batch_invalidate_tlb;
> >>>>>>>>
> >>>>>>>> struct {
> >>>>>>>> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>>>>>> - enum xe_cache_level
> >> cache);
> >>>>>>>> - u64 (*pde_encode)(struct xe_device *xe, u64
> pde_pat,
> >>>>>>>> - enum xe_cache_level
> >> cache);
> >>>>>>>> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >> u32
> >>>>>> pat_index);
> >>>>>>>> + u64 (*pde_encode)(struct xe_device *xe, u64
> pde_pat,
> >> u32
> >>>>>> pat_index);
> >>>>>>>> } pat_encode;
> >>>>>>>> };
> >>>>>>>>
> >>>>>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
> >>>>>>>> struct async_op_fence *fence;
> >>>>>>>> /** @tile_mask: gt mask for this operation */
> >>>>>>>> u8 tile_mask;
> >>>>>>>> + /** @pat_index: The pat index to use for this operation. */
> >>>>>>>> + u32 pat_index;
> >>>>>>>> /** @flags: operation flags */
> >>>>>>>> enum xe_vma_op_flags flags;
> >>>>>>>>
> >>>>>>>> diff --git a/include/uapi/drm/xe_drm.h
> >>>>>>>> b/include/uapi/drm/xe_drm.h index 64bc66d4b550..0c15b6f32447
> >>>>>>>> 100644
> >>>>>>>> --- a/include/uapi/drm/xe_drm.h
> >>>>>>>> +++ b/include/uapi/drm/xe_drm.h
> >>>>>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> >>>>>>>> */
> >>>>>>>> __u32 obj;
> >>>>>>>>
> >>>>>>>> - /** @pad: MBZ */
> >>>>>>>> - __u32 pad;
> >>>>>>>> + /**
> >>>>>>>> + * @pat_index: The platform defined @pat_index to use for
> >> this
> >>>>>> mapping.
> >>>>>>>> + * The index basically maps to some predefined memory
> >> attributes,
> >>>>>>>> + * including things like caching, coherency and likely other
> >>>>>>>> +stuff
> >> in
> >>>>>>>> + * the future. The exact meaning of the pat_index is
> >>>>>>>> +platform specific
> >>>>>>>
> >>>>>>> BTW, "other stuff in the future" already includes compression on
> >>>>>>> Xe2, we just haven't landed the patches for the Xe2 table yet.
> >>>>>>
> >>>>>> Ok, good to know.
> >>>>>>
> >>>>>>>
> >>>>>>>> + * and defined in the Bspec and PRMs. When the KMD sets
> up
> >> the
> >>>>>> binding
> >>>>>>>> + * the index here is encoded into the ppGTT PTE.
> >>>>>>>> + *
> >>>>>>>> + * For coherency the @pat_index needs to match the
> >>>>>>>> + * drm_xe_gem_create.coh_mode, so either
> >>>>>> XE_GEM_COHERENCY_NONE,
> >>>>>>>> + * XE_GEM_COHERENCY_1WAY or
> >> XE_GEM_COHERENCY_2WAY. The
> >>>>>> KMD will extract
> >>>>>>>> + * the coherency mode from the @pat_index and reject if
> >> there is a
> >>>>>>>> + * mismatch (see note below for pre-MTL platforms).
> >>>>>>>> + *
> >>>>>>>> + * Note: On pre-MTL platforms there is only a caching mode
> >> and no
> >>>>>>>> + * explicit coherency mode, but on such hardware there is
> >> always a
> >>>>>>>> + * shared-LLC (or is dgpu) so all GT memory accesses are
> >>>>>>>> +coherent
> >>>> with
> >>>>>>>> + * CPU caches even with the caching mode set as uncached.
> >> It's
> >>>>>>>> +only
> >>>>>> the
> >>>>>>>> + * display engine that is incoherent (on dgpu it must be in
> >> VRAM which
> >>>>>>>> + * is always mapped as WC on the CPU). However to keep the
> >> uapi
> >>>>>> somewhat
> >>>>>>>> + * consistent with newer platforms the KMD groups the
> >> different
> >>>> cache
> >>>>>>>> + * levels into the following coherency buckets on all pre-MTL
> >>>> platforms:
> >>>>>>>> + *
> >>>>>>>> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
> >>>>>>>> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
> >>>>>>>> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
> >>>>>>>> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
> >>>>>>>
> >>>>>>> As noted on the previous patch, it seems like 2-way is
> >>>>>>> appropriate for LLC platforms, but 1-way might be a more
> >>>>>>> accurate description of dGPU behavior.
> >>>>>>>
> >>>>>>>> + *
> >>>>>>>> + * In practice UC/WC/WT should only ever used for scanout
> >>>>>>>> +surfaces
> >>>> on
> >>>>>>>> + * such platforms since it is only the display engine that is
> >> actually
> >>>>>>>> + * incoherent. Everything else should typically use WB given
> >>>>>>>> +that we
> >>>>>>>
> >>>>>>> What if we're sharing our buffers with some other (non-GPU)
> >>>>>>> device? Are there cases where that other device wouldn't be
> >>>>>>> coherent with the LLC, so we'd want to use one of these?
> >>>>>>
> >>>>>> Yeah, I guess there might be cases like that. I'll reword.
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Matt
> >>>>>>>
> >>>>>>>> + * have a shared-LLC. On MTL+ this completely changes (also
> >>>>>> potentially
> >>>>>>>> + * no shared-LLC) and the HW defines the coherency mode as
> >> part
> >>>>>>>> +of
> >>>>>> the
> >>>>>>>> + * @pat_index.
> >>>>>>>> + *
> >>>>>>>> + * Note: For userptr and externally imported dma-buf the
> >> kernel
> >>>>>> expects
> >>>>>>>> + * either 1WAY or 2WAY for the @pat_index.
> >>>>>>>> + */
> >>>>>>>> + __u32 pat_index;
> >>>>>>>>
> >>>>>>>> union {
> >>>>>>>> /**
> >>>>>>>> --
> >>>>>>>> 2.41.0
> >>>>>>>>
> >>>>>>>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-09-05 9:12 ` Zhang, Carl
@ 2023-09-05 9:46 ` Matthew Auld
2023-09-05 13:50 ` Zhang, Carl
0 siblings, 1 reply; 47+ messages in thread
From: Matthew Auld @ 2023-09-05 9:46 UTC (permalink / raw)
To: Zhang, Carl, Roper, Matthew D
Cc: Gu, Lihao, Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas,
Yu, Effie, intel-xe@lists.freedesktop.org
On 05/09/2023 10:12, Zhang, Carl wrote:
>
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld@intel.com>
>> Sent: Monday, September 4, 2023 5:24 PM
>>
>> On 01/09/2023 10:34, Zhang, Carl wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>> Sent: Thursday, August 31, 2023 6:44 PM
>>>>
>>>> On 31/08/2023 09:24, Zhang, Carl wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>>>> Sent: Thursday, August 31, 2023 12:02 AM
>>>>>>
>>>>>> On 30/08/2023 16:27, Zhang, Carl wrote:
>>>>>>> Several questions:
>>>>>>> 1. the pat_index from vm_bind will override the setting from
>> bo_create?
>>>>>>> How to keep the value from bo_create unchanged?
>>>>>>
>>>>>> You only set the pat_index during vm_bind. At gem_create you just
>>>>>> need tell the kernel what CPU side caching mode to use for system
>>>>>> memory (wb/wc/uc), plus the expected GPU coherency mode of the
>>>>>> pat_index. The current expectation is that the coherency mode of
>>>>>> the pat_index should match the one at gem_create.
>>>>>>
>>>>> It means that we set the incomplete pat_index at gem_create
>>>>> (coherency
>>>>> only) And set another part at vm_bind (gpu cache) Why we separate
>>>>> them? Why not set a full value at vm_bind?
>>>>
>>>> You always give the full pat_index at vm_bind. It's platform specific
>>>> but the BSpec for the most part tells you what the coherency mode of the
>> pat_index is.
>>>>
>>>> For example if you look at the MTL pat_index table it tells you the
>>>> coh_mode:
>>>>
>>>> pat_index 0-2 = coh_mode "No snoop" = COH_NONE
>>>> pat_index 3 = coh_mode "1way" = COH_1WAY
>>>> pat_index 4 = coh_mode "2way" = COH_2WAY
>>>>
>>>> On future platforms the table is larger and also encodes stuff like
>>>> compression, but KMD really only cares about the coh_mode, since that
>>>> will restrict the allowed CPU smem_caching values that userspace can
>> select.
>>>>
>>>> From KMD pov this matters since we always give userspace zeroed
>>>> memory, which is useful for normal applications but is also needed for
>> security reasons.
>>>> However the actual clearing might result in the writes only being
>>>> visible in the CPU cache (not flushed to main memory), and if the GPU
>>>> doesn't snoop the CPU cache it can read directly from main memory,
>>>> which effectively bypasses the clearing. To prevent that the KMD
>>>> needs to know how userspace will map the memory via the GPU i.e what
>>>> is the coh_mode for the pat_index. The gist is that if you select
>>>> COH_NONE then it needs to clflush the pages before giving them to
>>>> userspace, which is done by forcing you to use uc/wc for the
>>>> smem_caching (the kernel will internally issue cflush on x86 when
>>>> marking the pages as uc/wc). And if you select wb there is no flushing, but
>> here you need to use 1way or 2way.
>>>>
>>>
>>> It sounds like a perf optimization, if no considering the perf, you
>>> could always call Clflush to flush the cpu cache to main memory before
>>> giving it to user space
>>
>> I think the other thing is that the KMD is no longer doing any manual flushing,
>> but instead the core kernel will now do that for us when marking the pages as
>> wc/uc on x86.
>>
>>>
>>>
>>>> AFAIK separating them is just to prevent userspace from mixing
>>>> different coherency modes for the same object by having it immutable
>>>> (you can still use different pat_index but the coherency must be
>> compatible).
>>>> Also at creation time we can determine if the smem_caching is
>>>> compatible with the coh_mode.
>>>>
>>>> Note that it might be that we ditch the COH_2WAY and just have
>>>> COH_NONE and COH_AT_LEAST_1WAY, in which case this might be slightly
>> different.
>>>>
>>>>>
>>>>>>> 2. no UC/WB/WC definition (CPU cachable) in
>>>>>>> drm_xe_gem_mmap_offset,
>>>>>> will it be handled by KMD automatically?
>>>>>>> For example: if set 1-way, it means GPU could snoop CPU cache, we
>>>>>>> could use WB in mmap offset If it is COHERENCY_NONE, we could
>>>>>>> only use
>>>>>> UC, all these logic is handled by KMD automatically?
>>>>>>
>>>>>> Yes, the mmap will use the same CPU side caching mode that you set
>>>>>> at gem_create with smem_caching (wb/wc/uc).
>>>>>>
>>>>> It means it could not be changed after gem_create. Sometimes, UMD
>>>>> doesn’t know Whether the data need to be accessed by CPU at
>>>>> gem_create . only application Knows the purpose of allocation & the
>>>>> surface usage. So, may need change UMD api to let applications set
>>>>> whether the
>>>> resource is "accessible", or it will hurt the perf.
>>>>
>>>> Right, the smem_caching is immutable.
>>>>
>>>>>
>>>>>>> 3. about " For imported dma-buf (from a different device) the
>>>>>>> coherency mode is also implicit and must also be either 1WAY or
>> 2WAY"
>>>>>>> it means it must be 1way or 2way, and UMD need not to set it?
>>>>>>
>>>>>> If you are going to vm_bind it you need to supply the pat_index and
>>>>>> in this case it needs to be either 1way or 2way. The object comes
>>>>>> from a different device so we might not have smem_caching/coh_mode
>>>>>> like we do for native objects.
>>>>>>
>>>>> How about the GPU cache setting of imported dma-buf?
>>>>> Is there possible conflict between dma-buf producer and consumer?
>>>>> For example:
>>>>> In producer side , one bo is set GPU L3 cache, also CPU cacheable,
>>>>> it should be 2Way In client side, we know the coherency should be
>>>>> 1-way or 2-way, if it suppose no GPU cache Used. It just set 1-way.
>>>>> It may cause some problems?
>>>>
>>>> If the producer and consumer are the same device, then I think the
>>>> original gem_create.coh_mode is considered, since it's effectively
>>>> still a native object. In which case the normal rules apply.
>>>
>>> But the consumer does not know the coh_mode, so , it is possible that
>>> it set will set Another value, then it will failed.
>>> And because consumer does not know it is 1-way or 2-way, consumer
>>> could not Set the gpu cache correctly. Because if it is 1-way, cpu
>>> could not snoop gpu cache
>>>
>>>>
>>>> If it comes from a different device then we don't know what
>>>> gem_create.coh_mode is (or if that even exists) since we don't have
>>>> the usual native object, but just some opaque dma-buf object. In this
>>>> case it must be at least 1way or 2way.
>>>>
>>>> From KMD pov we only care about userspace not being able to bypass
>>>> the clearing. There might still be other coherency issues though.
>>>>
>>>>>
>>>>> Also , when a bo is created, UMD does not know whether it will be
>> exported.
>>>>> It maybe be set as COHERENCY_NONE, right?
>>>>
>>>> Yeah, you can use whatever you want. Just that your smem_caching will
>>>> need to be wc/uc if using COH_NONE. On the importer side, if it comes
>>>> from the same device then it's just a normal native object and the normal
>> rules apply.
>>>> Currently this would mean also using pat_index with COH_NONE. If it
>>>> is imported from a different device then you need to use a pat_index
>>>> with 1way or 2way, since KMD doesn't really know how the object is
>>>> mapped on the CPU or if the CPU caches are dirty (potentially with
>> clearing).
>>>>
>>>> Is this going to be problematic?
>>>>
>>> If imported surfaces is from same device, how consumer know it is
>>> original coh_non Or 1-way or 2-way?
>>>
>>> And it also means producer and consumer could have different pat_index for
>> same bo.
>>
>> Ok, if KMD instead allows 1way or 2way for dma-buf when it's external or
>> from the same device? Would that help here? i.e you can always select 1way
>> or 2way, even if it came from the same device and was coh_none?
>>
>> Other option is you could call the vm_bind ioctl to check if the coh_mode is
>> supported by checking if it returns an error? But maybe that is too nasy.
>>
>
> Still a bit confuse, I create a bo with coh_none, then export it to a PRIME fd.
> In consumer side, it does not know it is coh_none, and will call vm_bind to set pat_index.
>
> You mean it will return a error when I set different coh value though vm_bind?
> And try coh_non, 1-way, 2-way, then get the correct one? TBH, it is ugly.
Yeah, it is nasty.
>
> My concern is: if it is coh_non or 1-way, it means I could not use GPU cache if the bo also is CPU accessible.
> Because CPU could not snoop GPU cache.
> So, GPU cache specified by pat_index also will be rejected?
>
> Could we add some query interface to query the CPU cache and coherency setting of a bo?
Say if the buffer comes from a different device + driver, what pat_index
would you choose for that? There is no such thing as
coh_mode/smem_caching for such an object. If you have to handle that
case can't you just treat all imported dma-buf the same as that? The new
proposal was to allow 1way or 2way for any dma-buf object, even if it
comes from the same device and has different BO coh_mode. Do you need
more than that? AFAICT the query would only really work for objects
imported from the same device.
>
>>
>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>>>>>> Sent: Wednesday, August 30, 2023 7:28 PM
>>>>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>>>>>>> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi
>>>>>> <pallavi.mishra@intel.com>;
>>>>>>>> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas
>>>>>>>> Lahtinen <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
>>>>>>>> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>;
>>>>>> Hazubski,
>>>>>>>> Filip <filip.hazubski@intel.com>; Zhang, Carl
>>>>>>>> <carl.zhang@intel.com>; Yu,
>>>>>> Effie
>>>>>>>> <effie.yu@intel.com>
>>>>>>>> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection
>>>>>>>> with
>>>>>> vm_bind
>>>>>>>>
>>>>>>>> On 29/08/2023 22:36, Matt Roper wrote:
>>>>>>>>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
>>>>>>>>>> Allow userspace to directly control the pat_index for a given
>>>>>>>>>> vm binding. This should allow directly controlling the
>>>>>>>>>> coherency, caching and potentially other stuff in the future for the
>> ppGTT binding.
>>>>>>>>>>
>>>>>>>>>> The exact meaning behind the pat_index is very platform
>>>>>>>>>> specific (see BSpec or PRMs) but effectively maps to some
>>>>>>>>>> predefined memory attributes. From the KMD pov we only care
>>>>>>>>>> about the coherency that is provided by the pat_index, which
>>>>>>>>>> falls into either
>>>> NONE, 1WAY or 2WAY.
>>>>>>>>>> The vm_bind coherency mode for the given pat_index needs to
>>>>>>>>>> match
>>>>>> the
>>>>>>>>>> given coh_mode that was set at object creation. For platforms
>>>>>>>>>> that lack
>>>>>>>>>
>>>>>>>>> Is it actually important to match the coherency mode? I think
>>>>>>>>> one of the main goals was to know up front if userspace might be
>>>>>>>>> using a non-snooping PAT setting that would let it bypass the
>>>>>>>>> CPU cache (and potentially read old, stale data from a different
>>>>>>>>> process if the buffer's clear value is still sitting in cache
>>>>>>>>> and hasn't landed in memory yet).
>>>>>>>>>
>>>>>>>>> If that's the only concern, then I think it should still be fine
>>>>>>>>> to map with a non-matching PAT as long as it's more coherent
>>>>>>>>> than the one specified at creation, right? E.g., if the buffer
>>>>>>>>> was created with 1-way coherency, it would be fine to map it
>>>>>>>>> with 2-way because userspace still can't use that to observe the
>>>>>>>>> previous contents of the buffer. Or
>>>>>>>>
>>>>>>>> Yeah, I guess we could in theory do something that.
>>>>>>>>
>>>>>>>>> if the buffer was created with "non-coherent" then we've already
>>>>>>>>> done the necessary clflushing in kernel before handing to buffer
>>>>>>>>> over to userspace to ensure the clear value landed in memory, so
>>>>>>>>> any valid PAT index should be safe (from a security POV) after
>>>>>>>>> that, right? Any other problems that arise from mismatched
>>>>>>>>> coherency would just be contained to the app possibly shooting
>>>>>>>>> itself in the foot, which isn't really our concern.
>>>>>>>>
>>>>>>>> That is also my understanding, at least from the KMD security pov.
>>>>>>>> If you allocate as wb then you must use at least 1way, since
>>>>>>>> there is no flushing
>>>>>> for
>>>>>>>> clearing or swap-in. For uc/wc you could in theory use whatever you
>> want.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and
>> WB
>>>>>>>>>> as
>>>>>>>> 2WAY.
>>>>>>>>>>
>>>>>>>>>> For userptr mappings we lack a corresponding gem object, so the
>>>>>> expected
>>>>>>>>>> coherency mode is instead implicit and must fall into either
>>>>>>>>>> 1WAY or 2WAY. Trying to use NONE will be rejected by the
>>>>>>>>>> kernel. For imported dma-buf (from a different device) the
>>>>>>>>>> coherency mode is also implicit and must also be either 1WAY or
>> 2WAY.
>>>>>>>>>>
>>>>>>>>>> As part of adding pat_index support with vm_bind we also need
>>>>>>>>>> stop
>>>>>> using
>>>>>>>>>> xe_cache_level and instead use the pat_index in various places.
>>>>>>>>>> We still make use of xe_cache_level, but only as a convenience
>>>>>>>>>> for kernel internal objectsi (internally it maps to some
>>>>>>>>>> reasonable pat_index). For
>>>>>>>>>
>>>>>>>>> Maybe we should kill xe_cache_level completely and just assign
>>>>>>>>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used
>>>>>>>>> in appropriate places, similar to what we do with MOCS (gt-
>>>>>>> mocs.uc_index,
>>>>>>>>> gt->mocs.wb_index)?
>>>>>>>>
>>>>>>>> OK, seems reasonable to me.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> now this is just a 1:1 conversion of the existing code, however
>>>>>>>>>> for platforms like MTL+ we might need to give more control
>>>>>>>>>> through
>>>>>> bo_create
>>>>>>>>>> or stop using WB on the CPU side if we need CPU access.
>>>>>>>>>>
>>>>>>>>>> Bspec: 45101, 44235 #xe
>>>>>>>>>> Bspec: 70552, 71582, 59400 #xe2
>>>>>>>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>>>>>>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>>>>>>>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>>>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>>>>>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>>>>>>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>>>>>>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
>>>>>>>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>>>>>>>>>> Cc: Carl Zhang <carl.zhang@intel.com>
>>>>>>>>>> Cc: Effie Yu <effie.yu@intel.com>
>>>>>>>>>> ---
>>>>>>>>>> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
>>>>>>>>>> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
>>>>>>>>>> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
>>>>>>>>>> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
>>>>>>>>>> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
>>>>>>>>>> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
>>>>>>>>>> drivers/gpu/drm/xe/xe_pt.h | 6 +--
>>>>>>>>>> drivers/gpu/drm/xe/xe_vm.c | 73
>> +++++++++++++++++++++----
>>>> -
>>>>>> -
>>>>>>>>>> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
>>>>>>>>>> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
>>>>>>>>>> 10 files changed, 134 insertions(+), 58 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>> b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>>>> index 4d6296cdbcfd..bb4c182048c3 100644
>>>>>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
>>>>>>>>>> (*alloc_scratch_dma)(struct i915_address_space
>> *vm,
>>>> int
>>>>>>>>>> sz);
>>>>>>>>>>
>>>>>>>>>> u64 (*pte_encode)(dma_addr_t addr,
>>>>>>>>>> - unsigned int pat_index,
>>>>>>>>>> + u32 pat_index,
>>>>>>>>>> u32 flags); /* Create a valid PTE */
>>>>>>>>>> #define PTE_READ_ONLY BIT(0)
>>>>>>>>>> #define PTE_LM BIT(1)
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>> b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct
>>>>>> xe_migrate
>>>>>>>> *m, struct kunit *test)
>>>>>>>>>> /* First part of the test, are we updating our pagetable
>>>>>>>>>> bo with a new
>>>>>>>> entry? */
>>>>>>>>>> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE *
>>>> (NUM_KERNEL_PDE -
>>>>>> 1),
>>>>>>>> u64,
>>>>>>>>>> 0xdeaddeadbeefbeef);
>>>>>>>>>> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>>>>>>>>>> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
>>>>>>>> XE_CACHE_WB), 0);
>>>>>>>>>> if (m->q->vm->flags & XE_VM_FLAG_64K)
>>>>>>>>>> expected |= XE_PTE_PS64;
>>>>>>>>>> if (xe_bo_is_vram(pt))
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>> b/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>>>> index 209fa053d9fb..4134c26150a5 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo,
>> u64
>>>>>>>> bo_offset)
>>>>>>>>>> pte |= XE_GGTT_PTE_DM;
>>>>>>>>>>
>>>>>>>>>> if ((ggtt->pat_encode).pte_encode)
>>>>>>>>>> - pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>>>>>> XE_CACHE_WB_1_WAY);
>>>>>>>>>> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>>>>>>>> +
>>>> xe_pat_get_index(xe,
>>>>>>>> XE_CACHE_WB_1_WAY));
>>>>>>>>>>
>>>>>>>>>> return pte;
>>>>>>>>>> }
>>>>>>>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt
>> *ggtt)
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe,
>>>>>>>>>> u64
>>>>>> pte_pat,
>>>>>>>>>> - enum xe_cache_level
>>>> cache)
>>>>>>>>>> + u32 pat_index)
>>>>>>>>>> {
>>>>>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>>>> -
>>>>>>>>>> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>>>>>>>>>
>>>>>>>>>> if (pat_index & BIT(0))
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>> b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>>>>>>>>>
>>>>>>>>>> struct {
>>>>>>>>>> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>>>>>> - enum xe_cache_level
>>>> cache);
>>>>>>>>>> + u32 pat_index);
>>>>>>>>>> } pat_encode;
>>>>>>>>>> };
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>> b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>>> index a782ea282cb6..54585e98452a 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>>> @@ -24,6 +24,7 @@
>>>>>>>>>> #include "xe_lrc.h"
>>>>>>>>>> #include "xe_map.h"
>>>>>>>>>> #include "xe_mocs.h"
>>>>>>>>>> +#include "xe_pat.h"
>>>>>>>>>> #include "xe_pt.h"
>>>>>>>>>> #include "xe_res_cursor.h"
>>>>>>>>>> #include "xe_sched_job.h"
>>>>>>>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>>>> xe_tile
>>>>>> *tile,
>>>>>>>> struct xe_migrate *m,
>>>>>>>>>> u32 num_entries = NUM_PT_SLOTS, num_level = vm-
>>>>> pt_root[id]-
>>>>>>>>> level;
>>>>>>>>>> u32 map_ofs, level, i;
>>>>>>>>>> struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>>>>>>>>>> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>>>>>>>>>> u64 entry;
>>>>>>>>>> int ret;
>>>>>>>>>>
>>>>>>>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct
>>>>>> xe_tile
>>>>>>>> *tile, struct xe_migrate *m,
>>>>>>>>>> return ret;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
>>>>>> XE_CACHE_WB);
>>>>>>>>>> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
>>>> pat_index);
>>>>>>>>>> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>>>>>>>>>
>>>>>>>>>> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>>>>>>>>>
>>>>>>>>>> /* Map the entire BO in our level 0 pt */
>>>>>>>>>> for (i = 0, level = 0; i < num_entries; level++) {
>>>>>>>>>> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
>>>> XE_CACHE_WB,
>>>>>>>> 0);
>>>>>>>>>> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
>>>> pat_index, 0);
>>>>>>>>>>
>>>>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>>>> entry);
>>>>>>>>>>
>>>>>>>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>>>> xe_tile
>>>>>> *tile,
>>>>>>>> struct xe_migrate *m,
>>>>>>>>>> for (i = 0; i < batch->size;
>>>>>>>>>> i += vm->flags & XE_VM_FLAG_64K ?
>>>> XE_64K_PAGE_SIZE :
>>>>>>>>>> XE_PAGE_SIZE) {
>>>>>>>>>> - entry = xe_pte_encode(batch, i,
>>>> XE_CACHE_WB, 0);
>>>>>>>>>> + entry = xe_pte_encode(batch, i, pat_index, 0);
>>>>>>>>>>
>>>>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level
>> *
>>>> 8, u64,
>>>>>>>>>> entry);
>>>>>>>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>>>> xe_tile
>>>>>> *tile,
>>>>>>>> struct xe_migrate *m,
>>>>>>>>>> flags = XE_PDE_64K;
>>>>>>>>>>
>>>>>>>>>> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>>>>>>>>>> - XE_PAGE_SIZE,
>>>> XE_CACHE_WB);
>>>>>>>>>> + XE_PAGE_SIZE, pat_index);
>>>>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs +
>> XE_PAGE_SIZE
>>>> * level,
>>>>>>>> u64,
>>>>>>>>>> entry | flags);
>>>>>>>>>> }
>>>>>>>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>>>> xe_tile
>>>>>> *tile,
>>>>>>>> struct xe_migrate *m,
>>>>>>>>>> /* Write PDE's that point to our BO. */
>>>>>>>>>> for (i = 0; i < num_entries - num_level; i++) {
>>>>>>>>>> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>>>>>>>>>> - XE_CACHE_WB);
>>>>>>>>>> + pat_index);
>>>>>>>>>>
>>>>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs +
>> XE_PAGE_SIZE
>>>> +
>>>>>>>>>> (i + 1) * 8, u64, entry); @@ -1256,7 +1258,7
>>>> @@
>>>>>>>>>> xe_migrate_update_pgtables(struct
>>>>>> xe_migrate
>>>>>>>> *m,
>>>>>>>>>>
>>>>>>>>>> XE_WARN_ON(pt_bo->size != SZ_4K);
>>>>>>>>>>
>>>>>>>>>> - addr = xe_pte_encode(pt_bo, 0,
>>>> XE_CACHE_WB, 0);
>>>>>>>>>> + addr = xe_pte_encode(pt_bo, 0,
>>>> xe_pat_get_index(xe,
>>>>>>>> XE_CACHE_WB), 0);
>>>>>>>>>> bb->cs[bb->len++] = lower_32_bits(addr);
>>>>>>>>>> bb->cs[bb->len++] = upper_32_bits(addr);
>>>>>>>>>> }
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>>>> b/drivers/gpu/drm/xe/xe_pt.c index 64713f400d94..019af2920078
>>>>>>>>>> 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>>>> @@ -10,6 +10,7 @@
>>>>>>>>>> #include "xe_gt.h"
>>>>>>>>>> #include "xe_gt_tlb_invalidation.h"
>>>>>>>>>> #include "xe_migrate.h"
>>>>>>>>>> +#include "xe_pat.h"
>>>>>>>>>> #include "xe_pt_types.h"
>>>>>>>>>> #include "xe_pt_walk.h"
>>>>>>>>>> #include "xe_res_cursor.h"
>>>>>>>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct
>>>>>>>>>> xe_pt_dir
>>>>>>>> *pt_dir, unsigned int index)
>>>>>>>>>> *
>>>>>>>>>> * Return: An encoded page directory entry. No errors.
>>>>>>>>>> */
>>>>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>>>>>>>> - const enum xe_cache_level cache)
>>>>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
>>>>>>>>>> +pat_index)
>>>>>>>>>> {
>>>>>>>>>> u64 pde;
>>>>>>>>>> struct xe_vm *vm = bo->vm;
>>>>>>>>>> struct xe_device *xe = vm->xe;
>>>>>>>>>>
>>>>>>>>>> -
>>>>>>>>>> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>>>>>>>>>> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>>>>>>>>>
>>>>>>>>>> if ((vm->pat_encode).pde_encode)
>>>>>>>>>> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>>>>>>>>>> + pde = (vm->pat_encode).pde_encode(xe, pde,
>>>> pat_index);
>>>>>>>>>>
>>>>>>>>>> return pde;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>>>>>>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>>>>>>>>>> struct xe_vma *vma, u32 pt_level)
>>>>>>>>>> {
>>>>>>>>>> struct xe_vm *vm = xe_vma_vm(vma); @@ -89,7 +88,7 @@
>>>> static
>>>>>>>>>> u64 __pte_encode(u64 pte, enum
>>>>>>>> xe_cache_level cache,
>>>>>>>>>> pte |= XE_PTE_NULL;
>>>>>>>>>>
>>>>>>>>>> if ((vm->pat_encode).pte_encode)
>>>>>>>>>> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>>>>>>>>>> + pte = (vm->pat_encode).pte_encode(xe, pte,
>>>> pat_index);
>>>>>>>>>>
>>>>>>>>>> if (pt_level == 1)
>>>>>>>>>> pte |= XE_PDE_PS_2M;
>>>>>>>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
>>>>>>>> xe_cache_level cache,
>>>>>>>>>> *
>>>>>>>>>> * Return: An encoded page-table entry. No errors.
>>>>>>>>>> */
>>>>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
>>>>>>>>>> xe_cache_level
>>>>>>>> cache,
>>>>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>>>>>>>>> u32 pt_level)
>>>>>>>>>> {
>>>>>>>>>> u64 pte;
>>>>>>>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64
>>>>>> offset,
>>>>>>>> enum xe_cache_level cache,
>>>>>>>>>> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>>>>>>>>>> pte |= XE_PPGTT_PTE_DM;
>>>>>>>>>>
>>>>>>>>>> - return __pte_encode(pte, cache, NULL, pt_level);
>>>>>>>>>> + return __pte_encode(pte, pat_index, NULL, pt_level);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct
>>>>>>>>>> xe_vm *vm, @@ -134,12 +133,12 @@ static u64
>>>>>>>>>> __xe_pt_empty_pte(struct xe_tile
>>>>>> *tile,
>>>>>>>> struct xe_vm *vm,
>>>>>>>>>>
>>>>>>>>>> if (level == 0) {
>>>>>>>>>> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>>>>>>>>>> - XE_CACHE_WB, 0);
>>>>>>>>>> + xe_pat_get_index(vm->xe,
>>>>>>>> XE_CACHE_WB), 0);
>>>>>>>>>>
>>>>>>>>>> return empty;
>>>>>>>>>> } else {
>>>>>>>>>> return xe_pde_encode(vm->scratch_pt[id][level - 1]-
>>>>> bo, 0,
>>>>>>>>>> - XE_CACHE_WB);
>>>>>>>>>> + xe_pat_get_index(vm->xe,
>>>>>>>> XE_CACHE_WB));
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>>>>>>>>>> struct xe_vm *vm;
>>>>>>>>>> /** @tile: The tile we're building for. */
>>>>>>>>>> struct xe_tile *tile;
>>>>>>>>>> - /** @cache: Desired cache level for the ptes */
>>>>>>>>>> - enum xe_cache_level cache;
>>>>>>>>>> /** @default_pte: PTE flag only template. No address is
>>>> associated */
>>>>>>>>>> u64 default_pte;
>>>>>>>>>> /** @dma_offset: DMA offset to add to the PTE. */ @@
>>>>>>>>>> -604,7
>>>>>>>>>> +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
>>>>>>>> pgoff_t offset,
>>>>>>>>>>
>>>>>>>>>> pte = __pte_encode(is_null ? 0 :
>>>>>>>>>> xe_res_dma(curs) + xe_walk-
>>>>> dma_offset,
>>>>>>>>>> - xe_walk->cache, xe_walk->vma,
>>>> level);
>>>>>>>>>> + xe_walk->vma->pat_index,
>> xe_walk-
>>>>> vma,
>>>>>>>> level);
>>>>>>>>>> pte |= xe_walk->default_pte;
>>>>>>>>>>
>>>>>>>>>> /*
>>>>>>>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw
>>>>>>>>>> *parent,
>>>>>>>> pgoff_t offset,
>>>>>>>>>> xe_child->is_compact = true;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk-
>>> cache)
>>>> |
>>>>>> flags;
>>>>>>>>>> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
>>>>>>>>> pat_index) | flags;
>>>>>>>>>> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset,
>>>> xe_child,
>>>>>>>>>> pte);
>>>>>>>>>> }
>>>>>>>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile,
>>>>>>>>>> struct
>>>>>> xe_vma
>>>>>>>> *vma,
>>>>>>>>>> if (vma && vma->gpuva.flags &
>>>> XE_VMA_ATOMIC_PTE_BIT)
>>>>>>>>>> xe_walk.default_pte |=
>>>> XE_USM_PPGTT_PTE_AE;
>>>>>>>>>> xe_walk.dma_offset = vram_region_gpu_offset(bo-
>>>>>>>>> ttm.resource);
>>>>>>>>>> - xe_walk.cache = XE_CACHE_WB;
>>>>>>>>>> - } else {
>>>>>>>>>> - if (!xe_vma_has_no_bo(vma) && bo->flags &
>>>>>>>> XE_BO_SCANOUT_BIT)
>>>>>>>>>> - xe_walk.cache = XE_CACHE_WT;
>>>>>>>>>> - else
>>>>>>>>>> - xe_walk.cache = XE_CACHE_WB;
>>>>>>>>>> }
>>>>>>>>>> +
>>>>>>>>>> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>>>>>>>>> xe_walk.dma_offset =
>>>>>>>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>>>> b/drivers/gpu/drm/xe/xe_pt.h index 01be7ab08f87..1d433a5a96b4
>>>>>>>>>> 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile,
>>>>>>>>>> struct
>>>>>>>> xe_vma *vma, struct xe_exec_queu
>>>>>>>>>>
>>>>>>>>>> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma
>>>>>>>>>> *vma);
>>>>>>>>>>
>>>>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>>>>>>>> - const enum xe_cache_level level);
>>>>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
>>>>>>>>>> +pat_index);
>>>>>>>>>>
>>>>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
>>>>>>>>>> xe_cache_level
>>>>>>>> cache,
>>>>>>>>>> - u32 pt_level);
>>>>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>>>>>>>>> +u32
>>>>>>>> pt_level);
>>>>>>>>>>
>>>>>>>>>> #endif
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>>>> b/drivers/gpu/drm/xe/xe_vm.c index
>> 7eeeed0411f3..34603a7e84b0
>>>>>>>>>> 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>>>> @@ -6,6 +6,7 @@
>>>>>>>>>> #include "xe_vm.h"
>>>>>>>>>>
>>>>>>>>>> #include <linux/dma-fence-array.h>
>>>>>>>>>> +#include <linux/nospec.h>
>>>>>>>>>>
>>>>>>>>>> #include <drm/drm_print.h>
>>>>>>>>>> #include <drm/ttm/ttm_execbuf_util.h> @@ -874,7 +875,8 @@
>>>>>>>>>> static struct xe_vma *xe_vma_create(struct
>>>>>> xe_vm
>>>>>>>> *vm,
>>>>>>>>>> u64 start, u64 end,
>>>>>>>>>> bool read_only,
>>>>>>>>>> bool is_null,
>>>>>>>>>> - u8 tile_mask)
>>>>>>>>>> + u8 tile_mask,
>>>>>>>>>> + u32 pat_index)
>>>>>>>>>> {
>>>>>>>>>> struct xe_vma *vma;
>>>>>>>>>> struct xe_tile *tile;
>>>>>>>>>> @@ -913,6 +915,8 @@ static struct xe_vma
>> *xe_vma_create(struct
>>>>>> xe_vm
>>>>>>>> *vm,
>>>>>>>>>> vma->tile_mask |= 0x1 << id;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> + vma->pat_index = pat_index;
>>>>>>>>>> +
>>>>>>>>>> if (vm->xe->info.platform == XE_PVC)
>>>>>>>>>> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>>>>>>>>>
>>>>>>>>>> @@ -1194,10 +1198,8 @@ static void
>> xe_vma_op_work_func(struct
>>>>>>>> work_struct *w);
>>>>>>>>>> static void vm_destroy_work_func(struct work_struct *w);
>>>>>>>>>>
>>>>>>>>>> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe,
>>>>>>>>>> u64
>>>>>> pde_pat,
>>>>>>>>>> - enum xe_cache_level
>>>> cache)
>>>>>>>>>> + u32 pat_index)
>>>>>>>>>> {
>>>>>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>>>> -
>>>>>>>>>> pde_pat &= ~(XELP_PDE_PAT_MASK);
>>>>>>>>>>
>>>>>>>>>> if (pat_index & BIT(0))
>>>>>>>>>> @@ -1213,10 +1215,8 @@ static u64
>>>>>> xelp_ppgtt_pde_encode_pat(struct
>>>>>>>> xe_device *xe, u64 pde_pat,
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe,
>>>>>>>>>> u64
>>>>>> pte_pat,
>>>>>>>>>> - enum xe_cache_level
>>>> cache)
>>>>>>>>>> + u32 pat_index)
>>>>>>>>>> {
>>>>>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>>>> -
>>>>>>>>>> pte_pat &= ~(XELP_PTE_PAT_MASK);
>>>>>>>>>>
>>>>>>>>>> if (pat_index & BIT(0))
>>>>>>>>>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct
>> xe_file
>>>>>> *xef,
>>>>>>>> u32 id)
>>>>>>>>>> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile
>> *tile)
>>>>>>>>>> {
>>>>>>>>>> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>>>>>>>>>> - XE_CACHE_WB);
>>>>>>>>>> + xe_pat_get_index(vm->xe,
>>>> XE_CACHE_WB));
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> static struct dma_fence *
>>>>>>>>>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device
>>>>>>>>>> *xe,
>>>>>> struct
>>>>>>>> drm_gpuva_op *op)
>>>>>>>>>> static struct drm_gpuva_ops *
>>>>>>>>>> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>>>>>>>> u64 bo_offset_or_userptr, u64 addr, u64
>>>> range,
>>>>>>>>>> - u32 operation, u8 tile_mask, u32 region)
>>>>>>>>>> + u32 operation, u8 tile_mask, u32 region, u32
>>>>>>>> pat_index)
>>>>>>>>>> {
>>>>>>>>>> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>>>>>>>>> struct ww_acquire_ctx ww; @@ -2339,6 +2339,7 @@
>>>>>>>>>> vm_bind_ioctl_ops_create(struct xe_vm
>>>> *vm,
>>>>>>>> struct xe_bo *bo,
>>>>>>>>>> struct xe_vma_op *op =
>>>> gpuva_op_to_vma_op(__op);
>>>>>>>>>>
>>>>>>>>>> op->tile_mask = tile_mask;
>>>>>>>>>> + op->pat_index = pat_index;
>>>>>>>>>> op->map.immediate =
>>>>>>>>>> operation &
>>>>>> XE_VM_BIND_FLAG_IMMEDIATE;
>>>>>>>>>> op->map.read_only =
>>>>>>>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm
>>>> *vm,
>>>>>>>> struct xe_bo *bo,
>>>>>>>>>> struct xe_vma_op *op =
>>>> gpuva_op_to_vma_op(__op);
>>>>>>>>>>
>>>>>>>>>> op->tile_mask = tile_mask;
>>>>>>>>>> + op->pat_index = pat_index;
>>>>>>>>>> op->prefetch.region = region;
>>>>>>>>>> }
>>>>>>>>>> break;
>>>>>>>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
>>>> *vm,
>>>>>>>> struct xe_bo *bo,
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> static struct xe_vma *new_vma(struct xe_vm *vm, struct
>>>>>>>> drm_gpuva_op_map *op,
>>>>>>>>>> - u8 tile_mask, bool read_only, bool is_null)
>>>>>>>>>> + u8 tile_mask, bool read_only, bool is_null,
>>>>>>>>>> + u32 pat_index)
>>>>>>>>>> {
>>>>>>>>>> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op-
>>>>> gem.obj) :
>>>>>> NULL;
>>>>>>>>>> struct xe_vma *vma;
>>>>>>>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct
>>>>>>>>>> xe_vm
>>>>>>>> *vm, struct drm_gpuva_op_map *op,
>>>>>>>>>> vma = xe_vma_create(vm, bo, op->gem.offset,
>>>>>>>>>> op->va.addr, op->va.addr +
>>>>>>>>>> op->va.range - 1, read_only, is_null,
>>>>>>>>>> - tile_mask);
>>>>>>>>>> + tile_mask, pat_index);
>>>>>>>>>> if (bo)
>>>>>>>>>> xe_bo_unlock(bo, &ww);
>>>>>>>>>>
>>>>>>>>>> @@ -2539,7 +2542,7 @@ static int
>> vm_bind_ioctl_ops_parse(struct
>>>>>> xe_vm
>>>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>>>
>>>>>>>>>> vma = new_vma(vm, &op-
>>> base.map,
>>>>>>>>>> op->tile_mask, op-
>>>>>>> map.read_only,
>>>>>>>>>> - op->map.is_null);
>>>>>>>>>> + op->map.is_null, op-
>>>>> pat_index);
>>>>>>>>>> if (IS_ERR(vma)) {
>>>>>>>>>> err = PTR_ERR(vma);
>>>>>>>>>> goto free_fence;
>>>>>>>>>> @@ -2567,7 +2570,7 @@ static int
>> vm_bind_ioctl_ops_parse(struct
>>>>>> xe_vm
>>>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>>>
>>>>>>>>>> vma = new_vma(vm, op-
>>>>>>>>> base.remap.prev,
>>>>>>>>>> op->tile_mask,
>>>> read_only,
>>>>>>>>>> - is_null);
>>>>>>>>>> + is_null, op-
>>>>> pat_index);
>>>>>>>>>> if (IS_ERR(vma)) {
>>>>>>>>>> err = PTR_ERR(vma);
>>>>>>>>>> goto free_fence;
>>>>>>>>>> @@ -2603,7 +2606,7 @@ static int
>> vm_bind_ioctl_ops_parse(struct
>>>>>> xe_vm
>>>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>>>
>>>>>>>>>> vma = new_vma(vm, op-
>>>>>>>>> base.remap.next,
>>>>>>>>>> op->tile_mask,
>>>> read_only,
>>>>>>>>>> - is_null);
>>>>>>>>>> + is_null, op-
>>>>> pat_index);
>>>>>>>>>> if (IS_ERR(vma)) {
>>>>>>>>>> err = PTR_ERR(vma);
>>>>>>>>>> goto free_fence;
>>>>>>>>>> @@ -3158,8 +3161,14 @@ static int
>>>>>>>>>> vm_bind_ioctl_check_args(struct
>>>>>>>> xe_device *xe,
>>>>>>>>>> u32 obj = (*bind_ops)[i].obj;
>>>>>>>>>> u64 obj_offset = (*bind_ops)[i].obj_offset;
>>>>>>>>>> u32 region = (*bind_ops)[i].region;
>>>>>>>>>> + u32 pat_index = (*bind_ops)[i].pat_index;
>>>>>>>>>> bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>>>>>>>>>
>>>>>>>>>> + if (XE_IOCTL_DBG(xe, pat_index >= xe-
>>>>>>>>> info.pat_table_n_entries)) {
>>>>>>>>>> + err = -EINVAL;
>>>>>>>>>> + goto free_bind_ops;
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> if (i == 0) {
>>>>>>>>>> *async = !!(op &
>> XE_VM_BIND_FLAG_ASYNC);
>>>>>>>>>> } else if (XE_IOCTL_DBG(xe, !*async) || @@ -3346,8
>>>> +3355,25
>>>>>>>>>> @@ int xe_vm_bind_ioctl(struct drm_device
>>>>>> *dev,
>>>>>>>> void *data, struct drm_file *file)
>>>>>>>>>> struct drm_gem_object *gem_obj;
>>>>>>>>>> u64 range = bind_ops[i].range;
>>>>>>>>>> u64 addr = bind_ops[i].addr;
>>>>>>>>>> + u32 op = bind_ops[i].op;
>>>>>>>>>> u32 obj = bind_ops[i].obj;
>>>>>>>>>> u64 obj_offset = bind_ops[i].obj_offset;
>>>>>>>>>> + u32 pat_index = bind_ops[i].pat_index;
>>>>>>>>>> + u16 coh_mode;
>>>>>>>>>> +
>>>>>>>>>> + pat_index = array_index_nospec(pat_index,
>>>>>>>>>> + xe-
>>>>> info.pat_table_n_entries);
>>>>>>>>>> + coh_mode = xe_pat_index_get_coh_mode(xe,
>>>> pat_index);
>>>>>>>>>> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
>>>>>>>>>
>>>>>>>>> Assuming we drop the unusable entries from the TGL table, this
>>>>>>>>> should be impossible, right? Any index that makes it past the
>>>>>>>>> n_entries check at the top of the function should have a valid,
>>>>>>>>> non-zero
>>>> coh_mode value.
>>>>>>>>> So this should probably be an assertion (to highlight a KMD bug)
>>>>>>>>> rather than just a silent uapi failure return.
>>>>>>>>
>>>>>>>> Makes sense.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> + err = -EINVAL;
>>>>>>>>>> + goto put_obj;
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
>>>>>>>> XE_VM_BIND_OP_MAP_USERPTR &&
>>>>>>>>>> + coh_mode ==
>>>> XE_GEM_COHERENCY_NONE))
>>>>>> {
>>>>>>>>>> + err = -EINVAL;
>>>>>>>>>> + goto put_obj;
>>>>>>>>>> + }
>>>>>>>>>>
>>>>>>>>>> if (!obj)
>>>>>>>>>> continue;
>>>>>>>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct
>> drm_device
>>>>>> *dev,
>>>>>>>> void *data, struct drm_file *file)
>>>>>>>>>> goto put_obj;
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>> +
>>>>>>>>>> + if (bos[i]->coh_mode) {
>>>>>>>>>> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
>>>>>>>> coh_mode)) {
>>>>>>>>>> + err = -EINVAL;
>>>>>>>>>> + goto put_obj;
>>>>>>>>>> + }
>>>>>>>>>> + } else if (XE_IOCTL_DBG(xe, coh_mode ==
>>>>>>>> XE_GEM_COHERENCY_NONE)) {
>>>>>>>>>> + /*
>>>>>>>>>> + * Imported dma-buf from a different device
>>>> should
>>>>>>>>>> + * require 1way or 2way coherency since we
>>>> don't
>>>>>>>> know
>>>>>>>>>> + * how it was mapped on CPU. Just assume is
>> it
>>>>>>>>>> + * potentially cached on CPU side.
>>>>>>>>>> + */
>>>>>>>>>> + err = -EINVAL;
>>>>>>>>>> + goto put_obj;
>>>>>>>>>> + }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> if (args->num_syncs) {
>>>>>>>>>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct
>> drm_device
>>>>>> *dev,
>>>>>>>> void *data, struct drm_file *file)
>>>>>>>>>> u64 obj_offset = bind_ops[i].obj_offset;
>>>>>>>>>> u8 tile_mask = bind_ops[i].tile_mask;
>>>>>>>>>> u32 region = bind_ops[i].region;
>>>>>>>>>> + u32 pat_index = bind_ops[i].pat_index;
>>>>>>>>>>
>>>>>>>>>> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i],
>>>> obj_offset,
>>>>>>>>>> addr, range, op,
>>>> tile_mask,
>>>>>>>>>> - region);
>>>>>>>>>> + region, pat_index);
>>>>>>>>>> if (IS_ERR(ops[i])) {
>>>>>>>>>> err = PTR_ERR(ops[i]);
>>>>>>>>>> ops[i] = NULL;
>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>> b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>>>> index 83a1f87b6537..508679ed3c74 100644
>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>>>> @@ -111,6 +111,11 @@ struct xe_vma {
>>>>>>>>>> */
>>>>>>>>>> u8 tile_present;
>>>>>>>>>>
>>>>>>>>>> + /**
>>>>>>>>>> + * @pat_index: The pat index to use when encoding the PTEs
>>>> for
>>>>>>>>>> +this
>>>>>>>> vma.
>>>>>>>>>> + */
>>>>>>>>>> + u32 pat_index;
>>>>>>>>>> +
>>>>>>>>>> struct {
>>>>>>>>>> struct list_head rebind_link;
>>>>>>>>>> } notifier;
>>>>>>>>>> @@ -338,10 +343,8 @@ struct xe_vm {
>>>>>>>>>> bool batch_invalidate_tlb;
>>>>>>>>>>
>>>>>>>>>> struct {
>>>>>>>>>> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>>>>>> - enum xe_cache_level
>>>> cache);
>>>>>>>>>> - u64 (*pde_encode)(struct xe_device *xe, u64
>> pde_pat,
>>>>>>>>>> - enum xe_cache_level
>>>> cache);
>>>>>>>>>> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>> u32
>>>>>>>> pat_index);
>>>>>>>>>> + u64 (*pde_encode)(struct xe_device *xe, u64
>> pde_pat,
>>>> u32
>>>>>>>> pat_index);
>>>>>>>>>> } pat_encode;
>>>>>>>>>> };
>>>>>>>>>>
>>>>>>>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>>>>>>>>>> struct async_op_fence *fence;
>>>>>>>>>> /** @tile_mask: gt mask for this operation */
>>>>>>>>>> u8 tile_mask;
>>>>>>>>>> + /** @pat_index: The pat index to use for this operation. */
>>>>>>>>>> + u32 pat_index;
>>>>>>>>>> /** @flags: operation flags */
>>>>>>>>>> enum xe_vma_op_flags flags;
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/uapi/drm/xe_drm.h
>>>>>>>>>> b/include/uapi/drm/xe_drm.h index 64bc66d4b550..0c15b6f32447
>>>>>>>>>> 100644
>>>>>>>>>> --- a/include/uapi/drm/xe_drm.h
>>>>>>>>>> +++ b/include/uapi/drm/xe_drm.h
>>>>>>>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>>>>>>>>>> */
>>>>>>>>>> __u32 obj;
>>>>>>>>>>
>>>>>>>>>> - /** @pad: MBZ */
>>>>>>>>>> - __u32 pad;
>>>>>>>>>> + /**
>>>>>>>>>> + * @pat_index: The platform defined @pat_index to use for
>>>> this
>>>>>>>> mapping.
>>>>>>>>>> + * The index basically maps to some predefined memory
>>>> attributes,
>>>>>>>>>> + * including things like caching, coherency and likely other
>>>>>>>>>> +stuff
>>>> in
>>>>>>>>>> + * the future. The exact meaning of the pat_index is
>>>>>>>>>> +platform specific
>>>>>>>>>
>>>>>>>>> BTW, "other stuff in the future" already includes compression on
>>>>>>>>> Xe2, we just haven't landed the patches for the Xe2 table yet.
>>>>>>>>
>>>>>>>> Ok, good to know.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> + * and defined in the Bspec and PRMs. When the KMD sets
>> up
>>>> the
>>>>>>>> binding
>>>>>>>>>> + * the index here is encoded into the ppGTT PTE.
>>>>>>>>>> + *
>>>>>>>>>> + * For coherency the @pat_index needs to match the
>>>>>>>>>> + * drm_xe_gem_create.coh_mode, so either
>>>>>>>> XE_GEM_COHERENCY_NONE,
>>>>>>>>>> + * XE_GEM_COHERENCY_1WAY or
>>>> XE_GEM_COHERENCY_2WAY. The
>>>>>>>> KMD will extract
>>>>>>>>>> + * the coherency mode from the @pat_index and reject if
>>>> there is a
>>>>>>>>>> + * mismatch (see note below for pre-MTL platforms).
>>>>>>>>>> + *
>>>>>>>>>> + * Note: On pre-MTL platforms there is only a caching mode
>>>> and no
>>>>>>>>>> + * explicit coherency mode, but on such hardware there is
>>>> always a
>>>>>>>>>> + * shared-LLC (or is dgpu) so all GT memory accesses are
>>>>>>>>>> +coherent
>>>>>> with
>>>>>>>>>> + * CPU caches even with the caching mode set as uncached.
>>>> It's
>>>>>>>>>> +only
>>>>>>>> the
>>>>>>>>>> + * display engine that is incoherent (on dgpu it must be in
>>>> VRAM which
>>>>>>>>>> + * is always mapped as WC on the CPU). However to keep the
>>>> uapi
>>>>>>>> somewhat
>>>>>>>>>> + * consistent with newer platforms the KMD groups the
>>>> different
>>>>>> cache
>>>>>>>>>> + * levels into the following coherency buckets on all pre-MTL
>>>>>> platforms:
>>>>>>>>>> + *
>>>>>>>>>> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
>>>>>>>>>> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
>>>>>>>>>> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
>>>>>>>>>> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
>>>>>>>>>
>>>>>>>>> As noted on the previous patch, it seems like 2-way is
>>>>>>>>> appropriate for LLC platforms, but 1-way might be a more
>>>>>>>>> accurate description of dGPU behavior.
>>>>>>>>>
>>>>>>>>>> + *
>>>>>>>>>> + * In practice UC/WC/WT should only ever used for scanout
>>>>>>>>>> +surfaces
>>>>>> on
>>>>>>>>>> + * such platforms since it is only the display engine that is
>>>> actually
>>>>>>>>>> + * incoherent. Everything else should typically use WB given
>>>>>>>>>> +that we
>>>>>>>>>
>>>>>>>>> What if we're sharing our buffers with some other (non-GPU)
>>>>>>>>> device? Are there cases where that other device wouldn't be
>>>>>>>>> coherent with the LLC, so we'd want to use one of these?
>>>>>>>>
>>>>>>>> Yeah, I guess there might be cases like that. I'll reword.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Matt
>>>>>>>>>
>>>>>>>>>> + * have a shared-LLC. On MTL+ this completely changes (also
>>>>>>>> potentially
>>>>>>>>>> + * no shared-LLC) and the HW defines the coherency mode as
>>>> part
>>>>>>>>>> +of
>>>>>>>> the
>>>>>>>>>> + * @pat_index.
>>>>>>>>>> + *
>>>>>>>>>> + * Note: For userptr and externally imported dma-buf the
>>>> kernel
>>>>>>>> expects
>>>>>>>>>> + * either 1WAY or 2WAY for the @pat_index.
>>>>>>>>>> + */
>>>>>>>>>> + __u32 pat_index;
>>>>>>>>>>
>>>>>>>>>> union {
>>>>>>>>>> /**
>>>>>>>>>> --
>>>>>>>>>> 2.41.0
>>>>>>>>>>
>>>>>>>>>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-09-05 9:46 ` Matthew Auld
@ 2023-09-05 13:50 ` Zhang, Carl
2023-09-05 14:07 ` Matthew Auld
0 siblings, 1 reply; 47+ messages in thread
From: Zhang, Carl @ 2023-09-05 13:50 UTC (permalink / raw)
To: Auld, Matthew, Roper, Matthew D
Cc: Gu, Lihao, Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas,
Yu, Effie, intel-xe@lists.freedesktop.org
> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: Tuesday, September 5, 2023 5:47 PM
>
> On 05/09/2023 10:12, Zhang, Carl wrote:
> >
> >> -----Original Message-----
> >> From: Auld, Matthew <matthew.auld@intel.com>
> >> Sent: Monday, September 4, 2023 5:24 PM
> >>
> >> On 01/09/2023 10:34, Zhang, Carl wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Auld, Matthew <matthew.auld@intel.com>
> >>>> Sent: Thursday, August 31, 2023 6:44 PM
> >>>>
> >>>> On 31/08/2023 09:24, Zhang, Carl wrote:
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Auld, Matthew <matthew.auld@intel.com>
> >>>>>> Sent: Thursday, August 31, 2023 12:02 AM
> >>>>>>
> >>>>>> On 30/08/2023 16:27, Zhang, Carl wrote:
> >>>>>>> Several questions:
> >>>>>>> 1. the pat_index from vm_bind will override the setting from
> >> bo_create?
> >>>>>>> How to keep the value from bo_create unchanged?
> >>>>>>
> >>>>>> You only set the pat_index during vm_bind. At gem_create you just
> >>>>>> need tell the kernel what CPU side caching mode to use for system
> >>>>>> memory (wb/wc/uc), plus the expected GPU coherency mode of the
> >>>>>> pat_index. The current expectation is that the coherency mode of
> >>>>>> the pat_index should match the one at gem_create.
> >>>>>>
> >>>>> It means that we set the incomplete pat_index at gem_create
> >>>>> (coherency
> >>>>> only) And set another part at vm_bind (gpu cache) Why we separate
> >>>>> them? Why not set a full value at vm_bind?
> >>>>
> >>>> You always give the full pat_index at vm_bind. It's platform
> >>>> specific but the BSpec for the most part tells you what the
> >>>> coherency mode of the
> >> pat_index is.
> >>>>
> >>>> For example if you look at the MTL pat_index table it tells you the
> >>>> coh_mode:
> >>>>
> >>>> pat_index 0-2 = coh_mode "No snoop" = COH_NONE
> >>>> pat_index 3 = coh_mode "1way" = COH_1WAY
> >>>> pat_index 4 = coh_mode "2way" = COH_2WAY
> >>>>
> >>>> On future platforms the table is larger and also encodes stuff like
> >>>> compression, but KMD really only cares about the coh_mode, since
> >>>> that will restrict the allowed CPU smem_caching values that
> >>>> userspace can
> >> select.
> >>>>
> >>>> From KMD pov this matters since we always give userspace zeroed
> >>>> memory, which is useful for normal applications but is also needed
> >>>> for
> >> security reasons.
> >>>> However the actual clearing might result in the writes only being
> >>>> visible in the CPU cache (not flushed to main memory), and if the
> >>>> GPU doesn't snoop the CPU cache it can read directly from main
> >>>> memory, which effectively bypasses the clearing. To prevent that
> >>>> the KMD needs to know how userspace will map the memory via the GPU
> >>>> i.e what is the coh_mode for the pat_index. The gist is that if you
> >>>> select COH_NONE then it needs to clflush the pages before giving
> >>>> them to userspace, which is done by forcing you to use uc/wc for
> >>>> the smem_caching (the kernel will internally issue cflush on x86
> >>>> when marking the pages as uc/wc). And if you select wb there is no
> >>>> flushing, but
> >> here you need to use 1way or 2way.
> >>>>
> >>>
> >>> It sounds like a perf optimization, if no considering the perf,
> >>> you could always call Clflush to flush the cpu cache to main memory
> >>> before giving it to user space
> >>
> >> I think the other thing is that the KMD is no longer doing any manual
> >> flushing, but instead the core kernel will now do that for us when
> >> marking the pages as wc/uc on x86.
> >>
> >>>
> >>>
> >>>> AFAIK separating them is just to prevent userspace from mixing
> >>>> different coherency modes for the same object by having it
> >>>> immutable (you can still use different pat_index but the coherency
> >>>> must be
> >> compatible).
> >>>> Also at creation time we can determine if the smem_caching is
> >>>> compatible with the coh_mode.
> >>>>
> >>>> Note that it might be that we ditch the COH_2WAY and just have
> >>>> COH_NONE and COH_AT_LEAST_1WAY, in which case this might be
> >>>> slightly
> >> different.
> >>>>
> >>>>>
> >>>>>>> 2. no UC/WB/WC definition (CPU cachable) in
> >>>>>>> drm_xe_gem_mmap_offset,
> >>>>>> will it be handled by KMD automatically?
> >>>>>>> For example: if set 1-way, it means GPU could snoop CPU cache,
> >>>>>>> we could use WB in mmap offset If it is COHERENCY_NONE, we
> >>>>>>> could only use
> >>>>>> UC, all these logic is handled by KMD automatically?
> >>>>>>
> >>>>>> Yes, the mmap will use the same CPU side caching mode that you
> >>>>>> set at gem_create with smem_caching (wb/wc/uc).
> >>>>>>
> >>>>> It means it could not be changed after gem_create. Sometimes,
> >>>>> UMD doesn’t know Whether the data need to be accessed by CPU at
> >>>>> gem_create . only application Knows the purpose of allocation &
> >>>>> the surface usage. So, may need change UMD api to let
> >>>>> applications set whether the
> >>>> resource is "accessible", or it will hurt the perf.
> >>>>
> >>>> Right, the smem_caching is immutable.
> >>>>
> >>>>>
> >>>>>>> 3. about " For imported dma-buf (from a different device) the
> >>>>>>> coherency mode is also implicit and must also be either 1WAY or
> >> 2WAY"
> >>>>>>> it means it must be 1way or 2way, and UMD need not to set it?
> >>>>>>
> >>>>>> If you are going to vm_bind it you need to supply the pat_index
> >>>>>> and in this case it needs to be either 1way or 2way. The object
> >>>>>> comes from a different device so we might not have
> >>>>>> smem_caching/coh_mode like we do for native objects.
> >>>>>>
> >>>>> How about the GPU cache setting of imported dma-buf?
> >>>>> Is there possible conflict between dma-buf producer and consumer?
> >>>>> For example:
> >>>>> In producer side , one bo is set GPU L3 cache, also CPU
> >>>>> cacheable, it should be 2Way In client side, we know the coherency
> >>>>> should be 1-way or 2-way, if it suppose no GPU cache Used. It just set 1-
> way.
> >>>>> It may cause some problems?
> >>>>
> >>>> If the producer and consumer are the same device, then I think the
> >>>> original gem_create.coh_mode is considered, since it's effectively
> >>>> still a native object. In which case the normal rules apply.
> >>>
> >>> But the consumer does not know the coh_mode, so , it is possible
> >>> that it set will set Another value, then it will failed.
> >>> And because consumer does not know it is 1-way or 2-way, consumer
> >>> could not Set the gpu cache correctly. Because if it is 1-way, cpu
> >>> could not snoop gpu cache
> >>>
> >>>>
> >>>> If it comes from a different device then we don't know what
> >>>> gem_create.coh_mode is (or if that even exists) since we don't have
> >>>> the usual native object, but just some opaque dma-buf object. In
> >>>> this case it must be at least 1way or 2way.
> >>>>
> >>>> From KMD pov we only care about userspace not being able to
> >>>> bypass the clearing. There might still be other coherency issues though.
> >>>>
> >>>>>
> >>>>> Also , when a bo is created, UMD does not know whether it will be
> >> exported.
> >>>>> It maybe be set as COHERENCY_NONE, right?
> >>>>
> >>>> Yeah, you can use whatever you want. Just that your smem_caching
> >>>> will need to be wc/uc if using COH_NONE. On the importer side, if
> >>>> it comes from the same device then it's just a normal native object
> >>>> and the normal
> >> rules apply.
> >>>> Currently this would mean also using pat_index with COH_NONE. If it
> >>>> is imported from a different device then you need to use a
> >>>> pat_index with 1way or 2way, since KMD doesn't really know how the
> >>>> object is mapped on the CPU or if the CPU caches are dirty
> >>>> (potentially with
> >> clearing).
> >>>>
> >>>> Is this going to be problematic?
> >>>>
> >>> If imported surfaces is from same device, how consumer know it is
> >>> original coh_non Or 1-way or 2-way?
> >>>
> >>> And it also means producer and consumer could have different
> >>> pat_index for
> >> same bo.
> >>
> >> Ok, if KMD instead allows 1way or 2way for dma-buf when it's external
> >> or from the same device? Would that help here? i.e you can always
> >> select 1way or 2way, even if it came from the same device and was
> coh_none?
> >>
> >> Other option is you could call the vm_bind ioctl to check if the
> >> coh_mode is supported by checking if it returns an error? But maybe that is
> too nasy.
> >>
> >
> > Still a bit confuse, I create a bo with coh_none, then export it to a PRIME fd.
> > In consumer side, it does not know it is coh_none, and will call vm_bind to
> set pat_index.
> >
> > You mean it will return a error when I set different coh value though
> vm_bind?
> > And try coh_non, 1-way, 2-way, then get the correct one? TBH, it is ugly.
>
> Yeah, it is nasty.
>
> >
> > My concern is: if it is coh_non or 1-way, it means I could not use GPU cache if
> the bo also is CPU accessible.
> > Because CPU could not snoop GPU cache.
> > So, GPU cache specified by pat_index also will be rejected?
> >
> > Could we add some query interface to query the CPU cache and coherency
> setting of a bo?
>
> Say if the buffer comes from a different device + driver, what pat_index would
> you choose for that? There is no such thing as coh_mode/smem_caching for
> such an object. If you have to handle that case can't you just treat all imported
> dma-buf the same as that? The new proposal was to allow 1way or 2way for
> any dma-buf object, even if it comes from the same device and has different BO
> coh_mode. Do you need more than that? AFAICT the query would only really
> work for objects imported from the same device.
>
Looks I misunderstood the proposal, so, you mean if it shared in one device,
Consumer could choose 1-way ,2-way , any one should work , and no error returned,
no rejection from vm_bind, right?
> >
> >>
> >>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Auld, Matthew <matthew.auld@intel.com>
> >>>>>>>> Sent: Wednesday, August 30, 2023 7:28 PM
> >>>>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
> >>>>>>>> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi
> >>>>>> <pallavi.mishra@intel.com>;
> >>>>>>>> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas
> >>>>>>>> Lahtinen <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
> >>>>>>>> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>;
> >>>>>> Hazubski,
> >>>>>>>> Filip <filip.hazubski@intel.com>; Zhang, Carl
> >>>>>>>> <carl.zhang@intel.com>; Yu,
> >>>>>> Effie
> >>>>>>>> <effie.yu@intel.com>
> >>>>>>>> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection
> >>>>>>>> with
> >>>>>> vm_bind
> >>>>>>>>
> >>>>>>>> On 29/08/2023 22:36, Matt Roper wrote:
> >>>>>>>>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
> >>>>>>>>>> Allow userspace to directly control the pat_index for a given
> >>>>>>>>>> vm binding. This should allow directly controlling the
> >>>>>>>>>> coherency, caching and potentially other stuff in the future
> >>>>>>>>>> for the
> >> ppGTT binding.
> >>>>>>>>>>
> >>>>>>>>>> The exact meaning behind the pat_index is very platform
> >>>>>>>>>> specific (see BSpec or PRMs) but effectively maps to some
> >>>>>>>>>> predefined memory attributes. From the KMD pov we only care
> >>>>>>>>>> about the coherency that is provided by the pat_index, which
> >>>>>>>>>> falls into either
> >>>> NONE, 1WAY or 2WAY.
> >>>>>>>>>> The vm_bind coherency mode for the given pat_index needs to
> >>>>>>>>>> match
> >>>>>> the
> >>>>>>>>>> given coh_mode that was set at object creation. For platforms
> >>>>>>>>>> that lack
> >>>>>>>>>
> >>>>>>>>> Is it actually important to match the coherency mode? I think
> >>>>>>>>> one of the main goals was to know up front if userspace might
> >>>>>>>>> be using a non-snooping PAT setting that would let it bypass
> >>>>>>>>> the CPU cache (and potentially read old, stale data from a
> >>>>>>>>> different process if the buffer's clear value is still sitting
> >>>>>>>>> in cache and hasn't landed in memory yet).
> >>>>>>>>>
> >>>>>>>>> If that's the only concern, then I think it should still be
> >>>>>>>>> fine to map with a non-matching PAT as long as it's more
> >>>>>>>>> coherent than the one specified at creation, right? E.g., if
> >>>>>>>>> the buffer was created with 1-way coherency, it would be fine
> >>>>>>>>> to map it with 2-way because userspace still can't use that to
> >>>>>>>>> observe the previous contents of the buffer. Or
> >>>>>>>>
> >>>>>>>> Yeah, I guess we could in theory do something that.
> >>>>>>>>
> >>>>>>>>> if the buffer was created with "non-coherent" then we've
> >>>>>>>>> already done the necessary clflushing in kernel before handing
> >>>>>>>>> to buffer over to userspace to ensure the clear value landed
> >>>>>>>>> in memory, so any valid PAT index should be safe (from a
> >>>>>>>>> security POV) after that, right? Any other problems that
> >>>>>>>>> arise from mismatched coherency would just be contained to the
> >>>>>>>>> app possibly shooting itself in the foot, which isn't really our
> concern.
> >>>>>>>>
> >>>>>>>> That is also my understanding, at least from the KMD security pov.
> >>>>>>>> If you allocate as wb then you must use at least 1way, since
> >>>>>>>> there is no flushing
> >>>>>> for
> >>>>>>>> clearing or swap-in. For uc/wc you could in theory use whatever
> >>>>>>>> you
> >> want.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and
> >> WB
> >>>>>>>>>> as
> >>>>>>>> 2WAY.
> >>>>>>>>>>
> >>>>>>>>>> For userptr mappings we lack a corresponding gem object, so
> >>>>>>>>>> the
> >>>>>> expected
> >>>>>>>>>> coherency mode is instead implicit and must fall into either
> >>>>>>>>>> 1WAY or 2WAY. Trying to use NONE will be rejected by the
> >>>>>>>>>> kernel. For imported dma-buf (from a different device) the
> >>>>>>>>>> coherency mode is also implicit and must also be either 1WAY
> >>>>>>>>>> or
> >> 2WAY.
> >>>>>>>>>>
> >>>>>>>>>> As part of adding pat_index support with vm_bind we also need
> >>>>>>>>>> stop
> >>>>>> using
> >>>>>>>>>> xe_cache_level and instead use the pat_index in various places.
> >>>>>>>>>> We still make use of xe_cache_level, but only as a
> >>>>>>>>>> convenience for kernel internal objectsi (internally it maps
> >>>>>>>>>> to some reasonable pat_index). For
> >>>>>>>>>
> >>>>>>>>> Maybe we should kill xe_cache_level completely and just assign
> >>>>>>>>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be
> >>>>>>>>> used in appropriate places, similar to what we do with MOCS
> >>>>>>>>> (gt-
> >>>>>>> mocs.uc_index,
> >>>>>>>>> gt->mocs.wb_index)?
> >>>>>>>>
> >>>>>>>> OK, seems reasonable to me.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> now this is just a 1:1 conversion of the existing code,
> >>>>>>>>>> however for platforms like MTL+ we might need to give more
> >>>>>>>>>> control through
> >>>>>> bo_create
> >>>>>>>>>> or stop using WB on the CPU side if we need CPU access.
> >>>>>>>>>>
> >>>>>>>>>> Bspec: 45101, 44235 #xe
> >>>>>>>>>> Bspec: 70552, 71582, 59400 #xe2
> >>>>>>>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> >>>>>>>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> >>>>>>>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> >>>>>>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >>>>>>>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> >>>>>>>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
> >>>>>>>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
> >>>>>>>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
> >>>>>>>>>> Cc: Carl Zhang <carl.zhang@intel.com>
> >>>>>>>>>> Cc: Effie Yu <effie.yu@intel.com>
> >>>>>>>>>> ---
> >>>>>>>>>> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
> >>>>>>>>>> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
> >>>>>>>>>> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
> >>>>>>>>>> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
> >>>>>>>>>> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
> >>>>>>>>>> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
> >>>>>>>>>> drivers/gpu/drm/xe/xe_pt.h | 6 +--
> >>>>>>>>>> drivers/gpu/drm/xe/xe_vm.c | 73
> >> +++++++++++++++++++++----
> >>>> -
> >>>>>> -
> >>>>>>>>>> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
> >>>>>>>>>> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
> >>>>>>>>>> 10 files changed, 134 insertions(+), 58 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>>>> b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>>>>>> index 4d6296cdbcfd..bb4c182048c3 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >>>>>>>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
> >>>>>>>>>> (*alloc_scratch_dma)(struct i915_address_space
> >> *vm,
> >>>> int
> >>>>>>>>>> sz);
> >>>>>>>>>>
> >>>>>>>>>> u64 (*pte_encode)(dma_addr_t addr,
> >>>>>>>>>> - unsigned int pat_index,
> >>>>>>>>>> + u32 pat_index,
> >>>>>>>>>> u32 flags); /* Create a valid PTE */
> >>>>>>>>>> #define PTE_READ_ONLY BIT(0)
> >>>>>>>>>> #define PTE_LM BIT(1)
> >>>>>>>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>>>> b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>>>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>>>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >>>>>>>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct
> >>>>>> xe_migrate
> >>>>>>>> *m, struct kunit *test)
> >>>>>>>>>> /* First part of the test, are we updating our
> >>>>>>>>>> pagetable bo with a new
> >>>>>>>> entry? */
> >>>>>>>>>> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE *
> >>>> (NUM_KERNEL_PDE -
> >>>>>> 1),
> >>>>>>>> u64,
> >>>>>>>>>> 0xdeaddeadbeefbeef);
> >>>>>>>>>> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> >>>>>>>>>> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
> >>>>>>>> XE_CACHE_WB), 0);
> >>>>>>>>>> if (m->q->vm->flags & XE_VM_FLAG_64K)
> >>>>>>>>>> expected |= XE_PTE_PS64;
> >>>>>>>>>> if (xe_bo_is_vram(pt)) diff --git
> >>>>>>>>>> a/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>> b/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>>>>>> index 209fa053d9fb..4134c26150a5 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> >>>>>>>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo,
> >> u64
> >>>>>>>> bo_offset)
> >>>>>>>>>> pte |= XE_GGTT_PTE_DM;
> >>>>>>>>>>
> >>>>>>>>>> if ((ggtt->pat_encode).pte_encode)
> >>>>>>>>>> - pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >>>>>>>> XE_CACHE_WB_1_WAY);
> >>>>>>>>>> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >>>>>>>>>> +
> >>>> xe_pat_get_index(xe,
> >>>>>>>> XE_CACHE_WB_1_WAY));
> >>>>>>>>>>
> >>>>>>>>>> return pte;
> >>>>>>>>>> }
> >>>>>>>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt
> >> *ggtt)
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device
> >>>>>>>>>> *xe,
> >>>>>>>>>> u64
> >>>>>> pte_pat,
> >>>>>>>>>> - enum xe_cache_level
> >>>> cache)
> >>>>>>>>>> + u32 pat_index)
> >>>>>>>>>> {
> >>>>>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>>>>>> -
> >>>>>>>>>> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
> >>>>>>>>>>
> >>>>>>>>>> if (pat_index & BIT(0)) diff --git
> >>>>>>>>>> a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>>>> b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>>>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >>>>>>>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
> >>>>>>>>>>
> >>>>>>>>>> struct {
> >>>>>>>>>> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>>>>>>>> - enum xe_cache_level
> >>>> cache);
> >>>>>>>>>> + u32 pat_index);
> >>>>>>>>>> } pat_encode;
> >>>>>>>>>> };
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>>>> b/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>>>>>> index a782ea282cb6..54585e98452a 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> >>>>>>>>>> @@ -24,6 +24,7 @@
> >>>>>>>>>> #include "xe_lrc.h"
> >>>>>>>>>> #include "xe_map.h"
> >>>>>>>>>> #include "xe_mocs.h"
> >>>>>>>>>> +#include "xe_pat.h"
> >>>>>>>>>> #include "xe_pt.h"
> >>>>>>>>>> #include "xe_res_cursor.h"
> >>>>>>>>>> #include "xe_sched_job.h"
> >>>>>>>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>>>>>> xe_tile
> >>>>>> *tile,
> >>>>>>>> struct xe_migrate *m,
> >>>>>>>>>> u32 num_entries = NUM_PT_SLOTS, num_level = vm-
> >>>>> pt_root[id]-
> >>>>>>>>> level;
> >>>>>>>>>> u32 map_ofs, level, i;
> >>>>>>>>>> struct xe_bo *bo, *batch =
> >>>>>>>>>> tile->mem.kernel_bb_pool->bo;
> >>>>>>>>>> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> >>>>>>>>>> u64 entry;
> >>>>>>>>>> int ret;
> >>>>>>>>>>
> >>>>>>>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct
> >>>>>> xe_tile
> >>>>>>>> *tile, struct xe_migrate *m,
> >>>>>>>>>> return ret;
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
> >>>>>> XE_CACHE_WB);
> >>>>>>>>>> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
> >>>> pat_index);
> >>>>>>>>>> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
> >>>>>>>>>>
> >>>>>>>>>> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
> >>>>>>>>>>
> >>>>>>>>>> /* Map the entire BO in our level 0 pt */
> >>>>>>>>>> for (i = 0, level = 0; i < num_entries; level++) {
> >>>>>>>>>> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
> >>>> XE_CACHE_WB,
> >>>>>>>> 0);
> >>>>>>>>>> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
> >>>> pat_index, 0);
> >>>>>>>>>>
> >>>>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> >>>> entry);
> >>>>>>>>>>
> >>>>>>>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>>>>>> xe_tile
> >>>>>> *tile,
> >>>>>>>> struct xe_migrate *m,
> >>>>>>>>>> for (i = 0; i < batch->size;
> >>>>>>>>>> i += vm->flags & XE_VM_FLAG_64K ?
> >>>> XE_64K_PAGE_SIZE :
> >>>>>>>>>> XE_PAGE_SIZE) {
> >>>>>>>>>> - entry = xe_pte_encode(batch, i,
> >>>> XE_CACHE_WB, 0);
> >>>>>>>>>> + entry = xe_pte_encode(batch, i, pat_index, 0);
> >>>>>>>>>>
> >>>>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level
> >> *
> >>>> 8, u64,
> >>>>>>>>>> entry);
> >>>>>>>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>>>>>> xe_tile
> >>>>>> *tile,
> >>>>>>>> struct xe_migrate *m,
> >>>>>>>>>> flags = XE_PDE_64K;
> >>>>>>>>>>
> >>>>>>>>>> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> >>>>>>>>>> - XE_PAGE_SIZE,
> >>>> XE_CACHE_WB);
> >>>>>>>>>> + XE_PAGE_SIZE, pat_index);
> >>>>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs +
> >> XE_PAGE_SIZE
> >>>> * level,
> >>>>>>>> u64,
> >>>>>>>>>> entry | flags);
> >>>>>>>>>> }
> >>>>>>>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct
> >>>>>>>>>> xe_tile
> >>>>>> *tile,
> >>>>>>>> struct xe_migrate *m,
> >>>>>>>>>> /* Write PDE's that point to our BO. */
> >>>>>>>>>> for (i = 0; i < num_entries - num_level; i++) {
> >>>>>>>>>> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> >>>>>>>>>> - XE_CACHE_WB);
> >>>>>>>>>> + pat_index);
> >>>>>>>>>>
> >>>>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs +
> >> XE_PAGE_SIZE
> >>>> +
> >>>>>>>>>> (i + 1) * 8, u64, entry); @@ -1256,7 +1258,7
> >>>> @@
> >>>>>>>>>> xe_migrate_update_pgtables(struct
> >>>>>> xe_migrate
> >>>>>>>> *m,
> >>>>>>>>>>
> >>>>>>>>>> XE_WARN_ON(pt_bo->size != SZ_4K);
> >>>>>>>>>>
> >>>>>>>>>> - addr = xe_pte_encode(pt_bo, 0,
> >>>> XE_CACHE_WB, 0);
> >>>>>>>>>> + addr = xe_pte_encode(pt_bo, 0,
> >>>> xe_pat_get_index(xe,
> >>>>>>>> XE_CACHE_WB), 0);
> >>>>>>>>>> bb->cs[bb->len++] = lower_32_bits(addr);
> >>>>>>>>>> bb->cs[bb->len++] = upper_32_bits(addr);
> >>>>>>>>>> }
> >>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c
> >>>>>>>>>> b/drivers/gpu/drm/xe/xe_pt.c index 64713f400d94..019af2920078
> >>>>>>>>>> 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
> >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
> >>>>>>>>>> @@ -10,6 +10,7 @@
> >>>>>>>>>> #include "xe_gt.h"
> >>>>>>>>>> #include "xe_gt_tlb_invalidation.h"
> >>>>>>>>>> #include "xe_migrate.h"
> >>>>>>>>>> +#include "xe_pat.h"
> >>>>>>>>>> #include "xe_pt_types.h"
> >>>>>>>>>> #include "xe_pt_walk.h"
> >>>>>>>>>> #include "xe_res_cursor.h"
> >>>>>>>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct
> >>>>>>>>>> xe_pt_dir
> >>>>>>>> *pt_dir, unsigned int index)
> >>>>>>>>>> *
> >>>>>>>>>> * Return: An encoded page directory entry. No errors.
> >>>>>>>>>> */
> >>>>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >>>>>>>>>> - const enum xe_cache_level cache)
> >>>>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
> >>>>>>>>>> +pat_index)
> >>>>>>>>>> {
> >>>>>>>>>> u64 pde;
> >>>>>>>>>> struct xe_vm *vm = bo->vm;
> >>>>>>>>>> struct xe_device *xe = vm->xe;
> >>>>>>>>>>
> >>>>>>>>>> -
> >>>>>>>>>> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> >>>>>>>>>> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
> >>>>>>>>>>
> >>>>>>>>>> if ((vm->pat_encode).pde_encode)
> >>>>>>>>>> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> >>>>>>>>>> + pde = (vm->pat_encode).pde_encode(xe, pde,
> >>>> pat_index);
> >>>>>>>>>>
> >>>>>>>>>> return pde;
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> >>>>>>>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
> >>>>>>>>>> struct xe_vma *vma, u32 pt_level)
> >>>>>>>>>> {
> >>>>>>>>>> struct xe_vm *vm = xe_vma_vm(vma); @@ -89,7 +88,7 @@
> >>>> static
> >>>>>>>>>> u64 __pte_encode(u64 pte, enum
> >>>>>>>> xe_cache_level cache,
> >>>>>>>>>> pte |= XE_PTE_NULL;
> >>>>>>>>>>
> >>>>>>>>>> if ((vm->pat_encode).pte_encode)
> >>>>>>>>>> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> >>>>>>>>>> + pte = (vm->pat_encode).pte_encode(xe, pte,
> >>>> pat_index);
> >>>>>>>>>>
> >>>>>>>>>> if (pt_level == 1)
> >>>>>>>>>> pte |= XE_PDE_PS_2M;
> >>>>>>>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
> >>>>>>>> xe_cache_level cache,
> >>>>>>>>>> *
> >>>>>>>>>> * Return: An encoded page-table entry. No errors.
> >>>>>>>>>> */
> >>>>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
> >>>>>>>>>> xe_cache_level
> >>>>>>>> cache,
> >>>>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32
> >>>>>>>>>> +pat_index,
> >>>>>>>>>> u32 pt_level)
> >>>>>>>>>> {
> >>>>>>>>>> u64 pte;
> >>>>>>>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64
> >>>>>> offset,
> >>>>>>>> enum xe_cache_level cache,
> >>>>>>>>>> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> >>>>>>>>>> pte |= XE_PPGTT_PTE_DM;
> >>>>>>>>>>
> >>>>>>>>>> - return __pte_encode(pte, cache, NULL, pt_level);
> >>>>>>>>>> + return __pte_encode(pte, pat_index, NULL, pt_level);
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> static u64 __xe_pt_empty_pte(struct xe_tile *tile,
> >>>>>>>>>> struct xe_vm *vm, @@ -134,12 +133,12 @@ static u64
> >>>>>>>>>> __xe_pt_empty_pte(struct xe_tile
> >>>>>> *tile,
> >>>>>>>> struct xe_vm *vm,
> >>>>>>>>>>
> >>>>>>>>>> if (level == 0) {
> >>>>>>>>>> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> >>>>>>>>>> - XE_CACHE_WB, 0);
> >>>>>>>>>> + xe_pat_get_index(vm->xe,
> >>>>>>>> XE_CACHE_WB), 0);
> >>>>>>>>>>
> >>>>>>>>>> return empty;
> >>>>>>>>>> } else {
> >>>>>>>>>> return xe_pde_encode(vm->scratch_pt[id][level - 1]-
> >>>>> bo, 0,
> >>>>>>>>>> - XE_CACHE_WB);
> >>>>>>>>>> + xe_pat_get_index(vm->xe,
> >>>>>>>> XE_CACHE_WB));
> >>>>>>>>>> }
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> >>>>>>>>>> struct xe_vm *vm;
> >>>>>>>>>> /** @tile: The tile we're building for. */
> >>>>>>>>>> struct xe_tile *tile;
> >>>>>>>>>> - /** @cache: Desired cache level for the ptes */
> >>>>>>>>>> - enum xe_cache_level cache;
> >>>>>>>>>> /** @default_pte: PTE flag only template. No address
> >>>>>>>>>> is
> >>>> associated */
> >>>>>>>>>> u64 default_pte;
> >>>>>>>>>> /** @dma_offset: DMA offset to add to the PTE. */ @@
> >>>>>>>>>> -604,7
> >>>>>>>>>> +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> >>>>>>>> pgoff_t offset,
> >>>>>>>>>>
> >>>>>>>>>> pte = __pte_encode(is_null ? 0 :
> >>>>>>>>>> xe_res_dma(curs) + xe_walk-
> >>>>> dma_offset,
> >>>>>>>>>> - xe_walk->cache, xe_walk->vma,
> >>>> level);
> >>>>>>>>>> + xe_walk->vma->pat_index,
> >> xe_walk-
> >>>>> vma,
> >>>>>>>> level);
> >>>>>>>>>> pte |= xe_walk->default_pte;
> >>>>>>>>>>
> >>>>>>>>>> /*
> >>>>>>>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw
> >>>>>>>>>> *parent,
> >>>>>>>> pgoff_t offset,
> >>>>>>>>>> xe_child->is_compact = true;
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk-
> >>> cache)
> >>>> |
> >>>>>> flags;
> >>>>>>>>>> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
> >>>>>>>>> pat_index) | flags;
> >>>>>>>>>> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset,
> >>>> xe_child,
> >>>>>>>>>> pte);
> >>>>>>>>>> }
> >>>>>>>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile,
> >>>>>>>>>> struct
> >>>>>> xe_vma
> >>>>>>>> *vma,
> >>>>>>>>>> if (vma && vma->gpuva.flags &
> >>>> XE_VMA_ATOMIC_PTE_BIT)
> >>>>>>>>>> xe_walk.default_pte |=
> >>>> XE_USM_PPGTT_PTE_AE;
> >>>>>>>>>> xe_walk.dma_offset = vram_region_gpu_offset(bo-
> >>>>>>>>> ttm.resource);
> >>>>>>>>>> - xe_walk.cache = XE_CACHE_WB;
> >>>>>>>>>> - } else {
> >>>>>>>>>> - if (!xe_vma_has_no_bo(vma) && bo->flags &
> >>>>>>>> XE_BO_SCANOUT_BIT)
> >>>>>>>>>> - xe_walk.cache = XE_CACHE_WT;
> >>>>>>>>>> - else
> >>>>>>>>>> - xe_walk.cache = XE_CACHE_WB;
> >>>>>>>>>> }
> >>>>>>>>>> +
> >>>>>>>>>> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> >>>>>>>>>> xe_walk.dma_offset =
> >>>>>>>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h
> >>>>>>>>>> b/drivers/gpu/drm/xe/xe_pt.h index
> 01be7ab08f87..1d433a5a96b4
> >>>>>>>>>> 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
> >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
> >>>>>>>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile,
> >>>>>>>>>> struct
> >>>>>>>> xe_vma *vma, struct xe_exec_queu
> >>>>>>>>>>
> >>>>>>>>>> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma
> >>>>>>>>>> *vma);
> >>>>>>>>>>
> >>>>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >>>>>>>>>> - const enum xe_cache_level level);
> >>>>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
> >>>>>>>>>> +pat_index);
> >>>>>>>>>>
> >>>>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
> >>>>>>>>>> xe_cache_level
> >>>>>>>> cache,
> >>>>>>>>>> - u32 pt_level);
> >>>>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32
> >>>>>>>>>> +pat_index,
> >>>>>>>>>> +u32
> >>>>>>>> pt_level);
> >>>>>>>>>>
> >>>>>>>>>> #endif
> >>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c
> >>>>>>>>>> b/drivers/gpu/drm/xe/xe_vm.c index
> >> 7eeeed0411f3..34603a7e84b0
> >>>>>>>>>> 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
> >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
> >>>>>>>>>> @@ -6,6 +6,7 @@
> >>>>>>>>>> #include "xe_vm.h"
> >>>>>>>>>>
> >>>>>>>>>> #include <linux/dma-fence-array.h>
> >>>>>>>>>> +#include <linux/nospec.h>
> >>>>>>>>>>
> >>>>>>>>>> #include <drm/drm_print.h>
> >>>>>>>>>> #include <drm/ttm/ttm_execbuf_util.h> @@ -874,7 +875,8
> >>>>>>>>>> @@ static struct xe_vma *xe_vma_create(struct
> >>>>>> xe_vm
> >>>>>>>> *vm,
> >>>>>>>>>> u64 start, u64 end,
> >>>>>>>>>> bool read_only,
> >>>>>>>>>> bool is_null,
> >>>>>>>>>> - u8 tile_mask)
> >>>>>>>>>> + u8 tile_mask,
> >>>>>>>>>> + u32 pat_index)
> >>>>>>>>>> {
> >>>>>>>>>> struct xe_vma *vma;
> >>>>>>>>>> struct xe_tile *tile;
> >>>>>>>>>> @@ -913,6 +915,8 @@ static struct xe_vma
> >> *xe_vma_create(struct
> >>>>>> xe_vm
> >>>>>>>> *vm,
> >>>>>>>>>> vma->tile_mask |= 0x1 << id;
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> + vma->pat_index = pat_index;
> >>>>>>>>>> +
> >>>>>>>>>> if (vm->xe->info.platform == XE_PVC)
> >>>>>>>>>> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
> >>>>>>>>>>
> >>>>>>>>>> @@ -1194,10 +1198,8 @@ static void
> >> xe_vma_op_work_func(struct
> >>>>>>>> work_struct *w);
> >>>>>>>>>> static void vm_destroy_work_func(struct work_struct
> >>>>>>>>>> *w);
> >>>>>>>>>>
> >>>>>>>>>> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device
> >>>>>>>>>> *xe,
> >>>>>>>>>> u64
> >>>>>> pde_pat,
> >>>>>>>>>> - enum xe_cache_level
> >>>> cache)
> >>>>>>>>>> + u32 pat_index)
> >>>>>>>>>> {
> >>>>>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>>>>>> -
> >>>>>>>>>> pde_pat &= ~(XELP_PDE_PAT_MASK);
> >>>>>>>>>>
> >>>>>>>>>> if (pat_index & BIT(0)) @@ -1213,10 +1215,8 @@ static
> >>>>>>>>>> u64
> >>>>>> xelp_ppgtt_pde_encode_pat(struct
> >>>>>>>> xe_device *xe, u64 pde_pat,
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device
> >>>>>>>>>> *xe,
> >>>>>>>>>> u64
> >>>>>> pte_pat,
> >>>>>>>>>> - enum xe_cache_level
> >>>> cache)
> >>>>>>>>>> + u32 pat_index)
> >>>>>>>>>> {
> >>>>>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
> >>>>>>>>>> -
> >>>>>>>>>> pte_pat &= ~(XELP_PTE_PAT_MASK);
> >>>>>>>>>>
> >>>>>>>>>> if (pat_index & BIT(0)) @@ -1622,7 +1622,7 @@ struct
> >>>>>>>>>> xe_vm *xe_vm_lookup(struct
> >> xe_file
> >>>>>> *xef,
> >>>>>>>> u32 id)
> >>>>>>>>>> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct
> >>>>>>>>>> xe_tile
> >> *tile)
> >>>>>>>>>> {
> >>>>>>>>>> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> >>>>>>>>>> - XE_CACHE_WB);
> >>>>>>>>>> + xe_pat_get_index(vm->xe,
> >>>> XE_CACHE_WB));
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> static struct dma_fence * @@ -2311,7 +2311,7 @@ static
> >>>>>>>>>> void print_op(struct xe_device *xe,
> >>>>>> struct
> >>>>>>>> drm_gpuva_op *op)
> >>>>>>>>>> static struct drm_gpuva_ops *
> >>>>>>>>>> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo
> *bo,
> >>>>>>>>>> u64 bo_offset_or_userptr, u64 addr, u64
> >>>> range,
> >>>>>>>>>> - u32 operation, u8 tile_mask, u32 region)
> >>>>>>>>>> + u32 operation, u8 tile_mask, u32 region, u32
> >>>>>>>> pat_index)
> >>>>>>>>>> {
> >>>>>>>>>> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> >>>>>>>>>> struct ww_acquire_ctx ww; @@ -2339,6 +2339,7 @@
> >>>>>>>>>> vm_bind_ioctl_ops_create(struct xe_vm
> >>>> *vm,
> >>>>>>>> struct xe_bo *bo,
> >>>>>>>>>> struct xe_vma_op *op =
> >>>> gpuva_op_to_vma_op(__op);
> >>>>>>>>>>
> >>>>>>>>>> op->tile_mask = tile_mask;
> >>>>>>>>>> + op->pat_index = pat_index;
> >>>>>>>>>> op->map.immediate =
> >>>>>>>>>> operation &
> >>>>>> XE_VM_BIND_FLAG_IMMEDIATE;
> >>>>>>>>>> op->map.read_only =
> >>>>>>>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm
> >>>> *vm,
> >>>>>>>> struct xe_bo *bo,
> >>>>>>>>>> struct xe_vma_op *op =
> >>>> gpuva_op_to_vma_op(__op);
> >>>>>>>>>>
> >>>>>>>>>> op->tile_mask = tile_mask;
> >>>>>>>>>> + op->pat_index = pat_index;
> >>>>>>>>>> op->prefetch.region = region;
> >>>>>>>>>> }
> >>>>>>>>>> break;
> >>>>>>>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
> >>>> *vm,
> >>>>>>>> struct xe_bo *bo,
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> static struct xe_vma *new_vma(struct xe_vm *vm, struct
> >>>>>>>> drm_gpuva_op_map *op,
> >>>>>>>>>> - u8 tile_mask, bool read_only, bool is_null)
> >>>>>>>>>> + u8 tile_mask, bool read_only, bool is_null,
> >>>>>>>>>> + u32 pat_index)
> >>>>>>>>>> {
> >>>>>>>>>> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op-
> >>>>> gem.obj) :
> >>>>>> NULL;
> >>>>>>>>>> struct xe_vma *vma;
> >>>>>>>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct
> >>>>>>>>>> xe_vm
> >>>>>>>> *vm, struct drm_gpuva_op_map *op,
> >>>>>>>>>> vma = xe_vma_create(vm, bo, op->gem.offset,
> >>>>>>>>>> op->va.addr, op->va.addr +
> >>>>>>>>>> op->va.range - 1, read_only, is_null,
> >>>>>>>>>> - tile_mask);
> >>>>>>>>>> + tile_mask, pat_index);
> >>>>>>>>>> if (bo)
> >>>>>>>>>> xe_bo_unlock(bo, &ww);
> >>>>>>>>>>
> >>>>>>>>>> @@ -2539,7 +2542,7 @@ static int
> >> vm_bind_ioctl_ops_parse(struct
> >>>>>> xe_vm
> >>>>>>>> *vm, struct xe_exec_queue *q,
> >>>>>>>>>>
> >>>>>>>>>> vma = new_vma(vm, &op-
> >>> base.map,
> >>>>>>>>>> op->tile_mask, op-
> >>>>>>> map.read_only,
> >>>>>>>>>> - op->map.is_null);
> >>>>>>>>>> + op->map.is_null, op-
> >>>>> pat_index);
> >>>>>>>>>> if (IS_ERR(vma)) {
> >>>>>>>>>> err = PTR_ERR(vma);
> >>>>>>>>>> goto free_fence;
> >>>>>>>>>> @@ -2567,7 +2570,7 @@ static int
> >> vm_bind_ioctl_ops_parse(struct
> >>>>>> xe_vm
> >>>>>>>> *vm, struct xe_exec_queue *q,
> >>>>>>>>>>
> >>>>>>>>>> vma = new_vma(vm, op-
> >>>>>>>>> base.remap.prev,
> >>>>>>>>>> op->tile_mask,
> >>>> read_only,
> >>>>>>>>>> - is_null);
> >>>>>>>>>> + is_null, op-
> >>>>> pat_index);
> >>>>>>>>>> if (IS_ERR(vma)) {
> >>>>>>>>>> err = PTR_ERR(vma);
> >>>>>>>>>> goto free_fence;
> >>>>>>>>>> @@ -2603,7 +2606,7 @@ static int
> >> vm_bind_ioctl_ops_parse(struct
> >>>>>> xe_vm
> >>>>>>>> *vm, struct xe_exec_queue *q,
> >>>>>>>>>>
> >>>>>>>>>> vma = new_vma(vm, op-
> >>>>>>>>> base.remap.next,
> >>>>>>>>>> op->tile_mask,
> >>>> read_only,
> >>>>>>>>>> - is_null);
> >>>>>>>>>> + is_null, op-
> >>>>> pat_index);
> >>>>>>>>>> if (IS_ERR(vma)) {
> >>>>>>>>>> err = PTR_ERR(vma);
> >>>>>>>>>> goto free_fence;
> >>>>>>>>>> @@ -3158,8 +3161,14 @@ static int
> >>>>>>>>>> vm_bind_ioctl_check_args(struct
> >>>>>>>> xe_device *xe,
> >>>>>>>>>> u32 obj = (*bind_ops)[i].obj;
> >>>>>>>>>> u64 obj_offset = (*bind_ops)[i].obj_offset;
> >>>>>>>>>> u32 region = (*bind_ops)[i].region;
> >>>>>>>>>> + u32 pat_index = (*bind_ops)[i].pat_index;
> >>>>>>>>>> bool is_null = op & XE_VM_BIND_FLAG_NULL;
> >>>>>>>>>>
> >>>>>>>>>> + if (XE_IOCTL_DBG(xe, pat_index >= xe-
> >>>>>>>>> info.pat_table_n_entries)) {
> >>>>>>>>>> + err = -EINVAL;
> >>>>>>>>>> + goto free_bind_ops;
> >>>>>>>>>> + }
> >>>>>>>>>> +
> >>>>>>>>>> if (i == 0) {
> >>>>>>>>>> *async = !!(op &
> >> XE_VM_BIND_FLAG_ASYNC);
> >>>>>>>>>> } else if (XE_IOCTL_DBG(xe, !*async) || @@ -3346,8
> >>>> +3355,25
> >>>>>>>>>> @@ int xe_vm_bind_ioctl(struct drm_device
> >>>>>> *dev,
> >>>>>>>> void *data, struct drm_file *file)
> >>>>>>>>>> struct drm_gem_object *gem_obj;
> >>>>>>>>>> u64 range = bind_ops[i].range;
> >>>>>>>>>> u64 addr = bind_ops[i].addr;
> >>>>>>>>>> + u32 op = bind_ops[i].op;
> >>>>>>>>>> u32 obj = bind_ops[i].obj;
> >>>>>>>>>> u64 obj_offset = bind_ops[i].obj_offset;
> >>>>>>>>>> + u32 pat_index = bind_ops[i].pat_index;
> >>>>>>>>>> + u16 coh_mode;
> >>>>>>>>>> +
> >>>>>>>>>> + pat_index = array_index_nospec(pat_index,
> >>>>>>>>>> + xe-
> >>>>> info.pat_table_n_entries);
> >>>>>>>>>> + coh_mode = xe_pat_index_get_coh_mode(xe,
> >>>> pat_index);
> >>>>>>>>>> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
> >>>>>>>>>
> >>>>>>>>> Assuming we drop the unusable entries from the TGL table, this
> >>>>>>>>> should be impossible, right? Any index that makes it past the
> >>>>>>>>> n_entries check at the top of the function should have a
> >>>>>>>>> valid, non-zero
> >>>> coh_mode value.
> >>>>>>>>> So this should probably be an assertion (to highlight a KMD
> >>>>>>>>> bug) rather than just a silent uapi failure return.
> >>>>>>>>
> >>>>>>>> Makes sense.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> + err = -EINVAL;
> >>>>>>>>>> + goto put_obj;
> >>>>>>>>>> + }
> >>>>>>>>>> +
> >>>>>>>>>> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
> >>>>>>>> XE_VM_BIND_OP_MAP_USERPTR &&
> >>>>>>>>>> + coh_mode ==
> >>>> XE_GEM_COHERENCY_NONE))
> >>>>>> {
> >>>>>>>>>> + err = -EINVAL;
> >>>>>>>>>> + goto put_obj;
> >>>>>>>>>> + }
> >>>>>>>>>>
> >>>>>>>>>> if (!obj)
> >>>>>>>>>> continue;
> >>>>>>>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct
> >> drm_device
> >>>>>> *dev,
> >>>>>>>> void *data, struct drm_file *file)
> >>>>>>>>>> goto put_obj;
> >>>>>>>>>> }
> >>>>>>>>>> }
> >>>>>>>>>> +
> >>>>>>>>>> + if (bos[i]->coh_mode) {
> >>>>>>>>>> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
> >>>>>>>> coh_mode)) {
> >>>>>>>>>> + err = -EINVAL;
> >>>>>>>>>> + goto put_obj;
> >>>>>>>>>> + }
> >>>>>>>>>> + } else if (XE_IOCTL_DBG(xe, coh_mode ==
> >>>>>>>> XE_GEM_COHERENCY_NONE)) {
> >>>>>>>>>> + /*
> >>>>>>>>>> + * Imported dma-buf from a different device
> >>>> should
> >>>>>>>>>> + * require 1way or 2way coherency since we
> >>>> don't
> >>>>>>>> know
> >>>>>>>>>> + * how it was mapped on CPU. Just assume is
> >> it
> >>>>>>>>>> + * potentially cached on CPU side.
> >>>>>>>>>> + */
> >>>>>>>>>> + err = -EINVAL;
> >>>>>>>>>> + goto put_obj;
> >>>>>>>>>> + }
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> if (args->num_syncs) { @@ -3412,10 +3454,11 @@ int
> >>>>>>>>>> xe_vm_bind_ioctl(struct
> >> drm_device
> >>>>>> *dev,
> >>>>>>>> void *data, struct drm_file *file)
> >>>>>>>>>> u64 obj_offset = bind_ops[i].obj_offset;
> >>>>>>>>>> u8 tile_mask = bind_ops[i].tile_mask;
> >>>>>>>>>> u32 region = bind_ops[i].region;
> >>>>>>>>>> + u32 pat_index = bind_ops[i].pat_index;
> >>>>>>>>>>
> >>>>>>>>>> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i],
> >>>> obj_offset,
> >>>>>>>>>> addr, range, op,
> >>>> tile_mask,
> >>>>>>>>>> - region);
> >>>>>>>>>> + region, pat_index);
> >>>>>>>>>> if (IS_ERR(ops[i])) {
> >>>>>>>>>> err = PTR_ERR(ops[i]);
> >>>>>>>>>> ops[i] = NULL;
> >>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>>>> b/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>>>>>> index 83a1f87b6537..508679ed3c74 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> >>>>>>>>>> @@ -111,6 +111,11 @@ struct xe_vma {
> >>>>>>>>>> */
> >>>>>>>>>> u8 tile_present;
> >>>>>>>>>>
> >>>>>>>>>> + /**
> >>>>>>>>>> + * @pat_index: The pat index to use when encoding the PTEs
> >>>> for
> >>>>>>>>>> +this
> >>>>>>>> vma.
> >>>>>>>>>> + */
> >>>>>>>>>> + u32 pat_index;
> >>>>>>>>>> +
> >>>>>>>>>> struct {
> >>>>>>>>>> struct list_head rebind_link;
> >>>>>>>>>> } notifier;
> >>>>>>>>>> @@ -338,10 +343,8 @@ struct xe_vm {
> >>>>>>>>>> bool batch_invalidate_tlb;
> >>>>>>>>>>
> >>>>>>>>>> struct {
> >>>>>>>>>> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>>>>>>>> - enum xe_cache_level
> >>>> cache);
> >>>>>>>>>> - u64 (*pde_encode)(struct xe_device *xe, u64
> >> pde_pat,
> >>>>>>>>>> - enum xe_cache_level
> >>>> cache);
> >>>>>>>>>> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >>>> u32
> >>>>>>>> pat_index);
> >>>>>>>>>> + u64 (*pde_encode)(struct xe_device *xe, u64
> >> pde_pat,
> >>>> u32
> >>>>>>>> pat_index);
> >>>>>>>>>> } pat_encode;
> >>>>>>>>>> };
> >>>>>>>>>>
> >>>>>>>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
> >>>>>>>>>> struct async_op_fence *fence;
> >>>>>>>>>> /** @tile_mask: gt mask for this operation */
> >>>>>>>>>> u8 tile_mask;
> >>>>>>>>>> + /** @pat_index: The pat index to use for this operation. */
> >>>>>>>>>> + u32 pat_index;
> >>>>>>>>>> /** @flags: operation flags */
> >>>>>>>>>> enum xe_vma_op_flags flags;
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/include/uapi/drm/xe_drm.h
> >>>>>>>>>> b/include/uapi/drm/xe_drm.h index
> 64bc66d4b550..0c15b6f32447
> >>>>>>>>>> 100644
> >>>>>>>>>> --- a/include/uapi/drm/xe_drm.h
> >>>>>>>>>> +++ b/include/uapi/drm/xe_drm.h
> >>>>>>>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> >>>>>>>>>> */
> >>>>>>>>>> __u32 obj;
> >>>>>>>>>>
> >>>>>>>>>> - /** @pad: MBZ */
> >>>>>>>>>> - __u32 pad;
> >>>>>>>>>> + /**
> >>>>>>>>>> + * @pat_index: The platform defined @pat_index to use for
> >>>> this
> >>>>>>>> mapping.
> >>>>>>>>>> + * The index basically maps to some predefined memory
> >>>> attributes,
> >>>>>>>>>> + * including things like caching, coherency and likely
> >>>>>>>>>> +other stuff
> >>>> in
> >>>>>>>>>> + * the future. The exact meaning of the pat_index is
> >>>>>>>>>> +platform specific
> >>>>>>>>>
> >>>>>>>>> BTW, "other stuff in the future" already includes compression
> >>>>>>>>> on Xe2, we just haven't landed the patches for the Xe2 table yet.
> >>>>>>>>
> >>>>>>>> Ok, good to know.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> + * and defined in the Bspec and PRMs. When the KMD sets
> >> up
> >>>> the
> >>>>>>>> binding
> >>>>>>>>>> + * the index here is encoded into the ppGTT PTE.
> >>>>>>>>>> + *
> >>>>>>>>>> + * For coherency the @pat_index needs to match the
> >>>>>>>>>> + * drm_xe_gem_create.coh_mode, so either
> >>>>>>>> XE_GEM_COHERENCY_NONE,
> >>>>>>>>>> + * XE_GEM_COHERENCY_1WAY or
> >>>> XE_GEM_COHERENCY_2WAY. The
> >>>>>>>> KMD will extract
> >>>>>>>>>> + * the coherency mode from the @pat_index and reject if
> >>>> there is a
> >>>>>>>>>> + * mismatch (see note below for pre-MTL platforms).
> >>>>>>>>>> + *
> >>>>>>>>>> + * Note: On pre-MTL platforms there is only a caching mode
> >>>> and no
> >>>>>>>>>> + * explicit coherency mode, but on such hardware there is
> >>>> always a
> >>>>>>>>>> + * shared-LLC (or is dgpu) so all GT memory accesses are
> >>>>>>>>>> +coherent
> >>>>>> with
> >>>>>>>>>> + * CPU caches even with the caching mode set as uncached.
> >>>> It's
> >>>>>>>>>> +only
> >>>>>>>> the
> >>>>>>>>>> + * display engine that is incoherent (on dgpu it must be in
> >>>> VRAM which
> >>>>>>>>>> + * is always mapped as WC on the CPU). However to keep the
> >>>> uapi
> >>>>>>>> somewhat
> >>>>>>>>>> + * consistent with newer platforms the KMD groups the
> >>>> different
> >>>>>> cache
> >>>>>>>>>> + * levels into the following coherency buckets on all
> >>>>>>>>>> +pre-MTL
> >>>>>> platforms:
> >>>>>>>>>> + *
> >>>>>>>>>> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
> >>>>>>>>>> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
> >>>>>>>>>> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
> >>>>>>>>>> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
> >>>>>>>>>
> >>>>>>>>> As noted on the previous patch, it seems like 2-way is
> >>>>>>>>> appropriate for LLC platforms, but 1-way might be a more
> >>>>>>>>> accurate description of dGPU behavior.
> >>>>>>>>>
> >>>>>>>>>> + *
> >>>>>>>>>> + * In practice UC/WC/WT should only ever used for scanout
> >>>>>>>>>> +surfaces
> >>>>>> on
> >>>>>>>>>> + * such platforms since it is only the display engine that
> >>>>>>>>>> +is
> >>>> actually
> >>>>>>>>>> + * incoherent. Everything else should typically use WB
> >>>>>>>>>> +given that we
> >>>>>>>>>
> >>>>>>>>> What if we're sharing our buffers with some other (non-GPU)
> >>>>>>>>> device? Are there cases where that other device wouldn't be
> >>>>>>>>> coherent with the LLC, so we'd want to use one of these?
> >>>>>>>>
> >>>>>>>> Yeah, I guess there might be cases like that. I'll reword.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Matt
> >>>>>>>>>
> >>>>>>>>>> + * have a shared-LLC. On MTL+ this completely changes
> >>>>>>>>>> +(also
> >>>>>>>> potentially
> >>>>>>>>>> + * no shared-LLC) and the HW defines the coherency mode as
> >>>> part
> >>>>>>>>>> +of
> >>>>>>>> the
> >>>>>>>>>> + * @pat_index.
> >>>>>>>>>> + *
> >>>>>>>>>> + * Note: For userptr and externally imported dma-buf the
> >>>> kernel
> >>>>>>>> expects
> >>>>>>>>>> + * either 1WAY or 2WAY for the @pat_index.
> >>>>>>>>>> + */
> >>>>>>>>>> + __u32 pat_index;
> >>>>>>>>>>
> >>>>>>>>>> union {
> >>>>>>>>>> /**
> >>>>>>>>>> --
> >>>>>>>>>> 2.41.0
> >>>>>>>>>>
> >>>>>>>>>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-09-05 13:50 ` Zhang, Carl
@ 2023-09-05 14:07 ` Matthew Auld
0 siblings, 0 replies; 47+ messages in thread
From: Matthew Auld @ 2023-09-05 14:07 UTC (permalink / raw)
To: Zhang, Carl, Roper, Matthew D
Cc: Gu, Lihao, Hazubski, Filip, Joonas Lahtinen, De Marchi, Lucas,
Yu, Effie, intel-xe@lists.freedesktop.org
On 05/09/2023 14:50, Zhang, Carl wrote:
>
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld@intel.com>
>> Sent: Tuesday, September 5, 2023 5:47 PM
>>
>> On 05/09/2023 10:12, Zhang, Carl wrote:
>>>
>>>> -----Original Message-----
>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>> Sent: Monday, September 4, 2023 5:24 PM
>>>>
>>>> On 01/09/2023 10:34, Zhang, Carl wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>>>> Sent: Thursday, August 31, 2023 6:44 PM
>>>>>>
>>>>>> On 31/08/2023 09:24, Zhang, Carl wrote:
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>>>>>> Sent: Thursday, August 31, 2023 12:02 AM
>>>>>>>>
>>>>>>>> On 30/08/2023 16:27, Zhang, Carl wrote:
>>>>>>>>> Several questions:
>>>>>>>>> 1. the pat_index from vm_bind will override the setting from
>>>> bo_create?
>>>>>>>>> How to keep the value from bo_create unchanged?
>>>>>>>>
>>>>>>>> You only set the pat_index during vm_bind. At gem_create you just
>>>>>>>> need tell the kernel what CPU side caching mode to use for system
>>>>>>>> memory (wb/wc/uc), plus the expected GPU coherency mode of the
>>>>>>>> pat_index. The current expectation is that the coherency mode of
>>>>>>>> the pat_index should match the one at gem_create.
>>>>>>>>
>>>>>>> It means that we set the incomplete pat_index at gem_create
>>>>>>> (coherency
>>>>>>> only) And set another part at vm_bind (gpu cache) Why we separate
>>>>>>> them? Why not set a full value at vm_bind?
>>>>>>
>>>>>> You always give the full pat_index at vm_bind. It's platform
>>>>>> specific but the BSpec for the most part tells you what the
>>>>>> coherency mode of the
>>>> pat_index is.
>>>>>>
>>>>>> For example if you look at the MTL pat_index table it tells you the
>>>>>> coh_mode:
>>>>>>
>>>>>> pat_index 0-2 = coh_mode "No snoop" = COH_NONE
>>>>>> pat_index 3 = coh_mode "1way" = COH_1WAY
>>>>>> pat_index 4 = coh_mode "2way" = COH_2WAY
>>>>>>
>>>>>> On future platforms the table is larger and also encodes stuff like
>>>>>> compression, but KMD really only cares about the coh_mode, since
>>>>>> that will restrict the allowed CPU smem_caching values that
>>>>>> userspace can
>>>> select.
>>>>>>
>>>>>> From KMD pov this matters since we always give userspace zeroed
>>>>>> memory, which is useful for normal applications but is also needed
>>>>>> for
>>>> security reasons.
>>>>>> However the actual clearing might result in the writes only being
>>>>>> visible in the CPU cache (not flushed to main memory), and if the
>>>>>> GPU doesn't snoop the CPU cache it can read directly from main
>>>>>> memory, which effectively bypasses the clearing. To prevent that
>>>>>> the KMD needs to know how userspace will map the memory via the GPU
>>>>>> i.e what is the coh_mode for the pat_index. The gist is that if you
>>>>>> select COH_NONE then it needs to clflush the pages before giving
>>>>>> them to userspace, which is done by forcing you to use uc/wc for
>>>>>> the smem_caching (the kernel will internally issue cflush on x86
>>>>>> when marking the pages as uc/wc). And if you select wb there is no
>>>>>> flushing, but
>>>> here you need to use 1way or 2way.
>>>>>>
>>>>>
>>>>> It sounds like a perf optimization, if no considering the perf,
>>>>> you could always call Clflush to flush the cpu cache to main memory
>>>>> before giving it to user space
>>>>
>>>> I think the other thing is that the KMD is no longer doing any manual
>>>> flushing, but instead the core kernel will now do that for us when
>>>> marking the pages as wc/uc on x86.
>>>>
>>>>>
>>>>>
>>>>>> AFAIK separating them is just to prevent userspace from mixing
>>>>>> different coherency modes for the same object by having it
>>>>>> immutable (you can still use different pat_index but the coherency
>>>>>> must be
>>>> compatible).
>>>>>> Also at creation time we can determine if the smem_caching is
>>>>>> compatible with the coh_mode.
>>>>>>
>>>>>> Note that it might be that we ditch the COH_2WAY and just have
>>>>>> COH_NONE and COH_AT_LEAST_1WAY, in which case this might be
>>>>>> slightly
>>>> different.
>>>>>>
>>>>>>>
>>>>>>>>> 2. no UC/WB/WC definition (CPU cachable) in
>>>>>>>>> drm_xe_gem_mmap_offset,
>>>>>>>> will it be handled by KMD automatically?
>>>>>>>>> For example: if set 1-way, it means GPU could snoop CPU cache,
>>>>>>>>> we could use WB in mmap offset If it is COHERENCY_NONE, we
>>>>>>>>> could only use
>>>>>>>> UC, all these logic is handled by KMD automatically?
>>>>>>>>
>>>>>>>> Yes, the mmap will use the same CPU side caching mode that you
>>>>>>>> set at gem_create with smem_caching (wb/wc/uc).
>>>>>>>>
>>>>>>> It means it could not be changed after gem_create. Sometimes,
>>>>>>> UMD doesn’t know Whether the data need to be accessed by CPU at
>>>>>>> gem_create . only application Knows the purpose of allocation &
>>>>>>> the surface usage. So, may need change UMD api to let
>>>>>>> applications set whether the
>>>>>> resource is "accessible", or it will hurt the perf.
>>>>>>
>>>>>> Right, the smem_caching is immutable.
>>>>>>
>>>>>>>
>>>>>>>>> 3. about " For imported dma-buf (from a different device) the
>>>>>>>>> coherency mode is also implicit and must also be either 1WAY or
>>>> 2WAY"
>>>>>>>>> it means it must be 1way or 2way, and UMD need not to set it?
>>>>>>>>
>>>>>>>> If you are going to vm_bind it you need to supply the pat_index
>>>>>>>> and in this case it needs to be either 1way or 2way. The object
>>>>>>>> comes from a different device so we might not have
>>>>>>>> smem_caching/coh_mode like we do for native objects.
>>>>>>>>
>>>>>>> How about the GPU cache setting of imported dma-buf?
>>>>>>> Is there possible conflict between dma-buf producer and consumer?
>>>>>>> For example:
>>>>>>> In producer side , one bo is set GPU L3 cache, also CPU
>>>>>>> cacheable, it should be 2Way In client side, we know the coherency
>>>>>>> should be 1-way or 2-way, if it suppose no GPU cache Used. It just set 1-
>> way.
>>>>>>> It may cause some problems?
>>>>>>
>>>>>> If the producer and consumer are the same device, then I think the
>>>>>> original gem_create.coh_mode is considered, since it's effectively
>>>>>> still a native object. In which case the normal rules apply.
>>>>>
>>>>> But the consumer does not know the coh_mode, so , it is possible
>>>>> that it set will set Another value, then it will failed.
>>>>> And because consumer does not know it is 1-way or 2-way, consumer
>>>>> could not Set the gpu cache correctly. Because if it is 1-way, cpu
>>>>> could not snoop gpu cache
>>>>>
>>>>>>
>>>>>> If it comes from a different device then we don't know what
>>>>>> gem_create.coh_mode is (or if that even exists) since we don't have
>>>>>> the usual native object, but just some opaque dma-buf object. In
>>>>>> this case it must be at least 1way or 2way.
>>>>>>
>>>>>> From KMD pov we only care about userspace not being able to
>>>>>> bypass the clearing. There might still be other coherency issues though.
>>>>>>
>>>>>>>
>>>>>>> Also , when a bo is created, UMD does not know whether it will be
>>>> exported.
>>>>>>> It maybe be set as COHERENCY_NONE, right?
>>>>>>
>>>>>> Yeah, you can use whatever you want. Just that your smem_caching
>>>>>> will need to be wc/uc if using COH_NONE. On the importer side, if
>>>>>> it comes from the same device then it's just a normal native object
>>>>>> and the normal
>>>> rules apply.
>>>>>> Currently this would mean also using pat_index with COH_NONE. If it
>>>>>> is imported from a different device then you need to use a
>>>>>> pat_index with 1way or 2way, since KMD doesn't really know how the
>>>>>> object is mapped on the CPU or if the CPU caches are dirty
>>>>>> (potentially with
>>>> clearing).
>>>>>>
>>>>>> Is this going to be problematic?
>>>>>>
>>>>> If imported surfaces is from same device, how consumer know it is
>>>>> original coh_non Or 1-way or 2-way?
>>>>>
>>>>> And it also means producer and consumer could have different
>>>>> pat_index for
>>>> same bo.
>>>>
>>>> Ok, if KMD instead allows 1way or 2way for dma-buf when it's external
>>>> or from the same device? Would that help here? i.e you can always
>>>> select 1way or 2way, even if it came from the same device and was
>> coh_none?
>>>>
>>>> Other option is you could call the vm_bind ioctl to check if the
>>>> coh_mode is supported by checking if it returns an error? But maybe that is
>> too nasy.
>>>>
>>>
>>> Still a bit confuse, I create a bo with coh_none, then export it to a PRIME fd.
>>> In consumer side, it does not know it is coh_none, and will call vm_bind to
>> set pat_index.
>>>
>>> You mean it will return a error when I set different coh value though
>> vm_bind?
>>> And try coh_non, 1-way, 2-way, then get the correct one? TBH, it is ugly.
>>
>> Yeah, it is nasty.
>>
>>>
>>> My concern is: if it is coh_non or 1-way, it means I could not use GPU cache if
>> the bo also is CPU accessible.
>>> Because CPU could not snoop GPU cache.
>>> So, GPU cache specified by pat_index also will be rejected?
>>>
>>> Could we add some query interface to query the CPU cache and coherency
>> setting of a bo?
>>
>> Say if the buffer comes from a different device + driver, what pat_index would
>> you choose for that? There is no such thing as coh_mode/smem_caching for
>> such an object. If you have to handle that case can't you just treat all imported
>> dma-buf the same as that? The new proposal was to allow 1way or 2way for
>> any dma-buf object, even if it comes from the same device and has different BO
>> coh_mode. Do you need more than that? AFAICT the query would only really
>> work for objects imported from the same device.
>>
> Looks I misunderstood the proposal, so, you mean if it shared in one device,
> Consumer could choose 1-way ,2-way , any one should work , and no error returned,
> no rejection from vm_bind, right?
Yeah, treat same-device and different-device the same in KMD, and allow
1way or 2way (anything but coh_none), since you highlighted the issue
that for importer from same device it might not even know the original
coh_mode.
>
>>>
>>>>
>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>>>>>>>> Sent: Wednesday, August 30, 2023 7:28 PM
>>>>>>>>>> To: Roper, Matthew D <matthew.d.roper@intel.com>
>>>>>>>>>> Cc: intel-xe@lists.freedesktop.org; Mishra, Pallavi
>>>>>>>> <pallavi.mishra@intel.com>;
>>>>>>>>>> Thomas Hellström <thomas.hellstrom@linux.intel.com>; Joonas
>>>>>>>>>> Lahtinen <joonas.lahtinen@linux.intel.com>; De Marchi, Lucas
>>>>>>>>>> <lucas.demarchi@intel.com>; Souza, Jose <jose.souza@intel.com>;
>>>>>>>> Hazubski,
>>>>>>>>>> Filip <filip.hazubski@intel.com>; Zhang, Carl
>>>>>>>>>> <carl.zhang@intel.com>; Yu,
>>>>>>>> Effie
>>>>>>>>>> <effie.yu@intel.com>
>>>>>>>>>> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection
>>>>>>>>>> with
>>>>>>>> vm_bind
>>>>>>>>>>
>>>>>>>>>> On 29/08/2023 22:36, Matt Roper wrote:
>>>>>>>>>>> On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
>>>>>>>>>>>> Allow userspace to directly control the pat_index for a given
>>>>>>>>>>>> vm binding. This should allow directly controlling the
>>>>>>>>>>>> coherency, caching and potentially other stuff in the future
>>>>>>>>>>>> for the
>>>> ppGTT binding.
>>>>>>>>>>>>
>>>>>>>>>>>> The exact meaning behind the pat_index is very platform
>>>>>>>>>>>> specific (see BSpec or PRMs) but effectively maps to some
>>>>>>>>>>>> predefined memory attributes. From the KMD pov we only care
>>>>>>>>>>>> about the coherency that is provided by the pat_index, which
>>>>>>>>>>>> falls into either
>>>>>> NONE, 1WAY or 2WAY.
>>>>>>>>>>>> The vm_bind coherency mode for the given pat_index needs to
>>>>>>>>>>>> match
>>>>>>>> the
>>>>>>>>>>>> given coh_mode that was set at object creation. For platforms
>>>>>>>>>>>> that lack
>>>>>>>>>>>
>>>>>>>>>>> Is it actually important to match the coherency mode? I think
>>>>>>>>>>> one of the main goals was to know up front if userspace might
>>>>>>>>>>> be using a non-snooping PAT setting that would let it bypass
>>>>>>>>>>> the CPU cache (and potentially read old, stale data from a
>>>>>>>>>>> different process if the buffer's clear value is still sitting
>>>>>>>>>>> in cache and hasn't landed in memory yet).
>>>>>>>>>>>
>>>>>>>>>>> If that's the only concern, then I think it should still be
>>>>>>>>>>> fine to map with a non-matching PAT as long as it's more
>>>>>>>>>>> coherent than the one specified at creation, right? E.g., if
>>>>>>>>>>> the buffer was created with 1-way coherency, it would be fine
>>>>>>>>>>> to map it with 2-way because userspace still can't use that to
>>>>>>>>>>> observe the previous contents of the buffer. Or
>>>>>>>>>>
>>>>>>>>>> Yeah, I guess we could in theory do something that.
>>>>>>>>>>
>>>>>>>>>>> if the buffer was created with "non-coherent" then we've
>>>>>>>>>>> already done the necessary clflushing in kernel before handing
>>>>>>>>>>> to buffer over to userspace to ensure the clear value landed
>>>>>>>>>>> in memory, so any valid PAT index should be safe (from a
>>>>>>>>>>> security POV) after that, right? Any other problems that
>>>>>>>>>>> arise from mismatched coherency would just be contained to the
>>>>>>>>>>> app possibly shooting itself in the foot, which isn't really our
>> concern.
>>>>>>>>>>
>>>>>>>>>> That is also my understanding, at least from the KMD security pov.
>>>>>>>>>> If you allocate as wb then you must use at least 1way, since
>>>>>>>>>> there is no flushing
>>>>>>>> for
>>>>>>>>>> clearing or swap-in. For uc/wc you could in theory use whatever
>>>>>>>>>> you
>>>> want.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and
>>>> WB
>>>>>>>>>>>> as
>>>>>>>>>> 2WAY.
>>>>>>>>>>>>
>>>>>>>>>>>> For userptr mappings we lack a corresponding gem object, so
>>>>>>>>>>>> the
>>>>>>>> expected
>>>>>>>>>>>> coherency mode is instead implicit and must fall into either
>>>>>>>>>>>> 1WAY or 2WAY. Trying to use NONE will be rejected by the
>>>>>>>>>>>> kernel. For imported dma-buf (from a different device) the
>>>>>>>>>>>> coherency mode is also implicit and must also be either 1WAY
>>>>>>>>>>>> or
>>>> 2WAY.
>>>>>>>>>>>>
>>>>>>>>>>>> As part of adding pat_index support with vm_bind we also need
>>>>>>>>>>>> stop
>>>>>>>> using
>>>>>>>>>>>> xe_cache_level and instead use the pat_index in various places.
>>>>>>>>>>>> We still make use of xe_cache_level, but only as a
>>>>>>>>>>>> convenience for kernel internal objectsi (internally it maps
>>>>>>>>>>>> to some reasonable pat_index). For
>>>>>>>>>>>
>>>>>>>>>>> Maybe we should kill xe_cache_level completely and just assign
>>>>>>>>>>> xe_gt->pat_cached / xe_gt->pat_uncached at init that can be
>>>>>>>>>>> used in appropriate places, similar to what we do with MOCS
>>>>>>>>>>> (gt-
>>>>>>>>> mocs.uc_index,
>>>>>>>>>>> gt->mocs.wb_index)?
>>>>>>>>>>
>>>>>>>>>> OK, seems reasonable to me.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> now this is just a 1:1 conversion of the existing code,
>>>>>>>>>>>> however for platforms like MTL+ we might need to give more
>>>>>>>>>>>> control through
>>>>>>>> bo_create
>>>>>>>>>>>> or stop using WB on the CPU side if we need CPU access.
>>>>>>>>>>>>
>>>>>>>>>>>> Bspec: 45101, 44235 #xe
>>>>>>>>>>>> Bspec: 70552, 71582, 59400 #xe2
>>>>>>>>>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>>>>>>>>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>>>>>>>>>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>>>>>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>>>>>>>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>>>>>>>>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>>>>>>>>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
>>>>>>>>>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>>>>>>>>>>>> Cc: Carl Zhang <carl.zhang@intel.com>
>>>>>>>>>>>> Cc: Effie Yu <effie.yu@intel.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
>>>>>>>>>>>> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
>>>>>>>>>>>> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
>>>>>>>>>>>> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
>>>>>>>>>>>> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
>>>>>>>>>>>> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
>>>>>>>>>>>> drivers/gpu/drm/xe/xe_pt.h | 6 +--
>>>>>>>>>>>> drivers/gpu/drm/xe/xe_vm.c | 73
>>>> +++++++++++++++++++++----
>>>>>> -
>>>>>>>> -
>>>>>>>>>>>> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
>>>>>>>>>>>> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
>>>>>>>>>>>> 10 files changed, 134 insertions(+), 58 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>>>> b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>>>>>> index 4d6296cdbcfd..bb4c182048c3 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>>>>>>>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
>>>>>>>>>>>> (*alloc_scratch_dma)(struct i915_address_space
>>>> *vm,
>>>>>> int
>>>>>>>>>>>> sz);
>>>>>>>>>>>>
>>>>>>>>>>>> u64 (*pte_encode)(dma_addr_t addr,
>>>>>>>>>>>> - unsigned int pat_index,
>>>>>>>>>>>> + u32 pat_index,
>>>>>>>>>>>> u32 flags); /* Create a valid PTE */
>>>>>>>>>>>> #define PTE_READ_ONLY BIT(0)
>>>>>>>>>>>> #define PTE_LM BIT(1)
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>>>> b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>>>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>>>>>>>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct
>>>>>>>> xe_migrate
>>>>>>>>>> *m, struct kunit *test)
>>>>>>>>>>>> /* First part of the test, are we updating our
>>>>>>>>>>>> pagetable bo with a new
>>>>>>>>>> entry? */
>>>>>>>>>>>> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE *
>>>>>> (NUM_KERNEL_PDE -
>>>>>>>> 1),
>>>>>>>>>> u64,
>>>>>>>>>>>> 0xdeaddeadbeefbeef);
>>>>>>>>>>>> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>>>>>>>>>>>> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
>>>>>>>>>> XE_CACHE_WB), 0);
>>>>>>>>>>>> if (m->q->vm->flags & XE_VM_FLAG_64K)
>>>>>>>>>>>> expected |= XE_PTE_PS64;
>>>>>>>>>>>> if (xe_bo_is_vram(pt)) diff --git
>>>>>>>>>>>> a/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>> b/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>>>>>> index 209fa053d9fb..4134c26150a5 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>>>>>>>>>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo,
>>>> u64
>>>>>>>>>> bo_offset)
>>>>>>>>>>>> pte |= XE_GGTT_PTE_DM;
>>>>>>>>>>>>
>>>>>>>>>>>> if ((ggtt->pat_encode).pte_encode)
>>>>>>>>>>>> - pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>>>>>>>> XE_CACHE_WB_1_WAY);
>>>>>>>>>>>> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>>>>>>>>>> +
>>>>>> xe_pat_get_index(xe,
>>>>>>>>>> XE_CACHE_WB_1_WAY));
>>>>>>>>>>>>
>>>>>>>>>>>> return pte;
>>>>>>>>>>>> }
>>>>>>>>>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt
>>>> *ggtt)
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device
>>>>>>>>>>>> *xe,
>>>>>>>>>>>> u64
>>>>>>>> pte_pat,
>>>>>>>>>>>> - enum xe_cache_level
>>>>>> cache)
>>>>>>>>>>>> + u32 pat_index)
>>>>>>>>>>>> {
>>>>>>>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>>>>>> -
>>>>>>>>>>>> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>>>>>>>>>>>
>>>>>>>>>>>> if (pat_index & BIT(0)) diff --git
>>>>>>>>>>>> a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>>>> b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>>>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>>>>>>>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>>>>>>>>>>>
>>>>>>>>>>>> struct {
>>>>>>>>>>>> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>>>>>>>> - enum xe_cache_level
>>>>>> cache);
>>>>>>>>>>>> + u32 pat_index);
>>>>>>>>>>>> } pat_encode;
>>>>>>>>>>>> };
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>>> b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>>>>> index a782ea282cb6..54585e98452a 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>>>>>>>>>>>> @@ -24,6 +24,7 @@
>>>>>>>>>>>> #include "xe_lrc.h"
>>>>>>>>>>>> #include "xe_map.h"
>>>>>>>>>>>> #include "xe_mocs.h"
>>>>>>>>>>>> +#include "xe_pat.h"
>>>>>>>>>>>> #include "xe_pt.h"
>>>>>>>>>>>> #include "xe_res_cursor.h"
>>>>>>>>>>>> #include "xe_sched_job.h"
>>>>>>>>>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>>>>>> xe_tile
>>>>>>>> *tile,
>>>>>>>>>> struct xe_migrate *m,
>>>>>>>>>>>> u32 num_entries = NUM_PT_SLOTS, num_level = vm-
>>>>>>> pt_root[id]-
>>>>>>>>>>> level;
>>>>>>>>>>>> u32 map_ofs, level, i;
>>>>>>>>>>>> struct xe_bo *bo, *batch =
>>>>>>>>>>>> tile->mem.kernel_bb_pool->bo;
>>>>>>>>>>>> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>>>>>>>>>>>> u64 entry;
>>>>>>>>>>>> int ret;
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>> xe_tile
>>>>>>>>>> *tile, struct xe_migrate *m,
>>>>>>>>>>>> return ret;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
>>>>>>>> XE_CACHE_WB);
>>>>>>>>>>>> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE,
>>>>>> pat_index);
>>>>>>>>>>>> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>>>>>>>>>>>
>>>>>>>>>>>> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>>>>>>>>>>>
>>>>>>>>>>>> /* Map the entire BO in our level 0 pt */
>>>>>>>>>>>> for (i = 0, level = 0; i < num_entries; level++) {
>>>>>>>>>>>> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
>>>>>> XE_CACHE_WB,
>>>>>>>>>> 0);
>>>>>>>>>>>> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE,
>>>>>> pat_index, 0);
>>>>>>>>>>>>
>>>>>>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>>>>>> entry);
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>>>>>> xe_tile
>>>>>>>> *tile,
>>>>>>>>>> struct xe_migrate *m,
>>>>>>>>>>>> for (i = 0; i < batch->size;
>>>>>>>>>>>> i += vm->flags & XE_VM_FLAG_64K ?
>>>>>> XE_64K_PAGE_SIZE :
>>>>>>>>>>>> XE_PAGE_SIZE) {
>>>>>>>>>>>> - entry = xe_pte_encode(batch, i,
>>>>>> XE_CACHE_WB, 0);
>>>>>>>>>>>> + entry = xe_pte_encode(batch, i, pat_index, 0);
>>>>>>>>>>>>
>>>>>>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level
>>>> *
>>>>>> 8, u64,
>>>>>>>>>>>> entry);
>>>>>>>>>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>>>>>> xe_tile
>>>>>>>> *tile,
>>>>>>>>>> struct xe_migrate *m,
>>>>>>>>>>>> flags = XE_PDE_64K;
>>>>>>>>>>>>
>>>>>>>>>>>> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>>>>>>>>>>>> - XE_PAGE_SIZE,
>>>>>> XE_CACHE_WB);
>>>>>>>>>>>> + XE_PAGE_SIZE, pat_index);
>>>>>>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs +
>>>> XE_PAGE_SIZE
>>>>>> * level,
>>>>>>>>>> u64,
>>>>>>>>>>>> entry | flags);
>>>>>>>>>>>> }
>>>>>>>>>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct
>>>>>>>>>>>> xe_tile
>>>>>>>> *tile,
>>>>>>>>>> struct xe_migrate *m,
>>>>>>>>>>>> /* Write PDE's that point to our BO. */
>>>>>>>>>>>> for (i = 0; i < num_entries - num_level; i++) {
>>>>>>>>>>>> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>>>>>>>>>>>> - XE_CACHE_WB);
>>>>>>>>>>>> + pat_index);
>>>>>>>>>>>>
>>>>>>>>>>>> xe_map_wr(xe, &bo->vmap, map_ofs +
>>>> XE_PAGE_SIZE
>>>>>> +
>>>>>>>>>>>> (i + 1) * 8, u64, entry); @@ -1256,7 +1258,7
>>>>>> @@
>>>>>>>>>>>> xe_migrate_update_pgtables(struct
>>>>>>>> xe_migrate
>>>>>>>>>> *m,
>>>>>>>>>>>>
>>>>>>>>>>>> XE_WARN_ON(pt_bo->size != SZ_4K);
>>>>>>>>>>>>
>>>>>>>>>>>> - addr = xe_pte_encode(pt_bo, 0,
>>>>>> XE_CACHE_WB, 0);
>>>>>>>>>>>> + addr = xe_pte_encode(pt_bo, 0,
>>>>>> xe_pat_get_index(xe,
>>>>>>>>>> XE_CACHE_WB), 0);
>>>>>>>>>>>> bb->cs[bb->len++] = lower_32_bits(addr);
>>>>>>>>>>>> bb->cs[bb->len++] = upper_32_bits(addr);
>>>>>>>>>>>> }
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_pt.c index 64713f400d94..019af2920078
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>>>>>>>>>>>> @@ -10,6 +10,7 @@
>>>>>>>>>>>> #include "xe_gt.h"
>>>>>>>>>>>> #include "xe_gt_tlb_invalidation.h"
>>>>>>>>>>>> #include "xe_migrate.h"
>>>>>>>>>>>> +#include "xe_pat.h"
>>>>>>>>>>>> #include "xe_pt_types.h"
>>>>>>>>>>>> #include "xe_pt_walk.h"
>>>>>>>>>>>> #include "xe_res_cursor.h"
>>>>>>>>>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct
>>>>>>>>>>>> xe_pt_dir
>>>>>>>>>> *pt_dir, unsigned int index)
>>>>>>>>>>>> *
>>>>>>>>>>>> * Return: An encoded page directory entry. No errors.
>>>>>>>>>>>> */
>>>>>>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>>>>>>>>>> - const enum xe_cache_level cache)
>>>>>>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
>>>>>>>>>>>> +pat_index)
>>>>>>>>>>>> {
>>>>>>>>>>>> u64 pde;
>>>>>>>>>>>> struct xe_vm *vm = bo->vm;
>>>>>>>>>>>> struct xe_device *xe = vm->xe;
>>>>>>>>>>>>
>>>>>>>>>>>> -
>>>>>>>>>>>> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>>>>>>>>>>>> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>>>>>>>>>>>
>>>>>>>>>>>> if ((vm->pat_encode).pde_encode)
>>>>>>>>>>>> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>>>>>>>>>>>> + pde = (vm->pat_encode).pde_encode(xe, pde,
>>>>>> pat_index);
>>>>>>>>>>>>
>>>>>>>>>>>> return pde;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>>>>>>>>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>>>>>>>>>>>> struct xe_vma *vma, u32 pt_level)
>>>>>>>>>>>> {
>>>>>>>>>>>> struct xe_vm *vm = xe_vma_vm(vma); @@ -89,7 +88,7 @@
>>>>>> static
>>>>>>>>>>>> u64 __pte_encode(u64 pte, enum
>>>>>>>>>> xe_cache_level cache,
>>>>>>>>>>>> pte |= XE_PTE_NULL;
>>>>>>>>>>>>
>>>>>>>>>>>> if ((vm->pat_encode).pte_encode)
>>>>>>>>>>>> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>>>>>>>>>>>> + pte = (vm->pat_encode).pte_encode(xe, pte,
>>>>>> pat_index);
>>>>>>>>>>>>
>>>>>>>>>>>> if (pt_level == 1)
>>>>>>>>>>>> pte |= XE_PDE_PS_2M;
>>>>>>>>>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
>>>>>>>>>> xe_cache_level cache,
>>>>>>>>>>>> *
>>>>>>>>>>>> * Return: An encoded page-table entry. No errors.
>>>>>>>>>>>> */
>>>>>>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
>>>>>>>>>>>> xe_cache_level
>>>>>>>>>> cache,
>>>>>>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32
>>>>>>>>>>>> +pat_index,
>>>>>>>>>>>> u32 pt_level)
>>>>>>>>>>>> {
>>>>>>>>>>>> u64 pte;
>>>>>>>>>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64
>>>>>>>> offset,
>>>>>>>>>> enum xe_cache_level cache,
>>>>>>>>>>>> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>>>>>>>>>>>> pte |= XE_PPGTT_PTE_DM;
>>>>>>>>>>>>
>>>>>>>>>>>> - return __pte_encode(pte, cache, NULL, pt_level);
>>>>>>>>>>>> + return __pte_encode(pte, pat_index, NULL, pt_level);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> static u64 __xe_pt_empty_pte(struct xe_tile *tile,
>>>>>>>>>>>> struct xe_vm *vm, @@ -134,12 +133,12 @@ static u64
>>>>>>>>>>>> __xe_pt_empty_pte(struct xe_tile
>>>>>>>> *tile,
>>>>>>>>>> struct xe_vm *vm,
>>>>>>>>>>>>
>>>>>>>>>>>> if (level == 0) {
>>>>>>>>>>>> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>>>>>>>>>>>> - XE_CACHE_WB, 0);
>>>>>>>>>>>> + xe_pat_get_index(vm->xe,
>>>>>>>>>> XE_CACHE_WB), 0);
>>>>>>>>>>>>
>>>>>>>>>>>> return empty;
>>>>>>>>>>>> } else {
>>>>>>>>>>>> return xe_pde_encode(vm->scratch_pt[id][level - 1]-
>>>>>>> bo, 0,
>>>>>>>>>>>> - XE_CACHE_WB);
>>>>>>>>>>>> + xe_pat_get_index(vm->xe,
>>>>>>>>>> XE_CACHE_WB));
>>>>>>>>>>>> }
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>>>>>>>>>>>> struct xe_vm *vm;
>>>>>>>>>>>> /** @tile: The tile we're building for. */
>>>>>>>>>>>> struct xe_tile *tile;
>>>>>>>>>>>> - /** @cache: Desired cache level for the ptes */
>>>>>>>>>>>> - enum xe_cache_level cache;
>>>>>>>>>>>> /** @default_pte: PTE flag only template. No address
>>>>>>>>>>>> is
>>>>>> associated */
>>>>>>>>>>>> u64 default_pte;
>>>>>>>>>>>> /** @dma_offset: DMA offset to add to the PTE. */ @@
>>>>>>>>>>>> -604,7
>>>>>>>>>>>> +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
>>>>>>>>>> pgoff_t offset,
>>>>>>>>>>>>
>>>>>>>>>>>> pte = __pte_encode(is_null ? 0 :
>>>>>>>>>>>> xe_res_dma(curs) + xe_walk-
>>>>>>> dma_offset,
>>>>>>>>>>>> - xe_walk->cache, xe_walk->vma,
>>>>>> level);
>>>>>>>>>>>> + xe_walk->vma->pat_index,
>>>> xe_walk-
>>>>>>> vma,
>>>>>>>>>> level);
>>>>>>>>>>>> pte |= xe_walk->default_pte;
>>>>>>>>>>>>
>>>>>>>>>>>> /*
>>>>>>>>>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw
>>>>>>>>>>>> *parent,
>>>>>>>>>> pgoff_t offset,
>>>>>>>>>>>> xe_child->is_compact = true;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk-
>>>>> cache)
>>>>>> |
>>>>>>>> flags;
>>>>>>>>>>>> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
>>>>>>>>>>> pat_index) | flags;
>>>>>>>>>>>> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset,
>>>>>> xe_child,
>>>>>>>>>>>> pte);
>>>>>>>>>>>> }
>>>>>>>>>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile,
>>>>>>>>>>>> struct
>>>>>>>> xe_vma
>>>>>>>>>> *vma,
>>>>>>>>>>>> if (vma && vma->gpuva.flags &
>>>>>> XE_VMA_ATOMIC_PTE_BIT)
>>>>>>>>>>>> xe_walk.default_pte |=
>>>>>> XE_USM_PPGTT_PTE_AE;
>>>>>>>>>>>> xe_walk.dma_offset = vram_region_gpu_offset(bo-
>>>>>>>>>>> ttm.resource);
>>>>>>>>>>>> - xe_walk.cache = XE_CACHE_WB;
>>>>>>>>>>>> - } else {
>>>>>>>>>>>> - if (!xe_vma_has_no_bo(vma) && bo->flags &
>>>>>>>>>> XE_BO_SCANOUT_BIT)
>>>>>>>>>>>> - xe_walk.cache = XE_CACHE_WT;
>>>>>>>>>>>> - else
>>>>>>>>>>>> - xe_walk.cache = XE_CACHE_WB;
>>>>>>>>>>>> }
>>>>>>>>>>>> +
>>>>>>>>>>>> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>>>>>>>>>>> xe_walk.dma_offset =
>>>>>>>>>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_pt.h index
>> 01be7ab08f87..1d433a5a96b4
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>>>>>>>>>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile,
>>>>>>>>>>>> struct
>>>>>>>>>> xe_vma *vma, struct xe_exec_queu
>>>>>>>>>>>>
>>>>>>>>>>>> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma
>>>>>>>>>>>> *vma);
>>>>>>>>>>>>
>>>>>>>>>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>>>>>>>>>> - const enum xe_cache_level level);
>>>>>>>>>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32
>>>>>>>>>>>> +pat_index);
>>>>>>>>>>>>
>>>>>>>>>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum
>>>>>>>>>>>> xe_cache_level
>>>>>>>>>> cache,
>>>>>>>>>>>> - u32 pt_level);
>>>>>>>>>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32
>>>>>>>>>>>> +pat_index,
>>>>>>>>>>>> +u32
>>>>>>>>>> pt_level);
>>>>>>>>>>>>
>>>>>>>>>>>> #endif
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>>>>>> b/drivers/gpu/drm/xe/xe_vm.c index
>>>> 7eeeed0411f3..34603a7e84b0
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>>>>>>>>>> @@ -6,6 +6,7 @@
>>>>>>>>>>>> #include "xe_vm.h"
>>>>>>>>>>>>
>>>>>>>>>>>> #include <linux/dma-fence-array.h>
>>>>>>>>>>>> +#include <linux/nospec.h>
>>>>>>>>>>>>
>>>>>>>>>>>> #include <drm/drm_print.h>
>>>>>>>>>>>> #include <drm/ttm/ttm_execbuf_util.h> @@ -874,7 +875,8
>>>>>>>>>>>> @@ static struct xe_vma *xe_vma_create(struct
>>>>>>>> xe_vm
>>>>>>>>>> *vm,
>>>>>>>>>>>> u64 start, u64 end,
>>>>>>>>>>>> bool read_only,
>>>>>>>>>>>> bool is_null,
>>>>>>>>>>>> - u8 tile_mask)
>>>>>>>>>>>> + u8 tile_mask,
>>>>>>>>>>>> + u32 pat_index)
>>>>>>>>>>>> {
>>>>>>>>>>>> struct xe_vma *vma;
>>>>>>>>>>>> struct xe_tile *tile;
>>>>>>>>>>>> @@ -913,6 +915,8 @@ static struct xe_vma
>>>> *xe_vma_create(struct
>>>>>>>> xe_vm
>>>>>>>>>> *vm,
>>>>>>>>>>>> vma->tile_mask |= 0x1 << id;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> + vma->pat_index = pat_index;
>>>>>>>>>>>> +
>>>>>>>>>>>> if (vm->xe->info.platform == XE_PVC)
>>>>>>>>>>>> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -1194,10 +1198,8 @@ static void
>>>> xe_vma_op_work_func(struct
>>>>>>>>>> work_struct *w);
>>>>>>>>>>>> static void vm_destroy_work_func(struct work_struct
>>>>>>>>>>>> *w);
>>>>>>>>>>>>
>>>>>>>>>>>> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device
>>>>>>>>>>>> *xe,
>>>>>>>>>>>> u64
>>>>>>>> pde_pat,
>>>>>>>>>>>> - enum xe_cache_level
>>>>>> cache)
>>>>>>>>>>>> + u32 pat_index)
>>>>>>>>>>>> {
>>>>>>>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>>>>>> -
>>>>>>>>>>>> pde_pat &= ~(XELP_PDE_PAT_MASK);
>>>>>>>>>>>>
>>>>>>>>>>>> if (pat_index & BIT(0)) @@ -1213,10 +1215,8 @@ static
>>>>>>>>>>>> u64
>>>>>>>> xelp_ppgtt_pde_encode_pat(struct
>>>>>>>>>> xe_device *xe, u64 pde_pat,
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device
>>>>>>>>>>>> *xe,
>>>>>>>>>>>> u64
>>>>>>>> pte_pat,
>>>>>>>>>>>> - enum xe_cache_level
>>>>>> cache)
>>>>>>>>>>>> + u32 pat_index)
>>>>>>>>>>>> {
>>>>>>>>>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>>>>>>>>>> -
>>>>>>>>>>>> pte_pat &= ~(XELP_PTE_PAT_MASK);
>>>>>>>>>>>>
>>>>>>>>>>>> if (pat_index & BIT(0)) @@ -1622,7 +1622,7 @@ struct
>>>>>>>>>>>> xe_vm *xe_vm_lookup(struct
>>>> xe_file
>>>>>>>> *xef,
>>>>>>>>>> u32 id)
>>>>>>>>>>>> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct
>>>>>>>>>>>> xe_tile
>>>> *tile)
>>>>>>>>>>>> {
>>>>>>>>>>>> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>>>>>>>>>>>> - XE_CACHE_WB);
>>>>>>>>>>>> + xe_pat_get_index(vm->xe,
>>>>>> XE_CACHE_WB));
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> static struct dma_fence * @@ -2311,7 +2311,7 @@ static
>>>>>>>>>>>> void print_op(struct xe_device *xe,
>>>>>>>> struct
>>>>>>>>>> drm_gpuva_op *op)
>>>>>>>>>>>> static struct drm_gpuva_ops *
>>>>>>>>>>>> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo
>> *bo,
>>>>>>>>>>>> u64 bo_offset_or_userptr, u64 addr, u64
>>>>>> range,
>>>>>>>>>>>> - u32 operation, u8 tile_mask, u32 region)
>>>>>>>>>>>> + u32 operation, u8 tile_mask, u32 region, u32
>>>>>>>>>> pat_index)
>>>>>>>>>>>> {
>>>>>>>>>>>> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>>>>>>>>>>> struct ww_acquire_ctx ww; @@ -2339,6 +2339,7 @@
>>>>>>>>>>>> vm_bind_ioctl_ops_create(struct xe_vm
>>>>>> *vm,
>>>>>>>>>> struct xe_bo *bo,
>>>>>>>>>>>> struct xe_vma_op *op =
>>>>>> gpuva_op_to_vma_op(__op);
>>>>>>>>>>>>
>>>>>>>>>>>> op->tile_mask = tile_mask;
>>>>>>>>>>>> + op->pat_index = pat_index;
>>>>>>>>>>>> op->map.immediate =
>>>>>>>>>>>> operation &
>>>>>>>> XE_VM_BIND_FLAG_IMMEDIATE;
>>>>>>>>>>>> op->map.read_only =
>>>>>>>>>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm
>>>>>> *vm,
>>>>>>>>>> struct xe_bo *bo,
>>>>>>>>>>>> struct xe_vma_op *op =
>>>>>> gpuva_op_to_vma_op(__op);
>>>>>>>>>>>>
>>>>>>>>>>>> op->tile_mask = tile_mask;
>>>>>>>>>>>> + op->pat_index = pat_index;
>>>>>>>>>>>> op->prefetch.region = region;
>>>>>>>>>>>> }
>>>>>>>>>>>> break;
>>>>>>>>>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
>>>>>> *vm,
>>>>>>>>>> struct xe_bo *bo,
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> static struct xe_vma *new_vma(struct xe_vm *vm, struct
>>>>>>>>>> drm_gpuva_op_map *op,
>>>>>>>>>>>> - u8 tile_mask, bool read_only, bool is_null)
>>>>>>>>>>>> + u8 tile_mask, bool read_only, bool is_null,
>>>>>>>>>>>> + u32 pat_index)
>>>>>>>>>>>> {
>>>>>>>>>>>> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op-
>>>>>>> gem.obj) :
>>>>>>>> NULL;
>>>>>>>>>>>> struct xe_vma *vma;
>>>>>>>>>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct
>>>>>>>>>>>> xe_vm
>>>>>>>>>> *vm, struct drm_gpuva_op_map *op,
>>>>>>>>>>>> vma = xe_vma_create(vm, bo, op->gem.offset,
>>>>>>>>>>>> op->va.addr, op->va.addr +
>>>>>>>>>>>> op->va.range - 1, read_only, is_null,
>>>>>>>>>>>> - tile_mask);
>>>>>>>>>>>> + tile_mask, pat_index);
>>>>>>>>>>>> if (bo)
>>>>>>>>>>>> xe_bo_unlock(bo, &ww);
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -2539,7 +2542,7 @@ static int
>>>> vm_bind_ioctl_ops_parse(struct
>>>>>>>> xe_vm
>>>>>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>>>>>
>>>>>>>>>>>> vma = new_vma(vm, &op-
>>>>> base.map,
>>>>>>>>>>>> op->tile_mask, op-
>>>>>>>>> map.read_only,
>>>>>>>>>>>> - op->map.is_null);
>>>>>>>>>>>> + op->map.is_null, op-
>>>>>>> pat_index);
>>>>>>>>>>>> if (IS_ERR(vma)) {
>>>>>>>>>>>> err = PTR_ERR(vma);
>>>>>>>>>>>> goto free_fence;
>>>>>>>>>>>> @@ -2567,7 +2570,7 @@ static int
>>>> vm_bind_ioctl_ops_parse(struct
>>>>>>>> xe_vm
>>>>>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>>>>>
>>>>>>>>>>>> vma = new_vma(vm, op-
>>>>>>>>>>> base.remap.prev,
>>>>>>>>>>>> op->tile_mask,
>>>>>> read_only,
>>>>>>>>>>>> - is_null);
>>>>>>>>>>>> + is_null, op-
>>>>>>> pat_index);
>>>>>>>>>>>> if (IS_ERR(vma)) {
>>>>>>>>>>>> err = PTR_ERR(vma);
>>>>>>>>>>>> goto free_fence;
>>>>>>>>>>>> @@ -2603,7 +2606,7 @@ static int
>>>> vm_bind_ioctl_ops_parse(struct
>>>>>>>> xe_vm
>>>>>>>>>> *vm, struct xe_exec_queue *q,
>>>>>>>>>>>>
>>>>>>>>>>>> vma = new_vma(vm, op-
>>>>>>>>>>> base.remap.next,
>>>>>>>>>>>> op->tile_mask,
>>>>>> read_only,
>>>>>>>>>>>> - is_null);
>>>>>>>>>>>> + is_null, op-
>>>>>>> pat_index);
>>>>>>>>>>>> if (IS_ERR(vma)) {
>>>>>>>>>>>> err = PTR_ERR(vma);
>>>>>>>>>>>> goto free_fence;
>>>>>>>>>>>> @@ -3158,8 +3161,14 @@ static int
>>>>>>>>>>>> vm_bind_ioctl_check_args(struct
>>>>>>>>>> xe_device *xe,
>>>>>>>>>>>> u32 obj = (*bind_ops)[i].obj;
>>>>>>>>>>>> u64 obj_offset = (*bind_ops)[i].obj_offset;
>>>>>>>>>>>> u32 region = (*bind_ops)[i].region;
>>>>>>>>>>>> + u32 pat_index = (*bind_ops)[i].pat_index;
>>>>>>>>>>>> bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>>>>>>>>>>>
>>>>>>>>>>>> + if (XE_IOCTL_DBG(xe, pat_index >= xe-
>>>>>>>>>>> info.pat_table_n_entries)) {
>>>>>>>>>>>> + err = -EINVAL;
>>>>>>>>>>>> + goto free_bind_ops;
>>>>>>>>>>>> + }
>>>>>>>>>>>> +
>>>>>>>>>>>> if (i == 0) {
>>>>>>>>>>>> *async = !!(op &
>>>> XE_VM_BIND_FLAG_ASYNC);
>>>>>>>>>>>> } else if (XE_IOCTL_DBG(xe, !*async) || @@ -3346,8
>>>>>> +3355,25
>>>>>>>>>>>> @@ int xe_vm_bind_ioctl(struct drm_device
>>>>>>>> *dev,
>>>>>>>>>> void *data, struct drm_file *file)
>>>>>>>>>>>> struct drm_gem_object *gem_obj;
>>>>>>>>>>>> u64 range = bind_ops[i].range;
>>>>>>>>>>>> u64 addr = bind_ops[i].addr;
>>>>>>>>>>>> + u32 op = bind_ops[i].op;
>>>>>>>>>>>> u32 obj = bind_ops[i].obj;
>>>>>>>>>>>> u64 obj_offset = bind_ops[i].obj_offset;
>>>>>>>>>>>> + u32 pat_index = bind_ops[i].pat_index;
>>>>>>>>>>>> + u16 coh_mode;
>>>>>>>>>>>> +
>>>>>>>>>>>> + pat_index = array_index_nospec(pat_index,
>>>>>>>>>>>> + xe-
>>>>>>> info.pat_table_n_entries);
>>>>>>>>>>>> + coh_mode = xe_pat_index_get_coh_mode(xe,
>>>>>> pat_index);
>>>>>>>>>>>> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
>>>>>>>>>>>
>>>>>>>>>>> Assuming we drop the unusable entries from the TGL table, this
>>>>>>>>>>> should be impossible, right? Any index that makes it past the
>>>>>>>>>>> n_entries check at the top of the function should have a
>>>>>>>>>>> valid, non-zero
>>>>>> coh_mode value.
>>>>>>>>>>> So this should probably be an assertion (to highlight a KMD
>>>>>>>>>>> bug) rather than just a silent uapi failure return.
>>>>>>>>>>
>>>>>>>>>> Makes sense.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> + err = -EINVAL;
>>>>>>>>>>>> + goto put_obj;
>>>>>>>>>>>> + }
>>>>>>>>>>>> +
>>>>>>>>>>>> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
>>>>>>>>>> XE_VM_BIND_OP_MAP_USERPTR &&
>>>>>>>>>>>> + coh_mode ==
>>>>>> XE_GEM_COHERENCY_NONE))
>>>>>>>> {
>>>>>>>>>>>> + err = -EINVAL;
>>>>>>>>>>>> + goto put_obj;
>>>>>>>>>>>> + }
>>>>>>>>>>>>
>>>>>>>>>>>> if (!obj)
>>>>>>>>>>>> continue;
>>>>>>>>>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct
>>>> drm_device
>>>>>>>> *dev,
>>>>>>>>>> void *data, struct drm_file *file)
>>>>>>>>>>>> goto put_obj;
>>>>>>>>>>>> }
>>>>>>>>>>>> }
>>>>>>>>>>>> +
>>>>>>>>>>>> + if (bos[i]->coh_mode) {
>>>>>>>>>>>> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
>>>>>>>>>> coh_mode)) {
>>>>>>>>>>>> + err = -EINVAL;
>>>>>>>>>>>> + goto put_obj;
>>>>>>>>>>>> + }
>>>>>>>>>>>> + } else if (XE_IOCTL_DBG(xe, coh_mode ==
>>>>>>>>>> XE_GEM_COHERENCY_NONE)) {
>>>>>>>>>>>> + /*
>>>>>>>>>>>> + * Imported dma-buf from a different device
>>>>>> should
>>>>>>>>>>>> + * require 1way or 2way coherency since we
>>>>>> don't
>>>>>>>>>> know
>>>>>>>>>>>> + * how it was mapped on CPU. Just assume is
>>>> it
>>>>>>>>>>>> + * potentially cached on CPU side.
>>>>>>>>>>>> + */
>>>>>>>>>>>> + err = -EINVAL;
>>>>>>>>>>>> + goto put_obj;
>>>>>>>>>>>> + }
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> if (args->num_syncs) { @@ -3412,10 +3454,11 @@ int
>>>>>>>>>>>> xe_vm_bind_ioctl(struct
>>>> drm_device
>>>>>>>> *dev,
>>>>>>>>>> void *data, struct drm_file *file)
>>>>>>>>>>>> u64 obj_offset = bind_ops[i].obj_offset;
>>>>>>>>>>>> u8 tile_mask = bind_ops[i].tile_mask;
>>>>>>>>>>>> u32 region = bind_ops[i].region;
>>>>>>>>>>>> + u32 pat_index = bind_ops[i].pat_index;
>>>>>>>>>>>>
>>>>>>>>>>>> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i],
>>>>>> obj_offset,
>>>>>>>>>>>> addr, range, op,
>>>>>> tile_mask,
>>>>>>>>>>>> - region);
>>>>>>>>>>>> + region, pat_index);
>>>>>>>>>>>> if (IS_ERR(ops[i])) {
>>>>>>>>>>>> err = PTR_ERR(ops[i]);
>>>>>>>>>>>> ops[i] = NULL;
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>>>> b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>>>>>> index 83a1f87b6537..508679ed3c74 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>>>>>>>> @@ -111,6 +111,11 @@ struct xe_vma {
>>>>>>>>>>>> */
>>>>>>>>>>>> u8 tile_present;
>>>>>>>>>>>>
>>>>>>>>>>>> + /**
>>>>>>>>>>>> + * @pat_index: The pat index to use when encoding the PTEs
>>>>>> for
>>>>>>>>>>>> +this
>>>>>>>>>> vma.
>>>>>>>>>>>> + */
>>>>>>>>>>>> + u32 pat_index;
>>>>>>>>>>>> +
>>>>>>>>>>>> struct {
>>>>>>>>>>>> struct list_head rebind_link;
>>>>>>>>>>>> } notifier;
>>>>>>>>>>>> @@ -338,10 +343,8 @@ struct xe_vm {
>>>>>>>>>>>> bool batch_invalidate_tlb;
>>>>>>>>>>>>
>>>>>>>>>>>> struct {
>>>>>>>>>>>> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>>>>>>>> - enum xe_cache_level
>>>>>> cache);
>>>>>>>>>>>> - u64 (*pde_encode)(struct xe_device *xe, u64
>>>> pde_pat,
>>>>>>>>>>>> - enum xe_cache_level
>>>>>> cache);
>>>>>>>>>>>> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>>>> u32
>>>>>>>>>> pat_index);
>>>>>>>>>>>> + u64 (*pde_encode)(struct xe_device *xe, u64
>>>> pde_pat,
>>>>>> u32
>>>>>>>>>> pat_index);
>>>>>>>>>>>> } pat_encode;
>>>>>>>>>>>> };
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>>>>>>>>>>>> struct async_op_fence *fence;
>>>>>>>>>>>> /** @tile_mask: gt mask for this operation */
>>>>>>>>>>>> u8 tile_mask;
>>>>>>>>>>>> + /** @pat_index: The pat index to use for this operation. */
>>>>>>>>>>>> + u32 pat_index;
>>>>>>>>>>>> /** @flags: operation flags */
>>>>>>>>>>>> enum xe_vma_op_flags flags;
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/include/uapi/drm/xe_drm.h
>>>>>>>>>>>> b/include/uapi/drm/xe_drm.h index
>> 64bc66d4b550..0c15b6f32447
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/include/uapi/drm/xe_drm.h
>>>>>>>>>>>> +++ b/include/uapi/drm/xe_drm.h
>>>>>>>>>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>>>>>>>>>>>> */
>>>>>>>>>>>> __u32 obj;
>>>>>>>>>>>>
>>>>>>>>>>>> - /** @pad: MBZ */
>>>>>>>>>>>> - __u32 pad;
>>>>>>>>>>>> + /**
>>>>>>>>>>>> + * @pat_index: The platform defined @pat_index to use for
>>>>>> this
>>>>>>>>>> mapping.
>>>>>>>>>>>> + * The index basically maps to some predefined memory
>>>>>> attributes,
>>>>>>>>>>>> + * including things like caching, coherency and likely
>>>>>>>>>>>> +other stuff
>>>>>> in
>>>>>>>>>>>> + * the future. The exact meaning of the pat_index is
>>>>>>>>>>>> +platform specific
>>>>>>>>>>>
>>>>>>>>>>> BTW, "other stuff in the future" already includes compression
>>>>>>>>>>> on Xe2, we just haven't landed the patches for the Xe2 table yet.
>>>>>>>>>>
>>>>>>>>>> Ok, good to know.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> + * and defined in the Bspec and PRMs. When the KMD sets
>>>> up
>>>>>> the
>>>>>>>>>> binding
>>>>>>>>>>>> + * the index here is encoded into the ppGTT PTE.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * For coherency the @pat_index needs to match the
>>>>>>>>>>>> + * drm_xe_gem_create.coh_mode, so either
>>>>>>>>>> XE_GEM_COHERENCY_NONE,
>>>>>>>>>>>> + * XE_GEM_COHERENCY_1WAY or
>>>>>> XE_GEM_COHERENCY_2WAY. The
>>>>>>>>>> KMD will extract
>>>>>>>>>>>> + * the coherency mode from the @pat_index and reject if
>>>>>> there is a
>>>>>>>>>>>> + * mismatch (see note below for pre-MTL platforms).
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * Note: On pre-MTL platforms there is only a caching mode
>>>>>> and no
>>>>>>>>>>>> + * explicit coherency mode, but on such hardware there is
>>>>>> always a
>>>>>>>>>>>> + * shared-LLC (or is dgpu) so all GT memory accesses are
>>>>>>>>>>>> +coherent
>>>>>>>> with
>>>>>>>>>>>> + * CPU caches even with the caching mode set as uncached.
>>>>>> It's
>>>>>>>>>>>> +only
>>>>>>>>>> the
>>>>>>>>>>>> + * display engine that is incoherent (on dgpu it must be in
>>>>>> VRAM which
>>>>>>>>>>>> + * is always mapped as WC on the CPU). However to keep the
>>>>>> uapi
>>>>>>>>>> somewhat
>>>>>>>>>>>> + * consistent with newer platforms the KMD groups the
>>>>>> different
>>>>>>>> cache
>>>>>>>>>>>> + * levels into the following coherency buckets on all
>>>>>>>>>>>> +pre-MTL
>>>>>>>> platforms:
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
>>>>>>>>>>>> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
>>>>>>>>>>>> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
>>>>>>>>>>>> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
>>>>>>>>>>>
>>>>>>>>>>> As noted on the previous patch, it seems like 2-way is
>>>>>>>>>>> appropriate for LLC platforms, but 1-way might be a more
>>>>>>>>>>> accurate description of dGPU behavior.
>>>>>>>>>>>
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * In practice UC/WC/WT should only ever used for scanout
>>>>>>>>>>>> +surfaces
>>>>>>>> on
>>>>>>>>>>>> + * such platforms since it is only the display engine that
>>>>>>>>>>>> +is
>>>>>> actually
>>>>>>>>>>>> + * incoherent. Everything else should typically use WB
>>>>>>>>>>>> +given that we
>>>>>>>>>>>
>>>>>>>>>>> What if we're sharing our buffers with some other (non-GPU)
>>>>>>>>>>> device? Are there cases where that other device wouldn't be
>>>>>>>>>>> coherent with the LLC, so we'd want to use one of these?
>>>>>>>>>>
>>>>>>>>>> Yeah, I guess there might be cases like that. I'll reword.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Matt
>>>>>>>>>>>
>>>>>>>>>>>> + * have a shared-LLC. On MTL+ this completely changes
>>>>>>>>>>>> +(also
>>>>>>>>>> potentially
>>>>>>>>>>>> + * no shared-LLC) and the HW defines the coherency mode as
>>>>>> part
>>>>>>>>>>>> +of
>>>>>>>>>> the
>>>>>>>>>>>> + * @pat_index.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * Note: For userptr and externally imported dma-buf the
>>>>>> kernel
>>>>>>>>>> expects
>>>>>>>>>>>> + * either 1WAY or 2WAY for the @pat_index.
>>>>>>>>>>>> + */
>>>>>>>>>>>> + __u32 pat_index;
>>>>>>>>>>>>
>>>>>>>>>>>> union {
>>>>>>>>>>>> /**
>>>>>>>>>>>> --
>>>>>>>>>>>> 2.41.0
>>>>>>>>>>>>
>>>>>>>>>>>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-08-29 16:28 ` [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind Matthew Auld
2023-08-29 21:36 ` Matt Roper
@ 2023-09-04 20:21 ` Souza, Jose
2023-09-05 9:08 ` Matthew Auld
2023-09-07 18:56 ` Souza, Jose
2 siblings, 1 reply; 47+ messages in thread
From: Souza, Jose @ 2023-09-04 20:21 UTC (permalink / raw)
To: intel-xe@lists.freedesktop.org, Auld, Matthew
On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
> Allow userspace to directly control the pat_index for a given vm
> binding. This should allow directly controlling the coherency, caching
> and potentially other stuff in the future for the ppGTT binding.
>
> The exact meaning behind the pat_index is very platform specific (see
> BSpec or PRMs) but effectively maps to some predefined memory
> attributes. From the KMD pov we only care about the coherency that is
> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
> The vm_bind coherency mode for the given pat_index needs to match the
> given coh_mode that was set at object creation. For platforms that lack
> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
>
> For userptr mappings we lack a corresponding gem object, so the expected
> coherency mode is instead implicit and must fall into either 1WAY or
> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
> dma-buf (from a different device) the coherency mode is also implicit
> and must also be either 1WAY or 2WAY.
>
> As part of adding pat_index support with vm_bind we also need stop using
> xe_cache_level and instead use the pat_index in various places. We still
> make use of xe_cache_level, but only as a convenience for kernel
> internal objectsi (internally it maps to some reasonable pat_index). For
> now this is just a 1:1 conversion of the existing code, however for
> platforms like MTL+ we might need to give more control through bo_create
> or stop using WB on the CPU side if we need CPU access.
>
> Bspec: 45101, 44235 #xe
> Bspec: 70552, 71582, 59400 #xe2
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: José Roberto de Souza <jose.souza@intel.com>
> Cc: Filip Hazubski <filip.hazubski@intel.com>
> Cc: Carl Zhang <carl.zhang@intel.com>
> Cc: Effie Yu <effie.yu@intel.com>
> ---
> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
> drivers/gpu/drm/xe/xe_pt.h | 6 +--
> drivers/gpu/drm/xe/xe_vm.c | 73 +++++++++++++++++++++------
> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
> 10 files changed, 134 insertions(+), 58 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
> index 4d6296cdbcfd..bb4c182048c3 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> @@ -302,7 +302,7 @@ struct i915_address_space {
> (*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>
> u64 (*pte_encode)(dma_addr_t addr,
> - unsigned int pat_index,
> + u32 pat_index,
> u32 flags); /* Create a valid PTE */
> #define PTE_READ_ONLY BIT(0)
> #define PTE_LM BIT(1)
> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
> index 5c8d5e78d9bc..7a128fd20a29 100644
> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
> /* First part of the test, are we updating our pagetable bo with a new entry? */
> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
> 0xdeaddeadbeefbeef);
> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
> if (m->q->vm->flags & XE_VM_FLAG_64K)
> expected |= XE_PTE_PS64;
> if (xe_bo_is_vram(pt))
> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
> index 209fa053d9fb..4134c26150a5 100644
> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
> pte |= XE_GGTT_PTE_DM;
>
> if ((ggtt->pat_encode).pte_encode)
> - pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
> + xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
>
> return pte;
> }
> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
> }
>
> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> - enum xe_cache_level cache)
> + u32 pat_index)
> {
> - u32 pat_index = xe_pat_get_index(xe, cache);
> -
> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>
> if (pat_index & BIT(0))
> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
> index 7e55fac1a8a9..0bc40cb072e3 100644
> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> @@ -31,7 +31,7 @@ struct xe_ggtt {
>
> struct {
> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> - enum xe_cache_level cache);
> + u32 pat_index);
> } pat_encode;
> };
>
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> index a782ea282cb6..54585e98452a 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -24,6 +24,7 @@
> #include "xe_lrc.h"
> #include "xe_map.h"
> #include "xe_mocs.h"
> +#include "xe_pat.h"
> #include "xe_pt.h"
> #include "xe_res_cursor.h"
> #include "xe_sched_job.h"
> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
> u32 map_ofs, level, i;
> struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> u64 entry;
> int ret;
>
> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> return ret;
> }
>
> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>
> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>
> /* Map the entire BO in our level 0 pt */
> for (i = 0, level = 0; i < num_entries; level++) {
> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>
> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>
> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> for (i = 0; i < batch->size;
> i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
> XE_PAGE_SIZE) {
> - entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
> + entry = xe_pte_encode(batch, i, pat_index, 0);
>
> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> entry);
> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> flags = XE_PDE_64K;
>
> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> - XE_PAGE_SIZE, XE_CACHE_WB);
> + XE_PAGE_SIZE, pat_index);
> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
> entry | flags);
> }
> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> /* Write PDE's that point to our BO. */
> for (i = 0; i < num_entries - num_level; i++) {
> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> - XE_CACHE_WB);
> + pat_index);
>
> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
> (i + 1) * 8, u64, entry);
> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
>
> XE_WARN_ON(pt_bo->size != SZ_4K);
>
> - addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
> + addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
> bb->cs[bb->len++] = lower_32_bits(addr);
> bb->cs[bb->len++] = upper_32_bits(addr);
> }
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 64713f400d94..019af2920078 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -10,6 +10,7 @@
> #include "xe_gt.h"
> #include "xe_gt_tlb_invalidation.h"
> #include "xe_migrate.h"
> +#include "xe_pat.h"
> #include "xe_pt_types.h"
> #include "xe_pt_walk.h"
> #include "xe_res_cursor.h"
> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
> *
> * Return: An encoded page directory entry. No errors.
> */
> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> - const enum xe_cache_level cache)
> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
> {
> u64 pde;
> struct xe_vm *vm = bo->vm;
> struct xe_device *xe = vm->xe;
>
> -
> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>
> if ((vm->pat_encode).pde_encode)
> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> + pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>
> return pde;
> }
>
> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> +static u64 __pte_encode(u64 pte, u32 pat_index,
> struct xe_vma *vma, u32 pt_level)
> {
> struct xe_vm *vm = xe_vma_vm(vma);
> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> pte |= XE_PTE_NULL;
>
> if ((vm->pat_encode).pte_encode)
> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> + pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>
> if (pt_level == 1)
> pte |= XE_PDE_PS_2M;
> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> *
> * Return: An encoded page-table entry. No errors.
> */
> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> u32 pt_level)
> {
> u64 pte;
> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> pte |= XE_PPGTT_PTE_DM;
>
> - return __pte_encode(pte, cache, NULL, pt_level);
> + return __pte_encode(pte, pat_index, NULL, pt_level);
> }
>
> static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>
> if (level == 0) {
> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> - XE_CACHE_WB, 0);
> + xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
>
> return empty;
> } else {
> return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
> - XE_CACHE_WB);
> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
> }
> }
>
> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> struct xe_vm *vm;
> /** @tile: The tile we're building for. */
> struct xe_tile *tile;
> - /** @cache: Desired cache level for the ptes */
> - enum xe_cache_level cache;
> /** @default_pte: PTE flag only template. No address is associated */
> u64 default_pte;
> /** @dma_offset: DMA offset to add to the PTE. */
> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>
> pte = __pte_encode(is_null ? 0 :
> xe_res_dma(curs) + xe_walk->dma_offset,
> - xe_walk->cache, xe_walk->vma, level);
> + xe_walk->vma->pat_index, xe_walk->vma, level);
> pte |= xe_walk->default_pte;
>
> /*
> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
> xe_child->is_compact = true;
> }
>
> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
> pte);
> }
> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
> xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
> xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
> - xe_walk.cache = XE_CACHE_WB;
> - } else {
> - if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
> - xe_walk.cache = XE_CACHE_WT;
> - else
> - xe_walk.cache = XE_CACHE_WB;
> }
> +
> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>
> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> index 01be7ab08f87..1d433a5a96b4 100644
> --- a/drivers/gpu/drm/xe/xe_pt.h
> +++ b/drivers/gpu/drm/xe/xe_pt.h
> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
>
> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>
> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> - const enum xe_cache_level level);
> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>
> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> - u32 pt_level);
> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
>
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 7eeeed0411f3..34603a7e84b0 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -6,6 +6,7 @@
> #include "xe_vm.h"
>
> #include <linux/dma-fence-array.h>
> +#include <linux/nospec.h>
>
> #include <drm/drm_print.h>
> #include <drm/ttm/ttm_execbuf_util.h>
> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
> u64 start, u64 end,
> bool read_only,
> bool is_null,
> - u8 tile_mask)
> + u8 tile_mask,
> + u32 pat_index)
> {
> struct xe_vma *vma;
> struct xe_tile *tile;
> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
> vma->tile_mask |= 0x1 << id;
> }
>
> + vma->pat_index = pat_index;
> +
> if (vm->xe->info.platform == XE_PVC)
> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>
> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
> static void vm_destroy_work_func(struct work_struct *w);
>
> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> - enum xe_cache_level cache)
> + u32 pat_index)
> {
> - u32 pat_index = xe_pat_get_index(xe, cache);
> -
> pde_pat &= ~(XELP_PDE_PAT_MASK);
>
> if (pat_index & BIT(0))
> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> }
>
> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> - enum xe_cache_level cache)
> + u32 pat_index)
> {
> - u32 pat_index = xe_pat_get_index(xe, cache);
> -
> pte_pat &= ~(XELP_PTE_PAT_MASK);
>
> if (pat_index & BIT(0))
> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
> {
> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> - XE_CACHE_WB);
> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
> }
>
> static struct dma_fence *
> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
> static struct drm_gpuva_ops *
> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> u64 bo_offset_or_userptr, u64 addr, u64 range,
> - u32 operation, u8 tile_mask, u32 region)
> + u32 operation, u8 tile_mask, u32 region, u32 pat_index)
> {
> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> struct ww_acquire_ctx ww;
> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>
> op->tile_mask = tile_mask;
> + op->pat_index = pat_index;
> op->map.immediate =
> operation & XE_VM_BIND_FLAG_IMMEDIATE;
> op->map.read_only =
> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>
> op->tile_mask = tile_mask;
> + op->pat_index = pat_index;
> op->prefetch.region = region;
> }
> break;
> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> }
>
> static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> - u8 tile_mask, bool read_only, bool is_null)
> + u8 tile_mask, bool read_only, bool is_null,
> + u32 pat_index)
> {
> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
> struct xe_vma *vma;
> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> vma = xe_vma_create(vm, bo, op->gem.offset,
> op->va.addr, op->va.addr +
> op->va.range - 1, read_only, is_null,
> - tile_mask);
> + tile_mask, pat_index);
> if (bo)
> xe_bo_unlock(bo, &ww);
>
> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>
> vma = new_vma(vm, &op->base.map,
> op->tile_mask, op->map.read_only,
> - op->map.is_null);
> + op->map.is_null, op->pat_index);
> if (IS_ERR(vma)) {
> err = PTR_ERR(vma);
> goto free_fence;
> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>
> vma = new_vma(vm, op->base.remap.prev,
> op->tile_mask, read_only,
> - is_null);
> + is_null, op->pat_index);
> if (IS_ERR(vma)) {
> err = PTR_ERR(vma);
> goto free_fence;
> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>
> vma = new_vma(vm, op->base.remap.next,
> op->tile_mask, read_only,
> - is_null);
> + is_null, op->pat_index);
> if (IS_ERR(vma)) {
> err = PTR_ERR(vma);
> goto free_fence;
> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
> u32 obj = (*bind_ops)[i].obj;
> u64 obj_offset = (*bind_ops)[i].obj_offset;
> u32 region = (*bind_ops)[i].region;
> + u32 pat_index = (*bind_ops)[i].pat_index;
> bool is_null = op & XE_VM_BIND_FLAG_NULL;
>
> + if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
> + err = -EINVAL;
> + goto free_bind_ops;
> + }
> +
> if (i == 0) {
> *async = !!(op & XE_VM_BIND_FLAG_ASYNC);
> } else if (XE_IOCTL_DBG(xe, !*async) ||
> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> struct drm_gem_object *gem_obj;
> u64 range = bind_ops[i].range;
> u64 addr = bind_ops[i].addr;
> + u32 op = bind_ops[i].op;
> u32 obj = bind_ops[i].obj;
> u64 obj_offset = bind_ops[i].obj_offset;
> + u32 pat_index = bind_ops[i].pat_index;
> + u16 coh_mode;
> +
> + pat_index = array_index_nospec(pat_index,
> + xe->info.pat_table_n_entries);
> + coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
> + err = -EINVAL;
> + goto put_obj;
> + }
In my opinion this check belongs to vm_bind_ioctl_check_args().
> +
> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
> + coh_mode == XE_GEM_COHERENCY_NONE)) {
> + err = -EINVAL;
> + goto put_obj;
> + }
Same for this one.
>
> if (!obj)
> continue;
> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> goto put_obj;
> }
> }
> +
> + if (bos[i]->coh_mode) {
> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
> + err = -EINVAL;
> + goto put_obj;
> + }
> + } else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
> + /*
> + * Imported dma-buf from a different device should
> + * require 1way or 2way coherency since we don't know
> + * how it was mapped on CPU. Just assume is it
> + * potentially cached on CPU side.
> + */
> + err = -EINVAL;
> + goto put_obj;
> + }
> }
>
> if (args->num_syncs) {
> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> u64 obj_offset = bind_ops[i].obj_offset;
> u8 tile_mask = bind_ops[i].tile_mask;
> u32 region = bind_ops[i].region;
> + u32 pat_index = bind_ops[i].pat_index;
>
> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
> addr, range, op, tile_mask,
> - region);
> + region, pat_index);
> if (IS_ERR(ops[i])) {
> err = PTR_ERR(ops[i]);
> ops[i] = NULL;
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
> index 83a1f87b6537..508679ed3c74 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -111,6 +111,11 @@ struct xe_vma {
> */
> u8 tile_present;
>
> + /**
> + * @pat_index: The pat index to use when encoding the PTEs for this vma.
> + */
> + u32 pat_index;
u32 is overkill for pat_index.
> +
> struct {
> struct list_head rebind_link;
> } notifier;
> @@ -338,10 +343,8 @@ struct xe_vm {
> bool batch_invalidate_tlb;
>
> struct {
> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> - enum xe_cache_level cache);
> - u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> - enum xe_cache_level cache);
> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
> + u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
> } pat_encode;
> };
>
> @@ -417,6 +420,8 @@ struct xe_vma_op {
> struct async_op_fence *fence;
> /** @tile_mask: gt mask for this operation */
> u8 tile_mask;
> + /** @pat_index: The pat index to use for this operation. */
> + u32 pat_index;
> /** @flags: operation flags */
> enum xe_vma_op_flags flags;
>
> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> index 64bc66d4b550..0c15b6f32447 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> */
> __u32 obj;
>
> - /** @pad: MBZ */
> - __u32 pad;
> + /**
> + * @pat_index: The platform defined @pat_index to use for this mapping.
> + * The index basically maps to some predefined memory attributes,
> + * including things like caching, coherency and likely other stuff in
> + * the future. The exact meaning of the pat_index is platform specific
> + * and defined in the Bspec and PRMs. When the KMD sets up the binding
> + * the index here is encoded into the ppGTT PTE.
> + *
> + * For coherency the @pat_index needs to match the
> + * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
> + * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
> + * the coherency mode from the @pat_index and reject if there is a
> + * mismatch (see note below for pre-MTL platforms).
> + *
> + * Note: On pre-MTL platforms there is only a caching mode and no
> + * explicit coherency mode, but on such hardware there is always a
> + * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
> + * CPU caches even with the caching mode set as uncached. It's only the
> + * display engine that is incoherent (on dgpu it must be in VRAM which
> + * is always mapped as WC on the CPU). However to keep the uapi somewhat
> + * consistent with newer platforms the KMD groups the different cache
> + * levels into the following coherency buckets on all pre-MTL platforms:
> + *
> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
> + *
> + * In practice UC/WC/WT should only ever used for scanout surfaces on
> + * such platforms since it is only the display engine that is actually
> + * incoherent. Everything else should typically use WB given that we
> + * have a shared-LLC. On MTL+ this completely changes (also potentially
> + * no shared-LLC) and the HW defines the coherency mode as part of the
> + * @pat_index.
> + *
> + * Note: For userptr and externally imported dma-buf the kernel expects
> + * either 1WAY or 2WAY for the @pat_index.
> + */
> + __u32 pat_index;
>
> union {
> /**
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-09-04 20:21 ` Souza, Jose
@ 2023-09-05 9:08 ` Matthew Auld
0 siblings, 0 replies; 47+ messages in thread
From: Matthew Auld @ 2023-09-05 9:08 UTC (permalink / raw)
To: Souza, Jose, intel-xe@lists.freedesktop.org
On 04/09/2023 21:21, Souza, Jose wrote:
> On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
>> Allow userspace to directly control the pat_index for a given vm
>> binding. This should allow directly controlling the coherency, caching
>> and potentially other stuff in the future for the ppGTT binding.
>>
>> The exact meaning behind the pat_index is very platform specific (see
>> BSpec or PRMs) but effectively maps to some predefined memory
>> attributes. From the KMD pov we only care about the coherency that is
>> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
>> The vm_bind coherency mode for the given pat_index needs to match the
>> given coh_mode that was set at object creation. For platforms that lack
>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
>>
>> For userptr mappings we lack a corresponding gem object, so the expected
>> coherency mode is instead implicit and must fall into either 1WAY or
>> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
>> dma-buf (from a different device) the coherency mode is also implicit
>> and must also be either 1WAY or 2WAY.
>>
>> As part of adding pat_index support with vm_bind we also need stop using
>> xe_cache_level and instead use the pat_index in various places. We still
>> make use of xe_cache_level, but only as a convenience for kernel
>> internal objectsi (internally it maps to some reasonable pat_index). For
>> now this is just a 1:1 conversion of the existing code, however for
>> platforms like MTL+ we might need to give more control through bo_create
>> or stop using WB on the CPU side if we need CPU access.
>>
>> Bspec: 45101, 44235 #xe
>> Bspec: 70552, 71582, 59400 #xe2
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: José Roberto de Souza <jose.souza@intel.com>
>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>> Cc: Carl Zhang <carl.zhang@intel.com>
>> Cc: Effie Yu <effie.yu@intel.com>
>> ---
>> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
>> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
>> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
>> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
>> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
>> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
>> drivers/gpu/drm/xe/xe_pt.h | 6 +--
>> drivers/gpu/drm/xe/xe_vm.c | 73 +++++++++++++++++++++------
>> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
>> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
>> 10 files changed, 134 insertions(+), 58 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
>> index 4d6296cdbcfd..bb4c182048c3 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>> @@ -302,7 +302,7 @@ struct i915_address_space {
>> (*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>>
>> u64 (*pte_encode)(dma_addr_t addr,
>> - unsigned int pat_index,
>> + u32 pat_index,
>> u32 flags); /* Create a valid PTE */
>> #define PTE_READ_ONLY BIT(0)
>> #define PTE_LM BIT(1)
>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
>> index 5c8d5e78d9bc..7a128fd20a29 100644
>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
>> /* First part of the test, are we updating our pagetable bo with a new entry? */
>> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
>> 0xdeaddeadbeefbeef);
>> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>> if (m->q->vm->flags & XE_VM_FLAG_64K)
>> expected |= XE_PTE_PS64;
>> if (xe_bo_is_vram(pt))
>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
>> index 209fa053d9fb..4134c26150a5 100644
>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
>> pte |= XE_GGTT_PTE_DM;
>>
>> if ((ggtt->pat_encode).pte_encode)
>> - pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
>> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
>> + xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
>>
>> return pte;
>> }
>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>> }
>>
>> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>> - enum xe_cache_level cache)
>> + u32 pat_index)
>> {
>> - u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>
>> if (pat_index & BIT(0))
>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
>> index 7e55fac1a8a9..0bc40cb072e3 100644
>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>
>> struct {
>> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> - enum xe_cache_level cache);
>> + u32 pat_index);
>> } pat_encode;
>> };
>>
>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
>> index a782ea282cb6..54585e98452a 100644
>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>> @@ -24,6 +24,7 @@
>> #include "xe_lrc.h"
>> #include "xe_map.h"
>> #include "xe_mocs.h"
>> +#include "xe_pat.h"
>> #include "xe_pt.h"
>> #include "xe_res_cursor.h"
>> #include "xe_sched_job.h"
>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
>> u32 map_ofs, level, i;
>> struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>> u64 entry;
>> int ret;
>>
>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> return ret;
>> }
>>
>> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
>> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>
>> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>
>> /* Map the entire BO in our level 0 pt */
>> for (i = 0, level = 0; i < num_entries; level++) {
>> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
>> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>>
>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>>
>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> for (i = 0; i < batch->size;
>> i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>> XE_PAGE_SIZE) {
>> - entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
>> + entry = xe_pte_encode(batch, i, pat_index, 0);
>>
>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>> entry);
>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> flags = XE_PDE_64K;
>>
>> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>> - XE_PAGE_SIZE, XE_CACHE_WB);
>> + XE_PAGE_SIZE, pat_index);
>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
>> entry | flags);
>> }
>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> /* Write PDE's that point to our BO. */
>> for (i = 0; i < num_entries - num_level; i++) {
>> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>> - XE_CACHE_WB);
>> + pat_index);
>>
>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>> (i + 1) * 8, u64, entry);
>> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
>>
>> XE_WARN_ON(pt_bo->size != SZ_4K);
>>
>> - addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
>> + addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>> bb->cs[bb->len++] = lower_32_bits(addr);
>> bb->cs[bb->len++] = upper_32_bits(addr);
>> }
>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
>> index 64713f400d94..019af2920078 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.c
>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>> @@ -10,6 +10,7 @@
>> #include "xe_gt.h"
>> #include "xe_gt_tlb_invalidation.h"
>> #include "xe_migrate.h"
>> +#include "xe_pat.h"
>> #include "xe_pt_types.h"
>> #include "xe_pt_walk.h"
>> #include "xe_res_cursor.h"
>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
>> *
>> * Return: An encoded page directory entry. No errors.
>> */
>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>> - const enum xe_cache_level cache)
>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>> {
>> u64 pde;
>> struct xe_vm *vm = bo->vm;
>> struct xe_device *xe = vm->xe;
>>
>> -
>> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>
>> if ((vm->pat_encode).pde_encode)
>> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>> + pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>>
>> return pde;
>> }
>>
>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>> struct xe_vma *vma, u32 pt_level)
>> {
>> struct xe_vm *vm = xe_vma_vm(vma);
>> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>> pte |= XE_PTE_NULL;
>>
>> if ((vm->pat_encode).pte_encode)
>> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>> + pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>>
>> if (pt_level == 1)
>> pte |= XE_PDE_PS_2M;
>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>> *
>> * Return: An encoded page-table entry. No errors.
>> */
>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>> u32 pt_level)
>> {
>> u64 pte;
>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>> pte |= XE_PPGTT_PTE_DM;
>>
>> - return __pte_encode(pte, cache, NULL, pt_level);
>> + return __pte_encode(pte, pat_index, NULL, pt_level);
>> }
>>
>> static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>>
>> if (level == 0) {
>> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>> - XE_CACHE_WB, 0);
>> + xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
>>
>> return empty;
>> } else {
>> return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
>> - XE_CACHE_WB);
>> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
>> }
>> }
>>
>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>> struct xe_vm *vm;
>> /** @tile: The tile we're building for. */
>> struct xe_tile *tile;
>> - /** @cache: Desired cache level for the ptes */
>> - enum xe_cache_level cache;
>> /** @default_pte: PTE flag only template. No address is associated */
>> u64 default_pte;
>> /** @dma_offset: DMA offset to add to the PTE. */
>> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>>
>> pte = __pte_encode(is_null ? 0 :
>> xe_res_dma(curs) + xe_walk->dma_offset,
>> - xe_walk->cache, xe_walk->vma, level);
>> + xe_walk->vma->pat_index, xe_walk->vma, level);
>> pte |= xe_walk->default_pte;
>>
>> /*
>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>> xe_child->is_compact = true;
>> }
>>
>> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
>> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
>> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>> pte);
>> }
>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
>> if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>> xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>> xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
>> - xe_walk.cache = XE_CACHE_WB;
>> - } else {
>> - if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
>> - xe_walk.cache = XE_CACHE_WT;
>> - else
>> - xe_walk.cache = XE_CACHE_WB;
>> }
>> +
>> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>> xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>
>> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
>> index 01be7ab08f87..1d433a5a96b4 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.h
>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
>>
>> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>>
>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>> - const enum xe_cache_level level);
>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>>
>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> - u32 pt_level);
>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
>>
>> #endif
>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>> index 7eeeed0411f3..34603a7e84b0 100644
>> --- a/drivers/gpu/drm/xe/xe_vm.c
>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>> @@ -6,6 +6,7 @@
>> #include "xe_vm.h"
>>
>> #include <linux/dma-fence-array.h>
>> +#include <linux/nospec.h>
>>
>> #include <drm/drm_print.h>
>> #include <drm/ttm/ttm_execbuf_util.h>
>> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>> u64 start, u64 end,
>> bool read_only,
>> bool is_null,
>> - u8 tile_mask)
>> + u8 tile_mask,
>> + u32 pat_index)
>> {
>> struct xe_vma *vma;
>> struct xe_tile *tile;
>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>> vma->tile_mask |= 0x1 << id;
>> }
>>
>> + vma->pat_index = pat_index;
>> +
>> if (vm->xe->info.platform == XE_PVC)
>> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>
>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
>> static void vm_destroy_work_func(struct work_struct *w);
>>
>> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>> - enum xe_cache_level cache)
>> + u32 pat_index)
>> {
>> - u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>> pde_pat &= ~(XELP_PDE_PAT_MASK);
>>
>> if (pat_index & BIT(0))
>> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>> }
>>
>> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>> - enum xe_cache_level cache)
>> + u32 pat_index)
>> {
>> - u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>> pte_pat &= ~(XELP_PTE_PAT_MASK);
>>
>> if (pat_index & BIT(0))
>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
>> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>> {
>> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>> - XE_CACHE_WB);
>> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
>> }
>>
>> static struct dma_fence *
>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
>> static struct drm_gpuva_ops *
>> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>> u64 bo_offset_or_userptr, u64 addr, u64 range,
>> - u32 operation, u8 tile_mask, u32 region)
>> + u32 operation, u8 tile_mask, u32 region, u32 pat_index)
>> {
>> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>> struct ww_acquire_ctx ww;
>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>
>> op->tile_mask = tile_mask;
>> + op->pat_index = pat_index;
>> op->map.immediate =
>> operation & XE_VM_BIND_FLAG_IMMEDIATE;
>> op->map.read_only =
>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>
>> op->tile_mask = tile_mask;
>> + op->pat_index = pat_index;
>> op->prefetch.region = region;
>> }
>> break;
>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>> }
>>
>> static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>> - u8 tile_mask, bool read_only, bool is_null)
>> + u8 tile_mask, bool read_only, bool is_null,
>> + u32 pat_index)
>> {
>> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
>> struct xe_vma *vma;
>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>> vma = xe_vma_create(vm, bo, op->gem.offset,
>> op->va.addr, op->va.addr +
>> op->va.range - 1, read_only, is_null,
>> - tile_mask);
>> + tile_mask, pat_index);
>> if (bo)
>> xe_bo_unlock(bo, &ww);
>>
>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>
>> vma = new_vma(vm, &op->base.map,
>> op->tile_mask, op->map.read_only,
>> - op->map.is_null);
>> + op->map.is_null, op->pat_index);
>> if (IS_ERR(vma)) {
>> err = PTR_ERR(vma);
>> goto free_fence;
>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>
>> vma = new_vma(vm, op->base.remap.prev,
>> op->tile_mask, read_only,
>> - is_null);
>> + is_null, op->pat_index);
>> if (IS_ERR(vma)) {
>> err = PTR_ERR(vma);
>> goto free_fence;
>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>
>> vma = new_vma(vm, op->base.remap.next,
>> op->tile_mask, read_only,
>> - is_null);
>> + is_null, op->pat_index);
>> if (IS_ERR(vma)) {
>> err = PTR_ERR(vma);
>> goto free_fence;
>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
>> u32 obj = (*bind_ops)[i].obj;
>> u64 obj_offset = (*bind_ops)[i].obj_offset;
>> u32 region = (*bind_ops)[i].region;
>> + u32 pat_index = (*bind_ops)[i].pat_index;
>> bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>
>> + if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
>> + err = -EINVAL;
>> + goto free_bind_ops;
>> + }
>> +
>> if (i == 0) {
>> *async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>> } else if (XE_IOCTL_DBG(xe, !*async) ||
>> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>> struct drm_gem_object *gem_obj;
>> u64 range = bind_ops[i].range;
>> u64 addr = bind_ops[i].addr;
>> + u32 op = bind_ops[i].op;
>> u32 obj = bind_ops[i].obj;
>> u64 obj_offset = bind_ops[i].obj_offset;
>> + u32 pat_index = bind_ops[i].pat_index;
>> + u16 coh_mode;
>> +
>> + pat_index = array_index_nospec(pat_index,
>> + xe->info.pat_table_n_entries);
>> + coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
>> + err = -EINVAL;
>> + goto put_obj;
>> + }
>
> In my opinion this check belongs to vm_bind_ioctl_check_args().
>
>> +
>> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
>> + coh_mode == XE_GEM_COHERENCY_NONE)) {
>> + err = -EINVAL;
>> + goto put_obj;
>> + }
>
> Same for this one.
OK, will take a look.
>
>>
>> if (!obj)
>> continue;
>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>> goto put_obj;
>> }
>> }
>> +
>> + if (bos[i]->coh_mode) {
>> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
>> + err = -EINVAL;
>> + goto put_obj;
>> + }
>> + } else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
>> + /*
>> + * Imported dma-buf from a different device should
>> + * require 1way or 2way coherency since we don't know
>> + * how it was mapped on CPU. Just assume is it
>> + * potentially cached on CPU side.
>> + */
>> + err = -EINVAL;
>> + goto put_obj;
>> + }
>> }
>>
>> if (args->num_syncs) {
>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>> u64 obj_offset = bind_ops[i].obj_offset;
>> u8 tile_mask = bind_ops[i].tile_mask;
>> u32 region = bind_ops[i].region;
>> + u32 pat_index = bind_ops[i].pat_index;
>>
>> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>> addr, range, op, tile_mask,
>> - region);
>> + region, pat_index);
>> if (IS_ERR(ops[i])) {
>> err = PTR_ERR(ops[i]);
>> ops[i] = NULL;
>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
>> index 83a1f87b6537..508679ed3c74 100644
>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>> @@ -111,6 +111,11 @@ struct xe_vma {
>> */
>> u8 tile_present;
>>
>> + /**
>> + * @pat_index: The pat index to use when encoding the PTEs for this vma.
>> + */
>> + u32 pat_index;
>
> u32 is overkill for pat_index.
What about u16 instead?
>
>> +
>> struct {
>> struct list_head rebind_link;
>> } notifier;
>> @@ -338,10 +343,8 @@ struct xe_vm {
>> bool batch_invalidate_tlb;
>>
>> struct {
>> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> - enum xe_cache_level cache);
>> - u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>> - enum xe_cache_level cache);
>> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
>> + u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
>> } pat_encode;
>> };
>>
>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>> struct async_op_fence *fence;
>> /** @tile_mask: gt mask for this operation */
>> u8 tile_mask;
>> + /** @pat_index: The pat index to use for this operation. */
>> + u32 pat_index;
>> /** @flags: operation flags */
>> enum xe_vma_op_flags flags;
>>
>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>> index 64bc66d4b550..0c15b6f32447 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>> */
>> __u32 obj;
>>
>> - /** @pad: MBZ */
>> - __u32 pad;
>> + /**
>> + * @pat_index: The platform defined @pat_index to use for this mapping.
>> + * The index basically maps to some predefined memory attributes,
>> + * including things like caching, coherency and likely other stuff in
>> + * the future. The exact meaning of the pat_index is platform specific
>> + * and defined in the Bspec and PRMs. When the KMD sets up the binding
>> + * the index here is encoded into the ppGTT PTE.
>> + *
>> + * For coherency the @pat_index needs to match the
>> + * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
>> + * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
>> + * the coherency mode from the @pat_index and reject if there is a
>> + * mismatch (see note below for pre-MTL platforms).
>> + *
>> + * Note: On pre-MTL platforms there is only a caching mode and no
>> + * explicit coherency mode, but on such hardware there is always a
>> + * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
>> + * CPU caches even with the caching mode set as uncached. It's only the
>> + * display engine that is incoherent (on dgpu it must be in VRAM which
>> + * is always mapped as WC on the CPU). However to keep the uapi somewhat
>> + * consistent with newer platforms the KMD groups the different cache
>> + * levels into the following coherency buckets on all pre-MTL platforms:
>> + *
>> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
>> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
>> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
>> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
>> + *
>> + * In practice UC/WC/WT should only ever used for scanout surfaces on
>> + * such platforms since it is only the display engine that is actually
>> + * incoherent. Everything else should typically use WB given that we
>> + * have a shared-LLC. On MTL+ this completely changes (also potentially
>> + * no shared-LLC) and the HW defines the coherency mode as part of the
>> + * @pat_index.
>> + *
>> + * Note: For userptr and externally imported dma-buf the kernel expects
>> + * either 1WAY or 2WAY for the @pat_index.
>> + */
>> + __u32 pat_index;
>>
>> union {
>> /**
>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-08-29 16:28 ` [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind Matthew Auld
2023-08-29 21:36 ` Matt Roper
2023-09-04 20:21 ` Souza, Jose
@ 2023-09-07 18:56 ` Souza, Jose
2023-09-08 6:51 ` Matthew Auld
2 siblings, 1 reply; 47+ messages in thread
From: Souza, Jose @ 2023-09-07 18:56 UTC (permalink / raw)
To: intel-xe@lists.freedesktop.org, Auld, Matthew
On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
> Allow userspace to directly control the pat_index for a given vm
> binding. This should allow directly controlling the coherency, caching
> and potentially other stuff in the future for the ppGTT binding.
>
> The exact meaning behind the pat_index is very platform specific (see
> BSpec or PRMs) but effectively maps to some predefined memory
> attributes. From the KMD pov we only care about the coherency that is
> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
> The vm_bind coherency mode for the given pat_index needs to match the
> given coh_mode that was set at object creation. For platforms that lack
> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
>
> For userptr mappings we lack a corresponding gem object, so the expected
> coherency mode is instead implicit and must fall into either 1WAY or
> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
> dma-buf (from a different device) the coherency mode is also implicit
> and must also be either 1WAY or 2WAY.
>
> As part of adding pat_index support with vm_bind we also need stop using
> xe_cache_level and instead use the pat_index in various places. We still
> make use of xe_cache_level, but only as a convenience for kernel
> internal objectsi (internally it maps to some reasonable pat_index). For
> now this is just a 1:1 conversion of the existing code, however for
> platforms like MTL+ we might need to give more control through bo_create
> or stop using WB on the CPU side if we need CPU access.
xe_ttm_tt_create() caching should not also take into consideration the PAT index?
>
> Bspec: 45101, 44235 #xe
> Bspec: 70552, 71582, 59400 #xe2
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: José Roberto de Souza <jose.souza@intel.com>
> Cc: Filip Hazubski <filip.hazubski@intel.com>
> Cc: Carl Zhang <carl.zhang@intel.com>
> Cc: Effie Yu <effie.yu@intel.com>
> ---
> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
> drivers/gpu/drm/xe/xe_pt.h | 6 +--
> drivers/gpu/drm/xe/xe_vm.c | 73 +++++++++++++++++++++------
> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
> 10 files changed, 134 insertions(+), 58 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
> index 4d6296cdbcfd..bb4c182048c3 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> @@ -302,7 +302,7 @@ struct i915_address_space {
> (*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>
> u64 (*pte_encode)(dma_addr_t addr,
> - unsigned int pat_index,
> + u32 pat_index,
> u32 flags); /* Create a valid PTE */
> #define PTE_READ_ONLY BIT(0)
> #define PTE_LM BIT(1)
> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
> index 5c8d5e78d9bc..7a128fd20a29 100644
> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
> /* First part of the test, are we updating our pagetable bo with a new entry? */
> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
> 0xdeaddeadbeefbeef);
> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
> if (m->q->vm->flags & XE_VM_FLAG_64K)
> expected |= XE_PTE_PS64;
> if (xe_bo_is_vram(pt))
> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
> index 209fa053d9fb..4134c26150a5 100644
> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
> pte |= XE_GGTT_PTE_DM;
>
> if ((ggtt->pat_encode).pte_encode)
> - pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
> + xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
>
> return pte;
> }
> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
> }
>
> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> - enum xe_cache_level cache)
> + u32 pat_index)
> {
> - u32 pat_index = xe_pat_get_index(xe, cache);
> -
> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>
> if (pat_index & BIT(0))
> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
> index 7e55fac1a8a9..0bc40cb072e3 100644
> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> @@ -31,7 +31,7 @@ struct xe_ggtt {
>
> struct {
> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> - enum xe_cache_level cache);
> + u32 pat_index);
> } pat_encode;
> };
>
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> index a782ea282cb6..54585e98452a 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -24,6 +24,7 @@
> #include "xe_lrc.h"
> #include "xe_map.h"
> #include "xe_mocs.h"
> +#include "xe_pat.h"
> #include "xe_pt.h"
> #include "xe_res_cursor.h"
> #include "xe_sched_job.h"
> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
> u32 map_ofs, level, i;
> struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> u64 entry;
> int ret;
>
> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> return ret;
> }
>
> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>
> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>
> /* Map the entire BO in our level 0 pt */
> for (i = 0, level = 0; i < num_entries; level++) {
> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>
> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>
> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> for (i = 0; i < batch->size;
> i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
> XE_PAGE_SIZE) {
> - entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
> + entry = xe_pte_encode(batch, i, pat_index, 0);
>
> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> entry);
> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> flags = XE_PDE_64K;
>
> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> - XE_PAGE_SIZE, XE_CACHE_WB);
> + XE_PAGE_SIZE, pat_index);
> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
> entry | flags);
> }
> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> /* Write PDE's that point to our BO. */
> for (i = 0; i < num_entries - num_level; i++) {
> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> - XE_CACHE_WB);
> + pat_index);
>
> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
> (i + 1) * 8, u64, entry);
> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
>
> XE_WARN_ON(pt_bo->size != SZ_4K);
>
> - addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
> + addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
> bb->cs[bb->len++] = lower_32_bits(addr);
> bb->cs[bb->len++] = upper_32_bits(addr);
> }
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 64713f400d94..019af2920078 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -10,6 +10,7 @@
> #include "xe_gt.h"
> #include "xe_gt_tlb_invalidation.h"
> #include "xe_migrate.h"
> +#include "xe_pat.h"
> #include "xe_pt_types.h"
> #include "xe_pt_walk.h"
> #include "xe_res_cursor.h"
> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
> *
> * Return: An encoded page directory entry. No errors.
> */
> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> - const enum xe_cache_level cache)
> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
> {
> u64 pde;
> struct xe_vm *vm = bo->vm;
> struct xe_device *xe = vm->xe;
>
> -
> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>
> if ((vm->pat_encode).pde_encode)
> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> + pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>
> return pde;
> }
>
> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> +static u64 __pte_encode(u64 pte, u32 pat_index,
> struct xe_vma *vma, u32 pt_level)
> {
> struct xe_vm *vm = xe_vma_vm(vma);
> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> pte |= XE_PTE_NULL;
>
> if ((vm->pat_encode).pte_encode)
> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> + pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>
> if (pt_level == 1)
> pte |= XE_PDE_PS_2M;
> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> *
> * Return: An encoded page-table entry. No errors.
> */
> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> u32 pt_level)
> {
> u64 pte;
> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> pte |= XE_PPGTT_PTE_DM;
>
> - return __pte_encode(pte, cache, NULL, pt_level);
> + return __pte_encode(pte, pat_index, NULL, pt_level);
> }
>
> static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>
> if (level == 0) {
> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> - XE_CACHE_WB, 0);
> + xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
>
> return empty;
> } else {
> return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
> - XE_CACHE_WB);
> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
> }
> }
>
> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> struct xe_vm *vm;
> /** @tile: The tile we're building for. */
> struct xe_tile *tile;
> - /** @cache: Desired cache level for the ptes */
> - enum xe_cache_level cache;
> /** @default_pte: PTE flag only template. No address is associated */
> u64 default_pte;
> /** @dma_offset: DMA offset to add to the PTE. */
> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>
> pte = __pte_encode(is_null ? 0 :
> xe_res_dma(curs) + xe_walk->dma_offset,
> - xe_walk->cache, xe_walk->vma, level);
> + xe_walk->vma->pat_index, xe_walk->vma, level);
> pte |= xe_walk->default_pte;
>
> /*
> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
> xe_child->is_compact = true;
> }
>
> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
> pte);
> }
> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
> xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
> xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
> - xe_walk.cache = XE_CACHE_WB;
> - } else {
> - if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
> - xe_walk.cache = XE_CACHE_WT;
> - else
> - xe_walk.cache = XE_CACHE_WB;
> }
> +
> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>
> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> index 01be7ab08f87..1d433a5a96b4 100644
> --- a/drivers/gpu/drm/xe/xe_pt.h
> +++ b/drivers/gpu/drm/xe/xe_pt.h
> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
>
> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>
> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> - const enum xe_cache_level level);
> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>
> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> - u32 pt_level);
> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
>
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 7eeeed0411f3..34603a7e84b0 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -6,6 +6,7 @@
> #include "xe_vm.h"
>
> #include <linux/dma-fence-array.h>
> +#include <linux/nospec.h>
>
> #include <drm/drm_print.h>
> #include <drm/ttm/ttm_execbuf_util.h>
> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
> u64 start, u64 end,
> bool read_only,
> bool is_null,
> - u8 tile_mask)
> + u8 tile_mask,
> + u32 pat_index)
> {
> struct xe_vma *vma;
> struct xe_tile *tile;
> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
> vma->tile_mask |= 0x1 << id;
> }
>
> + vma->pat_index = pat_index;
> +
> if (vm->xe->info.platform == XE_PVC)
> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>
> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
> static void vm_destroy_work_func(struct work_struct *w);
>
> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> - enum xe_cache_level cache)
> + u32 pat_index)
> {
> - u32 pat_index = xe_pat_get_index(xe, cache);
> -
> pde_pat &= ~(XELP_PDE_PAT_MASK);
>
> if (pat_index & BIT(0))
> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> }
>
> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> - enum xe_cache_level cache)
> + u32 pat_index)
> {
> - u32 pat_index = xe_pat_get_index(xe, cache);
> -
> pte_pat &= ~(XELP_PTE_PAT_MASK);
>
> if (pat_index & BIT(0))
> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
> {
> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> - XE_CACHE_WB);
> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
> }
>
> static struct dma_fence *
> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
> static struct drm_gpuva_ops *
> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> u64 bo_offset_or_userptr, u64 addr, u64 range,
> - u32 operation, u8 tile_mask, u32 region)
> + u32 operation, u8 tile_mask, u32 region, u32 pat_index)
> {
> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> struct ww_acquire_ctx ww;
> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>
> op->tile_mask = tile_mask;
> + op->pat_index = pat_index;
> op->map.immediate =
> operation & XE_VM_BIND_FLAG_IMMEDIATE;
> op->map.read_only =
> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>
> op->tile_mask = tile_mask;
> + op->pat_index = pat_index;
> op->prefetch.region = region;
> }
> break;
> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> }
>
> static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> - u8 tile_mask, bool read_only, bool is_null)
> + u8 tile_mask, bool read_only, bool is_null,
> + u32 pat_index)
> {
> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
> struct xe_vma *vma;
> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> vma = xe_vma_create(vm, bo, op->gem.offset,
> op->va.addr, op->va.addr +
> op->va.range - 1, read_only, is_null,
> - tile_mask);
> + tile_mask, pat_index);
> if (bo)
> xe_bo_unlock(bo, &ww);
>
> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>
> vma = new_vma(vm, &op->base.map,
> op->tile_mask, op->map.read_only,
> - op->map.is_null);
> + op->map.is_null, op->pat_index);
> if (IS_ERR(vma)) {
> err = PTR_ERR(vma);
> goto free_fence;
> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>
> vma = new_vma(vm, op->base.remap.prev,
> op->tile_mask, read_only,
> - is_null);
> + is_null, op->pat_index);
> if (IS_ERR(vma)) {
> err = PTR_ERR(vma);
> goto free_fence;
> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>
> vma = new_vma(vm, op->base.remap.next,
> op->tile_mask, read_only,
> - is_null);
> + is_null, op->pat_index);
> if (IS_ERR(vma)) {
> err = PTR_ERR(vma);
> goto free_fence;
> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
> u32 obj = (*bind_ops)[i].obj;
> u64 obj_offset = (*bind_ops)[i].obj_offset;
> u32 region = (*bind_ops)[i].region;
> + u32 pat_index = (*bind_ops)[i].pat_index;
> bool is_null = op & XE_VM_BIND_FLAG_NULL;
>
> + if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
> + err = -EINVAL;
> + goto free_bind_ops;
> + }
> +
> if (i == 0) {
> *async = !!(op & XE_VM_BIND_FLAG_ASYNC);
> } else if (XE_IOCTL_DBG(xe, !*async) ||
> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> struct drm_gem_object *gem_obj;
> u64 range = bind_ops[i].range;
> u64 addr = bind_ops[i].addr;
> + u32 op = bind_ops[i].op;
> u32 obj = bind_ops[i].obj;
> u64 obj_offset = bind_ops[i].obj_offset;
> + u32 pat_index = bind_ops[i].pat_index;
> + u16 coh_mode;
> +
> + pat_index = array_index_nospec(pat_index,
> + xe->info.pat_table_n_entries);
> + coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
> + err = -EINVAL;
> + goto put_obj;
> + }
> +
> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
> + coh_mode == XE_GEM_COHERENCY_NONE)) {
> + err = -EINVAL;
> + goto put_obj;
> + }
>
> if (!obj)
> continue;
> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> goto put_obj;
> }
> }
> +
> + if (bos[i]->coh_mode) {
> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
> + err = -EINVAL;
> + goto put_obj;
> + }
> + } else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
> + /*
> + * Imported dma-buf from a different device should
> + * require 1way or 2way coherency since we don't know
> + * how it was mapped on CPU. Just assume is it
> + * potentially cached on CPU side.
> + */
> + err = -EINVAL;
> + goto put_obj;
> + }
> }
>
> if (args->num_syncs) {
> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> u64 obj_offset = bind_ops[i].obj_offset;
> u8 tile_mask = bind_ops[i].tile_mask;
> u32 region = bind_ops[i].region;
> + u32 pat_index = bind_ops[i].pat_index;
>
> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
> addr, range, op, tile_mask,
> - region);
> + region, pat_index);
> if (IS_ERR(ops[i])) {
> err = PTR_ERR(ops[i]);
> ops[i] = NULL;
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
> index 83a1f87b6537..508679ed3c74 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -111,6 +111,11 @@ struct xe_vma {
> */
> u8 tile_present;
>
> + /**
> + * @pat_index: The pat index to use when encoding the PTEs for this vma.
> + */
> + u32 pat_index;
> +
> struct {
> struct list_head rebind_link;
> } notifier;
> @@ -338,10 +343,8 @@ struct xe_vm {
> bool batch_invalidate_tlb;
>
> struct {
> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> - enum xe_cache_level cache);
> - u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> - enum xe_cache_level cache);
> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
> + u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
> } pat_encode;
> };
>
> @@ -417,6 +420,8 @@ struct xe_vma_op {
> struct async_op_fence *fence;
> /** @tile_mask: gt mask for this operation */
> u8 tile_mask;
> + /** @pat_index: The pat index to use for this operation. */
> + u32 pat_index;
> /** @flags: operation flags */
> enum xe_vma_op_flags flags;
>
> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> index 64bc66d4b550..0c15b6f32447 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> */
> __u32 obj;
>
> - /** @pad: MBZ */
> - __u32 pad;
> + /**
> + * @pat_index: The platform defined @pat_index to use for this mapping.
> + * The index basically maps to some predefined memory attributes,
> + * including things like caching, coherency and likely other stuff in
> + * the future. The exact meaning of the pat_index is platform specific
> + * and defined in the Bspec and PRMs. When the KMD sets up the binding
> + * the index here is encoded into the ppGTT PTE.
> + *
> + * For coherency the @pat_index needs to match the
> + * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
> + * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
> + * the coherency mode from the @pat_index and reject if there is a
> + * mismatch (see note below for pre-MTL platforms).
> + *
> + * Note: On pre-MTL platforms there is only a caching mode and no
> + * explicit coherency mode, but on such hardware there is always a
> + * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
> + * CPU caches even with the caching mode set as uncached. It's only the
> + * display engine that is incoherent (on dgpu it must be in VRAM which
> + * is always mapped as WC on the CPU). However to keep the uapi somewhat
> + * consistent with newer platforms the KMD groups the different cache
> + * levels into the following coherency buckets on all pre-MTL platforms:
> + *
> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
> + *
> + * In practice UC/WC/WT should only ever used for scanout surfaces on
> + * such platforms since it is only the display engine that is actually
> + * incoherent. Everything else should typically use WB given that we
> + * have a shared-LLC. On MTL+ this completely changes (also potentially
> + * no shared-LLC) and the HW defines the coherency mode as part of the
> + * @pat_index.
> + *
> + * Note: For userptr and externally imported dma-buf the kernel expects
> + * either 1WAY or 2WAY for the @pat_index.
> + */
> + __u32 pat_index;
>
> union {
> /**
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-09-07 18:56 ` Souza, Jose
@ 2023-09-08 6:51 ` Matthew Auld
2023-09-13 15:35 ` Souza, Jose
0 siblings, 1 reply; 47+ messages in thread
From: Matthew Auld @ 2023-09-08 6:51 UTC (permalink / raw)
To: Souza, Jose, intel-xe@lists.freedesktop.org
On 07/09/2023 19:56, Souza, Jose wrote:
> On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
>> Allow userspace to directly control the pat_index for a given vm
>> binding. This should allow directly controlling the coherency, caching
>> and potentially other stuff in the future for the ppGTT binding.
>>
>> The exact meaning behind the pat_index is very platform specific (see
>> BSpec or PRMs) but effectively maps to some predefined memory
>> attributes. From the KMD pov we only care about the coherency that is
>> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
>> The vm_bind coherency mode for the given pat_index needs to match the
>> given coh_mode that was set at object creation. For platforms that lack
>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
>>
>> For userptr mappings we lack a corresponding gem object, so the expected
>> coherency mode is instead implicit and must fall into either 1WAY or
>> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
>> dma-buf (from a different device) the coherency mode is also implicit
>> and must also be either 1WAY or 2WAY.
>>
>> As part of adding pat_index support with vm_bind we also need stop using
>> xe_cache_level and instead use the pat_index in various places. We still
>> make use of xe_cache_level, but only as a convenience for kernel
>> internal objectsi (internally it maps to some reasonable pat_index). For
>> now this is just a 1:1 conversion of the existing code, however for
>> platforms like MTL+ we might need to give more control through bo_create
>> or stop using WB on the CPU side if we need CPU access.
>
> xe_ttm_tt_create() caching should not also take into consideration the PAT index?
Which PAT index? At least for userspace objects we might not know the
pat_index at this stage, also the pat_index is more per vma, while the
CPU caching mode in xe_ttm_tt_create() is per object. At gem create we
only need to consider the overall coh_mode of every pat_index that will
map the object to determine if the CPU caching mode is "valid".
>
>>
>> Bspec: 45101, 44235 #xe
>> Bspec: 70552, 71582, 59400 #xe2
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: José Roberto de Souza <jose.souza@intel.com>
>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>> Cc: Carl Zhang <carl.zhang@intel.com>
>> Cc: Effie Yu <effie.yu@intel.com>
>> ---
>> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
>> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
>> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
>> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
>> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
>> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
>> drivers/gpu/drm/xe/xe_pt.h | 6 +--
>> drivers/gpu/drm/xe/xe_vm.c | 73 +++++++++++++++++++++------
>> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
>> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
>> 10 files changed, 134 insertions(+), 58 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
>> index 4d6296cdbcfd..bb4c182048c3 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>> @@ -302,7 +302,7 @@ struct i915_address_space {
>> (*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>>
>> u64 (*pte_encode)(dma_addr_t addr,
>> - unsigned int pat_index,
>> + u32 pat_index,
>> u32 flags); /* Create a valid PTE */
>> #define PTE_READ_ONLY BIT(0)
>> #define PTE_LM BIT(1)
>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
>> index 5c8d5e78d9bc..7a128fd20a29 100644
>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
>> /* First part of the test, are we updating our pagetable bo with a new entry? */
>> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
>> 0xdeaddeadbeefbeef);
>> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>> if (m->q->vm->flags & XE_VM_FLAG_64K)
>> expected |= XE_PTE_PS64;
>> if (xe_bo_is_vram(pt))
>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
>> index 209fa053d9fb..4134c26150a5 100644
>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
>> pte |= XE_GGTT_PTE_DM;
>>
>> if ((ggtt->pat_encode).pte_encode)
>> - pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
>> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
>> + xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
>>
>> return pte;
>> }
>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>> }
>>
>> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>> - enum xe_cache_level cache)
>> + u32 pat_index)
>> {
>> - u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>
>> if (pat_index & BIT(0))
>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
>> index 7e55fac1a8a9..0bc40cb072e3 100644
>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>
>> struct {
>> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> - enum xe_cache_level cache);
>> + u32 pat_index);
>> } pat_encode;
>> };
>>
>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
>> index a782ea282cb6..54585e98452a 100644
>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>> @@ -24,6 +24,7 @@
>> #include "xe_lrc.h"
>> #include "xe_map.h"
>> #include "xe_mocs.h"
>> +#include "xe_pat.h"
>> #include "xe_pt.h"
>> #include "xe_res_cursor.h"
>> #include "xe_sched_job.h"
>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
>> u32 map_ofs, level, i;
>> struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>> u64 entry;
>> int ret;
>>
>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> return ret;
>> }
>>
>> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
>> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>
>> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>
>> /* Map the entire BO in our level 0 pt */
>> for (i = 0, level = 0; i < num_entries; level++) {
>> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
>> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>>
>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>>
>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> for (i = 0; i < batch->size;
>> i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>> XE_PAGE_SIZE) {
>> - entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
>> + entry = xe_pte_encode(batch, i, pat_index, 0);
>>
>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>> entry);
>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> flags = XE_PDE_64K;
>>
>> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>> - XE_PAGE_SIZE, XE_CACHE_WB);
>> + XE_PAGE_SIZE, pat_index);
>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
>> entry | flags);
>> }
>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>> /* Write PDE's that point to our BO. */
>> for (i = 0; i < num_entries - num_level; i++) {
>> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>> - XE_CACHE_WB);
>> + pat_index);
>>
>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>> (i + 1) * 8, u64, entry);
>> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
>>
>> XE_WARN_ON(pt_bo->size != SZ_4K);
>>
>> - addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
>> + addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>> bb->cs[bb->len++] = lower_32_bits(addr);
>> bb->cs[bb->len++] = upper_32_bits(addr);
>> }
>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
>> index 64713f400d94..019af2920078 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.c
>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>> @@ -10,6 +10,7 @@
>> #include "xe_gt.h"
>> #include "xe_gt_tlb_invalidation.h"
>> #include "xe_migrate.h"
>> +#include "xe_pat.h"
>> #include "xe_pt_types.h"
>> #include "xe_pt_walk.h"
>> #include "xe_res_cursor.h"
>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
>> *
>> * Return: An encoded page directory entry. No errors.
>> */
>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>> - const enum xe_cache_level cache)
>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>> {
>> u64 pde;
>> struct xe_vm *vm = bo->vm;
>> struct xe_device *xe = vm->xe;
>>
>> -
>> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>
>> if ((vm->pat_encode).pde_encode)
>> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>> + pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>>
>> return pde;
>> }
>>
>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>> struct xe_vma *vma, u32 pt_level)
>> {
>> struct xe_vm *vm = xe_vma_vm(vma);
>> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>> pte |= XE_PTE_NULL;
>>
>> if ((vm->pat_encode).pte_encode)
>> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>> + pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>>
>> if (pt_level == 1)
>> pte |= XE_PDE_PS_2M;
>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>> *
>> * Return: An encoded page-table entry. No errors.
>> */
>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>> u32 pt_level)
>> {
>> u64 pte;
>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>> pte |= XE_PPGTT_PTE_DM;
>>
>> - return __pte_encode(pte, cache, NULL, pt_level);
>> + return __pte_encode(pte, pat_index, NULL, pt_level);
>> }
>>
>> static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>>
>> if (level == 0) {
>> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>> - XE_CACHE_WB, 0);
>> + xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
>>
>> return empty;
>> } else {
>> return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
>> - XE_CACHE_WB);
>> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
>> }
>> }
>>
>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>> struct xe_vm *vm;
>> /** @tile: The tile we're building for. */
>> struct xe_tile *tile;
>> - /** @cache: Desired cache level for the ptes */
>> - enum xe_cache_level cache;
>> /** @default_pte: PTE flag only template. No address is associated */
>> u64 default_pte;
>> /** @dma_offset: DMA offset to add to the PTE. */
>> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>>
>> pte = __pte_encode(is_null ? 0 :
>> xe_res_dma(curs) + xe_walk->dma_offset,
>> - xe_walk->cache, xe_walk->vma, level);
>> + xe_walk->vma->pat_index, xe_walk->vma, level);
>> pte |= xe_walk->default_pte;
>>
>> /*
>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>> xe_child->is_compact = true;
>> }
>>
>> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
>> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
>> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>> pte);
>> }
>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
>> if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>> xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>> xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
>> - xe_walk.cache = XE_CACHE_WB;
>> - } else {
>> - if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
>> - xe_walk.cache = XE_CACHE_WT;
>> - else
>> - xe_walk.cache = XE_CACHE_WB;
>> }
>> +
>> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>> xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>
>> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
>> index 01be7ab08f87..1d433a5a96b4 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.h
>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
>>
>> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>>
>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>> - const enum xe_cache_level level);
>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>>
>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>> - u32 pt_level);
>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
>>
>> #endif
>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>> index 7eeeed0411f3..34603a7e84b0 100644
>> --- a/drivers/gpu/drm/xe/xe_vm.c
>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>> @@ -6,6 +6,7 @@
>> #include "xe_vm.h"
>>
>> #include <linux/dma-fence-array.h>
>> +#include <linux/nospec.h>
>>
>> #include <drm/drm_print.h>
>> #include <drm/ttm/ttm_execbuf_util.h>
>> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>> u64 start, u64 end,
>> bool read_only,
>> bool is_null,
>> - u8 tile_mask)
>> + u8 tile_mask,
>> + u32 pat_index)
>> {
>> struct xe_vma *vma;
>> struct xe_tile *tile;
>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>> vma->tile_mask |= 0x1 << id;
>> }
>>
>> + vma->pat_index = pat_index;
>> +
>> if (vm->xe->info.platform == XE_PVC)
>> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>
>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
>> static void vm_destroy_work_func(struct work_struct *w);
>>
>> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>> - enum xe_cache_level cache)
>> + u32 pat_index)
>> {
>> - u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>> pde_pat &= ~(XELP_PDE_PAT_MASK);
>>
>> if (pat_index & BIT(0))
>> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>> }
>>
>> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>> - enum xe_cache_level cache)
>> + u32 pat_index)
>> {
>> - u32 pat_index = xe_pat_get_index(xe, cache);
>> -
>> pte_pat &= ~(XELP_PTE_PAT_MASK);
>>
>> if (pat_index & BIT(0))
>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
>> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>> {
>> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>> - XE_CACHE_WB);
>> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
>> }
>>
>> static struct dma_fence *
>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
>> static struct drm_gpuva_ops *
>> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>> u64 bo_offset_or_userptr, u64 addr, u64 range,
>> - u32 operation, u8 tile_mask, u32 region)
>> + u32 operation, u8 tile_mask, u32 region, u32 pat_index)
>> {
>> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>> struct ww_acquire_ctx ww;
>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>
>> op->tile_mask = tile_mask;
>> + op->pat_index = pat_index;
>> op->map.immediate =
>> operation & XE_VM_BIND_FLAG_IMMEDIATE;
>> op->map.read_only =
>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>
>> op->tile_mask = tile_mask;
>> + op->pat_index = pat_index;
>> op->prefetch.region = region;
>> }
>> break;
>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>> }
>>
>> static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>> - u8 tile_mask, bool read_only, bool is_null)
>> + u8 tile_mask, bool read_only, bool is_null,
>> + u32 pat_index)
>> {
>> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
>> struct xe_vma *vma;
>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>> vma = xe_vma_create(vm, bo, op->gem.offset,
>> op->va.addr, op->va.addr +
>> op->va.range - 1, read_only, is_null,
>> - tile_mask);
>> + tile_mask, pat_index);
>> if (bo)
>> xe_bo_unlock(bo, &ww);
>>
>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>
>> vma = new_vma(vm, &op->base.map,
>> op->tile_mask, op->map.read_only,
>> - op->map.is_null);
>> + op->map.is_null, op->pat_index);
>> if (IS_ERR(vma)) {
>> err = PTR_ERR(vma);
>> goto free_fence;
>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>
>> vma = new_vma(vm, op->base.remap.prev,
>> op->tile_mask, read_only,
>> - is_null);
>> + is_null, op->pat_index);
>> if (IS_ERR(vma)) {
>> err = PTR_ERR(vma);
>> goto free_fence;
>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>
>> vma = new_vma(vm, op->base.remap.next,
>> op->tile_mask, read_only,
>> - is_null);
>> + is_null, op->pat_index);
>> if (IS_ERR(vma)) {
>> err = PTR_ERR(vma);
>> goto free_fence;
>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
>> u32 obj = (*bind_ops)[i].obj;
>> u64 obj_offset = (*bind_ops)[i].obj_offset;
>> u32 region = (*bind_ops)[i].region;
>> + u32 pat_index = (*bind_ops)[i].pat_index;
>> bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>
>> + if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
>> + err = -EINVAL;
>> + goto free_bind_ops;
>> + }
>> +
>> if (i == 0) {
>> *async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>> } else if (XE_IOCTL_DBG(xe, !*async) ||
>> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>> struct drm_gem_object *gem_obj;
>> u64 range = bind_ops[i].range;
>> u64 addr = bind_ops[i].addr;
>> + u32 op = bind_ops[i].op;
>> u32 obj = bind_ops[i].obj;
>> u64 obj_offset = bind_ops[i].obj_offset;
>> + u32 pat_index = bind_ops[i].pat_index;
>> + u16 coh_mode;
>> +
>> + pat_index = array_index_nospec(pat_index,
>> + xe->info.pat_table_n_entries);
>> + coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
>> + err = -EINVAL;
>> + goto put_obj;
>> + }
>> +
>> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
>> + coh_mode == XE_GEM_COHERENCY_NONE)) {
>> + err = -EINVAL;
>> + goto put_obj;
>> + }
>>
>> if (!obj)
>> continue;
>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>> goto put_obj;
>> }
>> }
>> +
>> + if (bos[i]->coh_mode) {
>> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
>> + err = -EINVAL;
>> + goto put_obj;
>> + }
>> + } else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
>> + /*
>> + * Imported dma-buf from a different device should
>> + * require 1way or 2way coherency since we don't know
>> + * how it was mapped on CPU. Just assume is it
>> + * potentially cached on CPU side.
>> + */
>> + err = -EINVAL;
>> + goto put_obj;
>> + }
>> }
>>
>> if (args->num_syncs) {
>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>> u64 obj_offset = bind_ops[i].obj_offset;
>> u8 tile_mask = bind_ops[i].tile_mask;
>> u32 region = bind_ops[i].region;
>> + u32 pat_index = bind_ops[i].pat_index;
>>
>> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>> addr, range, op, tile_mask,
>> - region);
>> + region, pat_index);
>> if (IS_ERR(ops[i])) {
>> err = PTR_ERR(ops[i]);
>> ops[i] = NULL;
>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
>> index 83a1f87b6537..508679ed3c74 100644
>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>> @@ -111,6 +111,11 @@ struct xe_vma {
>> */
>> u8 tile_present;
>>
>> + /**
>> + * @pat_index: The pat index to use when encoding the PTEs for this vma.
>> + */
>> + u32 pat_index;
>> +
>> struct {
>> struct list_head rebind_link;
>> } notifier;
>> @@ -338,10 +343,8 @@ struct xe_vm {
>> bool batch_invalidate_tlb;
>>
>> struct {
>> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>> - enum xe_cache_level cache);
>> - u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>> - enum xe_cache_level cache);
>> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
>> + u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
>> } pat_encode;
>> };
>>
>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>> struct async_op_fence *fence;
>> /** @tile_mask: gt mask for this operation */
>> u8 tile_mask;
>> + /** @pat_index: The pat index to use for this operation. */
>> + u32 pat_index;
>> /** @flags: operation flags */
>> enum xe_vma_op_flags flags;
>>
>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>> index 64bc66d4b550..0c15b6f32447 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>> */
>> __u32 obj;
>>
>> - /** @pad: MBZ */
>> - __u32 pad;
>> + /**
>> + * @pat_index: The platform defined @pat_index to use for this mapping.
>> + * The index basically maps to some predefined memory attributes,
>> + * including things like caching, coherency and likely other stuff in
>> + * the future. The exact meaning of the pat_index is platform specific
>> + * and defined in the Bspec and PRMs. When the KMD sets up the binding
>> + * the index here is encoded into the ppGTT PTE.
>> + *
>> + * For coherency the @pat_index needs to match the
>> + * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
>> + * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
>> + * the coherency mode from the @pat_index and reject if there is a
>> + * mismatch (see note below for pre-MTL platforms).
>> + *
>> + * Note: On pre-MTL platforms there is only a caching mode and no
>> + * explicit coherency mode, but on such hardware there is always a
>> + * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
>> + * CPU caches even with the caching mode set as uncached. It's only the
>> + * display engine that is incoherent (on dgpu it must be in VRAM which
>> + * is always mapped as WC on the CPU). However to keep the uapi somewhat
>> + * consistent with newer platforms the KMD groups the different cache
>> + * levels into the following coherency buckets on all pre-MTL platforms:
>> + *
>> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
>> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
>> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
>> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
>> + *
>> + * In practice UC/WC/WT should only ever used for scanout surfaces on
>> + * such platforms since it is only the display engine that is actually
>> + * incoherent. Everything else should typically use WB given that we
>> + * have a shared-LLC. On MTL+ this completely changes (also potentially
>> + * no shared-LLC) and the HW defines the coherency mode as part of the
>> + * @pat_index.
>> + *
>> + * Note: For userptr and externally imported dma-buf the kernel expects
>> + * either 1WAY or 2WAY for the @pat_index.
>> + */
>> + __u32 pat_index;
>>
>> union {
>> /**
>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-09-08 6:51 ` Matthew Auld
@ 2023-09-13 15:35 ` Souza, Jose
2023-09-13 15:50 ` Matthew Auld
0 siblings, 1 reply; 47+ messages in thread
From: Souza, Jose @ 2023-09-13 15:35 UTC (permalink / raw)
To: intel-xe@lists.freedesktop.org, Auld, Matthew
On Fri, 2023-09-08 at 07:51 +0100, Matthew Auld wrote:
> On 07/09/2023 19:56, Souza, Jose wrote:
> > On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
> > > Allow userspace to directly control the pat_index for a given vm
> > > binding. This should allow directly controlling the coherency, caching
> > > and potentially other stuff in the future for the ppGTT binding.
> > >
> > > The exact meaning behind the pat_index is very platform specific (see
> > > BSpec or PRMs) but effectively maps to some predefined memory
> > > attributes. From the KMD pov we only care about the coherency that is
> > > provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
> > > The vm_bind coherency mode for the given pat_index needs to match the
> > > given coh_mode that was set at object creation. For platforms that lack
> > > the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
> > >
> > > For userptr mappings we lack a corresponding gem object, so the expected
> > > coherency mode is instead implicit and must fall into either 1WAY or
> > > 2WAY. Trying to use NONE will be rejected by the kernel. For imported
> > > dma-buf (from a different device) the coherency mode is also implicit
> > > and must also be either 1WAY or 2WAY.
> > >
> > > As part of adding pat_index support with vm_bind we also need stop using
> > > xe_cache_level and instead use the pat_index in various places. We still
> > > make use of xe_cache_level, but only as a convenience for kernel
> > > internal objectsi (internally it maps to some reasonable pat_index). For
> > > now this is just a 1:1 conversion of the existing code, however for
> > > platforms like MTL+ we might need to give more control through bo_create
> > > or stop using WB on the CPU side if we need CPU access.
> >
> > xe_ttm_tt_create() caching should not also take into consideration the PAT index?
>
> Which PAT index? At least for userspace objects we might not know the
> pat_index at this stage, also the pat_index is more per vma, while the
> CPU caching mode in xe_ttm_tt_create() is per object. At gem create we
> only need to consider the overall coh_mode of every pat_index that will
> map the object to determine if the CPU caching mode is "valid".
Yeah it is too late.
I should not consider smem_caching in drm_xe_gem_create? It that what will be used as caching mode for mmap?
>
> >
> > >
> > > Bspec: 45101, 44235 #xe
> > > Bspec: 70552, 71582, 59400 #xe2
> > > Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> > > Cc: Pallavi Mishra <pallavi.mishra@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > > Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> > > Cc: Matt Roper <matthew.d.roper@intel.com>
> > > Cc: José Roberto de Souza <jose.souza@intel.com>
> > > Cc: Filip Hazubski <filip.hazubski@intel.com>
> > > Cc: Carl Zhang <carl.zhang@intel.com>
> > > Cc: Effie Yu <effie.yu@intel.com>
> > > ---
> > > drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
> > > drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
> > > drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
> > > drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
> > > drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
> > > drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
> > > drivers/gpu/drm/xe/xe_pt.h | 6 +--
> > > drivers/gpu/drm/xe/xe_vm.c | 73 +++++++++++++++++++++------
> > > drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
> > > include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
> > > 10 files changed, 134 insertions(+), 58 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
> > > index 4d6296cdbcfd..bb4c182048c3 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> > > @@ -302,7 +302,7 @@ struct i915_address_space {
> > > (*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
> > >
> > > u64 (*pte_encode)(dma_addr_t addr,
> > > - unsigned int pat_index,
> > > + u32 pat_index,
> > > u32 flags); /* Create a valid PTE */
> > > #define PTE_READ_ONLY BIT(0)
> > > #define PTE_LM BIT(1)
> > > diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
> > > index 5c8d5e78d9bc..7a128fd20a29 100644
> > > --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> > > +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> > > @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
> > > /* First part of the test, are we updating our pagetable bo with a new entry? */
> > > xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
> > > 0xdeaddeadbeefbeef);
> > > - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> > > + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
> > > if (m->q->vm->flags & XE_VM_FLAG_64K)
> > > expected |= XE_PTE_PS64;
> > > if (xe_bo_is_vram(pt))
> > > diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
> > > index 209fa053d9fb..4134c26150a5 100644
> > > --- a/drivers/gpu/drm/xe/xe_ggtt.c
> > > +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> > > @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
> > > pte |= XE_GGTT_PTE_DM;
> > >
> > > if ((ggtt->pat_encode).pte_encode)
> > > - pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
> > > + pte = (ggtt->pat_encode).pte_encode(xe, pte,
> > > + xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
> > >
> > > return pte;
> > > }
> > > @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
> > > }
> > >
> > > static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> > > - enum xe_cache_level cache)
> > > + u32 pat_index)
> > > {
> > > - u32 pat_index = xe_pat_get_index(xe, cache);
> > > -
> > > pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
> > >
> > > if (pat_index & BIT(0))
> > > diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
> > > index 7e55fac1a8a9..0bc40cb072e3 100644
> > > --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> > > @@ -31,7 +31,7 @@ struct xe_ggtt {
> > >
> > > struct {
> > > u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> > > - enum xe_cache_level cache);
> > > + u32 pat_index);
> > > } pat_encode;
> > > };
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> > > index a782ea282cb6..54585e98452a 100644
> > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > @@ -24,6 +24,7 @@
> > > #include "xe_lrc.h"
> > > #include "xe_map.h"
> > > #include "xe_mocs.h"
> > > +#include "xe_pat.h"
> > > #include "xe_pt.h"
> > > #include "xe_res_cursor.h"
> > > #include "xe_sched_job.h"
> > > @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > > u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
> > > u32 map_ofs, level, i;
> > > struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> > > + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> > > u64 entry;
> > > int ret;
> > >
> > > @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > > return ret;
> > > }
> > >
> > > - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
> > > + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
> > > xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
> > >
> > > map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
> > >
> > > /* Map the entire BO in our level 0 pt */
> > > for (i = 0, level = 0; i < num_entries; level++) {
> > > - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
> > > + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
> > >
> > > xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
> > >
> > > @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > > for (i = 0; i < batch->size;
> > > i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
> > > XE_PAGE_SIZE) {
> > > - entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
> > > + entry = xe_pte_encode(batch, i, pat_index, 0);
> > >
> > > xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> > > entry);
> > > @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > > flags = XE_PDE_64K;
> > >
> > > entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> > > - XE_PAGE_SIZE, XE_CACHE_WB);
> > > + XE_PAGE_SIZE, pat_index);
> > > xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
> > > entry | flags);
> > > }
> > > @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
> > > /* Write PDE's that point to our BO. */
> > > for (i = 0; i < num_entries - num_level; i++) {
> > > entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> > > - XE_CACHE_WB);
> > > + pat_index);
> > >
> > > xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
> > > (i + 1) * 8, u64, entry);
> > > @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
> > >
> > > XE_WARN_ON(pt_bo->size != SZ_4K);
> > >
> > > - addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
> > > + addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
> > > bb->cs[bb->len++] = lower_32_bits(addr);
> > > bb->cs[bb->len++] = upper_32_bits(addr);
> > > }
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> > > index 64713f400d94..019af2920078 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > @@ -10,6 +10,7 @@
> > > #include "xe_gt.h"
> > > #include "xe_gt_tlb_invalidation.h"
> > > #include "xe_migrate.h"
> > > +#include "xe_pat.h"
> > > #include "xe_pt_types.h"
> > > #include "xe_pt_walk.h"
> > > #include "xe_res_cursor.h"
> > > @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
> > > *
> > > * Return: An encoded page directory entry. No errors.
> > > */
> > > -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> > > - const enum xe_cache_level cache)
> > > +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
> > > {
> > > u64 pde;
> > > struct xe_vm *vm = bo->vm;
> > > struct xe_device *xe = vm->xe;
> > >
> > > -
> > > pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> > > pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
> > >
> > > if ((vm->pat_encode).pde_encode)
> > > - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> > > + pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
> > >
> > > return pde;
> > > }
> > >
> > > -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> > > +static u64 __pte_encode(u64 pte, u32 pat_index,
> > > struct xe_vma *vma, u32 pt_level)
> > > {
> > > struct xe_vm *vm = xe_vma_vm(vma);
> > > @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> > > pte |= XE_PTE_NULL;
> > >
> > > if ((vm->pat_encode).pte_encode)
> > > - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> > > + pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
> > >
> > > if (pt_level == 1)
> > > pte |= XE_PDE_PS_2M;
> > > @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> > > *
> > > * Return: An encoded page-table entry. No errors.
> > > */
> > > -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> > > +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> > > u32 pt_level)
> > > {
> > > u64 pte;
> > > @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> > > if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> > > pte |= XE_PPGTT_PTE_DM;
> > >
> > > - return __pte_encode(pte, cache, NULL, pt_level);
> > > + return __pte_encode(pte, pat_index, NULL, pt_level);
> > > }
> > >
> > > static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> > > @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> > >
> > > if (level == 0) {
> > > u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> > > - XE_CACHE_WB, 0);
> > > + xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
> > >
> > > return empty;
> > > } else {
> > > return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
> > > - XE_CACHE_WB);
> > > + xe_pat_get_index(vm->xe, XE_CACHE_WB));
> > > }
> > > }
> > >
> > > @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> > > struct xe_vm *vm;
> > > /** @tile: The tile we're building for. */
> > > struct xe_tile *tile;
> > > - /** @cache: Desired cache level for the ptes */
> > > - enum xe_cache_level cache;
> > > /** @default_pte: PTE flag only template. No address is associated */
> > > u64 default_pte;
> > > /** @dma_offset: DMA offset to add to the PTE. */
> > > @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
> > >
> > > pte = __pte_encode(is_null ? 0 :
> > > xe_res_dma(curs) + xe_walk->dma_offset,
> > > - xe_walk->cache, xe_walk->vma, level);
> > > + xe_walk->vma->pat_index, xe_walk->vma, level);
> > > pte |= xe_walk->default_pte;
> > >
> > > /*
> > > @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
> > > xe_child->is_compact = true;
> > > }
> > >
> > > - pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
> > > + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
> > > ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
> > > pte);
> > > }
> > > @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
> > > xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
> > > xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
> > > - xe_walk.cache = XE_CACHE_WB;
> > > - } else {
> > > - if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
> > > - xe_walk.cache = XE_CACHE_WT;
> > > - else
> > > - xe_walk.cache = XE_CACHE_WB;
> > > }
> > > +
> > > if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> > > xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> > > index 01be7ab08f87..1d433a5a96b4 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.h
> > > +++ b/drivers/gpu/drm/xe/xe_pt.h
> > > @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
> > >
> > > bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
> > >
> > > -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> > > - const enum xe_cache_level level);
> > > +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
> > >
> > > -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
> > > - u32 pt_level);
> > > +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
> > >
> > > #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > > index 7eeeed0411f3..34603a7e84b0 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -6,6 +6,7 @@
> > > #include "xe_vm.h"
> > >
> > > #include <linux/dma-fence-array.h>
> > > +#include <linux/nospec.h>
> > >
> > > #include <drm/drm_print.h>
> > > #include <drm/ttm/ttm_execbuf_util.h>
> > > @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
> > > u64 start, u64 end,
> > > bool read_only,
> > > bool is_null,
> > > - u8 tile_mask)
> > > + u8 tile_mask,
> > > + u32 pat_index)
> > > {
> > > struct xe_vma *vma;
> > > struct xe_tile *tile;
> > > @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
> > > vma->tile_mask |= 0x1 << id;
> > > }
> > >
> > > + vma->pat_index = pat_index;
> > > +
> > > if (vm->xe->info.platform == XE_PVC)
> > > vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
> > >
> > > @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
> > > static void vm_destroy_work_func(struct work_struct *w);
> > >
> > > static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> > > - enum xe_cache_level cache)
> > > + u32 pat_index)
> > > {
> > > - u32 pat_index = xe_pat_get_index(xe, cache);
> > > -
> > > pde_pat &= ~(XELP_PDE_PAT_MASK);
> > >
> > > if (pat_index & BIT(0))
> > > @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> > > }
> > >
> > > static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> > > - enum xe_cache_level cache)
> > > + u32 pat_index)
> > > {
> > > - u32 pat_index = xe_pat_get_index(xe, cache);
> > > -
> > > pte_pat &= ~(XELP_PTE_PAT_MASK);
> > >
> > > if (pat_index & BIT(0))
> > > @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
> > > u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
> > > {
> > > return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> > > - XE_CACHE_WB);
> > > + xe_pat_get_index(vm->xe, XE_CACHE_WB));
> > > }
> > >
> > > static struct dma_fence *
> > > @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
> > > static struct drm_gpuva_ops *
> > > vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> > > u64 bo_offset_or_userptr, u64 addr, u64 range,
> > > - u32 operation, u8 tile_mask, u32 region)
> > > + u32 operation, u8 tile_mask, u32 region, u32 pat_index)
> > > {
> > > struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> > > struct ww_acquire_ctx ww;
> > > @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> > > struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> > >
> > > op->tile_mask = tile_mask;
> > > + op->pat_index = pat_index;
> > > op->map.immediate =
> > > operation & XE_VM_BIND_FLAG_IMMEDIATE;
> > > op->map.read_only =
> > > @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> > > struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> > >
> > > op->tile_mask = tile_mask;
> > > + op->pat_index = pat_index;
> > > op->prefetch.region = region;
> > > }
> > > break;
> > > @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> > > }
> > >
> > > static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> > > - u8 tile_mask, bool read_only, bool is_null)
> > > + u8 tile_mask, bool read_only, bool is_null,
> > > + u32 pat_index)
> > > {
> > > struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
> > > struct xe_vma *vma;
> > > @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
> > > vma = xe_vma_create(vm, bo, op->gem.offset,
> > > op->va.addr, op->va.addr +
> > > op->va.range - 1, read_only, is_null,
> > > - tile_mask);
> > > + tile_mask, pat_index);
> > > if (bo)
> > > xe_bo_unlock(bo, &ww);
> > >
> > > @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
> > >
> > > vma = new_vma(vm, &op->base.map,
> > > op->tile_mask, op->map.read_only,
> > > - op->map.is_null);
> > > + op->map.is_null, op->pat_index);
> > > if (IS_ERR(vma)) {
> > > err = PTR_ERR(vma);
> > > goto free_fence;
> > > @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
> > >
> > > vma = new_vma(vm, op->base.remap.prev,
> > > op->tile_mask, read_only,
> > > - is_null);
> > > + is_null, op->pat_index);
> > > if (IS_ERR(vma)) {
> > > err = PTR_ERR(vma);
> > > goto free_fence;
> > > @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
> > >
> > > vma = new_vma(vm, op->base.remap.next,
> > > op->tile_mask, read_only,
> > > - is_null);
> > > + is_null, op->pat_index);
> > > if (IS_ERR(vma)) {
> > > err = PTR_ERR(vma);
> > > goto free_fence;
> > > @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
> > > u32 obj = (*bind_ops)[i].obj;
> > > u64 obj_offset = (*bind_ops)[i].obj_offset;
> > > u32 region = (*bind_ops)[i].region;
> > > + u32 pat_index = (*bind_ops)[i].pat_index;
> > > bool is_null = op & XE_VM_BIND_FLAG_NULL;
> > >
> > > + if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
> > > + err = -EINVAL;
> > > + goto free_bind_ops;
> > > + }
> > > +
> > > if (i == 0) {
> > > *async = !!(op & XE_VM_BIND_FLAG_ASYNC);
> > > } else if (XE_IOCTL_DBG(xe, !*async) ||
> > > @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> > > struct drm_gem_object *gem_obj;
> > > u64 range = bind_ops[i].range;
> > > u64 addr = bind_ops[i].addr;
> > > + u32 op = bind_ops[i].op;
> > > u32 obj = bind_ops[i].obj;
> > > u64 obj_offset = bind_ops[i].obj_offset;
> > > + u32 pat_index = bind_ops[i].pat_index;
> > > + u16 coh_mode;
> > > +
> > > + pat_index = array_index_nospec(pat_index,
> > > + xe->info.pat_table_n_entries);
> > > + coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
> > > + if (XE_IOCTL_DBG(xe, !coh_mode)) {
> > > + err = -EINVAL;
> > > + goto put_obj;
> > > + }
> > > +
> > > + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
> > > + coh_mode == XE_GEM_COHERENCY_NONE)) {
> > > + err = -EINVAL;
> > > + goto put_obj;
> > > + }
> > >
> > > if (!obj)
> > > continue;
> > > @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> > > goto put_obj;
> > > }
> > > }
> > > +
> > > + if (bos[i]->coh_mode) {
> > > + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
> > > + err = -EINVAL;
> > > + goto put_obj;
> > > + }
> > > + } else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
> > > + /*
> > > + * Imported dma-buf from a different device should
> > > + * require 1way or 2way coherency since we don't know
> > > + * how it was mapped on CPU. Just assume is it
> > > + * potentially cached on CPU side.
> > > + */
> > > + err = -EINVAL;
> > > + goto put_obj;
> > > + }
> > > }
> > >
> > > if (args->num_syncs) {
> > > @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> > > u64 obj_offset = bind_ops[i].obj_offset;
> > > u8 tile_mask = bind_ops[i].tile_mask;
> > > u32 region = bind_ops[i].region;
> > > + u32 pat_index = bind_ops[i].pat_index;
> > >
> > > ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
> > > addr, range, op, tile_mask,
> > > - region);
> > > + region, pat_index);
> > > if (IS_ERR(ops[i])) {
> > > err = PTR_ERR(ops[i]);
> > > ops[i] = NULL;
> > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
> > > index 83a1f87b6537..508679ed3c74 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > > @@ -111,6 +111,11 @@ struct xe_vma {
> > > */
> > > u8 tile_present;
> > >
> > > + /**
> > > + * @pat_index: The pat index to use when encoding the PTEs for this vma.
> > > + */
> > > + u32 pat_index;
> > > +
> > > struct {
> > > struct list_head rebind_link;
> > > } notifier;
> > > @@ -338,10 +343,8 @@ struct xe_vm {
> > > bool batch_invalidate_tlb;
> > >
> > > struct {
> > > - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> > > - enum xe_cache_level cache);
> > > - u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> > > - enum xe_cache_level cache);
> > > + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
> > > + u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
> > > } pat_encode;
> > > };
> > >
> > > @@ -417,6 +420,8 @@ struct xe_vma_op {
> > > struct async_op_fence *fence;
> > > /** @tile_mask: gt mask for this operation */
> > > u8 tile_mask;
> > > + /** @pat_index: The pat index to use for this operation. */
> > > + u32 pat_index;
> > > /** @flags: operation flags */
> > > enum xe_vma_op_flags flags;
> > >
> > > diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> > > index 64bc66d4b550..0c15b6f32447 100644
> > > --- a/include/uapi/drm/xe_drm.h
> > > +++ b/include/uapi/drm/xe_drm.h
> > > @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> > > */
> > > __u32 obj;
> > >
> > > - /** @pad: MBZ */
> > > - __u32 pad;
> > > + /**
> > > + * @pat_index: The platform defined @pat_index to use for this mapping.
> > > + * The index basically maps to some predefined memory attributes,
> > > + * including things like caching, coherency and likely other stuff in
> > > + * the future. The exact meaning of the pat_index is platform specific
> > > + * and defined in the Bspec and PRMs. When the KMD sets up the binding
> > > + * the index here is encoded into the ppGTT PTE.
> > > + *
> > > + * For coherency the @pat_index needs to match the
> > > + * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
> > > + * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
> > > + * the coherency mode from the @pat_index and reject if there is a
> > > + * mismatch (see note below for pre-MTL platforms).
> > > + *
> > > + * Note: On pre-MTL platforms there is only a caching mode and no
> > > + * explicit coherency mode, but on such hardware there is always a
> > > + * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
> > > + * CPU caches even with the caching mode set as uncached. It's only the
> > > + * display engine that is incoherent (on dgpu it must be in VRAM which
> > > + * is always mapped as WC on the CPU). However to keep the uapi somewhat
> > > + * consistent with newer platforms the KMD groups the different cache
> > > + * levels into the following coherency buckets on all pre-MTL platforms:
> > > + *
> > > + * ppGTT UC -> XE_GEM_COHERENCY_NONE
> > > + * ppGTT WC -> XE_GEM_COHERENCY_NONE
> > > + * ppGTT WT -> XE_GEM_COHERENCY_NONE
> > > + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
> > > + *
> > > + * In practice UC/WC/WT should only ever used for scanout surfaces on
> > > + * such platforms since it is only the display engine that is actually
> > > + * incoherent. Everything else should typically use WB given that we
> > > + * have a shared-LLC. On MTL+ this completely changes (also potentially
> > > + * no shared-LLC) and the HW defines the coherency mode as part of the
> > > + * @pat_index.
> > > + *
> > > + * Note: For userptr and externally imported dma-buf the kernel expects
> > > + * either 1WAY or 2WAY for the @pat_index.
> > > + */
> > > + __u32 pat_index;
> > >
> > > union {
> > > /**
> >
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
2023-09-13 15:35 ` Souza, Jose
@ 2023-09-13 15:50 ` Matthew Auld
0 siblings, 0 replies; 47+ messages in thread
From: Matthew Auld @ 2023-09-13 15:50 UTC (permalink / raw)
To: Souza, Jose, intel-xe@lists.freedesktop.org
On 13/09/2023 16:35, Souza, Jose wrote:
> On Fri, 2023-09-08 at 07:51 +0100, Matthew Auld wrote:
>> On 07/09/2023 19:56, Souza, Jose wrote:
>>> On Tue, 2023-08-29 at 17:28 +0100, Matthew Auld wrote:
>>>> Allow userspace to directly control the pat_index for a given vm
>>>> binding. This should allow directly controlling the coherency, caching
>>>> and potentially other stuff in the future for the ppGTT binding.
>>>>
>>>> The exact meaning behind the pat_index is very platform specific (see
>>>> BSpec or PRMs) but effectively maps to some predefined memory
>>>> attributes. From the KMD pov we only care about the coherency that is
>>>> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
>>>> The vm_bind coherency mode for the given pat_index needs to match the
>>>> given coh_mode that was set at object creation. For platforms that lack
>>>> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as 2WAY.
>>>>
>>>> For userptr mappings we lack a corresponding gem object, so the expected
>>>> coherency mode is instead implicit and must fall into either 1WAY or
>>>> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
>>>> dma-buf (from a different device) the coherency mode is also implicit
>>>> and must also be either 1WAY or 2WAY.
>>>>
>>>> As part of adding pat_index support with vm_bind we also need stop using
>>>> xe_cache_level and instead use the pat_index in various places. We still
>>>> make use of xe_cache_level, but only as a convenience for kernel
>>>> internal objectsi (internally it maps to some reasonable pat_index). For
>>>> now this is just a 1:1 conversion of the existing code, however for
>>>> platforms like MTL+ we might need to give more control through bo_create
>>>> or stop using WB on the CPU side if we need CPU access.
>>>
>>> xe_ttm_tt_create() caching should not also take into consideration the PAT index?
>>
>> Which PAT index? At least for userspace objects we might not know the
>> pat_index at this stage, also the pat_index is more per vma, while the
>> CPU caching mode in xe_ttm_tt_create() is per object. At gem create we
>> only need to consider the overall coh_mode of every pat_index that will
>> map the object to determine if the CPU caching mode is "valid".
>
> Yeah it is too late.
> I should not consider smem_caching in drm_xe_gem_create? It that what will be used as caching mode for mmap?
You will need pick something for smem_caching, unless it's vram-only in
which case just leave as zero. And yes, the mmap caching mode will be
whatever you picked for smem_caching.
>
>>
>>>
>>>>
>>>> Bspec: 45101, 44235 #xe
>>>> Bspec: 70552, 71582, 59400 #xe2
>>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>>> Cc: Pallavi Mishra <pallavi.mishra@intel.com>
>>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>>> Cc: José Roberto de Souza <jose.souza@intel.com>
>>>> Cc: Filip Hazubski <filip.hazubski@intel.com>
>>>> Cc: Carl Zhang <carl.zhang@intel.com>
>>>> Cc: Effie Yu <effie.yu@intel.com>
>>>> ---
>>>> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 +-
>>>> drivers/gpu/drm/xe/tests/xe_migrate.c | 2 +-
>>>> drivers/gpu/drm/xe/xe_ggtt.c | 7 ++-
>>>> drivers/gpu/drm/xe/xe_ggtt_types.h | 2 +-
>>>> drivers/gpu/drm/xe/xe_migrate.c | 14 ++---
>>>> drivers/gpu/drm/xe/xe_pt.c | 32 +++++-------
>>>> drivers/gpu/drm/xe/xe_pt.h | 6 +--
>>>> drivers/gpu/drm/xe/xe_vm.c | 73 +++++++++++++++++++++------
>>>> drivers/gpu/drm/xe/xe_vm_types.h | 13 +++--
>>>> include/uapi/drm/xe_drm.h | 41 ++++++++++++++-
>>>> 10 files changed, 134 insertions(+), 58 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>> index 4d6296cdbcfd..bb4c182048c3 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
>>>> @@ -302,7 +302,7 @@ struct i915_address_space {
>>>> (*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
>>>>
>>>> u64 (*pte_encode)(dma_addr_t addr,
>>>> - unsigned int pat_index,
>>>> + u32 pat_index,
>>>> u32 flags); /* Create a valid PTE */
>>>> #define PTE_READ_ONLY BIT(0)
>>>> #define PTE_LM BIT(1)
>>>> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>> index 5c8d5e78d9bc..7a128fd20a29 100644
>>>> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
>>>> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
>>>> /* First part of the test, are we updating our pagetable bo with a new entry? */
>>>> xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
>>>> 0xdeaddeadbeefbeef);
>>>> - expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
>>>> + expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>>>> if (m->q->vm->flags & XE_VM_FLAG_64K)
>>>> expected |= XE_PTE_PS64;
>>>> if (xe_bo_is_vram(pt))
>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
>>>> index 209fa053d9fb..4134c26150a5 100644
>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>>>> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64 bo_offset)
>>>> pte |= XE_GGTT_PTE_DM;
>>>>
>>>> if ((ggtt->pat_encode).pte_encode)
>>>> - pte = (ggtt->pat_encode).pte_encode(xe, pte, XE_CACHE_WB_1_WAY);
>>>> + pte = (ggtt->pat_encode).pte_encode(xe, pte,
>>>> + xe_pat_get_index(xe, XE_CACHE_WB_1_WAY));
>>>>
>>>> return pte;
>>>> }
>>>> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
>>>> }
>>>>
>>>> static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>>>> - enum xe_cache_level cache)
>>>> + u32 pat_index)
>>>> {
>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>> -
>>>> pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
>>>>
>>>> if (pat_index & BIT(0))
>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>> index 7e55fac1a8a9..0bc40cb072e3 100644
>>>> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
>>>> @@ -31,7 +31,7 @@ struct xe_ggtt {
>>>>
>>>> struct {
>>>> u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>> - enum xe_cache_level cache);
>>>> + u32 pat_index);
>>>> } pat_encode;
>>>> };
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
>>>> index a782ea282cb6..54585e98452a 100644
>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>>>> @@ -24,6 +24,7 @@
>>>> #include "xe_lrc.h"
>>>> #include "xe_map.h"
>>>> #include "xe_mocs.h"
>>>> +#include "xe_pat.h"
>>>> #include "xe_pt.h"
>>>> #include "xe_res_cursor.h"
>>>> #include "xe_sched_job.h"
>>>> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>>> u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]->level;
>>>> u32 map_ofs, level, i;
>>>> struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
>>>> + u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
>>>> u64 entry;
>>>> int ret;
>>>>
>>>> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>>> return ret;
>>>> }
>>>>
>>>> - entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
>>>> + entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
>>>> xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
>>>>
>>>> map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
>>>>
>>>> /* Map the entire BO in our level 0 pt */
>>>> for (i = 0, level = 0; i < num_entries; level++) {
>>>> - entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB, 0);
>>>> + entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
>>>>
>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
>>>>
>>>> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>>> for (i = 0; i < batch->size;
>>>> i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
>>>> XE_PAGE_SIZE) {
>>>> - entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
>>>> + entry = xe_pte_encode(batch, i, pat_index, 0);
>>>>
>>>> xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
>>>> entry);
>>>> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>>> flags = XE_PDE_64K;
>>>>
>>>> entry = xe_pde_encode(bo, map_ofs + (level - 1) *
>>>> - XE_PAGE_SIZE, XE_CACHE_WB);
>>>> + XE_PAGE_SIZE, pat_index);
>>>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level, u64,
>>>> entry | flags);
>>>> }
>>>> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
>>>> /* Write PDE's that point to our BO. */
>>>> for (i = 0; i < num_entries - num_level; i++) {
>>>> entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
>>>> - XE_CACHE_WB);
>>>> + pat_index);
>>>>
>>>> xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
>>>> (i + 1) * 8, u64, entry);
>>>> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
>>>>
>>>> XE_WARN_ON(pt_bo->size != SZ_4K);
>>>>
>>>> - addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
>>>> + addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe, XE_CACHE_WB), 0);
>>>> bb->cs[bb->len++] = lower_32_bits(addr);
>>>> bb->cs[bb->len++] = upper_32_bits(addr);
>>>> }
>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
>>>> index 64713f400d94..019af2920078 100644
>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>>>> @@ -10,6 +10,7 @@
>>>> #include "xe_gt.h"
>>>> #include "xe_gt_tlb_invalidation.h"
>>>> #include "xe_migrate.h"
>>>> +#include "xe_pat.h"
>>>> #include "xe_pt_types.h"
>>>> #include "xe_pt_walk.h"
>>>> #include "xe_res_cursor.h"
>>>> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir *pt_dir, unsigned int index)
>>>> *
>>>> * Return: An encoded page directory entry. No errors.
>>>> */
>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>> - const enum xe_cache_level cache)
>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
>>>> {
>>>> u64 pde;
>>>> struct xe_vm *vm = bo->vm;
>>>> struct xe_device *xe = vm->xe;
>>>>
>>>> -
>>>> pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
>>>> pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
>>>>
>>>> if ((vm->pat_encode).pde_encode)
>>>> - pde = (vm->pat_encode).pde_encode(xe, pde, cache);
>>>> + pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
>>>>
>>>> return pde;
>>>> }
>>>>
>>>> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>>> +static u64 __pte_encode(u64 pte, u32 pat_index,
>>>> struct xe_vma *vma, u32 pt_level)
>>>> {
>>>> struct xe_vm *vm = xe_vma_vm(vma);
>>>> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>>> pte |= XE_PTE_NULL;
>>>>
>>>> if ((vm->pat_encode).pte_encode)
>>>> - pte = (vm->pat_encode).pte_encode(xe, pte, cache);
>>>> + pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
>>>>
>>>> if (pt_level == 1)
>>>> pte |= XE_PDE_PS_2M;
>>>> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
>>>> *
>>>> * Return: An encoded page-table entry. No errors.
>>>> */
>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
>>>> u32 pt_level)
>>>> {
>>>> u64 pte;
>>>> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>>>> if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
>>>> pte |= XE_PPGTT_PTE_DM;
>>>>
>>>> - return __pte_encode(pte, cache, NULL, pt_level);
>>>> + return __pte_encode(pte, pat_index, NULL, pt_level);
>>>> }
>>>>
>>>> static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>>>> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
>>>>
>>>> if (level == 0) {
>>>> u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
>>>> - XE_CACHE_WB, 0);
>>>> + xe_pat_get_index(vm->xe, XE_CACHE_WB), 0);
>>>>
>>>> return empty;
>>>> } else {
>>>> return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
>>>> - XE_CACHE_WB);
>>>> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
>>>> }
>>>> }
>>>>
>>>> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
>>>> struct xe_vm *vm;
>>>> /** @tile: The tile we're building for. */
>>>> struct xe_tile *tile;
>>>> - /** @cache: Desired cache level for the ptes */
>>>> - enum xe_cache_level cache;
>>>> /** @default_pte: PTE flag only template. No address is associated */
>>>> u64 default_pte;
>>>> /** @dma_offset: DMA offset to add to the PTE. */
>>>> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>>>>
>>>> pte = __pte_encode(is_null ? 0 :
>>>> xe_res_dma(curs) + xe_walk->dma_offset,
>>>> - xe_walk->cache, xe_walk->vma, level);
>>>> + xe_walk->vma->pat_index, xe_walk->vma, level);
>>>> pte |= xe_walk->default_pte;
>>>>
>>>> /*
>>>> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,
>>>> xe_child->is_compact = true;
>>>> }
>>>>
>>>> - pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
>>>> + pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma->pat_index) | flags;
>>>> ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
>>>> pte);
>>>> }
>>>> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
>>>> if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
>>>> xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
>>>> xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
>>>> - xe_walk.cache = XE_CACHE_WB;
>>>> - } else {
>>>> - if (!xe_vma_has_no_bo(vma) && bo->flags & XE_BO_SCANOUT_BIT)
>>>> - xe_walk.cache = XE_CACHE_WT;
>>>> - else
>>>> - xe_walk.cache = XE_CACHE_WB;
>>>> }
>>>> +
>>>> if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>>> xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
>>>> index 01be7ab08f87..1d433a5a96b4 100644
>>>> --- a/drivers/gpu/drm/xe/xe_pt.h
>>>> +++ b/drivers/gpu/drm/xe/xe_pt.h
>>>> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
>>>>
>>>> bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>>>>
>>>> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
>>>> - const enum xe_cache_level level);
>>>> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
>>>>
>>>> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level cache,
>>>> - u32 pt_level);
>>>> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32 pt_level);
>>>>
>>>> #endif
>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>>>> index 7eeeed0411f3..34603a7e84b0 100644
>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>> @@ -6,6 +6,7 @@
>>>> #include "xe_vm.h"
>>>>
>>>> #include <linux/dma-fence-array.h>
>>>> +#include <linux/nospec.h>
>>>>
>>>> #include <drm/drm_print.h>
>>>> #include <drm/ttm/ttm_execbuf_util.h>
>>>> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>>>> u64 start, u64 end,
>>>> bool read_only,
>>>> bool is_null,
>>>> - u8 tile_mask)
>>>> + u8 tile_mask,
>>>> + u32 pat_index)
>>>> {
>>>> struct xe_vma *vma;
>>>> struct xe_tile *tile;
>>>> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>>>> vma->tile_mask |= 0x1 << id;
>>>> }
>>>>
>>>> + vma->pat_index = pat_index;
>>>> +
>>>> if (vm->xe->info.platform == XE_PVC)
>>>> vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>>>
>>>> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct work_struct *w);
>>>> static void vm_destroy_work_func(struct work_struct *w);
>>>>
>>>> static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>>>> - enum xe_cache_level cache)
>>>> + u32 pat_index)
>>>> {
>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>> -
>>>> pde_pat &= ~(XELP_PDE_PAT_MASK);
>>>>
>>>> if (pat_index & BIT(0))
>>>> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
>>>> }
>>>>
>>>> static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
>>>> - enum xe_cache_level cache)
>>>> + u32 pat_index)
>>>> {
>>>> - u32 pat_index = xe_pat_get_index(xe, cache);
>>>> -
>>>> pte_pat &= ~(XELP_PTE_PAT_MASK);
>>>>
>>>> if (pat_index & BIT(0))
>>>> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef, u32 id)
>>>> u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
>>>> {
>>>> return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
>>>> - XE_CACHE_WB);
>>>> + xe_pat_get_index(vm->xe, XE_CACHE_WB));
>>>> }
>>>>
>>>> static struct dma_fence *
>>>> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct drm_gpuva_op *op)
>>>> static struct drm_gpuva_ops *
>>>> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>> u64 bo_offset_or_userptr, u64 addr, u64 range,
>>>> - u32 operation, u8 tile_mask, u32 region)
>>>> + u32 operation, u8 tile_mask, u32 region, u32 pat_index)
>>>> {
>>>> struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>>> struct ww_acquire_ctx ww;
>>>> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>>>
>>>> op->tile_mask = tile_mask;
>>>> + op->pat_index = pat_index;
>>>> op->map.immediate =
>>>> operation & XE_VM_BIND_FLAG_IMMEDIATE;
>>>> op->map.read_only =
>>>> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>> struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>>>
>>>> op->tile_mask = tile_mask;
>>>> + op->pat_index = pat_index;
>>>> op->prefetch.region = region;
>>>> }
>>>> break;
>>>> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>> }
>>>>
>>>> static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>>>> - u8 tile_mask, bool read_only, bool is_null)
>>>> + u8 tile_mask, bool read_only, bool is_null,
>>>> + u32 pat_index)
>>>> {
>>>> struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
>>>> struct xe_vma *vma;
>>>> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
>>>> vma = xe_vma_create(vm, bo, op->gem.offset,
>>>> op->va.addr, op->va.addr +
>>>> op->va.range - 1, read_only, is_null,
>>>> - tile_mask);
>>>> + tile_mask, pat_index);
>>>> if (bo)
>>>> xe_bo_unlock(bo, &ww);
>>>>
>>>> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>>>
>>>> vma = new_vma(vm, &op->base.map,
>>>> op->tile_mask, op->map.read_only,
>>>> - op->map.is_null);
>>>> + op->map.is_null, op->pat_index);
>>>> if (IS_ERR(vma)) {
>>>> err = PTR_ERR(vma);
>>>> goto free_fence;
>>>> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>>>
>>>> vma = new_vma(vm, op->base.remap.prev,
>>>> op->tile_mask, read_only,
>>>> - is_null);
>>>> + is_null, op->pat_index);
>>>> if (IS_ERR(vma)) {
>>>> err = PTR_ERR(vma);
>>>> goto free_fence;
>>>> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
>>>>
>>>> vma = new_vma(vm, op->base.remap.next,
>>>> op->tile_mask, read_only,
>>>> - is_null);
>>>> + is_null, op->pat_index);
>>>> if (IS_ERR(vma)) {
>>>> err = PTR_ERR(vma);
>>>> goto free_fence;
>>>> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
>>>> u32 obj = (*bind_ops)[i].obj;
>>>> u64 obj_offset = (*bind_ops)[i].obj_offset;
>>>> u32 region = (*bind_ops)[i].region;
>>>> + u32 pat_index = (*bind_ops)[i].pat_index;
>>>> bool is_null = op & XE_VM_BIND_FLAG_NULL;
>>>>
>>>> + if (XE_IOCTL_DBG(xe, pat_index >= xe->info.pat_table_n_entries)) {
>>>> + err = -EINVAL;
>>>> + goto free_bind_ops;
>>>> + }
>>>> +
>>>> if (i == 0) {
>>>> *async = !!(op & XE_VM_BIND_FLAG_ASYNC);
>>>> } else if (XE_IOCTL_DBG(xe, !*async) ||
>>>> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>>> struct drm_gem_object *gem_obj;
>>>> u64 range = bind_ops[i].range;
>>>> u64 addr = bind_ops[i].addr;
>>>> + u32 op = bind_ops[i].op;
>>>> u32 obj = bind_ops[i].obj;
>>>> u64 obj_offset = bind_ops[i].obj_offset;
>>>> + u32 pat_index = bind_ops[i].pat_index;
>>>> + u16 coh_mode;
>>>> +
>>>> + pat_index = array_index_nospec(pat_index,
>>>> + xe->info.pat_table_n_entries);
>>>> + coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>>>> + if (XE_IOCTL_DBG(xe, !coh_mode)) {
>>>> + err = -EINVAL;
>>>> + goto put_obj;
>>>> + }
>>>> +
>>>> + if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) == XE_VM_BIND_OP_MAP_USERPTR &&
>>>> + coh_mode == XE_GEM_COHERENCY_NONE)) {
>>>> + err = -EINVAL;
>>>> + goto put_obj;
>>>> + }
>>>>
>>>> if (!obj)
>>>> continue;
>>>> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>>> goto put_obj;
>>>> }
>>>> }
>>>> +
>>>> + if (bos[i]->coh_mode) {
>>>> + if (XE_IOCTL_DBG(xe, bos[i]->coh_mode != coh_mode)) {
>>>> + err = -EINVAL;
>>>> + goto put_obj;
>>>> + }
>>>> + } else if (XE_IOCTL_DBG(xe, coh_mode == XE_GEM_COHERENCY_NONE)) {
>>>> + /*
>>>> + * Imported dma-buf from a different device should
>>>> + * require 1way or 2way coherency since we don't know
>>>> + * how it was mapped on CPU. Just assume is it
>>>> + * potentially cached on CPU side.
>>>> + */
>>>> + err = -EINVAL;
>>>> + goto put_obj;
>>>> + }
>>>> }
>>>>
>>>> if (args->num_syncs) {
>>>> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>>>> u64 obj_offset = bind_ops[i].obj_offset;
>>>> u8 tile_mask = bind_ops[i].tile_mask;
>>>> u32 region = bind_ops[i].region;
>>>> + u32 pat_index = bind_ops[i].pat_index;
>>>>
>>>> ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>>>> addr, range, op, tile_mask,
>>>> - region);
>>>> + region, pat_index);
>>>> if (IS_ERR(ops[i])) {
>>>> err = PTR_ERR(ops[i]);
>>>> ops[i] = NULL;
>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
>>>> index 83a1f87b6537..508679ed3c74 100644
>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>>>> @@ -111,6 +111,11 @@ struct xe_vma {
>>>> */
>>>> u8 tile_present;
>>>>
>>>> + /**
>>>> + * @pat_index: The pat index to use when encoding the PTEs for this vma.
>>>> + */
>>>> + u32 pat_index;
>>>> +
>>>> struct {
>>>> struct list_head rebind_link;
>>>> } notifier;
>>>> @@ -338,10 +343,8 @@ struct xe_vm {
>>>> bool batch_invalidate_tlb;
>>>>
>>>> struct {
>>>> - u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
>>>> - enum xe_cache_level cache);
>>>> - u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
>>>> - enum xe_cache_level cache);
>>>> + u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32 pat_index);
>>>> + u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32 pat_index);
>>>> } pat_encode;
>>>> };
>>>>
>>>> @@ -417,6 +420,8 @@ struct xe_vma_op {
>>>> struct async_op_fence *fence;
>>>> /** @tile_mask: gt mask for this operation */
>>>> u8 tile_mask;
>>>> + /** @pat_index: The pat index to use for this operation. */
>>>> + u32 pat_index;
>>>> /** @flags: operation flags */
>>>> enum xe_vma_op_flags flags;
>>>>
>>>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>>>> index 64bc66d4b550..0c15b6f32447 100644
>>>> --- a/include/uapi/drm/xe_drm.h
>>>> +++ b/include/uapi/drm/xe_drm.h
>>>> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
>>>> */
>>>> __u32 obj;
>>>>
>>>> - /** @pad: MBZ */
>>>> - __u32 pad;
>>>> + /**
>>>> + * @pat_index: The platform defined @pat_index to use for this mapping.
>>>> + * The index basically maps to some predefined memory attributes,
>>>> + * including things like caching, coherency and likely other stuff in
>>>> + * the future. The exact meaning of the pat_index is platform specific
>>>> + * and defined in the Bspec and PRMs. When the KMD sets up the binding
>>>> + * the index here is encoded into the ppGTT PTE.
>>>> + *
>>>> + * For coherency the @pat_index needs to match the
>>>> + * drm_xe_gem_create.coh_mode, so either XE_GEM_COHERENCY_NONE,
>>>> + * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The KMD will extract
>>>> + * the coherency mode from the @pat_index and reject if there is a
>>>> + * mismatch (see note below for pre-MTL platforms).
>>>> + *
>>>> + * Note: On pre-MTL platforms there is only a caching mode and no
>>>> + * explicit coherency mode, but on such hardware there is always a
>>>> + * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
>>>> + * CPU caches even with the caching mode set as uncached. It's only the
>>>> + * display engine that is incoherent (on dgpu it must be in VRAM which
>>>> + * is always mapped as WC on the CPU). However to keep the uapi somewhat
>>>> + * consistent with newer platforms the KMD groups the different cache
>>>> + * levels into the following coherency buckets on all pre-MTL platforms:
>>>> + *
>>>> + * ppGTT UC -> XE_GEM_COHERENCY_NONE
>>>> + * ppGTT WC -> XE_GEM_COHERENCY_NONE
>>>> + * ppGTT WT -> XE_GEM_COHERENCY_NONE
>>>> + * ppGTT WB -> XE_GEM_COHERENCY_2WAY
>>>> + *
>>>> + * In practice UC/WC/WT should only ever used for scanout surfaces on
>>>> + * such platforms since it is only the display engine that is actually
>>>> + * incoherent. Everything else should typically use WB given that we
>>>> + * have a shared-LLC. On MTL+ this completely changes (also potentially
>>>> + * no shared-LLC) and the HW defines the coherency mode as part of the
>>>> + * @pat_index.
>>>> + *
>>>> + * Note: For userptr and externally imported dma-buf the kernel expects
>>>> + * either 1WAY or 2WAY for the @pat_index.
>>>> + */
>>>> + __u32 pat_index;
>>>>
>>>> union {
>>>> /**
>>>
>
^ permalink raw reply [flat|nested] 47+ messages in thread