* [PATCH v7 00/24] AuxCCS handling and render compression modifiers
@ 2025-06-27 13:33 Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 01/24] drm/xe: Consolidate LRC offset calculations Tvrtko Ursulin
` (23 more replies)
0 siblings, 24 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin
A series to fix and add xe support for AuxCSS framebuffers via DPT.
Currently the auxiliary buffer data isn't mapped into the page tables at all so
cf48bddd31de ("drm/i915/display: Disable AuxCCS framebuffers if built for Xe")
had to disable the support.
On top of that there are missing flushes, invalidations and similar.
Tested with KDE Wayland, on Lenovo Carbon X1 ADL-P:
[PLANE:32:plane 1A]: type=PRI
uapi: [FB:242] AR30 little-endian (0x30335241),0x100000000000008,2880x1800, visible=visible, src=2880.000000x1800.000000+0.000000+0.000000, dst=2880x1800+0+0, rotation=0 (0x00000001)
hw: [FB:242] AR30 little-endian (0x30335241),0x100000000000008,2880x1800, visible=yes, src=2880.000000x1800.000000+0.000000+0.000000, dst=2880x1800+0+0, rotation=0 (0x00000001)
Display working fine - no artefacts, no DMAR/PIPE faults.
*** NOTE ***
The first 8 patches are not really the AuxCCS work. They are a different series
which adds plumbing for the indirect context workarounds, hence the AuxCCS work
depends on it.
For strictly AuxCCS work you can start reading from patch 9.
*** NOTE ***
v2:
* More patches added to fix kms_flip_tiling.
v3:
* Rebased after some cleanup patches from v2 were merged.
* Added people to Cc as suggested by Rodrigo.
* Adjusted last patch title. (Rodrigo)
* Apply GGTT flushing only to iomapped system memory buffers.
v4:
* Added patch for potentially misplaced Wa_14016712196.
* Fixed (hopefully) MAX_JOB_SIZE_DW on Meteorlake.
v5:
* Split out ring emission changes to smaller patches.
* Fixed MAX_JOB_SIZE_DW even more.
* Don't emit MI_FLUSH_DW_CCS on !BCS. This should fix Meteorlake.
v6:
* Added AuxCCS invalidation to indirect context workarounds.
* Also added the indirect context handling and some other workarounds. They are
unrelated but the series depends on it.
* Dropped DPT pin alignment reduction since BMG appears not to be liking it for
some reason.
v7:
* Rebased on top of recent xe_fb_pin.c refactoring and also the indirect
context workarounds series.
Tvrtko Ursulin (24):
drm/xe: Consolidate LRC offset calculations
drm/xe: Generalize wa bb emission code
drm/xe: Rename utilisation workaround emission function
drm/xe: Return number of written dwords from workaround batch buffer
emission
drm/xe: Allow specifying number of extra dwords at the end of wa bb
emission
drm/xe: Add plumbing for indirect context workarounds
drm/xe/xelp: Implement Wa_16010904313
drm/xe/xelp: Add Wa_18022495364
drm/xe: Use emit_flush_imm_ggtt helper instead of open coding
drm/xe/xelpg: Flush CCS when flushing caches
drm/xe: Flush L3 when flushing render cache
drm/xe/xelp: Quiesce memory traffic before invalidating auxccs
drm/xe/xelp: Support auxccs invalidation on blitter
drm/xe/xelp: Use MI_FLUSH_DW_CCS on auxccs platforms
drm/xe/xelp: Wait for AuxCCS invalidation to complete
drm/xe/xelp: Add AuxCCS invalidation to the buffer migration path
drm/xe: Export xe_emit_aux_table_inv
drm/xe/xelp: Add AuxCCS invalidation to the indirect context
workarounds
drm/xe: Use fb cached min alignment
drm/xe: Flush GGTT writes after populating DPT
drm/xe: Handle DPT in system memory
drm/xe: Force flush system memory AuxCCS framebuffers before scan out
drm/xe/display: Add support for AuxCCS
drm/i915/display: Expose AuxCCS frame buffer modifiers for Xe
.../drm/i915/display/skl_universal_plane.c | 6 -
drivers/gpu/drm/xe/display/xe_fb_pin.c | 177 ++++++++--
.../gpu/drm/xe/instructions/xe_gpu_commands.h | 2 +
.../gpu/drm/xe/instructions/xe_mi_commands.h | 7 +
drivers/gpu/drm/xe/regs/xe_engine_regs.h | 3 +
drivers/gpu/drm/xe/regs/xe_gt_regs.h | 1 +
drivers/gpu/drm/xe/regs/xe_lrc_layout.h | 4 +
drivers/gpu/drm/xe/xe_bo_types.h | 14 +-
drivers/gpu/drm/xe/xe_lrc.c | 317 +++++++++++++++---
drivers/gpu/drm/xe/xe_lrc_types.h | 5 +-
drivers/gpu/drm/xe/xe_ring_ops.c | 210 +++++++-----
drivers/gpu/drm/xe/xe_ring_ops.h | 3 +
drivers/gpu/drm/xe/xe_ring_ops_types.h | 2 +-
drivers/gpu/drm/xe/xe_wa_oob.rules | 2 +
14 files changed, 571 insertions(+), 182 deletions(-)
--
2.48.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v7 01/24] drm/xe: Consolidate LRC offset calculations
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 02/24] drm/xe: Generalize wa bb emission code Tvrtko Ursulin
` (22 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe
Cc: kernel-dev, Tvrtko Ursulin, Matthew Brost, Matt Roper,
Lucas De Marchi
Attempt to consolidate the LRC offsets calculations by alignning the
recently added wa_bb_offset with the naming scheme in the file and
also change the size stored in struct xe_lrc to not include the ring
buffer.
The former makes it somewhat visually easier to follow the layout of the
various logical blocks stored in the LRC bo, while the latter reduces the
number of sprinkled around calculations.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
---
drivers/gpu/drm/xe/xe_lrc.c | 41 ++++++++++++++-----------------
drivers/gpu/drm/xe/xe_lrc_types.h | 2 +-
2 files changed, 19 insertions(+), 24 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index 37598588a54f..f0c33de4eb6c 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -717,8 +717,12 @@ static u32 __xe_lrc_ctx_timestamp_udw_offset(struct xe_lrc *lrc)
static inline u32 __xe_lrc_indirect_ring_offset(struct xe_lrc *lrc)
{
- /* Indirect ring state page is at the very end of LRC */
- return lrc->size - LRC_INDIRECT_RING_STATE_SIZE;
+ return lrc->bo->size - LRC_WA_BB_SIZE - LRC_INDIRECT_RING_STATE_SIZE;
+}
+
+static inline u32 __xe_lrc_wa_bb_offset(struct xe_lrc *lrc)
+{
+ return lrc->bo->size - LRC_WA_BB_SIZE;
}
#define DECL_MAP_ADDR_HELPERS(elem) \
@@ -973,11 +977,6 @@ struct wa_bb_setup {
u32 *batch, size_t max_size);
};
-static size_t wa_bb_offset(struct xe_lrc *lrc)
-{
- return lrc->bo->size - LRC_WA_BB_SIZE;
-}
-
static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
{
const size_t max_size = LRC_WA_BB_SIZE;
@@ -993,7 +992,7 @@ static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
return -ENOMEM;
cmd = buf;
} else {
- cmd = lrc->bo->vmap.vaddr + wa_bb_offset(lrc);
+ cmd = lrc->bo->vmap.vaddr + __xe_lrc_wa_bb_offset(lrc);
}
remain = max_size / sizeof(*cmd);
@@ -1017,13 +1016,13 @@ static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
if (buf) {
xe_map_memcpy_to(gt_to_xe(lrc->gt), &lrc->bo->vmap,
- wa_bb_offset(lrc), buf,
+ __xe_lrc_wa_bb_offset(lrc), buf,
(cmd - buf) * sizeof(*cmd));
kfree(buf);
}
xe_lrc_write_ctx_reg(lrc, CTX_BB_PER_CTX_PTR, xe_bo_ggtt_addr(lrc->bo) +
- wa_bb_offset(lrc) + 1);
+ __xe_lrc_wa_bb_offset(lrc) + 1);
return 0;
@@ -1040,19 +1039,22 @@ static int xe_lrc_init(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
u32 init_flags)
{
struct xe_gt *gt = hwe->gt;
+ const u32 lrc_size = xe_gt_lrc_size(gt, hwe->class);
+ const u32 bo_size = ring_size + lrc_size + LRC_WA_BB_SIZE;
struct xe_tile *tile = gt_to_tile(gt);
struct xe_device *xe = gt_to_xe(gt);
struct iosys_map map;
void *init_data = NULL;
u32 arb_enable;
- u32 lrc_size;
u32 bo_flags;
int err;
kref_init(&lrc->refcount);
lrc->gt = gt;
+ lrc->size = lrc_size;
lrc->flags = 0;
- lrc_size = ring_size + xe_gt_lrc_size(gt, hwe->class);
+ lrc->ring.size = ring_size;
+ lrc->ring.tail = 0;
if (xe_gt_has_indirect_ring_state(gt))
lrc->flags |= XE_LRC_FLAG_INDIRECT_RING_STATE;
@@ -1065,17 +1067,12 @@ static int xe_lrc_init(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
* FIXME: Perma-pinning LRC as we don't yet support moving GGTT address
* via VM bind calls.
*/
- lrc->bo = xe_bo_create_pin_map(xe, tile, NULL,
- lrc_size + LRC_WA_BB_SIZE,
+ lrc->bo = xe_bo_create_pin_map(xe, tile, NULL, bo_size,
ttm_bo_type_kernel,
bo_flags);
if (IS_ERR(lrc->bo))
return PTR_ERR(lrc->bo);
- lrc->size = lrc_size;
- lrc->ring.size = ring_size;
- lrc->ring.tail = 0;
-
xe_hw_fence_ctx_init(&lrc->fence_ctx, hwe->gt,
hwe->fence_irq, hwe->name);
@@ -1096,10 +1093,9 @@ static int xe_lrc_init(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
xe_map_memset(xe, &map, 0, 0, LRC_PPHWSP_SIZE); /* PPHWSP */
xe_map_memcpy_to(xe, &map, LRC_PPHWSP_SIZE,
gt->default_lrc[hwe->class] + LRC_PPHWSP_SIZE,
- xe_gt_lrc_size(gt, hwe->class) - LRC_PPHWSP_SIZE);
+ lrc_size - LRC_PPHWSP_SIZE);
} else {
- xe_map_memcpy_to(xe, &map, 0, init_data,
- xe_gt_lrc_size(gt, hwe->class));
+ xe_map_memcpy_to(xe, &map, 0, init_data, lrc_size);
kfree(init_data);
}
@@ -1859,8 +1855,7 @@ struct xe_lrc_snapshot *xe_lrc_snapshot_capture(struct xe_lrc *lrc)
snapshot->seqno = xe_lrc_seqno(lrc);
snapshot->lrc_bo = xe_bo_get(lrc->bo);
snapshot->lrc_offset = xe_lrc_pphwsp_offset(lrc);
- snapshot->lrc_size = lrc->bo->size - snapshot->lrc_offset -
- LRC_WA_BB_SIZE;
+ snapshot->lrc_size = lrc->size;
snapshot->lrc_snapshot = NULL;
snapshot->ctx_timestamp = lower_32_bits(xe_lrc_ctx_timestamp(lrc));
snapshot->ctx_job_timestamp = xe_lrc_ctx_job_timestamp(lrc);
diff --git a/drivers/gpu/drm/xe/xe_lrc_types.h b/drivers/gpu/drm/xe/xe_lrc_types.h
index 883e550a9423..2c7c81079801 100644
--- a/drivers/gpu/drm/xe/xe_lrc_types.h
+++ b/drivers/gpu/drm/xe/xe_lrc_types.h
@@ -22,7 +22,7 @@ struct xe_lrc {
*/
struct xe_bo *bo;
- /** @size: size of lrc including any indirect ring state page */
+ /** @size: size of the lrc and optional indirect ring state */
u32 size;
/** @gt: gt which this LRC belongs to */
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 02/24] drm/xe: Generalize wa bb emission code
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 01/24] drm/xe: Consolidate LRC offset calculations Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 03/24] drm/xe: Rename utilisation workaround emission function Tvrtko Ursulin
` (21 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin, Lucas De Marchi
Generalize the wa bb emission by splitting it into three phases - setup,
emit and finish, and extract setup and finish steps into helpers.
This will enable using the same infrastructure for emitting the indirect
context workarounds.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
---
drivers/gpu/drm/xe/xe_lrc.c | 77 +++++++++++++++++++++++++------------
1 file changed, 53 insertions(+), 24 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index f0c33de4eb6c..58c676afc60f 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -972,32 +972,37 @@ static ssize_t wa_bb_setup_utilization(struct xe_lrc *lrc, struct xe_hw_engine *
return cmd - batch;
}
-struct wa_bb_setup {
+struct wa_bo_setup {
ssize_t (*setup)(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
u32 *batch, size_t max_size);
};
-static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
+static u32 *
+setup_wa_bo(struct xe_lrc *lrc,
+ struct xe_hw_engine *hwe,
+ const size_t max_size,
+ unsigned int offset,
+ const struct wa_bo_setup *funcs,
+ unsigned int num_funcs,
+ u32 **free)
{
- const size_t max_size = LRC_WA_BB_SIZE;
- static const struct wa_bb_setup funcs[] = {
- { .setup = wa_bb_setup_utilization },
- };
+ u32 *cmd, *buf = NULL;
ssize_t remain;
- u32 *cmd, *buf = NULL;
if (lrc->bo->vmap.is_iomem) {
buf = kmalloc(max_size, GFP_KERNEL);
if (!buf)
- return -ENOMEM;
+ return ERR_PTR(-ENOMEM);
cmd = buf;
+ *free = buf;
} else {
- cmd = lrc->bo->vmap.vaddr + __xe_lrc_wa_bb_offset(lrc);
+ cmd = lrc->bo->vmap.vaddr + offset;
+ *free = NULL;
}
remain = max_size / sizeof(*cmd);
- for (size_t i = 0; i < ARRAY_SIZE(funcs); i++) {
+ for (size_t i = 0; i < num_funcs; i++) {
ssize_t len = funcs[i].setup(lrc, hwe, cmd, remain);
remain -= len;
@@ -1012,23 +1017,47 @@ static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
cmd += len;
}
- *cmd++ = MI_BATCH_BUFFER_END;
-
- if (buf) {
- xe_map_memcpy_to(gt_to_xe(lrc->gt), &lrc->bo->vmap,
- __xe_lrc_wa_bb_offset(lrc), buf,
- (cmd - buf) * sizeof(*cmd));
- kfree(buf);
- }
-
- xe_lrc_write_ctx_reg(lrc, CTX_BB_PER_CTX_PTR, xe_bo_ggtt_addr(lrc->bo) +
- __xe_lrc_wa_bb_offset(lrc) + 1);
-
- return 0;
+ return cmd;
fail:
kfree(buf);
- return -ENOSPC;
+ return ERR_PTR(-ENOSPC);
+}
+
+static void finish_wa_bo(struct xe_lrc *lrc,
+ unsigned int offset,
+ u32 *cmd,
+ u32 *free)
+{
+ if (!free)
+ return;
+
+ xe_map_memcpy_to(gt_to_xe(lrc->gt), &lrc->bo->vmap, offset, free,
+ (cmd - free) * sizeof(*cmd));
+ kfree(free);
+}
+
+static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
+{
+ static const struct wa_bo_setup funcs[] = {
+ { .setup = wa_bb_setup_utilization },
+ };
+ unsigned int offset = __xe_lrc_wa_bb_offset(lrc);
+ u32 *cmd, *buf = NULL;
+
+ cmd = setup_wa_bo(lrc, hwe, LRC_WA_BB_SIZE, offset, funcs,
+ ARRAY_SIZE(funcs), &buf);
+ if (IS_ERR(cmd))
+ return PTR_ERR(cmd);
+
+ *cmd++ = MI_BATCH_BUFFER_END;
+
+ finish_wa_bo(lrc, offset, cmd, buf);
+
+ xe_lrc_write_ctx_reg(lrc, CTX_BB_PER_CTX_PTR,
+ xe_bo_ggtt_addr(lrc->bo) + offset + 1);
+
+ return 0;
}
#define PVC_CTX_ASID (0x2e + 1)
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 03/24] drm/xe: Rename utilisation workaround emission function
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 01/24] drm/xe: Consolidate LRC offset calculations Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 02/24] drm/xe: Generalize wa bb emission code Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 04/24] drm/xe: Return number of written dwords from workaround batch buffer emission Tvrtko Ursulin
` (20 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin, Lucas De Marchi, Matt Roper
Lucas suggested to consolidate to a slightly different naming scheme which
will align with the upcoming additions better.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Suggested-by: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
---
drivers/gpu/drm/xe/xe_lrc.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index 58c676afc60f..a26ec8b6a3ad 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -944,8 +944,9 @@ static void xe_lrc_finish(struct xe_lrc *lrc)
* store it in the PPHSWP.
*/
#define CONTEXT_ACTIVE 1ULL
-static ssize_t wa_bb_setup_utilization(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
- u32 *batch, size_t max_len)
+static ssize_t
+setup_utilization_wa(struct xe_lrc *lrc, struct xe_hw_engine *hwe, u32 *batch,
+ size_t max_len)
{
u32 *cmd = batch;
@@ -1040,7 +1041,7 @@ static void finish_wa_bo(struct xe_lrc *lrc,
static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
{
static const struct wa_bo_setup funcs[] = {
- { .setup = wa_bb_setup_utilization },
+ { .setup = setup_utilization_wa },
};
unsigned int offset = __xe_lrc_wa_bb_offset(lrc);
u32 *cmd, *buf = NULL;
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 04/24] drm/xe: Return number of written dwords from workaround batch buffer emission
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (2 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 03/24] drm/xe: Rename utilisation workaround emission function Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 05/24] drm/xe: Allow specifying number of extra dwords at the end of wa bb emission Tvrtko Ursulin
` (19 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin, Lucas De Marchi
Indirect context setup will need to get to the number of written dwords.
Lets add it as an output parameter so it can be accessed from the caller
regardless of whether code is writing directly or via an shadow buffer.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
---
drivers/gpu/drm/xe/xe_lrc.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index a26ec8b6a3ad..5a2978139e4f 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -985,7 +985,8 @@ setup_wa_bo(struct xe_lrc *lrc,
unsigned int offset,
const struct wa_bo_setup *funcs,
unsigned int num_funcs,
- u32 **free)
+ u32 **free,
+ unsigned int *written)
{
u32 *cmd, *buf = NULL;
ssize_t remain;
@@ -1016,6 +1017,8 @@ setup_wa_bo(struct xe_lrc *lrc,
goto fail;
cmd += len;
+ if (written)
+ *written += len;
}
return cmd;
@@ -1027,14 +1030,14 @@ setup_wa_bo(struct xe_lrc *lrc,
static void finish_wa_bo(struct xe_lrc *lrc,
unsigned int offset,
- u32 *cmd,
+ unsigned int written,
u32 *free)
{
if (!free)
return;
xe_map_memcpy_to(gt_to_xe(lrc->gt), &lrc->bo->vmap, offset, free,
- (cmd - free) * sizeof(*cmd));
+ written * sizeof(u32));
kfree(free);
}
@@ -1044,16 +1047,18 @@ static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
{ .setup = setup_utilization_wa },
};
unsigned int offset = __xe_lrc_wa_bb_offset(lrc);
+ unsigned int written = 0;
u32 *cmd, *buf = NULL;
cmd = setup_wa_bo(lrc, hwe, LRC_WA_BB_SIZE, offset, funcs,
- ARRAY_SIZE(funcs), &buf);
+ ARRAY_SIZE(funcs), &buf, &written);
if (IS_ERR(cmd))
return PTR_ERR(cmd);
*cmd++ = MI_BATCH_BUFFER_END;
+ written++;
- finish_wa_bo(lrc, offset, cmd, buf);
+ finish_wa_bo(lrc, offset, written, buf);
xe_lrc_write_ctx_reg(lrc, CTX_BB_PER_CTX_PTR,
xe_bo_ggtt_addr(lrc->bo) + offset + 1);
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 05/24] drm/xe: Allow specifying number of extra dwords at the end of wa bb emission
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (3 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 04/24] drm/xe: Return number of written dwords from workaround batch buffer emission Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 06/24] drm/xe: Add plumbing for indirect context workarounds Tvrtko Ursulin
` (18 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin, Lucas De Marchi
Indirect context setup will need more than one.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
---
drivers/gpu/drm/xe/xe_lrc.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index 5a2978139e4f..ea1e60e23120 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -982,6 +982,7 @@ static u32 *
setup_wa_bo(struct xe_lrc *lrc,
struct xe_hw_engine *hwe,
const size_t max_size,
+ unsigned int reserve_dw,
unsigned int offset,
const struct wa_bo_setup *funcs,
unsigned int num_funcs,
@@ -1010,10 +1011,9 @@ setup_wa_bo(struct xe_lrc *lrc,
remain -= len;
/*
- * There should always be at least 1 additional dword for
- * the end marker
+ * Caller has asked for at least reserve_dw to remain unused.
*/
- if (len < 0 || xe_gt_WARN_ON(lrc->gt, remain < 1))
+ if (len < 0 || xe_gt_WARN_ON(lrc->gt, remain < reserve_dw))
goto fail;
cmd += len;
@@ -1050,7 +1050,7 @@ static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
unsigned int written = 0;
u32 *cmd, *buf = NULL;
- cmd = setup_wa_bo(lrc, hwe, LRC_WA_BB_SIZE, offset, funcs,
+ cmd = setup_wa_bo(lrc, hwe, LRC_WA_BB_SIZE, 1, offset, funcs,
ARRAY_SIZE(funcs), &buf, &written);
if (IS_ERR(cmd))
return PTR_ERR(cmd);
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 06/24] drm/xe: Add plumbing for indirect context workarounds
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (4 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 05/24] drm/xe: Allow specifying number of extra dwords at the end of wa bb emission Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 07/24] drm/xe/xelp: Implement Wa_16010904313 Tvrtko Ursulin
` (17 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin, Lucas De Marchi, Matt Roper
Some upcoming workarounds need to be emitted from the indirect workaround
context so lets add some plumbing where they will be able to easily slot
in.
No functional changes for now since everything is still deactivated.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
---
drivers/gpu/drm/xe/regs/xe_lrc_layout.h | 4 ++
drivers/gpu/drm/xe/xe_lrc.c | 80 ++++++++++++++++++++++++-
drivers/gpu/drm/xe/xe_lrc_types.h | 3 +-
3 files changed, 84 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/xe/regs/xe_lrc_layout.h b/drivers/gpu/drm/xe/regs/xe_lrc_layout.h
index 994af591a2e8..06c3a24ac381 100644
--- a/drivers/gpu/drm/xe/regs/xe_lrc_layout.h
+++ b/drivers/gpu/drm/xe/regs/xe_lrc_layout.h
@@ -12,6 +12,8 @@
#define CTX_RING_START (0x08 + 1)
#define CTX_RING_CTL (0x0a + 1)
#define CTX_BB_PER_CTX_PTR (0x12 + 1)
+#define CTX_CS_INDIRECT_CTX (0x14 + 1)
+#define CTX_CS_INDIRECT_CTX_OFFSET (0x16 + 1)
#define CTX_TIMESTAMP (0x22 + 1)
#define CTX_TIMESTAMP_UDW (0x24 + 1)
#define CTX_INDIRECT_RING_STATE (0x26 + 1)
@@ -36,4 +38,6 @@
#define INDIRECT_CTX_RING_START_UDW (0x08 + 1)
#define INDIRECT_CTX_RING_CTL (0x0a + 1)
+#define XELP_CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT 0xd
+
#endif
diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index ea1e60e23120..40c9fabeda09 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -39,6 +39,7 @@
#define LRC_ENGINE_INSTANCE GENMASK_ULL(53, 48)
#define LRC_PPHWSP_SIZE SZ_4K
+#define LRC_INDIRECT_CTX_SIZE SZ_4K
#define LRC_INDIRECT_RING_STATE_SIZE SZ_4K
#define LRC_WA_BB_SIZE SZ_4K
@@ -48,6 +49,12 @@ lrc_to_xe(struct xe_lrc *lrc)
return gt_to_xe(lrc->fence_ctx.gt);
}
+static bool
+gt_engine_needs_indirect_ctx(struct xe_gt *gt, enum xe_engine_class class)
+{
+ return false;
+}
+
size_t xe_gt_lrc_size(struct xe_gt *gt, enum xe_engine_class class)
{
struct xe_device *xe = gt_to_xe(gt);
@@ -717,7 +724,18 @@ static u32 __xe_lrc_ctx_timestamp_udw_offset(struct xe_lrc *lrc)
static inline u32 __xe_lrc_indirect_ring_offset(struct xe_lrc *lrc)
{
- return lrc->bo->size - LRC_WA_BB_SIZE - LRC_INDIRECT_RING_STATE_SIZE;
+ u32 offset = lrc->bo->size - LRC_WA_BB_SIZE -
+ LRC_INDIRECT_RING_STATE_SIZE;
+
+ if (lrc->flags & XE_LRC_FLAG_INDIRECT_CTX)
+ offset -= LRC_INDIRECT_CTX_SIZE;
+
+ return offset;
+}
+
+static inline u32 __xe_lrc_indirect_ctx_offset(struct xe_lrc *lrc)
+{
+ return lrc->bo->size - LRC_WA_BB_SIZE - LRC_INDIRECT_CTX_SIZE;
}
static inline u32 __xe_lrc_wa_bb_offset(struct xe_lrc *lrc)
@@ -1066,6 +1084,54 @@ static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
return 0;
}
+static int
+setup_indirect_ctx(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
+{
+ static struct wa_bo_setup rcs_funcs[] = {
+ };
+ unsigned int offset, num_funcs, written = 0;
+ struct wa_bo_setup *funcs = NULL;
+ u32 *cmd, *buf = NULL;
+
+ if (!(lrc->flags & XE_LRC_FLAG_INDIRECT_CTX))
+ return 0;
+
+ if (hwe->class == XE_ENGINE_CLASS_RENDER ||
+ hwe->class == XE_ENGINE_CLASS_COMPUTE) {
+ funcs = rcs_funcs;
+ num_funcs = ARRAY_SIZE(rcs_funcs);
+ }
+
+ if (xe_gt_WARN_ON(lrc->gt, !funcs))
+ return 0;
+
+ offset = __xe_lrc_indirect_ctx_offset(lrc);
+
+ cmd = setup_wa_bo(lrc, hwe, LRC_INDIRECT_CTX_SIZE, 15, offset, funcs,
+ num_funcs, &buf, &written);
+ if (IS_ERR(cmd))
+ return PTR_ERR(cmd);
+
+ /* Align to 64B cacheline. */
+ while ((unsigned long)cmd & 0x3f) {
+ *cmd++ = MI_NOOP;
+ written++;
+ }
+
+ finish_wa_bo(lrc, offset, written, buf);
+
+ xe_lrc_write_ctx_reg(lrc,
+ CTX_CS_INDIRECT_CTX,
+ (xe_bo_ggtt_addr(lrc->bo) + offset) |
+ /* Size in CLs. */
+ (written * sizeof(u32) / 64));
+ xe_lrc_write_ctx_reg(lrc,
+ CTX_CS_INDIRECT_CTX_OFFSET,
+ XELP_CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT << 6);
+
+ return 0;
+}
+
#define PVC_CTX_ASID (0x2e + 1)
#define PVC_CTX_ACC_CTR_THOLD (0x2a + 1)
@@ -1075,7 +1141,7 @@ static int xe_lrc_init(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
{
struct xe_gt *gt = hwe->gt;
const u32 lrc_size = xe_gt_lrc_size(gt, hwe->class);
- const u32 bo_size = ring_size + lrc_size + LRC_WA_BB_SIZE;
+ u32 bo_size = ring_size + lrc_size + LRC_WA_BB_SIZE;
struct xe_tile *tile = gt_to_tile(gt);
struct xe_device *xe = gt_to_xe(gt);
struct iosys_map map;
@@ -1090,6 +1156,12 @@ static int xe_lrc_init(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
lrc->flags = 0;
lrc->ring.size = ring_size;
lrc->ring.tail = 0;
+
+ if (gt_engine_needs_indirect_ctx(gt, hwe->class)) {
+ lrc->flags |= XE_LRC_FLAG_INDIRECT_CTX;
+ bo_size += LRC_INDIRECT_CTX_SIZE;
+ }
+
if (xe_gt_has_indirect_ring_state(gt))
lrc->flags |= XE_LRC_FLAG_INDIRECT_RING_STATE;
@@ -1214,6 +1286,10 @@ static int xe_lrc_init(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
if (err)
goto err_lrc_finish;
+ err = setup_indirect_ctx(lrc, hwe);
+ if (err)
+ goto err_lrc_finish;
+
return 0;
err_lrc_finish:
diff --git a/drivers/gpu/drm/xe/xe_lrc_types.h b/drivers/gpu/drm/xe/xe_lrc_types.h
index 2c7c81079801..e9883706e004 100644
--- a/drivers/gpu/drm/xe/xe_lrc_types.h
+++ b/drivers/gpu/drm/xe/xe_lrc_types.h
@@ -29,7 +29,8 @@ struct xe_lrc {
struct xe_gt *gt;
/** @flags: LRC flags */
-#define XE_LRC_FLAG_INDIRECT_RING_STATE 0x1
+#define XE_LRC_FLAG_INDIRECT_CTX 0x1
+#define XE_LRC_FLAG_INDIRECT_RING_STATE 0x2
u32 flags;
/** @refcount: ref count of this lrc */
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 07/24] drm/xe/xelp: Implement Wa_16010904313
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (5 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 06/24] drm/xe: Add plumbing for indirect context workarounds Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 08/24] drm/xe/xelp: Add Wa_18022495364 Tvrtko Ursulin
` (16 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin, Lucas De Marchi, Matt Roper
Add XeLP workaround 16010904313.
The description calls for it to be emitted as the indirect context buffer
workaround for render and compute, and from the workaround batch buffer
for the other engines. Therefore we plug into the previously added
respective top level emission functions.
The actual command streamer programming sequence differs from what is
described in the PRM, in that it assumes the listed LRCA offset was
supposed to actually refer to the location of the CTX_TIMESTAMP register
instead of LRCA + 0x180c (which is in GPR space). Latter appears to make
more sense under the assumption that multiple writes are helping with
restoring the CTX_TIMESTAMP register content from the saved context state.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
---
.../gpu/drm/xe/instructions/xe_mi_commands.h | 1 +
drivers/gpu/drm/xe/xe_lrc.c | 45 +++++++++++++++++++
drivers/gpu/drm/xe/xe_wa_oob.rules | 1 +
3 files changed, 47 insertions(+)
diff --git a/drivers/gpu/drm/xe/instructions/xe_mi_commands.h b/drivers/gpu/drm/xe/instructions/xe_mi_commands.h
index e3f5e8bb3ebc..c47b290e0e9f 100644
--- a/drivers/gpu/drm/xe/instructions/xe_mi_commands.h
+++ b/drivers/gpu/drm/xe/instructions/xe_mi_commands.h
@@ -65,6 +65,7 @@
#define MI_LOAD_REGISTER_MEM (__MI_INSTR(0x29) | XE_INSTR_NUM_DW(4))
#define MI_LRM_USE_GGTT REG_BIT(22)
+#define MI_LRM_ASYNC REG_BIT(21)
#define MI_LOAD_REGISTER_REG (__MI_INSTR(0x2a) | XE_INSTR_NUM_DW(3))
#define MI_LRR_DST_CS_MMIO REG_BIT(19)
diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index 40c9fabeda09..f58d659bb13a 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -52,6 +52,11 @@ lrc_to_xe(struct xe_lrc *lrc)
static bool
gt_engine_needs_indirect_ctx(struct xe_gt *gt, enum xe_engine_class class)
{
+ if (XE_WA(gt, 16010904313) &&
+ (class == XE_ENGINE_CLASS_RENDER ||
+ class == XE_ENGINE_CLASS_COMPUTE))
+ return true;
+
return false;
}
@@ -991,6 +996,44 @@ setup_utilization_wa(struct xe_lrc *lrc, struct xe_hw_engine *hwe, u32 *batch,
return cmd - batch;
}
+static ssize_t
+setup_timestamp_wa(struct xe_lrc *lrc, struct xe_hw_engine *hwe, u32 *batch,
+ size_t max_len)
+{
+ const u32 ts_addr = __xe_lrc_ctx_timestamp_ggtt_addr(lrc);
+ u32 *cmd = batch;
+
+ if (!XE_WA(lrc->gt, 16010904313) ||
+ !(hwe->class == XE_ENGINE_CLASS_RENDER ||
+ hwe->class == XE_ENGINE_CLASS_COMPUTE ||
+ hwe->class == XE_ENGINE_CLASS_COPY ||
+ hwe->class == XE_ENGINE_CLASS_VIDEO_DECODE ||
+ hwe->class == XE_ENGINE_CLASS_VIDEO_ENHANCE))
+ return 0;
+
+ if (xe_gt_WARN_ON(lrc->gt, max_len < 12))
+ return -ENOSPC;
+
+ *cmd++ = MI_LOAD_REGISTER_MEM | MI_LRM_USE_GGTT | MI_LRI_LRM_CS_MMIO |
+ MI_LRM_ASYNC;
+ *cmd++ = RING_CTX_TIMESTAMP(0).addr;
+ *cmd++ = ts_addr;
+ *cmd++ = 0;
+
+ *cmd++ = MI_LOAD_REGISTER_MEM | MI_LRM_USE_GGTT | MI_LRI_LRM_CS_MMIO |
+ MI_LRM_ASYNC;
+ *cmd++ = RING_CTX_TIMESTAMP(0).addr;
+ *cmd++ = ts_addr;
+ *cmd++ = 0;
+
+ *cmd++ = MI_LOAD_REGISTER_MEM | MI_LRM_USE_GGTT | MI_LRI_LRM_CS_MMIO;
+ *cmd++ = RING_CTX_TIMESTAMP(0).addr;
+ *cmd++ = ts_addr;
+ *cmd++ = 0;
+
+ return cmd - batch;
+}
+
struct wa_bo_setup {
ssize_t (*setup)(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
u32 *batch, size_t max_size);
@@ -1062,6 +1105,7 @@ static void finish_wa_bo(struct xe_lrc *lrc,
static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
{
static const struct wa_bo_setup funcs[] = {
+ { .setup = setup_timestamp_wa },
{ .setup = setup_utilization_wa },
};
unsigned int offset = __xe_lrc_wa_bb_offset(lrc);
@@ -1088,6 +1132,7 @@ static int
setup_indirect_ctx(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
{
static struct wa_bo_setup rcs_funcs[] = {
+ { .setup = setup_timestamp_wa },
};
unsigned int offset, num_funcs, written = 0;
struct wa_bo_setup *funcs = NULL;
diff --git a/drivers/gpu/drm/xe/xe_wa_oob.rules b/drivers/gpu/drm/xe/xe_wa_oob.rules
index 96cc33da0fb5..bf798f3b8f93 100644
--- a/drivers/gpu/drm/xe/xe_wa_oob.rules
+++ b/drivers/gpu/drm/xe/xe_wa_oob.rules
@@ -1,4 +1,5 @@
1607983814 GRAPHICS_VERSION_RANGE(1200, 1210)
+16010904313 GRAPHICS_VERSION_RANGE(1200, 1210)
22012773006 GRAPHICS_VERSION_RANGE(1200, 1250)
14014475959 GRAPHICS_VERSION_RANGE(1270, 1271), GRAPHICS_STEP(A0, B0)
PLATFORM(DG2)
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 08/24] drm/xe/xelp: Add Wa_18022495364
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (6 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 07/24] drm/xe/xelp: Implement Wa_16010904313 Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 09/24] drm/xe: Use emit_flush_imm_ggtt helper instead of open coding Tvrtko Ursulin
` (15 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin, Lucas De Marchi, Matt Roper
Add Wa_18022495364 as a context workaround batch buffer workaround.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
---
drivers/gpu/drm/xe/regs/xe_engine_regs.h | 3 +++
drivers/gpu/drm/xe/xe_lrc.c | 21 +++++++++++++++++++++
drivers/gpu/drm/xe/xe_wa_oob.rules | 1 +
3 files changed, 25 insertions(+)
diff --git a/drivers/gpu/drm/xe/regs/xe_engine_regs.h b/drivers/gpu/drm/xe/regs/xe_engine_regs.h
index 7ade41e2b7b3..f4c3e1187a00 100644
--- a/drivers/gpu/drm/xe/regs/xe_engine_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_engine_regs.h
@@ -111,6 +111,9 @@
#define PPHWSP_CSB_AND_TIMESTAMP_REPORT_DIS REG_BIT(14)
#define CS_PRIORITY_MEM_READ REG_BIT(7)
+#define CS_DEBUG_MODE2(base) XE_REG((base) + 0xd8, XE_REG_OPTION_MASKED)
+#define INSTRUCTION_STATE_CACHE_INVALIDATE REG_BIT(6)
+
#define FF_SLICE_CS_CHICKEN1(base) XE_REG((base) + 0xe0, XE_REG_OPTION_MASKED)
#define FFSC_PERCTX_PREEMPT_CTRL REG_BIT(14)
diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index f58d659bb13a..a5c03ad4b8b2 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -1034,6 +1034,26 @@ setup_timestamp_wa(struct xe_lrc *lrc, struct xe_hw_engine *hwe, u32 *batch,
return cmd - batch;
}
+static ssize_t
+setup_invalidate_state_cache_wa(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
+ u32 *batch, size_t max_len)
+{
+ u32 *cmd = batch;
+
+ if (!XE_WA(lrc->gt, 18022495364) ||
+ hwe->class != XE_ENGINE_CLASS_RENDER)
+ return 0;
+
+ if (xe_gt_WARN_ON(lrc->gt, max_len < 3))
+ return -ENOSPC;
+
+ *cmd++ = MI_LOAD_REGISTER_IMM | MI_LRI_NUM_REGS(1);
+ *cmd++ = CS_DEBUG_MODE1(0).addr;
+ *cmd++ = _MASKED_BIT_ENABLE(INSTRUCTION_STATE_CACHE_INVALIDATE);
+
+ return cmd - batch;
+}
+
struct wa_bo_setup {
ssize_t (*setup)(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
u32 *batch, size_t max_size);
@@ -1106,6 +1126,7 @@ static int setup_wa_bb(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
{
static const struct wa_bo_setup funcs[] = {
{ .setup = setup_timestamp_wa },
+ { .setup = setup_invalidate_state_cache_wa },
{ .setup = setup_utilization_wa },
};
unsigned int offset = __xe_lrc_wa_bb_offset(lrc);
diff --git a/drivers/gpu/drm/xe/xe_wa_oob.rules b/drivers/gpu/drm/xe/xe_wa_oob.rules
index bf798f3b8f93..8ee1c71499fc 100644
--- a/drivers/gpu/drm/xe/xe_wa_oob.rules
+++ b/drivers/gpu/drm/xe/xe_wa_oob.rules
@@ -1,5 +1,6 @@
1607983814 GRAPHICS_VERSION_RANGE(1200, 1210)
16010904313 GRAPHICS_VERSION_RANGE(1200, 1210)
+18022495364 GRAPHICS_VERSION_RANGE(1200, 1210)
22012773006 GRAPHICS_VERSION_RANGE(1200, 1250)
14014475959 GRAPHICS_VERSION_RANGE(1270, 1271), GRAPHICS_STEP(A0, B0)
PLATFORM(DG2)
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 09/24] drm/xe: Use emit_flush_imm_ggtt helper instead of open coding
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (7 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 08/24] drm/xe/xelp: Add Wa_18022495364 Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 21:57 ` Matthew Brost
2025-06-27 13:33 ` [PATCH v7 10/24] drm/xe/xelpg: Flush CCS when flushing caches Tvrtko Ursulin
` (14 subsequent siblings)
23 siblings, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin
Helper is already there so lets just use it.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
drivers/gpu/drm/xe/xe_ring_ops.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
index bc1689db4cd7..b356134aca88 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops.c
+++ b/drivers/gpu/drm/xe/xe_ring_ops.c
@@ -417,11 +417,9 @@ static void emit_migration_job_gen12(struct xe_sched_job *job,
i = emit_bb_start(job->ptrs[1].batch_addr, BIT(8), dw, i);
- dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | job->migrate_flush_flags |
- MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_IMM_DW;
- dw[i++] = xe_lrc_seqno_ggtt_addr(lrc) | MI_FLUSH_DW_USE_GTT;
- dw[i++] = 0;
- dw[i++] = seqno; /* value */
+ i = emit_flush_imm_ggtt(xe_lrc_seqno_ggtt_addr(lrc), seqno,
+ MI_INVALIDATE_TLB | job->migrate_flush_flags,
+ dw, i);
i = emit_user_interrupt(dw, i);
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 10/24] drm/xe/xelpg: Flush CCS when flushing caches
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (8 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 09/24] drm/xe: Use emit_flush_imm_ggtt helper instead of open coding Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 11/24] drm/xe: Flush L3 when flushing render cache Tvrtko Ursulin
` (13 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin
According to i915 PIPE_CONTROL0_CCS_FLUSH needs to be set when flushing
render caches on gfx ip 12.70+.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
drivers/gpu/drm/xe/instructions/xe_gpu_commands.h | 1 +
drivers/gpu/drm/xe/xe_ring_ops.c | 7 ++++++-
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h b/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
index 8cfcd3360896..78c0e87dbd37 100644
--- a/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
+++ b/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
@@ -43,6 +43,7 @@
#define PIPE_CONTROL0_L3_READ_ONLY_CACHE_INVALIDATE BIT(10) /* gen12 */
#define PIPE_CONTROL0_HDC_PIPELINE_FLUSH BIT(9) /* gen12 */
+#define PIPE_CONTROL0_CCS_FLUSH BIT(13) /* MTL+ */
#define PIPE_CONTROL_COMMAND_CACHE_INVALIDATE (1<<29)
#define PIPE_CONTROL_TILE_CACHE_FLUSH (1<<28)
diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
index b356134aca88..a1289f086191 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops.c
+++ b/drivers/gpu/drm/xe/xe_ring_ops.c
@@ -175,13 +175,18 @@ static int emit_store_imm_ppgtt_posted(u64 addr, u64 value,
static int emit_render_cache_flush(struct xe_sched_job *job, u32 *dw, int i)
{
struct xe_gt *gt = job->q->gt;
+ struct xe_device *xe = gt_to_xe(gt);
bool lacks_render = !(gt->info.engine_mask & XE_HW_ENGINE_RCS_MASK);
+ u32 bit_group_0 = PIPE_CONTROL0_HDC_PIPELINE_FLUSH;
u32 flags;
if (XE_WA(gt, 14016712196))
i = emit_pipe_control(dw, i, 0, PIPE_CONTROL_DEPTH_CACHE_FLUSH,
LRC_PPHWSP_FLUSH_INVAL_SCRATCH_ADDR, 0);
+ if (GRAPHICS_VERx100(xe) >= 1270)
+ bit_group_0 |= PIPE_CONTROL0_CCS_FLUSH;
+
flags = (PIPE_CONTROL_CS_STALL |
PIPE_CONTROL_TILE_CACHE_FLUSH |
PIPE_CONTROL_RENDER_TARGET_CACHE_FLUSH |
@@ -197,7 +202,7 @@ static int emit_render_cache_flush(struct xe_sched_job *job, u32 *dw, int i)
else if (job->q->class == XE_ENGINE_CLASS_COMPUTE)
flags &= ~PIPE_CONTROL_3D_ENGINE_FLAGS;
- return emit_pipe_control(dw, i, PIPE_CONTROL0_HDC_PIPELINE_FLUSH, flags, 0, 0);
+ return emit_pipe_control(dw, i, bit_group_0, flags, 0, 0);
}
static int emit_pipe_control_to_ring_end(struct xe_hw_engine *hwe, u32 *dw, int i)
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 11/24] drm/xe: Flush L3 when flushing render cache
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (9 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 10/24] drm/xe/xelpg: Flush CCS when flushing caches Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 18:23 ` Souza, Jose
2025-06-27 13:33 ` [PATCH v7 12/24] drm/xe/xelp: Quiesce memory traffic before invalidating auxccs Tvrtko Ursulin
` (12 subsequent siblings)
23 siblings, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin
I915 sets PIPE_CONTROL_FLUSH_L3 (bit 27) when flushing render caches but
interesting thing is Tigerlake PRM lists that bit as reserved.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
Is xe missing this? Or has this been wrong for so long in i915? Or is this
an undocumented bit?
---
drivers/gpu/drm/xe/instructions/xe_gpu_commands.h | 1 +
drivers/gpu/drm/xe/xe_ring_ops.c | 10 ++++++++++
2 files changed, 11 insertions(+)
diff --git a/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h b/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
index 78c0e87dbd37..27892984403c 100644
--- a/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
+++ b/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
@@ -47,6 +47,7 @@
#define PIPE_CONTROL_COMMAND_CACHE_INVALIDATE (1<<29)
#define PIPE_CONTROL_TILE_CACHE_FLUSH (1<<28)
+#define PIPE_CONTROL_FLUSH_L3 (1<<27)
#define PIPE_CONTROL_AMFS_FLUSH (1<<25)
#define PIPE_CONTROL_GLOBAL_GTT_IVB (1<<24)
#define PIPE_CONTROL_LRI_POST_SYNC BIT(23)
diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
index a1289f086191..8f655b6fe913 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops.c
+++ b/drivers/gpu/drm/xe/xe_ring_ops.c
@@ -197,6 +197,16 @@ static int emit_render_cache_flush(struct xe_sched_job *job, u32 *dw, int i)
if (XE_WA(gt, 1409600907))
flags |= PIPE_CONTROL_DEPTH_STALL;
+ /*
+ * L3 fabric flush is needed for AUX CCS invalidation
+ * which happens as part of pipe-control so we can
+ * ignore PIPE_CONTROL_FLUSH_L3. Also PIPE_CONTROL_FLUSH_L3
+ * deals with Protected Memory which is not needed for
+ * AUX CCS invalidation and lead to unwanted side effects.
+ */
+ if (GRAPHICS_VERx100(xe) < 1270)
+ flags |= PIPE_CONTROL_FLUSH_L3;
+
if (lacks_render)
flags &= ~PIPE_CONTROL_3D_ARCH_FLAGS;
else if (job->q->class == XE_ENGINE_CLASS_COMPUTE)
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 12/24] drm/xe/xelp: Quiesce memory traffic before invalidating auxccs
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (10 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 11/24] drm/xe: Flush L3 when flushing render cache Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 13/24] drm/xe/xelp: Support auxccs invalidation on blitter Tvrtko Ursulin
` (11 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin
According to i915 before invalidating auxccs we must quiesce the memory
traffic by an extra flush.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
drivers/gpu/drm/xe/xe_ring_ops.c | 14 ++++++++++----
drivers/gpu/drm/xe/xe_ring_ops_types.h | 2 +-
2 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
index 8f655b6fe913..66bdfa94fe64 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops.c
+++ b/drivers/gpu/drm/xe/xe_ring_ops.c
@@ -172,7 +172,8 @@ static int emit_store_imm_ppgtt_posted(u64 addr, u64 value,
return i;
}
-static int emit_render_cache_flush(struct xe_sched_job *job, u32 *dw, int i)
+static int emit_render_cache_flush(struct xe_sched_job *job, bool flush_l3,
+ u32 *dw, int i)
{
struct xe_gt *gt = job->q->gt;
struct xe_device *xe = gt_to_xe(gt);
@@ -204,7 +205,7 @@ static int emit_render_cache_flush(struct xe_sched_job *job, u32 *dw, int i)
* deals with Protected Memory which is not needed for
* AUX CCS invalidation and lead to unwanted side effects.
*/
- if (GRAPHICS_VERx100(xe) < 1270)
+ if (flush_l3 && GRAPHICS_VERx100(xe) < 1270)
flags |= PIPE_CONTROL_FLUSH_L3;
if (lacks_render)
@@ -367,10 +368,15 @@ static void __emit_job_gen12_render_compute(struct xe_sched_job *job,
struct xe_gt *gt = job->q->gt;
struct xe_device *xe = gt_to_xe(gt);
bool lacks_render = !(gt->info.engine_mask & XE_HW_ENGINE_RCS_MASK);
+ const bool aux_ccs = has_aux_ccs(xe);
u32 mask_flags = 0;
i = emit_copy_timestamp(lrc, dw, i);
+ /* hsdes: 1809175790 */
+ if (aux_ccs)
+ i = emit_render_cache_flush(job, false, dw, i);
+
dw[i++] = preparser_disable(true);
if (lacks_render)
mask_flags = PIPE_CONTROL_3D_ARCH_FLAGS;
@@ -381,7 +387,7 @@ static void __emit_job_gen12_render_compute(struct xe_sched_job *job,
i = emit_pipe_invalidate(mask_flags, job->ring_ops_flush_tlb, dw, i);
/* hsdes: 1809175790 */
- if (has_aux_ccs(xe))
+ if (aux_ccs)
i = emit_aux_table_inv(gt, CCS_AUX_INV, dw, i);
dw[i++] = preparser_disable(false);
@@ -391,7 +397,7 @@ static void __emit_job_gen12_render_compute(struct xe_sched_job *job,
i = emit_bb_start(batch_addr, ppgtt_flag, dw, i);
- i = emit_render_cache_flush(job, dw, i);
+ i = emit_render_cache_flush(job, true, dw, i);
if (job->user_fence.used)
i = emit_store_imm_ppgtt_posted(job->user_fence.addr,
diff --git a/drivers/gpu/drm/xe/xe_ring_ops_types.h b/drivers/gpu/drm/xe/xe_ring_ops_types.h
index d7e3e150a9a5..477dc7defd72 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops_types.h
+++ b/drivers/gpu/drm/xe/xe_ring_ops_types.h
@@ -8,7 +8,7 @@
struct xe_sched_job;
-#define MAX_JOB_SIZE_DW 58
+#define MAX_JOB_SIZE_DW 70
#define MAX_JOB_SIZE_BYTES (MAX_JOB_SIZE_DW * 4)
/**
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 13/24] drm/xe/xelp: Support auxccs invalidation on blitter
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (11 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 12/24] drm/xe/xelp: Quiesce memory traffic before invalidating auxccs Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 14/24] drm/xe/xelp: Use MI_FLUSH_DW_CCS on auxccs platforms Tvrtko Ursulin
` (10 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin
Auxccs platforms need to be able to invalidate auxccs on the blitter
engine.
Add the relevant mmio register and enable this by refactoring the ring
emission a bit to consolidate all non-render engines.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
drivers/gpu/drm/xe/regs/xe_gt_regs.h | 1 +
drivers/gpu/drm/xe/xe_ring_ops.c | 104 +++++++++++----------------
2 files changed, 41 insertions(+), 64 deletions(-)
diff --git a/drivers/gpu/drm/xe/regs/xe_gt_regs.h b/drivers/gpu/drm/xe/regs/xe_gt_regs.h
index 5cd5ab8529c5..db15801445b0 100644
--- a/drivers/gpu/drm/xe/regs/xe_gt_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_gt_regs.h
@@ -88,6 +88,7 @@
#define CCS_AUX_INV XE_REG(0x4208)
#define VD0_AUX_INV XE_REG(0x4218)
+#define BCS_AUX_INV XE_REG(0x4248)
#define VE0_AUX_INV XE_REG(0x4238)
#define VE1_AUX_INV XE_REG(0x42b8)
diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
index 66bdfa94fe64..0b4d1c284a9d 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops.c
+++ b/drivers/gpu/drm/xe/xe_ring_ops.c
@@ -258,44 +258,6 @@ static int emit_copy_timestamp(struct xe_lrc *lrc, u32 *dw, int i)
return i;
}
-/* for engines that don't require any special HW handling (no EUs, no aux inval, etc) */
-static void __emit_job_gen12_simple(struct xe_sched_job *job, struct xe_lrc *lrc,
- u64 batch_addr, u32 seqno)
-{
- u32 dw[MAX_JOB_SIZE_DW], i = 0;
- u32 ppgtt_flag = get_ppgtt_flag(job);
- struct xe_gt *gt = job->q->gt;
-
- i = emit_copy_timestamp(lrc, dw, i);
-
- if (job->ring_ops_flush_tlb) {
- dw[i++] = preparser_disable(true);
- i = emit_flush_imm_ggtt(xe_lrc_start_seqno_ggtt_addr(lrc),
- seqno, MI_INVALIDATE_TLB, dw, i);
- dw[i++] = preparser_disable(false);
- } else {
- i = emit_store_imm_ggtt(xe_lrc_start_seqno_ggtt_addr(lrc),
- seqno, dw, i);
- }
-
- i = emit_bb_start(batch_addr, ppgtt_flag, dw, i);
-
- if (job->user_fence.used) {
- i = emit_flush_dw(dw, i);
- i = emit_store_imm_ppgtt_posted(job->user_fence.addr,
- job->user_fence.value,
- dw, i);
- }
-
- i = emit_flush_imm_ggtt(xe_lrc_seqno_ggtt_addr(lrc), seqno, 0, dw, i);
-
- i = emit_user_interrupt(dw, i);
-
- xe_gt_assert(gt, i <= MAX_JOB_SIZE_DW);
-
- xe_lrc_write_ring(lrc, dw, i * sizeof(*dw));
-}
-
static bool has_aux_ccs(struct xe_device *xe)
{
/*
@@ -310,36 +272,50 @@ static bool has_aux_ccs(struct xe_device *xe)
return !xe->info.has_flat_ccs;
}
-static void __emit_job_gen12_video(struct xe_sched_job *job, struct xe_lrc *lrc,
- u64 batch_addr, u32 seqno)
+static void __emit_job_gen12_xcs(struct xe_sched_job *job, struct xe_lrc *lrc,
+ u64 batch_addr, u32 seqno)
{
u32 dw[MAX_JOB_SIZE_DW], i = 0;
u32 ppgtt_flag = get_ppgtt_flag(job);
struct xe_gt *gt = job->q->gt;
struct xe_device *xe = gt_to_xe(gt);
- bool decode = job->q->class == XE_ENGINE_CLASS_VIDEO_DECODE;
+ const unsigned int class = job->q->class;
+ const bool aux_ccs = has_aux_ccs(xe) &&
+ (class == XE_ENGINE_CLASS_COPY ||
+ class == XE_ENGINE_CLASS_VIDEO_DECODE ||
+ class == XE_ENGINE_CLASS_VIDEO_ENHANCE);
+ const bool invalidate_tlb = aux_ccs || job->ring_ops_flush_tlb;
i = emit_copy_timestamp(lrc, dw, i);
- dw[i++] = preparser_disable(true);
-
- /* hsdes: 1809175790 */
- if (has_aux_ccs(xe)) {
- if (decode)
- i = emit_aux_table_inv(gt, VD0_AUX_INV, dw, i);
- else
- i = emit_aux_table_inv(gt, VE0_AUX_INV, dw, i);
- }
-
- if (job->ring_ops_flush_tlb)
+ if (invalidate_tlb) {
+ dw[i++] = preparser_disable(true);
i = emit_flush_imm_ggtt(xe_lrc_start_seqno_ggtt_addr(lrc),
- seqno, MI_INVALIDATE_TLB, dw, i);
+ seqno,
+ MI_INVALIDATE_TLB,
+ dw, i);
+ /* hsdes: 1809175790 */
+ if (aux_ccs) {
+ struct xe_reg reg;
- dw[i++] = preparser_disable(false);
+ switch (job->q->class) {
+ case XE_ENGINE_CLASS_COPY:
+ reg = BCS_AUX_INV;
+ break;
+ case XE_ENGINE_CLASS_VIDEO_DECODE:
+ reg = VD0_AUX_INV;
+ break;
+ default:
+ reg = VE0_AUX_INV;
+ };
- if (!job->ring_ops_flush_tlb)
+ i = emit_aux_table_inv(gt, reg, dw, i);
+ }
+ dw[i++] = preparser_disable(false);
+ } else {
i = emit_store_imm_ggtt(xe_lrc_start_seqno_ggtt_addr(lrc),
seqno, dw, i);
+ }
i = emit_bb_start(batch_addr, ppgtt_flag, dw, i);
@@ -455,9 +431,9 @@ static void emit_job_gen12_gsc(struct xe_sched_job *job)
xe_gt_assert(gt, job->q->width <= 1); /* no parallel submission for GSCCS */
- __emit_job_gen12_simple(job, job->q->lrc[0],
- job->ptrs[0].batch_addr,
- xe_sched_job_lrc_seqno(job));
+ __emit_job_gen12_xcs(job, job->q->lrc[0],
+ job->ptrs[0].batch_addr,
+ xe_sched_job_lrc_seqno(job));
}
static void emit_job_gen12_copy(struct xe_sched_job *job)
@@ -471,9 +447,9 @@ static void emit_job_gen12_copy(struct xe_sched_job *job)
}
for (i = 0; i < job->q->width; ++i)
- __emit_job_gen12_simple(job, job->q->lrc[i],
- job->ptrs[i].batch_addr,
- xe_sched_job_lrc_seqno(job));
+ __emit_job_gen12_xcs(job, job->q->lrc[i],
+ job->ptrs[i].batch_addr,
+ xe_sched_job_lrc_seqno(job));
}
static void emit_job_gen12_video(struct xe_sched_job *job)
@@ -482,9 +458,9 @@ static void emit_job_gen12_video(struct xe_sched_job *job)
/* FIXME: Not doing parallel handshake for now */
for (i = 0; i < job->q->width; ++i)
- __emit_job_gen12_video(job, job->q->lrc[i],
- job->ptrs[i].batch_addr,
- xe_sched_job_lrc_seqno(job));
+ __emit_job_gen12_xcs(job, job->q->lrc[i],
+ job->ptrs[i].batch_addr,
+ xe_sched_job_lrc_seqno(job));
}
static void emit_job_gen12_render_compute(struct xe_sched_job *job)
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 14/24] drm/xe/xelp: Use MI_FLUSH_DW_CCS on auxccs platforms
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (12 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 13/24] drm/xe/xelp: Support auxccs invalidation on blitter Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 15/24] drm/xe/xelp: Wait for AuxCCS invalidation to complete Tvrtko Ursulin
` (9 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin
Emit MI_FLUSH_DW_CCS when invalidating on auxccs platforms.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
drivers/gpu/drm/xe/xe_ring_ops.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
index 0b4d1c284a9d..2834c0193f87 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops.c
+++ b/drivers/gpu/drm/xe/xe_ring_ops.c
@@ -285,15 +285,16 @@ static void __emit_job_gen12_xcs(struct xe_sched_job *job, struct xe_lrc *lrc,
class == XE_ENGINE_CLASS_VIDEO_DECODE ||
class == XE_ENGINE_CLASS_VIDEO_ENHANCE);
const bool invalidate_tlb = aux_ccs || job->ring_ops_flush_tlb;
+ const u32 flags = aux_ccs && class == XE_ENGINE_CLASS_COPY ?
+ MI_FLUSH_DW_CCS : 0;
i = emit_copy_timestamp(lrc, dw, i);
if (invalidate_tlb) {
dw[i++] = preparser_disable(true);
i = emit_flush_imm_ggtt(xe_lrc_start_seqno_ggtt_addr(lrc),
- seqno,
- MI_INVALIDATE_TLB,
- dw, i);
+ seqno, MI_INVALIDATE_TLB | flags, dw,
+ i);
/* hsdes: 1809175790 */
if (aux_ccs) {
struct xe_reg reg;
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 15/24] drm/xe/xelp: Wait for AuxCCS invalidation to complete
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (13 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 14/24] drm/xe/xelp: Use MI_FLUSH_DW_CCS on auxccs platforms Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 16/24] drm/xe/xelp: Add AuxCCS invalidation to the buffer migration path Tvrtko Ursulin
` (8 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin
On AuxCCS platforms we need to wait for AuxCCS invalidations to complete.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
drivers/gpu/drm/xe/instructions/xe_mi_commands.h | 6 ++++++
drivers/gpu/drm/xe/xe_ring_ops.c | 9 ++++++++-
drivers/gpu/drm/xe/xe_ring_ops_types.h | 2 +-
3 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/xe/instructions/xe_mi_commands.h b/drivers/gpu/drm/xe/instructions/xe_mi_commands.h
index c47b290e0e9f..ef4b033570cf 100644
--- a/drivers/gpu/drm/xe/instructions/xe_mi_commands.h
+++ b/drivers/gpu/drm/xe/instructions/xe_mi_commands.h
@@ -81,4 +81,10 @@
#define MI_SET_APPID_SESSION_ID_MASK REG_GENMASK(6, 0)
#define MI_SET_APPID_SESSION_ID(x) REG_FIELD_PREP(MI_SET_APPID_SESSION_ID_MASK, x)
+#define MI_SEMAPHORE_WAIT_TOKEN (__MI_INSTR(0x1c) | XE_INSTR_NUM_DW(3)) /* XeLP+ */
+#define MI_SEMAPHORE_REGISTER_POLL REG_BIT(16)
+#define MI_SEMAPHORE_POLL REG_BIT(15)
+#define MI_SEMAPHORE_CMP_OP_MASK REG_GENMASK(14, 12)
+#define MI_SEMAPHORE_SAD_EQ_SDD REG_FIELD_PREP(MI_SEMAPHORE_CMP_OP_MASK, 4)
+
#endif
diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
index 2834c0193f87..f207c6217ce1 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops.c
+++ b/drivers/gpu/drm/xe/xe_ring_ops.c
@@ -56,7 +56,14 @@ static int emit_aux_table_inv(struct xe_gt *gt, struct xe_reg reg,
dw[i++] = MI_LOAD_REGISTER_IMM | MI_LRI_NUM_REGS(1) | MI_LRI_MMIO_REMAP_EN;
dw[i++] = reg.addr + gt->mmio.adj_offset;
dw[i++] = AUX_INV;
- dw[i++] = MI_NOOP;
+ dw[i++] = MI_SEMAPHORE_WAIT_TOKEN |
+ MI_SEMAPHORE_REGISTER_POLL |
+ MI_SEMAPHORE_POLL |
+ MI_SEMAPHORE_SAD_EQ_SDD;
+ dw[i++] = 0;
+ dw[i++] = reg.addr + gt->mmio.adj_offset;
+ dw[i++] = 0;
+ dw[i++] = 0;
return i;
}
diff --git a/drivers/gpu/drm/xe/xe_ring_ops_types.h b/drivers/gpu/drm/xe/xe_ring_ops_types.h
index 477dc7defd72..1197fc0bf2af 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops_types.h
+++ b/drivers/gpu/drm/xe/xe_ring_ops_types.h
@@ -8,7 +8,7 @@
struct xe_sched_job;
-#define MAX_JOB_SIZE_DW 70
+#define MAX_JOB_SIZE_DW 74
#define MAX_JOB_SIZE_BYTES (MAX_JOB_SIZE_DW * 4)
/**
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 16/24] drm/xe/xelp: Add AuxCCS invalidation to the buffer migration path
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (14 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 15/24] drm/xe/xelp: Wait for AuxCCS invalidation to complete Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 17/24] drm/xe: Export xe_emit_aux_table_inv Tvrtko Ursulin
` (7 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin
Buffer migration path has to handle the AuxCCS invalidation too.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
drivers/gpu/drm/xe/xe_ring_ops.c | 28 +++++++++++++++++++++++++---
1 file changed, 25 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
index f207c6217ce1..4da12f41ccb6 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops.c
+++ b/drivers/gpu/drm/xe/xe_ring_ops.c
@@ -402,12 +402,30 @@ static void __emit_job_gen12_render_compute(struct xe_sched_job *job,
static void emit_migration_job_gen12(struct xe_sched_job *job,
struct xe_lrc *lrc, u32 seqno)
{
+ struct xe_gt *gt = job->q->gt;
+ struct xe_device *xe = gt_to_xe(gt);
+ const bool aux_ccs = has_aux_ccs(xe);
+ const bool invalidate_tlb = aux_ccs || job->ring_ops_flush_tlb;
u32 dw[MAX_JOB_SIZE_DW], i = 0;
i = emit_copy_timestamp(lrc, dw, i);
- i = emit_store_imm_ggtt(xe_lrc_start_seqno_ggtt_addr(lrc),
- seqno, dw, i);
+ if (invalidate_tlb) {
+ dw[i++] = preparser_disable(true);
+ i = emit_flush_imm_ggtt(xe_lrc_start_seqno_ggtt_addr(lrc),
+ seqno,
+ MI_INVALIDATE_TLB |
+ (aux_ccs ? MI_FLUSH_DW_CCS : 0) |
+ job->migrate_flush_flags,
+ dw, i);
+ /* hsdes: 1809175790 */
+ if (aux_ccs)
+ i = emit_aux_table_inv(gt, BCS_AUX_INV, dw, i);
+ dw[i++] = preparser_disable(false);
+ } else {
+ i = emit_store_imm_ggtt(xe_lrc_start_seqno_ggtt_addr(lrc),
+ seqno, dw, i);
+ }
dw[i++] = MI_ARB_ON_OFF | MI_ARB_DISABLE; /* Enabled again below */
@@ -417,13 +435,17 @@ static void emit_migration_job_gen12(struct xe_sched_job *job,
/* XXX: Do we need this? Leaving for now. */
dw[i++] = preparser_disable(true);
i = emit_flush_invalidate(dw, i);
+ if (aux_ccs)
+ i = emit_aux_table_inv(gt, BCS_AUX_INV, dw, i);
dw[i++] = preparser_disable(false);
}
i = emit_bb_start(job->ptrs[1].batch_addr, BIT(8), dw, i);
i = emit_flush_imm_ggtt(xe_lrc_seqno_ggtt_addr(lrc), seqno,
- MI_INVALIDATE_TLB | job->migrate_flush_flags,
+ MI_INVALIDATE_TLB |
+ (aux_ccs ? MI_FLUSH_DW_CCS : 0) |
+ job->migrate_flush_flags,
dw, i);
i = emit_user_interrupt(dw, i);
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 17/24] drm/xe: Export xe_emit_aux_table_inv
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (15 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 16/24] drm/xe/xelp: Add AuxCCS invalidation to the buffer migration path Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 18/24] drm/xe/xelp: Add AuxCCS invalidation to the indirect context workarounds Tvrtko Ursulin
` (6 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin
Export the existing AuxCCS invalidation ring buffer programming helper
which we will need to use to setup the indirect context workaround in the
next patch.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
drivers/gpu/drm/xe/xe_ring_ops.c | 81 +++++++++++++++++++-------------
drivers/gpu/drm/xe/xe_ring_ops.h | 3 ++
2 files changed, 51 insertions(+), 33 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
index 4da12f41ccb6..d6e2fcc593a7 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops.c
+++ b/drivers/gpu/drm/xe/xe_ring_ops.c
@@ -50,22 +50,51 @@ static u32 preparser_disable(bool state)
return MI_ARB_CHECK | BIT(8) | state;
}
-static int emit_aux_table_inv(struct xe_gt *gt, struct xe_reg reg,
- u32 *dw, int i)
+u32 *xe_emit_aux_table_inv(struct xe_hw_engine *hwe, u32 *cmd)
{
- dw[i++] = MI_LOAD_REGISTER_IMM | MI_LRI_NUM_REGS(1) | MI_LRI_MMIO_REMAP_EN;
- dw[i++] = reg.addr + gt->mmio.adj_offset;
- dw[i++] = AUX_INV;
- dw[i++] = MI_SEMAPHORE_WAIT_TOKEN |
- MI_SEMAPHORE_REGISTER_POLL |
- MI_SEMAPHORE_POLL |
- MI_SEMAPHORE_SAD_EQ_SDD;
- dw[i++] = 0;
- dw[i++] = reg.addr + gt->mmio.adj_offset;
- dw[i++] = 0;
- dw[i++] = 0;
+ struct xe_gt *gt = hwe->gt;
+ struct xe_reg reg;
- return i;
+ switch (hwe->class) {
+ case XE_ENGINE_CLASS_RENDER:
+ case XE_ENGINE_CLASS_COMPUTE:
+ reg = CCS_AUX_INV;
+ break;
+ case XE_ENGINE_CLASS_COPY:
+ reg = BCS_AUX_INV;
+ break;
+ case XE_ENGINE_CLASS_VIDEO_DECODE:
+ reg = VD0_AUX_INV;
+ break;
+ case XE_ENGINE_CLASS_VIDEO_ENHANCE:
+ reg = VE0_AUX_INV;
+ break;
+ default:
+ return cmd;
+ };
+
+ *cmd++ = MI_LOAD_REGISTER_IMM | MI_LRI_NUM_REGS(1) |
+ MI_LRI_MMIO_REMAP_EN;
+ *cmd++ = reg.addr + gt->mmio.adj_offset;
+ *cmd++ = AUX_INV;
+ *cmd++ = MI_SEMAPHORE_WAIT_TOKEN | MI_SEMAPHORE_REGISTER_POLL |
+ MI_SEMAPHORE_POLL | MI_SEMAPHORE_SAD_EQ_SDD;
+ *cmd++ = 0;
+ *cmd++ = reg.addr + gt->mmio.adj_offset;
+ *cmd++ = 0;
+ *cmd++ = 0;
+
+ return cmd;
+}
+
+static int emit_aux_table_inv(struct xe_hw_engine *hwe, u32 *dw, int i)
+{
+ u32 *start, *end;
+
+ start = dw + i;
+ end = xe_emit_aux_table_inv(hwe, start);
+
+ return i + (end - start);
}
static int emit_user_interrupt(u32 *dw, int i)
@@ -303,22 +332,8 @@ static void __emit_job_gen12_xcs(struct xe_sched_job *job, struct xe_lrc *lrc,
seqno, MI_INVALIDATE_TLB | flags, dw,
i);
/* hsdes: 1809175790 */
- if (aux_ccs) {
- struct xe_reg reg;
-
- switch (job->q->class) {
- case XE_ENGINE_CLASS_COPY:
- reg = BCS_AUX_INV;
- break;
- case XE_ENGINE_CLASS_VIDEO_DECODE:
- reg = VD0_AUX_INV;
- break;
- default:
- reg = VE0_AUX_INV;
- };
-
- i = emit_aux_table_inv(gt, reg, dw, i);
- }
+ if (aux_ccs)
+ i = emit_aux_table_inv(job->q->hwe, dw, i);
dw[i++] = preparser_disable(false);
} else {
i = emit_store_imm_ggtt(xe_lrc_start_seqno_ggtt_addr(lrc),
@@ -372,7 +387,7 @@ static void __emit_job_gen12_render_compute(struct xe_sched_job *job,
/* hsdes: 1809175790 */
if (aux_ccs)
- i = emit_aux_table_inv(gt, CCS_AUX_INV, dw, i);
+ i = emit_aux_table_inv(job->q->hwe, dw, i);
dw[i++] = preparser_disable(false);
@@ -420,7 +435,7 @@ static void emit_migration_job_gen12(struct xe_sched_job *job,
dw, i);
/* hsdes: 1809175790 */
if (aux_ccs)
- i = emit_aux_table_inv(gt, BCS_AUX_INV, dw, i);
+ i = emit_aux_table_inv(job->q->hwe, dw, i);
dw[i++] = preparser_disable(false);
} else {
i = emit_store_imm_ggtt(xe_lrc_start_seqno_ggtt_addr(lrc),
@@ -436,7 +451,7 @@ static void emit_migration_job_gen12(struct xe_sched_job *job,
dw[i++] = preparser_disable(true);
i = emit_flush_invalidate(dw, i);
if (aux_ccs)
- i = emit_aux_table_inv(gt, BCS_AUX_INV, dw, i);
+ i = emit_aux_table_inv(job->q->hwe, dw, i);
dw[i++] = preparser_disable(false);
}
diff --git a/drivers/gpu/drm/xe/xe_ring_ops.h b/drivers/gpu/drm/xe/xe_ring_ops.h
index e942735d76a6..5a2d32f9bb25 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops.h
+++ b/drivers/gpu/drm/xe/xe_ring_ops.h
@@ -10,8 +10,11 @@
#include "xe_ring_ops_types.h"
struct xe_gt;
+struct xe_hw_engine;
const struct xe_ring_ops *
xe_ring_ops_get(struct xe_gt *gt, enum xe_engine_class class);
+u32 *xe_emit_aux_table_inv(struct xe_hw_engine *hwe, u32 *cmd);
+
#endif
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 18/24] drm/xe/xelp: Add AuxCCS invalidation to the indirect context workarounds
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (16 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 17/24] drm/xe: Export xe_emit_aux_table_inv Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 19/24] drm/xe: Use fb cached min alignment Tvrtko Ursulin
` (5 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin
Following from the i915 reference implementation, we add the AuxCCS
invalidation to the indirect context workarounds page.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
drivers/gpu/drm/xe/xe_lrc.c | 47 +++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index a5c03ad4b8b2..32466ebcaedb 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -25,6 +25,7 @@
#include "xe_map.h"
#include "xe_memirq.h"
#include "xe_mmio.h"
+#include "xe_ring_ops.h"
#include "xe_sriov.h"
#include "xe_trace_lrc.h"
#include "xe_vm.h"
@@ -52,11 +53,23 @@ lrc_to_xe(struct xe_lrc *lrc)
static bool
gt_engine_needs_indirect_ctx(struct xe_gt *gt, enum xe_engine_class class)
{
+ struct xe_device *xe = gt_to_xe(gt);
+
if (XE_WA(gt, 16010904313) &&
(class == XE_ENGINE_CLASS_RENDER ||
class == XE_ENGINE_CLASS_COMPUTE))
return true;
+ /* AuxCCS invalidation */
+ if (GRAPHICS_VERx100(xe) >= 1200 &&
+ GRAPHICS_VERx100(xe) <= 1210 &&
+ (class == XE_ENGINE_CLASS_RENDER ||
+ class == XE_ENGINE_CLASS_COMPUTE ||
+ class == XE_ENGINE_CLASS_COPY ||
+ class == XE_ENGINE_CLASS_VIDEO_DECODE ||
+ class == XE_ENGINE_CLASS_VIDEO_ENHANCE))
+ return true;
+
return false;
}
@@ -1054,6 +1067,31 @@ setup_invalidate_state_cache_wa(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
return cmd - batch;
}
+static ssize_t
+setup_invalidate_auxccs_wa(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
+ u32 *batch, size_t max_len)
+{
+ struct xe_gt *gt = lrc->gt;
+ struct xe_device *xe = gt_to_xe(gt);
+ const unsigned int class = hwe->class;
+ u32 *cmd;
+
+ if (GRAPHICS_VERx100(xe) < 1200 || GRAPHICS_VERx100(xe) > 1210 ||
+ !(class == XE_ENGINE_CLASS_RENDER ||
+ class == XE_ENGINE_CLASS_COMPUTE ||
+ class == XE_ENGINE_CLASS_COPY ||
+ class == XE_ENGINE_CLASS_VIDEO_DECODE ||
+ class == XE_ENGINE_CLASS_VIDEO_ENHANCE))
+ return 0;
+
+ if (xe_gt_WARN_ON(gt, max_len < 8))
+ return -ENOSPC;
+
+ cmd = xe_emit_aux_table_inv(hwe, batch);
+
+ return cmd - batch;
+}
+
struct wa_bo_setup {
ssize_t (*setup)(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
u32 *batch, size_t max_size);
@@ -1154,6 +1192,10 @@ setup_indirect_ctx(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
{
static struct wa_bo_setup rcs_funcs[] = {
{ .setup = setup_timestamp_wa },
+ { .setup = setup_invalidate_auxccs_wa },
+ };
+ static struct wa_bo_setup xcs_funcs[] = {
+ { .setup = setup_invalidate_auxccs_wa },
};
unsigned int offset, num_funcs, written = 0;
struct wa_bo_setup *funcs = NULL;
@@ -1166,6 +1208,11 @@ setup_indirect_ctx(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
hwe->class == XE_ENGINE_CLASS_COMPUTE) {
funcs = rcs_funcs;
num_funcs = ARRAY_SIZE(rcs_funcs);
+ } else if (hwe->class == XE_ENGINE_CLASS_COPY ||
+ hwe->class == XE_ENGINE_CLASS_VIDEO_DECODE ||
+ hwe->class == XE_ENGINE_CLASS_VIDEO_ENHANCE) {
+ funcs = xcs_funcs;
+ num_funcs = ARRAY_SIZE(xcs_funcs);
}
if (xe_gt_WARN_ON(lrc->gt, !funcs))
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 19/24] drm/xe: Use fb cached min alignment
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (17 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 18/24] drm/xe/xelp: Add AuxCCS invalidation to the indirect context workarounds Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 20/24] drm/xe: Flush GGTT writes after populating DPT Tvrtko Ursulin
` (4 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin, Maarten Lankhort
Instead of just looking at the first plane use the fb cached overall
minimum alignment.
This aligns with how the i915 version of intel_plane_pin_fb works.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Maarten Lankhort <maarten.lankhorst@linux.intel.com>
---
drivers/gpu/drm/xe/display/xe_fb_pin.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/xe/display/xe_fb_pin.c b/drivers/gpu/drm/xe/display/xe_fb_pin.c
index 5e846f0bec21..a76423e1d59e 100644
--- a/drivers/gpu/drm/xe/display/xe_fb_pin.c
+++ b/drivers/gpu/drm/xe/display/xe_fb_pin.c
@@ -412,12 +412,9 @@ int intel_plane_pin_fb(struct intel_plane_state *new_plane_state,
const struct intel_plane_state *old_plane_state)
{
struct drm_framebuffer *fb = new_plane_state->hw.fb;
- struct drm_gem_object *obj = intel_fb_bo(fb);
- struct xe_bo *bo = gem_to_xe_bo(obj);
- struct i915_vma *vma;
struct intel_framebuffer *intel_fb = to_intel_framebuffer(fb);
- struct intel_plane *plane = to_intel_plane(new_plane_state->uapi.plane);
- unsigned int alignment = plane->min_alignment(plane, fb, 0);
+ struct xe_bo *bo = gem_to_xe_bo(intel_fb_bo(fb));
+ struct i915_vma *vma;
if (reuse_vma(new_plane_state, old_plane_state))
return 0;
@@ -425,8 +422,8 @@ int intel_plane_pin_fb(struct intel_plane_state *new_plane_state,
/* We reject creating !SCANOUT fb's, so this is weird.. */
drm_WARN_ON(bo->ttm.base.dev, !(bo->flags & XE_BO_FLAG_SCANOUT));
- vma = __xe_pin_fb_vma(intel_fb, &new_plane_state->view.gtt, alignment);
-
+ vma = __xe_pin_fb_vma(intel_fb, &new_plane_state->view.gtt,
+ intel_fb->min_alignment);
if (IS_ERR(vma))
return PTR_ERR(vma);
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 20/24] drm/xe: Flush GGTT writes after populating DPT
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (18 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 19/24] drm/xe: Use fb cached min alignment Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 21/24] drm/xe: Handle DPT in system memory Tvrtko Ursulin
` (3 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin, Ville Syrjälä
When DPT is placed in stolen it is populated using ioremap_wc() via GGTT.
I915 has established that on modern platforms a small flush and delay is
required for those writes to reliably land so lets add the same logic
(simplified by removing impossible platforms) to xe as well.
v2:
* Do it only for system memory buffers.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
drivers/gpu/drm/xe/display/xe_fb_pin.c | 45 ++++++++++++++++++++++++++
1 file changed, 45 insertions(+)
diff --git a/drivers/gpu/drm/xe/display/xe_fb_pin.c b/drivers/gpu/drm/xe/display/xe_fb_pin.c
index a76423e1d59e..d2fda8e2e324 100644
--- a/drivers/gpu/drm/xe/display/xe_fb_pin.c
+++ b/drivers/gpu/drm/xe/display/xe_fb_pin.c
@@ -12,9 +12,11 @@
#include "intel_fb.h"
#include "intel_fb_pin.h"
#include "intel_fbdev.h"
+#include "regs/xe_engine_regs.h"
#include "xe_bo.h"
#include "xe_device.h"
#include "xe_ggtt.h"
+#include "xe_mmio.h"
#include "xe_pm.h"
static void
@@ -78,6 +80,46 @@ write_dpt_remapped(struct xe_bo *bo, struct iosys_map *map, u32 *dpt_ofs,
*dpt_ofs = ALIGN(*dpt_ofs, 4096);
}
+static void gt_flush_ggtt_writes(struct xe_gt *gt)
+{
+ if (!gt)
+ return;
+
+ xe_mmio_read32(>->mmio, RING_TAIL(RENDER_RING_BASE));
+}
+
+static void ggtt_flush_writes(struct xe_ggtt *ggtt)
+{
+ struct xe_device *xe = tile_to_xe(ggtt->tile);
+
+ /*
+ * No actual flushing is required for the GTT write domain for reads
+ * from the GTT domain. Writes to it "immediately" go to main memory
+ * as far as we know, so there's no chipset flush. It also doesn't
+ * land in the GPU render cache.
+ *
+ * However, we do have to enforce the order so that all writes through
+ * the GTT land before any writes to the device, such as updates to
+ * the GATT itself.
+ *
+ * We also have to wait a bit for the writes to land from the GTT.
+ * An uncached read (i.e. mmio) seems to be ideal for the round-trip
+ * timing. This issue has only been observed when switching quickly
+ * between GTT writes and CPU reads from inside the kernel on recent hw,
+ * and it appears to only affect discrete GTT blocks (i.e. on LLC
+ * system agents we cannot reproduce this behaviour, until Cannonlake
+ * that was!).
+ */
+
+ wmb();
+
+ if (xe_pm_runtime_get_if_active(xe)) {
+ gt_flush_ggtt_writes(ggtt->tile->primary_gt);
+ gt_flush_ggtt_writes(ggtt->tile->media_gt);
+ xe_pm_runtime_put(xe);
+ }
+}
+
static int __xe_pin_fb_vma_dpt(const struct intel_framebuffer *fb,
const struct i915_gtt_view *view,
struct i915_vma *vma,
@@ -161,6 +203,9 @@ static int __xe_pin_fb_vma_dpt(const struct intel_framebuffer *fb,
rot_info->plane[i].dst_stride);
}
+ if (dpt->vmap.is_iomem && !xe_bo_is_vram(bo))
+ ggtt_flush_writes(tile0->mem.ggtt);
+
vma->dpt = dpt;
vma->node = dpt->ggtt_node[tile0->id];
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 21/24] drm/xe: Handle DPT in system memory
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (19 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 20/24] drm/xe: Flush GGTT writes after populating DPT Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 22/24] drm/xe: Force flush system memory AuxCCS framebuffers before scan out Tvrtko Ursulin
` (2 subsequent siblings)
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin
If DPT is allocated from system memory it will be created in the default
write-back cached mode. This means we need to flush it after populating
otherwise nothing works.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
drivers/gpu/drm/xe/display/xe_fb_pin.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/xe/display/xe_fb_pin.c b/drivers/gpu/drm/xe/display/xe_fb_pin.c
index d2fda8e2e324..27e64836113b 100644
--- a/drivers/gpu/drm/xe/display/xe_fb_pin.c
+++ b/drivers/gpu/drm/xe/display/xe_fb_pin.c
@@ -3,6 +3,7 @@
* Copyright © 2021 Intel Corporation
*/
+#include <drm/drm_cache.h>
#include <drm/ttm/ttm_bo.h>
#include "i915_vma.h"
@@ -205,6 +206,8 @@ static int __xe_pin_fb_vma_dpt(const struct intel_framebuffer *fb,
if (dpt->vmap.is_iomem && !xe_bo_is_vram(bo))
ggtt_flush_writes(tile0->mem.ggtt);
+ else if (!xe_bo_is_vram(dpt) && !xe_bo_is_stolen(dpt))
+ drm_clflush_virt_range(dpt->vmap.vaddr, dpt_size);
vma->dpt = dpt;
vma->node = dpt->ggtt_node[tile0->id];
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 22/24] drm/xe: Force flush system memory AuxCCS framebuffers before scan out
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (20 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 21/24] drm/xe: Handle DPT in system memory Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 23/24] drm/xe/display: Add support for AuxCCS Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 24/24] drm/i915/display: Expose AuxCCS frame buffer modifiers for Xe Tvrtko Ursulin
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin
Even though frame buffer objects are created as write-combined, in
practice, on top of all the ring buffer flushing, an additional clflush
seems to be needed before display engine can coherently scan out the
AuxCCS compressed data without transient artifacts.
If for comparison we look at how i915 handles things (where AuxCCS works
fine), as it happens it has this same clflush before a frame buffer is
pinned for display for the first time, courtesy the dynamic tracking of
the buffer cache mode and setting the latter to uncached before handing
to display.
Since xe considers the buffer object caching mode as static we can
implement the same approach by adding a flag telling us if the buffer
was ever pinned for display and flush on the first pin. Subsequent re-pins
will not repeat the clflush but so far I have not observed any glitching
after the first pin.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
drivers/gpu/drm/xe/display/xe_fb_pin.c | 12 ++++++++++++
drivers/gpu/drm/xe/xe_bo_types.h | 14 +++++++++-----
2 files changed, 21 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/xe/display/xe_fb_pin.c b/drivers/gpu/drm/xe/display/xe_fb_pin.c
index 27e64836113b..70c8f1e727f6 100644
--- a/drivers/gpu/drm/xe/display/xe_fb_pin.c
+++ b/drivers/gpu/drm/xe/display/xe_fb_pin.c
@@ -328,6 +328,7 @@ static struct i915_vma *__xe_pin_fb_vma(const struct intel_framebuffer *fb,
struct i915_vma *vma = kzalloc(sizeof(*vma), GFP_KERNEL);
struct drm_gem_object *obj = intel_fb_bo(&fb->base);
struct xe_bo *bo = gem_to_xe_bo(obj);
+ bool first_pin;
int ret;
if (!vma)
@@ -359,6 +360,9 @@ static struct i915_vma *__xe_pin_fb_vma(const struct intel_framebuffer *fb,
if (ret)
goto err;
+ first_pin = !bo->display_pin;
+ bo->display_pin = true;
+
if (IS_DGFX(xe))
ret = xe_bo_migrate(bo, XE_PL_VRAM0);
else
@@ -377,6 +381,14 @@ static struct i915_vma *__xe_pin_fb_vma(const struct intel_framebuffer *fb,
if (ret)
goto err_unpin;
+ /*
+ * Force flush frame buffer data for non-coherent display access when
+ * AuxCCS formats are used.
+ */
+ if (first_pin && !xe_bo_is_vram(bo) && !xe_bo_is_stolen(bo) &&
+ intel_fb_is_ccs_modifier(fb->base.modifier))
+ drm_clflush_sg(xe_bo_sg(bo));
+
return vma;
err_unpin:
diff --git a/drivers/gpu/drm/xe/xe_bo_types.h b/drivers/gpu/drm/xe/xe_bo_types.h
index e0efaf23d051..83973925f663 100644
--- a/drivers/gpu/drm/xe/xe_bo_types.h
+++ b/drivers/gpu/drm/xe/xe_bo_types.h
@@ -72,11 +72,6 @@ struct xe_bo {
struct llist_node freed;
/** @update_index: Update index if PT BO */
int update_index;
- /** @created: Whether the bo has passed initial creation */
- bool created;
-
- /** @ccs_cleared */
- bool ccs_cleared;
/**
* @cpu_caching: CPU caching mode. Currently only used for userspace
@@ -88,6 +83,15 @@ struct xe_bo {
/** @devmem_allocation: SVM device memory allocation */
struct drm_pagemap_devmem devmem_allocation;
+ /** @created: Whether the bo has passed initial creation */
+ bool created : 1;
+
+ /** @ccs_cleared */
+ bool ccs_cleared : 1;
+
+ /** @display_pin: Was it ever pinned to display */
+ bool display_pin : 1;
+
/** @vram_userfault_link: Link into @mem_access.vram_userfault.list */
struct list_head vram_userfault_link;
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 23/24] drm/xe/display: Add support for AuxCCS
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (21 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 22/24] drm/xe: Force flush system memory AuxCCS framebuffers before scan out Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 24/24] drm/i915/display: Expose AuxCCS frame buffer modifiers for Xe Tvrtko Ursulin
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe; +Cc: kernel-dev, Tvrtko Ursulin, Juha-Pekka Heikkila, Michael J. Ruhl
Add support for mapping the auxiliary CCS buffer into the DPT page tables.
This will allow for more power efficiency by enabling the render
compression frame buffer modifiers such as
I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS in a following patch.
We do this by refactoring the code a bit so handling for the linear
auxiliary frame buffer can be added in a tidy way. Also replace some
hardcoded constants and tighten the loops a bit.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Juha-Pekka Heikkila <juhapekka.heikkila@gmail.com>
Cc: Michael J. Ruhl <michael.j.ruhl@intel.com>
---
drivers/gpu/drm/xe/display/xe_fb_pin.c | 106 ++++++++++++++++++-------
1 file changed, 79 insertions(+), 27 deletions(-)
diff --git a/drivers/gpu/drm/xe/display/xe_fb_pin.c b/drivers/gpu/drm/xe/display/xe_fb_pin.c
index 70c8f1e727f6..d1293147db19 100644
--- a/drivers/gpu/drm/xe/display/xe_fb_pin.c
+++ b/drivers/gpu/drm/xe/display/xe_fb_pin.c
@@ -52,33 +52,95 @@ write_dpt_rotated(struct xe_bo *bo, struct iosys_map *map, u32 *dpt_ofs, u32 bo_
*dpt_ofs = ALIGN(*dpt_ofs, 4096);
}
-static void
-write_dpt_remapped(struct xe_bo *bo, struct iosys_map *map, u32 *dpt_ofs,
- u32 bo_ofs, u32 width, u32 height, u32 src_stride,
- u32 dst_stride)
+static unsigned int
+write_dpt_padding(struct iosys_map *map, unsigned int dest, unsigned int pad)
+{
+ while (pad--) {
+ iosys_map_wr(map, dest, u64, 0);
+ dest += sizeof(u64);
+ }
+
+ return dest;
+}
+
+static unsigned int
+write_dpt_remapped_linear(struct xe_bo *bo, struct iosys_map *map,
+ unsigned int dest,
+ const struct intel_remapped_plane_info *plane)
{
struct xe_device *xe = xe_bo_device(bo);
struct xe_ggtt *ggtt = xe_device_get_root_tile(xe)->mem.ggtt;
- u32 column, row;
- u64 pte = xe_ggtt_encode_pte_flags(ggtt, bo, xe->pat.idx[XE_CACHE_NONE]);
+ const u64 pte = xe_ggtt_encode_pte_flags(ggtt, bo,
+ xe->pat.idx[XE_CACHE_NONE]);
+ u64 src = xe_bo_addr(bo, plane->offset * XE_PAGE_SIZE, XE_PAGE_SIZE);
+ unsigned int size = plane->size;
+
+ while (size--) {
+ iosys_map_wr(map, dest, u64, src | pte);
+ dest += sizeof(u64);
+ src += XE_PAGE_SIZE;
+ }
- for (row = 0; row < height; row++) {
- u32 src_idx = src_stride * row + bo_ofs;
+ return dest;
+}
- for (column = 0; column < width; column++) {
- u64 addr = xe_bo_addr(bo, src_idx * XE_PAGE_SIZE, XE_PAGE_SIZE);
- iosys_map_wr(map, *dpt_ofs, u64, pte | addr);
+static unsigned int
+write_dpt_remapped_tiled(struct xe_bo *bo, struct iosys_map *map,
+ unsigned int dest,
+ const struct intel_remapped_plane_info *plane)
+{
+ struct xe_device *xe = xe_bo_device(bo);
+ struct xe_ggtt *ggtt = xe_device_get_root_tile(xe)->mem.ggtt;
+ const u64 pte = xe_ggtt_encode_pte_flags(ggtt, bo,
+ xe->pat.idx[XE_CACHE_NONE]);
+ u64 src = xe_bo_addr(bo, plane->offset * XE_PAGE_SIZE, XE_PAGE_SIZE);
+ const unsigned int next_row = (plane->src_stride - plane->width) *
+ XE_PAGE_SIZE;
+ unsigned int column, row;
- *dpt_ofs += 8;
- src_idx++;
+ for (row = 0; row < plane->height; row++, src += next_row) {
+ for (column = 0; column < plane->width; column++) {
+ iosys_map_wr(map, dest, u64, src | pte);
+ dest += sizeof(u64);
+ src += XE_PAGE_SIZE;
}
/* The DE ignores the PTEs for the padding tiles */
- *dpt_ofs += (dst_stride - width) * 8;
+ dest = write_dpt_padding(map, dest,
+ plane->dst_stride - plane->width);
}
- /* Align to next page */
- *dpt_ofs = ALIGN(*dpt_ofs, 4096);
+ return dest;
+}
+
+static void
+write_dpt_remapped(struct xe_bo *bo,
+ const struct intel_remapped_info *remap_info,
+ struct iosys_map *map)
+{
+ unsigned int i, dest = 0;
+
+ for (i = 0; i < ARRAY_SIZE(remap_info->plane); i++) {
+ const struct intel_remapped_plane_info *plane =
+ &remap_info->plane[i];
+
+ if (!plane->width && !plane->height && !plane->linear)
+ continue;
+
+ if (remap_info->plane_alignment) {
+ const unsigned int index = dest / sizeof(u64);
+ const unsigned int pad =
+ ALIGN(index, remap_info->plane_alignment) -
+ index;
+
+ dest = write_dpt_padding(map, dest, pad);
+ }
+
+ if (plane->linear)
+ dest = write_dpt_remapped_linear(bo, map, dest, plane);
+ else
+ dest = write_dpt_remapped_tiled(bo, map, dest, plane);
+ }
}
static void gt_flush_ggtt_writes(struct xe_gt *gt)
@@ -180,17 +242,7 @@ static int __xe_pin_fb_vma_dpt(const struct intel_framebuffer *fb,
iosys_map_wr(&dpt->vmap, x * 8, u64, pte | addr);
}
} else if (view->type == I915_GTT_VIEW_REMAPPED) {
- const struct intel_remapped_info *remap_info = &view->remapped;
- u32 i, dpt_ofs = 0;
-
- for (i = 0; i < ARRAY_SIZE(remap_info->plane); i++)
- write_dpt_remapped(bo, &dpt->vmap, &dpt_ofs,
- remap_info->plane[i].offset,
- remap_info->plane[i].width,
- remap_info->plane[i].height,
- remap_info->plane[i].src_stride,
- remap_info->plane[i].dst_stride);
-
+ write_dpt_remapped(bo, &view->remapped, &dpt->vmap);
} else {
const struct intel_rotation_info *rot_info = &view->rotated;
u32 i, dpt_ofs = 0;
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v7 24/24] drm/i915/display: Expose AuxCCS frame buffer modifiers for Xe
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
` (22 preceding siblings ...)
2025-06-27 13:33 ` [PATCH v7 23/24] drm/xe/display: Add support for AuxCCS Tvrtko Ursulin
@ 2025-06-27 13:33 ` Tvrtko Ursulin
23 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-27 13:33 UTC (permalink / raw)
To: intel-xe
Cc: kernel-dev, Tvrtko Ursulin, José Roberto de Souza,
Juha-Pekka Heikkila, Rodrigo Vivi
Now that we have fixed the DPT handling we can undo the nerf which was
done in cf48bddd31de ("drm/i915/display: Disable AuxCCS framebuffers if
built for Xe").
Tested with KDE Wayland, on Lenovo Carbon X1 ADL-P:
[PLANE:32:plane 1A]: type=PRI
uapi: [FB:242] AR30 little-endian (0x30335241),0x100000000000008,2880x1800, visible=visible, src=2880.000000x1800.000000+0.000000+0.000000, dst=2880x1800+0+0, rotation=0 (0x00000001)
hw: [FB:242] AR30 little-endian (0x30335241),0x100000000000008,2880x1800, visible=yes, src=2880.000000x1800.000000+0.000000+0.000000, dst=2880x1800+0+0, rotation=0 (0x00000001)
Display working fine - no artefacts, no DMAR/PIPE faults.
v2:
* Adjust patch title. (Rodrigo)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
References: cf48bddd31de ("drm/i915/display: Disable AuxCCS framebuffers if built for Xe")
Cc: José Roberto de Souza <jose.souza@intel.com>
Cc: Juha-Pekka Heikkila <juhapekka.heikkila@gmail.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
---
drivers/gpu/drm/i915/display/skl_universal_plane.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/drivers/gpu/drm/i915/display/skl_universal_plane.c b/drivers/gpu/drm/i915/display/skl_universal_plane.c
index 68f18f18bacd..3ff7a7947519 100644
--- a/drivers/gpu/drm/i915/display/skl_universal_plane.c
+++ b/drivers/gpu/drm/i915/display/skl_universal_plane.c
@@ -2909,12 +2909,6 @@ skl_universal_plane_create(struct intel_display *display,
else
caps = skl_plane_caps(display, pipe, plane_id);
- /* FIXME: xe has problems with AUX */
- if (!IS_ENABLED(I915) && !HAS_FLAT_CCS(to_i915(display->drm)))
- caps &= ~(INTEL_PLANE_CAP_CCS_RC |
- INTEL_PLANE_CAP_CCS_RC_CC |
- INTEL_PLANE_CAP_CCS_MC);
-
modifiers = intel_fb_plane_get_modifiers(display, caps);
ret = drm_universal_plane_init(display->drm, &plane->base,
--
2.48.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH v7 11/24] drm/xe: Flush L3 when flushing render cache
2025-06-27 13:33 ` [PATCH v7 11/24] drm/xe: Flush L3 when flushing render cache Tvrtko Ursulin
@ 2025-06-27 18:23 ` Souza, Jose
2025-06-27 18:57 ` Souza, Jose
0 siblings, 1 reply; 29+ messages in thread
From: Souza, Jose @ 2025-06-27 18:23 UTC (permalink / raw)
To: intel-xe@lists.freedesktop.org, tvrtko.ursulin@igalia.com
Cc: kernel-dev@igalia.com
On Fri, 2025-06-27 at 14:33 +0100, Tvrtko Ursulin wrote:
> I915 sets PIPE_CONTROL_FLUSH_L3 (bit 27) when flushing render caches but
> interesting thing is Tigerlake PRM lists that bit as reserved.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> ---
> Is xe missing this? Or has this been wrong for so long in i915? Or is this
> an undocumented bit?
> ---
> drivers/gpu/drm/xe/instructions/xe_gpu_commands.h | 1 +
> drivers/gpu/drm/xe/xe_ring_ops.c | 10 ++++++++++
> 2 files changed, 11 insertions(+)
>
> diff --git a/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h b/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
> index 78c0e87dbd37..27892984403c 100644
> --- a/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
> +++ b/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
> @@ -47,6 +47,7 @@
>
> #define PIPE_CONTROL_COMMAND_CACHE_INVALIDATE (1<<29)
> #define PIPE_CONTROL_TILE_CACHE_FLUSH (1<<28)
> +#define PIPE_CONTROL_FLUSH_L3 (1<<27)
On spec this bit is Protected Memory Disable. I think what you want is bit 30.
> #define PIPE_CONTROL_AMFS_FLUSH (1<<25)
> #define PIPE_CONTROL_GLOBAL_GTT_IVB (1<<24)
> #define PIPE_CONTROL_LRI_POST_SYNC BIT(23)
> diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
> index a1289f086191..8f655b6fe913 100644
> --- a/drivers/gpu/drm/xe/xe_ring_ops.c
> +++ b/drivers/gpu/drm/xe/xe_ring_ops.c
> @@ -197,6 +197,16 @@ static int emit_render_cache_flush(struct xe_sched_job *job, u32 *dw, int i)
> if (XE_WA(gt, 1409600907))
> flags |= PIPE_CONTROL_DEPTH_STALL;
>
> + /*
> + * L3 fabric flush is needed for AUX CCS invalidation
> + * which happens as part of pipe-control so we can
> + * ignore PIPE_CONTROL_FLUSH_L3. Also PIPE_CONTROL_FLUSH_L3
> + * deals with Protected Memory which is not needed for
> + * AUX CCS invalidation and lead to unwanted side effects.
> + */
> + if (GRAPHICS_VERx100(xe) < 1270)
> + flags |= PIPE_CONTROL_FLUSH_L3;
> +
> if (lacks_render)
> flags &= ~PIPE_CONTROL_3D_ARCH_FLAGS;
> else if (job->q->class == XE_ENGINE_CLASS_COMPUTE)
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 11/24] drm/xe: Flush L3 when flushing render cache
2025-06-27 18:23 ` Souza, Jose
@ 2025-06-27 18:57 ` Souza, Jose
2025-06-30 12:43 ` Tvrtko Ursulin
0 siblings, 1 reply; 29+ messages in thread
From: Souza, Jose @ 2025-06-27 18:57 UTC (permalink / raw)
To: intel-xe@lists.freedesktop.org, tvrtko.ursulin@igalia.com
Cc: kernel-dev@igalia.com
On Fri, 2025-06-27 at 11:23 -0700, José Roberto de Souza wrote:
> On Fri, 2025-06-27 at 14:33 +0100, Tvrtko Ursulin wrote:
> > I915 sets PIPE_CONTROL_FLUSH_L3 (bit 27) when flushing render caches but
> > interesting thing is Tigerlake PRM lists that bit as reserved.
> >
> > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > ---
> > Is xe missing this? Or has this been wrong for so long in i915? Or is this
> > an undocumented bit?
> > ---
> > drivers/gpu/drm/xe/instructions/xe_gpu_commands.h | 1 +
> > drivers/gpu/drm/xe/xe_ring_ops.c | 10 ++++++++++
> > 2 files changed, 11 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h b/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
> > index 78c0e87dbd37..27892984403c 100644
> > --- a/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
> > +++ b/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
> > @@ -47,6 +47,7 @@
> >
> > #define PIPE_CONTROL_COMMAND_CACHE_INVALIDATE (1<<29)
> > #define PIPE_CONTROL_TILE_CACHE_FLUSH (1<<28)
> > +#define PIPE_CONTROL_FLUSH_L3 (1<<27)
>
> On spec this bit is Protected Memory Disable. I think what you want is bit 30.
That is also wrong on i915 oO.
Will give some testing here and submit a i915 patch.
>
> > #define PIPE_CONTROL_AMFS_FLUSH (1<<25)
> > #define PIPE_CONTROL_GLOBAL_GTT_IVB (1<<24)
> > #define PIPE_CONTROL_LRI_POST_SYNC BIT(23)
> > diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
> > index a1289f086191..8f655b6fe913 100644
> > --- a/drivers/gpu/drm/xe/xe_ring_ops.c
> > +++ b/drivers/gpu/drm/xe/xe_ring_ops.c
> > @@ -197,6 +197,16 @@ static int emit_render_cache_flush(struct xe_sched_job *job, u32 *dw, int i)
> > if (XE_WA(gt, 1409600907))
> > flags |= PIPE_CONTROL_DEPTH_STALL;
> >
> > + /*
> > + * L3 fabric flush is needed for AUX CCS invalidation
> > + * which happens as part of pipe-control so we can
> > + * ignore PIPE_CONTROL_FLUSH_L3. Also PIPE_CONTROL_FLUSH_L3
> > + * deals with Protected Memory which is not needed for
> > + * AUX CCS invalidation and lead to unwanted side effects.
> > + */
> > + if (GRAPHICS_VERx100(xe) < 1270)
> > + flags |= PIPE_CONTROL_FLUSH_L3;
> > +
> > if (lacks_render)
> > flags &= ~PIPE_CONTROL_3D_ARCH_FLAGS;
> > else if (job->q->class == XE_ENGINE_CLASS_COMPUTE)
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 09/24] drm/xe: Use emit_flush_imm_ggtt helper instead of open coding
2025-06-27 13:33 ` [PATCH v7 09/24] drm/xe: Use emit_flush_imm_ggtt helper instead of open coding Tvrtko Ursulin
@ 2025-06-27 21:57 ` Matthew Brost
0 siblings, 0 replies; 29+ messages in thread
From: Matthew Brost @ 2025-06-27 21:57 UTC (permalink / raw)
To: Tvrtko Ursulin; +Cc: intel-xe, kernel-dev
On Fri, Jun 27, 2025 at 02:33:22PM +0100, Tvrtko Ursulin wrote:
> Helper is already there so lets just use it.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
> ---
> drivers/gpu/drm/xe/xe_ring_ops.c | 8 +++-----
> 1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
> index bc1689db4cd7..b356134aca88 100644
> --- a/drivers/gpu/drm/xe/xe_ring_ops.c
> +++ b/drivers/gpu/drm/xe/xe_ring_ops.c
> @@ -417,11 +417,9 @@ static void emit_migration_job_gen12(struct xe_sched_job *job,
>
> i = emit_bb_start(job->ptrs[1].batch_addr, BIT(8), dw, i);
>
> - dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | job->migrate_flush_flags |
> - MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_IMM_DW;
> - dw[i++] = xe_lrc_seqno_ggtt_addr(lrc) | MI_FLUSH_DW_USE_GTT;
> - dw[i++] = 0;
> - dw[i++] = seqno; /* value */
> + i = emit_flush_imm_ggtt(xe_lrc_seqno_ggtt_addr(lrc), seqno,
> + MI_INVALIDATE_TLB | job->migrate_flush_flags,
> + dw, i);
>
> i = emit_user_interrupt(dw, i);
>
> --
> 2.48.0
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 11/24] drm/xe: Flush L3 when flushing render cache
2025-06-27 18:57 ` Souza, Jose
@ 2025-06-30 12:43 ` Tvrtko Ursulin
0 siblings, 0 replies; 29+ messages in thread
From: Tvrtko Ursulin @ 2025-06-30 12:43 UTC (permalink / raw)
To: Souza, Jose, intel-xe@lists.freedesktop.org; +Cc: kernel-dev@igalia.com
On 27/06/2025 19:57, Souza, Jose wrote:
> On Fri, 2025-06-27 at 11:23 -0700, José Roberto de Souza wrote:
>> On Fri, 2025-06-27 at 14:33 +0100, Tvrtko Ursulin wrote:
>>> I915 sets PIPE_CONTROL_FLUSH_L3 (bit 27) when flushing render caches but
>>> interesting thing is Tigerlake PRM lists that bit as reserved.
>>>
>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>> ---
>>> Is xe missing this? Or has this been wrong for so long in i915? Or is this
>>> an undocumented bit?
>>> ---
>>> drivers/gpu/drm/xe/instructions/xe_gpu_commands.h | 1 +
>>> drivers/gpu/drm/xe/xe_ring_ops.c | 10 ++++++++++
>>> 2 files changed, 11 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h b/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
>>> index 78c0e87dbd37..27892984403c 100644
>>> --- a/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
>>> +++ b/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
>>> @@ -47,6 +47,7 @@
>>>
>>> #define PIPE_CONTROL_COMMAND_CACHE_INVALIDATE (1<<29)
>>> #define PIPE_CONTROL_TILE_CACHE_FLUSH (1<<28)
>>> +#define PIPE_CONTROL_FLUSH_L3 (1<<27)
>>
>> On spec this bit is Protected Memory Disable. I think what you want is bit 30.
>
> That is also wrong on i915 oO.
Yeah, most of the series is me copying "reference" implementation from i915.
But you are right, bit 27 appears wrong, and the history of the whole
area is a bit confusing.
The existing PIPE_CONTROL_FLUSH_L3 definition was added back in 2015 in:
0160f055393f ("drm/i915/gen8: Add WaClearSlmSpaceAtContextSwitch
workaround")
But if I look at the public PRMs for BDW and CHV bit 27 is reserved MBZ.
Bit 30 is the same. Was it something undocumented? I don't know.
Furthermore, the public PRM has no mention of
WaClearSlmSpaceAtContextSwitch. It has this
WaFlushCoherentL3CacheLinesAtContextSwitch which is not quite the thing:
"""
Coherent L3 cache lines are not getting flushed during context switch
which is causing
issues like corruption. Need to set bit 21 of MMIO b118, then send PC
with DC flush and
then reset bit 21 of MMIO b118. This programming sequence needs to be
part of the indirect
context WA BB.
"""
This WA was later extnded to KBL:
066d46288851 ("drm/i915/kbl: Add WaClearSlmSpaceAtContextSwitch")
But KBL public PRM is the same - I don't see any WA which mention SLM or
bit 30. And bits 27 and 30 are still reserved.
There is a new bit 26 (Flush LLC) on SKL and KBL but we do not use it or
define it.
That also feels odd.
And finally L3 flushing was added to the plain ring programming in:
0c7c0c8e6f09 ("drm/i915/gen12: Flush L3")
As you discovered this is most likely wrong since bspec say bit 30 on
TGL is L3 Fabric Flush.
So more questions than answers and it looks access to internal docs will
be required to get to the bottom of it all.
> Will give some testing here and submit a i915 patch.
I at least tried changing it to bit 30 on ADL but that on it's own
appeared to make no positive improvement to the transient cache dirt in
some AuxCCS IGTs. It still requires the large hammer of "drm/xe: Force
flush system memory AuxCCS framebuffers before scan out".
I am curious to what your testing will show with the use cases you have
in mind.
Regards,
Tvrtko
>>
>>> #define PIPE_CONTROL_AMFS_FLUSH (1<<25)
>>> #define PIPE_CONTROL_GLOBAL_GTT_IVB (1<<24)
>>> #define PIPE_CONTROL_LRI_POST_SYNC BIT(23)
>>> diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
>>> index a1289f086191..8f655b6fe913 100644
>>> --- a/drivers/gpu/drm/xe/xe_ring_ops.c
>>> +++ b/drivers/gpu/drm/xe/xe_ring_ops.c
>>> @@ -197,6 +197,16 @@ static int emit_render_cache_flush(struct xe_sched_job *job, u32 *dw, int i)
>>> if (XE_WA(gt, 1409600907))
>>> flags |= PIPE_CONTROL_DEPTH_STALL;
>>>
>>> + /*
>>> + * L3 fabric flush is needed for AUX CCS invalidation
>>> + * which happens as part of pipe-control so we can
>>> + * ignore PIPE_CONTROL_FLUSH_L3. Also PIPE_CONTROL_FLUSH_L3
>>> + * deals with Protected Memory which is not needed for
>>> + * AUX CCS invalidation and lead to unwanted side effects.
>>> + */
>>> + if (GRAPHICS_VERx100(xe) < 1270)
>>> + flags |= PIPE_CONTROL_FLUSH_L3;
>>> +
>>> if (lacks_render)
>>> flags &= ~PIPE_CONTROL_3D_ARCH_FLAGS;
>>> else if (job->q->class == XE_ENGINE_CLASS_COMPUTE)
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2025-06-30 12:43 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-27 13:33 [PATCH v7 00/24] AuxCCS handling and render compression modifiers Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 01/24] drm/xe: Consolidate LRC offset calculations Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 02/24] drm/xe: Generalize wa bb emission code Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 03/24] drm/xe: Rename utilisation workaround emission function Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 04/24] drm/xe: Return number of written dwords from workaround batch buffer emission Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 05/24] drm/xe: Allow specifying number of extra dwords at the end of wa bb emission Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 06/24] drm/xe: Add plumbing for indirect context workarounds Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 07/24] drm/xe/xelp: Implement Wa_16010904313 Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 08/24] drm/xe/xelp: Add Wa_18022495364 Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 09/24] drm/xe: Use emit_flush_imm_ggtt helper instead of open coding Tvrtko Ursulin
2025-06-27 21:57 ` Matthew Brost
2025-06-27 13:33 ` [PATCH v7 10/24] drm/xe/xelpg: Flush CCS when flushing caches Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 11/24] drm/xe: Flush L3 when flushing render cache Tvrtko Ursulin
2025-06-27 18:23 ` Souza, Jose
2025-06-27 18:57 ` Souza, Jose
2025-06-30 12:43 ` Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 12/24] drm/xe/xelp: Quiesce memory traffic before invalidating auxccs Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 13/24] drm/xe/xelp: Support auxccs invalidation on blitter Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 14/24] drm/xe/xelp: Use MI_FLUSH_DW_CCS on auxccs platforms Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 15/24] drm/xe/xelp: Wait for AuxCCS invalidation to complete Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 16/24] drm/xe/xelp: Add AuxCCS invalidation to the buffer migration path Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 17/24] drm/xe: Export xe_emit_aux_table_inv Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 18/24] drm/xe/xelp: Add AuxCCS invalidation to the indirect context workarounds Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 19/24] drm/xe: Use fb cached min alignment Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 20/24] drm/xe: Flush GGTT writes after populating DPT Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 21/24] drm/xe: Handle DPT in system memory Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 22/24] drm/xe: Force flush system memory AuxCCS framebuffers before scan out Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 23/24] drm/xe/display: Add support for AuxCCS Tvrtko Ursulin
2025-06-27 13:33 ` [PATCH v7 24/24] drm/i915/display: Expose AuxCCS frame buffer modifiers for Xe Tvrtko Ursulin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox