All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v4 00/12] Ring padding micro-optimisation etc
@ 2024-12-27 11:19 Tvrtko Ursulin
  2024-12-27 11:19 ` [PATCH 01/12] drm/amdgpu: Use memset32 for IB padding Tvrtko Ursulin
                   ` (11 more replies)
  0 siblings, 12 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-12-27 11:19 UTC (permalink / raw)
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin, Christian König, Sunil Khatri

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

[Re-send after fdo mailman got recovered out of the -ENOSPC state.]

There is a few ideas in this series and not all might stick.

Trivial stuff aside, the two main things to higlight are:

1) Ther departure from the existing state of "duplicate everything" by
consolidating some SDMA insert nop vfuncs.

2) Conversion of amdgpu_ring_write() to variadic to allow for more compact
compiled code.

For the latter I have only included VCE, GFX v10.0 and SDMA v5.2 as examples.
(But note the code shrink is already noticable with even only those three.)

But it is churny and looks different so people may not like it. TBD.

Other than those two, the remaining general idea of the series is to consolidate
the padding approach to memset32, especially on rings with 64 or 256 dword
alignment.

Binary size comparison:

    text    data     bss     dec     hex filename
 10439777   542501  188232 11170510   aa72ce amdgpu.ko.before
 10412793   542609  188232 11143634   aa09d2 amdgpu.ko.after

Cc: Christian König <christian.koenig@amd.com>
Cc: Sunil Khatri <sunil.khatri@amd.com>

Tvrtko Ursulin (12):
  drm/amdgpu: Use memset32 for IB padding
  drm/amdgpu: Use memset32 for ring clearing
  drm/amdgpu: Cache SDMA instance and index in the ring
  drm/amdgpu: Consolidate a bunch of similar sdma insert nop vfuncs
  drm/amdgpu: Use memset32 for sdma insert nops
  drm/amdgpu: Use amdgpu_ring_fill() for VPE padding
  drm/amdgpu: Convert JPEG, VCE and UVD to more efficient ring padding
  drm/amdgpu: Cache some values in ring emission helpers
  drm/amdgpu: Optimise amdgpu_ring_write()
  drm/amdgpu: Convert VCE to variadic amdgpu_ring_write()
  drm/amdgpu: Convert GFX v10.0 to variadic amdgpu_ring_write()
  drm/amdgpu: Convert SDMA v5.2 to variadic amdgpu_ring_write()

 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c |  32 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 321 +++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c |  43 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h |   4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c  |  22 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c  |  13 +-
 drivers/gpu/drm/amd/amdgpu/cik_sdma.c    |   4 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c   | 399 ++++++++++++-----------
 drivers/gpu/drm/amd/amdgpu/jpeg_v1_0.c   |   8 +-
 drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c   |   8 +-
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c |   8 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c   |  24 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c   |  24 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c   |  31 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c |  31 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c   |  28 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c   | 182 +++++------
 drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c   |  31 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c   |  31 +-
 drivers/gpu/drm/amd/amdgpu/uvd_v3_1.c    |   7 +-
 drivers/gpu/drm/amd/amdgpu/uvd_v4_2.c    |   7 +-
 drivers/gpu/drm/amd/amdgpu/uvd_v5_0.c    |   7 +-
 drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c    |   7 +-
 drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c    |   9 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c    |   8 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c    |   7 +-
 26 files changed, 745 insertions(+), 551 deletions(-)

-- 
2.47.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 01/12] drm/amdgpu: Use memset32 for IB padding
  2024-12-27 11:19 [RFC v4 00/12] Ring padding micro-optimisation etc Tvrtko Ursulin
@ 2024-12-27 11:19 ` Tvrtko Ursulin
  2025-01-02 13:45   ` Christian König
  2024-12-27 11:19 ` [PATCH 02/12] drm/amdgpu: Use memset32 for ring clearing Tvrtko Ursulin
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-12-27 11:19 UTC (permalink / raw)
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin, Christian König, Sunil Khatri

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Use memset32 instead of open coding it, just because it is
that bit nicer.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Sunil Khatri <sunil.khatri@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index a6e28fe3f8d6..a27e32f48f99 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -136,8 +136,16 @@ void amdgpu_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
  */
 void amdgpu_ring_generic_pad_ib(struct amdgpu_ring *ring, struct amdgpu_ib *ib)
 {
-	while (ib->length_dw & ring->funcs->align_mask)
-		ib->ptr[ib->length_dw++] = ring->funcs->nop;
+	u32 align_mask = ring->funcs->align_mask;
+	u32 count = ib->length_dw & align_mask;
+
+	if (count) {
+		count = align_mask + 1 - count;
+
+		memset32(&ib->ptr[ib->length_dw], ring->funcs->nop, count);
+
+		ib->length_dw += count;
+	}
 }
 
 /**
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 02/12] drm/amdgpu: Use memset32 for ring clearing
  2024-12-27 11:19 [RFC v4 00/12] Ring padding micro-optimisation etc Tvrtko Ursulin
  2024-12-27 11:19 ` [PATCH 01/12] drm/amdgpu: Use memset32 for IB padding Tvrtko Ursulin
@ 2024-12-27 11:19 ` Tvrtko Ursulin
  2024-12-27 11:19 ` [PATCH 03/12] drm/amdgpu: Cache SDMA instance and index in the ring Tvrtko Ursulin
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-12-27 11:19 UTC (permalink / raw)
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin, Christian König, Sunil Khatri

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Use memset32 instead of open coding it, just because it is
a tiny bit nicer.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Sunil Khatri <sunil.khatri@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index dee5a1b4e572..96bfc0c23413 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -369,10 +369,7 @@ static inline void amdgpu_ring_set_preempt_cond_exec(struct amdgpu_ring *ring,
 
 static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
 {
-	int i = 0;
-	while (i <= ring->buf_mask)
-		ring->ring[i++] = ring->funcs->nop;
-
+	memset32(ring->ring, ring->funcs->nop, ring->buf_mask + 1);
 }
 
 static inline void amdgpu_ring_write(struct amdgpu_ring *ring, uint32_t v)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 03/12] drm/amdgpu: Cache SDMA instance and index in the ring
  2024-12-27 11:19 [RFC v4 00/12] Ring padding micro-optimisation etc Tvrtko Ursulin
  2024-12-27 11:19 ` [PATCH 01/12] drm/amdgpu: Use memset32 for IB padding Tvrtko Ursulin
  2024-12-27 11:19 ` [PATCH 02/12] drm/amdgpu: Use memset32 for ring clearing Tvrtko Ursulin
@ 2024-12-27 11:19 ` Tvrtko Ursulin
  2024-12-27 11:19 ` [PATCH 04/12] drm/amdgpu: Consolidate a bunch of similar sdma insert nop vfuncs Tvrtko Ursulin
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-12-27 11:19 UTC (permalink / raw)
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin, Christian König

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

A bunch of SDMA ring vfuncs seem to have a little bit of an identity
crisis not knowing to what SDMA instance they belong and have to
walk the list of instances on every invocation. We can improve that
by simply storing the relevant into in the ring itself.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Sunil Khatri <sunil.khatri@amd.com
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c |  3 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  6 ++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 39 ++----------------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h |  3 --
 drivers/gpu/drm/amd/amdgpu/cik_sdma.c    |  4 +--
 drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c   |  7 +++--
 drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c   |  7 +++--
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c   | 10 ++++--
 drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 10 ++++--
 drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c   | 11 ++++---
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c   | 14 +++++----
 drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c   | 14 +++++----
 drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c   | 14 +++++----
 13 files changed, 70 insertions(+), 72 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index a27e32f48f99..8e4e9ec68262 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -386,6 +386,9 @@ int amdgpu_ring_init(struct amdgpu_device *adev, struct amdgpu_ring *ring,
 			&ring->sched;
 	}
 
+	/* SDMA "sub-class" will override */
+	ring->sdma.index = -ENODEV;
+
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 96bfc0c23413..429c77db920f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -299,6 +299,12 @@ struct amdgpu_ring {
 	unsigned 		num_hw_submission;
 	atomic_t		*sched_score;
 
+	/* used for sdma */
+	struct {
+		struct amdgpu_sdma_instance *instance;
+		int			     index;
+	} sdma;
+
 	/* used for mes */
 	bool			is_mes_queue;
 	uint32_t		hw_queue_id;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
index 632295bf3875..d43dfec82624 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
@@ -34,42 +34,11 @@
  * GPU SDMA IP block helpers function.
  */
 
-struct amdgpu_sdma_instance *amdgpu_sdma_get_instance_from_ring(struct amdgpu_ring *ring)
-{
-	struct amdgpu_device *adev = ring->adev;
-	int i;
-
-	for (i = 0; i < adev->sdma.num_instances; i++)
-		if (ring == &adev->sdma.instance[i].ring ||
-		    ring == &adev->sdma.instance[i].page)
-			return &adev->sdma.instance[i];
-
-	return NULL;
-}
-
-int amdgpu_sdma_get_index_from_ring(struct amdgpu_ring *ring, uint32_t *index)
-{
-	struct amdgpu_device *adev = ring->adev;
-	int i;
-
-	for (i = 0; i < adev->sdma.num_instances; i++) {
-		if (ring == &adev->sdma.instance[i].ring ||
-			ring == &adev->sdma.instance[i].page) {
-			*index = i;
-			return 0;
-		}
-	}
-
-	return -EINVAL;
-}
-
 uint64_t amdgpu_sdma_get_csa_mc_addr(struct amdgpu_ring *ring,
 				     unsigned int vmid)
 {
 	struct amdgpu_device *adev = ring->adev;
 	uint64_t csa_mc_addr;
-	uint32_t index = 0;
-	int r;
 
 	/* don't enable OS preemption on SDMA under SRIOV */
 	if (amdgpu_sriov_vf(adev) || vmid == 0 || !adev->gfx.mcbp)
@@ -82,14 +51,12 @@ uint64_t amdgpu_sdma_get_csa_mc_addr(struct amdgpu_ring *ring,
 				  sdma[ring->idx].sdma_meta_data);
 		csa_mc_addr = amdgpu_mes_ctx_get_offs_gpu_addr(ring, offset);
 	} else {
-		r = amdgpu_sdma_get_index_from_ring(ring, &index);
-
-		if (r || index > 31)
+		if (ring->sdma.index < 0 || ring->sdma.index > 31)
 			csa_mc_addr = 0;
 		else
 			csa_mc_addr = amdgpu_csa_vaddr(adev) +
-				AMDGPU_CSA_SDMA_OFFSET +
-				index * AMDGPU_CSA_SDMA_SIZE;
+				      AMDGPU_CSA_SDMA_OFFSET +
+				      ring->sdma.index * AMDGPU_CSA_SDMA_SIZE;
 	}
 
 	return csa_mc_addr;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
index 2db58b5812a8..7debf3ed0b46 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
@@ -159,9 +159,6 @@ struct amdgpu_buffer_funcs {
 #define amdgpu_emit_copy_buffer(adev, ib, s, d, b, t) (adev)->mman.buffer_funcs->emit_copy_buffer((ib),  (s), (d), (b), (t))
 #define amdgpu_emit_fill_buffer(adev, ib, s, d, b) (adev)->mman.buffer_funcs->emit_fill_buffer((ib), (s), (d), (b))
 
-struct amdgpu_sdma_instance *
-amdgpu_sdma_get_instance_from_ring(struct amdgpu_ring *ring);
-int amdgpu_sdma_get_index_from_ring(struct amdgpu_ring *ring, uint32_t *index);
 uint64_t amdgpu_sdma_get_csa_mc_addr(struct amdgpu_ring *ring, unsigned vmid);
 int amdgpu_sdma_ras_late_init(struct amdgpu_device *adev,
 			      struct ras_common_if *ras_block);
diff --git a/drivers/gpu/drm/amd/amdgpu/cik_sdma.c b/drivers/gpu/drm/amd/amdgpu/cik_sdma.c
index d9bd8f3f17e2..b8d9a32b07f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik_sdma.c
@@ -199,7 +199,7 @@ static void cik_sdma_ring_set_wptr(struct amdgpu_ring *ring)
 
 static void cik_sdma_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	int i;
 
 	for (i = 0; i < count; i++)
@@ -798,7 +798,7 @@ static void cik_sdma_vm_set_pte_pde(struct amdgpu_ib *ib, uint64_t pe,
  */
 static void cik_sdma_ring_pad_ib(struct amdgpu_ring *ring, struct amdgpu_ib *ib)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	u32 pad_count;
 	int i;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c b/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
index 135c5099bfb8..2e844dba4ad5 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
@@ -223,7 +223,7 @@ static void sdma_v2_4_ring_set_wptr(struct amdgpu_ring *ring)
 
 static void sdma_v2_4_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	int i;
 
 	for (i = 0; i < count; i++)
@@ -733,7 +733,7 @@ static void sdma_v2_4_vm_set_pte_pde(struct amdgpu_ib *ib, uint64_t pe,
  */
 static void sdma_v2_4_ring_pad_ib(struct amdgpu_ring *ring, struct amdgpu_ib *ib)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	u32 pad_count;
 	int i;
 
@@ -863,6 +863,9 @@ static int sdma_v2_4_sw_init(struct amdgpu_ip_block *ip_block)
 				     AMDGPU_RING_PRIO_DEFAULT, NULL);
 		if (r)
 			return r;
+
+		ring->sdma.instance = &adev->sdma.instance[i];
+		ring->sdma.index = i;
 	}
 
 	return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
index c611328671ed..104fd1214c4c 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
@@ -399,7 +399,7 @@ static void sdma_v3_0_ring_set_wptr(struct amdgpu_ring *ring)
 
 static void sdma_v3_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	int i;
 
 	for (i = 0; i < count; i++)
@@ -1006,7 +1006,7 @@ static void sdma_v3_0_vm_set_pte_pde(struct amdgpu_ib *ib, uint64_t pe,
  */
 static void sdma_v3_0_ring_pad_ib(struct amdgpu_ring *ring, struct amdgpu_ib *ib)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	u32 pad_count;
 	int i;
 
@@ -1149,6 +1149,9 @@ static int sdma_v3_0_sw_init(struct amdgpu_ip_block *ip_block)
 				     AMDGPU_RING_PRIO_DEFAULT, NULL);
 		if (r)
 			return r;
+
+		ring->sdma.instance = &adev->sdma.instance[i];
+		ring->sdma.index = i;
 	}
 
 	return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
index b48d9c0b2e1c..c91d05a4593e 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
@@ -783,7 +783,7 @@ static void sdma_v4_0_page_ring_set_wptr(struct amdgpu_ring *ring)
 
 static void sdma_v4_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	int i;
 
 	for (i = 0; i < count; i++)
@@ -1667,7 +1667,7 @@ static void sdma_v4_0_vm_set_pte_pde(struct amdgpu_ib *ib,
  */
 static void sdma_v4_0_ring_pad_ib(struct amdgpu_ring *ring, struct amdgpu_ib *ib)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	u32 pad_count;
 	int i;
 
@@ -1874,6 +1874,9 @@ static int sdma_v4_0_sw_init(struct amdgpu_ip_block *ip_block)
 		if (r)
 			return r;
 
+		ring->sdma.instance = &adev->sdma.instance[i];
+		ring->sdma.index = i;
+
 		if (adev->sdma.has_page_queue) {
 			ring = &adev->sdma.instance[i].page;
 			ring->ring_obj = NULL;
@@ -1911,6 +1914,9 @@ static int sdma_v4_0_sw_init(struct amdgpu_ip_block *ip_block)
 					     AMDGPU_RING_PRIO_DEFAULT, NULL);
 			if (r)
 				return r;
+
+			ring->sdma.instance = &adev->sdma.instance[i];
+			ring->sdma.index = i;
 		}
 	}
 
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
index 56507ae919b0..d1d21a3951f8 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
@@ -343,7 +343,7 @@ static void sdma_v4_4_2_page_ring_set_wptr(struct amdgpu_ring *ring)
 
 static void sdma_v4_4_2_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	int i;
 
 	for (i = 0; i < count; i++)
@@ -1234,7 +1234,7 @@ static void sdma_v4_4_2_vm_set_pte_pde(struct amdgpu_ib *ib,
  */
 static void sdma_v4_4_2_ring_pad_ib(struct amdgpu_ring *ring, struct amdgpu_ib *ib)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	u32 pad_count;
 	int i;
 
@@ -1429,6 +1429,9 @@ static int sdma_v4_4_2_sw_init(struct amdgpu_ip_block *ip_block)
 		if (r)
 			return r;
 
+		ring->sdma.instance = &adev->sdma.instance[i];
+		ring->sdma.index = i;
+
 		if (adev->sdma.has_page_queue) {
 			ring = &adev->sdma.instance[i].page;
 			ring->ring_obj = NULL;
@@ -1449,6 +1452,9 @@ static int sdma_v4_4_2_sw_init(struct amdgpu_ip_block *ip_block)
 					     AMDGPU_RING_PRIO_DEFAULT, NULL);
 			if (r)
 				return r;
+
+			ring->sdma.instance = &adev->sdma.instance[i];
+			ring->sdma.index = i;
 		}
 	}
 
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
index b764550834a0..97536f82dfcd 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
@@ -435,7 +435,7 @@ static void sdma_v5_0_ring_set_wptr(struct amdgpu_ring *ring)
 
 static void sdma_v5_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	int i;
 
 	for (i = 0; i < count; i++)
@@ -1298,7 +1298,7 @@ static void sdma_v5_0_vm_set_pte_pde(struct amdgpu_ib *ib,
  */
 static void sdma_v5_0_ring_pad_ib(struct amdgpu_ring *ring, struct amdgpu_ib *ib)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	u32 pad_count;
 	int i;
 
@@ -1450,6 +1450,9 @@ static int sdma_v5_0_sw_init(struct amdgpu_ip_block *ip_block)
 				     AMDGPU_RING_PRIO_DEFAULT, NULL);
 		if (r)
 			return r;
+
+		ring->sdma.instance = &adev->sdma.instance[i];
+		ring->sdma.index = i;
 	}
 
 	adev->sdma.supported_reset =
@@ -1660,11 +1663,9 @@ static int sdma_v5_0_ring_preempt_ib(struct amdgpu_ring *ring)
 {
 	int i, r = 0;
 	struct amdgpu_device *adev = ring->adev;
-	u32 index = 0;
 	u64 sdma_gfx_preempt;
 
-	amdgpu_sdma_get_index_from_ring(ring, &index);
-	if (index == 0)
+	if (ring->sdma.index == 0)
 		sdma_gfx_preempt = mmSDMA0_GFX_PREEMPT;
 	else
 		sdma_gfx_preempt = mmSDMA1_GFX_PREEMPT;
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
index b1818e87889a..8eaddee1d97d 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
@@ -252,7 +252,7 @@ static void sdma_v5_2_ring_set_wptr(struct amdgpu_ring *ring)
 
 static void sdma_v5_2_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	int i;
 
 	for (i = 0; i < count; i++)
@@ -1155,7 +1155,7 @@ static void sdma_v5_2_vm_set_pte_pde(struct amdgpu_ib *ib,
  */
 static void sdma_v5_2_ring_pad_ib(struct amdgpu_ring *ring, struct amdgpu_ib *ib)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	u32 pad_count;
 	int i;
 
@@ -1355,6 +1355,9 @@ static int sdma_v5_2_sw_init(struct amdgpu_ip_block *ip_block)
 				     AMDGPU_RING_PRIO_DEFAULT, NULL);
 		if (r)
 			return r;
+
+		ring->sdma.instance = &adev->sdma.instance[i];
+		ring->sdma.index = i;
 	}
 
 	adev->sdma.supported_reset =
@@ -1563,12 +1566,11 @@ static int sdma_v5_2_ring_preempt_ib(struct amdgpu_ring *ring)
 {
 	int i, r = 0;
 	struct amdgpu_device *adev = ring->adev;
-	u32 index = 0;
 	u64 sdma_gfx_preempt;
 
-	amdgpu_sdma_get_index_from_ring(ring, &index);
-	sdma_gfx_preempt =
-		sdma_v5_2_get_reg_offset(adev, index, mmSDMA0_GFX_PREEMPT);
+	sdma_gfx_preempt = sdma_v5_2_get_reg_offset(adev,
+						    ring->sdma.index,
+						    mmSDMA0_GFX_PREEMPT);
 
 	/* assert preemption condition */
 	amdgpu_ring_set_preempt_cond_exec(ring, false);
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
index 1a023b45f0be..3ead269eccdc 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
@@ -237,7 +237,7 @@ static void sdma_v6_0_ring_set_wptr(struct amdgpu_ring *ring)
 
 static void sdma_v6_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	int i;
 
 	for (i = 0; i < count; i++)
@@ -1167,7 +1167,7 @@ static void sdma_v6_0_vm_set_pte_pde(struct amdgpu_ib *ib,
  */
 static void sdma_v6_0_ring_pad_ib(struct amdgpu_ring *ring, struct amdgpu_ib *ib)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	u32 pad_count;
 	int i;
 
@@ -1348,6 +1348,9 @@ static int sdma_v6_0_sw_init(struct amdgpu_ip_block *ip_block)
 				     AMDGPU_RING_PRIO_DEFAULT, NULL);
 		if (r)
 			return r;
+
+		ring->sdma.instance = &adev->sdma.instance[i];
+		ring->sdma.index = i;
 	}
 
 	adev->sdma.supported_reset =
@@ -1464,12 +1467,11 @@ static int sdma_v6_0_ring_preempt_ib(struct amdgpu_ring *ring)
 {
 	int i, r = 0;
 	struct amdgpu_device *adev = ring->adev;
-	u32 index = 0;
 	u64 sdma_gfx_preempt;
 
-	amdgpu_sdma_get_index_from_ring(ring, &index);
-	sdma_gfx_preempt =
-		sdma_v6_0_get_reg_offset(adev, index, regSDMA0_QUEUE0_PREEMPT);
+	sdma_gfx_preempt = sdma_v6_0_get_reg_offset(adev,
+						    ring->sdma.index,
+						    regSDMA0_QUEUE0_PREEMPT);
 
 	/* assert preemption condition */
 	amdgpu_ring_set_preempt_cond_exec(ring, false);
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
index 9c17df2cf37b..5fadaf35a03a 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
@@ -269,7 +269,7 @@ static void sdma_v7_0_ring_set_wptr(struct amdgpu_ring *ring)
 
 static void sdma_v7_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	int i;
 
 	for (i = 0; i < count; i++)
@@ -1216,7 +1216,7 @@ static void sdma_v7_0_vm_set_pte_pde(struct amdgpu_ib *ib,
  */
 static void sdma_v7_0_ring_pad_ib(struct amdgpu_ring *ring, struct amdgpu_ib *ib)
 {
-	struct amdgpu_sdma_instance *sdma = amdgpu_sdma_get_instance_from_ring(ring);
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	u32 pad_count;
 	int i;
 
@@ -1362,6 +1362,9 @@ static int sdma_v7_0_sw_init(struct amdgpu_ip_block *ip_block)
 				     AMDGPU_RING_PRIO_DEFAULT, NULL);
 		if (r)
 			return r;
+
+		ring->sdma.instance = &adev->sdma.instance[i];
+		ring->sdma.index = i;
 	}
 
 	adev->sdma.supported_reset =
@@ -1466,12 +1469,11 @@ static int sdma_v7_0_ring_preempt_ib(struct amdgpu_ring *ring)
 {
 	int i, r = 0;
 	struct amdgpu_device *adev = ring->adev;
-	u32 index = 0;
 	u64 sdma_gfx_preempt;
 
-	amdgpu_sdma_get_index_from_ring(ring, &index);
-	sdma_gfx_preempt =
-		sdma_v7_0_get_reg_offset(adev, index, regSDMA0_QUEUE0_PREEMPT);
+	sdma_gfx_preempt = sdma_v7_0_get_reg_offset(adev,
+						    ring->sdma.index,
+						    regSDMA0_QUEUE0_PREEMPT);
 
 	/* assert preemption condition */
 	amdgpu_ring_set_preempt_cond_exec(ring, false);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 04/12] drm/amdgpu: Consolidate a bunch of similar sdma insert nop vfuncs
  2024-12-27 11:19 [RFC v4 00/12] Ring padding micro-optimisation etc Tvrtko Ursulin
                   ` (2 preceding siblings ...)
  2024-12-27 11:19 ` [PATCH 03/12] drm/amdgpu: Cache SDMA instance and index in the ring Tvrtko Ursulin
@ 2024-12-27 11:19 ` Tvrtko Ursulin
  2025-01-02 13:49   ` Christian König
  2024-12-27 11:19 ` [PATCH 05/12] drm/amdgpu: Use memset32 for sdma insert nops Tvrtko Ursulin
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-12-27 11:19 UTC (permalink / raw)
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin, Christian König, Sunil Khatri

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

A lot of the hardware generations apparently uses the same nop insertion
logic, just with different masks and shifts.

We can consolidate if we store those shifts and mask in the ring and
shrink both the source and binary.

Though it appears a bit of a departure from the existing "duplicate
everything" state so we will see how this is received.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Sunil Khatri <sunil.khatri@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  5 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 18 ++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h |  1 +
 drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c   | 19 ++++---------------
 drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c   | 19 ++++---------------
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c   | 23 +++++++----------------
 drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 23 +++++++----------------
 drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c   | 19 ++++---------------
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c   | 19 ++++---------------
 drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c   | 19 ++++---------------
 drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c   | 19 ++++---------------
 11 files changed, 62 insertions(+), 122 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 429c77db920f..4c0861ebc77a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -303,6 +303,11 @@ struct amdgpu_ring {
 	struct {
 		struct amdgpu_sdma_instance *instance;
 		int			     index;
+
+		struct {
+			u32		mask;
+			unsigned int	shift;
+		} nop_pkt;
 	} sdma;
 
 	/* used for mes */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
index d43dfec82624..148413f01875 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
@@ -34,6 +34,24 @@
  * GPU SDMA IP block helpers function.
  */
 
+void amdgpu_sdma_ring_insert_nop(struct amdgpu_ring *ring, u32 count)
+{
+	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
+	const u32 nop = ring->funcs->nop;
+	u32 i;
+
+	if (!count)
+		return;
+
+	if (sdma && sdma->burst_nop)
+		amdgpu_ring_write(ring, nop |
+					(--count & ring->sdma.nop_pkt.mask) <<
+					ring->sdma.nop_pkt.shift);
+
+	for (i = 0; i < count; i++)
+		amdgpu_ring_write(ring, nop);
+}
+
 uint64_t amdgpu_sdma_get_csa_mc_addr(struct amdgpu_ring *ring,
 				     unsigned int vmid)
 {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
index 7debf3ed0b46..d2642a9113ae 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
@@ -159,6 +159,7 @@ struct amdgpu_buffer_funcs {
 #define amdgpu_emit_copy_buffer(adev, ib, s, d, b, t) (adev)->mman.buffer_funcs->emit_copy_buffer((ib),  (s), (d), (b), (t))
 #define amdgpu_emit_fill_buffer(adev, ib, s, d, b) (adev)->mman.buffer_funcs->emit_fill_buffer((ib), (s), (d), (b))
 
+void amdgpu_sdma_ring_insert_nop(struct amdgpu_ring *ring, u32 count);
 uint64_t amdgpu_sdma_get_csa_mc_addr(struct amdgpu_ring *ring, unsigned vmid);
 int amdgpu_sdma_ras_late_init(struct amdgpu_device *adev,
 			      struct ras_common_if *ras_block);
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c b/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
index 2e844dba4ad5..e0ed65eca431 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
@@ -221,19 +221,6 @@ static void sdma_v2_4_ring_set_wptr(struct amdgpu_ring *ring)
 	WREG32(mmSDMA0_GFX_RB_WPTR + sdma_offsets[ring->me], ring->wptr << 2);
 }
 
-static void sdma_v2_4_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
-{
-	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
-	int i;
-
-	for (i = 0; i < count; i++)
-		if (sdma && sdma->burst_nop && (i == 0))
-			amdgpu_ring_write(ring, ring->funcs->nop |
-				SDMA_PKT_NOP_HEADER_COUNT(count - 1));
-		else
-			amdgpu_ring_write(ring, ring->funcs->nop);
-}
-
 /**
  * sdma_v2_4_ring_emit_ib - Schedule an IB on the DMA engine
  *
@@ -252,7 +239,7 @@ static void sdma_v2_4_ring_emit_ib(struct amdgpu_ring *ring,
 	unsigned vmid = AMDGPU_JOB_GET_VMID(job);
 
 	/* IB packet must end on a 8 DW boundary */
-	sdma_v2_4_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
+	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
 
 	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
 			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
@@ -866,6 +853,8 @@ static int sdma_v2_4_sw_init(struct amdgpu_ip_block *ip_block)
 
 		ring->sdma.instance = &adev->sdma.instance[i];
 		ring->sdma.index = i;
+		ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
+		ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
 	}
 
 	return r;
@@ -1137,7 +1126,7 @@ static const struct amdgpu_ring_funcs sdma_v2_4_ring_funcs = {
 	.emit_hdp_flush = sdma_v2_4_ring_emit_hdp_flush,
 	.test_ring = sdma_v2_4_ring_test_ring,
 	.test_ib = sdma_v2_4_ring_test_ib,
-	.insert_nop = sdma_v2_4_ring_insert_nop,
+	.insert_nop = amdgpu_sdma_ring_insert_nop,
 	.pad_ib = sdma_v2_4_ring_pad_ib,
 	.emit_wreg = sdma_v2_4_ring_emit_wreg,
 };
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
index 104fd1214c4c..8a644ea28589 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
@@ -397,19 +397,6 @@ static void sdma_v3_0_ring_set_wptr(struct amdgpu_ring *ring)
 	}
 }
 
-static void sdma_v3_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
-{
-	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
-	int i;
-
-	for (i = 0; i < count; i++)
-		if (sdma && sdma->burst_nop && (i == 0))
-			amdgpu_ring_write(ring, ring->funcs->nop |
-				SDMA_PKT_NOP_HEADER_COUNT(count - 1));
-		else
-			amdgpu_ring_write(ring, ring->funcs->nop);
-}
-
 /**
  * sdma_v3_0_ring_emit_ib - Schedule an IB on the DMA engine
  *
@@ -428,7 +415,7 @@ static void sdma_v3_0_ring_emit_ib(struct amdgpu_ring *ring,
 	unsigned vmid = AMDGPU_JOB_GET_VMID(job);
 
 	/* IB packet must end on a 8 DW boundary */
-	sdma_v3_0_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
+	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
 
 	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
 			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
@@ -1152,6 +1139,8 @@ static int sdma_v3_0_sw_init(struct amdgpu_ip_block *ip_block)
 
 		ring->sdma.instance = &adev->sdma.instance[i];
 		ring->sdma.index = i;
+		ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
+		ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
 	}
 
 	return r;
@@ -1579,7 +1568,7 @@ static const struct amdgpu_ring_funcs sdma_v3_0_ring_funcs = {
 	.emit_hdp_flush = sdma_v3_0_ring_emit_hdp_flush,
 	.test_ring = sdma_v3_0_ring_test_ring,
 	.test_ib = sdma_v3_0_ring_test_ib,
-	.insert_nop = sdma_v3_0_ring_insert_nop,
+	.insert_nop = amdgpu_sdma_ring_insert_nop,
 	.pad_ib = sdma_v3_0_ring_pad_ib,
 	.emit_wreg = sdma_v3_0_ring_emit_wreg,
 };
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
index c91d05a4593e..0f0f05a03cd4 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
@@ -781,19 +781,6 @@ static void sdma_v4_0_page_ring_set_wptr(struct amdgpu_ring *ring)
 	}
 }
 
-static void sdma_v4_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
-{
-	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
-	int i;
-
-	for (i = 0; i < count; i++)
-		if (sdma && sdma->burst_nop && (i == 0))
-			amdgpu_ring_write(ring, ring->funcs->nop |
-				SDMA_PKT_NOP_HEADER_COUNT(count - 1));
-		else
-			amdgpu_ring_write(ring, ring->funcs->nop);
-}
-
 /**
  * sdma_v4_0_ring_emit_ib - Schedule an IB on the DMA engine
  *
@@ -812,7 +799,7 @@ static void sdma_v4_0_ring_emit_ib(struct amdgpu_ring *ring,
 	unsigned vmid = AMDGPU_JOB_GET_VMID(job);
 
 	/* IB packet must end on a 8 DW boundary */
-	sdma_v4_0_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
+	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
 
 	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
 			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
@@ -1876,6 +1863,8 @@ static int sdma_v4_0_sw_init(struct amdgpu_ip_block *ip_block)
 
 		ring->sdma.instance = &adev->sdma.instance[i];
 		ring->sdma.index = i;
+		ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
+		ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
 
 		if (adev->sdma.has_page_queue) {
 			ring = &adev->sdma.instance[i].page;
@@ -1917,6 +1906,8 @@ static int sdma_v4_0_sw_init(struct amdgpu_ip_block *ip_block)
 
 			ring->sdma.instance = &adev->sdma.instance[i];
 			ring->sdma.index = i;
+			ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
+			ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
 		}
 	}
 
@@ -2443,7 +2434,7 @@ static const struct amdgpu_ring_funcs sdma_v4_0_ring_funcs = {
 	.emit_hdp_flush = sdma_v4_0_ring_emit_hdp_flush,
 	.test_ring = sdma_v4_0_ring_test_ring,
 	.test_ib = sdma_v4_0_ring_test_ib,
-	.insert_nop = sdma_v4_0_ring_insert_nop,
+	.insert_nop = amdgpu_sdma_ring_insert_nop,
 	.pad_ib = sdma_v4_0_ring_pad_ib,
 	.emit_wreg = sdma_v4_0_ring_emit_wreg,
 	.emit_reg_wait = sdma_v4_0_ring_emit_reg_wait,
@@ -2475,7 +2466,7 @@ static const struct amdgpu_ring_funcs sdma_v4_0_page_ring_funcs = {
 	.emit_hdp_flush = sdma_v4_0_ring_emit_hdp_flush,
 	.test_ring = sdma_v4_0_ring_test_ring,
 	.test_ib = sdma_v4_0_ring_test_ib,
-	.insert_nop = sdma_v4_0_ring_insert_nop,
+	.insert_nop = amdgpu_sdma_ring_insert_nop,
 	.pad_ib = sdma_v4_0_ring_pad_ib,
 	.emit_wreg = sdma_v4_0_ring_emit_wreg,
 	.emit_reg_wait = sdma_v4_0_ring_emit_reg_wait,
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
index d1d21a3951f8..f68b7f2e0a40 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
@@ -341,19 +341,6 @@ static void sdma_v4_4_2_page_ring_set_wptr(struct amdgpu_ring *ring)
 	}
 }
 
-static void sdma_v4_4_2_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
-{
-	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
-	int i;
-
-	for (i = 0; i < count; i++)
-		if (sdma && sdma->burst_nop && (i == 0))
-			amdgpu_ring_write(ring, ring->funcs->nop |
-				SDMA_PKT_NOP_HEADER_COUNT(count - 1));
-		else
-			amdgpu_ring_write(ring, ring->funcs->nop);
-}
-
 /**
  * sdma_v4_4_2_ring_emit_ib - Schedule an IB on the DMA engine
  *
@@ -372,7 +359,7 @@ static void sdma_v4_4_2_ring_emit_ib(struct amdgpu_ring *ring,
 	unsigned vmid = AMDGPU_JOB_GET_VMID(job);
 
 	/* IB packet must end on a 8 DW boundary */
-	sdma_v4_4_2_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
+	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
 
 	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
 			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
@@ -1431,6 +1418,8 @@ static int sdma_v4_4_2_sw_init(struct amdgpu_ip_block *ip_block)
 
 		ring->sdma.instance = &adev->sdma.instance[i];
 		ring->sdma.index = i;
+		ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
+		ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
 
 		if (adev->sdma.has_page_queue) {
 			ring = &adev->sdma.instance[i].page;
@@ -1455,6 +1444,8 @@ static int sdma_v4_4_2_sw_init(struct amdgpu_ip_block *ip_block)
 
 			ring->sdma.instance = &adev->sdma.instance[i];
 			ring->sdma.index = i;
+			ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
+			ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
 		}
 	}
 
@@ -2013,7 +2004,7 @@ static const struct amdgpu_ring_funcs sdma_v4_4_2_ring_funcs = {
 	.emit_hdp_flush = sdma_v4_4_2_ring_emit_hdp_flush,
 	.test_ring = sdma_v4_4_2_ring_test_ring,
 	.test_ib = sdma_v4_4_2_ring_test_ib,
-	.insert_nop = sdma_v4_4_2_ring_insert_nop,
+	.insert_nop = amdgpu_sdma_ring_insert_nop,
 	.pad_ib = sdma_v4_4_2_ring_pad_ib,
 	.emit_wreg = sdma_v4_4_2_ring_emit_wreg,
 	.emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
@@ -2045,7 +2036,7 @@ static const struct amdgpu_ring_funcs sdma_v4_4_2_page_ring_funcs = {
 	.emit_hdp_flush = sdma_v4_4_2_ring_emit_hdp_flush,
 	.test_ring = sdma_v4_4_2_ring_test_ring,
 	.test_ib = sdma_v4_4_2_ring_test_ib,
-	.insert_nop = sdma_v4_4_2_ring_insert_nop,
+	.insert_nop = amdgpu_sdma_ring_insert_nop,
 	.pad_ib = sdma_v4_4_2_ring_pad_ib,
 	.emit_wreg = sdma_v4_4_2_ring_emit_wreg,
 	.emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
index 97536f82dfcd..bbf60bfa4b1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
@@ -433,19 +433,6 @@ static void sdma_v5_0_ring_set_wptr(struct amdgpu_ring *ring)
 	}
 }
 
-static void sdma_v5_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
-{
-	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
-	int i;
-
-	for (i = 0; i < count; i++)
-		if (sdma && sdma->burst_nop && (i == 0))
-			amdgpu_ring_write(ring, ring->funcs->nop |
-				SDMA_PKT_NOP_HEADER_COUNT(count - 1));
-		else
-			amdgpu_ring_write(ring, ring->funcs->nop);
-}
-
 /**
  * sdma_v5_0_ring_emit_ib - Schedule an IB on the DMA engine
  *
@@ -472,7 +459,7 @@ static void sdma_v5_0_ring_emit_ib(struct amdgpu_ring *ring,
 	 * (wptr + 6 + x) % 8 = 0.
 	 * The expression below, is a solution of x.
 	 */
-	sdma_v5_0_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
+	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
 
 	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
 			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
@@ -1453,6 +1440,8 @@ static int sdma_v5_0_sw_init(struct amdgpu_ip_block *ip_block)
 
 		ring->sdma.instance = &adev->sdma.instance[i];
 		ring->sdma.index = i;
+		ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
+		ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
 	}
 
 	adev->sdma.supported_reset =
@@ -1991,7 +1980,7 @@ static const struct amdgpu_ring_funcs sdma_v5_0_ring_funcs = {
 	.emit_hdp_flush = sdma_v5_0_ring_emit_hdp_flush,
 	.test_ring = sdma_v5_0_ring_test_ring,
 	.test_ib = sdma_v5_0_ring_test_ib,
-	.insert_nop = sdma_v5_0_ring_insert_nop,
+	.insert_nop = amdgpu_sdma_ring_insert_nop,
 	.pad_ib = sdma_v5_0_ring_pad_ib,
 	.emit_wreg = sdma_v5_0_ring_emit_wreg,
 	.emit_reg_wait = sdma_v5_0_ring_emit_reg_wait,
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
index 8eaddee1d97d..69b88db32117 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
@@ -250,19 +250,6 @@ static void sdma_v5_2_ring_set_wptr(struct amdgpu_ring *ring)
 	}
 }
 
-static void sdma_v5_2_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
-{
-	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
-	int i;
-
-	for (i = 0; i < count; i++)
-		if (sdma && sdma->burst_nop && (i == 0))
-			amdgpu_ring_write(ring, ring->funcs->nop |
-				SDMA_PKT_NOP_HEADER_COUNT(count - 1));
-		else
-			amdgpu_ring_write(ring, ring->funcs->nop);
-}
-
 /**
  * sdma_v5_2_ring_emit_ib - Schedule an IB on the DMA engine
  *
@@ -289,7 +276,7 @@ static void sdma_v5_2_ring_emit_ib(struct amdgpu_ring *ring,
 	 * (wptr + 6 + x) % 8 = 0.
 	 * The expression below, is a solution of x.
 	 */
-	sdma_v5_2_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
+	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
 
 	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
 			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
@@ -1358,6 +1345,8 @@ static int sdma_v5_2_sw_init(struct amdgpu_ip_block *ip_block)
 
 		ring->sdma.instance = &adev->sdma.instance[i];
 		ring->sdma.index = i;
+		ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
+		ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
 	}
 
 	adev->sdma.supported_reset =
@@ -1986,7 +1975,7 @@ static const struct amdgpu_ring_funcs sdma_v5_2_ring_funcs = {
 	.emit_hdp_flush = sdma_v5_2_ring_emit_hdp_flush,
 	.test_ring = sdma_v5_2_ring_test_ring,
 	.test_ib = sdma_v5_2_ring_test_ib,
-	.insert_nop = sdma_v5_2_ring_insert_nop,
+	.insert_nop = amdgpu_sdma_ring_insert_nop,
 	.pad_ib = sdma_v5_2_ring_pad_ib,
 	.begin_use = sdma_v5_2_ring_begin_use,
 	.end_use = sdma_v5_2_ring_end_use,
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
index 3ead269eccdc..aa3992fd5313 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
@@ -235,19 +235,6 @@ static void sdma_v6_0_ring_set_wptr(struct amdgpu_ring *ring)
 	}
 }
 
-static void sdma_v6_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
-{
-	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
-	int i;
-
-	for (i = 0; i < count; i++)
-		if (sdma && sdma->burst_nop && (i == 0))
-			amdgpu_ring_write(ring, ring->funcs->nop |
-				SDMA_PKT_NOP_HEADER_COUNT(count - 1));
-		else
-			amdgpu_ring_write(ring, ring->funcs->nop);
-}
-
 /*
  * sdma_v6_0_ring_emit_ib - Schedule an IB on the DMA engine
  *
@@ -274,7 +261,7 @@ static void sdma_v6_0_ring_emit_ib(struct amdgpu_ring *ring,
 	 * (wptr + 6 + x) % 8 = 0.
 	 * The expression below, is a solution of x.
 	 */
-	sdma_v6_0_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
+	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
 
 	amdgpu_ring_write(ring, SDMA_PKT_COPY_LINEAR_HEADER_OP(SDMA_OP_INDIRECT) |
 			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
@@ -1351,6 +1338,8 @@ static int sdma_v6_0_sw_init(struct amdgpu_ip_block *ip_block)
 
 		ring->sdma.instance = &adev->sdma.instance[i];
 		ring->sdma.index = i;
+		ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
+		ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
 	}
 
 	adev->sdma.supported_reset =
@@ -1707,7 +1696,7 @@ static const struct amdgpu_ring_funcs sdma_v6_0_ring_funcs = {
 	.emit_hdp_flush = sdma_v6_0_ring_emit_hdp_flush,
 	.test_ring = sdma_v6_0_ring_test_ring,
 	.test_ib = sdma_v6_0_ring_test_ib,
-	.insert_nop = sdma_v6_0_ring_insert_nop,
+	.insert_nop = amdgpu_sdma_ring_insert_nop,
 	.pad_ib = sdma_v6_0_ring_pad_ib,
 	.emit_wreg = sdma_v6_0_ring_emit_wreg,
 	.emit_reg_wait = sdma_v6_0_ring_emit_reg_wait,
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
index 5fadaf35a03a..4ccc00248a09 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
@@ -267,19 +267,6 @@ static void sdma_v7_0_ring_set_wptr(struct amdgpu_ring *ring)
 	}
 }
 
-static void sdma_v7_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
-{
-	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
-	int i;
-
-	for (i = 0; i < count; i++)
-		if (sdma && sdma->burst_nop && (i == 0))
-			amdgpu_ring_write(ring, ring->funcs->nop |
-				SDMA_PKT_NOP_HEADER_COUNT(count - 1));
-		else
-			amdgpu_ring_write(ring, ring->funcs->nop);
-}
-
 /**
  * sdma_v7_0_ring_emit_ib - Schedule an IB on the DMA engine
  *
@@ -306,7 +293,7 @@ static void sdma_v7_0_ring_emit_ib(struct amdgpu_ring *ring,
 	 * (wptr + 6 + x) % 8 = 0.
 	 * The expression below, is a solution of x.
 	 */
-	sdma_v7_0_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
+	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
 
 	amdgpu_ring_write(ring, SDMA_PKT_COPY_LINEAR_HEADER_OP(SDMA_OP_INDIRECT) |
 			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
@@ -1365,6 +1352,8 @@ static int sdma_v7_0_sw_init(struct amdgpu_ip_block *ip_block)
 
 		ring->sdma.instance = &adev->sdma.instance[i];
 		ring->sdma.index = i;
+		ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
+		ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
 	}
 
 	adev->sdma.supported_reset =
@@ -1687,7 +1676,7 @@ static const struct amdgpu_ring_funcs sdma_v7_0_ring_funcs = {
 	.emit_hdp_flush = sdma_v7_0_ring_emit_hdp_flush,
 	.test_ring = sdma_v7_0_ring_test_ring,
 	.test_ib = sdma_v7_0_ring_test_ib,
-	.insert_nop = sdma_v7_0_ring_insert_nop,
+	.insert_nop = amdgpu_sdma_ring_insert_nop,
 	.pad_ib = sdma_v7_0_ring_pad_ib,
 	.emit_wreg = sdma_v7_0_ring_emit_wreg,
 	.emit_reg_wait = sdma_v7_0_ring_emit_reg_wait,
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 05/12] drm/amdgpu: Use memset32 for sdma insert nops
  2024-12-27 11:19 [RFC v4 00/12] Ring padding micro-optimisation etc Tvrtko Ursulin
                   ` (3 preceding siblings ...)
  2024-12-27 11:19 ` [PATCH 04/12] drm/amdgpu: Consolidate a bunch of similar sdma insert nop vfuncs Tvrtko Ursulin
@ 2024-12-27 11:19 ` Tvrtko Ursulin
  2024-12-27 11:19 ` [PATCH 06/12] drm/amdgpu: Use amdgpu_ring_fill() for VPE padding Tvrtko Ursulin
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-12-27 11:19 UTC (permalink / raw)
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin, Christian König, Sunil Khatri

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Instead of open coding it via the inefficient amdgpu_ring_write() helper
which the compiler is not able to optimise much, we can use the new
amdgpu_ring_fill() helper which pads using memset32.

With SDMA this should have much less benefit than what was measured with
GFX (only SDMA v4.0 uses the 256 byte ring padding while the rest use 16),
but on the other hand it should not harm and is at least more consistent.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Sunil Khatri <sunil.khatri@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 17 +---------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 26 ++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c |  4 +---
 3 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 8e4e9ec68262..a4ae09fed5c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -108,22 +108,7 @@ int amdgpu_ring_alloc(struct amdgpu_ring *ring, unsigned int ndw)
  */
 void amdgpu_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	uint32_t occupied, chunk1, chunk2;
-
-	occupied = ring->wptr & ring->buf_mask;
-	chunk1 = ring->buf_mask + 1 - occupied;
-	chunk1 = (chunk1 >= count) ? count : chunk1;
-	chunk2 = count - chunk1;
-
-	if (chunk1)
-		memset32(&ring->ring[occupied], ring->funcs->nop, chunk1);
-
-	if (chunk2)
-		memset32(ring->ring, ring->funcs->nop, chunk2);
-
-	ring->wptr += count;
-	ring->wptr &= ring->ptr_mask;
-	ring->count_dw -= count;
+	amdgpu_ring_fill(ring, ring->funcs->nop, count);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 4c0861ebc77a..0a59738fa1d4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -415,6 +415,32 @@ static inline void amdgpu_ring_write_multiple(struct amdgpu_ring *ring,
 	ring->count_dw -= count_dw;
 }
 
+static inline void amdgpu_ring_fill(struct amdgpu_ring *ring,
+				    u32 val, u32 count)
+{
+	const u32 buf_mask = ring->buf_mask;
+	u32 occupied, chunk1, chunk2;
+	u64 wptr = ring->wptr;
+
+	if (count == 0)
+		return;
+
+	occupied = wptr & buf_mask;
+	chunk1 = buf_mask + 1 - occupied;
+	chunk1 = (chunk1 >= count) ? count : chunk1;
+	chunk2 = count - chunk1;
+
+	if (chunk1)
+		memset32(&ring->ring[occupied], val, chunk1);
+
+	if (chunk2)
+		memset32(ring->ring, val, chunk2);
+
+	wptr += count;
+	ring->wptr = wptr & ring->ptr_mask;
+	ring->count_dw -= count;
+}
+
 /**
  * amdgpu_ring_patch_cond_exec - patch dw count of conditional execute
  * @ring: amdgpu_ring structure
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
index 148413f01875..2d07fcbd21b7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
@@ -38,7 +38,6 @@ void amdgpu_sdma_ring_insert_nop(struct amdgpu_ring *ring, u32 count)
 {
 	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
 	const u32 nop = ring->funcs->nop;
-	u32 i;
 
 	if (!count)
 		return;
@@ -48,8 +47,7 @@ void amdgpu_sdma_ring_insert_nop(struct amdgpu_ring *ring, u32 count)
 					(--count & ring->sdma.nop_pkt.mask) <<
 					ring->sdma.nop_pkt.shift);
 
-	for (i = 0; i < count; i++)
-		amdgpu_ring_write(ring, nop);
+	amdgpu_ring_fill(ring, nop, count);
 }
 
 uint64_t amdgpu_sdma_get_csa_mc_addr(struct amdgpu_ring *ring,
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 06/12] drm/amdgpu: Use amdgpu_ring_fill() for VPE padding
  2024-12-27 11:19 [RFC v4 00/12] Ring padding micro-optimisation etc Tvrtko Ursulin
                   ` (4 preceding siblings ...)
  2024-12-27 11:19 ` [PATCH 05/12] drm/amdgpu: Use memset32 for sdma insert nops Tvrtko Ursulin
@ 2024-12-27 11:19 ` Tvrtko Ursulin
  2024-12-27 11:19 ` [PATCH 07/12] drm/amdgpu: Convert JPEG, VCE and UVD to more efficient ring padding Tvrtko Ursulin
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-12-27 11:19 UTC (permalink / raw)
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin, Christian König, Sunil Khatri

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Equally as with SDMA, the padding requirements for VPE are not great (16)
so this is only done for consistency.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Sunil Khatri <sunil.khatri@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
index 121ee17b522b..6c20ea754618 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
@@ -459,14 +459,13 @@ static int vpe_resume(struct amdgpu_ip_block *ip_block)
 
 static void vpe_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	int i;
+	const u32 nop = ring->funcs->nop;
 
-	for (i = 0; i < count; i++)
-		if (i == 0)
-			amdgpu_ring_write(ring, ring->funcs->nop |
-				VPE_CMD_NOP_HEADER_COUNT(count - 1));
-		else
-			amdgpu_ring_write(ring, ring->funcs->nop);
+	if (count == 0)
+		return;
+
+	amdgpu_ring_write(ring, nop | VPE_CMD_NOP_HEADER_COUNT(--count));
+	amdgpu_ring_fill(ring, nop, count);
 }
 
 static uint64_t vpe_get_csa_mc_addr(struct amdgpu_ring *ring, uint32_t vmid)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 07/12] drm/amdgpu: Convert JPEG, VCE and UVD to more efficient ring padding
  2024-12-27 11:19 [RFC v4 00/12] Ring padding micro-optimisation etc Tvrtko Ursulin
                   ` (5 preceding siblings ...)
  2024-12-27 11:19 ` [PATCH 06/12] drm/amdgpu: Use amdgpu_ring_fill() for VPE padding Tvrtko Ursulin
@ 2024-12-27 11:19 ` Tvrtko Ursulin
  2024-12-27 11:19 ` [PATCH 08/12] drm/amdgpu: Cache some values in ring emission helpers Tvrtko Ursulin
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-12-27 11:19 UTC (permalink / raw)
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin, Christian König, Sunil Khatri

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Similarly as in the previous patch, but now we add a new
amdgpu_ring_fill2x32() helper which can write out the nops more
efficiently using memset64().

This should have a lesser effect than the previous patch, given how the
affected rings have at most 64 dword alignment restriction (in contrast to
many with 256 with GFX). Nevertheless, it sounds plausible to utilise the
identical approach and still achieve that little bit more efficiency.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <ckoenig.leichtzumerken@gmail.com>
Cc: Sunil Khatri <sunil.khatri@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 41 ++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/jpeg_v1_0.c   |  8 ++---
 drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c   |  8 ++---
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c |  8 ++---
 drivers/gpu/drm/amd/amdgpu/uvd_v3_1.c    |  7 +---
 drivers/gpu/drm/amd/amdgpu/uvd_v4_2.c    |  7 +---
 drivers/gpu/drm/amd/amdgpu/uvd_v5_0.c    |  7 +---
 drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c    |  7 +---
 drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c    |  9 +++---
 drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c    |  8 ++---
 drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c    |  7 ++--
 11 files changed, 60 insertions(+), 57 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 0a59738fa1d4..158238f8c06a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -441,6 +441,47 @@ static inline void amdgpu_ring_fill(struct amdgpu_ring *ring,
 	ring->count_dw -= count;
 }
 
+static inline void amdgpu_ring_fill2x32(struct amdgpu_ring *ring,
+					u32 v1, u32 v2, u32 count)
+{
+	const u32 buf_mask = ring->buf_mask;
+	u32 occupied, chunk1, chunk2;
+	u64 wptr = ring->wptr;
+	u64 v;
+
+	if (count == 0)
+		return;
+
+	occupied = wptr & buf_mask;
+	chunk1 = buf_mask + 1 - occupied;
+	count *= 2; /* To dwords */
+	chunk1 = (chunk1 >= count) ? count : chunk1;
+	chunk2 = count - chunk1;
+
+	if (WARN_ON_ONCE((chunk1 | chunk2) & 1))
+		return;
+
+	/* To qwords */
+	chunk1 >>= 1;
+	chunk2 >>= 1;
+
+#ifdef __BIG_ENDIAN
+	v = (u64)v1 << 32 | v2;
+#else
+	v = (u64)v2 << 32 | v1;
+#endif
+
+	if (chunk1)
+		memset64((u64 *)&ring->ring[occupied], v, chunk1);
+
+	if (chunk2)
+		memset64((u64 *)&ring->ring[0], v, chunk2);
+
+	wptr += count;
+	ring->wptr = wptr & ring->ptr_mask;
+	ring->count_dw -= count;
+}
+
 /**
  * amdgpu_ring_patch_cond_exec - patch dw count of conditional execute
  * @ring: amdgpu_ring structure
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v1_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v1_0.c
index 03b8b7cd5229..1c4ec0330391 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v1_0.c
@@ -418,14 +418,10 @@ static void jpeg_v1_0_decode_ring_emit_wreg(struct amdgpu_ring *ring,
 
 static void jpeg_v1_0_decode_ring_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	int i;
-
 	WARN_ON(ring->wptr % 2 || count % 2);
 
-	for (i = 0; i < count / 2; i++) {
-		amdgpu_ring_write(ring, PACKETJ(0, 0, 0, PACKETJ_TYPE6));
-		amdgpu_ring_write(ring, 0);
-	}
+	amdgpu_ring_fill2x32(ring, PACKETJ(0, 0, 0, PACKETJ_TYPE6), 0,
+			     count / 2);
 }
 
 static int jpeg_v1_0_set_interrupt_state(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
index 7c9251c03815..4b39898aacc9 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
@@ -645,14 +645,10 @@ void jpeg_v2_0_dec_ring_emit_wreg(struct amdgpu_ring *ring, uint32_t reg, uint32
 
 void jpeg_v2_0_dec_ring_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	int i;
-
 	WARN_ON(ring->wptr % 2 || count % 2);
 
-	for (i = 0; i < count / 2; i++) {
-		amdgpu_ring_write(ring, PACKETJ(0, 0, 0, PACKETJ_TYPE6));
-		amdgpu_ring_write(ring, 0);
-	}
+	amdgpu_ring_fill2x32(ring, PACKETJ(0, 0, 0, PACKETJ_TYPE6), 0,
+			     count / 2);
 }
 
 static bool jpeg_v2_0_is_idle(void *handle)
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
index 88f9771c1686..5e503a5c1926 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
@@ -897,14 +897,10 @@ void jpeg_v4_0_3_dec_ring_emit_wreg(struct amdgpu_ring *ring, uint32_t reg, uint
 
 void jpeg_v4_0_3_dec_ring_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	int i;
-
 	WARN_ON(ring->wptr % 2 || count % 2);
 
-	for (i = 0; i < count / 2; i++) {
-		amdgpu_ring_write(ring, PACKETJ(0, 0, 0, PACKETJ_TYPE6));
-		amdgpu_ring_write(ring, 0);
-	}
+	amdgpu_ring_fill2x32(ring, PACKETJ(0, 0, 0, PACKETJ_TYPE6), 0,
+			     count / 2);
 }
 
 static bool jpeg_v4_0_3_is_idle(void *handle)
diff --git a/drivers/gpu/drm/amd/amdgpu/uvd_v3_1.c b/drivers/gpu/drm/amd/amdgpu/uvd_v3_1.c
index 5830e799c0a3..4faf8ca2773d 100644
--- a/drivers/gpu/drm/amd/amdgpu/uvd_v3_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/uvd_v3_1.c
@@ -166,14 +166,9 @@ static int uvd_v3_1_ring_test_ring(struct amdgpu_ring *ring)
 
 static void uvd_v3_1_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	int i;
-
 	WARN_ON(ring->wptr % 2 || count % 2);
 
-	for (i = 0; i < count / 2; i++) {
-		amdgpu_ring_write(ring, PACKET0(mmUVD_NO_OP, 0));
-		amdgpu_ring_write(ring, 0);
-	}
+	amdgpu_ring_fill2x32(ring, PACKET0(mmUVD_NO_OP, 0), 0, count / 2);
 }
 
 static const struct amdgpu_ring_funcs uvd_v3_1_ring_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/uvd_v4_2.c b/drivers/gpu/drm/amd/amdgpu/uvd_v4_2.c
index f93079e09215..05ac019e00b4 100644
--- a/drivers/gpu/drm/amd/amdgpu/uvd_v4_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/uvd_v4_2.c
@@ -553,14 +553,9 @@ static void uvd_v4_2_ring_emit_ib(struct amdgpu_ring *ring,
 
 static void uvd_v4_2_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	int i;
-
 	WARN_ON(ring->wptr % 2 || count % 2);
 
-	for (i = 0; i < count / 2; i++) {
-		amdgpu_ring_write(ring, PACKET0(mmUVD_NO_OP, 0));
-		amdgpu_ring_write(ring, 0);
-	}
+	amdgpu_ring_fill2x32(ring, PACKET0(mmUVD_NO_OP, 0), 0, count / 2);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/uvd_v5_0.c b/drivers/gpu/drm/amd/amdgpu/uvd_v5_0.c
index 050a0f309390..dde0d7d508ce 100644
--- a/drivers/gpu/drm/amd/amdgpu/uvd_v5_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/uvd_v5_0.c
@@ -570,14 +570,9 @@ static void uvd_v5_0_ring_emit_ib(struct amdgpu_ring *ring,
 
 static void uvd_v5_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	int i;
-
 	WARN_ON(ring->wptr % 2 || count % 2);
 
-	for (i = 0; i < count / 2; i++) {
-		amdgpu_ring_write(ring, PACKET0(mmUVD_NO_OP, 0));
-		amdgpu_ring_write(ring, 0);
-	}
+	amdgpu_ring_fill2x32(ring, PACKET0(mmUVD_NO_OP, 0), 0, count / 2);
 }
 
 static bool uvd_v5_0_is_idle(void *handle)
diff --git a/drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c b/drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c
index d9d036ee51fb..095e71d509dc 100644
--- a/drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c
@@ -1106,14 +1106,9 @@ static void uvd_v6_0_ring_emit_pipeline_sync(struct amdgpu_ring *ring)
 
 static void uvd_v6_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
-	int i;
-
 	WARN_ON(ring->wptr % 2 || count % 2);
 
-	for (i = 0; i < count / 2; i++) {
-		amdgpu_ring_write(ring, PACKET0(mmUVD_NO_OP, 0));
-		amdgpu_ring_write(ring, 0);
-	}
+	amdgpu_ring_fill2x32(ring, PACKET0(mmUVD_NO_OP, 0), 0, count / 2);
 }
 
 static void uvd_v6_0_enc_ring_emit_pipeline_sync(struct amdgpu_ring *ring)
diff --git a/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c b/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c
index 9d237b5937fb..31005dbde8a0 100644
--- a/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c
@@ -1416,14 +1416,13 @@ static void uvd_v7_0_ring_emit_vm_flush(struct amdgpu_ring *ring,
 static void uvd_v7_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
 	struct amdgpu_device *adev = ring->adev;
-	int i;
 
 	WARN_ON(ring->wptr % 2 || count % 2);
 
-	for (i = 0; i < count / 2; i++) {
-		amdgpu_ring_write(ring, PACKET0(SOC15_REG_OFFSET(UVD, ring->me, mmUVD_NO_OP), 0));
-		amdgpu_ring_write(ring, 0);
-	}
+	amdgpu_ring_fill2x32(ring,
+			   PACKET0(SOC15_REG_OFFSET(UVD, ring->me, mmUVD_NO_OP), 0),
+			   0,
+			   count / 2);
 }
 
 static void uvd_v7_0_enc_ring_insert_end(struct amdgpu_ring *ring)
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
index 5ea96c983517..2f848f302069 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
@@ -1790,14 +1790,12 @@ static int vcn_v1_0_process_interrupt(struct amdgpu_device *adev,
 static void vcn_v1_0_dec_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
 	struct amdgpu_device *adev = ring->adev;
-	int i;
 
 	WARN_ON(ring->wptr % 2 || count % 2);
 
-	for (i = 0; i < count / 2; i++) {
-		amdgpu_ring_write(ring, PACKET0(SOC15_REG_OFFSET(UVD, 0, mmUVD_NO_OP), 0));
-		amdgpu_ring_write(ring, 0);
-	}
+	amdgpu_ring_fill2x32(ring,
+			   PACKET0(SOC15_REG_OFFSET(UVD, 0, mmUVD_NO_OP), 0), 0,
+			   count / 2);
 }
 
 static int vcn_v1_0_set_powergating_state(struct amdgpu_ip_block *ip_block,
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c
index e42cfc731ad8..9d24ba173f09 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c
@@ -1453,14 +1453,11 @@ void vcn_v2_0_dec_ring_insert_end(struct amdgpu_ring *ring)
 void vcn_v2_0_dec_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
 {
 	struct amdgpu_device *adev = ring->adev;
-	int i;
 
 	WARN_ON(ring->wptr % 2 || count % 2);
 
-	for (i = 0; i < count / 2; i++) {
-		amdgpu_ring_write(ring, PACKET0(adev->vcn.internal.nop, 0));
-		amdgpu_ring_write(ring, 0);
-	}
+	amdgpu_ring_fill2x32(ring, PACKET0(adev->vcn.internal.nop, 0), 0,
+			     count / 2);
 }
 
 /**
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 08/12] drm/amdgpu: Cache some values in ring emission helpers
  2024-12-27 11:19 [RFC v4 00/12] Ring padding micro-optimisation etc Tvrtko Ursulin
                   ` (6 preceding siblings ...)
  2024-12-27 11:19 ` [PATCH 07/12] drm/amdgpu: Convert JPEG, VCE and UVD to more efficient ring padding Tvrtko Ursulin
@ 2024-12-27 11:19 ` Tvrtko Ursulin
  2024-12-27 11:19 ` [PATCH 09/12] drm/amdgpu: Optimise amdgpu_ring_write() Tvrtko Ursulin
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-12-27 11:19 UTC (permalink / raw)
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

     text	   data	    bss	     dec	    hex	filename
 10437711	 542597	 188232	11168540	 aa6b1c	amdgpu.ko.before
 10418181	 542597	 188232	11149010	 aa1ed2	amdgpu.ko.after

Main reason seems to be amdgpu_ring_write() can avoid re-loading
ring->wptr when called multiple times in sequence.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 158238f8c06a..b57951d8c997 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -385,8 +385,10 @@ static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
 
 static inline void amdgpu_ring_write(struct amdgpu_ring *ring, uint32_t v)
 {
-	ring->ring[ring->wptr++ & ring->buf_mask] = v;
-	ring->wptr &= ring->ptr_mask;
+	u64 wptr = ring->wptr;
+
+	ring->ring[wptr++ & ring->buf_mask] = v;
+	ring->wptr = wptr & ring->ptr_mask;
 	ring->count_dw--;
 }
 
@@ -394,9 +396,11 @@ static inline void amdgpu_ring_write_multiple(struct amdgpu_ring *ring,
 					      void *src, int count_dw)
 {
 	unsigned occupied, chunk1, chunk2;
+	u32 buf_mask = ring->buf_mask;
+	u64 wptr = ring->wptr;
 
-	occupied = ring->wptr & ring->buf_mask;
-	chunk1 = ring->buf_mask + 1 - occupied;
+	occupied = wptr & buf_mask;
+	chunk1 = buf_mask + 1 - occupied;
 	chunk1 = (chunk1 >= count_dw) ? count_dw : chunk1;
 	chunk2 = count_dw - chunk1;
 	chunk1 <<= 2;
@@ -410,8 +414,8 @@ static inline void amdgpu_ring_write_multiple(struct amdgpu_ring *ring,
 		memcpy(ring->ring, src, chunk2);
 	}
 
-	ring->wptr += count_dw;
-	ring->wptr &= ring->ptr_mask;
+	wptr += count_dw;
+	ring->wptr = wptr & ring->ptr_mask;
 	ring->count_dw -= count_dw;
 }
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 09/12] drm/amdgpu: Optimise amdgpu_ring_write()
  2024-12-27 11:19 [RFC v4 00/12] Ring padding micro-optimisation etc Tvrtko Ursulin
                   ` (7 preceding siblings ...)
  2024-12-27 11:19 ` [PATCH 08/12] drm/amdgpu: Cache some values in ring emission helpers Tvrtko Ursulin
@ 2024-12-27 11:19 ` Tvrtko Ursulin
  2025-01-02 13:55   ` Christian König
  2024-12-27 11:19 ` [PATCH 10/12] drm/amdgpu: Convert VCE to variadic amdgpu_ring_write() Tvrtko Ursulin
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-12-27 11:19 UTC (permalink / raw)
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin, Christian König, Sunil Khatri

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

There are more than 2000 calls to amdgpu_ring_write() in the driver and
the majority is multiple sequential calls which the compiler cannot
optimise much.

Lets make this helper variadic via some pre-processor magic which allows
merging those sequential calls and in turn enables the compiler to emit
much less code.

If we then would convert some call sites to use this macro, lets see on
the example of amdgpu_vce_ring_emit_ib(), what results that would bring.
Before (but after the wptr local caching, before it is even worse):

  173c39:       48 89 f8                mov    %rdi,%rax
  173c3c:       48 89 d1                mov    %rdx,%rcx
  173c3f:       48 8b 97 c8 01 00 00    mov    0x1c8(%rdi),%rdx
  173c46:       48 8b b0 a8 01 00 00    mov    0x1a8(%rax),%rsi
  173c4d:       89 d7                   mov    %edx,%edi
  173c4f:       23 b8 f8 01 00 00       and    0x1f8(%rax),%edi
  173c55:       48 83 c2 01             add    $0x1,%rdx
  173c59:       c7 04 be 02 00 00 00    movl   $0x2,(%rsi,%rdi,4)
  173c60:       48 23 90 f0 01 00 00    and    0x1f0(%rax),%rdx
  173c67:       48 89 90 c8 01 00 00    mov    %rdx,0x1c8(%rax)
  173c6e:       48 8b b0 a8 01 00 00    mov    0x1a8(%rax),%rsi
  173c75:       89 d7                   mov    %edx,%edi
  173c77:       48 83 c2 01             add    $0x1,%rdx
  173c7b:       83 a8 e0 01 00 00 01    subl   $0x1,0x1e0(%rax)
  173c82:       4c 8b 41 10             mov    0x10(%rcx),%r8
  173c86:       23 b8 f8 01 00 00       and    0x1f8(%rax),%edi
  173c8c:       44 89 04 be             mov    %r8d,(%rsi,%rdi,4)
  173c90:       48 23 90 f0 01 00 00    and    0x1f0(%rax),%rdx
  173c97:       48 89 90 c8 01 00 00    mov    %rdx,0x1c8(%rax)
  173c9e:       48 8b b0 a8 01 00 00    mov    0x1a8(%rax),%rsi
  173ca5:       89 d7                   mov    %edx,%edi
  173ca7:       48 83 c2 01             add    $0x1,%rdx
  173cab:       83 a8 e0 01 00 00 01    subl   $0x1,0x1e0(%rax)
  173cb2:       44 8b 41 14             mov    0x14(%rcx),%r8d
  173cb6:       23 b8 f8 01 00 00       and    0x1f8(%rax),%edi
  173cbc:       44 89 04 be             mov    %r8d,(%rsi,%rdi,4)
  173cc0:       48 23 90 f0 01 00 00    and    0x1f0(%rax),%rdx
  173cc7:       48 89 90 c8 01 00 00    mov    %rdx,0x1c8(%rax)
  173cce:       89 d6                   mov    %edx,%esi
  173cd0:       23 b0 f8 01 00 00       and    0x1f8(%rax),%esi
  173cd6:       48 83 c2 01             add    $0x1,%rdx
  173cda:       83 a8 e0 01 00 00 01    subl   $0x1,0x1e0(%rax)
  173ce1:       8b 79 08                mov    0x8(%rcx),%edi
  173ce4:       48 8b 88 a8 01 00 00    mov    0x1a8(%rax),%rcx
  173ceb:       89 3c b1                mov    %edi,(%rcx,%rsi,4)
  173cee:       48 23 90 f0 01 00 00    and    0x1f0(%rax),%rdx
  173cf5:       83 a8 e0 01 00 00 01    subl   $0x1,0x1e0(%rax)
  173cfc:       48 89 90 c8 01 00 00    mov    %rdx,0x1c8(%rax)
  173d03:       c3                      ret

And after:

    1579:       48 89 f8                mov    %rdi,%rax
    157c:       44 8b 4a 08             mov    0x8(%rdx),%r9d
    1580:       48 8b 7a 10             mov    0x10(%rdx),%rdi
    1584:       48 8b 90 c8 01 00 00    mov    0x1c8(%rax),%rdx
    158b:       8b b0 f8 01 00 00       mov    0x1f8(%rax),%esi
    1591:       48 8b 88 a8 01 00 00    mov    0x1a8(%rax),%rcx
    1598:       49 89 d0                mov    %rdx,%r8
    159b:       49 21 f0                and    %rsi,%r8
    159e:       42 c7 04 81 02 00 00    movl   $0x2,(%rcx,%r8,4)
    15a5:       00
    15a6:       4c 8d 42 01             lea    0x1(%rdx),%r8
    15aa:       49 21 f0                and    %rsi,%r8
    15ad:       42 89 3c 81             mov    %edi,(%rcx,%r8,4)
    15b1:       4c 8d 42 02             lea    0x2(%rdx),%r8
    15b5:       48 c1 ef 20             shr    $0x20,%rdi
    15b9:       49 21 f0                and    %rsi,%r8
    15bc:       42 89 3c 81             mov    %edi,(%rcx,%r8,4)
    15c0:       48 8d 7a 03             lea    0x3(%rdx),%rdi
    15c4:       48 83 c2 04             add    $0x4,%rdx
    15c8:       48 21 fe                and    %rdi,%rsi
    15cb:       44 89 0c b1             mov    %r9d,(%rcx,%rsi,4)
    15cf:       48 23 90 f0 01 00 00    and    0x1f0(%rax),%rdx
    15d6:       83 a8 e0 01 00 00 04    subl   $0x4,0x1e0(%rax)
    15dd:       48 89 90 c8 01 00 00    mov    %rdx,0x1c8(%rax)
    15e4:       c3                      ret

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Sunil Khatri <sunil.khatri@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 222 ++++++++++++++++++++++-
 1 file changed, 220 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index b57951d8c997..4f467864ed09 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -383,15 +383,233 @@ static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
 	memset32(ring->ring, ring->funcs->nop, ring->buf_mask + 1);
 }
 
-static inline void amdgpu_ring_write(struct amdgpu_ring *ring, uint32_t v)
+static inline void
+amdgpu_ring_write1(struct amdgpu_ring *ring,
+		   u32 v1)
 {
+	const u32 buf_mask = ring->buf_mask;
 	u64 wptr = ring->wptr;
 
-	ring->ring[wptr++ & ring->buf_mask] = v;
+	ring->ring[wptr++ & buf_mask] = v1;
 	ring->wptr = wptr & ring->ptr_mask;
 	ring->count_dw--;
 }
 
+static inline void
+amdgpu_ring_write2(struct amdgpu_ring *ring,
+		   u32 v1, u32 v2)
+{
+	const u32 buf_mask = ring->buf_mask;
+	u64 wptr = ring->wptr;
+	u32 *r = ring->ring;
+
+	r[wptr++ & buf_mask] = v1;
+	r[wptr++ & buf_mask] = v2;
+
+	ring->wptr = wptr & ring->ptr_mask;
+	ring->count_dw -= 2;
+}
+
+static inline void
+amdgpu_ring_write3(struct amdgpu_ring *ring,
+		   u32 v1, u32 v2, u32 v3)
+{
+	const u32 buf_mask = ring->buf_mask;
+	u64 wptr = ring->wptr;
+	u32 *r = ring->ring;
+
+	r[wptr++ & buf_mask] = v1;
+	r[wptr++ & buf_mask] = v2;
+	r[wptr++ & buf_mask] = v3;
+
+	ring->wptr = wptr & ring->ptr_mask;
+	ring->count_dw -= 3;
+}
+
+static inline void
+amdgpu_ring_write4(struct amdgpu_ring *ring,
+		   u32 v1, u32 v2, u32 v3, u32 v4)
+{
+	const u32 buf_mask = ring->buf_mask;
+	u64 wptr = ring->wptr;
+	u32 *r = ring->ring;
+
+	r[wptr++ & buf_mask] = v1;
+	r[wptr++ & buf_mask] = v2;
+	r[wptr++ & buf_mask] = v3;
+	r[wptr++ & buf_mask] = v4;
+
+	ring->wptr = wptr & ring->ptr_mask;
+	ring->count_dw -= 4;
+}
+
+static inline void
+amdgpu_ring_write5(struct amdgpu_ring *ring,
+		   u32 v1, u32 v2, u32 v3, u32 v4, u32 v5)
+{
+	const u32 buf_mask = ring->buf_mask;
+	u64 wptr = ring->wptr;
+	u32 *r = ring->ring;
+
+	r[wptr++ & buf_mask] = v1;
+	r[wptr++ & buf_mask] = v2;
+	r[wptr++ & buf_mask] = v3;
+	r[wptr++ & buf_mask] = v4;
+	r[wptr++ & buf_mask] = v5;
+
+	ring->wptr = wptr & ring->ptr_mask;
+	ring->count_dw -= 5;
+}
+
+static inline void
+amdgpu_ring_write6(struct amdgpu_ring *ring,
+		   u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6)
+{
+	const u32 buf_mask = ring->buf_mask;
+	u64 wptr = ring->wptr;
+	u32 *r = ring->ring;
+
+	r[wptr++ & buf_mask] = v1;
+	r[wptr++ & buf_mask] = v2;
+	r[wptr++ & buf_mask] = v3;
+	r[wptr++ & buf_mask] = v4;
+	r[wptr++ & buf_mask] = v5;
+	r[wptr++ & buf_mask] = v6;
+
+	ring->wptr = wptr & ring->ptr_mask;
+	ring->count_dw -= 6;
+}
+
+static inline void
+amdgpu_ring_write7(struct amdgpu_ring *ring,
+		   u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6, u32 v7)
+{
+	const u32 buf_mask = ring->buf_mask;
+	u64 wptr = ring->wptr;
+	u32 *r = ring->ring;
+
+	r[wptr++ & buf_mask] = v1;
+	r[wptr++ & buf_mask] = v2;
+	r[wptr++ & buf_mask] = v3;
+	r[wptr++ & buf_mask] = v4;
+	r[wptr++ & buf_mask] = v5;
+	r[wptr++ & buf_mask] = v6;
+	r[wptr++ & buf_mask] = v7;
+
+	ring->wptr = wptr & ring->ptr_mask;
+	ring->count_dw -= 7;
+}
+
+static inline void
+amdgpu_ring_write8(struct amdgpu_ring *ring,
+		   u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6, u32 v7,
+		   u32 v8)
+{
+	const u32 buf_mask = ring->buf_mask;
+	u64 wptr = ring->wptr;
+	u32 *r = ring->ring;
+
+	r[wptr++ & buf_mask] = v1;
+	r[wptr++ & buf_mask] = v2;
+	r[wptr++ & buf_mask] = v3;
+	r[wptr++ & buf_mask] = v4;
+	r[wptr++ & buf_mask] = v5;
+	r[wptr++ & buf_mask] = v6;
+	r[wptr++ & buf_mask] = v7;
+	r[wptr++ & buf_mask] = v8;
+
+	ring->wptr = wptr & ring->ptr_mask;
+	ring->count_dw -= 8;
+}
+
+static inline void
+amdgpu_ring_write9(struct amdgpu_ring *ring,
+		   u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6, u32 v7,
+		   u32 v8, u32 v9)
+{
+	const u32 buf_mask = ring->buf_mask;
+	u64 wptr = ring->wptr;
+	u32 *r = ring->ring;
+
+	r[wptr++ & buf_mask] = v1;
+	r[wptr++ & buf_mask] = v2;
+	r[wptr++ & buf_mask] = v3;
+	r[wptr++ & buf_mask] = v4;
+	r[wptr++ & buf_mask] = v5;
+	r[wptr++ & buf_mask] = v6;
+	r[wptr++ & buf_mask] = v7;
+	r[wptr++ & buf_mask] = v8;
+	r[wptr++ & buf_mask] = v9;
+
+	ring->wptr = wptr & ring->ptr_mask;
+	ring->count_dw -= 9;
+}
+
+static inline void
+amdgpu_ring_write10(struct amdgpu_ring *ring,
+		   u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6, u32 v7,
+		   u32 v8, u32 v9, u32 v10)
+{
+	const u32 buf_mask = ring->buf_mask;
+	u64 wptr = ring->wptr;
+	u32 *r = ring->ring;
+
+	r[wptr++ & buf_mask] = v1;
+	r[wptr++ & buf_mask] = v2;
+	r[wptr++ & buf_mask] = v3;
+	r[wptr++ & buf_mask] = v4;
+	r[wptr++ & buf_mask] = v5;
+	r[wptr++ & buf_mask] = v6;
+	r[wptr++ & buf_mask] = v7;
+	r[wptr++ & buf_mask] = v8;
+	r[wptr++ & buf_mask] = v9;
+	r[wptr++ & buf_mask] = v10;
+
+	ring->wptr = wptr & ring->ptr_mask;
+	ring->count_dw -= 10;
+}
+
+static inline void
+amdgpu_ring_write11(struct amdgpu_ring *ring,
+		   u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6, u32 v7,
+		   u32 v8, u32 v9, u32 v10, u32 v11)
+{
+	const u32 buf_mask = ring->buf_mask;
+	u64 wptr = ring->wptr;
+	u32 *r = ring->ring;
+
+	r[wptr++ & buf_mask] = v1;
+	r[wptr++ & buf_mask] = v2;
+	r[wptr++ & buf_mask] = v3;
+	r[wptr++ & buf_mask] = v4;
+	r[wptr++ & buf_mask] = v5;
+	r[wptr++ & buf_mask] = v6;
+	r[wptr++ & buf_mask] = v7;
+	r[wptr++ & buf_mask] = v8;
+	r[wptr++ & buf_mask] = v9;
+	r[wptr++ & buf_mask] = v10;
+	r[wptr++ & buf_mask] = v11;
+
+	ring->wptr = wptr & ring->ptr_mask;
+	ring->count_dw -= 11;
+}
+
+#define AMDGPU_RING_WRITE(_1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, NAME, ...) NAME
+#define amdgpu_ring_write(...) \
+	AMDGPU_RING_WRITE(__VA_ARGS__, \
+			  amdgpu_ring_write11, \
+			  amdgpu_ring_write10, \
+			  amdgpu_ring_write9, \
+			  amdgpu_ring_write8, \
+			  amdgpu_ring_write7, \
+			  amdgpu_ring_write6, \
+			  amdgpu_ring_write5, \
+			  amdgpu_ring_write4, \
+			  amdgpu_ring_write3, \
+			  amdgpu_ring_write2, \
+			  amdgpu_ring_write1, \
+			  NULL)(__VA_ARGS__)
+
 static inline void amdgpu_ring_write_multiple(struct amdgpu_ring *ring,
 					      void *src, int count_dw)
 {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 10/12] drm/amdgpu: Convert VCE to variadic amdgpu_ring_write()
  2024-12-27 11:19 [RFC v4 00/12] Ring padding micro-optimisation etc Tvrtko Ursulin
                   ` (8 preceding siblings ...)
  2024-12-27 11:19 ` [PATCH 09/12] drm/amdgpu: Optimise amdgpu_ring_write() Tvrtko Ursulin
@ 2024-12-27 11:19 ` Tvrtko Ursulin
  2024-12-27 11:19 ` [PATCH 11/12] drm/amdgpu: Convert GFX v10.0 " Tvrtko Ursulin
  2024-12-27 11:19 ` [PATCH 12/12] drm/amdgpu: Convert SDMA v5.2 " Tvrtko Ursulin
  11 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-12-27 11:19 UTC (permalink / raw)
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin, Christian König, Sunil Khatri

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Just a small example how the conversion works.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Sunil Khatri <sunil.khatri@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
index b9060bcd4806..3fa0b0824836 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
@@ -1069,10 +1069,11 @@ void amdgpu_vce_ring_emit_ib(struct amdgpu_ring *ring,
 				struct amdgpu_ib *ib,
 				uint32_t flags)
 {
-	amdgpu_ring_write(ring, VCE_CMD_IB);
-	amdgpu_ring_write(ring, lower_32_bits(ib->gpu_addr));
-	amdgpu_ring_write(ring, upper_32_bits(ib->gpu_addr));
-	amdgpu_ring_write(ring, ib->length_dw);
+	amdgpu_ring_write(ring,
+			  VCE_CMD_IB,
+			  lower_32_bits(ib->gpu_addr),
+			  upper_32_bits(ib->gpu_addr),
+			  ib->length_dw);
 }
 
 /**
@@ -1089,12 +1090,13 @@ void amdgpu_vce_ring_emit_fence(struct amdgpu_ring *ring, u64 addr, u64 seq,
 {
 	WARN_ON(flags & AMDGPU_FENCE_FLAG_64BIT);
 
-	amdgpu_ring_write(ring, VCE_CMD_FENCE);
-	amdgpu_ring_write(ring, addr);
-	amdgpu_ring_write(ring, upper_32_bits(addr));
-	amdgpu_ring_write(ring, seq);
-	amdgpu_ring_write(ring, VCE_CMD_TRAP);
-	amdgpu_ring_write(ring, VCE_CMD_END);
+	amdgpu_ring_write(ring,
+			  VCE_CMD_FENCE,
+			  addr,
+			  upper_32_bits(addr),
+			  seq,
+			  VCE_CMD_TRAP,
+			  VCE_CMD_END);
 }
 
 /**
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 11/12] drm/amdgpu: Convert GFX v10.0 to variadic amdgpu_ring_write()
  2024-12-27 11:19 [RFC v4 00/12] Ring padding micro-optimisation etc Tvrtko Ursulin
                   ` (9 preceding siblings ...)
  2024-12-27 11:19 ` [PATCH 10/12] drm/amdgpu: Convert VCE to variadic amdgpu_ring_write() Tvrtko Ursulin
@ 2024-12-27 11:19 ` Tvrtko Ursulin
  2024-12-27 11:19 ` [PATCH 12/12] drm/amdgpu: Convert SDMA v5.2 " Tvrtko Ursulin
  11 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-12-27 11:19 UTC (permalink / raw)
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin, Christian König, Sunil Khatri

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Converting the GFX v10.0 ring helpers to use the variadic
amdgpu_ring_write().

Also small cleanups in gfx_v10_0_cp_gfx_start(),
gfx_v10_0_ring_emit_ce_meta() and
gfx_v10_0_ring_emit_de_meta.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Sunil Khatri <sunil.khatri@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |   2 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c   | 399 ++++++++++++-----------
 2 files changed, 204 insertions(+), 197 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 4f467864ed09..1b428dda706a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -611,7 +611,7 @@ amdgpu_ring_write11(struct amdgpu_ring *ring,
 			  NULL)(__VA_ARGS__)
 
 static inline void amdgpu_ring_write_multiple(struct amdgpu_ring *ring,
-					      void *src, int count_dw)
+					      const void *src, int count_dw)
 {
 	unsigned occupied, chunk1, chunk2;
 	u32 buf_mask = ring->buf_mask;
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 003522c2d902..63fc94c5d989 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3683,15 +3683,16 @@ static void gfx10_kiq_set_resources(struct amdgpu_ring *kiq_ring, uint64_t queue
 	/* Cleaner shader MC address */
 	shader_mc_addr = adev->gfx.cleaner_shader_gpu_addr >> 8;
 
-	amdgpu_ring_write(kiq_ring, PACKET3(PACKET3_SET_RESOURCES, 6));
-	amdgpu_ring_write(kiq_ring, PACKET3_SET_RESOURCES_VMID_MASK(0) |
-			  PACKET3_SET_RESOURCES_QUEUE_TYPE(0));	/* vmid_mask:0 queue_type:0 (KIQ) */
-	amdgpu_ring_write(kiq_ring, lower_32_bits(queue_mask));	/* queue mask lo */
-	amdgpu_ring_write(kiq_ring, upper_32_bits(queue_mask));	/* queue mask hi */
-	amdgpu_ring_write(kiq_ring, lower_32_bits(shader_mc_addr)); /* cleaner shader addr lo */
-	amdgpu_ring_write(kiq_ring, upper_32_bits(shader_mc_addr)); /* cleaner shader addr hi */
-	amdgpu_ring_write(kiq_ring, 0);	/* oac mask */
-	amdgpu_ring_write(kiq_ring, 0);	/* gds heap base:0, gds heap size:0 */
+	amdgpu_ring_write(kiq_ring,
+			  PACKET3(PACKET3_SET_RESOURCES, 6),
+			  PACKET3_SET_RESOURCES_VMID_MASK(0) |
+			  PACKET3_SET_RESOURCES_QUEUE_TYPE(0),	/* vmid_mask:0 queue_type:0 (KIQ) */
+			  lower_32_bits(queue_mask),	/* queue mask lo */
+			  upper_32_bits(queue_mask),	/* queue mask hi */
+			  lower_32_bits(shader_mc_addr), /* cleaner shader addr lo */
+			  upper_32_bits(shader_mc_addr), /* cleaner shader addr hi */
+			  0,	/* oac mask */
+			  0);	/* gds heap base:0, gds heap size:0 */
 }
 
 static void gfx10_kiq_map_queues(struct amdgpu_ring *kiq_ring,
@@ -3715,10 +3716,9 @@ static void gfx10_kiq_map_queues(struct amdgpu_ring *kiq_ring,
 		WARN_ON(1);
 	}
 
-	amdgpu_ring_write(kiq_ring, PACKET3(PACKET3_MAP_QUEUES, 5));
-	/* Q_sel:0, vmid:0, vidmem: 1, engine:0, num_Q:1*/
-	amdgpu_ring_write(kiq_ring, /* Q_sel: 0, vmid: 0, engine: 0, num_Q: 1 */
-			  PACKET3_MAP_QUEUES_QUEUE_SEL(0) | /* Queue_Sel */
+	amdgpu_ring_write(kiq_ring,
+			  PACKET3(PACKET3_MAP_QUEUES, 5),
+			  PACKET3_MAP_QUEUES_QUEUE_SEL(0) | /* Q_sel: 0, vmid: 0, engine: 0, num_Q: 1 */
 			  PACKET3_MAP_QUEUES_VMID(0) | /* VMID */
 			  PACKET3_MAP_QUEUES_QUEUE(ring->queue) |
 			  PACKET3_MAP_QUEUES_PIPE(ring->pipe) |
@@ -3726,12 +3726,12 @@ static void gfx10_kiq_map_queues(struct amdgpu_ring *kiq_ring,
 			  PACKET3_MAP_QUEUES_QUEUE_TYPE(0) | /*queue_type: normal compute queue */
 			  PACKET3_MAP_QUEUES_ALLOC_FORMAT(0) | /* alloc format: all_on_one_pipe */
 			  PACKET3_MAP_QUEUES_ENGINE_SEL(eng_sel) |
-			  PACKET3_MAP_QUEUES_NUM_QUEUES(1)); /* num_queues: must be 1 */
-	amdgpu_ring_write(kiq_ring, PACKET3_MAP_QUEUES_DOORBELL_OFFSET(ring->doorbell_index));
-	amdgpu_ring_write(kiq_ring, lower_32_bits(mqd_addr));
-	amdgpu_ring_write(kiq_ring, upper_32_bits(mqd_addr));
-	amdgpu_ring_write(kiq_ring, lower_32_bits(wptr_addr));
-	amdgpu_ring_write(kiq_ring, upper_32_bits(wptr_addr));
+			  PACKET3_MAP_QUEUES_NUM_QUEUES(1), /* num_queues: must be 1 */
+			  PACKET3_MAP_QUEUES_DOORBELL_OFFSET(ring->doorbell_index),
+			  lower_32_bits(mqd_addr),
+			  upper_32_bits(mqd_addr),
+			  lower_32_bits(wptr_addr),
+			  upper_32_bits(wptr_addr));
 }
 
 static void gfx10_kiq_unmap_queues(struct amdgpu_ring *kiq_ring,
@@ -3741,23 +3741,21 @@ static void gfx10_kiq_unmap_queues(struct amdgpu_ring *kiq_ring,
 {
 	uint32_t eng_sel = ring->funcs->type == AMDGPU_RING_TYPE_GFX ? 4 : 0;
 
-	amdgpu_ring_write(kiq_ring, PACKET3(PACKET3_UNMAP_QUEUES, 4));
-	amdgpu_ring_write(kiq_ring, /* Q_sel: 0, vmid: 0, engine: 0, num_Q: 1 */
+	amdgpu_ring_write(kiq_ring,
+			  PACKET3(PACKET3_UNMAP_QUEUES, 4),
 			  PACKET3_UNMAP_QUEUES_ACTION(action) |
 			  PACKET3_UNMAP_QUEUES_QUEUE_SEL(0) |
 			  PACKET3_UNMAP_QUEUES_ENGINE_SEL(eng_sel) |
-			  PACKET3_UNMAP_QUEUES_NUM_QUEUES(1));
-	amdgpu_ring_write(kiq_ring,
-		  PACKET3_UNMAP_QUEUES_DOORBELL_OFFSET0(ring->doorbell_index));
+			  PACKET3_UNMAP_QUEUES_NUM_QUEUES(1) /* Q_sel: 0, vmid: 0, engine: 0, num_Q: 1 */,
+			  PACKET3_UNMAP_QUEUES_DOORBELL_OFFSET0(ring->doorbell_index));
 
 	if (action == PREEMPT_QUEUES_NO_UNMAP) {
-		amdgpu_ring_write(kiq_ring, lower_32_bits(gpu_addr));
-		amdgpu_ring_write(kiq_ring, upper_32_bits(gpu_addr));
-		amdgpu_ring_write(kiq_ring, seq);
+		amdgpu_ring_write(kiq_ring,
+				  lower_32_bits(gpu_addr),
+				  upper_32_bits(gpu_addr),
+				  seq);
 	} else {
-		amdgpu_ring_write(kiq_ring, 0);
-		amdgpu_ring_write(kiq_ring, 0);
-		amdgpu_ring_write(kiq_ring, 0);
+		amdgpu_ring_fill(kiq_ring, 0, 3);
 	}
 }
 
@@ -3768,18 +3766,17 @@ static void gfx10_kiq_query_status(struct amdgpu_ring *kiq_ring,
 {
 	uint32_t eng_sel = ring->funcs->type == AMDGPU_RING_TYPE_GFX ? 4 : 0;
 
-	amdgpu_ring_write(kiq_ring, PACKET3(PACKET3_QUERY_STATUS, 5));
 	amdgpu_ring_write(kiq_ring,
+			  PACKET3(PACKET3_QUERY_STATUS, 5),
 			  PACKET3_QUERY_STATUS_CONTEXT_ID(0) |
 			  PACKET3_QUERY_STATUS_INTERRUPT_SEL(0) |
-			  PACKET3_QUERY_STATUS_COMMAND(2));
-	amdgpu_ring_write(kiq_ring, /* Q_sel: 0, vmid: 0, engine: 0, num_Q: 1 */
+			  PACKET3_QUERY_STATUS_COMMAND(2), /* Q_sel: 0, vmid: 0, engine: 0, num_Q: 1 */
 			  PACKET3_QUERY_STATUS_DOORBELL_OFFSET(ring->doorbell_index) |
-			  PACKET3_QUERY_STATUS_ENG_SEL(eng_sel));
-	amdgpu_ring_write(kiq_ring, lower_32_bits(addr));
-	amdgpu_ring_write(kiq_ring, upper_32_bits(addr));
-	amdgpu_ring_write(kiq_ring, lower_32_bits(seq));
-	amdgpu_ring_write(kiq_ring, upper_32_bits(seq));
+			  PACKET3_QUERY_STATUS_ENG_SEL(eng_sel),
+			  lower_32_bits(addr),
+			  upper_32_bits(addr),
+			  lower_32_bits(seq),
+			  upper_32_bits(seq));
 }
 
 static void gfx10_kiq_invalidate_tlbs(struct amdgpu_ring *kiq_ring,
@@ -3918,12 +3915,13 @@ static void gfx_v10_0_init_golden_registers(struct amdgpu_device *adev)
 static void gfx_v10_0_write_data_to_reg(struct amdgpu_ring *ring, int eng_sel,
 				       bool wc, uint32_t reg, uint32_t val)
 {
-	amdgpu_ring_write(ring, PACKET3(PACKET3_WRITE_DATA, 3));
-	amdgpu_ring_write(ring, WRITE_DATA_ENGINE_SEL(eng_sel) |
-			  WRITE_DATA_DST_SEL(0) | (wc ? WR_CONFIRM : 0));
-	amdgpu_ring_write(ring, reg);
-	amdgpu_ring_write(ring, 0);
-	amdgpu_ring_write(ring, val);
+	amdgpu_ring_write(ring,
+			  PACKET3(PACKET3_WRITE_DATA, 3),
+			  WRITE_DATA_ENGINE_SEL(eng_sel) |
+			  WRITE_DATA_DST_SEL(0) | (wc ? WR_CONFIRM : 0),
+			  reg,
+			  0,
+			  val);
 }
 
 static void gfx_v10_0_wait_reg_mem(struct amdgpu_ring *ring, int eng_sel,
@@ -3931,21 +3929,21 @@ static void gfx_v10_0_wait_reg_mem(struct amdgpu_ring *ring, int eng_sel,
 				  uint32_t addr1, uint32_t ref, uint32_t mask,
 				  uint32_t inv)
 {
-	amdgpu_ring_write(ring, PACKET3(PACKET3_WAIT_REG_MEM, 5));
+	if (mem_space)
+		BUG_ON(addr0 & 0x3); /* Dword align */
+
 	amdgpu_ring_write(ring,
+			  PACKET3(PACKET3_WAIT_REG_MEM, 5),
 			  /* memory (1) or register (0) */
 			  (WAIT_REG_MEM_MEM_SPACE(mem_space) |
 			   WAIT_REG_MEM_OPERATION(opt) | /* wait */
 			   WAIT_REG_MEM_FUNCTION(3) |  /* equal */
-			   WAIT_REG_MEM_ENGINE(eng_sel)));
-
-	if (mem_space)
-		BUG_ON(addr0 & 0x3); /* Dword align */
-	amdgpu_ring_write(ring, addr0);
-	amdgpu_ring_write(ring, addr1);
-	amdgpu_ring_write(ring, ref);
-	amdgpu_ring_write(ring, mask);
-	amdgpu_ring_write(ring, inv); /* poll interval */
+			   WAIT_REG_MEM_ENGINE(eng_sel)),
+			  addr0,
+			  addr1,
+			  ref,
+			  mask,
+			  inv); /* poll interval */
 }
 
 static int gfx_v10_0_ring_test_ring(struct amdgpu_ring *ring)
@@ -3964,10 +3962,11 @@ static int gfx_v10_0_ring_test_ring(struct amdgpu_ring *ring)
 		return r;
 	}
 
-	amdgpu_ring_write(ring, PACKET3(PACKET3_SET_UCONFIG_REG, 1));
-	amdgpu_ring_write(ring, scratch -
-			  PACKET3_SET_UCONFIG_REG_START);
-	amdgpu_ring_write(ring, 0xDEADBEEF);
+	amdgpu_ring_write(ring,
+			  PACKET3(PACKET3_SET_UCONFIG_REG, 1),
+			  scratch - PACKET3_SET_UCONFIG_REG_START,
+			  0xDEADBEEF);
+
 	amdgpu_ring_commit(ring);
 
 	for (i = 0; i < adev->usec_timeout; i++) {
@@ -6239,8 +6238,8 @@ static int gfx_v10_0_cp_gfx_start(struct amdgpu_device *adev)
 	struct amdgpu_ring *ring;
 	const struct cs_section_def *sect = NULL;
 	const struct cs_extent_def *ext = NULL;
-	int r, i;
 	int ctx_reg_offset;
+	int r;
 
 	/* init the CP */
 	WREG32_SOC15(GC, 0, mmCP_MAX_CONTEXT,
@@ -6256,43 +6255,46 @@ static int gfx_v10_0_cp_gfx_start(struct amdgpu_device *adev)
 		return r;
 	}
 
-	amdgpu_ring_write(ring, PACKET3(PACKET3_PREAMBLE_CNTL, 0));
-	amdgpu_ring_write(ring, PACKET3_PREAMBLE_BEGIN_CLEAR_STATE);
+	amdgpu_ring_write(ring,
+			  PACKET3(PACKET3_PREAMBLE_CNTL, 0),
+			  PACKET3_PREAMBLE_BEGIN_CLEAR_STATE,
 
-	amdgpu_ring_write(ring, PACKET3(PACKET3_CONTEXT_CONTROL, 1));
-	amdgpu_ring_write(ring, 0x80000000);
-	amdgpu_ring_write(ring, 0x80000000);
+			  PACKET3(PACKET3_CONTEXT_CONTROL, 1),
+			  0x80000000,
+			  0x80000000);
 
 	for (sect = gfx10_cs_data; sect->section != NULL; ++sect) {
 		for (ext = sect->section; ext->extent != NULL; ++ext) {
 			if (sect->id == SECT_CONTEXT) {
 				amdgpu_ring_write(ring,
 						  PACKET3(PACKET3_SET_CONTEXT_REG,
-							  ext->reg_count));
-				amdgpu_ring_write(ring, ext->reg_index -
+							  ext->reg_count),
+						  ext->reg_index -
 						  PACKET3_SET_CONTEXT_REG_START);
-				for (i = 0; i < ext->reg_count; i++)
-					amdgpu_ring_write(ring, ext->extent[i]);
+				amdgpu_ring_write_multiple(ring, ext->extent,
+							   ext->reg_count);
 			}
 		}
 	}
 
 	ctx_reg_offset =
 		SOC15_REG_OFFSET(GC, 0, mmPA_SC_TILE_STEERING_OVERRIDE) - PACKET3_SET_CONTEXT_REG_START;
-	amdgpu_ring_write(ring, PACKET3(PACKET3_SET_CONTEXT_REG, 1));
-	amdgpu_ring_write(ring, ctx_reg_offset);
-	amdgpu_ring_write(ring, adev->gfx.config.pa_sc_tile_steering_override);
 
-	amdgpu_ring_write(ring, PACKET3(PACKET3_PREAMBLE_CNTL, 0));
-	amdgpu_ring_write(ring, PACKET3_PREAMBLE_END_CLEAR_STATE);
+	amdgpu_ring_write(ring,
+			  PACKET3(PACKET3_SET_CONTEXT_REG, 1),
+			   ctx_reg_offset,
+			   adev->gfx.config.pa_sc_tile_steering_override,
 
-	amdgpu_ring_write(ring, PACKET3(PACKET3_CLEAR_STATE, 0));
-	amdgpu_ring_write(ring, 0);
+			   PACKET3(PACKET3_PREAMBLE_CNTL, 0),
+			   PACKET3_PREAMBLE_END_CLEAR_STATE,
 
-	amdgpu_ring_write(ring, PACKET3(PACKET3_SET_BASE, 2));
-	amdgpu_ring_write(ring, PACKET3_BASE_INDEX(CE_PARTITION_BASE));
-	amdgpu_ring_write(ring, 0x8000);
-	amdgpu_ring_write(ring, 0x8000);
+			   PACKET3(PACKET3_CLEAR_STATE, 0),
+			   0,
+
+			   PACKET3(PACKET3_SET_BASE, 2),
+			   PACKET3_BASE_INDEX(CE_PARTITION_BASE),
+			   0x8000,
+			   0x8000);
 
 	amdgpu_ring_commit(ring);
 
@@ -6306,9 +6308,7 @@ static int gfx_v10_0_cp_gfx_start(struct amdgpu_device *adev)
 			return r;
 		}
 
-		amdgpu_ring_write(ring, PACKET3(PACKET3_CLEAR_STATE, 0));
-		amdgpu_ring_write(ring, 0);
-
+		amdgpu_ring_write(ring, PACKET3(PACKET3_CLEAR_STATE, 0), 0);
 		amdgpu_ring_commit(ring);
 	}
 	return 0;
@@ -8564,6 +8564,8 @@ static void gfx_v10_0_ring_emit_ib_gfx(struct amdgpu_ring *ring,
 	unsigned int vmid = AMDGPU_JOB_GET_VMID(job);
 	u32 header, control = 0;
 
+	BUG_ON(ib->gpu_addr & 0x3); /* Dword align */
+
 	if (ib->flags & AMDGPU_IB_FLAG_CE)
 		header = PACKET3(PACKET3_INDIRECT_BUFFER_CNST, 2);
 	else
@@ -8582,15 +8584,14 @@ static void gfx_v10_0_ring_emit_ib_gfx(struct amdgpu_ring *ring,
 				    (!amdgpu_sriov_vf(ring->adev) && flags & AMDGPU_IB_PREEMPTED) ? true : false);
 	}
 
-	amdgpu_ring_write(ring, header);
-	BUG_ON(ib->gpu_addr & 0x3); /* Dword align */
 	amdgpu_ring_write(ring,
+			  header,
 #ifdef __BIG_ENDIAN
-		(2 << 0) |
+			  (2 << 0) |
 #endif
-		lower_32_bits(ib->gpu_addr));
-	amdgpu_ring_write(ring, upper_32_bits(ib->gpu_addr));
-	amdgpu_ring_write(ring, control);
+			  lower_32_bits(ib->gpu_addr),
+			  upper_32_bits(ib->gpu_addr),
+			  control);
 }
 
 static void gfx_v10_0_ring_emit_ib_compute(struct amdgpu_ring *ring,
@@ -8601,6 +8602,8 @@ static void gfx_v10_0_ring_emit_ib_compute(struct amdgpu_ring *ring,
 	unsigned int vmid = AMDGPU_JOB_GET_VMID(job);
 	u32 control = INDIRECT_BUFFER_VALID | ib->length_dw | (vmid << 24);
 
+	BUG_ON(ib->gpu_addr & 0x3); /* Dword align */
+
 	/* Currently, there is a high possibility to get wave ID mismatch
 	 * between ME and GDS, leading to a hw deadlock, because ME generates
 	 * different wave IDs than the GDS expects. This situation happens
@@ -8612,20 +8615,20 @@ static void gfx_v10_0_ring_emit_ib_compute(struct amdgpu_ring *ring,
 	 * GDS to 0 for this ring (me/pipe).
 	 */
 	if (ib->flags & AMDGPU_IB_FLAG_RESET_GDS_MAX_WAVE_ID) {
-		amdgpu_ring_write(ring, PACKET3(PACKET3_SET_CONFIG_REG, 1));
-		amdgpu_ring_write(ring, mmGDS_COMPUTE_MAX_WAVE_ID);
-		amdgpu_ring_write(ring, ring->adev->gds.gds_compute_max_wave_id);
+		amdgpu_ring_write(ring,
+				  PACKET3(PACKET3_SET_CONFIG_REG, 1),
+				  mmGDS_COMPUTE_MAX_WAVE_ID,
+				  ring->adev->gds.gds_compute_max_wave_id);
 	}
 
-	amdgpu_ring_write(ring, PACKET3(PACKET3_INDIRECT_BUFFER, 2));
-	BUG_ON(ib->gpu_addr & 0x3); /* Dword align */
 	amdgpu_ring_write(ring,
+			  PACKET3(PACKET3_INDIRECT_BUFFER, 2),
 #ifdef __BIG_ENDIAN
-				(2 << 0) |
+			  (2 << 0) |
 #endif
-				lower_32_bits(ib->gpu_addr));
-	amdgpu_ring_write(ring, upper_32_bits(ib->gpu_addr));
-	amdgpu_ring_write(ring, control);
+			  lower_32_bits(ib->gpu_addr),
+			  upper_32_bits(ib->gpu_addr),
+			  control);
 }
 
 static void gfx_v10_0_ring_emit_fence(struct amdgpu_ring *ring, u64 addr,
@@ -8634,18 +8637,6 @@ static void gfx_v10_0_ring_emit_fence(struct amdgpu_ring *ring, u64 addr,
 	bool write64bit = flags & AMDGPU_FENCE_FLAG_64BIT;
 	bool int_sel = flags & AMDGPU_FENCE_FLAG_INT;
 
-	/* RELEASE_MEM - flush caches, send int */
-	amdgpu_ring_write(ring, PACKET3(PACKET3_RELEASE_MEM, 6));
-	amdgpu_ring_write(ring, (PACKET3_RELEASE_MEM_GCR_SEQ |
-				 PACKET3_RELEASE_MEM_GCR_GL2_WB |
-				 PACKET3_RELEASE_MEM_GCR_GLM_INV | /* must be set with GLM_WB */
-				 PACKET3_RELEASE_MEM_GCR_GLM_WB |
-				 PACKET3_RELEASE_MEM_CACHE_POLICY(3) |
-				 PACKET3_RELEASE_MEM_EVENT_TYPE(CACHE_FLUSH_AND_INV_TS_EVENT) |
-				 PACKET3_RELEASE_MEM_EVENT_INDEX(5)));
-	amdgpu_ring_write(ring, (PACKET3_RELEASE_MEM_DATA_SEL(write64bit ? 2 : 1) |
-				 PACKET3_RELEASE_MEM_INT_SEL(int_sel ? 2 : 0)));
-
 	/*
 	 * the address should be Qword aligned if 64bit write, Dword
 	 * aligned if only send 32bit data low (discard data high)
@@ -8654,11 +8645,24 @@ static void gfx_v10_0_ring_emit_fence(struct amdgpu_ring *ring, u64 addr,
 		BUG_ON(addr & 0x7);
 	else
 		BUG_ON(addr & 0x3);
-	amdgpu_ring_write(ring, lower_32_bits(addr));
-	amdgpu_ring_write(ring, upper_32_bits(addr));
-	amdgpu_ring_write(ring, lower_32_bits(seq));
-	amdgpu_ring_write(ring, upper_32_bits(seq));
-	amdgpu_ring_write(ring, 0);
+
+	amdgpu_ring_write(ring,
+			  /* RELEASE_MEM - flush caches, send int */
+			  PACKET3(PACKET3_RELEASE_MEM, 6),
+			  (PACKET3_RELEASE_MEM_GCR_SEQ |
+			   PACKET3_RELEASE_MEM_GCR_GL2_WB |
+			   PACKET3_RELEASE_MEM_GCR_GLM_INV | /* must be set with GLM_WB */
+			   PACKET3_RELEASE_MEM_GCR_GLM_WB |
+			   PACKET3_RELEASE_MEM_CACHE_POLICY(3) |
+			   PACKET3_RELEASE_MEM_EVENT_TYPE(CACHE_FLUSH_AND_INV_TS_EVENT) |
+			   PACKET3_RELEASE_MEM_EVENT_INDEX(5)),
+			  (PACKET3_RELEASE_MEM_DATA_SEL(write64bit ? 2 : 1) |
+			   PACKET3_RELEASE_MEM_INT_SEL(int_sel ? 2 : 0)),
+			  lower_32_bits(addr),
+			  upper_32_bits(addr),
+			  lower_32_bits(seq),
+			  upper_32_bits(seq),
+			  0);
 }
 
 static void gfx_v10_0_ring_emit_pipeline_sync(struct amdgpu_ring *ring)
@@ -8675,8 +8679,8 @@ static void gfx_v10_0_ring_invalidate_tlbs(struct amdgpu_ring *ring,
 				   uint16_t pasid, uint32_t flush_type,
 				   bool all_hub, uint8_t dst_sel)
 {
-	amdgpu_ring_write(ring, PACKET3(PACKET3_INVALIDATE_TLBS, 0));
 	amdgpu_ring_write(ring,
+			  PACKET3(PACKET3_INVALIDATE_TLBS, 0),
 			  PACKET3_INVALIDATE_TLBS_DST_SEL(dst_sel) |
 			  PACKET3_INVALIDATE_TLBS_ALL_HUB(all_hub) |
 			  PACKET3_INVALIDATE_TLBS_PASID(pasid) |
@@ -8691,8 +8695,7 @@ static void gfx_v10_0_ring_emit_vm_flush(struct amdgpu_ring *ring,
 	/* compute doesn't have PFP */
 	if (ring->funcs->type == AMDGPU_RING_TYPE_GFX) {
 		/* sync PFP to ME, otherwise we might get invalid PFP reads */
-		amdgpu_ring_write(ring, PACKET3(PACKET3_PFP_SYNC_ME, 0));
-		amdgpu_ring_write(ring, 0x0);
+		amdgpu_ring_write(ring, PACKET3(PACKET3_PFP_SYNC_ME, 0), 0);
 	}
 }
 
@@ -8705,40 +8708,42 @@ static void gfx_v10_0_ring_emit_fence_kiq(struct amdgpu_ring *ring, u64 addr,
 	BUG_ON(flags & AMDGPU_FENCE_FLAG_64BIT);
 
 	/* write fence seq to the "addr" */
-	amdgpu_ring_write(ring, PACKET3(PACKET3_WRITE_DATA, 3));
-	amdgpu_ring_write(ring, (WRITE_DATA_ENGINE_SEL(0) |
-				 WRITE_DATA_DST_SEL(5) | WR_CONFIRM));
-	amdgpu_ring_write(ring, lower_32_bits(addr));
-	amdgpu_ring_write(ring, upper_32_bits(addr));
-	amdgpu_ring_write(ring, lower_32_bits(seq));
+	amdgpu_ring_write(ring,
+			  PACKET3(PACKET3_WRITE_DATA, 3),
+			  (WRITE_DATA_ENGINE_SEL(0) | WRITE_DATA_DST_SEL(5) |
+			   WR_CONFIRM),
+			  lower_32_bits(addr),
+			  upper_32_bits(addr),
+			  lower_32_bits(seq));
 
 	if (flags & AMDGPU_FENCE_FLAG_INT) {
 		/* set register to trigger INT */
-		amdgpu_ring_write(ring, PACKET3(PACKET3_WRITE_DATA, 3));
-		amdgpu_ring_write(ring, (WRITE_DATA_ENGINE_SEL(0) |
-					 WRITE_DATA_DST_SEL(0) | WR_CONFIRM));
-		amdgpu_ring_write(ring, SOC15_REG_OFFSET(GC, 0, mmCPC_INT_STATUS));
-		amdgpu_ring_write(ring, 0);
-		amdgpu_ring_write(ring, 0x20000000); /* src_id is 178 */
+		amdgpu_ring_write(ring,
+				  PACKET3(PACKET3_WRITE_DATA, 3),
+				  (WRITE_DATA_ENGINE_SEL(0) |
+				   WRITE_DATA_DST_SEL(0) |
+				   WR_CONFIRM),
+				  SOC15_REG_OFFSET(GC, 0, mmCPC_INT_STATUS),
+				  0,
+				  0x20000000); /* src_id is 178 */
 	}
 }
 
 static void gfx_v10_0_ring_emit_sb(struct amdgpu_ring *ring)
 {
-	amdgpu_ring_write(ring, PACKET3(PACKET3_SWITCH_BUFFER, 0));
-	amdgpu_ring_write(ring, 0);
+	amdgpu_ring_write(ring, PACKET3(PACKET3_SWITCH_BUFFER, 0), 0);
 }
 
 static void gfx_v10_0_ring_emit_cntxcntl(struct amdgpu_ring *ring,
 					 uint32_t flags)
 {
-	uint32_t dw2 = 0;
+	uint32_t dw2;
 
 	if (ring->adev->gfx.mcbp)
 		gfx_v10_0_ring_emit_ce_meta(ring,
 				    (!amdgpu_sriov_vf(ring->adev) && flags & AMDGPU_IB_PREEMPTED) ? true : false);
 
-	dw2 |= 0x80000000; /* set load_enable otherwise this package is just NOPs */
+	dw2 = 0x80000000; /* set load_enable otherwise this package is just NOPs */
 	if (flags & AMDGPU_HAVE_CTX_SWITCH) {
 		/* set load_global_config & load_global_uconfig */
 		dw2 |= 0x8001;
@@ -8758,9 +8763,7 @@ static void gfx_v10_0_ring_emit_cntxcntl(struct amdgpu_ring *ring,
 			dw2 |= 0x10000000;
 	}
 
-	amdgpu_ring_write(ring, PACKET3(PACKET3_CONTEXT_CONTROL, 1));
-	amdgpu_ring_write(ring, dw2);
-	amdgpu_ring_write(ring, 0);
+	amdgpu_ring_write(ring, PACKET3(PACKET3_CONTEXT_CONTROL, 1), dw2, 0);
 }
 
 static unsigned int gfx_v10_0_ring_emit_init_cond_exec(struct amdgpu_ring *ring,
@@ -8768,11 +8771,12 @@ static unsigned int gfx_v10_0_ring_emit_init_cond_exec(struct amdgpu_ring *ring,
 {
 	unsigned int ret;
 
-	amdgpu_ring_write(ring, PACKET3(PACKET3_COND_EXEC, 3));
-	amdgpu_ring_write(ring, lower_32_bits(addr));
-	amdgpu_ring_write(ring, upper_32_bits(addr));
+	amdgpu_ring_write(ring,
+			  PACKET3(PACKET3_COND_EXEC, 3),
+			  lower_32_bits(addr),
+			  upper_32_bits(addr),
 	/* discard following DWs if *cond_exec_gpu_addr==0 */
-	amdgpu_ring_write(ring, 0);
+			  0);
 	ret = ring->wptr & ring->buf_mask;
 	/* patch dummy value later */
 	amdgpu_ring_write(ring, 0);
@@ -8839,22 +8843,21 @@ static void gfx_v10_0_ring_emit_ce_meta(struct amdgpu_ring *ring, bool resume)
 
 	offset = offsetof(struct v10_gfx_meta_data, ce_payload);
 	ce_payload_gpu_addr = amdgpu_csa_vaddr(ring->adev) + offset;
-	ce_payload_cpu_addr = adev->virt.csa_cpu_addr + offset;
-
-	amdgpu_ring_write(ring, PACKET3(PACKET3_WRITE_DATA, cnt));
-	amdgpu_ring_write(ring, (WRITE_DATA_ENGINE_SEL(2) |
-				 WRITE_DATA_DST_SEL(8) |
-				 WR_CONFIRM) |
-				 WRITE_DATA_CACHE_POLICY(0));
-	amdgpu_ring_write(ring, lower_32_bits(ce_payload_gpu_addr));
-	amdgpu_ring_write(ring, upper_32_bits(ce_payload_gpu_addr));
 
 	if (resume)
-		amdgpu_ring_write_multiple(ring, ce_payload_cpu_addr,
-					   sizeof(ce_payload) >> 2);
+		ce_payload_cpu_addr = adev->virt.csa_cpu_addr + offset;
 	else
-		amdgpu_ring_write_multiple(ring, (void *)&ce_payload,
-					   sizeof(ce_payload) >> 2);
+		ce_payload_cpu_addr = (void *)&ce_payload;
+
+	amdgpu_ring_write(ring,
+			  PACKET3(PACKET3_WRITE_DATA, cnt),
+			  (WRITE_DATA_ENGINE_SEL(2) | WRITE_DATA_DST_SEL(8) |
+			   WR_CONFIRM) | WRITE_DATA_CACHE_POLICY(0),
+			  lower_32_bits(ce_payload_gpu_addr),
+			  upper_32_bits(ce_payload_gpu_addr));
+
+	amdgpu_ring_write_multiple(ring, ce_payload_cpu_addr,
+				   sizeof(ce_payload) >> 2);
 }
 
 static void gfx_v10_0_ring_emit_de_meta(struct amdgpu_ring *ring, bool resume)
@@ -8867,7 +8870,10 @@ static void gfx_v10_0_ring_emit_de_meta(struct amdgpu_ring *ring, bool resume)
 
 	offset = offsetof(struct v10_gfx_meta_data, de_payload);
 	de_payload_gpu_addr = amdgpu_csa_vaddr(ring->adev) + offset;
-	de_payload_cpu_addr = adev->virt.csa_cpu_addr + offset;
+	if (resume)
+		de_payload_cpu_addr = adev->virt.csa_cpu_addr + offset;
+	else
+		de_payload_cpu_addr = (void *)&de_payload;
 
 	gds_addr = ALIGN(amdgpu_csa_vaddr(ring->adev) +
 			 AMDGPU_CSA_SIZE - adev->gds.gds_size,
@@ -8877,20 +8883,15 @@ static void gfx_v10_0_ring_emit_de_meta(struct amdgpu_ring *ring, bool resume)
 	de_payload.gds_backup_addrhi = upper_32_bits(gds_addr);
 
 	cnt = (sizeof(de_payload) >> 2) + 4 - 2;
-	amdgpu_ring_write(ring, PACKET3(PACKET3_WRITE_DATA, cnt));
-	amdgpu_ring_write(ring, (WRITE_DATA_ENGINE_SEL(1) |
-				 WRITE_DATA_DST_SEL(8) |
-				 WR_CONFIRM) |
-				 WRITE_DATA_CACHE_POLICY(0));
-	amdgpu_ring_write(ring, lower_32_bits(de_payload_gpu_addr));
-	amdgpu_ring_write(ring, upper_32_bits(de_payload_gpu_addr));
+	amdgpu_ring_write(ring,
+			  PACKET3(PACKET3_WRITE_DATA, cnt),
+			  (WRITE_DATA_ENGINE_SEL(1) | WRITE_DATA_DST_SEL(8) |
+			   WR_CONFIRM) | WRITE_DATA_CACHE_POLICY(0),
+			  lower_32_bits(de_payload_gpu_addr),
+			  upper_32_bits(de_payload_gpu_addr));
 
-	if (resume)
-		amdgpu_ring_write_multiple(ring, de_payload_cpu_addr,
-					   sizeof(de_payload) >> 2);
-	else
-		amdgpu_ring_write_multiple(ring, (void *)&de_payload,
-					   sizeof(de_payload) >> 2);
+	amdgpu_ring_write_multiple(ring, de_payload_cpu_addr,
+				   sizeof(de_payload) >> 2);
 }
 
 static void gfx_v10_0_ring_emit_frame_cntl(struct amdgpu_ring *ring, bool start,
@@ -8898,31 +8899,32 @@ static void gfx_v10_0_ring_emit_frame_cntl(struct amdgpu_ring *ring, bool start,
 {
 	uint32_t v = secure ? FRAME_TMZ : 0;
 
-	amdgpu_ring_write(ring, PACKET3(PACKET3_FRAME_CONTROL, 0));
-	amdgpu_ring_write(ring, v | FRAME_CMD(start ? 0 : 1));
+	amdgpu_ring_write(ring,
+			  PACKET3(PACKET3_FRAME_CONTROL, 0),
+			  v | FRAME_CMD(start ? 0 : 1));
 }
 
 static void gfx_v10_0_ring_emit_rreg(struct amdgpu_ring *ring, uint32_t reg,
 				     uint32_t reg_val_offs)
 {
 	struct amdgpu_device *adev = ring->adev;
+	u64 gpu_addr = adev->wb.gpu_addr + reg_val_offs * 4;
 
-	amdgpu_ring_write(ring, PACKET3(PACKET3_COPY_DATA, 4));
-	amdgpu_ring_write(ring, 0 |	/* src: register*/
-				(5 << 8) |	/* dst: memory */
-				(1 << 20));	/* write confirm */
-	amdgpu_ring_write(ring, reg);
-	amdgpu_ring_write(ring, 0);
-	amdgpu_ring_write(ring, lower_32_bits(adev->wb.gpu_addr +
-				reg_val_offs * 4));
-	amdgpu_ring_write(ring, upper_32_bits(adev->wb.gpu_addr +
-				reg_val_offs * 4));
+	amdgpu_ring_write(ring,
+			  PACKET3(PACKET3_COPY_DATA, 4),
+			  0 |		/* src: register*/
+			  (5 << 8) |	/* dst: memory */
+			  (1 << 20),	/* write confirm */
+			  reg,
+			  0,
+			  lower_32_bits(gpu_addr),
+			  upper_32_bits(gpu_addr));
 }
 
 static void gfx_v10_0_ring_emit_wreg(struct amdgpu_ring *ring, uint32_t reg,
 				   uint32_t val)
 {
-	uint32_t cmd = 0;
+	uint32_t cmd;
 
 	switch (ring->funcs->type) {
 	case AMDGPU_RING_TYPE_GFX:
@@ -8935,11 +8937,13 @@ static void gfx_v10_0_ring_emit_wreg(struct amdgpu_ring *ring, uint32_t reg,
 		cmd = WR_CONFIRM;
 		break;
 	}
-	amdgpu_ring_write(ring, PACKET3(PACKET3_WRITE_DATA, 3));
-	amdgpu_ring_write(ring, cmd);
-	amdgpu_ring_write(ring, reg);
-	amdgpu_ring_write(ring, 0);
-	amdgpu_ring_write(ring, val);
+
+	amdgpu_ring_write(ring,
+			  PACKET3(PACKET3_WRITE_DATA, 3),
+			  cmd,
+			  reg,
+			  0,
+			  val);
 }
 
 static void gfx_v10_0_ring_emit_reg_wait(struct amdgpu_ring *ring, uint32_t reg,
@@ -9416,15 +9420,17 @@ static void gfx_v10_0_emit_mem_sync(struct amdgpu_ring *ring)
 			PACKET3_ACQUIRE_MEM_GCR_CNTL_GLK_INV(1) |
 			PACKET3_ACQUIRE_MEM_GCR_CNTL_GLI_INV(1);
 
-	/* ACQUIRE_MEM - make one or more surfaces valid for use by the subsequent operations */
-	amdgpu_ring_write(ring, PACKET3(PACKET3_ACQUIRE_MEM, 6));
-	amdgpu_ring_write(ring, 0); /* CP_COHER_CNTL */
-	amdgpu_ring_write(ring, 0xffffffff);  /* CP_COHER_SIZE */
-	amdgpu_ring_write(ring, 0xffffff);  /* CP_COHER_SIZE_HI */
-	amdgpu_ring_write(ring, 0); /* CP_COHER_BASE */
-	amdgpu_ring_write(ring, 0);  /* CP_COHER_BASE_HI */
-	amdgpu_ring_write(ring, 0x0000000A); /* POLL_INTERVAL */
-	amdgpu_ring_write(ring, gcr_cntl); /* GCR_CNTL */
+
+	amdgpu_ring_write(ring,
+			  /* ACQUIRE_MEM - make one or more surfaces valid for use by the subsequent operations */
+			  PACKET3(PACKET3_ACQUIRE_MEM, 6),
+			  0, /* CP_COHER_CNTL */
+			  0xffffffff,  /* CP_COHER_SIZE */
+			  0xffffff,  /* CP_COHER_SIZE_HI */
+			  0, /* CP_COHER_BASE */
+			  0,  /* CP_COHER_BASE_HI */
+			  0x0000000A, /* POLL_INTERVAL */
+			  gcr_cntl); /* GCR_CNTL */
 }
 
 static void gfx_v10_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
@@ -9726,8 +9732,9 @@ static void gfx_v10_ip_dump(struct amdgpu_ip_block *ip_block)
 static void gfx_v10_0_ring_emit_cleaner_shader(struct amdgpu_ring *ring)
 {
 	/* Emit the cleaner shader */
-	amdgpu_ring_write(ring, PACKET3(PACKET3_RUN_CLEANER_SHADER, 0));
-	amdgpu_ring_write(ring, 0);  /* RESERVED field, programmed to zero */
+	amdgpu_ring_write(ring,
+			  PACKET3(PACKET3_RUN_CLEANER_SHADER, 0),
+			  0);  /* RESERVED field, programmed to zero */
 }
 
 static const struct amd_ip_funcs gfx_v10_0_ip_funcs = {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 12/12] drm/amdgpu: Convert SDMA v5.2 to variadic amdgpu_ring_write()
  2024-12-27 11:19 [RFC v4 00/12] Ring padding micro-optimisation etc Tvrtko Ursulin
                   ` (10 preceding siblings ...)
  2024-12-27 11:19 ` [PATCH 11/12] drm/amdgpu: Convert GFX v10.0 " Tvrtko Ursulin
@ 2024-12-27 11:19 ` Tvrtko Ursulin
  11 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-12-27 11:19 UTC (permalink / raw)
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin, Christian König, Sunil Khatri

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Converting the SDMA v5.2 ring helpers to use the variadic
amdgpu_ring_write().

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Sunil Khatri <sunil.khatri@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 151 +++++++++++++------------
 1 file changed, 80 insertions(+), 71 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
index 69b88db32117..55c1cc0e03b5 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
@@ -143,10 +143,11 @@ static unsigned sdma_v5_2_ring_init_cond_exec(struct amdgpu_ring *ring,
 {
 	unsigned ret;
 
-	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_COND_EXE));
-	amdgpu_ring_write(ring, lower_32_bits(addr));
-	amdgpu_ring_write(ring, upper_32_bits(addr));
-	amdgpu_ring_write(ring, 1);
+	amdgpu_ring_write(ring,
+			  SDMA_PKT_HEADER_OP(SDMA_OP_COND_EXE),
+			  lower_32_bits(addr),
+			  upper_32_bits(addr),
+			  1);
 	/* this is the offset we need patch later */
 	ret = ring->wptr & ring->buf_mask;
 	/* insert dummy here and patch it later */
@@ -278,14 +279,15 @@ static void sdma_v5_2_ring_emit_ib(struct amdgpu_ring *ring,
 	 */
 	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
 
-	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
-			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
-	/* base must be 32 byte aligned */
-	amdgpu_ring_write(ring, lower_32_bits(ib->gpu_addr) & 0xffffffe0);
-	amdgpu_ring_write(ring, upper_32_bits(ib->gpu_addr));
-	amdgpu_ring_write(ring, ib->length_dw);
-	amdgpu_ring_write(ring, lower_32_bits(csa_mc_addr));
-	amdgpu_ring_write(ring, upper_32_bits(csa_mc_addr));
+	amdgpu_ring_write(ring,
+			  SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
+			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf),
+			  /* base must be 32 byte aligned */
+			  lower_32_bits(ib->gpu_addr) & 0xffffffe0,
+			  upper_32_bits(ib->gpu_addr),
+			  ib->length_dw,
+			  lower_32_bits(csa_mc_addr),
+			  upper_32_bits(csa_mc_addr));
 }
 
 /**
@@ -303,14 +305,15 @@ static void sdma_v5_2_ring_emit_mem_sync(struct amdgpu_ring *ring)
 			    SDMA_GCR_GLI_INV(1);
 
 	/* flush entire cache L0/L1/L2, this can be optimized by performance requirement */
-	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_GCR_REQ));
-	amdgpu_ring_write(ring, SDMA_PKT_GCR_REQ_PAYLOAD1_BASE_VA_31_7(0));
-	amdgpu_ring_write(ring, SDMA_PKT_GCR_REQ_PAYLOAD2_GCR_CONTROL_15_0(gcr_cntl) |
-			SDMA_PKT_GCR_REQ_PAYLOAD2_BASE_VA_47_32(0));
-	amdgpu_ring_write(ring, SDMA_PKT_GCR_REQ_PAYLOAD3_LIMIT_VA_31_7(0) |
-			SDMA_PKT_GCR_REQ_PAYLOAD3_GCR_CONTROL_18_16(gcr_cntl >> 16));
-	amdgpu_ring_write(ring, SDMA_PKT_GCR_REQ_PAYLOAD4_LIMIT_VA_47_32(0) |
-			SDMA_PKT_GCR_REQ_PAYLOAD4_VMID(0));
+	amdgpu_ring_write(ring,
+			  SDMA_PKT_HEADER_OP(SDMA_OP_GCR_REQ),
+			  SDMA_PKT_GCR_REQ_PAYLOAD1_BASE_VA_31_7(0),
+			  SDMA_PKT_GCR_REQ_PAYLOAD2_GCR_CONTROL_15_0(gcr_cntl) |
+			  SDMA_PKT_GCR_REQ_PAYLOAD2_BASE_VA_47_32(0),
+			  SDMA_PKT_GCR_REQ_PAYLOAD3_LIMIT_VA_31_7(0) |
+			  SDMA_PKT_GCR_REQ_PAYLOAD3_GCR_CONTROL_18_16(gcr_cntl >> 16),
+			  SDMA_PKT_GCR_REQ_PAYLOAD4_LIMIT_VA_47_32(0) |
+			  SDMA_PKT_GCR_REQ_PAYLOAD4_VMID(0));
 }
 
 /**
@@ -331,14 +334,15 @@ static void sdma_v5_2_ring_emit_hdp_flush(struct amdgpu_ring *ring)
 	} else {
 		ref_and_mask = nbio_hf_reg->ref_and_mask_sdma0 << ring->me;
 
-		amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_POLL_REGMEM) |
+		amdgpu_ring_write(ring,
+				  SDMA_PKT_HEADER_OP(SDMA_OP_POLL_REGMEM) |
 				  SDMA_PKT_POLL_REGMEM_HEADER_HDP_FLUSH(1) |
-				  SDMA_PKT_POLL_REGMEM_HEADER_FUNC(3)); /* == */
-		amdgpu_ring_write(ring, (adev->nbio.funcs->get_hdp_flush_done_offset(adev)) << 2);
-		amdgpu_ring_write(ring, (adev->nbio.funcs->get_hdp_flush_req_offset(adev)) << 2);
-		amdgpu_ring_write(ring, ref_and_mask); /* reference */
-		amdgpu_ring_write(ring, ref_and_mask); /* mask */
-		amdgpu_ring_write(ring, SDMA_PKT_POLL_REGMEM_DW5_RETRY_COUNT(0xfff) |
+				  SDMA_PKT_POLL_REGMEM_HEADER_FUNC(3), /* == */
+				  adev->nbio.funcs->get_hdp_flush_done_offset(adev) << 2,
+				  adev->nbio.funcs->get_hdp_flush_req_offset(adev) << 2,
+				  ref_and_mask, /* reference */
+				  ref_and_mask, /* mask */
+				  SDMA_PKT_POLL_REGMEM_DW5_RETRY_COUNT(0xfff) |
 				  SDMA_PKT_POLL_REGMEM_DW5_INTERVAL(10)); /* retry count, poll interval */
 	}
 }
@@ -359,33 +363,35 @@ static void sdma_v5_2_ring_emit_fence(struct amdgpu_ring *ring, u64 addr, u64 se
 				      unsigned flags)
 {
 	bool write64bit = flags & AMDGPU_FENCE_FLAG_64BIT;
-	/* write the fence */
-	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_FENCE) |
-			  SDMA_PKT_FENCE_HEADER_MTYPE(0x3)); /* Ucached(UC) */
-	/* zero in first two bits */
+
 	BUG_ON(addr & 0x3);
-	amdgpu_ring_write(ring, lower_32_bits(addr));
-	amdgpu_ring_write(ring, upper_32_bits(addr));
-	amdgpu_ring_write(ring, lower_32_bits(seq));
+
+	/* write the fence */
+	amdgpu_ring_write(ring,
+			  SDMA_PKT_HEADER_OP(SDMA_OP_FENCE) |
+			  SDMA_PKT_FENCE_HEADER_MTYPE(0x3), /* Ucached(UC) */
+			  lower_32_bits(addr),
+			  upper_32_bits(addr),
+			  lower_32_bits(seq));
 
 	/* optionally write high bits as well */
 	if (write64bit) {
 		addr += 4;
-		amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_FENCE) |
-				  SDMA_PKT_FENCE_HEADER_MTYPE(0x3));
-		/* zero in first two bits */
-		BUG_ON(addr & 0x3);
-		amdgpu_ring_write(ring, lower_32_bits(addr));
-		amdgpu_ring_write(ring, upper_32_bits(addr));
-		amdgpu_ring_write(ring, upper_32_bits(seq));
+		amdgpu_ring_write(ring,
+				  SDMA_PKT_HEADER_OP(SDMA_OP_FENCE) |
+				  SDMA_PKT_FENCE_HEADER_MTYPE(0x3),
+				  lower_32_bits(addr),
+				  upper_32_bits(addr),
+				  upper_32_bits(seq));
 	}
 
 	if ((flags & AMDGPU_FENCE_FLAG_INT)) {
 		uint32_t ctx = ring->is_mes_queue ?
 			(ring->hw_queue_id | AMDGPU_FENCE_MES_QUEUE_FLAG) : 0;
 		/* generate an interrupt */
-		amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_TRAP));
-		amdgpu_ring_write(ring, SDMA_PKT_TRAP_INT_CONTEXT_INT_CONTEXT(ctx));
+		amdgpu_ring_write(ring,
+				  SDMA_PKT_HEADER_OP(SDMA_OP_TRAP),
+				  SDMA_PKT_TRAP_INT_CONTEXT_INT_CONTEXT(ctx));
 	}
 }
 
@@ -920,12 +926,13 @@ static int sdma_v5_2_ring_test_ring(struct amdgpu_ring *ring)
 		return r;
 	}
 
-	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_WRITE) |
-			  SDMA_PKT_HEADER_SUB_OP(SDMA_SUBOP_WRITE_LINEAR));
-	amdgpu_ring_write(ring, lower_32_bits(gpu_addr));
-	amdgpu_ring_write(ring, upper_32_bits(gpu_addr));
-	amdgpu_ring_write(ring, SDMA_PKT_WRITE_UNTILED_DW_3_COUNT(0));
-	amdgpu_ring_write(ring, 0xDEADBEEF);
+	amdgpu_ring_write(ring,
+			  SDMA_PKT_HEADER_OP(SDMA_OP_WRITE) |
+			  SDMA_PKT_HEADER_SUB_OP(SDMA_SUBOP_WRITE_LINEAR),
+			  lower_32_bits(gpu_addr),
+			  upper_32_bits(gpu_addr),
+			  SDMA_PKT_WRITE_UNTILED_DW_3_COUNT(0),
+			  0xDEADBEEF);
 	amdgpu_ring_commit(ring);
 
 	for (i = 0; i < adev->usec_timeout; i++) {
@@ -1171,15 +1178,16 @@ static void sdma_v5_2_ring_emit_pipeline_sync(struct amdgpu_ring *ring)
 	uint64_t addr = ring->fence_drv.gpu_addr;
 
 	/* wait for idle */
-	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_POLL_REGMEM) |
+	amdgpu_ring_write(ring,
+			  SDMA_PKT_HEADER_OP(SDMA_OP_POLL_REGMEM) |
 			  SDMA_PKT_POLL_REGMEM_HEADER_HDP_FLUSH(0) |
 			  SDMA_PKT_POLL_REGMEM_HEADER_FUNC(3) | /* equal */
-			  SDMA_PKT_POLL_REGMEM_HEADER_MEM_POLL(1));
-	amdgpu_ring_write(ring, addr & 0xfffffffc);
-	amdgpu_ring_write(ring, upper_32_bits(addr) & 0xffffffff);
-	amdgpu_ring_write(ring, seq); /* reference */
-	amdgpu_ring_write(ring, 0xffffffff); /* mask */
-	amdgpu_ring_write(ring, SDMA_PKT_POLL_REGMEM_DW5_RETRY_COUNT(0xfff) |
+			  SDMA_PKT_POLL_REGMEM_HEADER_MEM_POLL(1),
+			  addr & 0xfffffffc,
+			  upper_32_bits(addr) & 0xffffffff,
+			  seq, /* reference */
+			  0xffffffff, /* mask */
+			  SDMA_PKT_POLL_REGMEM_DW5_RETRY_COUNT(0xfff) |
 			  SDMA_PKT_POLL_REGMEM_DW5_INTERVAL(4)); /* retry count, poll interval */
 }
 
@@ -1213,10 +1221,9 @@ static void sdma_v5_2_ring_emit_vm_flush(struct amdgpu_ring *ring,
 			  SDMA_PKT_VM_INVALIDATION_HEADER_OP(SDMA_OP_POLL_REGMEM) |
 			  SDMA_PKT_VM_INVALIDATION_HEADER_SUB_OP(SDMA_SUBOP_VM_INVALIDATION) |
 			  SDMA_PKT_VM_INVALIDATION_HEADER_GFX_ENG_ID(ring->vm_inv_eng) |
-			  SDMA_PKT_VM_INVALIDATION_HEADER_MM_ENG_ID(0x1f));
-	amdgpu_ring_write(ring, req);
-	amdgpu_ring_write(ring, 0xFFFFFFFF);
-	amdgpu_ring_write(ring,
+			  SDMA_PKT_VM_INVALIDATION_HEADER_MM_ENG_ID(0x1f),
+			  req,
+			  0xFFFFFFFF,
 			  SDMA_PKT_VM_INVALIDATION_ADDRESSRANGEHI_INVALIDATEACK(1 << vmid) |
 			  SDMA_PKT_VM_INVALIDATION_ADDRESSRANGEHI_ADDRESSRANGEHI(0x1F));
 }
@@ -1224,23 +1231,25 @@ static void sdma_v5_2_ring_emit_vm_flush(struct amdgpu_ring *ring,
 static void sdma_v5_2_ring_emit_wreg(struct amdgpu_ring *ring,
 				     uint32_t reg, uint32_t val)
 {
-	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_SRBM_WRITE) |
-			  SDMA_PKT_SRBM_WRITE_HEADER_BYTE_EN(0xf));
-	amdgpu_ring_write(ring, reg);
-	amdgpu_ring_write(ring, val);
+	amdgpu_ring_write(ring,
+			  SDMA_PKT_HEADER_OP(SDMA_OP_SRBM_WRITE) |
+			  SDMA_PKT_SRBM_WRITE_HEADER_BYTE_EN(0xf),
+			  reg,
+			  val);
 }
 
 static void sdma_v5_2_ring_emit_reg_wait(struct amdgpu_ring *ring, uint32_t reg,
 					 uint32_t val, uint32_t mask)
 {
-	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_POLL_REGMEM) |
+	amdgpu_ring_write(ring,
+			  SDMA_PKT_HEADER_OP(SDMA_OP_POLL_REGMEM) |
 			  SDMA_PKT_POLL_REGMEM_HEADER_HDP_FLUSH(0) |
-			  SDMA_PKT_POLL_REGMEM_HEADER_FUNC(3)); /* equal */
-	amdgpu_ring_write(ring, reg << 2);
-	amdgpu_ring_write(ring, 0);
-	amdgpu_ring_write(ring, val); /* reference */
-	amdgpu_ring_write(ring, mask); /* mask */
-	amdgpu_ring_write(ring, SDMA_PKT_POLL_REGMEM_DW5_RETRY_COUNT(0xfff) |
+			  SDMA_PKT_POLL_REGMEM_HEADER_FUNC(3), /* equal */
+			  reg << 2,
+			  0,
+			  val, /* reference */
+			  mask, /* mask */
+			  SDMA_PKT_POLL_REGMEM_DW5_RETRY_COUNT(0xfff) |
 			  SDMA_PKT_POLL_REGMEM_DW5_INTERVAL(10));
 }
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/12] drm/amdgpu: Use memset32 for IB padding
  2024-12-27 11:19 ` [PATCH 01/12] drm/amdgpu: Use memset32 for IB padding Tvrtko Ursulin
@ 2025-01-02 13:45   ` Christian König
  2025-01-03 14:40     ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Christian König @ 2025-01-02 13:45 UTC (permalink / raw)
  To: Tvrtko Ursulin, amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin, Sunil Khatri



Am 27.12.24 um 12:19 schrieb Tvrtko Ursulin:
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>
> Use memset32 instead of open coding it, just because it is
> that bit nicer.

In general looks mostly good, my only concern is that we already had to 
switch to memset_io() on some platforms for clearing buffers.

Now an IB should in theory always be in system memory, but it would be 
nice to handle that cleanly.

Regards,
Christian.

>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Sunil Khatri <sunil.khatri@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 12 ++++++++++--
>   1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> index a6e28fe3f8d6..a27e32f48f99 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> @@ -136,8 +136,16 @@ void amdgpu_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
>    */
>   void amdgpu_ring_generic_pad_ib(struct amdgpu_ring *ring, struct amdgpu_ib *ib)
>   {
> -	while (ib->length_dw & ring->funcs->align_mask)
> -		ib->ptr[ib->length_dw++] = ring->funcs->nop;
> +	u32 align_mask = ring->funcs->align_mask;
> +	u32 count = ib->length_dw & align_mask;
> +
> +	if (count) {
> +		count = align_mask + 1 - count;
> +
> +		memset32(&ib->ptr[ib->length_dw], ring->funcs->nop, count);
> +
> +		ib->length_dw += count;
> +	}
>   }
>   
>   /**


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 04/12] drm/amdgpu: Consolidate a bunch of similar sdma insert nop vfuncs
  2024-12-27 11:19 ` [PATCH 04/12] drm/amdgpu: Consolidate a bunch of similar sdma insert nop vfuncs Tvrtko Ursulin
@ 2025-01-02 13:49   ` Christian König
  2025-01-03 14:25     ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Christian König @ 2025-01-02 13:49 UTC (permalink / raw)
  To: Tvrtko Ursulin, amd-gfx
  Cc: kernel-dev, Tvrtko Ursulin, Christian König, Sunil Khatri

Am 27.12.24 um 12:19 schrieb Tvrtko Ursulin:
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>
> A lot of the hardware generations apparently uses the same nop insertion
> logic, just with different masks and shifts.
>
> We can consolidate if we store those shifts and mask in the ring and
> shrink both the source and binary.

The shift and mask should be identical for most HW generations anyway. 
Only SI might be a bit different.

So probably better to have a special function for SI and a general for 
all other HW generations.

Regards,
Christian.

>
> Though it appears a bit of a departure from the existing "duplicate
> everything" state so we will see how this is received.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Sunil Khatri <sunil.khatri@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  5 +++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 18 ++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h |  1 +
>   drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c   | 19 ++++---------------
>   drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c   | 19 ++++---------------
>   drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c   | 23 +++++++----------------
>   drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 23 +++++++----------------
>   drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c   | 19 ++++---------------
>   drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c   | 19 ++++---------------
>   drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c   | 19 ++++---------------
>   drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c   | 19 ++++---------------
>   11 files changed, 62 insertions(+), 122 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> index 429c77db920f..4c0861ebc77a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> @@ -303,6 +303,11 @@ struct amdgpu_ring {
>   	struct {
>   		struct amdgpu_sdma_instance *instance;
>   		int			     index;
> +
> +		struct {
> +			u32		mask;
> +			unsigned int	shift;
> +		} nop_pkt;
>   	} sdma;
>   
>   	/* used for mes */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> index d43dfec82624..148413f01875 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> @@ -34,6 +34,24 @@
>    * GPU SDMA IP block helpers function.
>    */
>   
> +void amdgpu_sdma_ring_insert_nop(struct amdgpu_ring *ring, u32 count)
> +{
> +	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
> +	const u32 nop = ring->funcs->nop;
> +	u32 i;
> +
> +	if (!count)
> +		return;
> +
> +	if (sdma && sdma->burst_nop)
> +		amdgpu_ring_write(ring, nop |
> +					(--count & ring->sdma.nop_pkt.mask) <<
> +					ring->sdma.nop_pkt.shift);
> +
> +	for (i = 0; i < count; i++)
> +		amdgpu_ring_write(ring, nop);
> +}
> +
>   uint64_t amdgpu_sdma_get_csa_mc_addr(struct amdgpu_ring *ring,
>   				     unsigned int vmid)
>   {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> index 7debf3ed0b46..d2642a9113ae 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> @@ -159,6 +159,7 @@ struct amdgpu_buffer_funcs {
>   #define amdgpu_emit_copy_buffer(adev, ib, s, d, b, t) (adev)->mman.buffer_funcs->emit_copy_buffer((ib),  (s), (d), (b), (t))
>   #define amdgpu_emit_fill_buffer(adev, ib, s, d, b) (adev)->mman.buffer_funcs->emit_fill_buffer((ib), (s), (d), (b))
>   
> +void amdgpu_sdma_ring_insert_nop(struct amdgpu_ring *ring, u32 count);
>   uint64_t amdgpu_sdma_get_csa_mc_addr(struct amdgpu_ring *ring, unsigned vmid);
>   int amdgpu_sdma_ras_late_init(struct amdgpu_device *adev,
>   			      struct ras_common_if *ras_block);
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c b/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
> index 2e844dba4ad5..e0ed65eca431 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
> @@ -221,19 +221,6 @@ static void sdma_v2_4_ring_set_wptr(struct amdgpu_ring *ring)
>   	WREG32(mmSDMA0_GFX_RB_WPTR + sdma_offsets[ring->me], ring->wptr << 2);
>   }
>   
> -static void sdma_v2_4_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
> -{
> -	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
> -	int i;
> -
> -	for (i = 0; i < count; i++)
> -		if (sdma && sdma->burst_nop && (i == 0))
> -			amdgpu_ring_write(ring, ring->funcs->nop |
> -				SDMA_PKT_NOP_HEADER_COUNT(count - 1));
> -		else
> -			amdgpu_ring_write(ring, ring->funcs->nop);
> -}
> -
>   /**
>    * sdma_v2_4_ring_emit_ib - Schedule an IB on the DMA engine
>    *
> @@ -252,7 +239,7 @@ static void sdma_v2_4_ring_emit_ib(struct amdgpu_ring *ring,
>   	unsigned vmid = AMDGPU_JOB_GET_VMID(job);
>   
>   	/* IB packet must end on a 8 DW boundary */
> -	sdma_v2_4_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
> +	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
>   
>   	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
>   			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
> @@ -866,6 +853,8 @@ static int sdma_v2_4_sw_init(struct amdgpu_ip_block *ip_block)
>   
>   		ring->sdma.instance = &adev->sdma.instance[i];
>   		ring->sdma.index = i;
> +		ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
> +		ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>   	}
>   
>   	return r;
> @@ -1137,7 +1126,7 @@ static const struct amdgpu_ring_funcs sdma_v2_4_ring_funcs = {
>   	.emit_hdp_flush = sdma_v2_4_ring_emit_hdp_flush,
>   	.test_ring = sdma_v2_4_ring_test_ring,
>   	.test_ib = sdma_v2_4_ring_test_ib,
> -	.insert_nop = sdma_v2_4_ring_insert_nop,
> +	.insert_nop = amdgpu_sdma_ring_insert_nop,
>   	.pad_ib = sdma_v2_4_ring_pad_ib,
>   	.emit_wreg = sdma_v2_4_ring_emit_wreg,
>   };
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
> index 104fd1214c4c..8a644ea28589 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
> @@ -397,19 +397,6 @@ static void sdma_v3_0_ring_set_wptr(struct amdgpu_ring *ring)
>   	}
>   }
>   
> -static void sdma_v3_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
> -{
> -	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
> -	int i;
> -
> -	for (i = 0; i < count; i++)
> -		if (sdma && sdma->burst_nop && (i == 0))
> -			amdgpu_ring_write(ring, ring->funcs->nop |
> -				SDMA_PKT_NOP_HEADER_COUNT(count - 1));
> -		else
> -			amdgpu_ring_write(ring, ring->funcs->nop);
> -}
> -
>   /**
>    * sdma_v3_0_ring_emit_ib - Schedule an IB on the DMA engine
>    *
> @@ -428,7 +415,7 @@ static void sdma_v3_0_ring_emit_ib(struct amdgpu_ring *ring,
>   	unsigned vmid = AMDGPU_JOB_GET_VMID(job);
>   
>   	/* IB packet must end on a 8 DW boundary */
> -	sdma_v3_0_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
> +	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
>   
>   	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
>   			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
> @@ -1152,6 +1139,8 @@ static int sdma_v3_0_sw_init(struct amdgpu_ip_block *ip_block)
>   
>   		ring->sdma.instance = &adev->sdma.instance[i];
>   		ring->sdma.index = i;
> +		ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
> +		ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>   	}
>   
>   	return r;
> @@ -1579,7 +1568,7 @@ static const struct amdgpu_ring_funcs sdma_v3_0_ring_funcs = {
>   	.emit_hdp_flush = sdma_v3_0_ring_emit_hdp_flush,
>   	.test_ring = sdma_v3_0_ring_test_ring,
>   	.test_ib = sdma_v3_0_ring_test_ib,
> -	.insert_nop = sdma_v3_0_ring_insert_nop,
> +	.insert_nop = amdgpu_sdma_ring_insert_nop,
>   	.pad_ib = sdma_v3_0_ring_pad_ib,
>   	.emit_wreg = sdma_v3_0_ring_emit_wreg,
>   };
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> index c91d05a4593e..0f0f05a03cd4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> @@ -781,19 +781,6 @@ static void sdma_v4_0_page_ring_set_wptr(struct amdgpu_ring *ring)
>   	}
>   }
>   
> -static void sdma_v4_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
> -{
> -	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
> -	int i;
> -
> -	for (i = 0; i < count; i++)
> -		if (sdma && sdma->burst_nop && (i == 0))
> -			amdgpu_ring_write(ring, ring->funcs->nop |
> -				SDMA_PKT_NOP_HEADER_COUNT(count - 1));
> -		else
> -			amdgpu_ring_write(ring, ring->funcs->nop);
> -}
> -
>   /**
>    * sdma_v4_0_ring_emit_ib - Schedule an IB on the DMA engine
>    *
> @@ -812,7 +799,7 @@ static void sdma_v4_0_ring_emit_ib(struct amdgpu_ring *ring,
>   	unsigned vmid = AMDGPU_JOB_GET_VMID(job);
>   
>   	/* IB packet must end on a 8 DW boundary */
> -	sdma_v4_0_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
> +	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
>   
>   	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
>   			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
> @@ -1876,6 +1863,8 @@ static int sdma_v4_0_sw_init(struct amdgpu_ip_block *ip_block)
>   
>   		ring->sdma.instance = &adev->sdma.instance[i];
>   		ring->sdma.index = i;
> +		ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
> +		ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>   
>   		if (adev->sdma.has_page_queue) {
>   			ring = &adev->sdma.instance[i].page;
> @@ -1917,6 +1906,8 @@ static int sdma_v4_0_sw_init(struct amdgpu_ip_block *ip_block)
>   
>   			ring->sdma.instance = &adev->sdma.instance[i];
>   			ring->sdma.index = i;
> +			ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
> +			ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>   		}
>   	}
>   
> @@ -2443,7 +2434,7 @@ static const struct amdgpu_ring_funcs sdma_v4_0_ring_funcs = {
>   	.emit_hdp_flush = sdma_v4_0_ring_emit_hdp_flush,
>   	.test_ring = sdma_v4_0_ring_test_ring,
>   	.test_ib = sdma_v4_0_ring_test_ib,
> -	.insert_nop = sdma_v4_0_ring_insert_nop,
> +	.insert_nop = amdgpu_sdma_ring_insert_nop,
>   	.pad_ib = sdma_v4_0_ring_pad_ib,
>   	.emit_wreg = sdma_v4_0_ring_emit_wreg,
>   	.emit_reg_wait = sdma_v4_0_ring_emit_reg_wait,
> @@ -2475,7 +2466,7 @@ static const struct amdgpu_ring_funcs sdma_v4_0_page_ring_funcs = {
>   	.emit_hdp_flush = sdma_v4_0_ring_emit_hdp_flush,
>   	.test_ring = sdma_v4_0_ring_test_ring,
>   	.test_ib = sdma_v4_0_ring_test_ib,
> -	.insert_nop = sdma_v4_0_ring_insert_nop,
> +	.insert_nop = amdgpu_sdma_ring_insert_nop,
>   	.pad_ib = sdma_v4_0_ring_pad_ib,
>   	.emit_wreg = sdma_v4_0_ring_emit_wreg,
>   	.emit_reg_wait = sdma_v4_0_ring_emit_reg_wait,
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> index d1d21a3951f8..f68b7f2e0a40 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> @@ -341,19 +341,6 @@ static void sdma_v4_4_2_page_ring_set_wptr(struct amdgpu_ring *ring)
>   	}
>   }
>   
> -static void sdma_v4_4_2_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
> -{
> -	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
> -	int i;
> -
> -	for (i = 0; i < count; i++)
> -		if (sdma && sdma->burst_nop && (i == 0))
> -			amdgpu_ring_write(ring, ring->funcs->nop |
> -				SDMA_PKT_NOP_HEADER_COUNT(count - 1));
> -		else
> -			amdgpu_ring_write(ring, ring->funcs->nop);
> -}
> -
>   /**
>    * sdma_v4_4_2_ring_emit_ib - Schedule an IB on the DMA engine
>    *
> @@ -372,7 +359,7 @@ static void sdma_v4_4_2_ring_emit_ib(struct amdgpu_ring *ring,
>   	unsigned vmid = AMDGPU_JOB_GET_VMID(job);
>   
>   	/* IB packet must end on a 8 DW boundary */
> -	sdma_v4_4_2_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
> +	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
>   
>   	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
>   			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
> @@ -1431,6 +1418,8 @@ static int sdma_v4_4_2_sw_init(struct amdgpu_ip_block *ip_block)
>   
>   		ring->sdma.instance = &adev->sdma.instance[i];
>   		ring->sdma.index = i;
> +		ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
> +		ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>   
>   		if (adev->sdma.has_page_queue) {
>   			ring = &adev->sdma.instance[i].page;
> @@ -1455,6 +1444,8 @@ static int sdma_v4_4_2_sw_init(struct amdgpu_ip_block *ip_block)
>   
>   			ring->sdma.instance = &adev->sdma.instance[i];
>   			ring->sdma.index = i;
> +			ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
> +			ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>   		}
>   	}
>   
> @@ -2013,7 +2004,7 @@ static const struct amdgpu_ring_funcs sdma_v4_4_2_ring_funcs = {
>   	.emit_hdp_flush = sdma_v4_4_2_ring_emit_hdp_flush,
>   	.test_ring = sdma_v4_4_2_ring_test_ring,
>   	.test_ib = sdma_v4_4_2_ring_test_ib,
> -	.insert_nop = sdma_v4_4_2_ring_insert_nop,
> +	.insert_nop = amdgpu_sdma_ring_insert_nop,
>   	.pad_ib = sdma_v4_4_2_ring_pad_ib,
>   	.emit_wreg = sdma_v4_4_2_ring_emit_wreg,
>   	.emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
> @@ -2045,7 +2036,7 @@ static const struct amdgpu_ring_funcs sdma_v4_4_2_page_ring_funcs = {
>   	.emit_hdp_flush = sdma_v4_4_2_ring_emit_hdp_flush,
>   	.test_ring = sdma_v4_4_2_ring_test_ring,
>   	.test_ib = sdma_v4_4_2_ring_test_ib,
> -	.insert_nop = sdma_v4_4_2_ring_insert_nop,
> +	.insert_nop = amdgpu_sdma_ring_insert_nop,
>   	.pad_ib = sdma_v4_4_2_ring_pad_ib,
>   	.emit_wreg = sdma_v4_4_2_ring_emit_wreg,
>   	.emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> index 97536f82dfcd..bbf60bfa4b1b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> @@ -433,19 +433,6 @@ static void sdma_v5_0_ring_set_wptr(struct amdgpu_ring *ring)
>   	}
>   }
>   
> -static void sdma_v5_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
> -{
> -	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
> -	int i;
> -
> -	for (i = 0; i < count; i++)
> -		if (sdma && sdma->burst_nop && (i == 0))
> -			amdgpu_ring_write(ring, ring->funcs->nop |
> -				SDMA_PKT_NOP_HEADER_COUNT(count - 1));
> -		else
> -			amdgpu_ring_write(ring, ring->funcs->nop);
> -}
> -
>   /**
>    * sdma_v5_0_ring_emit_ib - Schedule an IB on the DMA engine
>    *
> @@ -472,7 +459,7 @@ static void sdma_v5_0_ring_emit_ib(struct amdgpu_ring *ring,
>   	 * (wptr + 6 + x) % 8 = 0.
>   	 * The expression below, is a solution of x.
>   	 */
> -	sdma_v5_0_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
> +	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
>   
>   	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
>   			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
> @@ -1453,6 +1440,8 @@ static int sdma_v5_0_sw_init(struct amdgpu_ip_block *ip_block)
>   
>   		ring->sdma.instance = &adev->sdma.instance[i];
>   		ring->sdma.index = i;
> +		ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
> +		ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>   	}
>   
>   	adev->sdma.supported_reset =
> @@ -1991,7 +1980,7 @@ static const struct amdgpu_ring_funcs sdma_v5_0_ring_funcs = {
>   	.emit_hdp_flush = sdma_v5_0_ring_emit_hdp_flush,
>   	.test_ring = sdma_v5_0_ring_test_ring,
>   	.test_ib = sdma_v5_0_ring_test_ib,
> -	.insert_nop = sdma_v5_0_ring_insert_nop,
> +	.insert_nop = amdgpu_sdma_ring_insert_nop,
>   	.pad_ib = sdma_v5_0_ring_pad_ib,
>   	.emit_wreg = sdma_v5_0_ring_emit_wreg,
>   	.emit_reg_wait = sdma_v5_0_ring_emit_reg_wait,
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> index 8eaddee1d97d..69b88db32117 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> @@ -250,19 +250,6 @@ static void sdma_v5_2_ring_set_wptr(struct amdgpu_ring *ring)
>   	}
>   }
>   
> -static void sdma_v5_2_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
> -{
> -	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
> -	int i;
> -
> -	for (i = 0; i < count; i++)
> -		if (sdma && sdma->burst_nop && (i == 0))
> -			amdgpu_ring_write(ring, ring->funcs->nop |
> -				SDMA_PKT_NOP_HEADER_COUNT(count - 1));
> -		else
> -			amdgpu_ring_write(ring, ring->funcs->nop);
> -}
> -
>   /**
>    * sdma_v5_2_ring_emit_ib - Schedule an IB on the DMA engine
>    *
> @@ -289,7 +276,7 @@ static void sdma_v5_2_ring_emit_ib(struct amdgpu_ring *ring,
>   	 * (wptr + 6 + x) % 8 = 0.
>   	 * The expression below, is a solution of x.
>   	 */
> -	sdma_v5_2_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
> +	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
>   
>   	amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
>   			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
> @@ -1358,6 +1345,8 @@ static int sdma_v5_2_sw_init(struct amdgpu_ip_block *ip_block)
>   
>   		ring->sdma.instance = &adev->sdma.instance[i];
>   		ring->sdma.index = i;
> +		ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
> +		ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>   	}
>   
>   	adev->sdma.supported_reset =
> @@ -1986,7 +1975,7 @@ static const struct amdgpu_ring_funcs sdma_v5_2_ring_funcs = {
>   	.emit_hdp_flush = sdma_v5_2_ring_emit_hdp_flush,
>   	.test_ring = sdma_v5_2_ring_test_ring,
>   	.test_ib = sdma_v5_2_ring_test_ib,
> -	.insert_nop = sdma_v5_2_ring_insert_nop,
> +	.insert_nop = amdgpu_sdma_ring_insert_nop,
>   	.pad_ib = sdma_v5_2_ring_pad_ib,
>   	.begin_use = sdma_v5_2_ring_begin_use,
>   	.end_use = sdma_v5_2_ring_end_use,
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> index 3ead269eccdc..aa3992fd5313 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> @@ -235,19 +235,6 @@ static void sdma_v6_0_ring_set_wptr(struct amdgpu_ring *ring)
>   	}
>   }
>   
> -static void sdma_v6_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
> -{
> -	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
> -	int i;
> -
> -	for (i = 0; i < count; i++)
> -		if (sdma && sdma->burst_nop && (i == 0))
> -			amdgpu_ring_write(ring, ring->funcs->nop |
> -				SDMA_PKT_NOP_HEADER_COUNT(count - 1));
> -		else
> -			amdgpu_ring_write(ring, ring->funcs->nop);
> -}
> -
>   /*
>    * sdma_v6_0_ring_emit_ib - Schedule an IB on the DMA engine
>    *
> @@ -274,7 +261,7 @@ static void sdma_v6_0_ring_emit_ib(struct amdgpu_ring *ring,
>   	 * (wptr + 6 + x) % 8 = 0.
>   	 * The expression below, is a solution of x.
>   	 */
> -	sdma_v6_0_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
> +	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
>   
>   	amdgpu_ring_write(ring, SDMA_PKT_COPY_LINEAR_HEADER_OP(SDMA_OP_INDIRECT) |
>   			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
> @@ -1351,6 +1338,8 @@ static int sdma_v6_0_sw_init(struct amdgpu_ip_block *ip_block)
>   
>   		ring->sdma.instance = &adev->sdma.instance[i];
>   		ring->sdma.index = i;
> +		ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
> +		ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>   	}
>   
>   	adev->sdma.supported_reset =
> @@ -1707,7 +1696,7 @@ static const struct amdgpu_ring_funcs sdma_v6_0_ring_funcs = {
>   	.emit_hdp_flush = sdma_v6_0_ring_emit_hdp_flush,
>   	.test_ring = sdma_v6_0_ring_test_ring,
>   	.test_ib = sdma_v6_0_ring_test_ib,
> -	.insert_nop = sdma_v6_0_ring_insert_nop,
> +	.insert_nop = amdgpu_sdma_ring_insert_nop,
>   	.pad_ib = sdma_v6_0_ring_pad_ib,
>   	.emit_wreg = sdma_v6_0_ring_emit_wreg,
>   	.emit_reg_wait = sdma_v6_0_ring_emit_reg_wait,
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> index 5fadaf35a03a..4ccc00248a09 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> @@ -267,19 +267,6 @@ static void sdma_v7_0_ring_set_wptr(struct amdgpu_ring *ring)
>   	}
>   }
>   
> -static void sdma_v7_0_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count)
> -{
> -	struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
> -	int i;
> -
> -	for (i = 0; i < count; i++)
> -		if (sdma && sdma->burst_nop && (i == 0))
> -			amdgpu_ring_write(ring, ring->funcs->nop |
> -				SDMA_PKT_NOP_HEADER_COUNT(count - 1));
> -		else
> -			amdgpu_ring_write(ring, ring->funcs->nop);
> -}
> -
>   /**
>    * sdma_v7_0_ring_emit_ib - Schedule an IB on the DMA engine
>    *
> @@ -306,7 +293,7 @@ static void sdma_v7_0_ring_emit_ib(struct amdgpu_ring *ring,
>   	 * (wptr + 6 + x) % 8 = 0.
>   	 * The expression below, is a solution of x.
>   	 */
> -	sdma_v7_0_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
> +	amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 7);
>   
>   	amdgpu_ring_write(ring, SDMA_PKT_COPY_LINEAR_HEADER_OP(SDMA_OP_INDIRECT) |
>   			  SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
> @@ -1365,6 +1352,8 @@ static int sdma_v7_0_sw_init(struct amdgpu_ip_block *ip_block)
>   
>   		ring->sdma.instance = &adev->sdma.instance[i];
>   		ring->sdma.index = i;
> +		ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
> +		ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>   	}
>   
>   	adev->sdma.supported_reset =
> @@ -1687,7 +1676,7 @@ static const struct amdgpu_ring_funcs sdma_v7_0_ring_funcs = {
>   	.emit_hdp_flush = sdma_v7_0_ring_emit_hdp_flush,
>   	.test_ring = sdma_v7_0_ring_test_ring,
>   	.test_ib = sdma_v7_0_ring_test_ib,
> -	.insert_nop = sdma_v7_0_ring_insert_nop,
> +	.insert_nop = amdgpu_sdma_ring_insert_nop,
>   	.pad_ib = sdma_v7_0_ring_pad_ib,
>   	.emit_wreg = sdma_v7_0_ring_emit_wreg,
>   	.emit_reg_wait = sdma_v7_0_ring_emit_reg_wait,


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 09/12] drm/amdgpu: Optimise amdgpu_ring_write()
  2024-12-27 11:19 ` [PATCH 09/12] drm/amdgpu: Optimise amdgpu_ring_write() Tvrtko Ursulin
@ 2025-01-02 13:55   ` Christian König
  2025-01-03 13:02     ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Christian König @ 2025-01-02 13:55 UTC (permalink / raw)
  To: Tvrtko Ursulin, amd-gfx
  Cc: kernel-dev, Tvrtko Ursulin, Christian König, Sunil Khatri

Am 27.12.24 um 12:19 schrieb Tvrtko Ursulin:
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>
> There are more than 2000 calls to amdgpu_ring_write() in the driver and
> the majority is multiple sequential calls which the compiler cannot
> optimise much.
>
> Lets make this helper variadic via some pre-processor magic which allows
> merging those sequential calls and in turn enables the compiler to emit
> much less code.

I've played around with the same idea before abandoning it for this 
patch here: 
https://lore.kernel.org/all/20241008181134.1350-2-christian.koenig@amd.com/

Basically we don't need to update count_dw nor mask the WPTR which has 
the same effect as this optimization here and far less code change.

The problem is that some code for Polaris HW generations seem to use the 
WPTR or count_dw directly for some calculation. I didn't had time to 
clean that up and push the resulting change.

Regards,
Christian.

>
> If we then would convert some call sites to use this macro, lets see on
> the example of amdgpu_vce_ring_emit_ib(), what results that would bring.
> Before (but after the wptr local caching, before it is even worse):
>
>    173c39:       48 89 f8                mov    %rdi,%rax
>    173c3c:       48 89 d1                mov    %rdx,%rcx
>    173c3f:       48 8b 97 c8 01 00 00    mov    0x1c8(%rdi),%rdx
>    173c46:       48 8b b0 a8 01 00 00    mov    0x1a8(%rax),%rsi
>    173c4d:       89 d7                   mov    %edx,%edi
>    173c4f:       23 b8 f8 01 00 00       and    0x1f8(%rax),%edi
>    173c55:       48 83 c2 01             add    $0x1,%rdx
>    173c59:       c7 04 be 02 00 00 00    movl   $0x2,(%rsi,%rdi,4)
>    173c60:       48 23 90 f0 01 00 00    and    0x1f0(%rax),%rdx
>    173c67:       48 89 90 c8 01 00 00    mov    %rdx,0x1c8(%rax)
>    173c6e:       48 8b b0 a8 01 00 00    mov    0x1a8(%rax),%rsi
>    173c75:       89 d7                   mov    %edx,%edi
>    173c77:       48 83 c2 01             add    $0x1,%rdx
>    173c7b:       83 a8 e0 01 00 00 01    subl   $0x1,0x1e0(%rax)
>    173c82:       4c 8b 41 10             mov    0x10(%rcx),%r8
>    173c86:       23 b8 f8 01 00 00       and    0x1f8(%rax),%edi
>    173c8c:       44 89 04 be             mov    %r8d,(%rsi,%rdi,4)
>    173c90:       48 23 90 f0 01 00 00    and    0x1f0(%rax),%rdx
>    173c97:       48 89 90 c8 01 00 00    mov    %rdx,0x1c8(%rax)
>    173c9e:       48 8b b0 a8 01 00 00    mov    0x1a8(%rax),%rsi
>    173ca5:       89 d7                   mov    %edx,%edi
>    173ca7:       48 83 c2 01             add    $0x1,%rdx
>    173cab:       83 a8 e0 01 00 00 01    subl   $0x1,0x1e0(%rax)
>    173cb2:       44 8b 41 14             mov    0x14(%rcx),%r8d
>    173cb6:       23 b8 f8 01 00 00       and    0x1f8(%rax),%edi
>    173cbc:       44 89 04 be             mov    %r8d,(%rsi,%rdi,4)
>    173cc0:       48 23 90 f0 01 00 00    and    0x1f0(%rax),%rdx
>    173cc7:       48 89 90 c8 01 00 00    mov    %rdx,0x1c8(%rax)
>    173cce:       89 d6                   mov    %edx,%esi
>    173cd0:       23 b0 f8 01 00 00       and    0x1f8(%rax),%esi
>    173cd6:       48 83 c2 01             add    $0x1,%rdx
>    173cda:       83 a8 e0 01 00 00 01    subl   $0x1,0x1e0(%rax)
>    173ce1:       8b 79 08                mov    0x8(%rcx),%edi
>    173ce4:       48 8b 88 a8 01 00 00    mov    0x1a8(%rax),%rcx
>    173ceb:       89 3c b1                mov    %edi,(%rcx,%rsi,4)
>    173cee:       48 23 90 f0 01 00 00    and    0x1f0(%rax),%rdx
>    173cf5:       83 a8 e0 01 00 00 01    subl   $0x1,0x1e0(%rax)
>    173cfc:       48 89 90 c8 01 00 00    mov    %rdx,0x1c8(%rax)
>    173d03:       c3                      ret
>
> And after:
>
>      1579:       48 89 f8                mov    %rdi,%rax
>      157c:       44 8b 4a 08             mov    0x8(%rdx),%r9d
>      1580:       48 8b 7a 10             mov    0x10(%rdx),%rdi
>      1584:       48 8b 90 c8 01 00 00    mov    0x1c8(%rax),%rdx
>      158b:       8b b0 f8 01 00 00       mov    0x1f8(%rax),%esi
>      1591:       48 8b 88 a8 01 00 00    mov    0x1a8(%rax),%rcx
>      1598:       49 89 d0                mov    %rdx,%r8
>      159b:       49 21 f0                and    %rsi,%r8
>      159e:       42 c7 04 81 02 00 00    movl   $0x2,(%rcx,%r8,4)
>      15a5:       00
>      15a6:       4c 8d 42 01             lea    0x1(%rdx),%r8
>      15aa:       49 21 f0                and    %rsi,%r8
>      15ad:       42 89 3c 81             mov    %edi,(%rcx,%r8,4)
>      15b1:       4c 8d 42 02             lea    0x2(%rdx),%r8
>      15b5:       48 c1 ef 20             shr    $0x20,%rdi
>      15b9:       49 21 f0                and    %rsi,%r8
>      15bc:       42 89 3c 81             mov    %edi,(%rcx,%r8,4)
>      15c0:       48 8d 7a 03             lea    0x3(%rdx),%rdi
>      15c4:       48 83 c2 04             add    $0x4,%rdx
>      15c8:       48 21 fe                and    %rdi,%rsi
>      15cb:       44 89 0c b1             mov    %r9d,(%rcx,%rsi,4)
>      15cf:       48 23 90 f0 01 00 00    and    0x1f0(%rax),%rdx
>      15d6:       83 a8 e0 01 00 00 04    subl   $0x4,0x1e0(%rax)
>      15dd:       48 89 90 c8 01 00 00    mov    %rdx,0x1c8(%rax)
>      15e4:       c3                      ret
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Sunil Khatri <sunil.khatri@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 222 ++++++++++++++++++++++-
>   1 file changed, 220 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> index b57951d8c997..4f467864ed09 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> @@ -383,15 +383,233 @@ static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
>   	memset32(ring->ring, ring->funcs->nop, ring->buf_mask + 1);
>   }
>   
> -static inline void amdgpu_ring_write(struct amdgpu_ring *ring, uint32_t v)
> +static inline void
> +amdgpu_ring_write1(struct amdgpu_ring *ring,
> +		   u32 v1)
>   {
> +	const u32 buf_mask = ring->buf_mask;
>   	u64 wptr = ring->wptr;
>   
> -	ring->ring[wptr++ & ring->buf_mask] = v;
> +	ring->ring[wptr++ & buf_mask] = v1;
>   	ring->wptr = wptr & ring->ptr_mask;
>   	ring->count_dw--;
>   }
>   
> +static inline void
> +amdgpu_ring_write2(struct amdgpu_ring *ring,
> +		   u32 v1, u32 v2)
> +{
> +	const u32 buf_mask = ring->buf_mask;
> +	u64 wptr = ring->wptr;
> +	u32 *r = ring->ring;
> +
> +	r[wptr++ & buf_mask] = v1;
> +	r[wptr++ & buf_mask] = v2;
> +
> +	ring->wptr = wptr & ring->ptr_mask;
> +	ring->count_dw -= 2;
> +}
> +
> +static inline void
> +amdgpu_ring_write3(struct amdgpu_ring *ring,
> +		   u32 v1, u32 v2, u32 v3)
> +{
> +	const u32 buf_mask = ring->buf_mask;
> +	u64 wptr = ring->wptr;
> +	u32 *r = ring->ring;
> +
> +	r[wptr++ & buf_mask] = v1;
> +	r[wptr++ & buf_mask] = v2;
> +	r[wptr++ & buf_mask] = v3;
> +
> +	ring->wptr = wptr & ring->ptr_mask;
> +	ring->count_dw -= 3;
> +}
> +
> +static inline void
> +amdgpu_ring_write4(struct amdgpu_ring *ring,
> +		   u32 v1, u32 v2, u32 v3, u32 v4)
> +{
> +	const u32 buf_mask = ring->buf_mask;
> +	u64 wptr = ring->wptr;
> +	u32 *r = ring->ring;
> +
> +	r[wptr++ & buf_mask] = v1;
> +	r[wptr++ & buf_mask] = v2;
> +	r[wptr++ & buf_mask] = v3;
> +	r[wptr++ & buf_mask] = v4;
> +
> +	ring->wptr = wptr & ring->ptr_mask;
> +	ring->count_dw -= 4;
> +}
> +
> +static inline void
> +amdgpu_ring_write5(struct amdgpu_ring *ring,
> +		   u32 v1, u32 v2, u32 v3, u32 v4, u32 v5)
> +{
> +	const u32 buf_mask = ring->buf_mask;
> +	u64 wptr = ring->wptr;
> +	u32 *r = ring->ring;
> +
> +	r[wptr++ & buf_mask] = v1;
> +	r[wptr++ & buf_mask] = v2;
> +	r[wptr++ & buf_mask] = v3;
> +	r[wptr++ & buf_mask] = v4;
> +	r[wptr++ & buf_mask] = v5;
> +
> +	ring->wptr = wptr & ring->ptr_mask;
> +	ring->count_dw -= 5;
> +}
> +
> +static inline void
> +amdgpu_ring_write6(struct amdgpu_ring *ring,
> +		   u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6)
> +{
> +	const u32 buf_mask = ring->buf_mask;
> +	u64 wptr = ring->wptr;
> +	u32 *r = ring->ring;
> +
> +	r[wptr++ & buf_mask] = v1;
> +	r[wptr++ & buf_mask] = v2;
> +	r[wptr++ & buf_mask] = v3;
> +	r[wptr++ & buf_mask] = v4;
> +	r[wptr++ & buf_mask] = v5;
> +	r[wptr++ & buf_mask] = v6;
> +
> +	ring->wptr = wptr & ring->ptr_mask;
> +	ring->count_dw -= 6;
> +}
> +
> +static inline void
> +amdgpu_ring_write7(struct amdgpu_ring *ring,
> +		   u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6, u32 v7)
> +{
> +	const u32 buf_mask = ring->buf_mask;
> +	u64 wptr = ring->wptr;
> +	u32 *r = ring->ring;
> +
> +	r[wptr++ & buf_mask] = v1;
> +	r[wptr++ & buf_mask] = v2;
> +	r[wptr++ & buf_mask] = v3;
> +	r[wptr++ & buf_mask] = v4;
> +	r[wptr++ & buf_mask] = v5;
> +	r[wptr++ & buf_mask] = v6;
> +	r[wptr++ & buf_mask] = v7;
> +
> +	ring->wptr = wptr & ring->ptr_mask;
> +	ring->count_dw -= 7;
> +}
> +
> +static inline void
> +amdgpu_ring_write8(struct amdgpu_ring *ring,
> +		   u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6, u32 v7,
> +		   u32 v8)
> +{
> +	const u32 buf_mask = ring->buf_mask;
> +	u64 wptr = ring->wptr;
> +	u32 *r = ring->ring;
> +
> +	r[wptr++ & buf_mask] = v1;
> +	r[wptr++ & buf_mask] = v2;
> +	r[wptr++ & buf_mask] = v3;
> +	r[wptr++ & buf_mask] = v4;
> +	r[wptr++ & buf_mask] = v5;
> +	r[wptr++ & buf_mask] = v6;
> +	r[wptr++ & buf_mask] = v7;
> +	r[wptr++ & buf_mask] = v8;
> +
> +	ring->wptr = wptr & ring->ptr_mask;
> +	ring->count_dw -= 8;
> +}
> +
> +static inline void
> +amdgpu_ring_write9(struct amdgpu_ring *ring,
> +		   u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6, u32 v7,
> +		   u32 v8, u32 v9)
> +{
> +	const u32 buf_mask = ring->buf_mask;
> +	u64 wptr = ring->wptr;
> +	u32 *r = ring->ring;
> +
> +	r[wptr++ & buf_mask] = v1;
> +	r[wptr++ & buf_mask] = v2;
> +	r[wptr++ & buf_mask] = v3;
> +	r[wptr++ & buf_mask] = v4;
> +	r[wptr++ & buf_mask] = v5;
> +	r[wptr++ & buf_mask] = v6;
> +	r[wptr++ & buf_mask] = v7;
> +	r[wptr++ & buf_mask] = v8;
> +	r[wptr++ & buf_mask] = v9;
> +
> +	ring->wptr = wptr & ring->ptr_mask;
> +	ring->count_dw -= 9;
> +}
> +
> +static inline void
> +amdgpu_ring_write10(struct amdgpu_ring *ring,
> +		   u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6, u32 v7,
> +		   u32 v8, u32 v9, u32 v10)
> +{
> +	const u32 buf_mask = ring->buf_mask;
> +	u64 wptr = ring->wptr;
> +	u32 *r = ring->ring;
> +
> +	r[wptr++ & buf_mask] = v1;
> +	r[wptr++ & buf_mask] = v2;
> +	r[wptr++ & buf_mask] = v3;
> +	r[wptr++ & buf_mask] = v4;
> +	r[wptr++ & buf_mask] = v5;
> +	r[wptr++ & buf_mask] = v6;
> +	r[wptr++ & buf_mask] = v7;
> +	r[wptr++ & buf_mask] = v8;
> +	r[wptr++ & buf_mask] = v9;
> +	r[wptr++ & buf_mask] = v10;
> +
> +	ring->wptr = wptr & ring->ptr_mask;
> +	ring->count_dw -= 10;
> +}
> +
> +static inline void
> +amdgpu_ring_write11(struct amdgpu_ring *ring,
> +		   u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6, u32 v7,
> +		   u32 v8, u32 v9, u32 v10, u32 v11)
> +{
> +	const u32 buf_mask = ring->buf_mask;
> +	u64 wptr = ring->wptr;
> +	u32 *r = ring->ring;
> +
> +	r[wptr++ & buf_mask] = v1;
> +	r[wptr++ & buf_mask] = v2;
> +	r[wptr++ & buf_mask] = v3;
> +	r[wptr++ & buf_mask] = v4;
> +	r[wptr++ & buf_mask] = v5;
> +	r[wptr++ & buf_mask] = v6;
> +	r[wptr++ & buf_mask] = v7;
> +	r[wptr++ & buf_mask] = v8;
> +	r[wptr++ & buf_mask] = v9;
> +	r[wptr++ & buf_mask] = v10;
> +	r[wptr++ & buf_mask] = v11;
> +
> +	ring->wptr = wptr & ring->ptr_mask;
> +	ring->count_dw -= 11;
> +}
> +
> +#define AMDGPU_RING_WRITE(_1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, NAME, ...) NAME
> +#define amdgpu_ring_write(...) \
> +	AMDGPU_RING_WRITE(__VA_ARGS__, \
> +			  amdgpu_ring_write11, \
> +			  amdgpu_ring_write10, \
> +			  amdgpu_ring_write9, \
> +			  amdgpu_ring_write8, \
> +			  amdgpu_ring_write7, \
> +			  amdgpu_ring_write6, \
> +			  amdgpu_ring_write5, \
> +			  amdgpu_ring_write4, \
> +			  amdgpu_ring_write3, \
> +			  amdgpu_ring_write2, \
> +			  amdgpu_ring_write1, \
> +			  NULL)(__VA_ARGS__)
> +
>   static inline void amdgpu_ring_write_multiple(struct amdgpu_ring *ring,
>   					      void *src, int count_dw)
>   {


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 09/12] drm/amdgpu: Optimise amdgpu_ring_write()
  2025-01-02 13:55   ` Christian König
@ 2025-01-03 13:02     ` Tvrtko Ursulin
  0 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2025-01-03 13:02 UTC (permalink / raw)
  To: Christian König, Tvrtko Ursulin, amd-gfx
  Cc: kernel-dev, Christian König, Sunil Khatri


On 02/01/2025 13:55, Christian König wrote:
> Am 27.12.24 um 12:19 schrieb Tvrtko Ursulin:
>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>
>> There are more than 2000 calls to amdgpu_ring_write() in the driver and
>> the majority is multiple sequential calls which the compiler cannot
>> optimise much.
>>
>> Lets make this helper variadic via some pre-processor magic which allows
>> merging those sequential calls and in turn enables the compiler to emit
>> much less code.
> 
> I've played around with the same idea before abandoning it for this 
> patch here: 
> https://lore.kernel.org/all/20241008181134.1350-2-christian.koenig@amd.com/

That would be nicer indeed. Okay I will leave you to it. And if you see 
something in this series which does not overlap with your work and looks 
interesting let me know.

> Basically we don't need to update count_dw nor mask the WPTR which has 
> the same effect as this optimization here and far less code change.
> 
> The problem is that some code for Polaris HW generations seem to use the 
> WPTR or count_dw directly for some calculation. I didn't had time to 
> clean that up and push the resulting change.

Is there a public mapping table for figuring out which IP block versions 
are Polaris?

I can't grep any count_dw usage and in case of wptr the places I found 
which access it directly (but I was only converting a small subset of 
files) where sdma_v5_2_ring_init_cond_exec, 
sdma_v6_0_ring_init_cond_exec and sdma_v7_0_ring_init_cond_exec. Not 
sure if it is those you had in mind or there is something else?

Regards,

Tvrtko

>> If we then would convert some call sites to use this macro, lets see on
>> the example of amdgpu_vce_ring_emit_ib(), what results that would bring.
>> Before (but after the wptr local caching, before it is even worse):
>>
>>    173c39:       48 89 f8                mov    %rdi,%rax
>>    173c3c:       48 89 d1                mov    %rdx,%rcx
>>    173c3f:       48 8b 97 c8 01 00 00    mov    0x1c8(%rdi),%rdx
>>    173c46:       48 8b b0 a8 01 00 00    mov    0x1a8(%rax),%rsi
>>    173c4d:       89 d7                   mov    %edx,%edi
>>    173c4f:       23 b8 f8 01 00 00       and    0x1f8(%rax),%edi
>>    173c55:       48 83 c2 01             add    $0x1,%rdx
>>    173c59:       c7 04 be 02 00 00 00    movl   $0x2,(%rsi,%rdi,4)
>>    173c60:       48 23 90 f0 01 00 00    and    0x1f0(%rax),%rdx
>>    173c67:       48 89 90 c8 01 00 00    mov    %rdx,0x1c8(%rax)
>>    173c6e:       48 8b b0 a8 01 00 00    mov    0x1a8(%rax),%rsi
>>    173c75:       89 d7                   mov    %edx,%edi
>>    173c77:       48 83 c2 01             add    $0x1,%rdx
>>    173c7b:       83 a8 e0 01 00 00 01    subl   $0x1,0x1e0(%rax)
>>    173c82:       4c 8b 41 10             mov    0x10(%rcx),%r8
>>    173c86:       23 b8 f8 01 00 00       and    0x1f8(%rax),%edi
>>    173c8c:       44 89 04 be             mov    %r8d,(%rsi,%rdi,4)
>>    173c90:       48 23 90 f0 01 00 00    and    0x1f0(%rax),%rdx
>>    173c97:       48 89 90 c8 01 00 00    mov    %rdx,0x1c8(%rax)
>>    173c9e:       48 8b b0 a8 01 00 00    mov    0x1a8(%rax),%rsi
>>    173ca5:       89 d7                   mov    %edx,%edi
>>    173ca7:       48 83 c2 01             add    $0x1,%rdx
>>    173cab:       83 a8 e0 01 00 00 01    subl   $0x1,0x1e0(%rax)
>>    173cb2:       44 8b 41 14             mov    0x14(%rcx),%r8d
>>    173cb6:       23 b8 f8 01 00 00       and    0x1f8(%rax),%edi
>>    173cbc:       44 89 04 be             mov    %r8d,(%rsi,%rdi,4)
>>    173cc0:       48 23 90 f0 01 00 00    and    0x1f0(%rax),%rdx
>>    173cc7:       48 89 90 c8 01 00 00    mov    %rdx,0x1c8(%rax)
>>    173cce:       89 d6                   mov    %edx,%esi
>>    173cd0:       23 b0 f8 01 00 00       and    0x1f8(%rax),%esi
>>    173cd6:       48 83 c2 01             add    $0x1,%rdx
>>    173cda:       83 a8 e0 01 00 00 01    subl   $0x1,0x1e0(%rax)
>>    173ce1:       8b 79 08                mov    0x8(%rcx),%edi
>>    173ce4:       48 8b 88 a8 01 00 00    mov    0x1a8(%rax),%rcx
>>    173ceb:       89 3c b1                mov    %edi,(%rcx,%rsi,4)
>>    173cee:       48 23 90 f0 01 00 00    and    0x1f0(%rax),%rdx
>>    173cf5:       83 a8 e0 01 00 00 01    subl   $0x1,0x1e0(%rax)
>>    173cfc:       48 89 90 c8 01 00 00    mov    %rdx,0x1c8(%rax)
>>    173d03:       c3                      ret
>>
>> And after:
>>
>>      1579:       48 89 f8                mov    %rdi,%rax
>>      157c:       44 8b 4a 08             mov    0x8(%rdx),%r9d
>>      1580:       48 8b 7a 10             mov    0x10(%rdx),%rdi
>>      1584:       48 8b 90 c8 01 00 00    mov    0x1c8(%rax),%rdx
>>      158b:       8b b0 f8 01 00 00       mov    0x1f8(%rax),%esi
>>      1591:       48 8b 88 a8 01 00 00    mov    0x1a8(%rax),%rcx
>>      1598:       49 89 d0                mov    %rdx,%r8
>>      159b:       49 21 f0                and    %rsi,%r8
>>      159e:       42 c7 04 81 02 00 00    movl   $0x2,(%rcx,%r8,4)
>>      15a5:       00
>>      15a6:       4c 8d 42 01             lea    0x1(%rdx),%r8
>>      15aa:       49 21 f0                and    %rsi,%r8
>>      15ad:       42 89 3c 81             mov    %edi,(%rcx,%r8,4)
>>      15b1:       4c 8d 42 02             lea    0x2(%rdx),%r8
>>      15b5:       48 c1 ef 20             shr    $0x20,%rdi
>>      15b9:       49 21 f0                and    %rsi,%r8
>>      15bc:       42 89 3c 81             mov    %edi,(%rcx,%r8,4)
>>      15c0:       48 8d 7a 03             lea    0x3(%rdx),%rdi
>>      15c4:       48 83 c2 04             add    $0x4,%rdx
>>      15c8:       48 21 fe                and    %rdi,%rsi
>>      15cb:       44 89 0c b1             mov    %r9d,(%rcx,%rsi,4)
>>      15cf:       48 23 90 f0 01 00 00    and    0x1f0(%rax),%rdx
>>      15d6:       83 a8 e0 01 00 00 04    subl   $0x4,0x1e0(%rax)
>>      15dd:       48 89 90 c8 01 00 00    mov    %rdx,0x1c8(%rax)
>>      15e4:       c3                      ret
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Sunil Khatri <sunil.khatri@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 222 ++++++++++++++++++++++-
>>   1 file changed, 220 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> index b57951d8c997..4f467864ed09 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> @@ -383,15 +383,233 @@ static inline void 
>> amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
>>       memset32(ring->ring, ring->funcs->nop, ring->buf_mask + 1);
>>   }
>> -static inline void amdgpu_ring_write(struct amdgpu_ring *ring, 
>> uint32_t v)
>> +static inline void
>> +amdgpu_ring_write1(struct amdgpu_ring *ring,
>> +           u32 v1)
>>   {
>> +    const u32 buf_mask = ring->buf_mask;
>>       u64 wptr = ring->wptr;
>> -    ring->ring[wptr++ & ring->buf_mask] = v;
>> +    ring->ring[wptr++ & buf_mask] = v1;
>>       ring->wptr = wptr & ring->ptr_mask;
>>       ring->count_dw--;
>>   }
>> +static inline void
>> +amdgpu_ring_write2(struct amdgpu_ring *ring,
>> +           u32 v1, u32 v2)
>> +{
>> +    const u32 buf_mask = ring->buf_mask;
>> +    u64 wptr = ring->wptr;
>> +    u32 *r = ring->ring;
>> +
>> +    r[wptr++ & buf_mask] = v1;
>> +    r[wptr++ & buf_mask] = v2;
>> +
>> +    ring->wptr = wptr & ring->ptr_mask;
>> +    ring->count_dw -= 2;
>> +}
>> +
>> +static inline void
>> +amdgpu_ring_write3(struct amdgpu_ring *ring,
>> +           u32 v1, u32 v2, u32 v3)
>> +{
>> +    const u32 buf_mask = ring->buf_mask;
>> +    u64 wptr = ring->wptr;
>> +    u32 *r = ring->ring;
>> +
>> +    r[wptr++ & buf_mask] = v1;
>> +    r[wptr++ & buf_mask] = v2;
>> +    r[wptr++ & buf_mask] = v3;
>> +
>> +    ring->wptr = wptr & ring->ptr_mask;
>> +    ring->count_dw -= 3;
>> +}
>> +
>> +static inline void
>> +amdgpu_ring_write4(struct amdgpu_ring *ring,
>> +           u32 v1, u32 v2, u32 v3, u32 v4)
>> +{
>> +    const u32 buf_mask = ring->buf_mask;
>> +    u64 wptr = ring->wptr;
>> +    u32 *r = ring->ring;
>> +
>> +    r[wptr++ & buf_mask] = v1;
>> +    r[wptr++ & buf_mask] = v2;
>> +    r[wptr++ & buf_mask] = v3;
>> +    r[wptr++ & buf_mask] = v4;
>> +
>> +    ring->wptr = wptr & ring->ptr_mask;
>> +    ring->count_dw -= 4;
>> +}
>> +
>> +static inline void
>> +amdgpu_ring_write5(struct amdgpu_ring *ring,
>> +           u32 v1, u32 v2, u32 v3, u32 v4, u32 v5)
>> +{
>> +    const u32 buf_mask = ring->buf_mask;
>> +    u64 wptr = ring->wptr;
>> +    u32 *r = ring->ring;
>> +
>> +    r[wptr++ & buf_mask] = v1;
>> +    r[wptr++ & buf_mask] = v2;
>> +    r[wptr++ & buf_mask] = v3;
>> +    r[wptr++ & buf_mask] = v4;
>> +    r[wptr++ & buf_mask] = v5;
>> +
>> +    ring->wptr = wptr & ring->ptr_mask;
>> +    ring->count_dw -= 5;
>> +}
>> +
>> +static inline void
>> +amdgpu_ring_write6(struct amdgpu_ring *ring,
>> +           u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6)
>> +{
>> +    const u32 buf_mask = ring->buf_mask;
>> +    u64 wptr = ring->wptr;
>> +    u32 *r = ring->ring;
>> +
>> +    r[wptr++ & buf_mask] = v1;
>> +    r[wptr++ & buf_mask] = v2;
>> +    r[wptr++ & buf_mask] = v3;
>> +    r[wptr++ & buf_mask] = v4;
>> +    r[wptr++ & buf_mask] = v5;
>> +    r[wptr++ & buf_mask] = v6;
>> +
>> +    ring->wptr = wptr & ring->ptr_mask;
>> +    ring->count_dw -= 6;
>> +}
>> +
>> +static inline void
>> +amdgpu_ring_write7(struct amdgpu_ring *ring,
>> +           u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6, u32 v7)
>> +{
>> +    const u32 buf_mask = ring->buf_mask;
>> +    u64 wptr = ring->wptr;
>> +    u32 *r = ring->ring;
>> +
>> +    r[wptr++ & buf_mask] = v1;
>> +    r[wptr++ & buf_mask] = v2;
>> +    r[wptr++ & buf_mask] = v3;
>> +    r[wptr++ & buf_mask] = v4;
>> +    r[wptr++ & buf_mask] = v5;
>> +    r[wptr++ & buf_mask] = v6;
>> +    r[wptr++ & buf_mask] = v7;
>> +
>> +    ring->wptr = wptr & ring->ptr_mask;
>> +    ring->count_dw -= 7;
>> +}
>> +
>> +static inline void
>> +amdgpu_ring_write8(struct amdgpu_ring *ring,
>> +           u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6, u32 v7,
>> +           u32 v8)
>> +{
>> +    const u32 buf_mask = ring->buf_mask;
>> +    u64 wptr = ring->wptr;
>> +    u32 *r = ring->ring;
>> +
>> +    r[wptr++ & buf_mask] = v1;
>> +    r[wptr++ & buf_mask] = v2;
>> +    r[wptr++ & buf_mask] = v3;
>> +    r[wptr++ & buf_mask] = v4;
>> +    r[wptr++ & buf_mask] = v5;
>> +    r[wptr++ & buf_mask] = v6;
>> +    r[wptr++ & buf_mask] = v7;
>> +    r[wptr++ & buf_mask] = v8;
>> +
>> +    ring->wptr = wptr & ring->ptr_mask;
>> +    ring->count_dw -= 8;
>> +}
>> +
>> +static inline void
>> +amdgpu_ring_write9(struct amdgpu_ring *ring,
>> +           u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6, u32 v7,
>> +           u32 v8, u32 v9)
>> +{
>> +    const u32 buf_mask = ring->buf_mask;
>> +    u64 wptr = ring->wptr;
>> +    u32 *r = ring->ring;
>> +
>> +    r[wptr++ & buf_mask] = v1;
>> +    r[wptr++ & buf_mask] = v2;
>> +    r[wptr++ & buf_mask] = v3;
>> +    r[wptr++ & buf_mask] = v4;
>> +    r[wptr++ & buf_mask] = v5;
>> +    r[wptr++ & buf_mask] = v6;
>> +    r[wptr++ & buf_mask] = v7;
>> +    r[wptr++ & buf_mask] = v8;
>> +    r[wptr++ & buf_mask] = v9;
>> +
>> +    ring->wptr = wptr & ring->ptr_mask;
>> +    ring->count_dw -= 9;
>> +}
>> +
>> +static inline void
>> +amdgpu_ring_write10(struct amdgpu_ring *ring,
>> +           u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6, u32 v7,
>> +           u32 v8, u32 v9, u32 v10)
>> +{
>> +    const u32 buf_mask = ring->buf_mask;
>> +    u64 wptr = ring->wptr;
>> +    u32 *r = ring->ring;
>> +
>> +    r[wptr++ & buf_mask] = v1;
>> +    r[wptr++ & buf_mask] = v2;
>> +    r[wptr++ & buf_mask] = v3;
>> +    r[wptr++ & buf_mask] = v4;
>> +    r[wptr++ & buf_mask] = v5;
>> +    r[wptr++ & buf_mask] = v6;
>> +    r[wptr++ & buf_mask] = v7;
>> +    r[wptr++ & buf_mask] = v8;
>> +    r[wptr++ & buf_mask] = v9;
>> +    r[wptr++ & buf_mask] = v10;
>> +
>> +    ring->wptr = wptr & ring->ptr_mask;
>> +    ring->count_dw -= 10;
>> +}
>> +
>> +static inline void
>> +amdgpu_ring_write11(struct amdgpu_ring *ring,
>> +           u32 v1, u32 v2, u32 v3, u32 v4, u32 v5, u32 v6, u32 v7,
>> +           u32 v8, u32 v9, u32 v10, u32 v11)
>> +{
>> +    const u32 buf_mask = ring->buf_mask;
>> +    u64 wptr = ring->wptr;
>> +    u32 *r = ring->ring;
>> +
>> +    r[wptr++ & buf_mask] = v1;
>> +    r[wptr++ & buf_mask] = v2;
>> +    r[wptr++ & buf_mask] = v3;
>> +    r[wptr++ & buf_mask] = v4;
>> +    r[wptr++ & buf_mask] = v5;
>> +    r[wptr++ & buf_mask] = v6;
>> +    r[wptr++ & buf_mask] = v7;
>> +    r[wptr++ & buf_mask] = v8;
>> +    r[wptr++ & buf_mask] = v9;
>> +    r[wptr++ & buf_mask] = v10;
>> +    r[wptr++ & buf_mask] = v11;
>> +
>> +    ring->wptr = wptr & ring->ptr_mask;
>> +    ring->count_dw -= 11;
>> +}
>> +
>> +#define AMDGPU_RING_WRITE(_1, _2, _3, _4, _5, _6, _7, _8, _9, _10, 
>> _11, _12, NAME, ...) NAME
>> +#define amdgpu_ring_write(...) \
>> +    AMDGPU_RING_WRITE(__VA_ARGS__, \
>> +              amdgpu_ring_write11, \
>> +              amdgpu_ring_write10, \
>> +              amdgpu_ring_write9, \
>> +              amdgpu_ring_write8, \
>> +              amdgpu_ring_write7, \
>> +              amdgpu_ring_write6, \
>> +              amdgpu_ring_write5, \
>> +              amdgpu_ring_write4, \
>> +              amdgpu_ring_write3, \
>> +              amdgpu_ring_write2, \
>> +              amdgpu_ring_write1, \
>> +              NULL)(__VA_ARGS__)
>> +
>>   static inline void amdgpu_ring_write_multiple(struct amdgpu_ring *ring,
>>                             void *src, int count_dw)
>>   {
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 04/12] drm/amdgpu: Consolidate a bunch of similar sdma insert nop vfuncs
  2025-01-02 13:49   ` Christian König
@ 2025-01-03 14:25     ` Tvrtko Ursulin
  0 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2025-01-03 14:25 UTC (permalink / raw)
  To: Christian König, Tvrtko Ursulin, amd-gfx
  Cc: kernel-dev, Christian König, Sunil Khatri


On 02/01/2025 13:49, Christian König wrote:
> Am 27.12.24 um 12:19 schrieb Tvrtko Ursulin:
>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>
>> A lot of the hardware generations apparently uses the same nop insertion
>> logic, just with different masks and shifts.
>>
>> We can consolidate if we store those shifts and mask in the ring and
>> shrink both the source and binary.
> 
> The shift and mask should be identical for most HW generations anyway. 
> Only SI might be a bit different.
> 
> So probably better to have a special function for SI and a general for 
> all other HW generations.

I see all sdma_v*.c use the same pattern ie. I don't see which would be 
the other flavour?

In case you are suggesting a helper with no shift and mask stored in the 
ring, how do you think that could be implemented cleanly? My concern is 
multiple headers define the values in the same namespace so it would be 
fragile to pick an arbitrary one.

Regards,

Tvrtko

>> Though it appears a bit of a departure from the existing "duplicate
>> everything" state so we will see how this is received.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Sunil Khatri <sunil.khatri@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  5 +++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 18 ++++++++++++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h |  1 +
>>   drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c   | 19 ++++---------------
>>   drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c   | 19 ++++---------------
>>   drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c   | 23 +++++++----------------
>>   drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 23 +++++++----------------
>>   drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c   | 19 ++++---------------
>>   drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c   | 19 ++++---------------
>>   drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c   | 19 ++++---------------
>>   drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c   | 19 ++++---------------
>>   11 files changed, 62 insertions(+), 122 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> index 429c77db920f..4c0861ebc77a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> @@ -303,6 +303,11 @@ struct amdgpu_ring {
>>       struct {
>>           struct amdgpu_sdma_instance *instance;
>>           int                 index;
>> +
>> +        struct {
>> +            u32        mask;
>> +            unsigned int    shift;
>> +        } nop_pkt;
>>       } sdma;
>>       /* used for mes */
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
>> index d43dfec82624..148413f01875 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
>> @@ -34,6 +34,24 @@
>>    * GPU SDMA IP block helpers function.
>>    */
>> +void amdgpu_sdma_ring_insert_nop(struct amdgpu_ring *ring, u32 count)
>> +{
>> +    struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
>> +    const u32 nop = ring->funcs->nop;
>> +    u32 i;
>> +
>> +    if (!count)
>> +        return;
>> +
>> +    if (sdma && sdma->burst_nop)
>> +        amdgpu_ring_write(ring, nop |
>> +                    (--count & ring->sdma.nop_pkt.mask) <<
>> +                    ring->sdma.nop_pkt.shift);
>> +
>> +    for (i = 0; i < count; i++)
>> +        amdgpu_ring_write(ring, nop);
>> +}
>> +
>>   uint64_t amdgpu_sdma_get_csa_mc_addr(struct amdgpu_ring *ring,
>>                        unsigned int vmid)
>>   {
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
>> index 7debf3ed0b46..d2642a9113ae 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
>> @@ -159,6 +159,7 @@ struct amdgpu_buffer_funcs {
>>   #define amdgpu_emit_copy_buffer(adev, ib, s, d, b, t) 
>> (adev)->mman.buffer_funcs->emit_copy_buffer((ib),  (s), (d), (b), (t))
>>   #define amdgpu_emit_fill_buffer(adev, ib, s, d, b) 
>> (adev)->mman.buffer_funcs->emit_fill_buffer((ib), (s), (d), (b))
>> +void amdgpu_sdma_ring_insert_nop(struct amdgpu_ring *ring, u32 count);
>>   uint64_t amdgpu_sdma_get_csa_mc_addr(struct amdgpu_ring *ring, 
>> unsigned vmid);
>>   int amdgpu_sdma_ras_late_init(struct amdgpu_device *adev,
>>                     struct ras_common_if *ras_block);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c 
>> b/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
>> index 2e844dba4ad5..e0ed65eca431 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
>> @@ -221,19 +221,6 @@ static void sdma_v2_4_ring_set_wptr(struct 
>> amdgpu_ring *ring)
>>       WREG32(mmSDMA0_GFX_RB_WPTR + sdma_offsets[ring->me], ring->wptr 
>> << 2);
>>   }
>> -static void sdma_v2_4_ring_insert_nop(struct amdgpu_ring *ring, 
>> uint32_t count)
>> -{
>> -    struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
>> -    int i;
>> -
>> -    for (i = 0; i < count; i++)
>> -        if (sdma && sdma->burst_nop && (i == 0))
>> -            amdgpu_ring_write(ring, ring->funcs->nop |
>> -                SDMA_PKT_NOP_HEADER_COUNT(count - 1));
>> -        else
>> -            amdgpu_ring_write(ring, ring->funcs->nop);
>> -}
>> -
>>   /**
>>    * sdma_v2_4_ring_emit_ib - Schedule an IB on the DMA engine
>>    *
>> @@ -252,7 +239,7 @@ static void sdma_v2_4_ring_emit_ib(struct 
>> amdgpu_ring *ring,
>>       unsigned vmid = AMDGPU_JOB_GET_VMID(job);
>>       /* IB packet must end on a 8 DW boundary */
>> -    sdma_v2_4_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 
>> 7);
>> +    amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) 
>> & 7);
>>       amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
>>                 SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
>> @@ -866,6 +853,8 @@ static int sdma_v2_4_sw_init(struct 
>> amdgpu_ip_block *ip_block)
>>           ring->sdma.instance = &adev->sdma.instance[i];
>>           ring->sdma.index = i;
>> +        ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
>> +        ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>>       }
>>       return r;
>> @@ -1137,7 +1126,7 @@ static const struct amdgpu_ring_funcs 
>> sdma_v2_4_ring_funcs = {
>>       .emit_hdp_flush = sdma_v2_4_ring_emit_hdp_flush,
>>       .test_ring = sdma_v2_4_ring_test_ring,
>>       .test_ib = sdma_v2_4_ring_test_ib,
>> -    .insert_nop = sdma_v2_4_ring_insert_nop,
>> +    .insert_nop = amdgpu_sdma_ring_insert_nop,
>>       .pad_ib = sdma_v2_4_ring_pad_ib,
>>       .emit_wreg = sdma_v2_4_ring_emit_wreg,
>>   };
>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c 
>> b/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
>> index 104fd1214c4c..8a644ea28589 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
>> @@ -397,19 +397,6 @@ static void sdma_v3_0_ring_set_wptr(struct 
>> amdgpu_ring *ring)
>>       }
>>   }
>> -static void sdma_v3_0_ring_insert_nop(struct amdgpu_ring *ring, 
>> uint32_t count)
>> -{
>> -    struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
>> -    int i;
>> -
>> -    for (i = 0; i < count; i++)
>> -        if (sdma && sdma->burst_nop && (i == 0))
>> -            amdgpu_ring_write(ring, ring->funcs->nop |
>> -                SDMA_PKT_NOP_HEADER_COUNT(count - 1));
>> -        else
>> -            amdgpu_ring_write(ring, ring->funcs->nop);
>> -}
>> -
>>   /**
>>    * sdma_v3_0_ring_emit_ib - Schedule an IB on the DMA engine
>>    *
>> @@ -428,7 +415,7 @@ static void sdma_v3_0_ring_emit_ib(struct 
>> amdgpu_ring *ring,
>>       unsigned vmid = AMDGPU_JOB_GET_VMID(job);
>>       /* IB packet must end on a 8 DW boundary */
>> -    sdma_v3_0_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 
>> 7);
>> +    amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) 
>> & 7);
>>       amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
>>                 SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
>> @@ -1152,6 +1139,8 @@ static int sdma_v3_0_sw_init(struct 
>> amdgpu_ip_block *ip_block)
>>           ring->sdma.instance = &adev->sdma.instance[i];
>>           ring->sdma.index = i;
>> +        ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
>> +        ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>>       }
>>       return r;
>> @@ -1579,7 +1568,7 @@ static const struct amdgpu_ring_funcs 
>> sdma_v3_0_ring_funcs = {
>>       .emit_hdp_flush = sdma_v3_0_ring_emit_hdp_flush,
>>       .test_ring = sdma_v3_0_ring_test_ring,
>>       .test_ib = sdma_v3_0_ring_test_ib,
>> -    .insert_nop = sdma_v3_0_ring_insert_nop,
>> +    .insert_nop = amdgpu_sdma_ring_insert_nop,
>>       .pad_ib = sdma_v3_0_ring_pad_ib,
>>       .emit_wreg = sdma_v3_0_ring_emit_wreg,
>>   };
>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c 
>> b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
>> index c91d05a4593e..0f0f05a03cd4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
>> @@ -781,19 +781,6 @@ static void sdma_v4_0_page_ring_set_wptr(struct 
>> amdgpu_ring *ring)
>>       }
>>   }
>> -static void sdma_v4_0_ring_insert_nop(struct amdgpu_ring *ring, 
>> uint32_t count)
>> -{
>> -    struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
>> -    int i;
>> -
>> -    for (i = 0; i < count; i++)
>> -        if (sdma && sdma->burst_nop && (i == 0))
>> -            amdgpu_ring_write(ring, ring->funcs->nop |
>> -                SDMA_PKT_NOP_HEADER_COUNT(count - 1));
>> -        else
>> -            amdgpu_ring_write(ring, ring->funcs->nop);
>> -}
>> -
>>   /**
>>    * sdma_v4_0_ring_emit_ib - Schedule an IB on the DMA engine
>>    *
>> @@ -812,7 +799,7 @@ static void sdma_v4_0_ring_emit_ib(struct 
>> amdgpu_ring *ring,
>>       unsigned vmid = AMDGPU_JOB_GET_VMID(job);
>>       /* IB packet must end on a 8 DW boundary */
>> -    sdma_v4_0_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 
>> 7);
>> +    amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) 
>> & 7);
>>       amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
>>                 SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
>> @@ -1876,6 +1863,8 @@ static int sdma_v4_0_sw_init(struct 
>> amdgpu_ip_block *ip_block)
>>           ring->sdma.instance = &adev->sdma.instance[i];
>>           ring->sdma.index = i;
>> +        ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
>> +        ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>>           if (adev->sdma.has_page_queue) {
>>               ring = &adev->sdma.instance[i].page;
>> @@ -1917,6 +1906,8 @@ static int sdma_v4_0_sw_init(struct 
>> amdgpu_ip_block *ip_block)
>>               ring->sdma.instance = &adev->sdma.instance[i];
>>               ring->sdma.index = i;
>> +            ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
>> +            ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>>           }
>>       }
>> @@ -2443,7 +2434,7 @@ static const struct amdgpu_ring_funcs 
>> sdma_v4_0_ring_funcs = {
>>       .emit_hdp_flush = sdma_v4_0_ring_emit_hdp_flush,
>>       .test_ring = sdma_v4_0_ring_test_ring,
>>       .test_ib = sdma_v4_0_ring_test_ib,
>> -    .insert_nop = sdma_v4_0_ring_insert_nop,
>> +    .insert_nop = amdgpu_sdma_ring_insert_nop,
>>       .pad_ib = sdma_v4_0_ring_pad_ib,
>>       .emit_wreg = sdma_v4_0_ring_emit_wreg,
>>       .emit_reg_wait = sdma_v4_0_ring_emit_reg_wait,
>> @@ -2475,7 +2466,7 @@ static const struct amdgpu_ring_funcs 
>> sdma_v4_0_page_ring_funcs = {
>>       .emit_hdp_flush = sdma_v4_0_ring_emit_hdp_flush,
>>       .test_ring = sdma_v4_0_ring_test_ring,
>>       .test_ib = sdma_v4_0_ring_test_ib,
>> -    .insert_nop = sdma_v4_0_ring_insert_nop,
>> +    .insert_nop = amdgpu_sdma_ring_insert_nop,
>>       .pad_ib = sdma_v4_0_ring_pad_ib,
>>       .emit_wreg = sdma_v4_0_ring_emit_wreg,
>>       .emit_reg_wait = sdma_v4_0_ring_emit_reg_wait,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c 
>> b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>> index d1d21a3951f8..f68b7f2e0a40 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>> @@ -341,19 +341,6 @@ static void sdma_v4_4_2_page_ring_set_wptr(struct 
>> amdgpu_ring *ring)
>>       }
>>   }
>> -static void sdma_v4_4_2_ring_insert_nop(struct amdgpu_ring *ring, 
>> uint32_t count)
>> -{
>> -    struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
>> -    int i;
>> -
>> -    for (i = 0; i < count; i++)
>> -        if (sdma && sdma->burst_nop && (i == 0))
>> -            amdgpu_ring_write(ring, ring->funcs->nop |
>> -                SDMA_PKT_NOP_HEADER_COUNT(count - 1));
>> -        else
>> -            amdgpu_ring_write(ring, ring->funcs->nop);
>> -}
>> -
>>   /**
>>    * sdma_v4_4_2_ring_emit_ib - Schedule an IB on the DMA engine
>>    *
>> @@ -372,7 +359,7 @@ static void sdma_v4_4_2_ring_emit_ib(struct 
>> amdgpu_ring *ring,
>>       unsigned vmid = AMDGPU_JOB_GET_VMID(job);
>>       /* IB packet must end on a 8 DW boundary */
>> -    sdma_v4_4_2_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) 
>> & 7);
>> +    amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) 
>> & 7);
>>       amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
>>                 SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
>> @@ -1431,6 +1418,8 @@ static int sdma_v4_4_2_sw_init(struct 
>> amdgpu_ip_block *ip_block)
>>           ring->sdma.instance = &adev->sdma.instance[i];
>>           ring->sdma.index = i;
>> +        ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
>> +        ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>>           if (adev->sdma.has_page_queue) {
>>               ring = &adev->sdma.instance[i].page;
>> @@ -1455,6 +1444,8 @@ static int sdma_v4_4_2_sw_init(struct 
>> amdgpu_ip_block *ip_block)
>>               ring->sdma.instance = &adev->sdma.instance[i];
>>               ring->sdma.index = i;
>> +            ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
>> +            ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>>           }
>>       }
>> @@ -2013,7 +2004,7 @@ static const struct amdgpu_ring_funcs 
>> sdma_v4_4_2_ring_funcs = {
>>       .emit_hdp_flush = sdma_v4_4_2_ring_emit_hdp_flush,
>>       .test_ring = sdma_v4_4_2_ring_test_ring,
>>       .test_ib = sdma_v4_4_2_ring_test_ib,
>> -    .insert_nop = sdma_v4_4_2_ring_insert_nop,
>> +    .insert_nop = amdgpu_sdma_ring_insert_nop,
>>       .pad_ib = sdma_v4_4_2_ring_pad_ib,
>>       .emit_wreg = sdma_v4_4_2_ring_emit_wreg,
>>       .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
>> @@ -2045,7 +2036,7 @@ static const struct amdgpu_ring_funcs 
>> sdma_v4_4_2_page_ring_funcs = {
>>       .emit_hdp_flush = sdma_v4_4_2_ring_emit_hdp_flush,
>>       .test_ring = sdma_v4_4_2_ring_test_ring,
>>       .test_ib = sdma_v4_4_2_ring_test_ib,
>> -    .insert_nop = sdma_v4_4_2_ring_insert_nop,
>> +    .insert_nop = amdgpu_sdma_ring_insert_nop,
>>       .pad_ib = sdma_v4_4_2_ring_pad_ib,
>>       .emit_wreg = sdma_v4_4_2_ring_emit_wreg,
>>       .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c 
>> b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
>> index 97536f82dfcd..bbf60bfa4b1b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
>> @@ -433,19 +433,6 @@ static void sdma_v5_0_ring_set_wptr(struct 
>> amdgpu_ring *ring)
>>       }
>>   }
>> -static void sdma_v5_0_ring_insert_nop(struct amdgpu_ring *ring, 
>> uint32_t count)
>> -{
>> -    struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
>> -    int i;
>> -
>> -    for (i = 0; i < count; i++)
>> -        if (sdma && sdma->burst_nop && (i == 0))
>> -            amdgpu_ring_write(ring, ring->funcs->nop |
>> -                SDMA_PKT_NOP_HEADER_COUNT(count - 1));
>> -        else
>> -            amdgpu_ring_write(ring, ring->funcs->nop);
>> -}
>> -
>>   /**
>>    * sdma_v5_0_ring_emit_ib - Schedule an IB on the DMA engine
>>    *
>> @@ -472,7 +459,7 @@ static void sdma_v5_0_ring_emit_ib(struct 
>> amdgpu_ring *ring,
>>        * (wptr + 6 + x) % 8 = 0.
>>        * The expression below, is a solution of x.
>>        */
>> -    sdma_v5_0_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 
>> 7);
>> +    amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) 
>> & 7);
>>       amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
>>                 SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
>> @@ -1453,6 +1440,8 @@ static int sdma_v5_0_sw_init(struct 
>> amdgpu_ip_block *ip_block)
>>           ring->sdma.instance = &adev->sdma.instance[i];
>>           ring->sdma.index = i;
>> +        ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
>> +        ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>>       }
>>       adev->sdma.supported_reset =
>> @@ -1991,7 +1980,7 @@ static const struct amdgpu_ring_funcs 
>> sdma_v5_0_ring_funcs = {
>>       .emit_hdp_flush = sdma_v5_0_ring_emit_hdp_flush,
>>       .test_ring = sdma_v5_0_ring_test_ring,
>>       .test_ib = sdma_v5_0_ring_test_ib,
>> -    .insert_nop = sdma_v5_0_ring_insert_nop,
>> +    .insert_nop = amdgpu_sdma_ring_insert_nop,
>>       .pad_ib = sdma_v5_0_ring_pad_ib,
>>       .emit_wreg = sdma_v5_0_ring_emit_wreg,
>>       .emit_reg_wait = sdma_v5_0_ring_emit_reg_wait,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c 
>> b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>> index 8eaddee1d97d..69b88db32117 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>> @@ -250,19 +250,6 @@ static void sdma_v5_2_ring_set_wptr(struct 
>> amdgpu_ring *ring)
>>       }
>>   }
>> -static void sdma_v5_2_ring_insert_nop(struct amdgpu_ring *ring, 
>> uint32_t count)
>> -{
>> -    struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
>> -    int i;
>> -
>> -    for (i = 0; i < count; i++)
>> -        if (sdma && sdma->burst_nop && (i == 0))
>> -            amdgpu_ring_write(ring, ring->funcs->nop |
>> -                SDMA_PKT_NOP_HEADER_COUNT(count - 1));
>> -        else
>> -            amdgpu_ring_write(ring, ring->funcs->nop);
>> -}
>> -
>>   /**
>>    * sdma_v5_2_ring_emit_ib - Schedule an IB on the DMA engine
>>    *
>> @@ -289,7 +276,7 @@ static void sdma_v5_2_ring_emit_ib(struct 
>> amdgpu_ring *ring,
>>        * (wptr + 6 + x) % 8 = 0.
>>        * The expression below, is a solution of x.
>>        */
>> -    sdma_v5_2_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 
>> 7);
>> +    amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) 
>> & 7);
>>       amdgpu_ring_write(ring, SDMA_PKT_HEADER_OP(SDMA_OP_INDIRECT) |
>>                 SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
>> @@ -1358,6 +1345,8 @@ static int sdma_v5_2_sw_init(struct 
>> amdgpu_ip_block *ip_block)
>>           ring->sdma.instance = &adev->sdma.instance[i];
>>           ring->sdma.index = i;
>> +        ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
>> +        ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>>       }
>>       adev->sdma.supported_reset =
>> @@ -1986,7 +1975,7 @@ static const struct amdgpu_ring_funcs 
>> sdma_v5_2_ring_funcs = {
>>       .emit_hdp_flush = sdma_v5_2_ring_emit_hdp_flush,
>>       .test_ring = sdma_v5_2_ring_test_ring,
>>       .test_ib = sdma_v5_2_ring_test_ib,
>> -    .insert_nop = sdma_v5_2_ring_insert_nop,
>> +    .insert_nop = amdgpu_sdma_ring_insert_nop,
>>       .pad_ib = sdma_v5_2_ring_pad_ib,
>>       .begin_use = sdma_v5_2_ring_begin_use,
>>       .end_use = sdma_v5_2_ring_end_use,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c 
>> b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
>> index 3ead269eccdc..aa3992fd5313 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
>> @@ -235,19 +235,6 @@ static void sdma_v6_0_ring_set_wptr(struct 
>> amdgpu_ring *ring)
>>       }
>>   }
>> -static void sdma_v6_0_ring_insert_nop(struct amdgpu_ring *ring, 
>> uint32_t count)
>> -{
>> -    struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
>> -    int i;
>> -
>> -    for (i = 0; i < count; i++)
>> -        if (sdma && sdma->burst_nop && (i == 0))
>> -            amdgpu_ring_write(ring, ring->funcs->nop |
>> -                SDMA_PKT_NOP_HEADER_COUNT(count - 1));
>> -        else
>> -            amdgpu_ring_write(ring, ring->funcs->nop);
>> -}
>> -
>>   /*
>>    * sdma_v6_0_ring_emit_ib - Schedule an IB on the DMA engine
>>    *
>> @@ -274,7 +261,7 @@ static void sdma_v6_0_ring_emit_ib(struct 
>> amdgpu_ring *ring,
>>        * (wptr + 6 + x) % 8 = 0.
>>        * The expression below, is a solution of x.
>>        */
>> -    sdma_v6_0_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 
>> 7);
>> +    amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) 
>> & 7);
>>       amdgpu_ring_write(ring, 
>> SDMA_PKT_COPY_LINEAR_HEADER_OP(SDMA_OP_INDIRECT) |
>>                 SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
>> @@ -1351,6 +1338,8 @@ static int sdma_v6_0_sw_init(struct 
>> amdgpu_ip_block *ip_block)
>>           ring->sdma.instance = &adev->sdma.instance[i];
>>           ring->sdma.index = i;
>> +        ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
>> +        ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>>       }
>>       adev->sdma.supported_reset =
>> @@ -1707,7 +1696,7 @@ static const struct amdgpu_ring_funcs 
>> sdma_v6_0_ring_funcs = {
>>       .emit_hdp_flush = sdma_v6_0_ring_emit_hdp_flush,
>>       .test_ring = sdma_v6_0_ring_test_ring,
>>       .test_ib = sdma_v6_0_ring_test_ib,
>> -    .insert_nop = sdma_v6_0_ring_insert_nop,
>> +    .insert_nop = amdgpu_sdma_ring_insert_nop,
>>       .pad_ib = sdma_v6_0_ring_pad_ib,
>>       .emit_wreg = sdma_v6_0_ring_emit_wreg,
>>       .emit_reg_wait = sdma_v6_0_ring_emit_reg_wait,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c 
>> b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
>> index 5fadaf35a03a..4ccc00248a09 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
>> @@ -267,19 +267,6 @@ static void sdma_v7_0_ring_set_wptr(struct 
>> amdgpu_ring *ring)
>>       }
>>   }
>> -static void sdma_v7_0_ring_insert_nop(struct amdgpu_ring *ring, 
>> uint32_t count)
>> -{
>> -    struct amdgpu_sdma_instance *sdma = ring->sdma.instance;
>> -    int i;
>> -
>> -    for (i = 0; i < count; i++)
>> -        if (sdma && sdma->burst_nop && (i == 0))
>> -            amdgpu_ring_write(ring, ring->funcs->nop |
>> -                SDMA_PKT_NOP_HEADER_COUNT(count - 1));
>> -        else
>> -            amdgpu_ring_write(ring, ring->funcs->nop);
>> -}
>> -
>>   /**
>>    * sdma_v7_0_ring_emit_ib - Schedule an IB on the DMA engine
>>    *
>> @@ -306,7 +293,7 @@ static void sdma_v7_0_ring_emit_ib(struct 
>> amdgpu_ring *ring,
>>        * (wptr + 6 + x) % 8 = 0.
>>        * The expression below, is a solution of x.
>>        */
>> -    sdma_v7_0_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) & 
>> 7);
>> +    amdgpu_sdma_ring_insert_nop(ring, (2 - lower_32_bits(ring->wptr)) 
>> & 7);
>>       amdgpu_ring_write(ring, 
>> SDMA_PKT_COPY_LINEAR_HEADER_OP(SDMA_OP_INDIRECT) |
>>                 SDMA_PKT_INDIRECT_HEADER_VMID(vmid & 0xf));
>> @@ -1365,6 +1352,8 @@ static int sdma_v7_0_sw_init(struct 
>> amdgpu_ip_block *ip_block)
>>           ring->sdma.instance = &adev->sdma.instance[i];
>>           ring->sdma.index = i;
>> +        ring->sdma.nop_pkt.mask = SDMA_PKT_NOP_HEADER_count_mask;
>> +        ring->sdma.nop_pkt.shift = SDMA_PKT_NOP_HEADER_count_shift;
>>       }
>>       adev->sdma.supported_reset =
>> @@ -1687,7 +1676,7 @@ static const struct amdgpu_ring_funcs 
>> sdma_v7_0_ring_funcs = {
>>       .emit_hdp_flush = sdma_v7_0_ring_emit_hdp_flush,
>>       .test_ring = sdma_v7_0_ring_test_ring,
>>       .test_ib = sdma_v7_0_ring_test_ib,
>> -    .insert_nop = sdma_v7_0_ring_insert_nop,
>> +    .insert_nop = amdgpu_sdma_ring_insert_nop,
>>       .pad_ib = sdma_v7_0_ring_pad_ib,
>>       .emit_wreg = sdma_v7_0_ring_emit_wreg,
>>       .emit_reg_wait = sdma_v7_0_ring_emit_reg_wait,
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/12] drm/amdgpu: Use memset32 for IB padding
  2025-01-02 13:45   ` Christian König
@ 2025-01-03 14:40     ` Tvrtko Ursulin
  0 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2025-01-03 14:40 UTC (permalink / raw)
  To: Christian König, Tvrtko Ursulin, amd-gfx; +Cc: kernel-dev, Sunil Khatri


On 02/01/2025 13:45, Christian König wrote:
> 
> 
> Am 27.12.24 um 12:19 schrieb Tvrtko Ursulin:
>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>
>> Use memset32 instead of open coding it, just because it is
>> that bit nicer.
> 
> In general looks mostly good, my only concern is that we already had to 
> switch to memset_io() on some platforms for clearing buffers.
> 
> Now an IB should in theory always be in system memory, but it would be 
> nice to handle that cleanly.

I could only find kmap on the sub-allocation path which leads into IB. 
What am I missing?

Or if one day IO mapped objects will be added as backing store for SA 
then I think it shouldn't be too difficult to somehow propagate that 
fact up. Maybe all the way to struct amdgpu_ib which would then be able 
to decide here which flavour of memset to use.

Regards,

Tvrtko
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Sunil Khatri <sunil.khatri@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 12 ++++++++++--
>>   1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> index a6e28fe3f8d6..a27e32f48f99 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> @@ -136,8 +136,16 @@ void amdgpu_ring_insert_nop(struct amdgpu_ring 
>> *ring, uint32_t count)
>>    */
>>   void amdgpu_ring_generic_pad_ib(struct amdgpu_ring *ring, struct 
>> amdgpu_ib *ib)
>>   {
>> -    while (ib->length_dw & ring->funcs->align_mask)
>> -        ib->ptr[ib->length_dw++] = ring->funcs->nop;
>> +    u32 align_mask = ring->funcs->align_mask;
>> +    u32 count = ib->length_dw & align_mask;
>> +
>> +    if (count) {
>> +        count = align_mask + 1 - count;
>> +
>> +        memset32(&ib->ptr[ib->length_dw], ring->funcs->nop, count);
>> +
>> +        ib->length_dw += count;
>> +    }
>>   }
>>   /**
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-01-04 16:58 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-27 11:19 [RFC v4 00/12] Ring padding micro-optimisation etc Tvrtko Ursulin
2024-12-27 11:19 ` [PATCH 01/12] drm/amdgpu: Use memset32 for IB padding Tvrtko Ursulin
2025-01-02 13:45   ` Christian König
2025-01-03 14:40     ` Tvrtko Ursulin
2024-12-27 11:19 ` [PATCH 02/12] drm/amdgpu: Use memset32 for ring clearing Tvrtko Ursulin
2024-12-27 11:19 ` [PATCH 03/12] drm/amdgpu: Cache SDMA instance and index in the ring Tvrtko Ursulin
2024-12-27 11:19 ` [PATCH 04/12] drm/amdgpu: Consolidate a bunch of similar sdma insert nop vfuncs Tvrtko Ursulin
2025-01-02 13:49   ` Christian König
2025-01-03 14:25     ` Tvrtko Ursulin
2024-12-27 11:19 ` [PATCH 05/12] drm/amdgpu: Use memset32 for sdma insert nops Tvrtko Ursulin
2024-12-27 11:19 ` [PATCH 06/12] drm/amdgpu: Use amdgpu_ring_fill() for VPE padding Tvrtko Ursulin
2024-12-27 11:19 ` [PATCH 07/12] drm/amdgpu: Convert JPEG, VCE and UVD to more efficient ring padding Tvrtko Ursulin
2024-12-27 11:19 ` [PATCH 08/12] drm/amdgpu: Cache some values in ring emission helpers Tvrtko Ursulin
2024-12-27 11:19 ` [PATCH 09/12] drm/amdgpu: Optimise amdgpu_ring_write() Tvrtko Ursulin
2025-01-02 13:55   ` Christian König
2025-01-03 13:02     ` Tvrtko Ursulin
2024-12-27 11:19 ` [PATCH 10/12] drm/amdgpu: Convert VCE to variadic amdgpu_ring_write() Tvrtko Ursulin
2024-12-27 11:19 ` [PATCH 11/12] drm/amdgpu: Convert GFX v10.0 " Tvrtko Ursulin
2024-12-27 11:19 ` [PATCH 12/12] drm/amdgpu: Convert SDMA v5.2 " Tvrtko Ursulin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.