public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu/mes: fix mes ring buffer overflow
@ 2024-08-27 14:10 Alex Deucher
  2024-08-27 14:21 ` Greg KH
  0 siblings, 1 reply; 4+ messages in thread
From: Alex Deucher @ 2024-08-27 14:10 UTC (permalink / raw)
  To: stable, gregkh, sashal; +Cc: Jack Xiao, Alex Deucher

From: Jack Xiao <Jack.Xiao@amd.com>

wait memory room until enough before writing mes packets
to avoid ring buffer overflow.

v2: squash in sched_hw_submission fix

Backport from 6.11.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3571
Fixes: de3246254156 ("drm/amdgpu: cleanup MES11 command submission")
Fixes: fffe347e1478 ("drm/amdgpu: cleanup MES12 command submission")
Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 34e087e8920e635c62e2ed6a758b0cd27f836d13)
Cc: stable@vger.kernel.org # 6.10.x
(cherry picked from commit 11752c013f562a1124088a35bd314aa0e9f0e88f)
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c |  2 ++
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c   | 18 ++++++++++++++----
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 06f0a6534a94..88ffb15e25cc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -212,6 +212,8 @@ int amdgpu_ring_init(struct amdgpu_device *adev, struct amdgpu_ring *ring,
 	 */
 	if (ring->funcs->type == AMDGPU_RING_TYPE_KIQ)
 		sched_hw_submission = max(sched_hw_submission, 256);
+	if (ring->funcs->type == AMDGPU_RING_TYPE_MES)
+		sched_hw_submission = 8;
 	else if (ring == &adev->sdma.instance[0].page)
 		sched_hw_submission = 256;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index 32d4519541c6..e1a66d585f5e 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -163,7 +163,7 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,
 	const char *op_str, *misc_op_str;
 	unsigned long flags;
 	u64 status_gpu_addr;
-	u32 status_offset;
+	u32 seq, status_offset;
 	u64 *status_ptr;
 	signed long r;
 	int ret;
@@ -191,6 +191,13 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,
 	if (r)
 		goto error_unlock_free;
 
+	seq = ++ring->fence_drv.sync_seq;
+	r = amdgpu_fence_wait_polling(ring,
+				      seq - ring->fence_drv.num_fences_mask,
+				      timeout);
+	if (r < 1)
+		goto error_undo;
+
 	api_status = (struct MES_API_STATUS *)((char *)pkt + api_status_off);
 	api_status->api_completion_fence_addr = status_gpu_addr;
 	api_status->api_completion_fence_value = 1;
@@ -203,8 +210,7 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,
 	mes_status_pkt.header.dwsize = API_FRAME_SIZE_IN_DWORDS;
 	mes_status_pkt.api_status.api_completion_fence_addr =
 		ring->fence_drv.gpu_addr;
-	mes_status_pkt.api_status.api_completion_fence_value =
-		++ring->fence_drv.sync_seq;
+	mes_status_pkt.api_status.api_completion_fence_value = seq;
 
 	amdgpu_ring_write_multiple(ring, &mes_status_pkt,
 				   sizeof(mes_status_pkt) / 4);
@@ -224,7 +230,7 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,
 		dev_dbg(adev->dev, "MES msg=%d was emitted\n",
 			x_pkt->header.opcode);
 
-	r = amdgpu_fence_wait_polling(ring, ring->fence_drv.sync_seq, timeout);
+	r = amdgpu_fence_wait_polling(ring, seq, timeout);
 	if (r < 1 || !*status_ptr) {
 
 		if (misc_op_str)
@@ -247,6 +253,10 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,
 	amdgpu_device_wb_free(adev, status_offset);
 	return 0;
 
+error_undo:
+	dev_err(adev->dev, "MES ring buffer is full.\n");
+	amdgpu_ring_undo(ring);
+
 error_unlock_free:
 	spin_unlock_irqrestore(&mes->ring_lock, flags);
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/amdgpu/mes: fix mes ring buffer overflow
  2024-08-27 14:10 [PATCH] drm/amdgpu/mes: fix mes ring buffer overflow Alex Deucher
@ 2024-08-27 14:21 ` Greg KH
  2024-08-27 15:01   ` Deucher, Alexander
  0 siblings, 1 reply; 4+ messages in thread
From: Greg KH @ 2024-08-27 14:21 UTC (permalink / raw)
  To: Alex Deucher; +Cc: stable, sashal, Jack Xiao

On Tue, Aug 27, 2024 at 10:10:25AM -0400, Alex Deucher wrote:
> From: Jack Xiao <Jack.Xiao@amd.com>
> 
> wait memory room until enough before writing mes packets
> to avoid ring buffer overflow.
> 
> v2: squash in sched_hw_submission fix
> 
> Backport from 6.11.
> 
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3571
> Fixes: de3246254156 ("drm/amdgpu: cleanup MES11 command submission")
> Fixes: fffe347e1478 ("drm/amdgpu: cleanup MES12 command submission")

These commits are in 6.11-rc1.

> Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
> Acked-by: Alex Deucher <alexander.deucher@amd.com>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> (cherry picked from commit 34e087e8920e635c62e2ed6a758b0cd27f836d13)
> Cc: stable@vger.kernel.org # 6.10.x

So why does this need to go to 6.10.y?

confused,

greg k-h

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [PATCH] drm/amdgpu/mes: fix mes ring buffer overflow
  2024-08-27 14:21 ` Greg KH
@ 2024-08-27 15:01   ` Deucher, Alexander
  2024-08-27 16:13     ` Greg KH
  0 siblings, 1 reply; 4+ messages in thread
From: Deucher, Alexander @ 2024-08-27 15:01 UTC (permalink / raw)
  To: Greg KH; +Cc: stable@vger.kernel.org, sashal@kernel.org, Xiao, Jack

[Public]

> -----Original Message-----
> From: Greg KH <gregkh@linuxfoundation.org>
> Sent: Tuesday, August 27, 2024 10:21 AM
> To: Deucher, Alexander <Alexander.Deucher@amd.com>
> Cc: stable@vger.kernel.org; sashal@kernel.org; Xiao, Jack
> <Jack.Xiao@amd.com>
> Subject: Re: [PATCH] drm/amdgpu/mes: fix mes ring buffer overflow
>
> On Tue, Aug 27, 2024 at 10:10:25AM -0400, Alex Deucher wrote:
> > From: Jack Xiao <Jack.Xiao@amd.com>
> >
> > wait memory room until enough before writing mes packets to avoid ring
> > buffer overflow.
> >
> > v2: squash in sched_hw_submission fix
> >
> > Backport from 6.11.
> >
> > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3571
> > Fixes: de3246254156 ("drm/amdgpu: cleanup MES11 command
> submission")
> > Fixes: fffe347e1478 ("drm/amdgpu: cleanup MES12 command submission")
>
> These commits are in 6.11-rc1.

de3246254156 ("drm/amdgpu: cleanup MES11 command submission")
was ported to 6.10 as well:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c?h=linux-6.10.y&id=e356d321d0240663a09b139fa3658ddbca163e27
So this fix is applicable there.

Alex

>
> > Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
> > Acked-by: Alex Deucher <alexander.deucher@amd.com>
> > Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked
> > from commit 34e087e8920e635c62e2ed6a758b0cd27f836d13)
> > Cc: stable@vger.kernel.org # 6.10.x
>
> So why does this need to go to 6.10.y?
>
> confused,
>
> greg k-h

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/amdgpu/mes: fix mes ring buffer overflow
  2024-08-27 15:01   ` Deucher, Alexander
@ 2024-08-27 16:13     ` Greg KH
  0 siblings, 0 replies; 4+ messages in thread
From: Greg KH @ 2024-08-27 16:13 UTC (permalink / raw)
  To: Deucher, Alexander; +Cc: stable@vger.kernel.org, sashal@kernel.org, Xiao, Jack

On Tue, Aug 27, 2024 at 03:01:54PM +0000, Deucher, Alexander wrote:
> [Public]
> 
> > -----Original Message-----
> > From: Greg KH <gregkh@linuxfoundation.org>
> > Sent: Tuesday, August 27, 2024 10:21 AM
> > To: Deucher, Alexander <Alexander.Deucher@amd.com>
> > Cc: stable@vger.kernel.org; sashal@kernel.org; Xiao, Jack
> > <Jack.Xiao@amd.com>
> > Subject: Re: [PATCH] drm/amdgpu/mes: fix mes ring buffer overflow
> >
> > On Tue, Aug 27, 2024 at 10:10:25AM -0400, Alex Deucher wrote:
> > > From: Jack Xiao <Jack.Xiao@amd.com>
> > >
> > > wait memory room until enough before writing mes packets to avoid ring
> > > buffer overflow.
> > >
> > > v2: squash in sched_hw_submission fix
> > >
> > > Backport from 6.11.
> > >
> > > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3571
> > > Fixes: de3246254156 ("drm/amdgpu: cleanup MES11 command
> > submission")
> > > Fixes: fffe347e1478 ("drm/amdgpu: cleanup MES12 command submission")
> >
> > These commits are in 6.11-rc1.
> 
> de3246254156 ("drm/amdgpu: cleanup MES11 command submission")
> was ported to 6.10 as well:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c?h=linux-6.10.y&id=e356d321d0240663a09b139fa3658ddbca163e27
> So this fix is applicable there.

No, commit e356d321d024 ("drm/amdgpu: cleanup MES11 command submission")
is in the 6.10 release, but commit de3246254156 ("drm/amdgpu: cleanup
MES11 command submission") is in 6.11-rc1!

So how in the world are we supposed to know anything here?

See how broken this all is?

I give up.

If you all want any AMD patches applied to stable trees, manually send
us a set of backported patches, AND be sure to get the git ids right.

I'll leave what I have right now in the queues, but after this round of
-rc releases, all AMD patches with cc: stable are going to be
automatically dropped and ignored.  I NEED you all to manually send them
to me now as this is just insane.

Time to go buy a Intel gpu card as there's no way this is going to work
out well over time...

{sigh}

greg k-h

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-08-27 16:13 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-27 14:10 [PATCH] drm/amdgpu/mes: fix mes ring buffer overflow Alex Deucher
2024-08-27 14:21 ` Greg KH
2024-08-27 15:01   ` Deucher, Alexander
2024-08-27 16:13     ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox