[PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace
@ 2024-03-07 19:04 Joshua Ashton
  2024-03-07 19:04 ` [PATCH 2/3] drm/amdgpu: Determine soft recovery deadline next to usage Joshua Ashton
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Joshua Ashton @ 2024-03-07 19:04 UTC (permalink / raw)
  To: amd-gfx
  Cc: Joshua Ashton, Friedrich Vock, Bas Nieuwenhuizen,
	Christian König, André Almeida, stable

As we discussed before[1], soft recovery should be
forwarded to userspace, or we can get into a really
bad state where apps will keep submitting hanging
command buffers cascading us to a hard reset.

1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/
Signed-off-by: Joshua Ashton <joshua@froggi.es>

Cc: Friedrich Vock <friedrich.vock@gmx.de>
Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Cc: Christian König <christian.koenig@amd.com>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 4b3000c21ef2..aebf59855e9f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job,
 	struct dma_fence *fence = NULL;
 	int r;
 
-	/* Ignore soft recovered fences here */
 	r = drm_sched_entity_error(s_entity);
-	if (r && r != -ENODATA)
+	if (r)
 		goto error;
 
 	if (!fence && job->gang_submit)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] drm/amdgpu: Determine soft recovery deadline next to usage
  2024-03-07 19:04 [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Joshua Ashton
@ 2024-03-07 19:04 ` Joshua Ashton
  2024-03-08  8:23   ` Christian König
  2024-03-07 19:04 ` [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s Joshua Ashton
  2024-03-08  8:33 ` [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Christian König
  2 siblings, 1 reply; 11+ messages in thread
From: Joshua Ashton @ 2024-03-07 19:04 UTC (permalink / raw)
  To: amd-gfx
  Cc: Joshua Ashton, Friedrich Vock, Bas Nieuwenhuizen,
	Christian König, André Almeida, stable

Otherwise we are determining this timeout based on
a time before we go into some unrelated spinlock,
which is bad.

Signed-off-by: Joshua Ashton <joshua@froggi.es>

Cc: Friedrich Vock <friedrich.vock@gmx.de>
Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Cc: Christian König <christian.koenig@amd.com>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 5505d646f43a..57c94901ed0a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -439,8 +439,6 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
 	if (unlikely(ring->adev->debug_disable_soft_recovery))
 		return false;
 
-	deadline = ktime_add_us(ktime_get(), 10000);
-
 	if (amdgpu_sriov_vf(ring->adev) || !ring->funcs->soft_recovery || !fence)
 		return false;
 
@@ -450,6 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
 	spin_unlock_irqrestore(fence->lock, flags);
 
 	atomic_inc(&ring->adev->gpu_reset_counter);
+	deadline = ktime_add_us(ktime_get(), 10000);
 	while (!dma_fence_is_signaled(fence) &&
 	       ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
 		ring->funcs->soft_recovery(ring, vmid);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s
  2024-03-07 19:04 [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Joshua Ashton
  2024-03-07 19:04 ` [PATCH 2/3] drm/amdgpu: Determine soft recovery deadline next to usage Joshua Ashton
@ 2024-03-07 19:04 ` Joshua Ashton
  2024-03-08  8:29   ` Christian König
  2024-03-08  8:33 ` [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Christian König
  2 siblings, 1 reply; 11+ messages in thread
From: Joshua Ashton @ 2024-03-07 19:04 UTC (permalink / raw)
  To: amd-gfx
  Cc: Joshua Ashton, Friedrich Vock, Bas Nieuwenhuizen,
	Christian König, André Almeida, stable

Results in much more reliable soft recovery on
Steam Deck.

Signed-off-by: Joshua Ashton <joshua@froggi.es>

Cc: Friedrich Vock <friedrich.vock@gmx.de>
Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Cc: Christian König <christian.koenig@amd.com>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 57c94901ed0a..be99db0e077e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,7 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
 	spin_unlock_irqrestore(fence->lock, flags);
 
 	atomic_inc(&ring->adev->gpu_reset_counter);
-	deadline = ktime_add_us(ktime_get(), 10000);
+	deadline = ktime_add_ms(ktime_get(), 500);
 	while (!dma_fence_is_signaled(fence) &&
 	       ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
 		ring->funcs->soft_recovery(ring, vmid);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] drm/amdgpu: Determine soft recovery deadline next to usage
  2024-03-07 19:04 ` [PATCH 2/3] drm/amdgpu: Determine soft recovery deadline next to usage Joshua Ashton
@ 2024-03-08  8:23   ` Christian König
  0 siblings, 0 replies; 11+ messages in thread
From: Christian König @ 2024-03-08  8:23 UTC (permalink / raw)
  To: Joshua Ashton, amd-gfx
  Cc: Friedrich Vock, Bas Nieuwenhuizen, André Almeida, stable

Am 07.03.24 um 20:04 schrieb Joshua Ashton:
> Otherwise we are determining this timeout based on
> a time before we go into some unrelated spinlock,
> which is bad.

Actually I don't think that this is a good idea.

The spinlock is the fence processing lock, so when fence processing is 
blocking this with activity it is perfectly valid and desirable that the 
timeout is decreased.

Regards,
Christian.

>
> Signed-off-by: Joshua Ashton <joshua@froggi.es>
>
> Cc: Friedrich Vock <friedrich.vock@gmx.de>
> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: André Almeida <andrealmeid@igalia.com>
> Cc: stable@vger.kernel.org
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> index 5505d646f43a..57c94901ed0a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> @@ -439,8 +439,6 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
>   	if (unlikely(ring->adev->debug_disable_soft_recovery))
>   		return false;
>   
> -	deadline = ktime_add_us(ktime_get(), 10000);
> -
>   	if (amdgpu_sriov_vf(ring->adev) || !ring->funcs->soft_recovery || !fence)
>   		return false;
>   
> @@ -450,6 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
>   	spin_unlock_irqrestore(fence->lock, flags);
>   
>   	atomic_inc(&ring->adev->gpu_reset_counter);
> +	deadline = ktime_add_us(ktime_get(), 10000);
>   	while (!dma_fence_is_signaled(fence) &&
>   	       ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
>   		ring->funcs->soft_recovery(ring, vmid);


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s
  2024-03-07 19:04 ` [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s Joshua Ashton
@ 2024-03-08  8:29   ` Christian König
  2024-03-08 22:31     ` Joshua Ashton
  0 siblings, 1 reply; 11+ messages in thread
From: Christian König @ 2024-03-08  8:29 UTC (permalink / raw)
  To: Joshua Ashton, amd-gfx
  Cc: Friedrich Vock, Bas Nieuwenhuizen, André Almeida, stable

Am 07.03.24 um 20:04 schrieb Joshua Ashton:
> Results in much more reliable soft recovery on
> Steam Deck.

Waiting 500ms for a locked up shader is way to long I think. We could 
increase the 10ms to something like 20ms, but I really wouldn't go much 
over that.

This here just kills shaders which are in an endless loop, when that 
takes longer than 10-20ms we really have a hardware problem which needs 
a full reset to resolve.

Regards,
Christian.

>
> Signed-off-by: Joshua Ashton <joshua@froggi.es>
>
> Cc: Friedrich Vock <friedrich.vock@gmx.de>
> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: André Almeida <andrealmeid@igalia.com>
> Cc: stable@vger.kernel.org
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> index 57c94901ed0a..be99db0e077e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> @@ -448,7 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
>   	spin_unlock_irqrestore(fence->lock, flags);
>   
>   	atomic_inc(&ring->adev->gpu_reset_counter);
> -	deadline = ktime_add_us(ktime_get(), 10000);
> +	deadline = ktime_add_ms(ktime_get(), 500);
>   	while (!dma_fence_is_signaled(fence) &&
>   	       ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
>   		ring->funcs->soft_recovery(ring, vmid);


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace
  2024-03-07 19:04 [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Joshua Ashton
  2024-03-07 19:04 ` [PATCH 2/3] drm/amdgpu: Determine soft recovery deadline next to usage Joshua Ashton
  2024-03-07 19:04 ` [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s Joshua Ashton
@ 2024-03-08  8:33 ` Christian König
  2024-03-09 16:27   ` Marek Olšák
  2024-08-01 15:17   ` Friedrich Vock
  2 siblings, 2 replies; 11+ messages in thread
From: Christian König @ 2024-03-08  8:33 UTC (permalink / raw)
  To: Joshua Ashton, amd-gfx, Olsak, Marek
  Cc: Friedrich Vock, Bas Nieuwenhuizen, André Almeida, stable

Am 07.03.24 um 20:04 schrieb Joshua Ashton:
> As we discussed before[1], soft recovery should be
> forwarded to userspace, or we can get into a really
> bad state where apps will keep submitting hanging
> command buffers cascading us to a hard reset.

Marek you are in favor of this like forever.  So I would like to request 
you to put your Reviewed-by on it and I will just push it into our 
internal kernel branch.

Regards,
Christian.

>
> 1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/
> Signed-off-by: Joshua Ashton <joshua@froggi.es>
>
> Cc: Friedrich Vock <friedrich.vock@gmx.de>
> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: André Almeida <andrealmeid@igalia.com>
> Cc: stable@vger.kernel.org
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 4b3000c21ef2..aebf59855e9f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job,
>   	struct dma_fence *fence = NULL;
>   	int r;
>   
> -	/* Ignore soft recovered fences here */
>   	r = drm_sched_entity_error(s_entity);
> -	if (r && r != -ENODATA)
> +	if (r)
>   		goto error;
>   
>   	if (!fence && job->gang_submit)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s
  2024-03-08  8:29   ` Christian König
@ 2024-03-08 22:31     ` Joshua Ashton
  2024-03-11  6:46       ` Christian König
  0 siblings, 1 reply; 11+ messages in thread
From: Joshua Ashton @ 2024-03-08 22:31 UTC (permalink / raw)
  To: Christian König, amd-gfx
  Cc: Friedrich Vock, Bas Nieuwenhuizen, André Almeida, stable

It definitely takes much longer than 10-20ms in some instances.

Some of these instances can even be shown in Freidrich's hang test suite 
-- specifically when there are a lot of page faults going on.

The work (or parts of the work) could also be pending and not in any 
wave yet, just hanging out in the ring. There may be a better solution 
to that, but I don't know it.

Raising it to .5s still makes sense to me.

- Joshie 🐸✨

On 3/8/24 08:29, Christian König wrote:
> Am 07.03.24 um 20:04 schrieb Joshua Ashton:
>> Results in much more reliable soft recovery on
>> Steam Deck.
> 
> Waiting 500ms for a locked up shader is way to long I think. We could 
> increase the 10ms to something like 20ms, but I really wouldn't go much 
> over that.
> 
> This here just kills shaders which are in an endless loop, when that 
> takes longer than 10-20ms we really have a hardware problem which needs 
> a full reset to resolve.
> 
> Regards,
> Christian.
> 
>>
>> Signed-off-by: Joshua Ashton <joshua@froggi.es>
>>
>> Cc: Friedrich Vock <friedrich.vock@gmx.de>
>> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: André Almeida <andrealmeid@igalia.com>
>> Cc: stable@vger.kernel.org
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> index 57c94901ed0a..be99db0e077e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> @@ -448,7 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring 
>> *ring, unsigned int vmid,
>>       spin_unlock_irqrestore(fence->lock, flags);
>>       atomic_inc(&ring->adev->gpu_reset_counter);
>> -    deadline = ktime_add_us(ktime_get(), 10000);
>> +    deadline = ktime_add_ms(ktime_get(), 500);
>>       while (!dma_fence_is_signaled(fence) &&
>>              ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
>>           ring->funcs->soft_recovery(ring, vmid);
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace
  2024-03-08  8:33 ` [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Christian König
@ 2024-03-09 16:27   ` Marek Olšák
  2024-08-01 15:17   ` Friedrich Vock
  1 sibling, 0 replies; 11+ messages in thread
From: Marek Olšák @ 2024-03-09 16:27 UTC (permalink / raw)
  To: Christian König
  Cc: Joshua Ashton, amd-gfx, Olsak, Marek, Friedrich Vock,
	Bas Nieuwenhuizen, André Almeida, stable

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

Marek

On Fri, Mar 8, 2024 at 3:43 AM Christian König <christian.koenig@amd.com> wrote:
>
> Am 07.03.24 um 20:04 schrieb Joshua Ashton:
> > As we discussed before[1], soft recovery should be
> > forwarded to userspace, or we can get into a really
> > bad state where apps will keep submitting hanging
> > command buffers cascading us to a hard reset.
>
> Marek you are in favor of this like forever.  So I would like to request
> you to put your Reviewed-by on it and I will just push it into our
> internal kernel branch.
>
> Regards,
> Christian.
>
> >
> > 1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/
> > Signed-off-by: Joshua Ashton <joshua@froggi.es>
> >
> > Cc: Friedrich Vock <friedrich.vock@gmx.de>
> > Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: André Almeida <andrealmeid@igalia.com>
> > Cc: stable@vger.kernel.org
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +--
> >   1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index 4b3000c21ef2..aebf59855e9f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job,
> >       struct dma_fence *fence = NULL;
> >       int r;
> >
> > -     /* Ignore soft recovered fences here */
> >       r = drm_sched_entity_error(s_entity);
> > -     if (r && r != -ENODATA)
> > +     if (r)
> >               goto error;
> >
> >       if (!fence && job->gang_submit)
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s
  2024-03-08 22:31     ` Joshua Ashton
@ 2024-03-11  6:46       ` Christian König
  0 siblings, 0 replies; 11+ messages in thread
From: Christian König @ 2024-03-11  6:46 UTC (permalink / raw)
  To: Joshua Ashton, Christian König, amd-gfx
  Cc: Friedrich Vock, Bas Nieuwenhuizen, André Almeida, stable

Am 08.03.24 um 23:31 schrieb Joshua Ashton:
> It definitely takes much longer than 10-20ms in some instances.
>
> Some of these instances can even be shown in Freidrich's hang test 
> suite -- specifically when there are a lot of page faults going on.

Exactly that's the part I want to avoid. The context based recovery is 
to break out of shaders with endless loops.

When there are page faults going on I would rather recommend a hard 
reset of the GPU.

>
> The work (or parts of the work) could also be pending and not in any 
> wave yet, just hanging out in the ring. There may be a better solution 
> to that, but I don't know it.

Yeah, but killing anything of that should never take longer than what 
the original submission supposed to take.

In other words when we assume that we should have at least 20fps then we 
should never go over 50ms. And even at this point we have already waited 
much longer than that for the shader to complete.

If you really want to raise that this high I would rather say to make it 
configurable.

Regards,
Christian.

>
> Raising it to .5s still makes sense to me.
>
> - Joshie 🐸✨
>
> On 3/8/24 08:29, Christian König wrote:
>> Am 07.03.24 um 20:04 schrieb Joshua Ashton:
>>> Results in much more reliable soft recovery on
>>> Steam Deck.
>>
>> Waiting 500ms for a locked up shader is way to long I think. We could 
>> increase the 10ms to something like 20ms, but I really wouldn't go 
>> much over that.
>>
>> This here just kills shaders which are in an endless loop, when that 
>> takes longer than 10-20ms we really have a hardware problem which 
>> needs a full reset to resolve.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Signed-off-by: Joshua Ashton <joshua@froggi.es>
>>>
>>> Cc: Friedrich Vock <friedrich.vock@gmx.de>
>>> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Cc: André Almeida <andrealmeid@igalia.com>
>>> Cc: stable@vger.kernel.org
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>> index 57c94901ed0a..be99db0e077e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>> @@ -448,7 +448,7 @@ bool amdgpu_ring_soft_recovery(struct 
>>> amdgpu_ring *ring, unsigned int vmid,
>>>       spin_unlock_irqrestore(fence->lock, flags);
>>>       atomic_inc(&ring->adev->gpu_reset_counter);
>>> -    deadline = ktime_add_us(ktime_get(), 10000);
>>> +    deadline = ktime_add_ms(ktime_get(), 500);
>>>       while (!dma_fence_is_signaled(fence) &&
>>>              ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
>>>           ring->funcs->soft_recovery(ring, vmid);
>>
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace
  2024-03-08  8:33 ` [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Christian König
  2024-03-09 16:27   ` Marek Olšák
@ 2024-08-01 15:17   ` Friedrich Vock
  2024-08-02  8:30     ` Christian König
  1 sibling, 1 reply; 11+ messages in thread
From: Friedrich Vock @ 2024-08-01 15:17 UTC (permalink / raw)
  To: Christian König, Joshua Ashton, amd-gfx, Olsak, Marek
  Cc: Bas Nieuwenhuizen, André Almeida, stable

Hi,

I happened to come across an issue just now again where soft recovery
fails to get reported to userspace properly, causing apps to submit
hanging work in a loop (which ended up hanging the entire machine) - it
seems like this patch never made it into amd-staging-drm-next. Given
that it has a Reviewed-by and everything, was this just an oversight or
are there some blockers to pushing it that I missed?

If not, I'd be grateful if the patch could get merged.

Thanks,
Friedrich

On 08.03.24 09:33, Christian König wrote:
> Am 07.03.24 um 20:04 schrieb Joshua Ashton:
>> As we discussed before[1], soft recovery should be
>> forwarded to userspace, or we can get into a really
>> bad state where apps will keep submitting hanging
>> command buffers cascading us to a hard reset.
>
> Marek you are in favor of this like forever.  So I would like to request
> you to put your Reviewed-by on it and I will just push it into our
> internal kernel branch.
>
> Regards,
> Christian.
>
>>
>> 1:
>> https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/
>> Signed-off-by: Joshua Ashton <joshua@froggi.es>
>>
>> Cc: Friedrich Vock <friedrich.vock@gmx.de>
>> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: André Almeida <andrealmeid@igalia.com>
>> Cc: stable@vger.kernel.org
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +--
>>   1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 4b3000c21ef2..aebf59855e9f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job
>> *sched_job,
>>       struct dma_fence *fence = NULL;
>>       int r;
>> -    /* Ignore soft recovered fences here */
>>       r = drm_sched_entity_error(s_entity);
>> -    if (r && r != -ENODATA)
>> +    if (r)
>>           goto error;
>>       if (!fence && job->gang_submit)
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace
  2024-08-01 15:17   ` Friedrich Vock
@ 2024-08-02  8:30     ` Christian König
  0 siblings, 0 replies; 11+ messages in thread
From: Christian König @ 2024-08-02  8:30 UTC (permalink / raw)
  To: Friedrich Vock, Christian König, Joshua Ashton, amd-gfx,
	Olsak, Marek
  Cc: Bas Nieuwenhuizen, André Almeida, stable

Am 01.08.24 um 17:17 schrieb Friedrich Vock:
> Hi,
>
> I happened to come across an issue just now again where soft recovery
> fails to get reported to userspace properly, causing apps to submit
> hanging work in a loop (which ended up hanging the entire machine) - it
> seems like this patch never made it into amd-staging-drm-next. Given
> that it has a Reviewed-by and everything, was this just an oversight or
> are there some blockers to pushing it that I missed?
>
> If not, I'd be grateful if the patch could get merged.

Sorry that was my fault, I've forgotten about it because Alex usually 
picks up stuff for amd-staging-drm-next.

Thanks for the reminder, just pushed it.

Regards,
Christian.

>
> Thanks,
> Friedrich
>
> On 08.03.24 09:33, Christian König wrote:
>> Am 07.03.24 um 20:04 schrieb Joshua Ashton:
>>> As we discussed before[1], soft recovery should be
>>> forwarded to userspace, or we can get into a really
>>> bad state where apps will keep submitting hanging
>>> command buffers cascading us to a hard reset.
>>
>> Marek you are in favor of this like forever.  So I would like to request
>> you to put your Reviewed-by on it and I will just push it into our
>> internal kernel branch.
>>
>> Regards,
>> Christian.
>>
>>>
>>> 1:
>>> https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/ 
>>>
>>> Signed-off-by: Joshua Ashton <joshua@froggi.es>
>>>
>>> Cc: Friedrich Vock <friedrich.vock@gmx.de>
>>> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Cc: André Almeida <andrealmeid@igalia.com>
>>> Cc: stable@vger.kernel.org
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +--
>>>   1 file changed, 1 insertion(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index 4b3000c21ef2..aebf59855e9f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job
>>> *sched_job,
>>>       struct dma_fence *fence = NULL;
>>>       int r;
>>> -    /* Ignore soft recovered fences here */
>>>       r = drm_sched_entity_error(s_entity);
>>> -    if (r && r != -ENODATA)
>>> +    if (r)
>>>           goto error;
>>>       if (!fence && job->gang_submit)
>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-08-02  8:30 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-07 19:04 [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Joshua Ashton
2024-03-07 19:04 ` [PATCH 2/3] drm/amdgpu: Determine soft recovery deadline next to usage Joshua Ashton
2024-03-08  8:23   ` Christian König
2024-03-07 19:04 ` [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s Joshua Ashton
2024-03-08  8:29   ` Christian König
2024-03-08 22:31     ` Joshua Ashton
2024-03-11  6:46       ` Christian König
2024-03-08  8:33 ` [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Christian König
2024-03-09 16:27   ` Marek Olšák
2024-08-01 15:17   ` Friedrich Vock
2024-08-02  8:30     ` Christian König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox