* [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace
@ 2024-03-07 19:04 Joshua Ashton
2024-03-07 19:04 ` [PATCH 2/3] drm/amdgpu: Determine soft recovery deadline next to usage Joshua Ashton
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Joshua Ashton @ 2024-03-07 19:04 UTC (permalink / raw)
To: amd-gfx
Cc: Joshua Ashton, Friedrich Vock, Bas Nieuwenhuizen,
Christian König, André Almeida, stable
As we discussed before[1], soft recovery should be
forwarded to userspace, or we can get into a really
bad state where apps will keep submitting hanging
command buffers cascading us to a hard reset.
1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Cc: Friedrich Vock <friedrich.vock@gmx.de>
Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Cc: Christian König <christian.koenig@amd.com>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: stable@vger.kernel.org
---
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 4b3000c21ef2..aebf59855e9f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job,
struct dma_fence *fence = NULL;
int r;
- /* Ignore soft recovered fences here */
r = drm_sched_entity_error(s_entity);
- if (r && r != -ENODATA)
+ if (r)
goto error;
if (!fence && job->gang_submit)
--
2.44.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/3] drm/amdgpu: Determine soft recovery deadline next to usage
2024-03-07 19:04 [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Joshua Ashton
@ 2024-03-07 19:04 ` Joshua Ashton
2024-03-08 8:23 ` Christian König
2024-03-07 19:04 ` [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s Joshua Ashton
2024-03-08 8:33 ` [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Christian König
2 siblings, 1 reply; 11+ messages in thread
From: Joshua Ashton @ 2024-03-07 19:04 UTC (permalink / raw)
To: amd-gfx
Cc: Joshua Ashton, Friedrich Vock, Bas Nieuwenhuizen,
Christian König, André Almeida, stable
Otherwise we are determining this timeout based on
a time before we go into some unrelated spinlock,
which is bad.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Cc: Friedrich Vock <friedrich.vock@gmx.de>
Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Cc: Christian König <christian.koenig@amd.com>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: stable@vger.kernel.org
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 5505d646f43a..57c94901ed0a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -439,8 +439,6 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
if (unlikely(ring->adev->debug_disable_soft_recovery))
return false;
- deadline = ktime_add_us(ktime_get(), 10000);
-
if (amdgpu_sriov_vf(ring->adev) || !ring->funcs->soft_recovery || !fence)
return false;
@@ -450,6 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
spin_unlock_irqrestore(fence->lock, flags);
atomic_inc(&ring->adev->gpu_reset_counter);
+ deadline = ktime_add_us(ktime_get(), 10000);
while (!dma_fence_is_signaled(fence) &&
ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
ring->funcs->soft_recovery(ring, vmid);
--
2.44.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s
2024-03-07 19:04 [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Joshua Ashton
2024-03-07 19:04 ` [PATCH 2/3] drm/amdgpu: Determine soft recovery deadline next to usage Joshua Ashton
@ 2024-03-07 19:04 ` Joshua Ashton
2024-03-08 8:29 ` Christian König
2024-03-08 8:33 ` [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Christian König
2 siblings, 1 reply; 11+ messages in thread
From: Joshua Ashton @ 2024-03-07 19:04 UTC (permalink / raw)
To: amd-gfx
Cc: Joshua Ashton, Friedrich Vock, Bas Nieuwenhuizen,
Christian König, André Almeida, stable
Results in much more reliable soft recovery on
Steam Deck.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Cc: Friedrich Vock <friedrich.vock@gmx.de>
Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Cc: Christian König <christian.koenig@amd.com>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: stable@vger.kernel.org
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 57c94901ed0a..be99db0e077e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,7 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
spin_unlock_irqrestore(fence->lock, flags);
atomic_inc(&ring->adev->gpu_reset_counter);
- deadline = ktime_add_us(ktime_get(), 10000);
+ deadline = ktime_add_ms(ktime_get(), 500);
while (!dma_fence_is_signaled(fence) &&
ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
ring->funcs->soft_recovery(ring, vmid);
--
2.44.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 2/3] drm/amdgpu: Determine soft recovery deadline next to usage
2024-03-07 19:04 ` [PATCH 2/3] drm/amdgpu: Determine soft recovery deadline next to usage Joshua Ashton
@ 2024-03-08 8:23 ` Christian König
0 siblings, 0 replies; 11+ messages in thread
From: Christian König @ 2024-03-08 8:23 UTC (permalink / raw)
To: Joshua Ashton, amd-gfx
Cc: Friedrich Vock, Bas Nieuwenhuizen, André Almeida, stable
Am 07.03.24 um 20:04 schrieb Joshua Ashton:
> Otherwise we are determining this timeout based on
> a time before we go into some unrelated spinlock,
> which is bad.
Actually I don't think that this is a good idea.
The spinlock is the fence processing lock, so when fence processing is
blocking this with activity it is perfectly valid and desirable that the
timeout is decreased.
Regards,
Christian.
>
> Signed-off-by: Joshua Ashton <joshua@froggi.es>
>
> Cc: Friedrich Vock <friedrich.vock@gmx.de>
> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: André Almeida <andrealmeid@igalia.com>
> Cc: stable@vger.kernel.org
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> index 5505d646f43a..57c94901ed0a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> @@ -439,8 +439,6 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
> if (unlikely(ring->adev->debug_disable_soft_recovery))
> return false;
>
> - deadline = ktime_add_us(ktime_get(), 10000);
> -
> if (amdgpu_sriov_vf(ring->adev) || !ring->funcs->soft_recovery || !fence)
> return false;
>
> @@ -450,6 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
> spin_unlock_irqrestore(fence->lock, flags);
>
> atomic_inc(&ring->adev->gpu_reset_counter);
> + deadline = ktime_add_us(ktime_get(), 10000);
> while (!dma_fence_is_signaled(fence) &&
> ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
> ring->funcs->soft_recovery(ring, vmid);
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s
2024-03-07 19:04 ` [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s Joshua Ashton
@ 2024-03-08 8:29 ` Christian König
2024-03-08 22:31 ` Joshua Ashton
0 siblings, 1 reply; 11+ messages in thread
From: Christian König @ 2024-03-08 8:29 UTC (permalink / raw)
To: Joshua Ashton, amd-gfx
Cc: Friedrich Vock, Bas Nieuwenhuizen, André Almeida, stable
Am 07.03.24 um 20:04 schrieb Joshua Ashton:
> Results in much more reliable soft recovery on
> Steam Deck.
Waiting 500ms for a locked up shader is way to long I think. We could
increase the 10ms to something like 20ms, but I really wouldn't go much
over that.
This here just kills shaders which are in an endless loop, when that
takes longer than 10-20ms we really have a hardware problem which needs
a full reset to resolve.
Regards,
Christian.
>
> Signed-off-by: Joshua Ashton <joshua@froggi.es>
>
> Cc: Friedrich Vock <friedrich.vock@gmx.de>
> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: André Almeida <andrealmeid@igalia.com>
> Cc: stable@vger.kernel.org
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> index 57c94901ed0a..be99db0e077e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> @@ -448,7 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
> spin_unlock_irqrestore(fence->lock, flags);
>
> atomic_inc(&ring->adev->gpu_reset_counter);
> - deadline = ktime_add_us(ktime_get(), 10000);
> + deadline = ktime_add_ms(ktime_get(), 500);
> while (!dma_fence_is_signaled(fence) &&
> ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
> ring->funcs->soft_recovery(ring, vmid);
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace
2024-03-07 19:04 [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Joshua Ashton
2024-03-07 19:04 ` [PATCH 2/3] drm/amdgpu: Determine soft recovery deadline next to usage Joshua Ashton
2024-03-07 19:04 ` [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s Joshua Ashton
@ 2024-03-08 8:33 ` Christian König
2024-03-09 16:27 ` Marek Olšák
2024-08-01 15:17 ` Friedrich Vock
2 siblings, 2 replies; 11+ messages in thread
From: Christian König @ 2024-03-08 8:33 UTC (permalink / raw)
To: Joshua Ashton, amd-gfx, Olsak, Marek
Cc: Friedrich Vock, Bas Nieuwenhuizen, André Almeida, stable
Am 07.03.24 um 20:04 schrieb Joshua Ashton:
> As we discussed before[1], soft recovery should be
> forwarded to userspace, or we can get into a really
> bad state where apps will keep submitting hanging
> command buffers cascading us to a hard reset.
Marek you are in favor of this like forever. So I would like to request
you to put your Reviewed-by on it and I will just push it into our
internal kernel branch.
Regards,
Christian.
>
> 1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/
> Signed-off-by: Joshua Ashton <joshua@froggi.es>
>
> Cc: Friedrich Vock <friedrich.vock@gmx.de>
> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: André Almeida <andrealmeid@igalia.com>
> Cc: stable@vger.kernel.org
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 4b3000c21ef2..aebf59855e9f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job,
> struct dma_fence *fence = NULL;
> int r;
>
> - /* Ignore soft recovered fences here */
> r = drm_sched_entity_error(s_entity);
> - if (r && r != -ENODATA)
> + if (r)
> goto error;
>
> if (!fence && job->gang_submit)
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s
2024-03-08 8:29 ` Christian König
@ 2024-03-08 22:31 ` Joshua Ashton
2024-03-11 6:46 ` Christian König
0 siblings, 1 reply; 11+ messages in thread
From: Joshua Ashton @ 2024-03-08 22:31 UTC (permalink / raw)
To: Christian König, amd-gfx
Cc: Friedrich Vock, Bas Nieuwenhuizen, André Almeida, stable
It definitely takes much longer than 10-20ms in some instances.
Some of these instances can even be shown in Freidrich's hang test suite
-- specifically when there are a lot of page faults going on.
The work (or parts of the work) could also be pending and not in any
wave yet, just hanging out in the ring. There may be a better solution
to that, but I don't know it.
Raising it to .5s still makes sense to me.
- Joshie 🐸✨
On 3/8/24 08:29, Christian König wrote:
> Am 07.03.24 um 20:04 schrieb Joshua Ashton:
>> Results in much more reliable soft recovery on
>> Steam Deck.
>
> Waiting 500ms for a locked up shader is way to long I think. We could
> increase the 10ms to something like 20ms, but I really wouldn't go much
> over that.
>
> This here just kills shaders which are in an endless loop, when that
> takes longer than 10-20ms we really have a hardware problem which needs
> a full reset to resolve.
>
> Regards,
> Christian.
>
>>
>> Signed-off-by: Joshua Ashton <joshua@froggi.es>
>>
>> Cc: Friedrich Vock <friedrich.vock@gmx.de>
>> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: André Almeida <andrealmeid@igalia.com>
>> Cc: stable@vger.kernel.org
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> index 57c94901ed0a..be99db0e077e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> @@ -448,7 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring
>> *ring, unsigned int vmid,
>> spin_unlock_irqrestore(fence->lock, flags);
>> atomic_inc(&ring->adev->gpu_reset_counter);
>> - deadline = ktime_add_us(ktime_get(), 10000);
>> + deadline = ktime_add_ms(ktime_get(), 500);
>> while (!dma_fence_is_signaled(fence) &&
>> ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
>> ring->funcs->soft_recovery(ring, vmid);
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace
2024-03-08 8:33 ` [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Christian König
@ 2024-03-09 16:27 ` Marek Olšák
2024-08-01 15:17 ` Friedrich Vock
1 sibling, 0 replies; 11+ messages in thread
From: Marek Olšák @ 2024-03-09 16:27 UTC (permalink / raw)
To: Christian König
Cc: Joshua Ashton, amd-gfx, Olsak, Marek, Friedrich Vock,
Bas Nieuwenhuizen, André Almeida, stable
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Marek
On Fri, Mar 8, 2024 at 3:43 AM Christian König <christian.koenig@amd.com> wrote:
>
> Am 07.03.24 um 20:04 schrieb Joshua Ashton:
> > As we discussed before[1], soft recovery should be
> > forwarded to userspace, or we can get into a really
> > bad state where apps will keep submitting hanging
> > command buffers cascading us to a hard reset.
>
> Marek you are in favor of this like forever. So I would like to request
> you to put your Reviewed-by on it and I will just push it into our
> internal kernel branch.
>
> Regards,
> Christian.
>
> >
> > 1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/
> > Signed-off-by: Joshua Ashton <joshua@froggi.es>
> >
> > Cc: Friedrich Vock <friedrich.vock@gmx.de>
> > Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: André Almeida <andrealmeid@igalia.com>
> > Cc: stable@vger.kernel.org
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +--
> > 1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index 4b3000c21ef2..aebf59855e9f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job,
> > struct dma_fence *fence = NULL;
> > int r;
> >
> > - /* Ignore soft recovered fences here */
> > r = drm_sched_entity_error(s_entity);
> > - if (r && r != -ENODATA)
> > + if (r)
> > goto error;
> >
> > if (!fence && job->gang_submit)
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s
2024-03-08 22:31 ` Joshua Ashton
@ 2024-03-11 6:46 ` Christian König
0 siblings, 0 replies; 11+ messages in thread
From: Christian König @ 2024-03-11 6:46 UTC (permalink / raw)
To: Joshua Ashton, Christian König, amd-gfx
Cc: Friedrich Vock, Bas Nieuwenhuizen, André Almeida, stable
Am 08.03.24 um 23:31 schrieb Joshua Ashton:
> It definitely takes much longer than 10-20ms in some instances.
>
> Some of these instances can even be shown in Freidrich's hang test
> suite -- specifically when there are a lot of page faults going on.
Exactly that's the part I want to avoid. The context based recovery is
to break out of shaders with endless loops.
When there are page faults going on I would rather recommend a hard
reset of the GPU.
>
> The work (or parts of the work) could also be pending and not in any
> wave yet, just hanging out in the ring. There may be a better solution
> to that, but I don't know it.
Yeah, but killing anything of that should never take longer than what
the original submission supposed to take.
In other words when we assume that we should have at least 20fps then we
should never go over 50ms. And even at this point we have already waited
much longer than that for the shader to complete.
If you really want to raise that this high I would rather say to make it
configurable.
Regards,
Christian.
>
> Raising it to .5s still makes sense to me.
>
> - Joshie 🐸✨
>
> On 3/8/24 08:29, Christian König wrote:
>> Am 07.03.24 um 20:04 schrieb Joshua Ashton:
>>> Results in much more reliable soft recovery on
>>> Steam Deck.
>>
>> Waiting 500ms for a locked up shader is way to long I think. We could
>> increase the 10ms to something like 20ms, but I really wouldn't go
>> much over that.
>>
>> This here just kills shaders which are in an endless loop, when that
>> takes longer than 10-20ms we really have a hardware problem which
>> needs a full reset to resolve.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Signed-off-by: Joshua Ashton <joshua@froggi.es>
>>>
>>> Cc: Friedrich Vock <friedrich.vock@gmx.de>
>>> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Cc: André Almeida <andrealmeid@igalia.com>
>>> Cc: stable@vger.kernel.org
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>> index 57c94901ed0a..be99db0e077e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>> @@ -448,7 +448,7 @@ bool amdgpu_ring_soft_recovery(struct
>>> amdgpu_ring *ring, unsigned int vmid,
>>> spin_unlock_irqrestore(fence->lock, flags);
>>> atomic_inc(&ring->adev->gpu_reset_counter);
>>> - deadline = ktime_add_us(ktime_get(), 10000);
>>> + deadline = ktime_add_ms(ktime_get(), 500);
>>> while (!dma_fence_is_signaled(fence) &&
>>> ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
>>> ring->funcs->soft_recovery(ring, vmid);
>>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace
2024-03-08 8:33 ` [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Christian König
2024-03-09 16:27 ` Marek Olšák
@ 2024-08-01 15:17 ` Friedrich Vock
2024-08-02 8:30 ` Christian König
1 sibling, 1 reply; 11+ messages in thread
From: Friedrich Vock @ 2024-08-01 15:17 UTC (permalink / raw)
To: Christian König, Joshua Ashton, amd-gfx, Olsak, Marek
Cc: Bas Nieuwenhuizen, André Almeida, stable
Hi,
I happened to come across an issue just now again where soft recovery
fails to get reported to userspace properly, causing apps to submit
hanging work in a loop (which ended up hanging the entire machine) - it
seems like this patch never made it into amd-staging-drm-next. Given
that it has a Reviewed-by and everything, was this just an oversight or
are there some blockers to pushing it that I missed?
If not, I'd be grateful if the patch could get merged.
Thanks,
Friedrich
On 08.03.24 09:33, Christian König wrote:
> Am 07.03.24 um 20:04 schrieb Joshua Ashton:
>> As we discussed before[1], soft recovery should be
>> forwarded to userspace, or we can get into a really
>> bad state where apps will keep submitting hanging
>> command buffers cascading us to a hard reset.
>
> Marek you are in favor of this like forever. So I would like to request
> you to put your Reviewed-by on it and I will just push it into our
> internal kernel branch.
>
> Regards,
> Christian.
>
>>
>> 1:
>> https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/
>> Signed-off-by: Joshua Ashton <joshua@froggi.es>
>>
>> Cc: Friedrich Vock <friedrich.vock@gmx.de>
>> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: André Almeida <andrealmeid@igalia.com>
>> Cc: stable@vger.kernel.org
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +--
>> 1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 4b3000c21ef2..aebf59855e9f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job
>> *sched_job,
>> struct dma_fence *fence = NULL;
>> int r;
>> - /* Ignore soft recovered fences here */
>> r = drm_sched_entity_error(s_entity);
>> - if (r && r != -ENODATA)
>> + if (r)
>> goto error;
>> if (!fence && job->gang_submit)
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace
2024-08-01 15:17 ` Friedrich Vock
@ 2024-08-02 8:30 ` Christian König
0 siblings, 0 replies; 11+ messages in thread
From: Christian König @ 2024-08-02 8:30 UTC (permalink / raw)
To: Friedrich Vock, Christian König, Joshua Ashton, amd-gfx,
Olsak, Marek
Cc: Bas Nieuwenhuizen, André Almeida, stable
Am 01.08.24 um 17:17 schrieb Friedrich Vock:
> Hi,
>
> I happened to come across an issue just now again where soft recovery
> fails to get reported to userspace properly, causing apps to submit
> hanging work in a loop (which ended up hanging the entire machine) - it
> seems like this patch never made it into amd-staging-drm-next. Given
> that it has a Reviewed-by and everything, was this just an oversight or
> are there some blockers to pushing it that I missed?
>
> If not, I'd be grateful if the patch could get merged.
Sorry that was my fault, I've forgotten about it because Alex usually
picks up stuff for amd-staging-drm-next.
Thanks for the reminder, just pushed it.
Regards,
Christian.
>
> Thanks,
> Friedrich
>
> On 08.03.24 09:33, Christian König wrote:
>> Am 07.03.24 um 20:04 schrieb Joshua Ashton:
>>> As we discussed before[1], soft recovery should be
>>> forwarded to userspace, or we can get into a really
>>> bad state where apps will keep submitting hanging
>>> command buffers cascading us to a hard reset.
>>
>> Marek you are in favor of this like forever. So I would like to request
>> you to put your Reviewed-by on it and I will just push it into our
>> internal kernel branch.
>>
>> Regards,
>> Christian.
>>
>>>
>>> 1:
>>> https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/
>>>
>>> Signed-off-by: Joshua Ashton <joshua@froggi.es>
>>>
>>> Cc: Friedrich Vock <friedrich.vock@gmx.de>
>>> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Cc: André Almeida <andrealmeid@igalia.com>
>>> Cc: stable@vger.kernel.org
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +--
>>> 1 file changed, 1 insertion(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index 4b3000c21ef2..aebf59855e9f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job
>>> *sched_job,
>>> struct dma_fence *fence = NULL;
>>> int r;
>>> - /* Ignore soft recovered fences here */
>>> r = drm_sched_entity_error(s_entity);
>>> - if (r && r != -ENODATA)
>>> + if (r)
>>> goto error;
>>> if (!fence && job->gang_submit)
>>
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-08-02 8:30 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-07 19:04 [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Joshua Ashton
2024-03-07 19:04 ` [PATCH 2/3] drm/amdgpu: Determine soft recovery deadline next to usage Joshua Ashton
2024-03-08 8:23 ` Christian König
2024-03-07 19:04 ` [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s Joshua Ashton
2024-03-08 8:29 ` Christian König
2024-03-08 22:31 ` Joshua Ashton
2024-03-11 6:46 ` Christian König
2024-03-08 8:33 ` [PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace Christian König
2024-03-09 16:27 ` Marek Olšák
2024-08-01 15:17 ` Friedrich Vock
2024-08-02 8:30 ` Christian König
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox