* [PATCH 0/1] drm/amdgpu: Use device wedged event
@ 2024-12-12 19:09 André Almeida
2024-12-12 19:09 ` [PATCH 1/1] " André Almeida
` (2 more replies)
0 siblings, 3 replies; 22+ messages in thread
From: André Almeida @ 2024-12-12 19:09 UTC (permalink / raw)
To: Raag Jadav, airlied, simona, lucas.demarchi, rodrigo.vivi,
jani.nikula, andriy.shevchenko, lina, michal.wajdeczko,
christian.koenig
Cc: intel-gfx, dri-devel, himal.prasad.ghimiray, aravind.iddamsetty,
anshuman.gupta, alexander.deucher, andrealmeid, amd-gfx,
kernel-dev
Raag Jadav is introducing a new DRM API for generating "device wedged events",
to notify userspace when the device needs userspace intervention after a GPU
reset[1]. I did a simple patch to add support for it for amdgpu for the
telemetry aspect of the event. Tested in Steam Deck. This patch requires [1] to
be applied.
Raag, if you are sending a v11 of your work you can carry this patch as well if
you think it makes sense.
[1] https://lore.kernel.org/dri-devel/20241128153707.1294347-1-raag.jadav@intel.com/
André Almeida (1):
drm/amdgpu: Use device wedged event
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
1 file changed, 3 insertions(+)
--
2.47.1
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-12 19:09 [PATCH 0/1] drm/amdgpu: Use device wedged event André Almeida
@ 2024-12-12 19:09 ` André Almeida
2024-12-13 7:34 ` Christian König
` (3 more replies)
2024-12-13 14:10 ` [PATCH 0/1] " Lucas De Marchi
2024-12-13 14:25 ` ✗ Fi.CI.BUILD: failure for " Patchwork
2 siblings, 4 replies; 22+ messages in thread
From: André Almeida @ 2024-12-12 19:09 UTC (permalink / raw)
To: Raag Jadav, airlied, simona, lucas.demarchi, rodrigo.vivi,
jani.nikula, andriy.shevchenko, lina, michal.wajdeczko,
christian.koenig
Cc: intel-gfx, dri-devel, himal.prasad.ghimiray, aravind.iddamsetty,
anshuman.gupta, alexander.deucher, andrealmeid, amd-gfx,
kernel-dev
Use DRM's device wedged event to notify userspace that a reset had
happened. For now, only use `none` method meant for telemetry
capture.
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 96316111300a..19e1a5493778 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
atomic_set(&adev->reset_domain->reset_res, r);
+
+ drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
+
return r;
}
--
2.47.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-12 19:09 ` [PATCH 1/1] " André Almeida
@ 2024-12-13 7:34 ` Christian König
2024-12-13 7:46 ` Sharma, Shashank
2024-12-13 14:15 ` André Almeida
2024-12-13 15:31 ` Lazar, Lijo
` (2 subsequent siblings)
3 siblings, 2 replies; 22+ messages in thread
From: Christian König @ 2024-12-13 7:34 UTC (permalink / raw)
To: André Almeida, Raag Jadav, airlied, simona, lucas.demarchi,
rodrigo.vivi, jani.nikula, andriy.shevchenko, lina,
michal.wajdeczko, Sharma, Shashank
Cc: intel-gfx, dri-devel, himal.prasad.ghimiray, aravind.iddamsetty,
anshuman.gupta, alexander.deucher, amd-gfx, kernel-dev
Am 12.12.24 um 20:09 schrieb André Almeida:
> Use DRM's device wedged event to notify userspace that a reset had
> happened. For now, only use `none` method meant for telemetry
> capture.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 96316111300a..19e1a5493778 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>
> atomic_set(&adev->reset_domain->reset_res, r);
> +
> + drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
That looks really good in general. I would just make the
DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
@Shashank any objections? IIRC you have worked on the AMD specific event
we never upstreamed.
Regards,
Christian.
> +
> return r;
> }
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-13 7:34 ` Christian König
@ 2024-12-13 7:46 ` Sharma, Shashank
2024-12-13 14:15 ` André Almeida
1 sibling, 0 replies; 22+ messages in thread
From: Sharma, Shashank @ 2024-12-13 7:46 UTC (permalink / raw)
To: Koenig, Christian, André Almeida, Raag Jadav,
airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com,
rodrigo.vivi@intel.com, jani.nikula@linux.intel.com,
andriy.shevchenko@linux.intel.com, lina@asahilina.net,
michal.wajdeczko@intel.com
Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
himal.prasad.ghimiray@intel.com,
aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com,
Deucher, Alexander, amd-gfx@lists.freedesktop.org,
kernel-dev@igalia.com
[-- Attachment #1: Type: text/plain, Size: 2585 bytes --]
[AMD Official Use Only - AMD Internal Distribution Only]
Looks good to me as well, With Christian's comment accommodated:
Acked-by: Shashank Sharma <shashank.sharma@amd.com>
Regards
Shashank
________________________________
From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Friday, December 13, 2024 8:34 AM
To: André Almeida <andrealmeid@igalia.com>; Raag Jadav <raag.jadav@intel.com>; airlied@gmail.com <airlied@gmail.com>; simona@ffwll.ch <simona@ffwll.ch>; lucas.demarchi@intel.com <lucas.demarchi@intel.com>; rodrigo.vivi@intel.com <rodrigo.vivi@intel.com>; jani.nikula@linux.intel.com <jani.nikula@linux.intel.com>; andriy.shevchenko@linux.intel.com <andriy.shevchenko@linux.intel.com>; lina@asahilina.net <lina@asahilina.net>; michal.wajdeczko@intel.com <michal.wajdeczko@intel.com>; Sharma, Shashank <Shashank.Sharma@amd.com>
Cc: intel-gfx@lists.freedesktop.org <intel-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; himal.prasad.ghimiray@intel.com <himal.prasad.ghimiray@intel.com>; aravind.iddamsetty@linux.intel.com <aravind.iddamsetty@linux.intel.com>; anshuman.gupta@intel.com <anshuman.gupta@intel.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; kernel-dev@igalia.com <kernel-dev@igalia.com>
Subject: Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
Am 12.12.24 um 20:09 schrieb André Almeida:
> Use DRM's device wedged event to notify userspace that a reset had
> happened. For now, only use `none` method meant for telemetry
> capture.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 96316111300a..19e1a5493778 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>
> atomic_set(&adev->reset_domain->reset_res, r);
> +
> + drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
That looks really good in general. I would just make the
DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
@Shashank any objections? IIRC you have worked on the AMD specific event
we never upstreamed.
Regards,
Christian.
> +
> return r;
> }
>
[-- Attachment #2: Type: text/html, Size: 4805 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/1] drm/amdgpu: Use device wedged event
2024-12-12 19:09 [PATCH 0/1] drm/amdgpu: Use device wedged event André Almeida
2024-12-12 19:09 ` [PATCH 1/1] " André Almeida
@ 2024-12-13 14:10 ` Lucas De Marchi
2024-12-13 14:25 ` ✗ Fi.CI.BUILD: failure for " Patchwork
2 siblings, 0 replies; 22+ messages in thread
From: Lucas De Marchi @ 2024-12-13 14:10 UTC (permalink / raw)
To: André Almeida
Cc: Raag Jadav, airlied, simona, rodrigo.vivi, jani.nikula,
andriy.shevchenko, lina, michal.wajdeczko, christian.koenig,
intel-gfx, dri-devel, himal.prasad.ghimiray, aravind.iddamsetty,
anshuman.gupta, alexander.deucher, amd-gfx, kernel-dev
On Thu, Dec 12, 2024 at 04:09:08PM -0300, André Almeida wrote:
>Raag Jadav is introducing a new DRM API for generating "device wedged events",
>to notify userspace when the device needs userspace intervention after a GPU
>reset[1]. I did a simple patch to add support for it for amdgpu for the
>telemetry aspect of the event. Tested in Steam Deck. This patch requires [1] to
>be applied.
>
>Raag, if you are sending a v11 of your work you can carry this patch as well if
>you think it makes sense.
great, more users
when this started the plan was to merge it through drm-xe since xe was
the only user. Now we already have 3, so I think it makes sense to go
through drm-misc... I looked at the possible conflicts in i915 and xe
and they are minimal.
thanks
Lucas De Marchi
>
>[1] https://lore.kernel.org/dri-devel/20241128153707.1294347-1-raag.jadav@intel.com/
>
>André Almeida (1):
> drm/amdgpu: Use device wedged event
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>--
>2.47.1
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-13 7:34 ` Christian König
2024-12-13 7:46 ` Sharma, Shashank
@ 2024-12-13 14:15 ` André Almeida
2024-12-13 14:36 ` Raag Jadav
1 sibling, 1 reply; 22+ messages in thread
From: André Almeida @ 2024-12-13 14:15 UTC (permalink / raw)
To: Christian König, Raag Jadav, airlied, simona, lucas.demarchi,
rodrigo.vivi, jani.nikula, andriy.shevchenko, lina,
michal.wajdeczko, Sharma, Shashank
Cc: intel-gfx, dri-devel, himal.prasad.ghimiray, aravind.iddamsetty,
anshuman.gupta, alexander.deucher, amd-gfx, kernel-dev
Hi Christian,
Em 13/12/2024 04:34, Christian König escreveu:
> Am 12.12.24 um 20:09 schrieb André Almeida:
>> Use DRM's device wedged event to notify userspace that a reset had
>> happened. For now, only use `none` method meant for telemetry
>> capture.
>>
>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/
>> drm/amd/amdgpu/amdgpu_device.c
>> index 96316111300a..19e1a5493778 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>> amdgpu_device *adev,
>> dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>> atomic_set(&adev->reset_domain->reset_res, r);
>> +
>> + drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
>
> That looks really good in general. I would just make the
> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>
Why depend or `r`? A reset was triggered anyway, regardless of the
success of it, shouldn't we tell userspace?
^ permalink raw reply [flat|nested] 22+ messages in thread
* ✗ Fi.CI.BUILD: failure for drm/amdgpu: Use device wedged event
2024-12-12 19:09 [PATCH 0/1] drm/amdgpu: Use device wedged event André Almeida
2024-12-12 19:09 ` [PATCH 1/1] " André Almeida
2024-12-13 14:10 ` [PATCH 0/1] " Lucas De Marchi
@ 2024-12-13 14:25 ` Patchwork
2 siblings, 0 replies; 22+ messages in thread
From: Patchwork @ 2024-12-13 14:25 UTC (permalink / raw)
To: André Almeida; +Cc: intel-gfx
== Series Details ==
Series: drm/amdgpu: Use device wedged event
URL : https://patchwork.freedesktop.org/series/142564/
State : failure
== Summary ==
Error: make failed
CALL scripts/checksyscalls.sh
DESCEND objtool
INSTALL libsubcmd_headers
CC [M] drivers/gpu/drm/amd/amdgpu/amdgpu_device.o
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c: In function ‘amdgpu_device_gpu_recover’:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6061:9: error: implicit declaration of function ‘drm_dev_wedged_event’ [-Werror=implicit-function-declaration]
6061 | drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
| ^~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6061:49: error: ‘DRM_WEDGE_RECOVERY_NONE’ undeclared (first use in this function); did you mean ‘DRM_MODE_ENCODER_NONE’?
6061 | drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
| ^~~~~~~~~~~~~~~~~~~~~~~
| DRM_MODE_ENCODER_NONE
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6061:49: note: each undeclared identifier is reported only once for each function it appears in
cc1: some warnings being treated as errors
make[6]: *** [scripts/Makefile.build:194: drivers/gpu/drm/amd/amdgpu/amdgpu_device.o] Error 1
make[5]: *** [scripts/Makefile.build:440: drivers/gpu/drm/amd/amdgpu] Error 2
make[4]: *** [scripts/Makefile.build:440: drivers/gpu/drm] Error 2
make[3]: *** [scripts/Makefile.build:440: drivers/gpu] Error 2
make[2]: *** [scripts/Makefile.build:440: drivers] Error 2
make[1]: *** [/home/kbuild/kernel/Makefile:1989: .] Error 2
make: *** [Makefile:251: __sub-make] Error 2
Build failed, no error log produced
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-13 14:15 ` André Almeida
@ 2024-12-13 14:36 ` Raag Jadav
2024-12-13 15:56 ` André Almeida
0 siblings, 1 reply; 22+ messages in thread
From: Raag Jadav @ 2024-12-13 14:36 UTC (permalink / raw)
To: André Almeida
Cc: Christian König, airlied, simona, lucas.demarchi,
rodrigo.vivi, jani.nikula, andriy.shevchenko, lina,
michal.wajdeczko, Sharma, Shashank, intel-gfx, dri-devel,
himal.prasad.ghimiray, aravind.iddamsetty, anshuman.gupta,
alexander.deucher, amd-gfx, kernel-dev
On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
> Hi Christian,
>
> Em 13/12/2024 04:34, Christian König escreveu:
> > Am 12.12.24 um 20:09 schrieb André Almeida:
> > > Use DRM's device wedged event to notify userspace that a reset had
> > > happened. For now, only use `none` method meant for telemetry
> > > capture.
> > >
> > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
> > > 1 file changed, 3 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
> > > index 96316111300a..19e1a5493778 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
> > > amdgpu_device *adev,
> > > dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
> > > atomic_set(&adev->reset_domain->reset_res, r);
> > > +
> > > + drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> >
> > That looks really good in general. I would just make the
> > DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
> >
>
> Why depend or `r`? A reset was triggered anyway, regardless of the success
> of it, shouldn't we tell userspace?
A failed reset would perhaps result in wedging, atleast that's how i915
is handling it.
Raag
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-12 19:09 ` [PATCH 1/1] " André Almeida
2024-12-13 7:34 ` Christian König
@ 2024-12-13 15:31 ` Lazar, Lijo
2024-12-20 13:31 ` kernel test robot
2024-12-20 14:06 ` kernel test robot
3 siblings, 0 replies; 22+ messages in thread
From: Lazar, Lijo @ 2024-12-13 15:31 UTC (permalink / raw)
To: amd-gfx
On 12/13/2024 12:39 AM, André Almeida wrote:
> Use DRM's device wedged event to notify userspace that a reset had
> happened. For now, only use `none` method meant for telemetry
> capture.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 96316111300a..19e1a5493778 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>
> atomic_set(&adev->reset_domain->reset_res, r);
> +
> + drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
The usage we have in amd with a different mechanism also provides with
the user a reason of the reset. Is that mechanism available in this? It
seems this only states device has taken a reset with no other context.
Thanks,
Lijo
> +
> return r;
> }
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-13 14:36 ` Raag Jadav
@ 2024-12-13 15:56 ` André Almeida
2024-12-16 10:18 ` Christian König
0 siblings, 1 reply; 22+ messages in thread
From: André Almeida @ 2024-12-13 15:56 UTC (permalink / raw)
To: Raag Jadav, Christian König
Cc: airlied, simona, lucas.demarchi, rodrigo.vivi, jani.nikula,
andriy.shevchenko, lina, michal.wajdeczko, Sharma, Shashank,
intel-gfx, dri-devel, himal.prasad.ghimiray, aravind.iddamsetty,
anshuman.gupta, alexander.deucher, amd-gfx, kernel-dev
Em 13/12/2024 11:36, Raag Jadav escreveu:
> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>> Hi Christian,
>>
>> Em 13/12/2024 04:34, Christian König escreveu:
>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>> happened. For now, only use `none` method meant for telemetry
>>>> capture.
>>>>
>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>> 1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>> index 96316111300a..19e1a5493778 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>> amdgpu_device *adev,
>>>> dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>> +
>>>> + drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
>>>
>>> That looks really good in general. I would just make the
>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>
>>
>> Why depend or `r`? A reset was triggered anyway, regardless of the success
>> of it, shouldn't we tell userspace?
>
> A failed reset would perhaps result in wedging, atleast that's how i915
> is handling it.
>
Right, and I think this raises the question of what wedge recovery
method should I add for amdgpu... Christian?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-13 15:56 ` André Almeida
@ 2024-12-16 10:18 ` Christian König
2024-12-16 10:38 ` Lazar, Lijo
0 siblings, 1 reply; 22+ messages in thread
From: Christian König @ 2024-12-16 10:18 UTC (permalink / raw)
To: André Almeida, Raag Jadav
Cc: airlied, simona, lucas.demarchi, rodrigo.vivi, jani.nikula,
andriy.shevchenko, lina, michal.wajdeczko, Sharma, Shashank,
intel-gfx, dri-devel, himal.prasad.ghimiray, aravind.iddamsetty,
anshuman.gupta, alexander.deucher, amd-gfx, kernel-dev
Am 13.12.24 um 16:56 schrieb André Almeida:
> Em 13/12/2024 11:36, Raag Jadav escreveu:
>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>> Hi Christian,
>>>
>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>> happened. For now, only use `none` method meant for telemetry
>>>>> capture.
>>>>>
>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>> ---
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>> 1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>> index 96316111300a..19e1a5493778 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>> amdgpu_device *adev,
>>>>> dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>> +
>>>>> + drm_dev_wedged_event(adev_to_drm(adev),
>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>
>>>> That looks really good in general. I would just make the
>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>
>>>
>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>> success
>>> of it, shouldn't we tell userspace?
>>
>> A failed reset would perhaps result in wedging, atleast that's how i915
>> is handling it.
>>
>
> Right, and I think this raises the question of what wedge recovery
> method should I add for amdgpu... Christian?
>
In theory a rebind should be enough to get the device going again, our
BOCO does a bus reset on driver load anyway.
Regards,
Christian.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-16 10:18 ` Christian König
@ 2024-12-16 10:38 ` Lazar, Lijo
2024-12-16 13:04 ` André Almeida
0 siblings, 1 reply; 22+ messages in thread
From: Lazar, Lijo @ 2024-12-16 10:38 UTC (permalink / raw)
To: Christian König, André Almeida, Raag Jadav
Cc: airlied, simona, lucas.demarchi, rodrigo.vivi, jani.nikula,
andriy.shevchenko, lina, michal.wajdeczko, Sharma, Shashank,
intel-gfx, dri-devel, himal.prasad.ghimiray, aravind.iddamsetty,
anshuman.gupta, alexander.deucher, amd-gfx, kernel-dev
On 12/16/2024 3:48 PM, Christian König wrote:
> Am 13.12.24 um 16:56 schrieb André Almeida:
>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>> Hi Christian,
>>>>
>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>> capture.
>>>>>>
>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>> 1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>> amdgpu_device *adev,
>>>>>> dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>> +
>>>>>> + drm_dev_wedged_event(adev_to_drm(adev),
>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>
>>>>> That looks really good in general. I would just make the
>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>
>>>>
>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>> success
>>>> of it, shouldn't we tell userspace?
>>>
>>> A failed reset would perhaps result in wedging, atleast that's how i915
>>> is handling it.
>>>
>>
>> Right, and I think this raises the question of what wedge recovery
>> method should I add for amdgpu... Christian?
>>
>
> In theory a rebind should be enough to get the device going again, our
> BOCO does a bus reset on driver load anyway.
>
The behavior varies between SOCs. In certain ones, if driver reset
fails, that means it's really in a bad state and it would need system
reboot.
I had asked earlier about the utility of this one here. If this is just
to inform userspace that driver has done a reset and recovered, it would
need some additional context also. We have a mechanism in KFD which
sends the context in which a reset has to be done. Currently, that's
restricted to compute applications, but if this is in a similar line, we
would like to pass some additional info like job timeout, RAS error etc.
Thanks,
Lijo
> Regards,
> Christian.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-16 10:38 ` Lazar, Lijo
@ 2024-12-16 13:04 ` André Almeida
2024-12-16 13:10 ` Christian König
0 siblings, 1 reply; 22+ messages in thread
From: André Almeida @ 2024-12-16 13:04 UTC (permalink / raw)
To: Lazar, Lijo
Cc: airlied, simona, Raag Jadav, Christian König, lucas.demarchi,
rodrigo.vivi, jani.nikula, andriy.shevchenko, lina,
michal.wajdeczko, Sharma, Shashank, intel-gfx, dri-devel,
himal.prasad.ghimiray, aravind.iddamsetty, anshuman.gupta,
alexander.deucher, amd-gfx, kernel-dev
Em 16/12/2024 07:38, Lazar, Lijo escreveu:
>
>
> On 12/16/2024 3:48 PM, Christian König wrote:
>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>> Hi Christian,
>>>>>
>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>> capture.
>>>>>>>
>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>> 1 file changed, 3 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>> amdgpu_device *adev,
>>>>>>> dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>> +
>>>>>>> + drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>
>>>>>> That looks really good in general. I would just make the
>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>
>>>>>
>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>> success
>>>>> of it, shouldn't we tell userspace?
>>>>
>>>> A failed reset would perhaps result in wedging, atleast that's how i915
>>>> is handling it.
>>>>
>>>
>>> Right, and I think this raises the question of what wedge recovery
>>> method should I add for amdgpu... Christian?
>>>
>>
>> In theory a rebind should be enough to get the device going again, our
>> BOCO does a bus reset on driver load anyway.
>>
>
> The behavior varies between SOCs. In certain ones, if driver reset
> fails, that means it's really in a bad state and it would need system
> reboot.
>
Is this documented somewhere? Then I could even add a
DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.
> I had asked earlier about the utility of this one here. If this is just
> to inform userspace that driver has done a reset and recovered, it would
> need some additional context also. We have a mechanism in KFD which
> sends the context in which a reset has to be done. Currently, that's
> restricted to compute applications, but if this is in a similar line, we
> would like to pass some additional info like job timeout, RAS error etc.
>
DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
reset and recovered, but additional data about like which job timeout,
RAS error and such belong to devcoredump I guess, where all data is
gathered and collected later.
> Thanks,
> Lijo
>
>> Regards,
>> Christian.
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-16 13:04 ` André Almeida
@ 2024-12-16 13:10 ` Christian König
2024-12-16 13:15 ` André Almeida
0 siblings, 1 reply; 22+ messages in thread
From: Christian König @ 2024-12-16 13:10 UTC (permalink / raw)
To: André Almeida, Lazar, Lijo
Cc: airlied, simona, Raag Jadav, lucas.demarchi, rodrigo.vivi,
jani.nikula, andriy.shevchenko, lina, michal.wajdeczko,
Sharma, Shashank, intel-gfx, dri-devel, himal.prasad.ghimiray,
aravind.iddamsetty, anshuman.gupta, alexander.deucher, amd-gfx,
kernel-dev
Am 16.12.24 um 14:04 schrieb André Almeida:
> Em 16/12/2024 07:38, Lazar, Lijo escreveu:
>>
>>
>> On 12/16/2024 3:48 PM, Christian König wrote:
>>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>>> Hi Christian,
>>>>>>
>>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>>> capture.
>>>>>>>>
>>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>>> ---
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>> 1 file changed, 3 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>> amdgpu_device *adev,
>>>>>>>> dev_info(adev->dev, "GPU reset end with ret =
>>>>>>>> %d\n", r);
>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>> +
>>>>>>>> + drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>>
>>>>>>> That looks really good in general. I would just make the
>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>>
>>>>>>
>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>>> success
>>>>>> of it, shouldn't we tell userspace?
>>>>>
>>>>> A failed reset would perhaps result in wedging, atleast that's how
>>>>> i915
>>>>> is handling it.
>>>>>
>>>>
>>>> Right, and I think this raises the question of what wedge recovery
>>>> method should I add for amdgpu... Christian?
>>>>
>>>
>>> In theory a rebind should be enough to get the device going again, our
>>> BOCO does a bus reset on driver load anyway.
>>>
>>
>> The behavior varies between SOCs. In certain ones, if driver reset
>> fails, that means it's really in a bad state and it would need system
>> reboot.
>>
>
> Is this documented somewhere? Then I could even add a
> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.
Not publicly as far as I know. But indeed a driver reset has basically
the same chance of succeeding than a driver reload.
I think the use case we have here is more that the administrator
intentionally disabled the reset to allow HW investigation.
So far we did that with a rather broken we don't do anything at all
approach.
>> I had asked earlier about the utility of this one here. If this is just
>> to inform userspace that driver has done a reset and recovered, it would
>> need some additional context also. We have a mechanism in KFD which
>> sends the context in which a reset has to be done. Currently, that's
>> restricted to compute applications, but if this is in a similar line, we
>> would like to pass some additional info like job timeout, RAS error etc.
>>
>
> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
> reset and recovered, but additional data about like which job timeout,
> RAS error and such belong to devcoredump I guess, where all data is
> gathered and collected later.
I think somebody else mentioned it as well that the source of the issue,
e.g. the PID of the submitting process would be helpful as well for
supervising daemons which need to restart processes when they caused
some issue.
We just postponed adding that till later.
Regards,
Christian.
>
>> Thanks,
>> Lijo
>>
>>> Regards,
>>> Christian.
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-16 13:10 ` Christian König
@ 2024-12-16 13:15 ` André Almeida
2024-12-16 13:36 ` Christian König
` (2 more replies)
0 siblings, 3 replies; 22+ messages in thread
From: André Almeida @ 2024-12-16 13:15 UTC (permalink / raw)
To: Christian König, Lazar, Lijo
Cc: airlied, simona, Raag Jadav, lucas.demarchi, rodrigo.vivi,
jani.nikula, andriy.shevchenko, lina, michal.wajdeczko,
Sharma, Shashank, intel-gfx, dri-devel, himal.prasad.ghimiray,
aravind.iddamsetty, anshuman.gupta, alexander.deucher, amd-gfx,
kernel-dev
Em 16/12/2024 10:10, Christian König escreveu:
> Am 16.12.24 um 14:04 schrieb André Almeida:
>> Em 16/12/2024 07:38, Lazar, Lijo escreveu:
>>>
>>>
>>> On 12/16/2024 3:48 PM, Christian König wrote:
>>>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>>>> capture.
>>>>>>>>>
>>>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>>>> ---
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>>> 1 file changed, 3 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>>> amdgpu_device *adev,
>>>>>>>>> dev_info(adev->dev, "GPU reset end with ret =
>>>>>>>>> %d\n", r);
>>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>>> +
>>>>>>>>> + drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>>>
>>>>>>>> That looks really good in general. I would just make the
>>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>>>
>>>>>>>
>>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>>>> success
>>>>>>> of it, shouldn't we tell userspace?
>>>>>>
>>>>>> A failed reset would perhaps result in wedging, atleast that's how
>>>>>> i915
>>>>>> is handling it.
>>>>>>
>>>>>
>>>>> Right, and I think this raises the question of what wedge recovery
>>>>> method should I add for amdgpu... Christian?
>>>>>
>>>>
>>>> In theory a rebind should be enough to get the device going again, our
>>>> BOCO does a bus reset on driver load anyway.
>>>>
>>>
>>> The behavior varies between SOCs. In certain ones, if driver reset
>>> fails, that means it's really in a bad state and it would need system
>>> reboot.
>>>
>>
>> Is this documented somewhere? Then I could even add a
>> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.
>
> Not publicly as far as I know. But indeed a driver reset has basically
> the same chance of succeeding than a driver reload.
>
> I think the use case we have here is more that the administrator
> intentionally disabled the reset to allow HW investigation.
>
> So far we did that with a rather broken we don't do anything at all
> approach.
OK.
>
>>> I had asked earlier about the utility of this one here. If this is just
>>> to inform userspace that driver has done a reset and recovered, it would
>>> need some additional context also. We have a mechanism in KFD which
>>> sends the context in which a reset has to be done. Currently, that's
>>> restricted to compute applications, but if this is in a similar line, we
>>> would like to pass some additional info like job timeout, RAS error etc.
>>>
>>
>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
>> reset and recovered, but additional data about like which job timeout,
>> RAS error and such belong to devcoredump I guess, where all data is
>> gathered and collected later.
>
> I think somebody else mentioned it as well that the source of the issue,
> e.g. the PID of the submitting process would be helpful as well for
> supervising daemons which need to restart processes when they caused
> some issue.
>
It was me :) we have a use case that we would need the PID for the
daemon indeed, but the daemon doesn't need to know what's the RAS error
or the job name that timeouted, there's no immediate action to be taken
with this information, contrary to the PID that we need to know.
> We just postponed adding that till later.
>
> Regards,
> Christian.
>
>>
>>> Thanks,
>>> Lijo
>>>
>>>> Regards,
>>>> Christian.
>>>
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-16 13:15 ` André Almeida
@ 2024-12-16 13:36 ` Christian König
2024-12-16 13:36 ` Lazar, Lijo
2024-12-16 13:57 ` Raag Jadav
2 siblings, 0 replies; 22+ messages in thread
From: Christian König @ 2024-12-16 13:36 UTC (permalink / raw)
To: André Almeida, Lazar, Lijo
Cc: airlied, simona, Raag Jadav, lucas.demarchi, rodrigo.vivi,
jani.nikula, andriy.shevchenko, lina, michal.wajdeczko,
Sharma, Shashank, intel-gfx, dri-devel, himal.prasad.ghimiray,
aravind.iddamsetty, anshuman.gupta, alexander.deucher, amd-gfx,
kernel-dev
Am 16.12.24 um 14:15 schrieb André Almeida:
> Em 16/12/2024 10:10, Christian König escreveu:
>> Am 16.12.24 um 14:04 schrieb André Almeida:
>>> Em 16/12/2024 07:38, Lazar, Lijo escreveu:
>>>>
>>>>
>>>> On 12/16/2024 3:48 PM, Christian König wrote:
>>>>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>>>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>>>>> Hi Christian,
>>>>>>>>
>>>>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>>>>> Use DRM's device wedged event to notify userspace that a
>>>>>>>>>> reset had
>>>>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>>>>> capture.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>>>>> ---
>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>>>> 1 file changed, 3 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>> dev_info(adev->dev, "GPU reset end with ret =
>>>>>>>>>> %d\n", r);
>>>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>>>> +
>>>>>>>>>> + drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>>>>
>>>>>>>>> That looks really good in general. I would just make the
>>>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>>>>> success
>>>>>>>> of it, shouldn't we tell userspace?
>>>>>>>
>>>>>>> A failed reset would perhaps result in wedging, atleast that's
>>>>>>> how i915
>>>>>>> is handling it.
>>>>>>>
>>>>>>
>>>>>> Right, and I think this raises the question of what wedge recovery
>>>>>> method should I add for amdgpu... Christian?
>>>>>>
>>>>>
>>>>> In theory a rebind should be enough to get the device going again,
>>>>> our
>>>>> BOCO does a bus reset on driver load anyway.
>>>>>
>>>>
>>>> The behavior varies between SOCs. In certain ones, if driver reset
>>>> fails, that means it's really in a bad state and it would need system
>>>> reboot.
>>>>
>>>
>>> Is this documented somewhere? Then I could even add a
>>> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.
>>
>> Not publicly as far as I know. But indeed a driver reset has
>> basically the same chance of succeeding than a driver reload.
>>
>> I think the use case we have here is more that the administrator
>> intentionally disabled the reset to allow HW investigation.
>>
>> So far we did that with a rather broken we don't do anything at all
>> approach.
>
> OK.
>
>>
>>>> I had asked earlier about the utility of this one here. If this is
>>>> just
>>>> to inform userspace that driver has done a reset and recovered, it
>>>> would
>>>> need some additional context also. We have a mechanism in KFD which
>>>> sends the context in which a reset has to be done. Currently, that's
>>>> restricted to compute applications, but if this is in a similar
>>>> line, we
>>>> would like to pass some additional info like job timeout, RAS error
>>>> etc.
>>>>
>>>
>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done
>>> a reset and recovered, but additional data about like which job
>>> timeout, RAS error and such belong to devcoredump I guess, where all
>>> data is gathered and collected later.
>>
>> I think somebody else mentioned it as well that the source of the
>> issue, e.g. the PID of the submitting process would be helpful as
>> well for supervising daemons which need to restart processes when
>> they caused some issue.
>>
>
> It was me :) we have a use case that we would need the PID for the
> daemon indeed, but the daemon doesn't need to know what's the RAS
> error or the job name that timeouted, there's no immediate action to
> be taken with this information, contrary to the PID that we need to know.
Yeah, that's all stuff for the device core dump I think.
But if you want to add the PID for high level control then that would
make sense I think.
Regards,
Christian.
>
>> We just postponed adding that till later.
>>
>> Regards,
>> Christian.
>>
>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-16 13:15 ` André Almeida
2024-12-16 13:36 ` Christian König
@ 2024-12-16 13:36 ` Lazar, Lijo
2024-12-16 13:39 ` Christian König
2024-12-16 13:57 ` Raag Jadav
2 siblings, 1 reply; 22+ messages in thread
From: Lazar, Lijo @ 2024-12-16 13:36 UTC (permalink / raw)
To: André Almeida, Christian König
Cc: airlied, simona, Raag Jadav, lucas.demarchi, rodrigo.vivi,
jani.nikula, andriy.shevchenko, lina, michal.wajdeczko,
Sharma, Shashank, intel-gfx, dri-devel, himal.prasad.ghimiray,
aravind.iddamsetty, anshuman.gupta, alexander.deucher, amd-gfx,
kernel-dev
On 12/16/2024 6:45 PM, André Almeida wrote:
>
>
> Em 16/12/2024 10:10, Christian König escreveu:
>> Am 16.12.24 um 14:04 schrieb André Almeida:
>>> Em 16/12/2024 07:38, Lazar, Lijo escreveu:
>>>>
>>>>
>>>> On 12/16/2024 3:48 PM, Christian König wrote:
>>>>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>>>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>>>>> Hi Christian,
>>>>>>>>
>>>>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>>>>> Use DRM's device wedged event to notify userspace that a reset
>>>>>>>>>> had
>>>>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>>>>> capture.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>>>>> ---
>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>>>> 1 file changed, 3 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>> dev_info(adev->dev, "GPU reset end with ret =
>>>>>>>>>> %d\n", r);
>>>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>>>> +
>>>>>>>>>> + drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>>>>
>>>>>>>>> That looks really good in general. I would just make the
>>>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>>>>> success
>>>>>>>> of it, shouldn't we tell userspace?
>>>>>>>
>>>>>>> A failed reset would perhaps result in wedging, atleast that's
>>>>>>> how i915
>>>>>>> is handling it.
>>>>>>>
>>>>>>
>>>>>> Right, and I think this raises the question of what wedge recovery
>>>>>> method should I add for amdgpu... Christian?
>>>>>>
>>>>>
>>>>> In theory a rebind should be enough to get the device going again, our
>>>>> BOCO does a bus reset on driver load anyway.
>>>>>
>>>>
>>>> The behavior varies between SOCs. In certain ones, if driver reset
>>>> fails, that means it's really in a bad state and it would need system
>>>> reboot.
>>>>
>>>
>>> Is this documented somewhere? Then I could even add a
>>> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.
>>
>> Not publicly as far as I know. But indeed a driver reset has basically
>> the same chance of succeeding than a driver reload.
>>
>> I think the use case we have here is more that the administrator
>> intentionally disabled the reset to allow HW investigation.
>>
>> So far we did that with a rather broken we don't do anything at all
>> approach.
>
> OK.
>
>>
>>>> I had asked earlier about the utility of this one here. If this is just
>>>> to inform userspace that driver has done a reset and recovered, it
>>>> would
>>>> need some additional context also. We have a mechanism in KFD which
>>>> sends the context in which a reset has to be done. Currently, that's
>>>> restricted to compute applications, but if this is in a similar
>>>> line, we
>>>> would like to pass some additional info like job timeout, RAS error
>>>> etc.
>>>>
>>>
>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
>>> reset and recovered, but additional data about like which job
>>> timeout, RAS error and such belong to devcoredump I guess, where all
>>> data is gathered and collected later.
>>
>> I think somebody else mentioned it as well that the source of the
>> issue, e.g. the PID of the submitting process would be helpful as well
>> for supervising daemons which need to restart processes when they
>> caused some issue.
>>
>
> It was me :) we have a use case that we would need the PID for the
> daemon indeed, but the daemon doesn't need to know what's the RAS error
> or the job name that timeouted, there's no immediate action to be taken
> with this information, contrary to the PID that we need to know.
>
Regarding devcoredump - it's not done every time. For ex: RAS errors
have a different way to identify the source of error, hence we don't
need a coredump in such cases.
The intention is only to let the user know the reason for reset at a
high level, and probably add more things later like the engines or
queues that have reset etc.
Thanks,
Lijo
>> We just postponed adding that till later.
>>
>> Regards,
>> Christian.
>>
>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-16 13:36 ` Lazar, Lijo
@ 2024-12-16 13:39 ` Christian König
2024-12-16 13:44 ` Lazar, Lijo
0 siblings, 1 reply; 22+ messages in thread
From: Christian König @ 2024-12-16 13:39 UTC (permalink / raw)
To: Lazar, Lijo, André Almeida
Cc: airlied, simona, Raag Jadav, lucas.demarchi, rodrigo.vivi,
jani.nikula, andriy.shevchenko, lina, michal.wajdeczko,
Sharma, Shashank, intel-gfx, dri-devel, himal.prasad.ghimiray,
aravind.iddamsetty, anshuman.gupta, alexander.deucher, amd-gfx,
kernel-dev
[-- Attachment #1: Type: text/plain, Size: 2097 bytes --]
Am 16.12.24 um 14:36 schrieb Lazar, Lijo:
>>>>> I had asked earlier about the utility of this one here. If this is just
>>>>> to inform userspace that driver has done a reset and recovered, it
>>>>> would
>>>>> need some additional context also. We have a mechanism in KFD which
>>>>> sends the context in which a reset has to be done. Currently, that's
>>>>> restricted to compute applications, but if this is in a similar
>>>>> line, we
>>>>> would like to pass some additional info like job timeout, RAS error
>>>>> etc.
>>>>>
>>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
>>>> reset and recovered, but additional data about like which job
>>>> timeout, RAS error and such belong to devcoredump I guess, where all
>>>> data is gathered and collected later.
>>> I think somebody else mentioned it as well that the source of the
>>> issue, e.g. the PID of the submitting process would be helpful as well
>>> for supervising daemons which need to restart processes when they
>>> caused some issue.
>>>
>> It was me :) we have a use case that we would need the PID for the
>> daemon indeed, but the daemon doesn't need to know what's the RAS error
>> or the job name that timeouted, there's no immediate action to be taken
>> with this information, contrary to the PID that we need to know.
>>
> Regarding devcoredump - it's not done every time. For ex: RAS errors
> have a different way to identify the source of error, hence we don't
> need a coredump in such cases.
>
> The intention is only to let the user know the reason for reset at a
> high level, and probably add more things later like the engines or
> queues that have reset etc.
Well what is the use case for that? That doesn't looks valuable to me.
RAS errors should generally be reported to the application who issued
the submission.
As a system wide event they are only useful in things like logfiles I think.
Regards,
Christian.
>
> Thanks,
> Lijo
>
>>> We just postponed adding that till later.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>> Regards,
>>>>>> Christian.
[-- Attachment #2: Type: text/html, Size: 3782 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-16 13:39 ` Christian König
@ 2024-12-16 13:44 ` Lazar, Lijo
0 siblings, 0 replies; 22+ messages in thread
From: Lazar, Lijo @ 2024-12-16 13:44 UTC (permalink / raw)
To: Christian König, André Almeida
Cc: airlied, simona, Raag Jadav, lucas.demarchi, rodrigo.vivi,
jani.nikula, andriy.shevchenko, lina, michal.wajdeczko,
Sharma, Shashank, intel-gfx, dri-devel, himal.prasad.ghimiray,
aravind.iddamsetty, anshuman.gupta, alexander.deucher, amd-gfx,
kernel-dev
On 12/16/2024 7:09 PM, Christian König wrote:
> Am 16.12.24 um 14:36 schrieb Lazar, Lijo:
>>>>>> I had asked earlier about the utility of this one here. If this is just
>>>>>> to inform userspace that driver has done a reset and recovered, it
>>>>>> would
>>>>>> need some additional context also. We have a mechanism in KFD which
>>>>>> sends the context in which a reset has to be done. Currently, that's
>>>>>> restricted to compute applications, but if this is in a similar
>>>>>> line, we
>>>>>> would like to pass some additional info like job timeout, RAS error
>>>>>> etc.
>>>>>>
>>>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
>>>>> reset and recovered, but additional data about like which job
>>>>> timeout, RAS error and such belong to devcoredump I guess, where all
>>>>> data is gathered and collected later.
>>>> I think somebody else mentioned it as well that the source of the
>>>> issue, e.g. the PID of the submitting process would be helpful as well
>>>> for supervising daemons which need to restart processes when they
>>>> caused some issue.
>>>>
>>> It was me :) we have a use case that we would need the PID for the
>>> daemon indeed, but the daemon doesn't need to know what's the RAS error
>>> or the job name that timeouted, there's no immediate action to be taken
>>> with this information, contrary to the PID that we need to know.
>>>
>> Regarding devcoredump - it's not done every time. For ex: RAS errors
>> have a different way to identify the source of error, hence we don't
>> need a coredump in such cases.
>>
>> The intention is only to let the user know the reason for reset at a
>> high level, and probably add more things later like the engines or
>> queues that have reset etc.
>
> Well what is the use case for that? That doesn't looks valuable to me.
It's mostly for in-band telemetry reporting through tools like amd-smi -
more for admin purpose rather than any debug.
Thanks,
Lijo
>
> RAS errors should generally be reported to the application who issued
> the submission.
>
> As a system wide event they are only useful in things like logfiles I think.
>
> Regards,
> Christian.
>
>> Thanks,
>> Lijo
>>
>>>> We just postponed adding that till later.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>> Thanks,
>>>>>> Lijo
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-16 13:15 ` André Almeida
2024-12-16 13:36 ` Christian König
2024-12-16 13:36 ` Lazar, Lijo
@ 2024-12-16 13:57 ` Raag Jadav
2 siblings, 0 replies; 22+ messages in thread
From: Raag Jadav @ 2024-12-16 13:57 UTC (permalink / raw)
To: André Almeida
Cc: Christian König, Lazar, Lijo, airlied, simona,
lucas.demarchi, rodrigo.vivi, jani.nikula, andriy.shevchenko,
lina, michal.wajdeczko, Sharma, Shashank, intel-gfx, dri-devel,
himal.prasad.ghimiray, aravind.iddamsetty, anshuman.gupta,
alexander.deucher, amd-gfx, kernel-dev
On Mon, Dec 16, 2024 at 10:15:00AM -0300, André Almeida wrote:
> Em 16/12/2024 10:10, Christian König escreveu:
> > Am 16.12.24 um 14:04 schrieb André Almeida:
> > > Em 16/12/2024 07:38, Lazar, Lijo escreveu:
> > > >
> > > >
> > > > On 12/16/2024 3:48 PM, Christian König wrote:
> > > > > Am 13.12.24 um 16:56 schrieb André Almeida:
> > > > > > Em 13/12/2024 11:36, Raag Jadav escreveu:
> > > > > > > On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
> > > > > > > > Hi Christian,
> > > > > > > >
> > > > > > > > Em 13/12/2024 04:34, Christian König escreveu:
> > > > > > > > > Am 12.12.24 um 20:09 schrieb André Almeida:
> > > > > > > > > > Use DRM's device wedged event to notify userspace that a reset had
> > > > > > > > > > happened. For now, only use `none` method meant for telemetry
> > > > > > > > > > capture.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > > > > > > > > ---
> > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
> > > > > > > > > > 1 file changed, 3 insertions(+)
> > > > > > > > > >
> > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > index 96316111300a..19e1a5493778 100644
> > > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
> > > > > > > > > > amdgpu_device *adev,
> > > > > > > > > > dev_info(adev->dev, "GPU
> > > > > > > > > > reset end with ret = %d\n", r);
> > > > > > > > > > atomic_set(&adev->reset_domain->reset_res, r);
> > > > > > > > > > +
> > > > > > > > > > + drm_dev_wedged_event(adev_to_drm(adev),
> > > > > > > > > > DRM_WEDGE_RECOVERY_NONE);
> > > > > > > > >
> > > > > > > > > That looks really good in general. I would just make the
> > > > > > > > > DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
> > > > > > > > >
> > > > > > > >
> > > > > > > > Why depend or `r`? A reset was triggered anyway, regardless of the
> > > > > > > > success
> > > > > > > > of it, shouldn't we tell userspace?
> > > > > > >
> > > > > > > A failed reset would perhaps result in wedging,
> > > > > > > atleast that's how i915
> > > > > > > is handling it.
> > > > > > >
> > > > > >
> > > > > > Right, and I think this raises the question of what wedge recovery
> > > > > > method should I add for amdgpu... Christian?
> > > > > >
> > > > >
> > > > > In theory a rebind should be enough to get the device going again, our
> > > > > BOCO does a bus reset on driver load anyway.
> > > > >
> > > >
> > > > The behavior varies between SOCs. In certain ones, if driver reset
> > > > fails, that means it's really in a bad state and it would need system
> > > > reboot.
> > > >
> > >
> > > Is this documented somewhere? Then I could even add a
> > > DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.
This was present in drafts v5 through v7 but later dropped with the
understanding that it is unwise to let a drm device make system level
decisions and rather have something like WEDGED=unknown to let admin/user
decide how to deal with it.
https://patchwork.freedesktop.org/series/138069/
> > Not publicly as far as I know. But indeed a driver reset has basically
> > the same chance of succeeding than a driver reload.
> >
> > I think the use case we have here is more that the administrator
> > intentionally disabled the reset to allow HW investigation.
> >
> > So far we did that with a rather broken we don't do anything at all
> > approach.
>
> OK.
>
> >
> > > > I had asked earlier about the utility of this one here. If this is just
> > > > to inform userspace that driver has done a reset and recovered, it would
> > > > need some additional context also. We have a mechanism in KFD which
> > > > sends the context in which a reset has to be done. Currently, that's
> > > > restricted to compute applications, but if this is in a similar line, we
> > > > would like to pass some additional info like job timeout, RAS error etc.
> > > >
> > >
> > > DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done
> > > a reset and recovered, but additional data about like which job
> > > timeout, RAS error and such belong to devcoredump I guess, where all
> > > data is gathered and collected later.
> >
> > I think somebody else mentioned it as well that the source of the issue,
> > e.g. the PID of the submitting process would be helpful as well for
> > supervising daemons which need to restart processes when they caused
> > some issue.
> >
>
> It was me :) we have a use case that we would need the PID for the daemon
> indeed, but the daemon doesn't need to know what's the RAS error or the job
> name that timeouted, there's no immediate action to be taken with this
> information, contrary to the PID that we need to know.
I think this calls for the standardization of telemetry (devcoredump, syslog
etc) but since each driver has its own way of doing it, it'd be quite an uphill
battle.
Raag
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-12 19:09 ` [PATCH 1/1] " André Almeida
2024-12-13 7:34 ` Christian König
2024-12-13 15:31 ` Lazar, Lijo
@ 2024-12-20 13:31 ` kernel test robot
2024-12-20 14:06 ` kernel test robot
3 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2024-12-20 13:31 UTC (permalink / raw)
To: André Almeida, Raag Jadav, airlied, simona, lucas.demarchi,
rodrigo.vivi, jani.nikula, andriy.shevchenko, lina,
michal.wajdeczko, christian.koenig
Cc: llvm, oe-kbuild-all, intel-gfx, dri-devel, himal.prasad.ghimiray,
aravind.iddamsetty, anshuman.gupta, alexander.deucher,
andrealmeid, amd-gfx, kernel-dev
Hi André,
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v6.13-rc3 next-20241220]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Andr-Almeida/drm-amdgpu-Use-device-wedged-event/20241213-031134
base: linus/master
patch link: https://lore.kernel.org/r/20241212190909.28559-2-andrealmeid%40igalia.com
patch subject: [PATCH 1/1] drm/amdgpu: Use device wedged event
config: arm-randconfig-001-20241220 (https://download.01.org/0day-ci/archive/20241220/202412202104.IwpOz5t5-lkp@intel.com/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241220/202412202104.IwpOz5t5-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412202104.IwpOz5t5-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:35:
In file included from include/linux/iommu.h:10:
In file included from include/linux/scatterlist.h:8:
In file included from include/linux/mm.h:2223:
include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
518 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
| ~~~~~~~~~~~ ^ ~~~
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6061:2: error: call to undeclared function 'drm_dev_wedged_event'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
6061 | drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
| ^
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6061:42: error: use of undeclared identifier 'DRM_WEDGE_RECOVERY_NONE'
6061 | drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
| ^
1 warning and 2 errors generated.
vim +/drm_dev_wedged_event +6061 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
5794
5795 /**
5796 * amdgpu_device_gpu_recover - reset the asic and recover scheduler
5797 *
5798 * @adev: amdgpu_device pointer
5799 * @job: which job trigger hang
5800 * @reset_context: amdgpu reset context pointer
5801 *
5802 * Attempt to reset the GPU if it has hung (all asics).
5803 * Attempt to do soft-reset or full-reset and reinitialize Asic
5804 * Returns 0 for success or an error on failure.
5805 */
5806
5807 int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
5808 struct amdgpu_job *job,
5809 struct amdgpu_reset_context *reset_context)
5810 {
5811 struct list_head device_list, *device_list_handle = NULL;
5812 bool job_signaled = false;
5813 struct amdgpu_hive_info *hive = NULL;
5814 struct amdgpu_device *tmp_adev = NULL;
5815 int i, r = 0;
5816 bool need_emergency_restart = false;
5817 bool audio_suspended = false;
5818 int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
5819
5820 /*
5821 * Special case: RAS triggered and full reset isn't supported
5822 */
5823 need_emergency_restart = amdgpu_ras_need_emergency_restart(adev);
5824
5825 /*
5826 * Flush RAM to disk so that after reboot
5827 * the user can read log and see why the system rebooted.
5828 */
5829 if (need_emergency_restart && amdgpu_ras_get_context(adev) &&
5830 amdgpu_ras_get_context(adev)->reboot) {
5831 DRM_WARN("Emergency reboot.");
5832
5833 ksys_sync_helper();
5834 emergency_restart();
5835 }
5836
5837 dev_info(adev->dev, "GPU %s begin!\n",
5838 need_emergency_restart ? "jobs stop":"reset");
5839
5840 if (!amdgpu_sriov_vf(adev))
5841 hive = amdgpu_get_xgmi_hive(adev);
5842 if (hive)
5843 mutex_lock(&hive->hive_lock);
5844
5845 reset_context->job = job;
5846 reset_context->hive = hive;
5847 /*
5848 * Build list of devices to reset.
5849 * In case we are in XGMI hive mode, resort the device list
5850 * to put adev in the 1st position.
5851 */
5852 INIT_LIST_HEAD(&device_list);
5853 if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1) && hive) {
5854 list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
5855 list_add_tail(&tmp_adev->reset_list, &device_list);
5856 if (adev->shutdown)
5857 tmp_adev->shutdown = true;
5858 }
5859 if (!list_is_first(&adev->reset_list, &device_list))
5860 list_rotate_to_front(&adev->reset_list, &device_list);
5861 device_list_handle = &device_list;
5862 } else {
5863 list_add_tail(&adev->reset_list, &device_list);
5864 device_list_handle = &device_list;
5865 }
5866
5867 if (!amdgpu_sriov_vf(adev)) {
5868 r = amdgpu_device_health_check(device_list_handle);
5869 if (r)
5870 goto end_reset;
5871 }
5872
5873 /* We need to lock reset domain only once both for XGMI and single device */
5874 tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
5875 reset_list);
5876 amdgpu_device_lock_reset_domain(tmp_adev->reset_domain);
5877
5878 /* block all schedulers and reset given job's ring */
5879 list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
5880
5881 amdgpu_device_set_mp1_state(tmp_adev);
5882
5883 /*
5884 * Try to put the audio codec into suspend state
5885 * before gpu reset started.
5886 *
5887 * Due to the power domain of the graphics device
5888 * is shared with AZ power domain. Without this,
5889 * we may change the audio hardware from behind
5890 * the audio driver's back. That will trigger
5891 * some audio codec errors.
5892 */
5893 if (!amdgpu_device_suspend_display_audio(tmp_adev))
5894 audio_suspended = true;
5895
5896 amdgpu_ras_set_error_query_ready(tmp_adev, false);
5897
5898 cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
5899
5900 amdgpu_amdkfd_pre_reset(tmp_adev, reset_context);
5901
5902 /*
5903 * Mark these ASICs to be reseted as untracked first
5904 * And add them back after reset completed
5905 */
5906 amdgpu_unregister_gpu_instance(tmp_adev);
5907
5908 drm_client_dev_suspend(adev_to_drm(tmp_adev), false);
5909
5910 /* disable ras on ALL IPs */
5911 if (!need_emergency_restart &&
5912 amdgpu_device_ip_need_full_reset(tmp_adev))
5913 amdgpu_ras_suspend(tmp_adev);
5914
5915 for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
5916 struct amdgpu_ring *ring = tmp_adev->rings[i];
5917
5918 if (!amdgpu_ring_sched_ready(ring))
5919 continue;
5920
5921 drm_sched_stop(&ring->sched, job ? &job->base : NULL);
5922
5923 if (need_emergency_restart)
5924 amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
5925 }
5926 atomic_inc(&tmp_adev->gpu_reset_counter);
5927 }
5928
5929 if (need_emergency_restart)
5930 goto skip_sched_resume;
5931
5932 /*
5933 * Must check guilty signal here since after this point all old
5934 * HW fences are force signaled.
5935 *
5936 * job->base holds a reference to parent fence
5937 */
5938 if (job && dma_fence_is_signaled(&job->hw_fence)) {
5939 job_signaled = true;
5940 dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
5941 goto skip_hw_reset;
5942 }
5943
5944 retry: /* Rest of adevs pre asic reset from XGMI hive. */
5945 list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
5946 r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context);
5947 /*TODO Should we stop ?*/
5948 if (r) {
5949 dev_err(tmp_adev->dev, "GPU pre asic reset failed with err, %d for drm dev, %s ",
5950 r, adev_to_drm(tmp_adev)->unique);
5951 tmp_adev->asic_reset_res = r;
5952 }
5953 }
5954
5955 /* Actual ASIC resets if needed.*/
5956 /* Host driver will handle XGMI hive reset for SRIOV */
5957 if (amdgpu_sriov_vf(adev)) {
5958 if (amdgpu_ras_get_fed_status(adev) || amdgpu_virt_rcvd_ras_interrupt(adev)) {
5959 dev_dbg(adev->dev, "Detected RAS error, wait for FLR completion\n");
5960 amdgpu_ras_set_fed(adev, true);
5961 set_bit(AMDGPU_HOST_FLR, &reset_context->flags);
5962 }
5963
5964 r = amdgpu_device_reset_sriov(adev, reset_context);
5965 if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0) {
5966 amdgpu_virt_release_full_gpu(adev, true);
5967 goto retry;
5968 }
5969 if (r)
5970 adev->asic_reset_res = r;
5971 } else {
5972 r = amdgpu_do_asic_reset(device_list_handle, reset_context);
5973 if (r && r == -EAGAIN)
5974 goto retry;
5975 }
5976
5977 list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
5978 /*
5979 * Drop any pending non scheduler resets queued before reset is done.
5980 * Any reset scheduled after this point would be valid. Scheduler resets
5981 * were already dropped during drm_sched_stop and no new ones can come
5982 * in before drm_sched_start.
5983 */
5984 amdgpu_device_stop_pending_resets(tmp_adev);
5985 }
5986
5987 skip_hw_reset:
5988
5989 /* Post ASIC reset for all devs .*/
5990 list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
5991
5992 for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
5993 struct amdgpu_ring *ring = tmp_adev->rings[i];
5994
5995 if (!amdgpu_ring_sched_ready(ring))
5996 continue;
5997
5998 drm_sched_start(&ring->sched, 0);
5999 }
6000
6001 if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
6002 drm_helper_resume_force_mode(adev_to_drm(tmp_adev));
6003
6004 if (tmp_adev->asic_reset_res)
6005 r = tmp_adev->asic_reset_res;
6006
6007 tmp_adev->asic_reset_res = 0;
6008
6009 if (r) {
6010 /* bad news, how to tell it to userspace ?
6011 * for ras error, we should report GPU bad status instead of
6012 * reset failure
6013 */
6014 if (reset_context->src != AMDGPU_RESET_SRC_RAS ||
6015 !amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
6016 dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
6017 atomic_read(&tmp_adev->gpu_reset_counter));
6018 amdgpu_vf_error_put(tmp_adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
6019 } else {
6020 dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
6021 if (amdgpu_acpi_smart_shift_update(adev_to_drm(tmp_adev), AMDGPU_SS_DEV_D0))
6022 DRM_WARN("smart shift update failed\n");
6023 }
6024 }
6025
6026 skip_sched_resume:
6027 list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
6028 /* unlock kfd: SRIOV would do it separately */
6029 if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev))
6030 amdgpu_amdkfd_post_reset(tmp_adev);
6031
6032 /* kfd_post_reset will do nothing if kfd device is not initialized,
6033 * need to bring up kfd here if it's not be initialized before
6034 */
6035 if (!adev->kfd.init_complete)
6036 amdgpu_amdkfd_device_init(adev);
6037
6038 if (audio_suspended)
6039 amdgpu_device_resume_display_audio(tmp_adev);
6040
6041 amdgpu_device_unset_mp1_state(tmp_adev);
6042
6043 amdgpu_ras_set_error_query_ready(tmp_adev, true);
6044 }
6045
6046 tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
6047 reset_list);
6048 amdgpu_device_unlock_reset_domain(tmp_adev->reset_domain);
6049
6050 end_reset:
6051 if (hive) {
6052 mutex_unlock(&hive->hive_lock);
6053 amdgpu_put_xgmi_hive(hive);
6054 }
6055
6056 if (r)
6057 dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
6058
6059 atomic_set(&adev->reset_domain->reset_res, r);
6060
> 6061 drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
6062
6063 return r;
6064 }
6065
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
2024-12-12 19:09 ` [PATCH 1/1] " André Almeida
` (2 preceding siblings ...)
2024-12-20 13:31 ` kernel test robot
@ 2024-12-20 14:06 ` kernel test robot
3 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2024-12-20 14:06 UTC (permalink / raw)
To: André Almeida, Raag Jadav, airlied, simona, lucas.demarchi,
rodrigo.vivi, jani.nikula, andriy.shevchenko, lina,
michal.wajdeczko, christian.koenig
Cc: oe-kbuild-all, intel-gfx, dri-devel, himal.prasad.ghimiray,
aravind.iddamsetty, anshuman.gupta, alexander.deucher,
andrealmeid, amd-gfx, kernel-dev
Hi André,
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on drm-intel/for-linux-next drm-intel/for-linux-next-fixes drm-tip/drm-tip v6.13-rc3 next-20241220]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Andr-Almeida/drm-amdgpu-Use-device-wedged-event/20241213-031134
base: linus/master
patch link: https://lore.kernel.org/r/20241212190909.28559-2-andrealmeid%40igalia.com
patch subject: [PATCH 1/1] drm/amdgpu: Use device wedged event
config: s390-randconfig-001-20241220 (https://download.01.org/0day-ci/archive/20241220/202412202144.SalviKDh-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241220/202412202144.SalviKDh-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412202144.SalviKDh-lkp@intel.com/
All errors (new ones prefixed by >>):
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c: In function 'amdgpu_device_gpu_recover':
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6061:9: error: implicit declaration of function 'drm_dev_wedged_event' [-Wimplicit-function-declaration]
6061 | drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
| ^~~~~~~~~~~~~~~~~~~~
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6061:49: error: 'DRM_WEDGE_RECOVERY_NONE' undeclared (first use in this function); did you mean 'DRM_MODE_ENCODER_NONE'?
6061 | drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
| ^~~~~~~~~~~~~~~~~~~~~~~
| DRM_MODE_ENCODER_NONE
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6061:49: note: each undeclared identifier is reported only once for each function it appears in
vim +/drm_dev_wedged_event +6061 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
5794
5795 /**
5796 * amdgpu_device_gpu_recover - reset the asic and recover scheduler
5797 *
5798 * @adev: amdgpu_device pointer
5799 * @job: which job trigger hang
5800 * @reset_context: amdgpu reset context pointer
5801 *
5802 * Attempt to reset the GPU if it has hung (all asics).
5803 * Attempt to do soft-reset or full-reset and reinitialize Asic
5804 * Returns 0 for success or an error on failure.
5805 */
5806
5807 int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
5808 struct amdgpu_job *job,
5809 struct amdgpu_reset_context *reset_context)
5810 {
5811 struct list_head device_list, *device_list_handle = NULL;
5812 bool job_signaled = false;
5813 struct amdgpu_hive_info *hive = NULL;
5814 struct amdgpu_device *tmp_adev = NULL;
5815 int i, r = 0;
5816 bool need_emergency_restart = false;
5817 bool audio_suspended = false;
5818 int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
5819
5820 /*
5821 * Special case: RAS triggered and full reset isn't supported
5822 */
5823 need_emergency_restart = amdgpu_ras_need_emergency_restart(adev);
5824
5825 /*
5826 * Flush RAM to disk so that after reboot
5827 * the user can read log and see why the system rebooted.
5828 */
5829 if (need_emergency_restart && amdgpu_ras_get_context(adev) &&
5830 amdgpu_ras_get_context(adev)->reboot) {
5831 DRM_WARN("Emergency reboot.");
5832
5833 ksys_sync_helper();
5834 emergency_restart();
5835 }
5836
5837 dev_info(adev->dev, "GPU %s begin!\n",
5838 need_emergency_restart ? "jobs stop":"reset");
5839
5840 if (!amdgpu_sriov_vf(adev))
5841 hive = amdgpu_get_xgmi_hive(adev);
5842 if (hive)
5843 mutex_lock(&hive->hive_lock);
5844
5845 reset_context->job = job;
5846 reset_context->hive = hive;
5847 /*
5848 * Build list of devices to reset.
5849 * In case we are in XGMI hive mode, resort the device list
5850 * to put adev in the 1st position.
5851 */
5852 INIT_LIST_HEAD(&device_list);
5853 if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1) && hive) {
5854 list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
5855 list_add_tail(&tmp_adev->reset_list, &device_list);
5856 if (adev->shutdown)
5857 tmp_adev->shutdown = true;
5858 }
5859 if (!list_is_first(&adev->reset_list, &device_list))
5860 list_rotate_to_front(&adev->reset_list, &device_list);
5861 device_list_handle = &device_list;
5862 } else {
5863 list_add_tail(&adev->reset_list, &device_list);
5864 device_list_handle = &device_list;
5865 }
5866
5867 if (!amdgpu_sriov_vf(adev)) {
5868 r = amdgpu_device_health_check(device_list_handle);
5869 if (r)
5870 goto end_reset;
5871 }
5872
5873 /* We need to lock reset domain only once both for XGMI and single device */
5874 tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
5875 reset_list);
5876 amdgpu_device_lock_reset_domain(tmp_adev->reset_domain);
5877
5878 /* block all schedulers and reset given job's ring */
5879 list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
5880
5881 amdgpu_device_set_mp1_state(tmp_adev);
5882
5883 /*
5884 * Try to put the audio codec into suspend state
5885 * before gpu reset started.
5886 *
5887 * Due to the power domain of the graphics device
5888 * is shared with AZ power domain. Without this,
5889 * we may change the audio hardware from behind
5890 * the audio driver's back. That will trigger
5891 * some audio codec errors.
5892 */
5893 if (!amdgpu_device_suspend_display_audio(tmp_adev))
5894 audio_suspended = true;
5895
5896 amdgpu_ras_set_error_query_ready(tmp_adev, false);
5897
5898 cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
5899
5900 amdgpu_amdkfd_pre_reset(tmp_adev, reset_context);
5901
5902 /*
5903 * Mark these ASICs to be reseted as untracked first
5904 * And add them back after reset completed
5905 */
5906 amdgpu_unregister_gpu_instance(tmp_adev);
5907
5908 drm_client_dev_suspend(adev_to_drm(tmp_adev), false);
5909
5910 /* disable ras on ALL IPs */
5911 if (!need_emergency_restart &&
5912 amdgpu_device_ip_need_full_reset(tmp_adev))
5913 amdgpu_ras_suspend(tmp_adev);
5914
5915 for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
5916 struct amdgpu_ring *ring = tmp_adev->rings[i];
5917
5918 if (!amdgpu_ring_sched_ready(ring))
5919 continue;
5920
5921 drm_sched_stop(&ring->sched, job ? &job->base : NULL);
5922
5923 if (need_emergency_restart)
5924 amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
5925 }
5926 atomic_inc(&tmp_adev->gpu_reset_counter);
5927 }
5928
5929 if (need_emergency_restart)
5930 goto skip_sched_resume;
5931
5932 /*
5933 * Must check guilty signal here since after this point all old
5934 * HW fences are force signaled.
5935 *
5936 * job->base holds a reference to parent fence
5937 */
5938 if (job && dma_fence_is_signaled(&job->hw_fence)) {
5939 job_signaled = true;
5940 dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
5941 goto skip_hw_reset;
5942 }
5943
5944 retry: /* Rest of adevs pre asic reset from XGMI hive. */
5945 list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
5946 r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context);
5947 /*TODO Should we stop ?*/
5948 if (r) {
5949 dev_err(tmp_adev->dev, "GPU pre asic reset failed with err, %d for drm dev, %s ",
5950 r, adev_to_drm(tmp_adev)->unique);
5951 tmp_adev->asic_reset_res = r;
5952 }
5953 }
5954
5955 /* Actual ASIC resets if needed.*/
5956 /* Host driver will handle XGMI hive reset for SRIOV */
5957 if (amdgpu_sriov_vf(adev)) {
5958 if (amdgpu_ras_get_fed_status(adev) || amdgpu_virt_rcvd_ras_interrupt(adev)) {
5959 dev_dbg(adev->dev, "Detected RAS error, wait for FLR completion\n");
5960 amdgpu_ras_set_fed(adev, true);
5961 set_bit(AMDGPU_HOST_FLR, &reset_context->flags);
5962 }
5963
5964 r = amdgpu_device_reset_sriov(adev, reset_context);
5965 if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0) {
5966 amdgpu_virt_release_full_gpu(adev, true);
5967 goto retry;
5968 }
5969 if (r)
5970 adev->asic_reset_res = r;
5971 } else {
5972 r = amdgpu_do_asic_reset(device_list_handle, reset_context);
5973 if (r && r == -EAGAIN)
5974 goto retry;
5975 }
5976
5977 list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
5978 /*
5979 * Drop any pending non scheduler resets queued before reset is done.
5980 * Any reset scheduled after this point would be valid. Scheduler resets
5981 * were already dropped during drm_sched_stop and no new ones can come
5982 * in before drm_sched_start.
5983 */
5984 amdgpu_device_stop_pending_resets(tmp_adev);
5985 }
5986
5987 skip_hw_reset:
5988
5989 /* Post ASIC reset for all devs .*/
5990 list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
5991
5992 for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
5993 struct amdgpu_ring *ring = tmp_adev->rings[i];
5994
5995 if (!amdgpu_ring_sched_ready(ring))
5996 continue;
5997
5998 drm_sched_start(&ring->sched, 0);
5999 }
6000
6001 if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
6002 drm_helper_resume_force_mode(adev_to_drm(tmp_adev));
6003
6004 if (tmp_adev->asic_reset_res)
6005 r = tmp_adev->asic_reset_res;
6006
6007 tmp_adev->asic_reset_res = 0;
6008
6009 if (r) {
6010 /* bad news, how to tell it to userspace ?
6011 * for ras error, we should report GPU bad status instead of
6012 * reset failure
6013 */
6014 if (reset_context->src != AMDGPU_RESET_SRC_RAS ||
6015 !amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
6016 dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
6017 atomic_read(&tmp_adev->gpu_reset_counter));
6018 amdgpu_vf_error_put(tmp_adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
6019 } else {
6020 dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
6021 if (amdgpu_acpi_smart_shift_update(adev_to_drm(tmp_adev), AMDGPU_SS_DEV_D0))
6022 DRM_WARN("smart shift update failed\n");
6023 }
6024 }
6025
6026 skip_sched_resume:
6027 list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
6028 /* unlock kfd: SRIOV would do it separately */
6029 if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev))
6030 amdgpu_amdkfd_post_reset(tmp_adev);
6031
6032 /* kfd_post_reset will do nothing if kfd device is not initialized,
6033 * need to bring up kfd here if it's not be initialized before
6034 */
6035 if (!adev->kfd.init_complete)
6036 amdgpu_amdkfd_device_init(adev);
6037
6038 if (audio_suspended)
6039 amdgpu_device_resume_display_audio(tmp_adev);
6040
6041 amdgpu_device_unset_mp1_state(tmp_adev);
6042
6043 amdgpu_ras_set_error_query_ready(tmp_adev, true);
6044 }
6045
6046 tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
6047 reset_list);
6048 amdgpu_device_unlock_reset_domain(tmp_adev->reset_domain);
6049
6050 end_reset:
6051 if (hive) {
6052 mutex_unlock(&hive->hive_lock);
6053 amdgpu_put_xgmi_hive(hive);
6054 }
6055
6056 if (r)
6057 dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
6058
6059 atomic_set(&adev->reset_domain->reset_res, r);
6060
> 6061 drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
6062
6063 return r;
6064 }
6065
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2024-12-20 14:07 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-12 19:09 [PATCH 0/1] drm/amdgpu: Use device wedged event André Almeida
2024-12-12 19:09 ` [PATCH 1/1] " André Almeida
2024-12-13 7:34 ` Christian König
2024-12-13 7:46 ` Sharma, Shashank
2024-12-13 14:15 ` André Almeida
2024-12-13 14:36 ` Raag Jadav
2024-12-13 15:56 ` André Almeida
2024-12-16 10:18 ` Christian König
2024-12-16 10:38 ` Lazar, Lijo
2024-12-16 13:04 ` André Almeida
2024-12-16 13:10 ` Christian König
2024-12-16 13:15 ` André Almeida
2024-12-16 13:36 ` Christian König
2024-12-16 13:36 ` Lazar, Lijo
2024-12-16 13:39 ` Christian König
2024-12-16 13:44 ` Lazar, Lijo
2024-12-16 13:57 ` Raag Jadav
2024-12-13 15:31 ` Lazar, Lijo
2024-12-20 13:31 ` kernel test robot
2024-12-20 14:06 ` kernel test robot
2024-12-13 14:10 ` [PATCH 0/1] " Lucas De Marchi
2024-12-13 14:25 ` ✗ Fi.CI.BUILD: failure for " Patchwork
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.