[RFC PATCH 0/1] drm: Add doc about GPU reset

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/1] drm: Add doc about GPU reset
@ 2023-01-23 20:26 André Almeida
  2023-01-23 20:26 ` [RFC PATCH] drm: Create documentation about device resets André Almeida
  0 siblings, 1 reply; 5+ messages in thread
From: André Almeida @ 2023-01-23 20:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, contactshashanksharma,
	amaranath.somalapuram, christian.koenig,
	pierre-eric.pelloux-prayer, Simon Ser, Rob Clark, Pekka Paalanen,
	Daniel Vetter, Daniel Stone, 'Marek Olšák',
	Dave Airlie, Pierre-Loup A . Griffais, André Almeida

Due to the complexity of its stack and the apps that we run on it, GPU resets
are for granted. What's left for driver developers is how to make resets a
smooth experience as possible. While some OS's can recover or show an error
message in such cases, Linux is more a hit-and-miss due to its lack of
standardization and guidelines of what to do in such cases.

This is the goal of this document, to proper define what should happen after a
GPU reset so developers can start acting on top of this. An IGT test should be
created to validate this for each driver.

Initially my approach was to expose an uevent for GPU resets, as it can be seen
here[1]. However, even if an uevent can be useful for some use cases (e.g.
telemetry and error reporting), for the "OS integration" case of GPU resets
it would be more productive to have something defined through the stack.

Thanks,
	André

[1] https://lore.kernel.org/amd-gfx/20221125175203.52481-1-andrealmeid@igalia.com/

André Almeida (1):
  drm: Create documentation about device resets

 Documentation/gpu/drm-reset.rst | 51 +++++++++++++++++++++++++++++++++
 Documentation/gpu/index.rst     |  1 +
 2 files changed, 52 insertions(+)
 create mode 100644 Documentation/gpu/drm-reset.rst

-- 
2.39.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH] drm: Create documentation about device resets
  2023-01-23 20:26 [RFC PATCH 0/1] drm: Add doc about GPU reset André Almeida
@ 2023-01-23 20:26 ` André Almeida
  2023-01-23 20:38   ` Christian König
  0 siblings, 1 reply; 5+ messages in thread
From: André Almeida @ 2023-01-23 20:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, contactshashanksharma,
	amaranath.somalapuram, christian.koenig,
	pierre-eric.pelloux-prayer, Simon Ser, Rob Clark, Pekka Paalanen,
	Daniel Vetter, Daniel Stone, 'Marek Olšák',
	Dave Airlie, Pierre-Loup A . Griffais, André Almeida

Create a document that specifies how to deal with DRM device resets for
kernel and userspace drivers.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 Documentation/gpu/drm-reset.rst | 51 +++++++++++++++++++++++++++++++++
 Documentation/gpu/index.rst     |  1 +
 2 files changed, 52 insertions(+)
 create mode 100644 Documentation/gpu/drm-reset.rst

diff --git a/Documentation/gpu/drm-reset.rst b/Documentation/gpu/drm-reset.rst
new file mode 100644
index 000000000000..0dd11a469cf9
--- /dev/null
+++ b/Documentation/gpu/drm-reset.rst
@@ -0,0 +1,51 @@
+================
+DRM Device Reset
+================
+
+The GPU stack is really complex and is prone to errors, from hardware bugs,
+faulty applications and everything in the many layers in between. To recover
+from this kind of state, sometimes is needed to reset the GPU. Unproper handling
+of GPU resets can lead to an unstable userspace. This page describes what's the
+expected behaviour from DRM drivers to do in those situations, from usermode
+drivers and compositors as well.
+
+Robustness
+----------
+
+First of all, application robust APIs, when available, should be used. This
+allows the application to correctly recover and continue to run after a reset.
+Apps that doesn't use this should be promptly killed when the kernel driver
+detects that it's in broken state. Specifically guidelines for some APIs:
+
+- OpenGL: During a reset, KMD kill processes that haven't ARB Robustness
+  enabled, assuming they can't recover.
+- Vulkan: Assumes that every app is able to deal with ``VK_ERROR_DEVICE_LOST``,
+  so KMD doesn't kill any. If it doesn't do it right, it's considered a broken
+  application and UMD will deal with it.
+
+Kernel mode driver
+------------------
+
+The KMD should be able to detect that something is wrong with the application
+and that a reset is needed to take place to recover the device (e.g. an endless
+wait). It needs to properly track the context that is broken and mark it as
+dead, so any other syscalls to that context should be further rejected. The
+other contexts should be preserved when possible, avoid crashing the rest of
+userspace. KMD can ban a file descriptor that keeps causing resets, as it's
+likely in a broken loop.
+
+User mode driver
+----------------
+
+During a reset, UMD should be aware that rejected syscalls indicates that the
+context is broken and for robust apps the recovery should happen for the
+context. Non-robust apps would be already terminated by KMD. If no new context
+is created for some time, it is assumed that the recovery didn't work, so UMD
+should terminate it.
+
+Compositors
+-----------
+
+(In the long term) compositors should be robust as well to properly deal with it
+errors. Init systems should be aware of the compositor status and reset it if is
+broken.
diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
index b99dede9a5b1..300b2529bd39 100644
--- a/Documentation/gpu/index.rst
+++ b/Documentation/gpu/index.rst
@@ -9,6 +9,7 @@ Linux GPU Driver Developer's Guide
    drm-mm
    drm-kms
    drm-kms-helpers
+   drm-reset
    drm-uapi
    drm-usage-stats
    driver-uapi
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] drm: Create documentation about device resets
  2023-01-23 20:26 ` [RFC PATCH] drm: Create documentation about device resets André Almeida
@ 2023-01-23 20:38   ` Christian König
  2023-02-07 13:30     ` Pekka Paalanen
  0 siblings, 1 reply; 5+ messages in thread
From: Christian König @ 2023-01-23 20:38 UTC (permalink / raw)
  To: André Almeida, dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, contactshashanksharma,
	amaranath.somalapuram, pierre-eric.pelloux-prayer, Simon Ser,
	Rob Clark, Pekka Paalanen, Daniel Vetter, Daniel Stone,
	'Marek Olšák', Dave Airlie,
	Pierre-Loup A . Griffais

Am 23.01.23 um 21:26 schrieb André Almeida:
> Create a document that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>   Documentation/gpu/drm-reset.rst | 51 +++++++++++++++++++++++++++++++++
>   Documentation/gpu/index.rst     |  1 +
>   2 files changed, 52 insertions(+)
>   create mode 100644 Documentation/gpu/drm-reset.rst
>
> diff --git a/Documentation/gpu/drm-reset.rst b/Documentation/gpu/drm-reset.rst
> new file mode 100644
> index 000000000000..0dd11a469cf9
> --- /dev/null
> +++ b/Documentation/gpu/drm-reset.rst
> @@ -0,0 +1,51 @@
> +================
> +DRM Device Reset
> +================
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in the many layers in between. To recover
> +from this kind of state, sometimes is needed to reset the GPU. Unproper handling
> +of GPU resets can lead to an unstable userspace. This page describes what's the
> +expected behaviour from DRM drivers to do in those situations, from usermode
> +drivers and compositors as well.
> +
> +Robustness
> +----------
> +
> +First of all, application robust APIs, when available, should be used. This
> +allows the application to correctly recover and continue to run after a reset.
> +Apps that doesn't use this should be promptly killed when the kernel driver
> +detects that it's in broken state. Specifically guidelines for some APIs:
> +

> +- OpenGL: During a reset, KMD kill processes that haven't ARB Robustness
> +  enabled, assuming they can't recover.

This is a pretty clear NAK from my side to this approach. The KMD should 
never mess with an userspace process directly in such a way.

Instead use something like this "OpenGL: KMD signals the abortion of 
submitted commands and the UMD should then react accordingly and abort 
the application.".

> +- Vulkan: Assumes that every app is able to deal with ``VK_ERROR_DEVICE_LOST``,
> +  so KMD doesn't kill any. If it doesn't do it right, it's considered a broken
> +  application and UMD will deal with it.

Again, pleas remove the "KMD kill" reference.

> +
> +Kernel mode driver
> +------------------
> +
> +The KMD should be able to detect that something is wrong with the application

Please replace *should* with *must* here, this is mandatory or otherwise 
core memory management can run into deadlocks during reclaim.

Regards,
Christian.

> +and that a reset is needed to take place to recover the device (e.g. an endless
> +wait). It needs to properly track the context that is broken and mark it as
> +dead, so any other syscalls to that context should be further rejected. The
> +other contexts should be preserved when possible, avoid crashing the rest of
> +userspace. KMD can ban a file descriptor that keeps causing resets, as it's
> +likely in a broken loop.
> +
> +User mode driver
> +----------------
> +
> +During a reset, UMD should be aware that rejected syscalls indicates that the
> +context is broken and for robust apps the recovery should happen for the
> +context. Non-robust apps would be already terminated by KMD. If no new context
> +is created for some time, it is assumed that the recovery didn't work, so UMD
> +should terminate it.
> +
> +Compositors
> +-----------
> +
> +(In the long term) compositors should be robust as well to properly deal with it
> +errors. Init systems should be aware of the compositor status and reset it if is
> +broken.
> diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
> index b99dede9a5b1..300b2529bd39 100644
> --- a/Documentation/gpu/index.rst
> +++ b/Documentation/gpu/index.rst
> @@ -9,6 +9,7 @@ Linux GPU Driver Developer's Guide
>      drm-mm
>      drm-kms
>      drm-kms-helpers
> +   drm-reset
>      drm-uapi
>      drm-usage-stats
>      driver-uapi


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] drm: Create documentation about device resets
  2023-01-23 20:38   ` Christian König
@ 2023-02-07 13:30     ` Pekka Paalanen
  2023-02-07 14:58       ` Michel Dänzer
  0 siblings, 1 reply; 5+ messages in thread
From: Pekka Paalanen @ 2023-02-07 13:30 UTC (permalink / raw)
  To: André Almeida
  Cc: Christian König, dri-devel, amd-gfx, linux-kernel,
	kernel-dev, alexander.deucher, contactshashanksharma,
	amaranath.somalapuram, pierre-eric.pelloux-prayer, Simon Ser,
	Rob Clark, Daniel Vetter, Daniel Stone,
	'Marek Olšák', Dave Airlie,
	Pierre-Loup A . Griffais

[-- Attachment #1: Type: text/plain, Size: 5080 bytes --]

On Mon, 23 Jan 2023 21:38:11 +0100
Christian König <christian.koenig@amd.com> wrote:

> Am 23.01.23 um 21:26 schrieb André Almeida:
> > Create a document that specifies how to deal with DRM device resets for
> > kernel and userspace drivers.
> >
> > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > ---
> >   Documentation/gpu/drm-reset.rst | 51 +++++++++++++++++++++++++++++++++
> >   Documentation/gpu/index.rst     |  1 +
> >   2 files changed, 52 insertions(+)
> >   create mode 100644 Documentation/gpu/drm-reset.rst
> >
> > diff --git a/Documentation/gpu/drm-reset.rst b/Documentation/gpu/drm-reset.rst
> > new file mode 100644
> > index 000000000000..0dd11a469cf9
> > --- /dev/null
> > +++ b/Documentation/gpu/drm-reset.rst
> > @@ -0,0 +1,51 @@
> > +================
> > +DRM Device Reset
> > +================
> > +
> > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > +faulty applications and everything in the many layers in between. To recover
> > +from this kind of state, sometimes is needed to reset the GPU. Unproper handling
> > +of GPU resets can lead to an unstable userspace. This page describes what's the
> > +expected behaviour from DRM drivers to do in those situations, from usermode
> > +drivers and compositors as well.
> > +
> > +Robustness
> > +----------
> > +
> > +First of all, application robust APIs, when available, should be used. This
> > +allows the application to correctly recover and continue to run after a reset.
> > +Apps that doesn't use this should be promptly killed when the kernel driver
> > +detects that it's in broken state. Specifically guidelines for some APIs:
> > +  
> 
> > +- OpenGL: During a reset, KMD kill processes that haven't ARB Robustness
> > +  enabled, assuming they can't recover.  
> 
> This is a pretty clear NAK from my side to this approach. The KMD should 
> never mess with an userspace process directly in such a way.
> 
> Instead use something like this "OpenGL: KMD signals the abortion of 
> submitted commands and the UMD should then react accordingly and abort 
> the application.".
> 
> > +- Vulkan: Assumes that every app is able to deal with ``VK_ERROR_DEVICE_LOST``,
> > +  so KMD doesn't kill any. If it doesn't do it right, it's considered a broken
> > +  application and UMD will deal with it.  
> 
> Again, pleas remove the "KMD kill" reference.
> 
> > +
> > +Kernel mode driver
> > +------------------
> > +
> > +The KMD should be able to detect that something is wrong with the application  
> 
> Please replace *should* with *must* here, this is mandatory or otherwise 
> core memory management can run into deadlocks during reclaim.
> 
> Regards,
> Christian.
> 
> > +and that a reset is needed to take place to recover the device (e.g. an endless
> > +wait). It needs to properly track the context that is broken and mark it as
> > +dead, so any other syscalls to that context should be further rejected. The
> > +other contexts should be preserved when possible, avoid crashing the rest of
> > +userspace. KMD can ban a file descriptor that keeps causing resets, as it's
> > +likely in a broken loop.
> > +
> > +User mode driver
> > +----------------
> > +
> > +During a reset, UMD should be aware that rejected syscalls indicates that the
> > +context is broken and for robust apps the recovery should happen for the
> > +context. Non-robust apps would be already terminated by KMD. If no new context
> > +is created for some time, it is assumed that the recovery didn't work, so UMD
> > +should terminate it.

Hi,

what Christian said, plus I would not assume that robust programs will
always respond by creating a new context. They could also switch
to a software renderer, or simply not do graphics again until something
else happens.

> > +
> > +Compositors
> > +-----------
> > +
> > +(In the long term) compositors should be robust as well to properly deal with it
> > +errors. Init systems should be aware of the compositor status and reset it if is
> > +broken.

I don't know how init systems could do that, or what difference does it
make to an init system whether the display server is robust or not.
Display servers can get stuck for other reasons as well. They may also
be live-stuck, where they respond to keepalive, serve clients, and
deliver input events, but still do not update the screen. You can't
tell if that's a malfunction or expected.



Have you checked
https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
that you are consistent with hot-unplug plans?


Thanks,
pq

> > diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
> > index b99dede9a5b1..300b2529bd39 100644
> > --- a/Documentation/gpu/index.rst
> > +++ b/Documentation/gpu/index.rst
> > @@ -9,6 +9,7 @@ Linux GPU Driver Developer's Guide
> >      drm-mm
> >      drm-kms
> >      drm-kms-helpers
> > +   drm-reset
> >      drm-uapi
> >      drm-usage-stats
> >      driver-uapi  
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] drm: Create documentation about device resets
  2023-02-07 13:30     ` Pekka Paalanen
@ 2023-02-07 14:58       ` Michel Dänzer
  0 siblings, 0 replies; 5+ messages in thread
From: Michel Dänzer @ 2023-02-07 14:58 UTC (permalink / raw)
  To: Pekka Paalanen, André Almeida
  Cc: pierre-eric.pelloux-prayer, 'Marek Olšák',
	amaranath.somalapuram, linux-kernel, amd-gfx, dri-devel,
	kernel-dev, alexander.deucher, contactshashanksharma,
	Christian König, Pierre-Loup A . Griffais

On 2/7/23 14:30, Pekka Paalanen wrote:
> On Mon, 23 Jan 2023 21:38:11 +0100
> Christian König <christian.koenig@amd.com> wrote:
>> Am 23.01.23 um 21:26 schrieb André Almeida:
>>>
>>> diff --git a/Documentation/gpu/drm-reset.rst b/Documentation/gpu/drm-reset.rst
>>> new file mode 100644
>>> index 000000000000..0dd11a469cf9
>>> --- /dev/null
>>> +++ b/Documentation/gpu/drm-reset.rst
>>> @@ -0,0 +1,51 @@
>>> +================
>>> +DRM Device Reset
>>> +================
>>> +
>>> +The GPU stack is really complex and is prone to errors, from hardware bugs,
>>> +faulty applications and everything in the many layers in between. To recover
>>> +from this kind of state, sometimes is needed to reset the GPU. Unproper handling
>>> +of GPU resets can lead to an unstable userspace. This page describes what's the
>>> +expected behaviour from DRM drivers to do in those situations, from usermode
>>> +drivers and compositors as well.
>>> +
>>> +Robustness
>>> +----------
>>> +
>>> +First of all, application robust APIs, when available, should be used. This
>>> +allows the application to correctly recover and continue to run after a reset.
>>> +Apps that doesn't use this should be promptly killed when the kernel driver
>>> +detects that it's in broken state. Specifically guidelines for some APIs:
>>> +  
>>
>>> +- OpenGL: During a reset, KMD kill processes that haven't ARB Robustness
>>> +  enabled, assuming they can't recover.  
>>
>> This is a pretty clear NAK from my side to this approach. The KMD should 
>> never mess with an userspace process directly in such a way.
>>
>> Instead use something like this "OpenGL: KMD signals the abortion of 
>> submitted commands and the UMD should then react accordingly and abort 
>> the application.".
> 
> what Christian said, plus I would not assume that robust programs will
> always respond by creating a new context. They could also switch
> to a software renderer, [...]

That is indeed what Firefox does.


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-02-07 14:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-23 20:26 [RFC PATCH 0/1] drm: Add doc about GPU reset André Almeida
2023-01-23 20:26 ` [RFC PATCH] drm: Create documentation about device resets André Almeida
2023-01-23 20:38   ` Christian König
2023-02-07 13:30     ` Pekka Paalanen
2023-02-07 14:58       ` Michel Dänzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox