* [PATCH v4 0/1] drm/doc: Document DRM device reset expectations
@ 2023-06-26 18:33 André Almeida
2023-06-26 18:33 ` [PATCH v4 1/1] " André Almeida
0 siblings, 1 reply; 4+ messages in thread
From: André Almeida @ 2023-06-26 18:33 UTC (permalink / raw)
To: dri-devel, amd-gfx, linux-kernel
Cc: kernel-dev, alexander.deucher, christian.koenig,
pierre-eric.pelloux-prayer, Simon Ser, Rob Clark, Pekka Paalanen,
Daniel Vetter, Daniel Stone, 'Marek Olšák',
Dave Airlie, Michel Dänzer, Samuel Pitoiset,
Timur Kristóf, Bas Nieuwenhuizen, André Almeida
This v4 removes the common DRM ioctl, and adds just the documentation for now,
giving the lack of a common "DRM context" infrascture make it hard to implement.
v3: https://lore.kernel.org/lkml/20230621005719.836857-1-andrealmeid@igalia.com/
Changes:
- Drop the ioctl
- Addresed comments com Pekka, as making the documentation more clear and
consistent.
André Almeida (1):
drm/doc: Document DRM device reset expectations
Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
1 file changed, 68 insertions(+)
--
2.41.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v4 1/1] drm/doc: Document DRM device reset expectations
2023-06-26 18:33 [PATCH v4 0/1] drm/doc: Document DRM device reset expectations André Almeida
@ 2023-06-26 18:33 ` André Almeida
2023-06-26 21:32 ` Randy Dunlap
2023-06-27 7:29 ` Pekka Paalanen
0 siblings, 2 replies; 4+ messages in thread
From: André Almeida @ 2023-06-26 18:33 UTC (permalink / raw)
To: dri-devel, amd-gfx, linux-kernel
Cc: kernel-dev, alexander.deucher, christian.koenig,
pierre-eric.pelloux-prayer, Simon Ser, Rob Clark, Pekka Paalanen,
Daniel Vetter, Daniel Stone, 'Marek Olšák',
Dave Airlie, Michel Dänzer, Samuel Pitoiset,
Timur Kristóf, Bas Nieuwenhuizen, André Almeida
Create a section that specifies how to deal with DRM device resets for
kernel and userspace drivers.
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
1 file changed, 68 insertions(+)
diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 65fb3036a580..25a11b9b98fa 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
mmapped regular files. Threads cause additional pain with signal
handling as well.
+Device reset
+============
+
+The GPU stack is really complex and is prone to errors, from hardware bugs,
+faulty applications and everything in between the many layers. Some errors
+require resetting the device in order to make the device usable again. This
+section describes what is the expectations for DRM and usermode drivers when a
+device resets and how to propagate the reset status.
+
+Kernel Mode Driver
+------------------
+
+The KMD is responsible for checking if the device needs a reset, and to perform
+it as needed. Usually a hang is detected when a job gets stuck executing. KMD
+should keep track of resets, because userspace can query any time about the
+reset stats for an specific context. This is needed to propagate to the rest of
+the stack that a reset has happened. Currently, this is implemented by each
+driver separately, with no common DRM interface.
+
+User Mode Driver
+----------------
+
+The UMD should check before submitting new commands to the KMD if the device has
+been reset, and this can be checked more often if it requires to. After
+detecting a reset, UMD will then proceed to report it to the application using
+the appropriated API error code, as explained in the below section about
+robustness.
+
+Robustness
+----------
+
+The only way to try to keep an application working after a reset is if it
+complies with the robustness aspects of the graphical API that it is using.
+
+Graphical APIs provide ways to application to deal with device resets. However,
+there is no guarantee that the app will be correctly using such features, and
+UMD can implement policies to close the app if it is a repeating offender,
+likely in a broken loop. This is done to ensure that it does not keeps blocking
+the user interface from being correctly displayed. This should be done even if
+the app is correct but happens to trigger some bug in the hardware/driver.
+
+OpenGL
+~~~~~~
+
+Apps using OpenGL should use the available robust interfaces, like the
+extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
+interface tells if a reset has happened, and if so, all the context state is
+considered lost and the app proceeds by creating new ones. If is possible to
+determine that robustness is not in use, UMD will terminate the app when a reset
+is detected, giving that the contexts are lost and the app won't be able to
+figure this out and recreate the contexts.
+
+Vulkan
+~~~~~~
+
+Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
+This error code means, among other things, that a device reset has happened and
+it needs to recreate the contexts to keep going.
+
+Reporting resets causes
+-----------------------
+
+Apart from propagating the reset through the stack so apps can recover, it's
+really useful for driver developers to learn more about what caused the reset in
+first place. DRM devices should make use of devcoredump to store relevant
+information about the reset, so this information can be added to user bug
+reports.
+
.. _drm_driver_ioctl:
IOCTL Support on Device Nodes
--
2.41.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v4 1/1] drm/doc: Document DRM device reset expectations
2023-06-26 18:33 ` [PATCH v4 1/1] " André Almeida
@ 2023-06-26 21:32 ` Randy Dunlap
2023-06-27 7:29 ` Pekka Paalanen
1 sibling, 0 replies; 4+ messages in thread
From: Randy Dunlap @ 2023-06-26 21:32 UTC (permalink / raw)
To: André Almeida, dri-devel, amd-gfx, linux-kernel
Cc: kernel-dev, alexander.deucher, christian.koenig,
pierre-eric.pelloux-prayer, Simon Ser, Rob Clark, Pekka Paalanen,
Daniel Vetter, Daniel Stone, 'Marek Olšák',
Dave Airlie, Michel Dänzer, Samuel Pitoiset,
Timur Kristóf, Bas Nieuwenhuizen
Hi,
On 6/26/23 11:33, André Almeida wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> 1 file changed, 68 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..25a11b9b98fa 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> mmapped regular files. Threads cause additional pain with signal
> handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +section describes what is the expectations for DRM and usermode drivers when a
sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if it requires to. After
more often if the UMD requires it. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriated API error code, as explained in the below section about
appropriate the section below about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to application to deal with device resets. However,
to applications
> +there is no guarantee that the app will be correctly using such features, and
will use such features correctly, and a // or "and the"
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keeps blocking
keep
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If is possible to
If it is possible to
> +determine that robustness is not in use, UMD will terminate the app when a reset
the UMD
> +is detected, giving that the contexts are lost and the app won't be able to
> +figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting resets causes
That's an awkward heading. How about:
Reporting causes of resets
--------------------------
> +-----------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
> .. _drm_driver_ioctl:
>
> IOCTL Support on Device Nodes
thanks for the documentation.
--
~Randy
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v4 1/1] drm/doc: Document DRM device reset expectations
2023-06-26 18:33 ` [PATCH v4 1/1] " André Almeida
2023-06-26 21:32 ` Randy Dunlap
@ 2023-06-27 7:29 ` Pekka Paalanen
1 sibling, 0 replies; 4+ messages in thread
From: Pekka Paalanen @ 2023-06-27 7:29 UTC (permalink / raw)
To: André Almeida
Cc: dri-devel, amd-gfx, linux-kernel, kernel-dev, alexander.deucher,
christian.koenig, pierre-eric.pelloux-prayer, Simon Ser,
Rob Clark, Daniel Vetter, Daniel Stone,
'Marek Olšák', Dave Airlie, Michel Dänzer,
Samuel Pitoiset, Timur Kristóf, Bas Nieuwenhuizen
[-- Attachment #1: Type: text/plain, Size: 4232 bytes --]
On Mon, 26 Jun 2023 15:33:47 -0300
André Almeida <andrealmeid@igalia.com> wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> 1 file changed, 68 insertions(+)
Hi,
grammar nitpicks notwithstanding, I'm happy with the contents now, so
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
Thanks,
pq
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..25a11b9b98fa 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> mmapped regular files. Threads cause additional pain with signal
> handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +section describes what is the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if it requires to. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriated API error code, as explained in the below section about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to application to deal with device resets. However,
> +there is no guarantee that the app will be correctly using such features, and
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keeps blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If is possible to
> +determine that robustness is not in use, UMD will terminate the app when a reset
> +is detected, giving that the contexts are lost and the app won't be able to
> +figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting resets causes
> +-----------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
> .. _drm_driver_ioctl:
>
> IOCTL Support on Device Nodes
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-06-27 7:30 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-26 18:33 [PATCH v4 0/1] drm/doc: Document DRM device reset expectations André Almeida
2023-06-26 18:33 ` [PATCH v4 1/1] " André Almeida
2023-06-26 21:32 ` Randy Dunlap
2023-06-27 7:29 ` Pekka Paalanen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox