From: Pekka Paalanen <ppaalanen@gmail.com>
To: "André Almeida" <andrealmeid@igalia.com>
Cc: pierre-eric.pelloux-prayer@amd.com,
"Samuel Pitoiset" <samuel.pitoiset@gmail.com>,
"Daniel Vetter" <daniel@ffwll.ch>,
"'Marek Olšák'" <maraeo@gmail.com>,
"Michel Dänzer" <michel.daenzer@mailbox.org>,
"Simon Ser" <contact@emersion.fr>,
"Timur Kristóf" <timur.kristof@gmail.com>,
linux-kernel@vger.kernel.org, amd-gfx@lists.freedesktop.org,
"Rob Clark" <robdclark@gmail.com>,
dri-devel@lists.freedesktop.org, kernel-dev@igalia.com,
"Bas Nieuwenhuizen" <bas@basnieuwenhuizen.nl>,
alexander.deucher@amd.com, "Daniel Stone" <daniel@fooishbar.org>,
"Dave Airlie" <airlied@gmail.com>,
christian.koenig@amd.com
Subject: Re: [PATCH v4 1/1] drm/doc: Document DRM device reset expectations
Date: Tue, 27 Jun 2023 10:29:55 +0300 [thread overview]
Message-ID: <20230627102955.6a2c5796@eldfell> (raw)
In-Reply-To: <20230626183347.55118-2-andrealmeid@igalia.com>
[-- Attachment #1: Type: text/plain, Size: 4232 bytes --]
On Mon, 26 Jun 2023 15:33:47 -0300
André Almeida <andrealmeid@igalia.com> wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> 1 file changed, 68 insertions(+)
Hi,
grammar nitpicks notwithstanding, I'm happy with the contents now, so
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
Thanks,
pq
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..25a11b9b98fa 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> mmapped regular files. Threads cause additional pain with signal
> handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +section describes what is the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if it requires to. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriated API error code, as explained in the below section about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to application to deal with device resets. However,
> +there is no guarantee that the app will be correctly using such features, and
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keeps blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If is possible to
> +determine that robustness is not in use, UMD will terminate the app when a reset
> +is detected, giving that the contexts are lost and the app won't be able to
> +figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting resets causes
> +-----------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
> .. _drm_driver_ioctl:
>
> IOCTL Support on Device Nodes
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
WARNING: multiple messages have this Message-ID (diff)
From: Pekka Paalanen <ppaalanen@gmail.com>
To: "André Almeida" <andrealmeid@igalia.com>
Cc: pierre-eric.pelloux-prayer@amd.com,
"Samuel Pitoiset" <samuel.pitoiset@gmail.com>,
"'Marek Olšák'" <maraeo@gmail.com>,
"Michel Dänzer" <michel.daenzer@mailbox.org>,
"Timur Kristóf" <timur.kristof@gmail.com>,
linux-kernel@vger.kernel.org, amd-gfx@lists.freedesktop.org,
dri-devel@lists.freedesktop.org, kernel-dev@igalia.com,
alexander.deucher@amd.com, christian.koenig@amd.com
Subject: Re: [PATCH v4 1/1] drm/doc: Document DRM device reset expectations
Date: Tue, 27 Jun 2023 10:29:55 +0300 [thread overview]
Message-ID: <20230627102955.6a2c5796@eldfell> (raw)
In-Reply-To: <20230626183347.55118-2-andrealmeid@igalia.com>
[-- Attachment #1: Type: text/plain, Size: 4232 bytes --]
On Mon, 26 Jun 2023 15:33:47 -0300
André Almeida <andrealmeid@igalia.com> wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> 1 file changed, 68 insertions(+)
Hi,
grammar nitpicks notwithstanding, I'm happy with the contents now, so
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
Thanks,
pq
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..25a11b9b98fa 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> mmapped regular files. Threads cause additional pain with signal
> handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +section describes what is the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if it requires to. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriated API error code, as explained in the below section about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to application to deal with device resets. However,
> +there is no guarantee that the app will be correctly using such features, and
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keeps blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If is possible to
> +determine that robustness is not in use, UMD will terminate the app when a reset
> +is detected, giving that the contexts are lost and the app won't be able to
> +figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting resets causes
> +-----------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
> .. _drm_driver_ioctl:
>
> IOCTL Support on Device Nodes
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
WARNING: multiple messages have this Message-ID (diff)
From: Pekka Paalanen <ppaalanen@gmail.com>
To: "André Almeida" <andrealmeid@igalia.com>
Cc: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org,
linux-kernel@vger.kernel.org, kernel-dev@igalia.com,
alexander.deucher@amd.com, christian.koenig@amd.com,
pierre-eric.pelloux-prayer@amd.com,
"Simon Ser" <contact@emersion.fr>,
"Rob Clark" <robdclark@gmail.com>,
"Daniel Vetter" <daniel@ffwll.ch>,
"Daniel Stone" <daniel@fooishbar.org>,
"'Marek Olšák'" <maraeo@gmail.com>,
"Dave Airlie" <airlied@gmail.com>,
"Michel Dänzer" <michel.daenzer@mailbox.org>,
"Samuel Pitoiset" <samuel.pitoiset@gmail.com>,
"Timur Kristóf" <timur.kristof@gmail.com>,
"Bas Nieuwenhuizen" <bas@basnieuwenhuizen.nl>
Subject: Re: [PATCH v4 1/1] drm/doc: Document DRM device reset expectations
Date: Tue, 27 Jun 2023 10:29:55 +0300 [thread overview]
Message-ID: <20230627102955.6a2c5796@eldfell> (raw)
In-Reply-To: <20230626183347.55118-2-andrealmeid@igalia.com>
[-- Attachment #1: Type: text/plain, Size: 4232 bytes --]
On Mon, 26 Jun 2023 15:33:47 -0300
André Almeida <andrealmeid@igalia.com> wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> 1 file changed, 68 insertions(+)
Hi,
grammar nitpicks notwithstanding, I'm happy with the contents now, so
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
Thanks,
pq
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..25a11b9b98fa 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> mmapped regular files. Threads cause additional pain with signal
> handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +section describes what is the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if it requires to. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriated API error code, as explained in the below section about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to application to deal with device resets. However,
> +there is no guarantee that the app will be correctly using such features, and
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keeps blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If is possible to
> +determine that robustness is not in use, UMD will terminate the app when a reset
> +is detected, giving that the contexts are lost and the app won't be able to
> +figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting resets causes
> +-----------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
> .. _drm_driver_ioctl:
>
> IOCTL Support on Device Nodes
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2023-06-27 7:30 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-26 18:33 [PATCH v4 0/1] drm/doc: Document DRM device reset expectations André Almeida
2023-06-26 18:33 ` André Almeida
2023-06-26 18:33 ` André Almeida
2023-06-26 18:33 ` [PATCH v4 1/1] " André Almeida
2023-06-26 18:33 ` André Almeida
2023-06-26 18:33 ` André Almeida
2023-06-26 21:32 ` Randy Dunlap
2023-06-26 21:32 ` Randy Dunlap
2023-06-26 21:32 ` Randy Dunlap
2023-06-27 7:29 ` Pekka Paalanen [this message]
2023-06-27 7:29 ` Pekka Paalanen
2023-06-27 7:29 ` Pekka Paalanen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230627102955.6a2c5796@eldfell \
--to=ppaalanen@gmail.com \
--cc=airlied@gmail.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=andrealmeid@igalia.com \
--cc=bas@basnieuwenhuizen.nl \
--cc=christian.koenig@amd.com \
--cc=contact@emersion.fr \
--cc=daniel@ffwll.ch \
--cc=daniel@fooishbar.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=kernel-dev@igalia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=maraeo@gmail.com \
--cc=michel.daenzer@mailbox.org \
--cc=pierre-eric.pelloux-prayer@amd.com \
--cc=robdclark@gmail.com \
--cc=samuel.pitoiset@gmail.com \
--cc=timur.kristof@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.