From: Pekka Paalanen <ppaalanen@gmail.com>
To: "André Almeida" <andrealmeid@igalia.com>
Cc: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org,
linux-kernel@vger.kernel.org, kernel-dev@igalia.com,
alexander.deucher@amd.com, christian.koenig@amd.com,
pierre-eric.pelloux-prayer@amd.com,
"Simon Ser" <contact@emersion.fr>,
"Rob Clark" <robdclark@gmail.com>,
"Daniel Vetter" <daniel@ffwll.ch>,
"Daniel Stone" <daniel@fooishbar.org>,
"'Marek Olšák'" <maraeo@gmail.com>,
"Dave Airlie" <airlied@gmail.com>,
"Michel Dänzer" <michel.daenzer@mailbox.org>,
"Samuel Pitoiset" <samuel.pitoiset@gmail.com>,
"Timur Kristóf" <timur.kristof@gmail.com>,
"Bas Nieuwenhuizen" <bas@basnieuwenhuizen.nl>
Subject: Re: [RFC PATCH v3 1/4] drm/doc: Document DRM device reset expectations
Date: Wed, 21 Jun 2023 10:58:42 +0300 [thread overview]
Message-ID: <20230621105842.0c21b161@eldfell> (raw)
In-Reply-To: <20230621005719.836857-2-andrealmeid@igalia.com>
[-- Attachment #1: Type: text/plain, Size: 5277 bytes --]
On Tue, 20 Jun 2023 21:57:16 -0300
André Almeida <andrealmeid@igalia.com> wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
Hi André,
nice to see this! I ended up giving lots of grammar comments, but I'm
not a native speaker. Generally it looks good to me.
> ---
> Documentation/gpu/drm-uapi.rst | 65 ++++++++++++++++++++++++++++++++++
> 1 file changed, 65 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..da4f8a694d8d 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,71 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> mmapped regular files. Threads cause additional pain with signal
> handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. To recover
> +from this kind of state, sometimes is needed to reset the device. This section
It seems unclear what "this kind of state" refers to, so maybe just write "errors"?
Maybe:
Some errors require resetting the device in order to make the
device usable again.
I presume that recovery does not mean that the failed job could recover.
> +describes what's the expectations for DRM and usermode drivers when a device
> +resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hung is detected when a job gets stuck executing. KMD
s/hung/hang/ ?
> +then update it's internal reset tracking to be ready when userspace asks the
updates its
"update reset tracking"... do you mean that KMD records information
about the reset in case userspace asks for it later?
> +kernel about reset information. Drivers should implement the DRM_IOCTL_GET_RESET
> +for that.
At this point, I'm not sure what "reset tracking" or "reset
information" entails. Could something be said about those?
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if it requires to. The
> +DRM_IOCTL_GET_RESET is the default interface for those kind of checks. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriated API error code, as explained in the bellow section about
s/bellow/below/
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that is using.
that it is using.
> +
> +Graphical APIs provide ways to application to deal with device resets. However,
provide ways for applications to deal with
> +there's no guarantee that the app will be correctly using such features, and UMD
> +can implement policies to close the app if it's a repeating offender, likely in
> +a broken loop. This is done to ensure that it doesn't keeps blocking the user
does not keep
I think contractions are usually avoided in documents, but I'm not
bothering to flag them all.
> +interface to be correctly displayed.
interface from being correctly displayed.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL can rely on ``GL_ARB_robustness`` to be robust. This extension
> +tells if a reset has happened, and if so, all the context state is considered
> +lost and the app proceeds by creating new ones. If robustness isn't in use, UMD
> +will terminate the app when a reset is detected, giving that the contexts are
> +lost and the app won't be able to figure this out and recreate the contexts.
What about GL ES? Is GL_ARB_robustness implemented or even defined there?
What about EGL returning errors like EGL_CONTEXT_LOST, would handling that not
be enough from the app? The documented expectation is: "The application
must destroy all contexts and reinitialise OpenGL ES state and objects
to continue rendering."
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting resets causes
> +-----------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
> .. _drm_driver_ioctl:
>
> IOCTL Support on Device Nodes
What about VRAM contents? If userspace holds a dmabuf handle, can a GPU
reset wipe that buffer? How would that be communicated?
The dmabuf may have originated in another process.
Thanks,
pq
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2023-06-21 7:59 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-21 0:57 [RFC PATCH v3 0/4] drm: Standardize device reset notification André Almeida
2023-06-21 0:57 ` [RFC PATCH v3 1/4] drm/doc: Document DRM device reset expectations André Almeida
2023-06-21 7:58 ` Pekka Paalanen [this message]
2023-06-21 16:28 ` André Almeida
2023-06-22 8:12 ` Pekka Paalanen
2023-06-26 16:15 ` André Almeida
2023-06-21 0:57 ` [RFC PATCH v3 2/4] drm: Create DRM_IOCTL_GET_RESET André Almeida
2023-06-21 8:09 ` Pekka Paalanen
2023-06-21 16:33 ` André Almeida
2023-06-22 8:22 ` Pekka Paalanen
2023-06-22 9:59 ` Christian König
2023-06-21 0:57 ` [RFC PATCH v3 3/4] drm/amdgpu: Implement DRM_IOCTL_GET_RESET André Almeida
2023-06-21 7:40 ` Christian König
2023-06-21 16:38 ` André Almeida
2023-06-22 7:45 ` Christian König
2023-06-21 0:57 ` [RFC PATCH v3 4/4] drm/i915: " André Almeida
2023-06-21 7:38 ` Jani Nikula
2023-06-21 7:42 ` [RFC PATCH v3 0/4] drm: Standardize device reset notification Christian König
2023-06-21 15:06 ` André Almeida
2023-06-21 15:09 ` Christian König
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230621105842.0c21b161@eldfell \
--to=ppaalanen@gmail.com \
--cc=airlied@gmail.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=andrealmeid@igalia.com \
--cc=bas@basnieuwenhuizen.nl \
--cc=christian.koenig@amd.com \
--cc=contact@emersion.fr \
--cc=daniel@ffwll.ch \
--cc=daniel@fooishbar.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=kernel-dev@igalia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=maraeo@gmail.com \
--cc=michel.daenzer@mailbox.org \
--cc=pierre-eric.pelloux-prayer@amd.com \
--cc=robdclark@gmail.com \
--cc=samuel.pitoiset@gmail.com \
--cc=timur.kristof@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox