* [PATCH v7] drm/doc: Document DRM device reset expectations
@ 2023-08-18 20:06 André Almeida
2023-08-22 10:26 ` Sebastian Wick
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: André Almeida @ 2023-08-18 20:06 UTC (permalink / raw)
To: dri-devel, amd-gfx, linux-kernel
Cc: kernel-dev, alexander.deucher, christian.koenig,
pierre-eric.pelloux-prayer, Simon Ser, Rob Clark, Pekka Paalanen,
Daniel Vetter, Daniel Stone, 'Marek Olšák',
Dave Airlie, Michel Dänzer, Samuel Pitoiset,
Timur Kristóf, Bas Nieuwenhuizen, Randy Dunlap,
André Almeida
Create a section that specifies how to deal with DRM device resets for
kernel and userspace drivers.
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
v7 changes:
- s/application/graphical API contex/ in the robustness part (Michel)
- Grammar fixes (Randy)
v6: https://lore.kernel.org/lkml/20230815185710.159779-1-andrealmeid@igalia.com/
v6 changes:
- Due to substantial changes in the content, dropped Pekka's Acked-by
- Grammar fixes (Randy)
- Add paragraph about disabling device resets
- Add note about integrating reset tracking in drm/sched
- Add note that KMD should return failure for contexts affected by
resets and UMD should check for this
- Add note about lack of consensus around what to do about non-robust
apps
v5: https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/
---
Documentation/gpu/drm-uapi.rst | 77 ++++++++++++++++++++++++++++++++++
1 file changed, 77 insertions(+)
diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 65fb3036a580..3694bdb977f5 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -285,6 +285,83 @@ for GPU1 and GPU2 from different vendors, and a third handler for
mmapped regular files. Threads cause additional pain with signal
handling as well.
+Device reset
+============
+
+The GPU stack is really complex and is prone to errors, from hardware bugs,
+faulty applications and everything in between the many layers. Some errors
+require resetting the device in order to make the device usable again. This
+section describes the expectations for DRM and usermode drivers when a
+device resets and how to propagate the reset status.
+
+Device resets can not be disabled without tainting the kernel, which can lead to
+hanging the entire kernel through shrinkers/mmu_notifiers. Userspace role in
+device resets is to propagate the message to the application and apply any
+special policy for blocking guilty applications, if any. Corollary is that
+debugging a hung GPU context require hardware support to be able to preempt such
+a GPU context while it's stopped.
+
+Kernel Mode Driver
+------------------
+
+The KMD is responsible for checking if the device needs a reset, and to perform
+it as needed. Usually a hang is detected when a job gets stuck executing. KMD
+should keep track of resets, because userspace can query any time about the
+reset status for a specific context. This is needed to propagate to the rest of
+the stack that a reset has happened. Currently, this is implemented by each
+driver separately, with no common DRM interface. Ideally this should be properly
+integrated at DRM scheduler to provide a common ground for all drivers. After a
+reset, KMD should reject new command submissions for affected contexts.
+
+User Mode Driver
+----------------
+
+After command submission, UMD should check if the submission was accepted or
+rejected. After a reset, KMD should reject submissions, and UMD can issue an
+ioctl to the KMD to check the reset status, and this can be checked more often
+if the UMD requires it. After detecting a reset, UMD will then proceed to report
+it to the application using the appropriate API error code, as explained in the
+section below about robustness.
+
+Robustness
+----------
+
+The only way to try to keep a graphical API context working after a reset is if
+it complies with the robustness aspects of the graphical API that it is using.
+
+Graphical APIs provide ways to applications to deal with device resets. However,
+there is no guarantee that the app will use such features correctly, and a
+userspace that doesn't support robust interfaces (like a non-robust
+OpenGL context or API without any robustness support like libva) leave the
+robustness handling entirely to the userspace driver. There is no strong
+community consensus on what the userspace driver should do in that case,
+since all reasonable approaches have some clear downsides.
+
+OpenGL
+~~~~~~
+
+Apps using OpenGL should use the available robust interfaces, like the
+extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
+interface tells if a reset has happened, and if so, all the context state is
+considered lost and the app proceeds by creating new ones. There's no consensus
+on what to do to if robustness is not in use.
+
+Vulkan
+~~~~~~
+
+Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
+This error code means, among other things, that a device reset has happened and
+it needs to recreate the contexts to keep going.
+
+Reporting causes of resets
+--------------------------
+
+Apart from propagating the reset through the stack so apps can recover, it's
+really useful for driver developers to learn more about what caused the reset in
+the first place. DRM devices should make use of devcoredump to store relevant
+information about the reset, so this information can be added to user bug
+reports.
+
.. _drm_driver_ioctl:
IOCTL Support on Device Nodes
--
2.41.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v7] drm/doc: Document DRM device reset expectations
2023-08-18 20:06 [PATCH v7] drm/doc: Document DRM device reset expectations André Almeida
@ 2023-08-22 10:26 ` Sebastian Wick
2023-08-23 17:31 ` Rodrigo Vivi
2023-08-24 11:30 ` Pekka Paalanen
2 siblings, 0 replies; 5+ messages in thread
From: Sebastian Wick @ 2023-08-22 10:26 UTC (permalink / raw)
To: André Almeida
Cc: dri-devel, amd-gfx, linux-kernel, pierre-eric.pelloux-prayer,
Randy Dunlap, 'Marek Olšák', Michel Dänzer,
Timur Kristóf, Pekka Paalanen, Samuel Pitoiset, kernel-dev,
alexander.deucher, christian.koenig
On Fri, Aug 18, 2023 at 05:06:42PM -0300, André Almeida wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>
> ---
>
> v7 changes:
> - s/application/graphical API contex/ in the robustness part (Michel)
> - Grammar fixes (Randy)
>
> v6: https://lore.kernel.org/lkml/20230815185710.159779-1-andrealmeid@igalia.com/
>
> v6 changes:
> - Due to substantial changes in the content, dropped Pekka's Acked-by
> - Grammar fixes (Randy)
> - Add paragraph about disabling device resets
> - Add note about integrating reset tracking in drm/sched
> - Add note that KMD should return failure for contexts affected by
> resets and UMD should check for this
> - Add note about lack of consensus around what to do about non-robust
> apps
>
> v5: https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/
> ---
> Documentation/gpu/drm-uapi.rst | 77 ++++++++++++++++++++++++++++++++++
> 1 file changed, 77 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3694bdb977f5 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,83 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> mmapped regular files. Threads cause additional pain with signal
> handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +section describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Device resets can not be disabled without tainting the kernel, which can lead to
> +hanging the entire kernel through shrinkers/mmu_notifiers. Userspace role in
> +device resets is to propagate the message to the application and apply any
> +special policy for blocking guilty applications, if any. Corollary is that
> +debugging a hung GPU context require hardware support to be able to preempt such
> +a GPU context while it's stopped.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset status for a specific context. This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface. Ideally this should be properly
> +integrated at DRM scheduler to provide a common ground for all drivers. After a
> +reset, KMD should reject new command submissions for affected contexts.
> +
> +User Mode Driver
> +----------------
> +
> +After command submission, UMD should check if the submission was accepted or
> +rejected. After a reset, KMD should reject submissions, and UMD can issue an
> +ioctl to the KMD to check the reset status, and this can be checked more often
> +if the UMD requires it. After detecting a reset, UMD will then proceed to report
> +it to the application using the appropriate API error code, as explained in the
> +section below about robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep a graphical API context working after a reset is if
> +it complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and a
> +userspace that doesn't support robust interfaces (like a non-robust
> +OpenGL context or API without any robustness support like libva) leave the
> +robustness handling entirely to the userspace driver. There is no strong
> +community consensus on what the userspace driver should do in that case,
> +since all reasonable approaches have some clear downsides.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. There's no consensus
> +on what to do to if robustness is not in use.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +the first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
> .. _drm_driver_ioctl:
>
> IOCTL Support on Device Nodes
Acked-by: Sebastian Wick <sebastian.wick@redhat.com>
> --
> 2.41.0
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v7] drm/doc: Document DRM device reset expectations
2023-08-18 20:06 [PATCH v7] drm/doc: Document DRM device reset expectations André Almeida
2023-08-22 10:26 ` Sebastian Wick
@ 2023-08-23 17:31 ` Rodrigo Vivi
2023-08-23 18:07 ` André Almeida
2023-08-24 11:30 ` Pekka Paalanen
2 siblings, 1 reply; 5+ messages in thread
From: Rodrigo Vivi @ 2023-08-23 17:31 UTC (permalink / raw)
To: André Almeida
Cc: dri-devel, amd-gfx, linux-kernel, pierre-eric.pelloux-prayer,
Randy Dunlap, 'Marek Olšák', Michel Dänzer,
Timur Kristóf, Pekka Paalanen, Samuel Pitoiset, kernel-dev,
alexander.deucher, christian.koenig
On Fri, Aug 18, 2023 at 05:06:42PM -0300, André Almeida wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>
> ---
>
> v7 changes:
> - s/application/graphical API contex/ in the robustness part (Michel)
> - Grammar fixes (Randy)
>
> v6: https://lore.kernel.org/lkml/20230815185710.159779-1-andrealmeid@igalia.com/
>
> v6 changes:
> - Due to substantial changes in the content, dropped Pekka's Acked-by
> - Grammar fixes (Randy)
> - Add paragraph about disabling device resets
> - Add note about integrating reset tracking in drm/sched
> - Add note that KMD should return failure for contexts affected by
> resets and UMD should check for this
> - Add note about lack of consensus around what to do about non-robust
> apps
>
> v5: https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/
> ---
> Documentation/gpu/drm-uapi.rst | 77 ++++++++++++++++++++++++++++++++++
> 1 file changed, 77 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3694bdb977f5 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,83 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> mmapped regular files. Threads cause additional pain with signal
> handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +section describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Device resets can not be disabled without tainting the kernel, which can lead to
> +hanging the entire kernel through shrinkers/mmu_notifiers. Userspace role in
> +device resets is to propagate the message to the application and apply any
> +special policy for blocking guilty applications, if any. Corollary is that
> +debugging a hung GPU context require hardware support to be able to preempt such
> +a GPU context while it's stopped.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset status for a specific context. This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface. Ideally this should be properly
> +integrated at DRM scheduler to provide a common ground for all drivers. After a
> +reset, KMD should reject new command submissions for affected contexts.
is there any consensus around what exactly 'affected contexts' might mean?
I see i915 pin-point only the context that was at execution with head pointing
at it and doesn't blame the queued ones, while on Xe it looks like we are
blaming all the queued context. Not sure what other drivers are doing for the
'affected contexts'.
> +
> +User Mode Driver
> +----------------
> +
> +After command submission, UMD should check if the submission was accepted or
> +rejected. After a reset, KMD should reject submissions, and UMD can issue an
> +ioctl to the KMD to check the reset status, and this can be checked more often
> +if the UMD requires it. After detecting a reset, UMD will then proceed to report
> +it to the application using the appropriate API error code, as explained in the
> +section below about robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep a graphical API context working after a reset is if
> +it complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and a
> +userspace that doesn't support robust interfaces (like a non-robust
> +OpenGL context or API without any robustness support like libva) leave the
> +robustness handling entirely to the userspace driver. There is no strong
> +community consensus on what the userspace driver should do in that case,
> +since all reasonable approaches have some clear downsides.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. There's no consensus
> +on what to do to if robustness is not in use.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +the first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
> .. _drm_driver_ioctl:
>
> IOCTL Support on Device Nodes
> --
> 2.41.0
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v7] drm/doc: Document DRM device reset expectations
2023-08-23 17:31 ` Rodrigo Vivi
@ 2023-08-23 18:07 ` André Almeida
0 siblings, 0 replies; 5+ messages in thread
From: André Almeida @ 2023-08-23 18:07 UTC (permalink / raw)
To: Rodrigo Vivi
Cc: dri-devel, amd-gfx, linux-kernel, pierre-eric.pelloux-prayer,
Randy Dunlap, 'Marek Olšák', Michel Dänzer,
Timur Kristóf, Pekka Paalanen, Samuel Pitoiset, kernel-dev,
alexander.deucher, christian.koenig
Hi Rodrigo,
Em 23/08/2023 14:31, Rodrigo Vivi escreveu:
> On Fri, Aug 18, 2023 at 05:06:42PM -0300, André Almeida wrote:
>> Create a section that specifies how to deal with DRM device resets for
>> kernel and userspace drivers.
>>
>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>
>> ---
>>
>> v7 changes:
>> - s/application/graphical API contex/ in the robustness part (Michel)
>> - Grammar fixes (Randy)
>>
>> v6: https://lore.kernel.org/lkml/20230815185710.159779-1-andrealmeid@igalia.com/
>>
>> v6 changes:
>> - Due to substantial changes in the content, dropped Pekka's Acked-by
>> - Grammar fixes (Randy)
>> - Add paragraph about disabling device resets
>> - Add note about integrating reset tracking in drm/sched
>> - Add note that KMD should return failure for contexts affected by
>> resets and UMD should check for this
>> - Add note about lack of consensus around what to do about non-robust
>> apps
>>
>> v5: https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/
>> ---
>> Documentation/gpu/drm-uapi.rst | 77 ++++++++++++++++++++++++++++++++++
>> 1 file changed, 77 insertions(+)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>> index 65fb3036a580..3694bdb977f5 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -285,6 +285,83 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>> mmapped regular files. Threads cause additional pain with signal
>> handling as well.
>>
>> +Device reset
>> +============
>> +
>> +The GPU stack is really complex and is prone to errors, from hardware bugs,
>> +faulty applications and everything in between the many layers. Some errors
>> +require resetting the device in order to make the device usable again. This
>> +section describes the expectations for DRM and usermode drivers when a
>> +device resets and how to propagate the reset status.
>> +
>> +Device resets can not be disabled without tainting the kernel, which can lead to
>> +hanging the entire kernel through shrinkers/mmu_notifiers. Userspace role in
>> +device resets is to propagate the message to the application and apply any
>> +special policy for blocking guilty applications, if any. Corollary is that
>> +debugging a hung GPU context require hardware support to be able to preempt such
>> +a GPU context while it's stopped.
>> +
>> +Kernel Mode Driver
>> +------------------
>> +
>> +The KMD is responsible for checking if the device needs a reset, and to perform
>> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
>> +should keep track of resets, because userspace can query any time about the
>> +reset status for a specific context. This is needed to propagate to the rest of
>> +the stack that a reset has happened. Currently, this is implemented by each
>> +driver separately, with no common DRM interface. Ideally this should be properly
>> +integrated at DRM scheduler to provide a common ground for all drivers. After a
>> +reset, KMD should reject new command submissions for affected contexts.
>
> is there any consensus around what exactly 'affected contexts' might mean?
> I see i915 pin-point only the context that was at execution with head pointing
> at it and doesn't blame the queued ones, while on Xe it looks like we are
> blaming all the queued context. Not sure what other drivers are doing for the
> 'affected contexts'.
>
"Affected contexts" is a generic term indeed, giving the differences
from each driver as you already pointed out. amdgpu also tends to affect
all queued contexts during a reset. This wording was used to fit how
different drivers works.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v7] drm/doc: Document DRM device reset expectations
2023-08-18 20:06 [PATCH v7] drm/doc: Document DRM device reset expectations André Almeida
2023-08-22 10:26 ` Sebastian Wick
2023-08-23 17:31 ` Rodrigo Vivi
@ 2023-08-24 11:30 ` Pekka Paalanen
2 siblings, 0 replies; 5+ messages in thread
From: Pekka Paalanen @ 2023-08-24 11:30 UTC (permalink / raw)
To: André Almeida
Cc: dri-devel, amd-gfx, linux-kernel, kernel-dev, alexander.deucher,
christian.koenig, pierre-eric.pelloux-prayer, Simon Ser,
Rob Clark, Daniel Vetter, Daniel Stone,
'Marek Olšák', Dave Airlie, Michel Dänzer,
Samuel Pitoiset, Timur Kristóf, Bas Nieuwenhuizen,
Randy Dunlap
[-- Attachment #1: Type: text/plain, Size: 5730 bytes --]
On Fri, 18 Aug 2023 17:06:42 -0300
André Almeida <andrealmeid@igalia.com> wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>
> ---
>
> v7 changes:
> - s/application/graphical API contex/ in the robustness part (Michel)
> - Grammar fixes (Randy)
>
> v6: https://lore.kernel.org/lkml/20230815185710.159779-1-andrealmeid@igalia.com/
>
> v6 changes:
> - Due to substantial changes in the content, dropped Pekka's Acked-by
> - Grammar fixes (Randy)
> - Add paragraph about disabling device resets
> - Add note about integrating reset tracking in drm/sched
> - Add note that KMD should return failure for contexts affected by
> resets and UMD should check for this
> - Add note about lack of consensus around what to do about non-robust
> apps
>
> v5: https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/
> ---
> Documentation/gpu/drm-uapi.rst | 77 ++++++++++++++++++++++++++++++++++
> 1 file changed, 77 insertions(+)
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
It's a good introduction, even if answers to the most interesting
questions were not found yet.
Does it still answer the questions you had originally that you set out
to document?
Thanks,
pq
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3694bdb977f5 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,83 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> mmapped regular files. Threads cause additional pain with signal
> handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +section describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Device resets can not be disabled without tainting the kernel, which can lead to
> +hanging the entire kernel through shrinkers/mmu_notifiers. Userspace role in
> +device resets is to propagate the message to the application and apply any
> +special policy for blocking guilty applications, if any. Corollary is that
> +debugging a hung GPU context require hardware support to be able to preempt such
> +a GPU context while it's stopped.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset status for a specific context. This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface. Ideally this should be properly
> +integrated at DRM scheduler to provide a common ground for all drivers. After a
> +reset, KMD should reject new command submissions for affected contexts.
> +
> +User Mode Driver
> +----------------
> +
> +After command submission, UMD should check if the submission was accepted or
> +rejected. After a reset, KMD should reject submissions, and UMD can issue an
> +ioctl to the KMD to check the reset status, and this can be checked more often
> +if the UMD requires it. After detecting a reset, UMD will then proceed to report
> +it to the application using the appropriate API error code, as explained in the
> +section below about robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep a graphical API context working after a reset is if
> +it complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and a
> +userspace that doesn't support robust interfaces (like a non-robust
> +OpenGL context or API without any robustness support like libva) leave the
> +robustness handling entirely to the userspace driver. There is no strong
> +community consensus on what the userspace driver should do in that case,
> +since all reasonable approaches have some clear downsides.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. There's no consensus
> +on what to do to if robustness is not in use.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +the first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
> .. _drm_driver_ioctl:
>
> IOCTL Support on Device Nodes
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-08-24 11:31 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-18 20:06 [PATCH v7] drm/doc: Document DRM device reset expectations André Almeida
2023-08-22 10:26 ` Sebastian Wick
2023-08-23 17:31 ` Rodrigo Vivi
2023-08-23 18:07 ` André Almeida
2023-08-24 11:30 ` Pekka Paalanen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox