Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Raag Jadav <raag.jadav@intel.com>
To: airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com,
	rodrigo.vivi@intel.com, jani.nikula@linux.intel.com,
	andriy.shevchenko@linux.intel.com, lina@asahilina.net,
	michal.wajdeczko@intel.com, christian.koenig@amd.com
Cc: intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com,
	aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com,
	alexander.deucher@amd.com, andrealmeid@igalia.com,
	amd-gfx@lists.freedesktop.org, kernel-dev@igalia.com,
	Raag Jadav <raag.jadav@intel.com>
Subject: [PATCH v8 2/4] drm/doc: Document device wedged event
Date: Fri, 25 Oct 2024 14:18:15 +0530	[thread overview]
Message-ID: <20241025084817.144621-3-raag.jadav@intel.com> (raw)
In-Reply-To: <20241025084817.144621-1-raag.jadav@intel.com>

Add documentation for device wedged event in a new 'Device wedging'
chapter. The describes basic definitions and consumer expectations
along with an example.

v8: Improve documentation (Christian, Rodrigo)

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
 Documentation/gpu/drm-uapi.rst | 75 ++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 370d820be248..11a7446233b5 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -362,6 +362,81 @@ the first place. DRM devices should make use of devcoredump to store relevant
 information about the reset, so this information can be added to user bug
 reports.
 
+Device wedging
+==============
+
+Drivers can optionally make use of device wedged event (implemented as
+drm_dev_wedged_event() in DRM subsystem) which notifies userspace of wedged
+(hanged/unusable) state of the DRM device through a uevent. This is useful
+especially in cases where the device is no longer operating as expected even
+after a reset and has become unrecoverable from driver context. Purpose of
+this implementation is to provide drivers a generic way to recover with the
+help of userspace intervention without taking any drastic measures in the
+driver.
+
+A 'wedged' device is basically a dead device that needs attention. The
+uevent is the notification that is sent to userspace along with a hint about
+what could possibly be attempted to recover the device and bring it back to
+usable state. Different drivers may have different ideas of a 'wedged' device
+depending on their hardware implementation, and hence the vendor agnostic
+nature of the event. It is up to the drivers to decide when they see the need
+for recovery and how they want to recover from the available methods.
+
+Recovery
+--------
+
+Current implementation defines two recovery methods, out of which, drivers
+can use any one, both or none. Method(s) of choice will be sent in the uevent
+environment as ``WEDGED=<method1>[,<method2>]`` in order of less to more side
+effects. If driver is unsure about recovery or method is unknown (like reboot,
+firmware flashing, hardware replacement or any other procedure which can't be
+attempted on the fly), ``WEDGED=none`` will be sent instead.
+
+It is the responsibility of the driver to perform required cleanups (like
+disabling system memory access or signalling dma_fences) and prepare itself
+for the recovery before sending the event. Once the event is sent, driver
+should block all IOCTLs with an error code. This will signify the reason for
+wegeding which can be reported to the application if needed.
+
+Userspace consumers can parse this event and attempt recovery as per below
+expectations.
+
+    =============== ==================================
+    Recovery method Consumer expectations
+    =============== ==================================
+    rebind          unbind + rebind driver
+    bus-reset       unbind + reset bus device + rebind
+    none            admin/user policy
+    =============== ==================================
+
+Example for rebind
+~~~~~~~~~~~~~~~~~~
+
+Udev rule::
+
+    SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
+    RUN+="/path/to/rebind.sh $env{DEVPATH}"
+
+Recovery script::
+
+    #!/bin/sh
+
+    DEVPATH=$(readlink -f /sys/$1/device)
+    DEVICE=$(basename $DEVPATH)
+    DRIVER=$(readlink -f $DEVPATH/driver)
+
+    echo -n $DEVICE > $DRIVER/unbind
+    sleep 1
+    echo -n $DEVICE > $DRIVER/bind
+
+Although scripts are simple enough for basic recovery, admin/users can define
+customized policies around recovery action. For example if the driver supports
+multiple recovery methods, consumers can opt for the suitable one based on
+policy definition. Consumers can also take additional steps like gathering
+telemetry information (devcoredump, syslog), or have the device available for
+further debugging and data collection before performing the recovery. This is
+useful especially when the driver is unsure about recovery or method is unknown.
+
 .. _drm_driver_ioctl:
 
 IOCTL Support on Device Nodes
-- 
2.34.1


  parent reply	other threads:[~2024-10-25  8:49 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-25  8:48 [PATCH v8 0/4] Introduce DRM device wedged event Raag Jadav
2024-10-25  8:48 ` [PATCH v8 1/4] drm: Introduce " Raag Jadav
2024-10-25  9:08   ` Jani Nikula
2024-10-25 14:45     ` Andy Shevchenko
2024-10-26 15:27       ` Raag Jadav
2024-10-28  8:50     ` Jani Nikula
2024-10-26  6:15   ` kernel test robot
2024-10-26 10:10   ` kernel test robot
2024-10-25  8:48 ` Raag Jadav [this message]
2024-10-29  9:51   ` [PATCH v8 2/4] drm/doc: Document " Christian König
2024-11-01  4:26     ` Raag Jadav
2024-10-25  8:48 ` [PATCH v8 3/4] drm/xe: Use " Raag Jadav
2024-10-25  8:48 ` [PATCH v8 4/4] drm/i915: " Raag Jadav
2024-10-25 10:08 ` ✗ Fi.CI.CHECKPATCH: warning for Introduce DRM device wedged event (rev6) Patchwork
2024-10-25 10:08 ` ✗ Fi.CI.SPARSE: " Patchwork
2024-10-25 10:28 ` ✓ Fi.CI.BAT: success " Patchwork
2024-10-25 14:44 ` ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241025084817.144621-3-raag.jadav@intel.com \
    --to=raag.jadav@intel.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrealmeid@igalia.com \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jani.nikula@linux.intel.com \
    --cc=kernel-dev@igalia.com \
    --cc=lina@asahilina.net \
    --cc=lucas.demarchi@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona@ffwll.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox