AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Raag Jadav <raag.jadav@intel.com>
To: airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com,
	rodrigo.vivi@intel.com, jani.nikula@linux.intel.com,
	andriy.shevchenko@linux.intel.com, lina@asahilina.net,
	michal.wajdeczko@intel.com, christian.koenig@amd.com
Cc: intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com,
	aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com,
	alexander.deucher@amd.com, andrealmeid@igalia.com,
	amd-gfx@lists.freedesktop.org, kernel-dev@igalia.com,
	Raag Jadav <raag.jadav@intel.com>
Subject: [PATCH v8 1/4] drm: Introduce device wedged event
Date: Fri, 25 Oct 2024 14:18:14 +0530	[thread overview]
Message-ID: <20241025084817.144621-2-raag.jadav@intel.com> (raw)
In-Reply-To: <20241025084817.144621-1-raag.jadav@intel.com>

Introduce device wedged event, which will notify userspace of wedged
(hanged/unusable) state of the DRM device through a uevent. This is
useful especially in cases where the device is no longer operating as
expected even after a reset and has become unrecoverable from driver
context. Purpose of this implementation is to provide drivers a generic
way to recover with the help of userspace intervention without taking
any drastic measures in the driver.

A 'wedged' device is basically a dead device that needs attention.
The uevent is the notification that is sent to userspace along with a
hint about what could possibly be attempted to recover the device and
bring it back to usable state. Different drivers may have different
ideas of a 'wedged' device depending on their hardware implementation,
and hence the vendor agnostic nature of the event. It is up to the
drivers to decide when they see the need for recovery and how they
want to recover from the available methods.

Recovery
--------

Current implementation defines two recovery methods, out of which,
drivers can use any one, both or none. Method(s) of choice will be
sent in the uevent environment as ``WEDGED=<method1>[,<method2>]``
in order of less to more side-effects. If driver is unsure about
recovery or method is unknown (like soft/hard reboot, firmware
flashing, hardware replacement or any other procedure which can't
be attempted on the fly), ``WEDGED=none`` will be sent instead.

It is the responsibility of the driver to perform required cleanups
(like disabling system memory access or signalling dma_fences) and
prepare itself for the recovery before sending the event. Once the
event is sent, driver should block all IOCTLs with an error code.
This will signify the reason for wegeding which can be reported to
the application if needed.

Userspace consumers can parse this event and attempt recovery as per
below expectations.

    =============== ==================================
    Recovery method Consumer expectations
    =============== ==================================
    rebind          unbind + rebind driver
    bus-reset       unbind + reset bus device + rebind
    none            admin/user policy
    =============== ==================================

Example for rebind
~~~~~~~~~~~~~~~~~~

Udev rule::

    SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
    RUN+="/path/to/rebind.sh $env{DEVPATH}"

Recovery script::

    #!/bin/sh

    DEVPATH=$(readlink -f /sys/$1/device)
    DEVICE=$(basename $DEVPATH)
    DRIVER=$(readlink -f $DEVPATH/driver)

    echo -n $DEVICE > $DRIVER/unbind
    sleep 1
    echo -n $DEVICE > $DRIVER/bind

Although scripts are simple enough for basic recovery, admin/users
can define customized policies around recovery action. For example if
the driver supports multiple recovery methods, consumers can opt for
the suitable one based on policy definition. Consumers can also take
additional steps like gathering telemetry information (devcoredump,
syslog), or have the device available for further debugging and data
collection before performing the recovery. This is useful especially
when the driver is unsure about recovery or method is unknown.

v4: s/drm_dev_wedged/drm_dev_wedged_event
    Use drm_info() (Jani)
    Kernel doc adjustment (Aravind)
v5: Send recovery method with uevent (Lina)
v6: Access wedge_recovery_opts[] using helper function (Jani)
    Use snprintf() (Jani)
v7: Convert recovery helpers into regular functions (Andy, Jani)
    Aesthetic adjustments (Andy)
    Handle invalid method cases
v8: Allow sending multiple methods with uevent (Lucas, Michal)
    static_assert() globally (Andy)

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
 drivers/gpu/drm/drm_drv.c | 51 +++++++++++++++++++++++++++++++++++++++
 include/drm/drm_device.h  |  7 ++++++
 include/drm/drm_drv.h     |  1 +
 3 files changed, 59 insertions(+)

diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index ac30b0ec9d93..ded6327fc242 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -26,6 +26,8 @@
  * DEALINGS IN THE SOFTWARE.
  */
 
+#include <linux/array_size.h>
+#include <linux/build_bug.h>
 #include <linux/debugfs.h>
 #include <linux/fs.h>
 #include <linux/module.h>
@@ -33,6 +35,7 @@
 #include <linux/mount.h>
 #include <linux/pseudo_fs.h>
 #include <linux/slab.h>
+#include <linux/sprintf.h>
 #include <linux/srcu.h>
 #include <linux/xarray.h>
 
@@ -70,6 +73,16 @@ static struct dentry *drm_debugfs_root;
 
 DEFINE_STATIC_SRCU(drm_unplug_srcu);
 
+/*
+ * Available recovery methods for wedged device. To be sent along with device
+ * wedged uevent.
+ */
+static const char *const drm_wedge_recovery_opts[] = {
+	[ffs(DRM_WEDGE_RECOVERY_REBIND) - 1]	= "rebind",
+	[ffs(DRM_WEDGE_RECOVERY_BUS_RESET) - 1]	= "bus-reset",
+};
+static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == ffs(DRM_WEDGE_RECOVERY_BUS_RESET));
+
 /*
  * DRM Minors
  * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
@@ -497,6 +510,44 @@ void drm_dev_unplug(struct drm_device *dev)
 }
 EXPORT_SYMBOL(drm_dev_unplug);
 
+/**
+ * drm_dev_wedged_event - generate a device wedged uevent
+ * @dev: DRM device
+ * @method: method(s) to be used for recovery
+ *
+ * This generates a device wedged uevent for the DRM device specified by @dev.
+ * Recovery @method from drm_wedge_recovery_opts[] is sent in the uevent
+ * environment as ``WEDGED=<method1>[,<method2>]`` in order of less to more
+ * side-effects. If caller is unsure about recovery or @method is unknown (0),
+ * ``WEDGED=none`` will be sent instead.
+ *
+ * Returns: 0 on success, negative error code otherwise.
+ */
+int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
+{
+	unsigned int len, opt, size = ARRAY_SIZE(drm_wedge_recovery_opts);
+	const char *recovery = NULL;
+	/* Event string length up to 24+ characters with available methods */
+	char event_string[32];
+	char *envp[] = { event_string, NULL };
+
+	len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
+
+	for_each_set_bit(opt, &method, size) {
+		recovery = drm_wedge_recovery_opts[opt];
+		len += scnprintf(event_string + len, sizeof(event_string),
+				 opt == size - 1 ? "%s" : "%s,", recovery);
+	}
+
+	if (!recovery)
+		/* Caller is unsure about recovery, do the best we can at this point. */
+		scnprintf(event_string + len, sizeof(event_string), "%s", "none");
+
+	drm_info(dev, "device wedged, needs recovery\n");
+	return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+}
+EXPORT_SYMBOL(drm_dev_wedged_event);
+
 /*
  * DRM internal mount
  * We want to be able to allocate our own "struct address_space" to control
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index c91f87b5242d..edf8b200891d 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -21,6 +21,13 @@ struct inode;
 struct pci_dev;
 struct pci_controller;
 
+/*
+ * Recovery methods for wedged device in order of less to more side-effects.
+ * To be used with drm_dev_wedged_event() as recovery @method. Callers can
+ * use any one, multiple (or'd) or none depending on their needs.
+ */
+#define DRM_WEDGE_RECOVERY_REBIND	BIT(0)	/* unbind + rebind driver */
+#define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(1)	/* unbind + reset bus device + rebind */
 
 /**
  * enum switch_power_state - power state of drm device
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index 02ea4e3248fd..cc7bcb94ad6a 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -461,6 +461,7 @@ void drm_put_dev(struct drm_device *dev);
 bool drm_dev_enter(struct drm_device *dev, int *idx);
 void drm_dev_exit(int idx);
 void drm_dev_unplug(struct drm_device *dev);
+int drm_dev_wedged_event(struct drm_device *dev, unsigned long method);
 
 /**
  * drm_dev_is_unplugged - is a DRM device unplugged
-- 
2.34.1


  reply	other threads:[~2024-10-25 12:48 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-25  8:48 [PATCH v8 0/4] Introduce DRM device wedged event Raag Jadav
2024-10-25  8:48 ` Raag Jadav [this message]
2024-10-25  9:08   ` [PATCH v8 1/4] drm: Introduce " Jani Nikula
2024-10-25 14:45     ` Andy Shevchenko
2024-10-26 15:27       ` Raag Jadav
2024-10-28  8:50     ` Jani Nikula
2024-10-26  6:15   ` kernel test robot
2024-10-26 10:10   ` kernel test robot
2024-10-25  8:48 ` [PATCH v8 2/4] drm/doc: Document " Raag Jadav
2024-10-29  9:51   ` Christian König
2024-11-01  4:26     ` Raag Jadav
2024-10-25  8:48 ` [PATCH v8 3/4] drm/xe: Use " Raag Jadav
2024-10-25  8:48 ` [PATCH v8 4/4] drm/i915: " Raag Jadav

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241025084817.144621-2-raag.jadav@intel.com \
    --to=raag.jadav@intel.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrealmeid@igalia.com \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jani.nikula@linux.intel.com \
    --cc=kernel-dev@igalia.com \
    --cc=lina@asahilina.net \
    --cc=lucas.demarchi@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona@ffwll.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox