[PATCH v7 0/5] Introduce DRM device wedged event

intel-xe.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 0/5] Introduce DRM device wedged event
@ 2024-09-30  7:38 Raag Jadav
  2024-09-30  7:38 ` [PATCH v7 1/5] drm: Introduce " Raag Jadav
                   ` (5 more replies)
  0 siblings, 6 replies; 34+ messages in thread
From: Raag Jadav @ 2024-09-30  7:38 UTC (permalink / raw)
  To: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper, Raag Jadav

This series introduces device wedged event in DRM subsystem and uses
it in xe and i915 drivers. Detailed description in commit message.

This was earlier attempted as xe specific uevent in v1 and v2.
https://patchwork.freedesktop.org/series/136909/

v2: Change authorship to Himal (Aravind)
    Add uevent for all device wedged cases (Aravind)

v3: Generic re-implementation in DRM subsystem (Lucas)

v4: s/drm_dev_wedged/drm_dev_wedged_event
    Use drm_info() (Jani)
    Kernel doc adjustment (Aravind)
    Change authorship to Raag (Aravind)

v5: Send recovery method with uevent (Lina)
    Expose supported recovery methods via sysfs (Lucas)

v6: Access wedge_recovery_opts[] using helper function (Jani)
    Use snprintf() (Jani)

v7: Convert recovery helpers into regular functions (Andy, Jani)
    Aesthetic adjustments (Andy)
    Handle invalid method cases
    Add documentation to drm-uapi.rst (Sima)

Raag Jadav (5):
  drm: Introduce device wedged event
  drm: Expose wedge recovery methods
  drm/doc: Document device wedged event
  drm/xe: Use device wedged event
  drm/i915: Use device wedged event

 Documentation/gpu/drm-uapi.rst        | 42 +++++++++++++++
 drivers/gpu/drm/drm_drv.c             | 77 +++++++++++++++++++++++++++
 drivers/gpu/drm/drm_sysfs.c           | 22 ++++++++
 drivers/gpu/drm/i915/gt/intel_reset.c |  2 +
 drivers/gpu/drm/i915/i915_driver.c    | 10 ++++
 drivers/gpu/drm/xe/xe_device.c        | 17 +++++-
 drivers/gpu/drm/xe/xe_device.h        |  1 +
 drivers/gpu/drm/xe/xe_pci.c           |  2 +
 include/drm/drm_device.h              | 23 ++++++++
 include/drm/drm_drv.h                 |  3 ++
 10 files changed, 197 insertions(+), 2 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v7 1/5] drm: Introduce device wedged event
  2024-09-30  7:38 [PATCH v7 0/5] Introduce DRM device wedged event Raag Jadav
@ 2024-09-30  7:38 ` Raag Jadav
  2024-09-30 12:59   ` Andy Shevchenko
                     ` (3 more replies)
  2024-09-30  7:38 ` [PATCH v7 2/5] drm: Expose wedge recovery methods Raag Jadav
                   ` (4 subsequent siblings)
  5 siblings, 4 replies; 34+ messages in thread
From: Raag Jadav @ 2024-09-30  7:38 UTC (permalink / raw)
  To: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper, Raag Jadav

Introduce device wedged event, which will notify userspace of wedged
(hanged/unusable) state of the DRM device through a uevent. This is
useful especially in cases where the device is no longer operating as
expected even after a hardware reset and has become unrecoverable from
driver context.

Purpose of this implementation is to provide drivers a generic way to
recover with the help of userspace intervention. Different drivers may
have different ideas of a "wedged device" depending on their hardware
implementation, and hence the vendor agnostic nature of the event.
It is up to the drivers to decide when they see the need for recovery
and how they want to recover from the available methods.

Current implementation defines three recovery methods, out of which,
drivers can choose to support any one or multiple of them. Preferred
recovery method will be sent in the uevent environment as WEDGED=<method>.
Userspace consumers (sysadmin) can define udev rules to parse this event
and take respective action to recover the device.

    =============== ==================================
    Recovery method Consumer expectations
    =============== ==================================
    rebind          unbind + rebind driver
    bus-reset       unbind + reset bus device + rebind
    reboot          reboot system
    =============== ==================================

v4: s/drm_dev_wedged/drm_dev_wedged_event
    Use drm_info() (Jani)
    Kernel doc adjustment (Aravind)
v5: Send recovery method with uevent (Lina)
v6: Access wedge_recovery_opts[] using helper function (Jani)
    Use snprintf() (Jani)
v7: Convert recovery helpers into regular functions (Andy, Jani)
    Aesthetic adjustments (Andy)
    Handle invalid method cases

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
 drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++
 include/drm/drm_device.h  | 23 ++++++++++++
 include/drm/drm_drv.h     |  3 ++
 3 files changed, 103 insertions(+)

diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index ac30b0ec9d93..cfe9600da2ee 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -26,6 +26,8 @@
  * DEALINGS IN THE SOFTWARE.
  */
 
+#include <linux/array_size.h>
+#include <linux/build_bug.h>
 #include <linux/debugfs.h>
 #include <linux/fs.h>
 #include <linux/module.h>
@@ -33,6 +35,7 @@
 #include <linux/mount.h>
 #include <linux/pseudo_fs.h>
 #include <linux/slab.h>
+#include <linux/sprintf.h>
 #include <linux/srcu.h>
 #include <linux/xarray.h>
 
@@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root;
 
 DEFINE_STATIC_SRCU(drm_unplug_srcu);
 
+/*
+ * Available recovery methods for wedged device. To be sent along with device
+ * wedged uevent.
+ */
+static const char *const drm_wedge_recovery_opts[] = {
+	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
+	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
+	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
+};
+
+static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
+{
+	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
+
+	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
+}
+
+/**
+ * drm_wedge_recovery_name - provide wedge recovery name
+ * @method: method to be used for recovery
+ *
+ * This validates wedge recovery @method against the available ones in
+ * drm_wedge_recovery_opts[] and provides respective recovery name in string
+ * format if found valid.
+ *
+ * Returns: pointer to const recovery string on success, NULL otherwise.
+ */
+const char *drm_wedge_recovery_name(enum drm_wedge_recovery method)
+{
+	if (drm_wedge_recovery_is_valid(method))
+		return drm_wedge_recovery_opts[method];
+
+	return NULL;
+}
+EXPORT_SYMBOL(drm_wedge_recovery_name);
+
 /*
  * DRM Minors
  * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
@@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev)
 }
 EXPORT_SYMBOL(drm_dev_unplug);
 
+/**
+ * drm_dev_wedged_event - generate a device wedged uevent
+ * @dev: DRM device
+ * @method: method to be used for recovery
+ *
+ * This generates a device wedged uevent for the DRM device specified by @dev.
+ * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device)
+ * is sent in the uevent environment as WEDGED=<method>, on the basis of which,
+ * userspace may take respective action to recover the device.
+ *
+ * Returns: 0 on success, or negative error code otherwise.
+ */
+int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method)
+{
+	/* Event string length up to 16+ characters with available methods */
+	char event_string[32] = {};
+	char *envp[] = { event_string, NULL };
+	const char *recovery;
+
+	recovery = drm_wedge_recovery_name(method);
+	if (!recovery) {
+		drm_err(dev, "device wedged, invalid recovery method %d\n", method);
+		return -EINVAL;
+	}
+
+	if (!test_bit(method, &dev->wedge_recovery)) {
+		drm_err(dev, "device wedged, %s based recovery not supported\n",
+			drm_wedge_recovery_name(method));
+		return -EOPNOTSUPP;
+	}
+
+	snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery);
+
+	drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery);
+	return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+}
+EXPORT_SYMBOL(drm_dev_wedged_event);
+
 /*
  * DRM internal mount
  * We want to be able to allocate our own "struct address_space" to control
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index c91f87b5242d..fed6f20e52fb 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -40,6 +40,26 @@ enum switch_power_state {
 	DRM_SWITCH_POWER_DYNAMIC_OFF = 3,
 };
 
+/**
+ * enum drm_wedge_recovery - Recovery method for wedged device in order of
+ * severity. To be set as bit fields in drm_device.wedge_recovery variable.
+ * Drivers can choose to support any one or multiple of them depending on
+ * their needs.
+ */
+enum drm_wedge_recovery {
+	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
+	DRM_WEDGE_RECOVERY_REBIND,
+
+	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
+	DRM_WEDGE_RECOVERY_BUS_RESET,
+
+	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
+	DRM_WEDGE_RECOVERY_REBOOT,
+
+	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
+	DRM_WEDGE_RECOVERY_MAX
+};
+
 /**
  * struct drm_device - DRM device structure
  *
@@ -317,6 +337,9 @@ struct drm_device {
 	 * Root directory for debugfs files.
 	 */
 	struct dentry *debugfs_root;
+
+	/** @wedge_recovery: Supported recovery methods for wedged device */
+	unsigned long wedge_recovery;
 };
 
 #endif
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index 02ea4e3248fd..d8dbc77010b0 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx);
 void drm_dev_exit(int idx);
 void drm_dev_unplug(struct drm_device *dev);
 
+const char *drm_wedge_recovery_name(enum drm_wedge_recovery method);
+int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method);
+
 /**
  * drm_dev_is_unplugged - is a DRM device unplugged
  * @dev: DRM device
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 2/5] drm: Expose wedge recovery methods
  2024-09-30  7:38 [PATCH v7 0/5] Introduce DRM device wedged event Raag Jadav
  2024-09-30  7:38 ` [PATCH v7 1/5] drm: Introduce " Raag Jadav
@ 2024-09-30  7:38 ` Raag Jadav
  2024-09-30 13:01   ` Andy Shevchenko
  2024-09-30  7:38 ` [PATCH v7 3/5] drm/doc: Document device wedged event Raag Jadav
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 34+ messages in thread
From: Raag Jadav @ 2024-09-30  7:38 UTC (permalink / raw)
  To: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper, Raag Jadav

Now that we have device wedged event in place, add wedge_recovery sysfs
attribute which will expose recovery methods supported by the DRM device.
This is useful for userspace consumers in cases where the device supports
multiple recovery methods which can be used as fallbacks.

  $ cat /sys/class/drm/card<N>/wedge_recovery
  rebind
  bus-reset
  reboot

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
 drivers/gpu/drm/drm_sysfs.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
index fb3bbb6adcd1..bd77b35ceb8a 100644
--- a/drivers/gpu/drm/drm_sysfs.c
+++ b/drivers/gpu/drm/drm_sysfs.c
@@ -24,6 +24,7 @@
 #include <drm/drm_accel.h>
 #include <drm/drm_connector.h>
 #include <drm/drm_device.h>
+#include <drm/drm_drv.h>
 #include <drm/drm_file.h>
 #include <drm/drm_modes.h>
 #include <drm/drm_print.h>
@@ -508,6 +509,26 @@ void drm_sysfs_connector_property_event(struct drm_connector *connector,
 }
 EXPORT_SYMBOL(drm_sysfs_connector_property_event);
 
+static ssize_t wedge_recovery_show(struct device *device,
+				   struct device_attribute *attr, char *buf)
+{
+	struct drm_minor *minor = to_drm_minor(device);
+	struct drm_device *dev = minor->dev;
+	unsigned int method, count = DRM_WEDGE_RECOVERY_REBIND;
+
+	for_each_set_bit(method, &dev->wedge_recovery, DRM_WEDGE_RECOVERY_MAX)
+		count += sysfs_emit_at(buf, count, "%s\n", drm_wedge_recovery_name(method));
+
+	return count;
+}
+static DEVICE_ATTR_RO(wedge_recovery);
+
+static struct attribute *minor_dev_attrs[] = {
+	&dev_attr_wedge_recovery.attr,
+	NULL
+};
+ATTRIBUTE_GROUPS(minor_dev);
+
 struct device *drm_sysfs_minor_alloc(struct drm_minor *minor)
 {
 	const char *minor_str;
@@ -532,6 +553,7 @@ struct device *drm_sysfs_minor_alloc(struct drm_minor *minor)
 		kdev->devt = MKDEV(DRM_MAJOR, minor->index);
 		kdev->class = drm_class;
 		kdev->type = &drm_sysfs_device_minor;
+		kdev->groups = minor_dev_groups;
 	}
 
 	kdev->parent = minor->dev->dev;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 3/5] drm/doc: Document device wedged event
  2024-09-30  7:38 [PATCH v7 0/5] Introduce DRM device wedged event Raag Jadav
  2024-09-30  7:38 ` [PATCH v7 1/5] drm: Introduce " Raag Jadav
  2024-09-30  7:38 ` [PATCH v7 2/5] drm: Expose wedge recovery methods Raag Jadav
@ 2024-09-30  7:38 ` Raag Jadav
  2024-09-30  7:38 ` [PATCH v7 4/5] drm/xe: Use " Raag Jadav
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: Raag Jadav @ 2024-09-30  7:38 UTC (permalink / raw)
  To: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper, Raag Jadav

Add documentation for device wedged event along with its consumer
expectations. For now it is amended to 'Device reset' chapter, but
with extended functionality in the future it can be refactored into
its own chapter.

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
 Documentation/gpu/drm-uapi.rst | 42 ++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 370d820be248..c1186dfd283d 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -313,6 +313,22 @@ driver separately, with no common DRM interface. Ideally this should be properly
 integrated at DRM scheduler to provide a common ground for all drivers. After a
 reset, KMD should reject new command submissions for affected contexts.
 
+Drivers can optionally make use of device wedged event (implemented as
+drm_dev_wedged_event() in DRM subsystem) which notifies userspace of wedged
+(hanged/unusable) state of the DRM device through a uevent. This is useful
+especially in cases where the device is no longer operating as expected even
+after a hardware reset and has become unrecoverable from driver context.
+Purpose of this implementation is to provide drivers a generic way to recover
+with the help of userspace intervention, and hence the vendor agnostic nature
+of the event.
+
+Different drivers may have different ideas of a "wedged device" depending on
+their hardware implementation. It is up to the drivers to decide when they see
+the need for recovery and how they want to recover from the available methods.
+Current implementation defines three recovery methods, out of which, drivers
+can choose to support any one or multiple of them. Preferred recovery method
+will be sent in the uevent environment as WEDGED=<method>.
+
 User Mode Driver
 ----------------
 
@@ -323,6 +339,32 @@ if the UMD requires it. After detecting a reset, UMD will then proceed to report
 it to the application using the appropriate API error code, as explained in the
 section below about robustness.
 
+On device wedged scenario, userspace will receive a uevent from KMD with
+its preferred recovery method in the uevent environment as WEDGED=<method>.
+Userspace consumers (sysadmin) can define udev rules to parse this event
+and take respective action to recover the device.
+
+.. table:: Wedged Device Recovery
+
+    =============== ==================================
+    Recovery method Consumer expectations
+    =============== ==================================
+    rebind          unbind + rebind driver
+    bus-reset       unbind + reset bus device + rebind
+    reboot          reboot system
+    =============== ==================================
+
+Userspace consumers can optionally read the recovery methods supported by the
+device via ``wedge_recovery`` sysfs attribute::
+
+  $ cat /sys/class/drm/card<N>/wedge_recovery
+  rebind
+  bus-reset
+  reboot
+
+This is useful in cases where the device supports multiple recovery methods
+which can be used as fallbacks.
+
 Robustness
 ----------
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 4/5] drm/xe: Use device wedged event
  2024-09-30  7:38 [PATCH v7 0/5] Introduce DRM device wedged event Raag Jadav
                   ` (2 preceding siblings ...)
  2024-09-30  7:38 ` [PATCH v7 3/5] drm/doc: Document device wedged event Raag Jadav
@ 2024-09-30  7:38 ` Raag Jadav
  2024-09-30  7:38 ` [PATCH v7 5/5] drm/i915: " Raag Jadav
  2024-09-30  7:47 ` ✗ CI.Patch_applied: failure for Introduce DRM device wedged event (rev5) Patchwork
  5 siblings, 0 replies; 34+ messages in thread
From: Raag Jadav @ 2024-09-30  7:38 UTC (permalink / raw)
  To: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper, Raag Jadav

This was previously attempted as xe specific reset uevent but dropped
in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset uevent for now")
as part of refactoring.

Now that we have device wedged event provided by DRM core, make use
of it and support both driver rebind and bus-reset based recovery.
With this in place userspace will be notified of wedged device, on
the basis of which, userspace may take respective action to recover
the device.

$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent

KERNEL[265.802982] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm)
ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0
SUBSYSTEM=drm
WEDGED=bus-reset
DEVNAME=/dev/dri/card0
DEVTYPE=drm_minor
SEQNUM=5208
MAJOR=226
MINOR=0

v2: Change authorship to Himal (Aravind)
    Add uevent for all device wedged cases (Aravind)
v3: Generic re-implementation in DRM subsystem (Lucas)
v4: Change authorship to Raag (Aravind)

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
 drivers/gpu/drm/xe/xe_device.c | 17 +++++++++++++++--
 drivers/gpu/drm/xe/xe_device.h |  1 +
 drivers/gpu/drm/xe/xe_pci.c    |  2 ++
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 8e9b551c7033..bbf2052a91ba 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -785,6 +785,15 @@ int xe_device_probe(struct xe_device *xe)
 	return err;
 }
 
+void xe_setup_wedge_recovery(struct xe_device *xe)
+{
+	struct drm_device *dev = &xe->drm;
+
+	/* Support both driver rebind and bus-reset based recovery. */
+	set_bit(DRM_WEDGE_RECOVERY_REBIND, &dev->wedge_recovery);
+	set_bit(DRM_WEDGE_RECOVERY_BUS_RESET, &dev->wedge_recovery);
+}
+
 static void xe_device_remove_display(struct xe_device *xe)
 {
 	xe_display_unregister(xe);
@@ -991,11 +1000,12 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
  * xe_device_declare_wedged - Declare device wedged
  * @xe: xe device instance
  *
- * This is a final state that can only be cleared with a mudule
+ * This is a final state that can only be cleared with a module
  * re-probe (unbind + bind).
  * In this state every IOCTL will be blocked so the GT cannot be used.
  * In general it will be called upon any critical error such as gt reset
- * failure or guc loading failure.
+ * failure or guc loading failure. Userspace will be notified of this state
+ * by a DRM uevent.
  * If xe.wedged module parameter is set to 2, this function will be called
  * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
  * snapshot capture. In this mode, GT reset won't be attempted so the state of
@@ -1025,6 +1035,9 @@ void xe_device_declare_wedged(struct xe_device *xe)
 			"IOCTLs and executions are blocked. Only a rebind may clear the failure\n"
 			"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
 			dev_name(xe->drm.dev));
+
+		/* Notify userspace of wedged device */
+		drm_dev_wedged_event(&xe->drm, DRM_WEDGE_RECOVERY_BUS_RESET);
 	}
 
 	for_each_gt(gt, xe, id)
diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
index 4c3f0ebe78a9..ca4b3935a982 100644
--- a/drivers/gpu/drm/xe/xe_device.h
+++ b/drivers/gpu/drm/xe/xe_device.h
@@ -186,6 +186,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
 	return atomic_read(&xe->wedged.flag);
 }
 
+void xe_setup_wedge_recovery(struct xe_device *xe);
 void xe_device_declare_wedged(struct xe_device *xe);
 
 struct xe_file *xe_file_get(struct xe_file *xef);
diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
index edaeefd2d648..e7a1d59c40a9 100644
--- a/drivers/gpu/drm/xe/xe_pci.c
+++ b/drivers/gpu/drm/xe/xe_pci.c
@@ -860,6 +860,8 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (err)
 		goto err_driver_cleanup;
 
+	xe_setup_wedge_recovery(xe);
+
 	drm_dbg(&xe->drm, "d3cold: capable=%s\n",
 		str_yes_no(xe->d3cold.capable));
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 5/5] drm/i915: Use device wedged event
  2024-09-30  7:38 [PATCH v7 0/5] Introduce DRM device wedged event Raag Jadav
                   ` (3 preceding siblings ...)
  2024-09-30  7:38 ` [PATCH v7 4/5] drm/xe: Use " Raag Jadav
@ 2024-09-30  7:38 ` Raag Jadav
  2024-09-30  7:47 ` ✗ CI.Patch_applied: failure for Introduce DRM device wedged event (rev5) Patchwork
  5 siblings, 0 replies; 34+ messages in thread
From: Raag Jadav @ 2024-09-30  7:38 UTC (permalink / raw)
  To: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper, Raag Jadav

Now that we have device wedged event provided by DRM core, make use
of it and support both driver rebind and bus-reset based recovery.
With this in place, userspace will be notified of wedged device on
gt reset failure.

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c |  2 ++
 drivers/gpu/drm/i915/i915_driver.c    | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index 8f1ea95471ef..02f357d4e4fb 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1418,6 +1418,8 @@ static void intel_gt_reset_global(struct intel_gt *gt,
 
 	if (!test_bit(I915_WEDGED, &gt->reset.flags))
 		kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event);
+	else
+		drm_dev_wedged_event(&gt->i915->drm, DRM_WEDGE_RECOVERY_BUS_RESET);
 }
 
 /**
diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
index fe905d65ddf7..389d9fc67eeb 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -711,6 +711,15 @@ static void i915_welcome_messages(struct drm_i915_private *dev_priv)
 			 "DRM_I915_DEBUG_RUNTIME_PM enabled\n");
 }
 
+static void i915_setup_wedge_recovery(struct drm_i915_private *i915)
+{
+	struct drm_device *dev = &i915->drm;
+
+	/* Support both driver rebind and bus-reset based recovery. */
+	set_bit(DRM_WEDGE_RECOVERY_REBIND, &dev->wedge_recovery);
+	set_bit(DRM_WEDGE_RECOVERY_BUS_RESET, &dev->wedge_recovery);
+}
+
 static struct drm_i915_private *
 i915_driver_create(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
@@ -812,6 +821,7 @@ int i915_driver_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	enable_rpm_wakeref_asserts(&i915->runtime_pm);
 
+	i915_setup_wedge_recovery(i915);
 	i915_welcome_messages(i915);
 
 	i915->do_release = true;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* ✗ CI.Patch_applied: failure for Introduce DRM device wedged event (rev5)
  2024-09-30  7:38 [PATCH v7 0/5] Introduce DRM device wedged event Raag Jadav
                   ` (4 preceding siblings ...)
  2024-09-30  7:38 ` [PATCH v7 5/5] drm/i915: " Raag Jadav
@ 2024-09-30  7:47 ` Patchwork
  5 siblings, 0 replies; 34+ messages in thread
From: Patchwork @ 2024-09-30  7:47 UTC (permalink / raw)
  To: Raag Jadav; +Cc: intel-xe

== Series Details ==

Series: Introduce DRM device wedged event (rev5)
URL   : https://patchwork.freedesktop.org/series/138070/
State : failure

== Summary ==

=== Applying kernel patches on branch 'drm-tip' with base: ===
Base commit: cd8c80767801 drm-tip: 2024y-09m-30d-06h-48m-45s UTC integration manifest
=== git am output follows ===
error: patch failed: Documentation/gpu/drm-uapi.rst:313
error: Documentation/gpu/drm-uapi.rst: patch does not apply
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Applying: drm: Introduce device wedged event
Applying: drm: Expose wedge recovery methods
Applying: drm/doc: Document device wedged event
Patch failed at 0003 drm/doc: Document device wedged event
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-09-30  7:38 ` [PATCH v7 1/5] drm: Introduce " Raag Jadav
@ 2024-09-30 12:59   ` Andy Shevchenko
  2024-10-01  5:08     ` Raag Jadav
  2024-10-01 12:20   ` Michal Wajdeczko
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 34+ messages in thread
From: Andy Shevchenko @ 2024-09-30 12:59 UTC (permalink / raw)
  To: Raag Jadav
  Cc: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, joonas.lahtinen, tursulin, lina, intel-xe, intel-gfx,
	dri-devel, himal.prasad.ghimiray, francois.dugast,
	aravind.iddamsetty, anshuman.gupta, andi.shyti, matthew.d.roper

On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
> Introduce device wedged event, which will notify userspace of wedged
> (hanged/unusable) state of the DRM device through a uevent. This is
> useful especially in cases where the device is no longer operating as
> expected even after a hardware reset and has become unrecoverable from
> driver context.
> 
> Purpose of this implementation is to provide drivers a generic way to
> recover with the help of userspace intervention. Different drivers may
> have different ideas of a "wedged device" depending on their hardware
> implementation, and hence the vendor agnostic nature of the event.
> It is up to the drivers to decide when they see the need for recovery
> and how they want to recover from the available methods.
> 
> Current implementation defines three recovery methods, out of which,
> drivers can choose to support any one or multiple of them. Preferred
> recovery method will be sent in the uevent environment as WEDGED=<method>.
> Userspace consumers (sysadmin) can define udev rules to parse this event
> and take respective action to recover the device.
> 
>     =============== ==================================
>     Recovery method Consumer expectations
>     =============== ==================================
>     rebind          unbind + rebind driver
>     bus-reset       unbind + reset bus device + rebind
>     reboot          reboot system
>     =============== ==================================

...

> +/*
> + * Available recovery methods for wedged device. To be sent along with device
> + * wedged uevent.
> + */
> +static const char *const drm_wedge_recovery_opts[] = {
> +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> +};

Place for static_assert() is here, as it closer to the actual data we test...

> +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> +{
> +	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);

...it doesn't fully belong to this function (or only to this function).

> +	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> +}

Why do we need this one-liner (after above comment being addressed) as a
separate function?

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 2/5] drm: Expose wedge recovery methods
  2024-09-30  7:38 ` [PATCH v7 2/5] drm: Expose wedge recovery methods Raag Jadav
@ 2024-09-30 13:01   ` Andy Shevchenko
  2024-10-01  5:23     ` Raag Jadav
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Shevchenko @ 2024-09-30 13:01 UTC (permalink / raw)
  To: Raag Jadav
  Cc: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, joonas.lahtinen, tursulin, lina, intel-xe, intel-gfx,
	dri-devel, himal.prasad.ghimiray, francois.dugast,
	aravind.iddamsetty, anshuman.gupta, andi.shyti, matthew.d.roper

On Mon, Sep 30, 2024 at 01:08:42PM +0530, Raag Jadav wrote:
> Now that we have device wedged event in place, add wedge_recovery sysfs
> attribute which will expose recovery methods supported by the DRM device.
> This is useful for userspace consumers in cases where the device supports
> multiple recovery methods which can be used as fallbacks.
> 
>   $ cat /sys/class/drm/card<N>/wedge_recovery
>   rebind
>   bus-reset
>   reboot

...

> +static ssize_t wedge_recovery_show(struct device *device,
> +				   struct device_attribute *attr, char *buf)

Looking at the below line it seems you are fine with 100 limit, so, why two
lines above if they perfectly fit 100?

> +{
> +	struct drm_minor *minor = to_drm_minor(device);
> +	struct drm_device *dev = minor->dev;
> +	unsigned int method, count = DRM_WEDGE_RECOVERY_REBIND;
> +
> +	for_each_set_bit(method, &dev->wedge_recovery, DRM_WEDGE_RECOVERY_MAX)
> +		count += sysfs_emit_at(buf, count, "%s\n", drm_wedge_recovery_name(method));
> +
> +	return count;
> +}

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-09-30 12:59   ` Andy Shevchenko
@ 2024-10-01  5:08     ` Raag Jadav
  2024-10-01 12:07       ` Andy Shevchenko
  0 siblings, 1 reply; 34+ messages in thread
From: Raag Jadav @ 2024-10-01  5:08 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, joonas.lahtinen, tursulin, lina, intel-xe, intel-gfx,
	dri-devel, himal.prasad.ghimiray, francois.dugast,
	aravind.iddamsetty, anshuman.gupta, andi.shyti, matthew.d.roper

On Mon, Sep 30, 2024 at 03:59:59PM +0300, Andy Shevchenko wrote:
> On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
> > Introduce device wedged event, which will notify userspace of wedged
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is no longer operating as
> > expected even after a hardware reset and has become unrecoverable from
> > driver context.
> > 
> > Purpose of this implementation is to provide drivers a generic way to
> > recover with the help of userspace intervention. Different drivers may
> > have different ideas of a "wedged device" depending on their hardware
> > implementation, and hence the vendor agnostic nature of the event.
> > It is up to the drivers to decide when they see the need for recovery
> > and how they want to recover from the available methods.
> > 
> > Current implementation defines three recovery methods, out of which,
> > drivers can choose to support any one or multiple of them. Preferred
> > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > Userspace consumers (sysadmin) can define udev rules to parse this event
> > and take respective action to recover the device.
> > 
> >     =============== ==================================
> >     Recovery method Consumer expectations
> >     =============== ==================================
> >     rebind          unbind + rebind driver
> >     bus-reset       unbind + reset bus device + rebind
> >     reboot          reboot system
> >     =============== ==================================
> 
> ...
> 
> > +/*
> > + * Available recovery methods for wedged device. To be sent along with device
> > + * wedged uevent.
> > + */
> > +static const char *const drm_wedge_recovery_opts[] = {
> > +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> > +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> > +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> > +};
> 
> Place for static_assert() is here, as it closer to the actual data we test...

Shouldn't it be at the point of access?
If no, why do we care about the data when it's not being used?

> > +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> > +{
> > +	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
> 
> ...it doesn't fully belong to this function (or only to this function).

The purpose of having a helper is to have a single point of access, no?

Side note: It also goes well with is_valid() semantic IMHO.

> > +	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> > +}
> 
> Why do we need this one-liner (after above comment being addressed) as a
> separate function?

I'm not sure if I'm following you. Method is not a constant here, we'll get it
on the stack.

Raag

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 2/5] drm: Expose wedge recovery methods
  2024-09-30 13:01   ` Andy Shevchenko
@ 2024-10-01  5:23     ` Raag Jadav
  0 siblings, 0 replies; 34+ messages in thread
From: Raag Jadav @ 2024-10-01  5:23 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, joonas.lahtinen, tursulin, lina, intel-xe, intel-gfx,
	dri-devel, himal.prasad.ghimiray, francois.dugast,
	aravind.iddamsetty, anshuman.gupta, andi.shyti, matthew.d.roper

On Mon, Sep 30, 2024 at 04:01:38PM +0300, Andy Shevchenko wrote:
> On Mon, Sep 30, 2024 at 01:08:42PM +0530, Raag Jadav wrote:
> > Now that we have device wedged event in place, add wedge_recovery sysfs
> > attribute which will expose recovery methods supported by the DRM device.
> > This is useful for userspace consumers in cases where the device supports
> > multiple recovery methods which can be used as fallbacks.
> > 
> >   $ cat /sys/class/drm/card<N>/wedge_recovery
> >   rebind
> >   bus-reset
> >   reboot
> 
> ...
> 
> > +static ssize_t wedge_recovery_show(struct device *device,
> > +				   struct device_attribute *attr, char *buf)
> 
> Looking at the below line it seems you are fine with 100 limit, so, why two
> lines above if they perfectly fit 100?

Just trying to avoid another bikeshed about conventions ;)

Raag

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-01  5:08     ` Raag Jadav
@ 2024-10-01 12:07       ` Andy Shevchenko
  2024-10-01 14:18         ` Raag Jadav
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Shevchenko @ 2024-10-01 12:07 UTC (permalink / raw)
  To: Raag Jadav
  Cc: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, joonas.lahtinen, tursulin, lina, intel-xe, intel-gfx,
	dri-devel, himal.prasad.ghimiray, francois.dugast,
	aravind.iddamsetty, anshuman.gupta, andi.shyti, matthew.d.roper

On Tue, Oct 01, 2024 at 08:08:18AM +0300, Raag Jadav wrote:
> On Mon, Sep 30, 2024 at 03:59:59PM +0300, Andy Shevchenko wrote:
> > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:

...

> > > +static const char *const drm_wedge_recovery_opts[] = {
> > > +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> > > +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> > > +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> > > +};
> > 
> > Place for static_assert() is here, as it closer to the actual data we test...
> 
> Shouldn't it be at the point of access?

No, the idea of static_assert() is in word 'static', meaning it's allowed to be
used in the global space.

> If no, why do we care about the data when it's not being used?

What does this suppose to mean? The assertion is for enforcing the boundaries
that are defined by different means (constant of the size and real size of
an array).

...

> > > +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> > > +{
> > > +	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
> > 
> > ...it doesn't fully belong to this function (or only to this function).
> 
> The purpose of having a helper is to have a single point of access, no?

What single access you are talking about? It seems you are trying to solve
non-existing issue. There is a function that is being used exactly once
and it's a one-liner. There is no point to have it being separated (at least
right now).

> Side note: It also goes well with is_valid() semantic IMHO.

It doesn't matter at all, it's unrelated to the point.

> > > +	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> > > +}
> > 
> > Why do we need this one-liner (after above comment being addressed) as a
> > separate function?
> 
> I'm not sure if I'm following you. Method is not a constant here, we'll get it
> on the stack.

I elaborated above.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-09-30  7:38 ` [PATCH v7 1/5] drm: Introduce " Raag Jadav
  2024-09-30 12:59   ` Andy Shevchenko
@ 2024-10-01 12:20   ` Michal Wajdeczko
  2024-10-03 12:23     ` Raag Jadav
  2024-10-17  2:47   ` Raag Jadav
  2024-10-17 19:16   ` André Almeida
  3 siblings, 1 reply; 34+ messages in thread
From: Michal Wajdeczko @ 2024-10-01 12:20 UTC (permalink / raw)
  To: Raag Jadav, airlied, simona, lucas.demarchi, thomas.hellstrom,
	rodrigo.vivi, jani.nikula, andriy.shevchenko, joonas.lahtinen,
	tursulin, lina
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper

Hi,

sorry for late comments,

On 30.09.2024 09:38, Raag Jadav wrote:
> Introduce device wedged event, which will notify userspace of wedged
> (hanged/unusable) state of the DRM device through a uevent. This is
> useful especially in cases where the device is no longer operating as
> expected even after a hardware reset and has become unrecoverable from
> driver context.
> 
> Purpose of this implementation is to provide drivers a generic way to
> recover with the help of userspace intervention. Different drivers may
> have different ideas of a "wedged device" depending on their hardware
> implementation, and hence the vendor agnostic nature of the event.
> It is up to the drivers to decide when they see the need for recovery
> and how they want to recover from the available methods.

what about when driver just wants to tell that it is in unusable state,
but recovery method is unknown or not possible?

> 
> Current implementation defines three recovery methods, out of which,
> drivers can choose to support any one or multiple of them. Preferred
> recovery method will be sent in the uevent environment as WEDGED=<method>.

could this be something like below instead:

	WEDGED=<reason>
	RECOVERY=<method>[,<method>]

then driver will have a chance to tell what happen _and_ additionally
provide a hint(s) how to recover from that situation

> Userspace consumers (sysadmin) can define udev rules to parse this event
> and take respective action to recover the device.
> 
>     =============== ==================================
>     Recovery method Consumer expectations
>     =============== ==================================
>     rebind          unbind + rebind driver
>     bus-reset       unbind + reset bus device + rebind
>     reboot          reboot system

btw, what if driver detects a really broken HW, or has no clue what will
help here, shouldn't we have a "none" method?

>     =============== ==================================
> 
> v4: s/drm_dev_wedged/drm_dev_wedged_event
>     Use drm_info() (Jani)
>     Kernel doc adjustment (Aravind)
> v5: Send recovery method with uevent (Lina)
> v6: Access wedge_recovery_opts[] using helper function (Jani)
>     Use snprintf() (Jani)
> v7: Convert recovery helpers into regular functions (Andy, Jani)
>     Aesthetic adjustments (Andy)
>     Handle invalid method cases
> 
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> ---
>  drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++
>  include/drm/drm_device.h  | 23 ++++++++++++
>  include/drm/drm_drv.h     |  3 ++
>  3 files changed, 103 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index ac30b0ec9d93..cfe9600da2ee 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -26,6 +26,8 @@
>   * DEALINGS IN THE SOFTWARE.
>   */
>  
> +#include <linux/array_size.h>
> +#include <linux/build_bug.h>
>  #include <linux/debugfs.h>
>  #include <linux/fs.h>
>  #include <linux/module.h>
> @@ -33,6 +35,7 @@
>  #include <linux/mount.h>
>  #include <linux/pseudo_fs.h>
>  #include <linux/slab.h>
> +#include <linux/sprintf.h>
>  #include <linux/srcu.h>
>  #include <linux/xarray.h>
>  
> @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root;
>  
>  DEFINE_STATIC_SRCU(drm_unplug_srcu);
>  
> +/*
> + * Available recovery methods for wedged device. To be sent along with device
> + * wedged uevent.
> + */
> +static const char *const drm_wedge_recovery_opts[] = {
> +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> +};
> +
> +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> +{
> +	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
> +
> +	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> +}
> +
> +/**
> + * drm_wedge_recovery_name - provide wedge recovery name
> + * @method: method to be used for recovery
> + *
> + * This validates wedge recovery @method against the available ones in

do we really need to validate an enum? maybe the problem is that there
is MAX enumerator that just shouldn't be there?

> + * drm_wedge_recovery_opts[] and provides respective recovery name in string
> + * format if found valid.
> + *
> + * Returns: pointer to const recovery string on success, NULL otherwise.
> + */
> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method)
> +{
> +	if (drm_wedge_recovery_is_valid(method))
> +		return drm_wedge_recovery_opts[method];

as we only have 3 methods, maybe simple switch() will do the work?

> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL(drm_wedge_recovery_name);
> +
>  /*
>   * DRM Minors
>   * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
> @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev)
>  }
>  EXPORT_SYMBOL(drm_dev_unplug);
>  
> +/**
> + * drm_dev_wedged_event - generate a device wedged uevent
> + * @dev: DRM device
> + * @method: method to be used for recovery
> + *
> + * This generates a device wedged uevent for the DRM device specified by @dev.
> + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device)

typo

> + * is sent in the uevent environment as WEDGED=<method>, on the basis of which,
> + * userspace may take respective action to recover the device.
> + *
> + * Returns: 0 on success, or negative error code otherwise.
> + */
> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method)
> +{
> +	/* Event string length up to 16+ characters with available methods */
> +	char event_string[32] = {};

magic 32 here and likely don't need to be initialized with { }

> +	char *envp[] = { event_string, NULL };
> +	const char *recovery;
> +
> +	recovery = drm_wedge_recovery_name(method);
> +	if (!recovery) {
> +		drm_err(dev, "device wedged, invalid recovery method %d\n", method);

maybe use drm_WARN() to see who is abusing the API ?

> +		return -EINVAL;

but shouldn't we still trigger an event with "none" recovery?

> +	}
> +
> +	if (!test_bit(method, &dev->wedge_recovery)) {
> +		drm_err(dev, "device wedged, %s based recovery not supported\n",
> +			drm_wedge_recovery_name(method));

do we really need this kind of guard? it will be a driver code that will
call this function, so likely it knows better what will work to recover

> +		return -EOPNOTSUPP;
> +	}
> +
> +	snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery);
> +
> +	drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery);

nit:
	drm_info(dev, "device wedged, needs %s to recover\n", recovery);

> +	return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> +}
> +EXPORT_SYMBOL(drm_dev_wedged_event);
> +
>  /*
>   * DRM internal mount
>   * We want to be able to allocate our own "struct address_space" to control
> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index c91f87b5242d..fed6f20e52fb 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -40,6 +40,26 @@ enum switch_power_state {
>  	DRM_SWITCH_POWER_DYNAMIC_OFF = 3,
>  };
>  
> +/**
> + * enum drm_wedge_recovery - Recovery method for wedged device in order of
> + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
> + * Drivers can choose to support any one or multiple of them depending on
> + * their needs.
> + */
> +enum drm_wedge_recovery {
> +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
> +	DRM_WEDGE_RECOVERY_REBIND,
> +
> +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
> +	DRM_WEDGE_RECOVERY_BUS_RESET,
> +
> +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
> +	DRM_WEDGE_RECOVERY_REBOOT,
> +
> +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
> +	DRM_WEDGE_RECOVERY_MAX
> +};
> +
>  /**
>   * struct drm_device - DRM device structure
>   *
> @@ -317,6 +337,9 @@ struct drm_device {
>  	 * Root directory for debugfs files.
>  	 */
>  	struct dentry *debugfs_root;
> +
> +	/** @wedge_recovery: Supported recovery methods for wedged device */
> +	unsigned long wedge_recovery;

hmm, so before the driver can ask for a reboot as a recovery method from
wedge it has to somehow add 'reboot' as available method? why it that?

and if you insist that this is useful then at least document how this
should be initialized (to not forcing developers to look at
drm_dev_wedged_event code where it's used)

>  };
>  
>  #endif
> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
> index 02ea4e3248fd..d8dbc77010b0 100644
> --- a/include/drm/drm_drv.h
> +++ b/include/drm/drm_drv.h
> @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx);
>  void drm_dev_exit(int idx);
>  void drm_dev_unplug(struct drm_device *dev);
>  
> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method);
> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method);
> +
>  /**
>   * drm_dev_is_unplugged - is a DRM device unplugged
>   * @dev: DRM device


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-01 12:07       ` Andy Shevchenko
@ 2024-10-01 14:18         ` Raag Jadav
  2024-10-01 14:54           ` Andy Shevchenko
  0 siblings, 1 reply; 34+ messages in thread
From: Raag Jadav @ 2024-10-01 14:18 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, joonas.lahtinen, tursulin, lina, intel-xe, intel-gfx,
	dri-devel, himal.prasad.ghimiray, francois.dugast,
	aravind.iddamsetty, anshuman.gupta, andi.shyti, matthew.d.roper

On Tue, Oct 01, 2024 at 03:07:59PM +0300, Andy Shevchenko wrote:
> On Tue, Oct 01, 2024 at 08:08:18AM +0300, Raag Jadav wrote:
> > On Mon, Sep 30, 2024 at 03:59:59PM +0300, Andy Shevchenko wrote:
> > > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
> 
> ...
> 
> > > > +static const char *const drm_wedge_recovery_opts[] = {
> > > > +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> > > > +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> > > > +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> > > > +};
> > > 
> > > Place for static_assert() is here, as it closer to the actual data we test...
> > 
> > Shouldn't it be at the point of access?
> 
> No, the idea of static_assert() is in word 'static', meaning it's allowed to be
> used in the global space.
> 
> > If no, why do we care about the data when it's not being used?
> 
> What does this suppose to mean? The assertion is for enforcing the boundaries
> that are defined by different means (constant of the size and real size of
> an array).

The point was to simply not assert without an active user of the array, which is
not the case now but may be possible with growing functionality in the future.

Raag

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-01 14:18         ` Raag Jadav
@ 2024-10-01 14:54           ` Andy Shevchenko
  2024-10-01 16:42             ` Raag Jadav
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Shevchenko @ 2024-10-01 14:54 UTC (permalink / raw)
  To: Raag Jadav
  Cc: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, joonas.lahtinen, tursulin, lina, intel-xe, intel-gfx,
	dri-devel, himal.prasad.ghimiray, francois.dugast,
	aravind.iddamsetty, anshuman.gupta, andi.shyti, matthew.d.roper

On Tue, Oct 01, 2024 at 05:18:33PM +0300, Raag Jadav wrote:
> On Tue, Oct 01, 2024 at 03:07:59PM +0300, Andy Shevchenko wrote:
> > On Tue, Oct 01, 2024 at 08:08:18AM +0300, Raag Jadav wrote:
> > > On Mon, Sep 30, 2024 at 03:59:59PM +0300, Andy Shevchenko wrote:
> > > > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:

...

> > > > > +static const char *const drm_wedge_recovery_opts[] = {
> > > > > +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> > > > > +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> > > > > +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> > > > > +};
> > > > 
> > > > Place for static_assert() is here, as it closer to the actual data we test...
> > > 
> > > Shouldn't it be at the point of access?
> > 
> > No, the idea of static_assert() is in word 'static', meaning it's allowed to be
> > used in the global space.
> > 
> > > If no, why do we care about the data when it's not being used?
> > 
> > What does this suppose to mean? The assertion is for enforcing the boundaries
> > that are defined by different means (constant of the size and real size of
> > an array).
> 
> The point was to simply not assert without an active user of the array, which is
> not the case now but may be possible with growing functionality in the future.

static_assert() is a compile-time check. How is it even related to this?
So, i.o.w., you are contradicting yourself in this code: on one hand you want
compile-time static checker, on the other you do not want it and rely on the
usage of the function.

Possible solutions:
1) remove static_assert() completely;
2) move it as I said.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-01 14:54           ` Andy Shevchenko
@ 2024-10-01 16:42             ` Raag Jadav
  0 siblings, 0 replies; 34+ messages in thread
From: Raag Jadav @ 2024-10-01 16:42 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, joonas.lahtinen, tursulin, lina, intel-xe, intel-gfx,
	dri-devel, himal.prasad.ghimiray, francois.dugast,
	aravind.iddamsetty, anshuman.gupta, andi.shyti, matthew.d.roper

On Tue, Oct 01, 2024 at 05:54:46PM +0300, Andy Shevchenko wrote:
> On Tue, Oct 01, 2024 at 05:18:33PM +0300, Raag Jadav wrote:
> > On Tue, Oct 01, 2024 at 03:07:59PM +0300, Andy Shevchenko wrote:
> > > On Tue, Oct 01, 2024 at 08:08:18AM +0300, Raag Jadav wrote:
> > > > On Mon, Sep 30, 2024 at 03:59:59PM +0300, Andy Shevchenko wrote:
> > > > > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
> 
> ...
> 
> > > > > > +static const char *const drm_wedge_recovery_opts[] = {
> > > > > > +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> > > > > > +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> > > > > > +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> > > > > > +};
> > > > > 
> > > > > Place for static_assert() is here, as it closer to the actual data we test...
> > > > 
> > > > Shouldn't it be at the point of access?
> > > 
> > > No, the idea of static_assert() is in word 'static', meaning it's allowed to be
> > > used in the global space.
> > > 
> > > > If no, why do we care about the data when it's not being used?
> > > 
> > > What does this suppose to mean? The assertion is for enforcing the boundaries
> > > that are defined by different means (constant of the size and real size of
> > > an array).
> > 
> > The point was to simply not assert without an active user of the array, which is
> > not the case now but may be possible with growing functionality in the future.
> 
> static_assert() is a compile-time check. How is it even related to this?

Yes, I understand. Semantically it made more sense to me is all, since core
helpers can always end up in config based ifdeffery.

Anyway, I'll update.

Raag

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-01 12:20   ` Michal Wajdeczko
@ 2024-10-03 12:23     ` Raag Jadav
  2024-10-08 15:02       ` Raag Jadav
  0 siblings, 1 reply; 34+ messages in thread
From: Raag Jadav @ 2024-10-03 12:23 UTC (permalink / raw)
  To: Michal Wajdeczko
  Cc: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina,
	intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper

On Tue, Oct 01, 2024 at 02:20:29PM +0200, Michal Wajdeczko wrote:
> Hi,
> 
> sorry for late comments,

Sure, no problem.

> On 30.09.2024 09:38, Raag Jadav wrote:
> > Introduce device wedged event, which will notify userspace of wedged
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is no longer operating as
> > expected even after a hardware reset and has become unrecoverable from
> > driver context.
> > 
> > Purpose of this implementation is to provide drivers a generic way to
> > recover with the help of userspace intervention. Different drivers may
> > have different ideas of a "wedged device" depending on their hardware
> > implementation, and hence the vendor agnostic nature of the event.
> > It is up to the drivers to decide when they see the need for recovery
> > and how they want to recover from the available methods.
> 
> what about when driver just wants to tell that it is in unusable state,
> but recovery method is unknown or not possible?

Interesting... However, what would be the consumer expectation for it?
If the expectation is to not recover, why send an event at all?

> > 
> > Current implementation defines three recovery methods, out of which,
> > drivers can choose to support any one or multiple of them. Preferred
> > recovery method will be sent in the uevent environment as WEDGED=<method>.
> 
> could this be something like below instead:
> 
> 	WEDGED=<reason>
> 	RECOVERY=<method>[,<method>]
> 
> then driver will have a chance to tell what happen _and_ additionally
> provide a hint(s) how to recover from that situation

Documentation/gpu/drm-uapi.rst +337

UMD can issue an ioctl to the KMD to check the reset status

...or <reason> for wedging, which KMD will signify with an error code...

UMD will then proceed to report it to the application using the appropriate
API error code

(should've explicitly added, sorry)

> > Userspace consumers (sysadmin) can define udev rules to parse this event
> > and take respective action to recover the device.
> > 
> >     =============== ==================================
> >     Recovery method Consumer expectations
> >     =============== ==================================
> >     rebind          unbind + rebind driver
> >     bus-reset       unbind + reset bus device + rebind
> >     reboot          reboot system
> 
> btw, what if driver detects a really broken HW, or has no clue what will
> help here, shouldn't we have a "none" method?

Sure. But same as above, we have to define expectations.

> >     =============== ==================================
> > 
> > v4: s/drm_dev_wedged/drm_dev_wedged_event
> >     Use drm_info() (Jani)
> >     Kernel doc adjustment (Aravind)
> > v5: Send recovery method with uevent (Lina)
> > v6: Access wedge_recovery_opts[] using helper function (Jani)
> >     Use snprintf() (Jani)
> > v7: Convert recovery helpers into regular functions (Andy, Jani)
> >     Aesthetic adjustments (Andy)
> >     Handle invalid method cases
> > 
> > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > ---
> >  drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++
> >  include/drm/drm_device.h  | 23 ++++++++++++
> >  include/drm/drm_drv.h     |  3 ++
> >  3 files changed, 103 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > index ac30b0ec9d93..cfe9600da2ee 100644
> > --- a/drivers/gpu/drm/drm_drv.c
> > +++ b/drivers/gpu/drm/drm_drv.c
> > @@ -26,6 +26,8 @@
> >   * DEALINGS IN THE SOFTWARE.
> >   */
> >  
> > +#include <linux/array_size.h>
> > +#include <linux/build_bug.h>
> >  #include <linux/debugfs.h>
> >  #include <linux/fs.h>
> >  #include <linux/module.h>
> > @@ -33,6 +35,7 @@
> >  #include <linux/mount.h>
> >  #include <linux/pseudo_fs.h>
> >  #include <linux/slab.h>
> > +#include <linux/sprintf.h>
> >  #include <linux/srcu.h>
> >  #include <linux/xarray.h>
> >  
> > @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root;
> >  
> >  DEFINE_STATIC_SRCU(drm_unplug_srcu);
> >  
> > +/*
> > + * Available recovery methods for wedged device. To be sent along with device
> > + * wedged uevent.
> > + */
> > +static const char *const drm_wedge_recovery_opts[] = {
> > +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> > +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> > +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> > +};
> > +
> > +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> > +{
> > +	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
> > +
> > +	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> > +}
> > +
> > +/**
> > + * drm_wedge_recovery_name - provide wedge recovery name
> > + * @method: method to be used for recovery
> > + *
> > + * This validates wedge recovery @method against the available ones in
> 
> do we really need to validate an enum?

I'm all for trusting the drivers explicitly, but since this is a core feature
I thought we'd have some guard rails (for abusers).

> maybe the problem is that there is MAX enumerator that just shouldn't be there?

With MAX in place we won't need to adjust the helpers to match with enum
modifications in the future (if any).

> > + * drm_wedge_recovery_opts[] and provides respective recovery name in string
> > + * format if found valid.
> > + *
> > + * Returns: pointer to const recovery string on success, NULL otherwise.
> > + */
> > +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method)
> > +{
> > +	if (drm_wedge_recovery_is_valid(method))
> > +		return drm_wedge_recovery_opts[method];
> 
> as we only have 3 methods, maybe simple switch() will do the work?

Sure.

> > +
> > +	return NULL;
> > +}
> > +EXPORT_SYMBOL(drm_wedge_recovery_name);
> > +
> >  /*
> >   * DRM Minors
> >   * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
> > @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev)
> >  }
> >  EXPORT_SYMBOL(drm_dev_unplug);
> >  
> > +/**
> > + * drm_dev_wedged_event - generate a device wedged uevent
> > + * @dev: DRM device
> > + * @method: method to be used for recovery
> > + *
> > + * This generates a device wedged uevent for the DRM device specified by @dev.
> > + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device)
> 
> typo

Good catch.

> > + * is sent in the uevent environment as WEDGED=<method>, on the basis of which,
> > + * userspace may take respective action to recover the device.
> > + *
> > + * Returns: 0 on success, or negative error code otherwise.
> > + */
> > +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method)
> > +{
> > +	/* Event string length up to 16+ characters with available methods */
> > +	char event_string[32] = {};
> 
> magic 32 here

Anything to add to the event string length comment above?

> > +	char *envp[] = { event_string, NULL };
> > +	const char *recovery;
> > +
> > +	recovery = drm_wedge_recovery_name(method);
> > +	if (!recovery) {
> > +		drm_err(dev, "device wedged, invalid recovery method %d\n", method);
> 
> maybe use drm_WARN() to see who is abusing the API ?

Sure.

> > +		return -EINVAL;
> 
> but shouldn't we still trigger an event with "none" recovery?

Explained above.

> > +	}
> > +
> > +	if (!test_bit(method, &dev->wedge_recovery)) {
> > +		drm_err(dev, "device wedged, %s based recovery not supported\n",
> > +			drm_wedge_recovery_name(method));
> 
> do we really need this kind of guard? it will be a driver code that will
> call this function, so likely it knows better what will work to recover

Agree, although unsupported method could cause undefined behaviour.

> > +		return -EOPNOTSUPP;
> > +	}
> > +
> > +	snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery);
> > +
> > +	drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery);
> 
> nit:
> 	drm_info(dev, "device wedged, needs %s to recover\n", recovery);

Sure.

> > +	return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> > +}
> > +EXPORT_SYMBOL(drm_dev_wedged_event);
> > +
> >  /*
> >   * DRM internal mount
> >   * We want to be able to allocate our own "struct address_space" to control
> > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > index c91f87b5242d..fed6f20e52fb 100644
> > --- a/include/drm/drm_device.h
> > +++ b/include/drm/drm_device.h
> > @@ -40,6 +40,26 @@ enum switch_power_state {
> >  	DRM_SWITCH_POWER_DYNAMIC_OFF = 3,
> >  };
> >  
> > +/**
> > + * enum drm_wedge_recovery - Recovery method for wedged device in order of
> > + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
> > + * Drivers can choose to support any one or multiple of them depending on
> > + * their needs.
> > + */
> > +enum drm_wedge_recovery {
> > +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
> > +	DRM_WEDGE_RECOVERY_REBIND,
> > +
> > +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
> > +	DRM_WEDGE_RECOVERY_BUS_RESET,
> > +
> > +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
> > +	DRM_WEDGE_RECOVERY_REBOOT,
> > +
> > +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
> > +	DRM_WEDGE_RECOVERY_MAX
> > +};
> > +
> >  /**
> >   * struct drm_device - DRM device structure
> >   *
> > @@ -317,6 +337,9 @@ struct drm_device {
> >  	 * Root directory for debugfs files.
> >  	 */
> >  	struct dentry *debugfs_root;
> > +
> > +	/** @wedge_recovery: Supported recovery methods for wedged device */
> > +	unsigned long wedge_recovery;
> 
> hmm, so before the driver can ask for a reboot as a recovery method from
> wedge it has to somehow add 'reboot' as available method? why it that?

It's for consumers to use as fallbacks in case the preferred recovery method
(sent along with uevent) don't workout. (patch 2/5)

> and if you insist that this is useful then at least document how this
> should be initialized (to not forcing developers to look at
> drm_dev_wedged_event code where it's used)

Sure.

Raag

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-03 12:23     ` Raag Jadav
@ 2024-10-08 15:02       ` Raag Jadav
  2024-10-10 13:02         ` Lucas De Marchi
  0 siblings, 1 reply; 34+ messages in thread
From: Raag Jadav @ 2024-10-08 15:02 UTC (permalink / raw)
  To: Michal Wajdeczko, lucas.demarchi
  Cc: airlied, simona, thomas.hellstrom, rodrigo.vivi, jani.nikula,
	andriy.shevchenko, joonas.lahtinen, tursulin, lina, intel-xe,
	intel-gfx, dri-devel, himal.prasad.ghimiray, francois.dugast,
	aravind.iddamsetty, anshuman.gupta, andi.shyti, matthew.d.roper

On Thu, Oct 03, 2024 at 03:23:22PM +0300, Raag Jadav wrote:
> On Tue, Oct 01, 2024 at 02:20:29PM +0200, Michal Wajdeczko wrote:
> > On 30.09.2024 09:38, Raag Jadav wrote:
> > >  
> > > +/**
> > > + * enum drm_wedge_recovery - Recovery method for wedged device in order of
> > > + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
> > > + * Drivers can choose to support any one or multiple of them depending on
> > > + * their needs.
> > > + */
> > > +enum drm_wedge_recovery {
> > > +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
> > > +	DRM_WEDGE_RECOVERY_REBIND,
> > > +
> > > +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
> > > +	DRM_WEDGE_RECOVERY_BUS_RESET,
> > > +
> > > +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
> > > +	DRM_WEDGE_RECOVERY_REBOOT,
> > > +
> > > +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
> > > +	DRM_WEDGE_RECOVERY_MAX
> > > +};
> > > +
> > >  /**
> > >   * struct drm_device - DRM device structure
> > >   *
> > > @@ -317,6 +337,9 @@ struct drm_device {
> > >  	 * Root directory for debugfs files.
> > >  	 */
> > >  	struct dentry *debugfs_root;
> > > +
> > > +	/** @wedge_recovery: Supported recovery methods for wedged device */
> > > +	unsigned long wedge_recovery;
> > 
> > hmm, so before the driver can ask for a reboot as a recovery method from
> > wedge it has to somehow add 'reboot' as available method? why it that?
> 
> It's for consumers to use as fallbacks in case the preferred recovery method
> (sent along with uevent) don't workout. (patch 2/5)

On second thought...

Lucas, do we have a convincing enough usecase for fallback recovery?
If <method> were to fail, I would expect there to be even bigger problems
like kernel crash or unrecoverable hardware failure.

At that point is it worth retrying?

Raag

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-08 15:02       ` Raag Jadav
@ 2024-10-10 13:02         ` Lucas De Marchi
  2024-10-11  8:47           ` Raag Jadav
  0 siblings, 1 reply; 34+ messages in thread
From: Lucas De Marchi @ 2024-10-10 13:02 UTC (permalink / raw)
  To: Raag Jadav
  Cc: Michal Wajdeczko, airlied, simona, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina,
	intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper

On Tue, Oct 08, 2024 at 06:02:43PM +0300, Raag Jadav wrote:
>On Thu, Oct 03, 2024 at 03:23:22PM +0300, Raag Jadav wrote:
>> On Tue, Oct 01, 2024 at 02:20:29PM +0200, Michal Wajdeczko wrote:
>> > On 30.09.2024 09:38, Raag Jadav wrote:
>> > >
>> > > +/**
>> > > + * enum drm_wedge_recovery - Recovery method for wedged device in order of
>> > > + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
>> > > + * Drivers can choose to support any one or multiple of them depending on
>> > > + * their needs.
>> > > + */
>> > > +enum drm_wedge_recovery {
>> > > +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
>> > > +	DRM_WEDGE_RECOVERY_REBIND,
>> > > +
>> > > +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
>> > > +	DRM_WEDGE_RECOVERY_BUS_RESET,
>> > > +
>> > > +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
>> > > +	DRM_WEDGE_RECOVERY_REBOOT,
>> > > +
>> > > +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
>> > > +	DRM_WEDGE_RECOVERY_MAX
>> > > +};
>> > > +
>> > >  /**
>> > >   * struct drm_device - DRM device structure
>> > >   *
>> > > @@ -317,6 +337,9 @@ struct drm_device {
>> > >  	 * Root directory for debugfs files.
>> > >  	 */
>> > >  	struct dentry *debugfs_root;
>> > > +
>> > > +	/** @wedge_recovery: Supported recovery methods for wedged device */
>> > > +	unsigned long wedge_recovery;
>> >
>> > hmm, so before the driver can ask for a reboot as a recovery method from
>> > wedge it has to somehow add 'reboot' as available method? why it that?
>>
>> It's for consumers to use as fallbacks in case the preferred recovery method
>> (sent along with uevent) don't workout. (patch 2/5)
>
>On second thought...
>
>Lucas, do we have a convincing enough usecase for fallback recovery?
>If <method> were to fail, I would expect there to be even bigger problems
>like kernel crash or unrecoverable hardware failure.
>
>At that point is it worth retrying?

when we were talking about this, I brought it up about allowing the
driver to inform what was the supported wedge recovery mechanisms
when the notification is sent. Not to be intended as fallback mechanism.

So if the driver sends a notification with:

	DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET | DRM_WEDGE_RECOVERY_REBOOT

it means any of these would be suitable, with the first being the option
with less side-effect. I don't think we are advising userspace to use
fallback, just informing what the driver/device supports. Depending on
the error, the driver may leave only

	DRM_WEDGE_RECOVERY_REBOOT

That name could actually be DRM_WEDGE_RECOVERY_NONE. Because at that
state the driver doesn't really know what can be done to recover.
With that we can drop _MAX and use _NONE for bounding check. I think
we can also omit it in the notification as it's clear:

	WEDGED
	DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET

This means the driver can use any of these options to recover

	WEDGED
	DRM_WEDGE_RECOVERY_BUS_RESET

only bus reset would fix it

	WEDGED
	
driver doesn't know anything that could fix it. It may be a soft-reboot,
hard-reboot, firmware flashing etc... We just don't know.

Lucas De Marchi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-10 13:02         ` Lucas De Marchi
@ 2024-10-11  8:47           ` Raag Jadav
  0 siblings, 0 replies; 34+ messages in thread
From: Raag Jadav @ 2024-10-11  8:47 UTC (permalink / raw)
  To: Lucas De Marchi
  Cc: Michal Wajdeczko, airlied, simona, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina,
	intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper

On Thu, Oct 10, 2024 at 08:02:10AM -0500, Lucas De Marchi wrote:
> On Tue, Oct 08, 2024 at 06:02:43PM +0300, Raag Jadav wrote:
> > On Thu, Oct 03, 2024 at 03:23:22PM +0300, Raag Jadav wrote:
> > > On Tue, Oct 01, 2024 at 02:20:29PM +0200, Michal Wajdeczko wrote:
> > > > On 30.09.2024 09:38, Raag Jadav wrote:
> > > > >
> > > > > +/**
> > > > > + * enum drm_wedge_recovery - Recovery method for wedged device in order of
> > > > > + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
> > > > > + * Drivers can choose to support any one or multiple of them depending on
> > > > > + * their needs.
> > > > > + */
> > > > > +enum drm_wedge_recovery {
> > > > > +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
> > > > > +	DRM_WEDGE_RECOVERY_REBIND,
> > > > > +
> > > > > +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
> > > > > +	DRM_WEDGE_RECOVERY_BUS_RESET,
> > > > > +
> > > > > +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
> > > > > +	DRM_WEDGE_RECOVERY_REBOOT,
> > > > > +
> > > > > +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
> > > > > +	DRM_WEDGE_RECOVERY_MAX
> > > > > +};
> > > > > +
> > > > >  /**
> > > > >   * struct drm_device - DRM device structure
> > > > >   *
> > > > > @@ -317,6 +337,9 @@ struct drm_device {
> > > > >  	 * Root directory for debugfs files.
> > > > >  	 */
> > > > >  	struct dentry *debugfs_root;
> > > > > +
> > > > > +	/** @wedge_recovery: Supported recovery methods for wedged device */
> > > > > +	unsigned long wedge_recovery;
> > > >
> > > > hmm, so before the driver can ask for a reboot as a recovery method from
> > > > wedge it has to somehow add 'reboot' as available method? why it that?
> > > 
> > > It's for consumers to use as fallbacks in case the preferred recovery method
> > > (sent along with uevent) don't workout. (patch 2/5)
> > 
> > On second thought...
> > 
> > Lucas, do we have a convincing enough usecase for fallback recovery?
> > If <method> were to fail, I would expect there to be even bigger problems
> > like kernel crash or unrecoverable hardware failure.
> > 
> > At that point is it worth retrying?
> 
> when we were talking about this, I brought it up about allowing the
> driver to inform what was the supported wedge recovery mechanisms
> when the notification is sent. Not to be intended as fallback mechanism.
> 
> So if the driver sends a notification with:
> 
> 	DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET | DRM_WEDGE_RECOVERY_REBOOT
> 
> it means any of these would be suitable, with the first being the option
> with less side-effect. I don't think we are advising userspace to use
> fallback, just informing what the driver/device supports. Depending on
> the error, the driver may leave only
> 
> 	DRM_WEDGE_RECOVERY_REBOOT
> 
> That name could actually be DRM_WEDGE_RECOVERY_NONE. Because at that
> state the driver doesn't really know what can be done to recover.
> With that we can drop _MAX and use _NONE for bounding check. I think
> we can also omit it in the notification as it's clear:
> 
> 	WEDGED
> 	DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET
> 
> This means the driver can use any of these options to recover
> 
> 	WEDGED
> 	DRM_WEDGE_RECOVERY_BUS_RESET
> 
> only bus reset would fix it
> 
> 	WEDGED
> 	
> driver doesn't know anything that could fix it. It may be a soft-reboot,
> hard-reboot, firmware flashing etc... We just don't know.

With this I think we can drop sysfs.
(Already too many ABIs to deal with)

Raag

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-09-30  7:38 ` [PATCH v7 1/5] drm: Introduce " Raag Jadav
  2024-09-30 12:59   ` Andy Shevchenko
  2024-10-01 12:20   ` Michal Wajdeczko
@ 2024-10-17  2:47   ` Raag Jadav
  2024-10-17  7:59     ` Christian König
  2024-10-17 19:16   ` André Almeida
  3 siblings, 1 reply; 34+ messages in thread
From: Raag Jadav @ 2024-10-17  2:47 UTC (permalink / raw)
  To: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
	jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper, boris.brezillon, adrian.larumbe, kernel, maraeo,
	christian.koenig, friedrich.vock, michel, joshua,
	alexander.deucher, andrealmeid, amd-gfx

On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
> Introduce device wedged event, which will notify userspace of wedged
> (hanged/unusable) state of the DRM device through a uevent. This is
> useful especially in cases where the device is no longer operating as
> expected even after a hardware reset and has become unrecoverable from
> driver context.
> 
> Purpose of this implementation is to provide drivers a generic way to
> recover with the help of userspace intervention. Different drivers may
> have different ideas of a "wedged device" depending on their hardware
> implementation, and hence the vendor agnostic nature of the event.
> It is up to the drivers to decide when they see the need for recovery
> and how they want to recover from the available methods.
> 
> Current implementation defines three recovery methods, out of which,
> drivers can choose to support any one or multiple of them. Preferred
> recovery method will be sent in the uevent environment as WEDGED=<method>.
> Userspace consumers (sysadmin) can define udev rules to parse this event
> and take respective action to recover the device.
> 
>     =============== ==================================
>     Recovery method Consumer expectations
>     =============== ==================================
>     rebind          unbind + rebind driver
>     bus-reset       unbind + reset bus device + rebind
>     reboot          reboot system
>     =============== ==================================
> 
> v4: s/drm_dev_wedged/drm_dev_wedged_event
>     Use drm_info() (Jani)
>     Kernel doc adjustment (Aravind)
> v5: Send recovery method with uevent (Lina)
> v6: Access wedge_recovery_opts[] using helper function (Jani)
>     Use snprintf() (Jani)
> v7: Convert recovery helpers into regular functions (Andy, Jani)
>     Aesthetic adjustments (Andy)
>     Handle invalid method cases
> 
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> ---

Cc'ing amd, collabora and others as I found semi-related work at

https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/
https://lore.kernel.org/amd-gfx/20240725150055.1991893-1-alexander.deucher@amd.com/
https://lore.kernel.org/dri-devel/20241011225906.3789965-3-adrian.larumbe@collabora.com/
https://lore.kernel.org/amd-gfx/CAAxE2A5v_RkZ9ex4=7jiBSKVb22_1FAj0AANBcmKtETt5c3gVA@mail.gmail.com/


Please share feedback about usefulness and adoption of this.
Improvements are welcome.

Raag

>  drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++
>  include/drm/drm_device.h  | 23 ++++++++++++
>  include/drm/drm_drv.h     |  3 ++
>  3 files changed, 103 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index ac30b0ec9d93..cfe9600da2ee 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -26,6 +26,8 @@
>   * DEALINGS IN THE SOFTWARE.
>   */
>  
> +#include <linux/array_size.h>
> +#include <linux/build_bug.h>
>  #include <linux/debugfs.h>
>  #include <linux/fs.h>
>  #include <linux/module.h>
> @@ -33,6 +35,7 @@
>  #include <linux/mount.h>
>  #include <linux/pseudo_fs.h>
>  #include <linux/slab.h>
> +#include <linux/sprintf.h>
>  #include <linux/srcu.h>
>  #include <linux/xarray.h>
>  
> @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root;
>  
>  DEFINE_STATIC_SRCU(drm_unplug_srcu);
>  
> +/*
> + * Available recovery methods for wedged device. To be sent along with device
> + * wedged uevent.
> + */
> +static const char *const drm_wedge_recovery_opts[] = {
> +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> +};
> +
> +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> +{
> +	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
> +
> +	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> +}
> +
> +/**
> + * drm_wedge_recovery_name - provide wedge recovery name
> + * @method: method to be used for recovery
> + *
> + * This validates wedge recovery @method against the available ones in
> + * drm_wedge_recovery_opts[] and provides respective recovery name in string
> + * format if found valid.
> + *
> + * Returns: pointer to const recovery string on success, NULL otherwise.
> + */
> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method)
> +{
> +	if (drm_wedge_recovery_is_valid(method))
> +		return drm_wedge_recovery_opts[method];
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL(drm_wedge_recovery_name);
> +
>  /*
>   * DRM Minors
>   * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
> @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev)
>  }
>  EXPORT_SYMBOL(drm_dev_unplug);
>  
> +/**
> + * drm_dev_wedged_event - generate a device wedged uevent
> + * @dev: DRM device
> + * @method: method to be used for recovery
> + *
> + * This generates a device wedged uevent for the DRM device specified by @dev.
> + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device)
> + * is sent in the uevent environment as WEDGED=<method>, on the basis of which,
> + * userspace may take respective action to recover the device.
> + *
> + * Returns: 0 on success, or negative error code otherwise.
> + */
> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method)
> +{
> +	/* Event string length up to 16+ characters with available methods */
> +	char event_string[32] = {};
> +	char *envp[] = { event_string, NULL };
> +	const char *recovery;
> +
> +	recovery = drm_wedge_recovery_name(method);
> +	if (!recovery) {
> +		drm_err(dev, "device wedged, invalid recovery method %d\n", method);
> +		return -EINVAL;
> +	}
> +
> +	if (!test_bit(method, &dev->wedge_recovery)) {
> +		drm_err(dev, "device wedged, %s based recovery not supported\n",
> +			drm_wedge_recovery_name(method));
> +		return -EOPNOTSUPP;
> +	}
> +
> +	snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery);
> +
> +	drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery);
> +	return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> +}
> +EXPORT_SYMBOL(drm_dev_wedged_event);
> +
>  /*
>   * DRM internal mount
>   * We want to be able to allocate our own "struct address_space" to control
> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index c91f87b5242d..fed6f20e52fb 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -40,6 +40,26 @@ enum switch_power_state {
>  	DRM_SWITCH_POWER_DYNAMIC_OFF = 3,
>  };
>  
> +/**
> + * enum drm_wedge_recovery - Recovery method for wedged device in order of
> + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
> + * Drivers can choose to support any one or multiple of them depending on
> + * their needs.
> + */
> +enum drm_wedge_recovery {
> +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
> +	DRM_WEDGE_RECOVERY_REBIND,
> +
> +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
> +	DRM_WEDGE_RECOVERY_BUS_RESET,
> +
> +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
> +	DRM_WEDGE_RECOVERY_REBOOT,
> +
> +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
> +	DRM_WEDGE_RECOVERY_MAX
> +};
> +
>  /**
>   * struct drm_device - DRM device structure
>   *
> @@ -317,6 +337,9 @@ struct drm_device {
>  	 * Root directory for debugfs files.
>  	 */
>  	struct dentry *debugfs_root;
> +
> +	/** @wedge_recovery: Supported recovery methods for wedged device */
> +	unsigned long wedge_recovery;
>  };
>  
>  #endif
> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
> index 02ea4e3248fd..d8dbc77010b0 100644
> --- a/include/drm/drm_drv.h
> +++ b/include/drm/drm_drv.h
> @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx);
>  void drm_dev_exit(int idx);
>  void drm_dev_unplug(struct drm_device *dev);
>  
> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method);
> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method);
> +
>  /**
>   * drm_dev_is_unplugged - is a DRM device unplugged
>   * @dev: DRM device
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-17  2:47   ` Raag Jadav
@ 2024-10-17  7:59     ` Christian König
  2024-10-17 16:43       ` Rodrigo Vivi
  0 siblings, 1 reply; 34+ messages in thread
From: Christian König @ 2024-10-17  7:59 UTC (permalink / raw)
  To: Raag Jadav, airlied, simona, lucas.demarchi, thomas.hellstrom,
	rodrigo.vivi, jani.nikula, andriy.shevchenko, joonas.lahtinen,
	tursulin, lina
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper, boris.brezillon, adrian.larumbe, kernel, maraeo,
	friedrich.vock, michel, joshua, alexander.deucher, andrealmeid,
	amd-gfx

Am 17.10.24 um 04:47 schrieb Raag Jadav:
> On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
>> Introduce device wedged event, which will notify userspace of wedged
>> (hanged/unusable) state of the DRM device through a uevent. This is
>> useful especially in cases where the device is no longer operating as
>> expected even after a hardware reset and has become unrecoverable from
>> driver context.

Well introduce is probably the wrong wording since i915 already has that 
and amdgpu looked into it but never upstreamed the support.

I would rather say standardize.

>>
>> Purpose of this implementation is to provide drivers a generic way to
>> recover with the help of userspace intervention. Different drivers may
>> have different ideas of a "wedged device" depending on their hardware
>> implementation, and hence the vendor agnostic nature of the event.
>> It is up to the drivers to decide when they see the need for recovery
>> and how they want to recover from the available methods.
>>
>> Current implementation defines three recovery methods, out of which,
>> drivers can choose to support any one or multiple of them. Preferred
>> recovery method will be sent in the uevent environment as WEDGED=<method>.
>> Userspace consumers (sysadmin) can define udev rules to parse this event
>> and take respective action to recover the device.
>>
>>      =============== ==================================
>>      Recovery method Consumer expectations
>>      =============== ==================================
>>      rebind          unbind + rebind driver
>>      bus-reset       unbind + reset bus device + rebind
>>      reboot          reboot system
>>      =============== ==================================

Well that sounds like userspace would need to be involved in recovery.

That in turn is a complete no-go since we at least need to signal all 
dma_fences to unblock the kernel. In other words things like bus reset 
needs to happen inside the kernel and *not* in userspace.

What we can do is to signal to userspace: Hey a bus reset of device X 
happened, maybe restart container, daemon, whatever service which was 
using this device.

Regards,
Christian.

>>
>> v4: s/drm_dev_wedged/drm_dev_wedged_event
>>      Use drm_info() (Jani)
>>      Kernel doc adjustment (Aravind)
>> v5: Send recovery method with uevent (Lina)
>> v6: Access wedge_recovery_opts[] using helper function (Jani)
>>      Use snprintf() (Jani)
>> v7: Convert recovery helpers into regular functions (Andy, Jani)
>>      Aesthetic adjustments (Andy)
>>      Handle invalid method cases
>>
>> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
>> ---
> Cc'ing amd, collabora and others as I found semi-related work at
>
> https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/
> https://lore.kernel.org/amd-gfx/20240725150055.1991893-1-alexander.deucher@amd.com/
> https://lore.kernel.org/dri-devel/20241011225906.3789965-3-adrian.larumbe@collabora.com/
> https://lore.kernel.org/amd-gfx/CAAxE2A5v_RkZ9ex4=7jiBSKVb22_1FAj0AANBcmKtETt5c3gVA@mail.gmail.com/
>
>
> Please share feedback about usefulness and adoption of this.
> Improvements are welcome.
>
> Raag
>
>>   drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++
>>   include/drm/drm_device.h  | 23 ++++++++++++
>>   include/drm/drm_drv.h     |  3 ++
>>   3 files changed, 103 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index ac30b0ec9d93..cfe9600da2ee 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -26,6 +26,8 @@
>>    * DEALINGS IN THE SOFTWARE.
>>    */
>>   
>> +#include <linux/array_size.h>
>> +#include <linux/build_bug.h>
>>   #include <linux/debugfs.h>
>>   #include <linux/fs.h>
>>   #include <linux/module.h>
>> @@ -33,6 +35,7 @@
>>   #include <linux/mount.h>
>>   #include <linux/pseudo_fs.h>
>>   #include <linux/slab.h>
>> +#include <linux/sprintf.h>
>>   #include <linux/srcu.h>
>>   #include <linux/xarray.h>
>>   
>> @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root;
>>   
>>   DEFINE_STATIC_SRCU(drm_unplug_srcu);
>>   
>> +/*
>> + * Available recovery methods for wedged device. To be sent along with device
>> + * wedged uevent.
>> + */
>> +static const char *const drm_wedge_recovery_opts[] = {
>> +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
>> +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
>> +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
>> +};
>> +
>> +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
>> +{
>> +	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
>> +
>> +	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
>> +}
>> +
>> +/**
>> + * drm_wedge_recovery_name - provide wedge recovery name
>> + * @method: method to be used for recovery
>> + *
>> + * This validates wedge recovery @method against the available ones in
>> + * drm_wedge_recovery_opts[] and provides respective recovery name in string
>> + * format if found valid.
>> + *
>> + * Returns: pointer to const recovery string on success, NULL otherwise.
>> + */
>> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method)
>> +{
>> +	if (drm_wedge_recovery_is_valid(method))
>> +		return drm_wedge_recovery_opts[method];
>> +
>> +	return NULL;
>> +}
>> +EXPORT_SYMBOL(drm_wedge_recovery_name);
>> +
>>   /*
>>    * DRM Minors
>>    * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
>> @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev)
>>   }
>>   EXPORT_SYMBOL(drm_dev_unplug);
>>   
>> +/**
>> + * drm_dev_wedged_event - generate a device wedged uevent
>> + * @dev: DRM device
>> + * @method: method to be used for recovery
>> + *
>> + * This generates a device wedged uevent for the DRM device specified by @dev.
>> + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device)
>> + * is sent in the uevent environment as WEDGED=<method>, on the basis of which,
>> + * userspace may take respective action to recover the device.
>> + *
>> + * Returns: 0 on success, or negative error code otherwise.
>> + */
>> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method)
>> +{
>> +	/* Event string length up to 16+ characters with available methods */
>> +	char event_string[32] = {};
>> +	char *envp[] = { event_string, NULL };
>> +	const char *recovery;
>> +
>> +	recovery = drm_wedge_recovery_name(method);
>> +	if (!recovery) {
>> +		drm_err(dev, "device wedged, invalid recovery method %d\n", method);
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (!test_bit(method, &dev->wedge_recovery)) {
>> +		drm_err(dev, "device wedged, %s based recovery not supported\n",
>> +			drm_wedge_recovery_name(method));
>> +		return -EOPNOTSUPP;
>> +	}
>> +
>> +	snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery);
>> +
>> +	drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery);
>> +	return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
>> +}
>> +EXPORT_SYMBOL(drm_dev_wedged_event);
>> +
>>   /*
>>    * DRM internal mount
>>    * We want to be able to allocate our own "struct address_space" to control
>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>> index c91f87b5242d..fed6f20e52fb 100644
>> --- a/include/drm/drm_device.h
>> +++ b/include/drm/drm_device.h
>> @@ -40,6 +40,26 @@ enum switch_power_state {
>>   	DRM_SWITCH_POWER_DYNAMIC_OFF = 3,
>>   };
>>   
>> +/**
>> + * enum drm_wedge_recovery - Recovery method for wedged device in order of
>> + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
>> + * Drivers can choose to support any one or multiple of them depending on
>> + * their needs.
>> + */
>> +enum drm_wedge_recovery {
>> +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
>> +	DRM_WEDGE_RECOVERY_REBIND,
>> +
>> +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
>> +	DRM_WEDGE_RECOVERY_BUS_RESET,
>> +
>> +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
>> +	DRM_WEDGE_RECOVERY_REBOOT,
>> +
>> +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
>> +	DRM_WEDGE_RECOVERY_MAX
>> +};
>> +
>>   /**
>>    * struct drm_device - DRM device structure
>>    *
>> @@ -317,6 +337,9 @@ struct drm_device {
>>   	 * Root directory for debugfs files.
>>   	 */
>>   	struct dentry *debugfs_root;
>> +
>> +	/** @wedge_recovery: Supported recovery methods for wedged device */
>> +	unsigned long wedge_recovery;
>>   };
>>   
>>   #endif
>> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
>> index 02ea4e3248fd..d8dbc77010b0 100644
>> --- a/include/drm/drm_drv.h
>> +++ b/include/drm/drm_drv.h
>> @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx);
>>   void drm_dev_exit(int idx);
>>   void drm_dev_unplug(struct drm_device *dev);
>>   
>> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method);
>> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method);
>> +
>>   /**
>>    * drm_dev_is_unplugged - is a DRM device unplugged
>>    * @dev: DRM device
>> -- 
>> 2.34.1
>>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-17  7:59     ` Christian König
@ 2024-10-17 16:43       ` Rodrigo Vivi
  2024-10-18 10:58         ` Christian König
  0 siblings, 1 reply; 34+ messages in thread
From: Rodrigo Vivi @ 2024-10-17 16:43 UTC (permalink / raw)
  To: Christian König
  Cc: Raag Jadav, airlied, simona, lucas.demarchi, thomas.hellstrom,
	jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina,
	intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper, boris.brezillon, adrian.larumbe, kernel, maraeo,
	friedrich.vock, michel, joshua, alexander.deucher, andrealmeid,
	amd-gfx

On Thu, Oct 17, 2024 at 09:59:10AM +0200, Christian König wrote:
> Am 17.10.24 um 04:47 schrieb Raag Jadav:
> > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
> > > Introduce device wedged event, which will notify userspace of wedged
> > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > useful especially in cases where the device is no longer operating as
> > > expected even after a hardware reset and has become unrecoverable from
> > > driver context.
> 
> Well introduce is probably the wrong wording since i915 already has that and
> amdgpu looked into it but never upstreamed the support.

in i915 we have the reset and error uevents, but not one specific for 'wedge'.
This would indeed be a new one.

> 
> I would rather say standardize.
> 
> > > 
> > > Purpose of this implementation is to provide drivers a generic way to
> > > recover with the help of userspace intervention. Different drivers may
> > > have different ideas of a "wedged device" depending on their hardware
> > > implementation, and hence the vendor agnostic nature of the event.
> > > It is up to the drivers to decide when they see the need for recovery
> > > and how they want to recover from the available methods.
> > > 
> > > Current implementation defines three recovery methods, out of which,
> > > drivers can choose to support any one or multiple of them. Preferred
> > > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > > Userspace consumers (sysadmin) can define udev rules to parse this event
> > > and take respective action to recover the device.
> > > 
> > >      =============== ==================================
> > >      Recovery method Consumer expectations
> > >      =============== ==================================
> > >      rebind          unbind + rebind driver
> > >      bus-reset       unbind + reset bus device + rebind
> > >      reboot          reboot system
> > >      =============== ==================================
> 
> Well that sounds like userspace would need to be involved in recovery.
> 
> That in turn is a complete no-go since we at least need to signal all
> dma_fences to unblock the kernel. In other words things like bus reset needs
> to happen inside the kernel and *not* in userspace.
> 
> What we can do is to signal to userspace: Hey a bus reset of device X
> happened, maybe restart container, daemon, whatever service which was using
> this device.

Well, when we declare device 'wedged' it is because we don't want to take
any drastic measures inside the kernel and want to leave it in a protected
and unusable state. In a way that users wouldn't lose display for instance,
or at least the device is in a debugable state.

Then, the instructions here is to tell what could possibly be attempted
from userspace to get the device to an usable state.

The 'wedge' mode (the one emiting this uevent) needs to be responsible
for signaling all the fences and everything needed for a clean unbind
and whatever next step might be indicated to userspace.

That should already be part of any wedged mode, regardless the uevent
to inform the userspace here.

> 
> Regards,
> Christian.
> 
> > > 
> > > v4: s/drm_dev_wedged/drm_dev_wedged_event
> > >      Use drm_info() (Jani)
> > >      Kernel doc adjustment (Aravind)
> > > v5: Send recovery method with uevent (Lina)
> > > v6: Access wedge_recovery_opts[] using helper function (Jani)
> > >      Use snprintf() (Jani)
> > > v7: Convert recovery helpers into regular functions (Andy, Jani)
> > >      Aesthetic adjustments (Andy)
> > >      Handle invalid method cases
> > > 
> > > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > > ---
> > Cc'ing amd, collabora and others as I found semi-related work at
> > 
> > https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/
> > https://lore.kernel.org/amd-gfx/20240725150055.1991893-1-alexander.deucher@amd.com/
> > https://lore.kernel.org/dri-devel/20241011225906.3789965-3-adrian.larumbe@collabora.com/
> > https://lore.kernel.org/amd-gfx/CAAxE2A5v_RkZ9ex4=7jiBSKVb22_1FAj0AANBcmKtETt5c3gVA@mail.gmail.com/
> > 
> > 
> > Please share feedback about usefulness and adoption of this.
> > Improvements are welcome.
> > 
> > Raag
> > 
> > >   drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++
> > >   include/drm/drm_device.h  | 23 ++++++++++++
> > >   include/drm/drm_drv.h     |  3 ++
> > >   3 files changed, 103 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > index ac30b0ec9d93..cfe9600da2ee 100644
> > > --- a/drivers/gpu/drm/drm_drv.c
> > > +++ b/drivers/gpu/drm/drm_drv.c
> > > @@ -26,6 +26,8 @@
> > >    * DEALINGS IN THE SOFTWARE.
> > >    */
> > > +#include <linux/array_size.h>
> > > +#include <linux/build_bug.h>
> > >   #include <linux/debugfs.h>
> > >   #include <linux/fs.h>
> > >   #include <linux/module.h>
> > > @@ -33,6 +35,7 @@
> > >   #include <linux/mount.h>
> > >   #include <linux/pseudo_fs.h>
> > >   #include <linux/slab.h>
> > > +#include <linux/sprintf.h>
> > >   #include <linux/srcu.h>
> > >   #include <linux/xarray.h>
> > > @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root;
> > >   DEFINE_STATIC_SRCU(drm_unplug_srcu);
> > > +/*
> > > + * Available recovery methods for wedged device. To be sent along with device
> > > + * wedged uevent.
> > > + */
> > > +static const char *const drm_wedge_recovery_opts[] = {
> > > +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> > > +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> > > +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> > > +};
> > > +
> > > +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> > > +{
> > > +	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
> > > +
> > > +	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> > > +}
> > > +
> > > +/**
> > > + * drm_wedge_recovery_name - provide wedge recovery name
> > > + * @method: method to be used for recovery
> > > + *
> > > + * This validates wedge recovery @method against the available ones in
> > > + * drm_wedge_recovery_opts[] and provides respective recovery name in string
> > > + * format if found valid.
> > > + *
> > > + * Returns: pointer to const recovery string on success, NULL otherwise.
> > > + */
> > > +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method)
> > > +{
> > > +	if (drm_wedge_recovery_is_valid(method))
> > > +		return drm_wedge_recovery_opts[method];
> > > +
> > > +	return NULL;
> > > +}
> > > +EXPORT_SYMBOL(drm_wedge_recovery_name);
> > > +
> > >   /*
> > >    * DRM Minors
> > >    * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
> > > @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev)
> > >   }
> > >   EXPORT_SYMBOL(drm_dev_unplug);
> > > +/**
> > > + * drm_dev_wedged_event - generate a device wedged uevent
> > > + * @dev: DRM device
> > > + * @method: method to be used for recovery
> > > + *
> > > + * This generates a device wedged uevent for the DRM device specified by @dev.
> > > + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device)
> > > + * is sent in the uevent environment as WEDGED=<method>, on the basis of which,
> > > + * userspace may take respective action to recover the device.
> > > + *
> > > + * Returns: 0 on success, or negative error code otherwise.
> > > + */
> > > +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method)
> > > +{
> > > +	/* Event string length up to 16+ characters with available methods */
> > > +	char event_string[32] = {};
> > > +	char *envp[] = { event_string, NULL };
> > > +	const char *recovery;
> > > +
> > > +	recovery = drm_wedge_recovery_name(method);
> > > +	if (!recovery) {
> > > +		drm_err(dev, "device wedged, invalid recovery method %d\n", method);
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	if (!test_bit(method, &dev->wedge_recovery)) {
> > > +		drm_err(dev, "device wedged, %s based recovery not supported\n",
> > > +			drm_wedge_recovery_name(method));
> > > +		return -EOPNOTSUPP;
> > > +	}
> > > +
> > > +	snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery);
> > > +
> > > +	drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery);
> > > +	return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> > > +}
> > > +EXPORT_SYMBOL(drm_dev_wedged_event);
> > > +
> > >   /*
> > >    * DRM internal mount
> > >    * We want to be able to allocate our own "struct address_space" to control
> > > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > index c91f87b5242d..fed6f20e52fb 100644
> > > --- a/include/drm/drm_device.h
> > > +++ b/include/drm/drm_device.h
> > > @@ -40,6 +40,26 @@ enum switch_power_state {
> > >   	DRM_SWITCH_POWER_DYNAMIC_OFF = 3,
> > >   };
> > > +/**
> > > + * enum drm_wedge_recovery - Recovery method for wedged device in order of
> > > + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
> > > + * Drivers can choose to support any one or multiple of them depending on
> > > + * their needs.
> > > + */
> > > +enum drm_wedge_recovery {
> > > +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
> > > +	DRM_WEDGE_RECOVERY_REBIND,
> > > +
> > > +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
> > > +	DRM_WEDGE_RECOVERY_BUS_RESET,
> > > +
> > > +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
> > > +	DRM_WEDGE_RECOVERY_REBOOT,
> > > +
> > > +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
> > > +	DRM_WEDGE_RECOVERY_MAX
> > > +};
> > > +
> > >   /**
> > >    * struct drm_device - DRM device structure
> > >    *
> > > @@ -317,6 +337,9 @@ struct drm_device {
> > >   	 * Root directory for debugfs files.
> > >   	 */
> > >   	struct dentry *debugfs_root;
> > > +
> > > +	/** @wedge_recovery: Supported recovery methods for wedged device */
> > > +	unsigned long wedge_recovery;
> > >   };
> > >   #endif
> > > diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
> > > index 02ea4e3248fd..d8dbc77010b0 100644
> > > --- a/include/drm/drm_drv.h
> > > +++ b/include/drm/drm_drv.h
> > > @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx);
> > >   void drm_dev_exit(int idx);
> > >   void drm_dev_unplug(struct drm_device *dev);
> > > +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method);
> > > +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method);
> > > +
> > >   /**
> > >    * drm_dev_is_unplugged - is a DRM device unplugged
> > >    * @dev: DRM device
> > > -- 
> > > 2.34.1
> > > 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-09-30  7:38 ` [PATCH v7 1/5] drm: Introduce " Raag Jadav
                     ` (2 preceding siblings ...)
  2024-10-17  2:47   ` Raag Jadav
@ 2024-10-17 19:16   ` André Almeida
  2024-10-18 14:56     ` Rodrigo Vivi
  2024-10-19 19:08     ` Raag Jadav
  3 siblings, 2 replies; 34+ messages in thread
From: André Almeida @ 2024-10-17 19:16 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, rodrigo.vivi, thomas.hellstrom, simona, intel-gfx,
	joonas.lahtinen, dri-devel, himal.prasad.ghimiray, lucas.demarchi,
	tursulin, francois.dugast, jani.nikula, airlied,
	aravind.iddamsetty, anshuman.gupta, andi.shyti, matthew.d.roper,
	andriy.shevchenko, lina, kernel-dev, Alex Deucher,
	Christian König

Hi Raag,

Em 30/09/2024 04:38, Raag Jadav escreveu:
> Introduce device wedged event, which will notify userspace of wedged
> (hanged/unusable) state of the DRM device through a uevent. This is
> useful especially in cases where the device is no longer operating as
> expected even after a hardware reset and has become unrecoverable from
> driver context.
> 
> Purpose of this implementation is to provide drivers a generic way to
> recover with the help of userspace intervention. Different drivers may
> have different ideas of a "wedged device" depending on their hardware
> implementation, and hence the vendor agnostic nature of the event.
> It is up to the drivers to decide when they see the need for recovery
> and how they want to recover from the available methods.
> 
> Current implementation defines three recovery methods, out of which,
> drivers can choose to support any one or multiple of them. Preferred
> recovery method will be sent in the uevent environment as WEDGED=<method>.
> Userspace consumers (sysadmin) can define udev rules to parse this event
> and take respective action to recover the device.
> 
>      =============== ==================================
>      Recovery method Consumer expectations
>      =============== ==================================
>      rebind          unbind + rebind driver
>      bus-reset       unbind + reset bus device + rebind
>      reboot          reboot system
>      =============== ==================================
> 
>

I proposed something similar in the past: 
https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/

The motivation was that amdgpu was getting stuck after every GPU reset, 
and there was just a black screen. The uevent would then trigger a 
daemon to reset the compositor and getting things back together. As you 
can see in my thread, the feature was blocked in favor of getting better 
overall GPU reset from the kernel side.

Which kind of scenarios are making i915/xe the need to have userspace 
involvement? I tested a bunch of resets in i915 but never managed to get 
the driver stuck.

For the bus-reset, amdgpu does that too, but it doesn't require 
userspace intervention.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-17 16:43       ` Rodrigo Vivi
@ 2024-10-18 10:58         ` Christian König
  2024-10-18 12:46           ` Raag Jadav
  0 siblings, 1 reply; 34+ messages in thread
From: Christian König @ 2024-10-18 10:58 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Raag Jadav, airlied, simona, lucas.demarchi, thomas.hellstrom,
	jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina,
	intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper, boris.brezillon, adrian.larumbe, kernel, maraeo,
	friedrich.vock, michel, joshua, alexander.deucher, andrealmeid,
	amd-gfx

[-- Attachment #1: Type: text/plain, Size: 3577 bytes --]

Am 17.10.24 um 18:43 schrieb Rodrigo Vivi:
> On Thu, Oct 17, 2024 at 09:59:10AM +0200, Christian König wrote:
>>>> Purpose of this implementation is to provide drivers a generic way to
>>>> recover with the help of userspace intervention. Different drivers may
>>>> have different ideas of a "wedged device" depending on their hardware
>>>> implementation, and hence the vendor agnostic nature of the event.
>>>> It is up to the drivers to decide when they see the need for recovery
>>>> and how they want to recover from the available methods.
>>>>
>>>> Current implementation defines three recovery methods, out of which,
>>>> drivers can choose to support any one or multiple of them. Preferred
>>>> recovery method will be sent in the uevent environment as WEDGED=<method>.
>>>> Userspace consumers (sysadmin) can define udev rules to parse this event
>>>> and take respective action to recover the device.
>>>>
>>>>       =============== ==================================
>>>>       Recovery method Consumer expectations
>>>>       =============== ==================================
>>>>       rebind          unbind + rebind driver
>>>>       bus-reset       unbind + reset bus device + rebind
>>>>       reboot          reboot system
>>>>       =============== ==================================
>> Well that sounds like userspace would need to be involved in recovery.
>>
>> That in turn is a complete no-go since we at least need to signal all
>> dma_fences to unblock the kernel. In other words things like bus reset needs
>> to happen inside the kernel and *not* in userspace.
>>
>> What we can do is to signal to userspace: Hey a bus reset of device X
>> happened, maybe restart container, daemon, whatever service which was using
>> this device.
> Well, when we declare device 'wedged' it is because we don't want to take
> any drastic measures inside the kernel and want to leave it in a protected
> and unusable state. In a way that users wouldn't lose display for instance,
> or at least the device is in a debugable state.

Uff, that needs to be very very well documented or otherwise the whole 
approach is an absolutely clear NAK from my side as DMA-buf maintainer.

>
> Then, the instructions here is to tell what could possibly be attempted
> from userspace to get the device to an usable state.
>
> The 'wedge' mode (the one emiting this uevent) needs to be responsible
> for signaling all the fences and everything needed for a clean unbind
> and whatever next step might be indicated to userspace.
>
> That should already be part of any wedged mode, regardless the uevent
> to inform the userspace here.

You need to approach that from a different side. With the current patch 
set you are ignoring documented mandatory driver behavior as far as I 
can see.

So first of all describe in the documentation what the wedged mode is 
and what requirements a driver has to fulfill to enter it: 
https://docs.kernel.org/gpu/drm-uapi.html#device-reset

Especially document that all system memory accesses of the device needs 
to be blocked by (for example) disabling DMA accesses in the PCI config 
space.

When it is guaranteed that the device can't access any system memory any 
more the device driver should signal all pending fences of this device.

And only after all of that is done the driver  can send an uevent to 
inform userspace that it can debug the hanged state.

As far as I can see this makes the enum how to recover the device 
superfluous because you will most likely always need a bus reset to get 
out of this again.

Regards,
Christian.

[-- Attachment #2: Type: text/html, Size: 4494 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-18 10:58         ` Christian König
@ 2024-10-18 12:46           ` Raag Jadav
  2024-10-18 12:54             ` Christian König
  0 siblings, 1 reply; 34+ messages in thread
From: Raag Jadav @ 2024-10-18 12:46 UTC (permalink / raw)
  To: Christian König
  Cc: Rodrigo Vivi, airlied, simona, lucas.demarchi, thomas.hellstrom,
	jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina,
	intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper, boris.brezillon, adrian.larumbe, kernel, maraeo,
	friedrich.vock, michel, joshua, alexander.deucher, andrealmeid,
	amd-gfx

On Fri, Oct 18, 2024 at 12:58:09PM +0200, Christian König wrote:
> Am 17.10.24 um 18:43 schrieb Rodrigo Vivi:
> > On Thu, Oct 17, 2024 at 09:59:10AM +0200, Christian König wrote:
> > > > > Purpose of this implementation is to provide drivers a generic way to
> > > > > recover with the help of userspace intervention. Different drivers may
> > > > > have different ideas of a "wedged device" depending on their hardware
> > > > > implementation, and hence the vendor agnostic nature of the event.
> > > > > It is up to the drivers to decide when they see the need for recovery
> > > > > and how they want to recover from the available methods.
> > > > > 
> > > > > Current implementation defines three recovery methods, out of which,
> > > > > drivers can choose to support any one or multiple of them. Preferred
> > > > > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > > > > Userspace consumers (sysadmin) can define udev rules to parse this event
> > > > > and take respective action to recover the device.
> > > > > 
> > > > >       =============== ==================================
> > > > >       Recovery method Consumer expectations
> > > > >       =============== ==================================
> > > > >       rebind          unbind + rebind driver
> > > > >       bus-reset       unbind + reset bus device + rebind
> > > > >       reboot          reboot system
> > > > >       =============== ==================================
> > > Well that sounds like userspace would need to be involved in recovery.
> > > 
> > > That in turn is a complete no-go since we at least need to signal all
> > > dma_fences to unblock the kernel. In other words things like bus reset needs
> > > to happen inside the kernel and *not* in userspace.
> > > 
> > > What we can do is to signal to userspace: Hey a bus reset of device X
> > > happened, maybe restart container, daemon, whatever service which was using
> > > this device.
> > Well, when we declare device 'wedged' it is because we don't want to take
> > any drastic measures inside the kernel and want to leave it in a protected
> > and unusable state. In a way that users wouldn't lose display for instance,
> > or at least the device is in a debugable state.
> 
> Uff, that needs to be very very well documented or otherwise the whole
> approach is an absolutely clear NAK from my side as DMA-buf maintainer.
> 
> > 
> > Then, the instructions here is to tell what could possibly be attempted
> > from userspace to get the device to an usable state.
> > 
> > The 'wedge' mode (the one emiting this uevent) needs to be responsible
> > for signaling all the fences and everything needed for a clean unbind
> > and whatever next step might be indicated to userspace.
> > 
> > That should already be part of any wedged mode, regardless the uevent
> > to inform the userspace here.
> 
> You need to approach that from a different side. With the current patch set
> you are ignoring documented mandatory driver behavior as far as I can see.
> 
> So first of all describe in the documentation what the wedged mode is and
> what requirements a driver has to fulfill to enter it:
> https://docs.kernel.org/gpu/drm-uapi.html#device-reset
>
> Especially document that all system memory accesses of the device needs to
> be blocked by (for example) disabling DMA accesses in the PCI config space.
> 
> When it is guaranteed that the device can't access any system memory any
> more the device driver should signal all pending fences of this device.
> 
> And only after all of that is done the driver  can send an uevent to inform
> userspace that it can debug the hanged state.

Sure, will do.

> As far as I can see this makes the enum how to recover the device
> superfluous because you will most likely always need a bus reset to get out
> of this again.

That depends on the kind of fault the device has encountered and the bus it is
sitting on. There could be buses that don't support reset.

Raag

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-18 12:46           ` Raag Jadav
@ 2024-10-18 12:54             ` Christian König
  2024-10-18 14:09               ` Raag Jadav
  0 siblings, 1 reply; 34+ messages in thread
From: Christian König @ 2024-10-18 12:54 UTC (permalink / raw)
  To: Raag Jadav
  Cc: Rodrigo Vivi, airlied, simona, lucas.demarchi, thomas.hellstrom,
	jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina,
	intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper, boris.brezillon, adrian.larumbe, kernel, maraeo,
	friedrich.vock, michel, joshua, alexander.deucher, andrealmeid,
	amd-gfx

[-- Attachment #1: Type: text/plain, Size: 548 bytes --]

Am 18.10.24 um 14:46 schrieb Raag Jadav:
>> As far as I can see this makes the enum how to recover the device
>> superfluous because you will most likely always need a bus reset to get out
>> of this again.
> That depends on the kind of fault the device has encountered and the bus it is
> sitting on. There could be buses that don't support reset.

That is even more an argument to not expose this in the uevent.

Getting the device working again is strongly device dependent and can't 
be handled in a generic way.

Regards,
Christian.

>
> Raag

[-- Attachment #2: Type: text/html, Size: 1179 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-18 12:54             ` Christian König
@ 2024-10-18 14:09               ` Raag Jadav
  0 siblings, 0 replies; 34+ messages in thread
From: Raag Jadav @ 2024-10-18 14:09 UTC (permalink / raw)
  To: Christian König
  Cc: Rodrigo Vivi, airlied, simona, lucas.demarchi, thomas.hellstrom,
	jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina,
	intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
	matthew.d.roper, boris.brezillon, adrian.larumbe, kernel, maraeo,
	friedrich.vock, michel, joshua, alexander.deucher, andrealmeid,
	amd-gfx

On Fri, Oct 18, 2024 at 02:54:38PM +0200, Christian König wrote:
> Am 18.10.24 um 14:46 schrieb Raag Jadav:
> > > As far as I can see this makes the enum how to recover the device
> > > superfluous because you will most likely always need a bus reset to get out
> > > of this again.
> > That depends on the kind of fault the device has encountered and the bus it is
> > sitting on. There could be buses that don't support reset.
> 
> That is even more an argument to not expose this in the uevent.
> 
> Getting the device working again is strongly device dependent and can't be
> handled in a generic way.

My understanding is that the proposed methods can be handled in a generic way
and are useful for the devices that do support it. This way the userspace can
atleast have a hint about recovery.

For others we can have something like WEDGED=none (as proposed by Michal and
Lucas in other threads) and let admin/user decide how to deal with it.

Raag

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-17 19:16   ` André Almeida
@ 2024-10-18 14:56     ` Rodrigo Vivi
  2024-10-18 15:31       ` Alex Deucher
  2024-10-19 19:08     ` Raag Jadav
  1 sibling, 1 reply; 34+ messages in thread
From: Rodrigo Vivi @ 2024-10-18 14:56 UTC (permalink / raw)
  To: André Almeida
  Cc: Raag Jadav, intel-xe, thomas.hellstrom, simona, intel-gfx,
	joonas.lahtinen, dri-devel, himal.prasad.ghimiray, lucas.demarchi,
	tursulin, francois.dugast, jani.nikula, airlied,
	aravind.iddamsetty, anshuman.gupta, andi.shyti, matthew.d.roper,
	andriy.shevchenko, lina, kernel-dev, Alex Deucher,
	Christian König

On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
> Hi Raag,
> 
> Em 30/09/2024 04:38, Raag Jadav escreveu:
> > Introduce device wedged event, which will notify userspace of wedged
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is no longer operating as
> > expected even after a hardware reset and has become unrecoverable from
> > driver context.
> > 
> > Purpose of this implementation is to provide drivers a generic way to
> > recover with the help of userspace intervention. Different drivers may
> > have different ideas of a "wedged device" depending on their hardware
> > implementation, and hence the vendor agnostic nature of the event.
> > It is up to the drivers to decide when they see the need for recovery
> > and how they want to recover from the available methods.
> > 
> > Current implementation defines three recovery methods, out of which,
> > drivers can choose to support any one or multiple of them. Preferred
> > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > Userspace consumers (sysadmin) can define udev rules to parse this event
> > and take respective action to recover the device.
> > 
> >      =============== ==================================
> >      Recovery method Consumer expectations
> >      =============== ==================================
> >      rebind          unbind + rebind driver
> >      bus-reset       unbind + reset bus device + rebind
> >      reboot          reboot system
> >      =============== ==================================
> > 
> > 
> 
> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/
> 
> The motivation was that amdgpu was getting stuck after every GPU reset, and
> there was just a black screen. The uevent would then trigger a daemon to
> reset the compositor and getting things back together. As you can see in my
> thread, the feature was blocked in favor of getting better overall GPU reset
> from the kernel side.
> 
> Which kind of scenarios are making i915/xe the need to have userspace
> involvement? I tested a bunch of resets in i915 but never managed to get the
> driver stuck.

2 scenarios:

1. Multiple levels of reset has failed and device was declared wedged. This is
rare indeed as the resets improved a lot.
2. Debug case. We can boot the driver with option to declare device wedged at
any timeout, so the device can be debugged.

> 
> For the bus-reset, amdgpu does that too, but it doesn't require userspace
> intervention.

How do you trigger that?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-18 14:56     ` Rodrigo Vivi
@ 2024-10-18 15:31       ` Alex Deucher
  2024-10-18 17:56         ` André Almeida
  0 siblings, 1 reply; 34+ messages in thread
From: Alex Deucher @ 2024-10-18 15:31 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: André Almeida, Raag Jadav, intel-xe, thomas.hellstrom,
	simona, intel-gfx, joonas.lahtinen, dri-devel,
	himal.prasad.ghimiray, lucas.demarchi, tursulin, francois.dugast,
	jani.nikula, airlied, aravind.iddamsetty, anshuman.gupta,
	andi.shyti, matthew.d.roper, andriy.shevchenko, lina, kernel-dev,
	Alex Deucher, Christian König

On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
>
> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
> > Hi Raag,
> >
> > Em 30/09/2024 04:38, Raag Jadav escreveu:
> > > Introduce device wedged event, which will notify userspace of wedged
> > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > useful especially in cases where the device is no longer operating as
> > > expected even after a hardware reset and has become unrecoverable from
> > > driver context.
> > >
> > > Purpose of this implementation is to provide drivers a generic way to
> > > recover with the help of userspace intervention. Different drivers may
> > > have different ideas of a "wedged device" depending on their hardware
> > > implementation, and hence the vendor agnostic nature of the event.
> > > It is up to the drivers to decide when they see the need for recovery
> > > and how they want to recover from the available methods.
> > >
> > > Current implementation defines three recovery methods, out of which,
> > > drivers can choose to support any one or multiple of them. Preferred
> > > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > > Userspace consumers (sysadmin) can define udev rules to parse this event
> > > and take respective action to recover the device.
> > >
> > >      =============== ==================================
> > >      Recovery method Consumer expectations
> > >      =============== ==================================
> > >      rebind          unbind + rebind driver
> > >      bus-reset       unbind + reset bus device + rebind
> > >      reboot          reboot system
> > >      =============== ==================================
> > >
> > >
> >
> > I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/
> >
> > The motivation was that amdgpu was getting stuck after every GPU reset, and
> > there was just a black screen. The uevent would then trigger a daemon to
> > reset the compositor and getting things back together. As you can see in my
> > thread, the feature was blocked in favor of getting better overall GPU reset
> > from the kernel side.
> >
> > Which kind of scenarios are making i915/xe the need to have userspace
> > involvement? I tested a bunch of resets in i915 but never managed to get the
> > driver stuck.
>
> 2 scenarios:
>
> 1. Multiple levels of reset has failed and device was declared wedged. This is
> rare indeed as the resets improved a lot.
> 2. Debug case. We can boot the driver with option to declare device wedged at
> any timeout, so the device can be debugged.
>
> >
> > For the bus-reset, amdgpu does that too, but it doesn't require userspace
> > intervention.
>
> How do you trigger that?

What do you mean by bus reset?  I think Chrisitian is just referring
to a full adapter reset (as opposed to a queue reset or something more
fine grained).  Driver can reset the device via MMIO or firmware,
depending on the device.  I think there are also PCI helpers for
things like PCI FLR.

Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-18 15:31       ` Alex Deucher
@ 2024-10-18 17:56         ` André Almeida
  2024-10-18 21:07           ` Alex Deucher
  0 siblings, 1 reply; 34+ messages in thread
From: André Almeida @ 2024-10-18 17:56 UTC (permalink / raw)
  To: Alex Deucher, Rodrigo Vivi
  Cc: Raag Jadav, intel-xe, thomas.hellstrom, simona, intel-gfx,
	joonas.lahtinen, dri-devel, himal.prasad.ghimiray, lucas.demarchi,
	tursulin, francois.dugast, jani.nikula, airlied,
	aravind.iddamsetty, anshuman.gupta, andi.shyti, matthew.d.roper,
	andriy.shevchenko, lina, kernel-dev, Alex Deucher,
	Christian König

Em 18/10/2024 12:31, Alex Deucher escreveu:
> On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
>>
>> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
>>> Hi Raag,
>>>
>>> Em 30/09/2024 04:38, Raag Jadav escreveu:
>>>> Introduce device wedged event, which will notify userspace of wedged
>>>> (hanged/unusable) state of the DRM device through a uevent. This is
>>>> useful especially in cases where the device is no longer operating as
>>>> expected even after a hardware reset and has become unrecoverable from
>>>> driver context.
>>>>
>>>> Purpose of this implementation is to provide drivers a generic way to
>>>> recover with the help of userspace intervention. Different drivers may
>>>> have different ideas of a "wedged device" depending on their hardware
>>>> implementation, and hence the vendor agnostic nature of the event.
>>>> It is up to the drivers to decide when they see the need for recovery
>>>> and how they want to recover from the available methods.
>>>>
>>>> Current implementation defines three recovery methods, out of which,
>>>> drivers can choose to support any one or multiple of them. Preferred
>>>> recovery method will be sent in the uevent environment as WEDGED=<method>.
>>>> Userspace consumers (sysadmin) can define udev rules to parse this event
>>>> and take respective action to recover the device.
>>>>
>>>>       =============== ==================================
>>>>       Recovery method Consumer expectations
>>>>       =============== ==================================
>>>>       rebind          unbind + rebind driver
>>>>       bus-reset       unbind + reset bus device + rebind
>>>>       reboot          reboot system
>>>>       =============== ==================================
>>>>
>>>>
>>>
>>> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/
>>>
>>> The motivation was that amdgpu was getting stuck after every GPU reset, and
>>> there was just a black screen. The uevent would then trigger a daemon to
>>> reset the compositor and getting things back together. As you can see in my
>>> thread, the feature was blocked in favor of getting better overall GPU reset
>>> from the kernel side.
>>>
>>> Which kind of scenarios are making i915/xe the need to have userspace
>>> involvement? I tested a bunch of resets in i915 but never managed to get the
>>> driver stuck.
>>
>> 2 scenarios:
>>
>> 1. Multiple levels of reset has failed and device was declared wedged. This is
>> rare indeed as the resets improved a lot.
>> 2. Debug case. We can boot the driver with option to declare device wedged at
>> any timeout, so the device can be debugged.
>>
>>>
>>> For the bus-reset, amdgpu does that too, but it doesn't require userspace
>>> intervention.
>>
>> How do you trigger that?
> 
> What do you mean by bus reset?  I think Chrisitian is just referring
> to a full adapter reset (as opposed to a queue reset or something more
> fine grained).  Driver can reset the device via MMIO or firmware,
> depending on the device.  I think there are also PCI helpers for
> things like PCI FLR.
> 

I was referring to AMD_RESET_PCI:

"Does a full bus reset using core Linux subsystem PCI reset and does a 
secondary bus reset or FLR, depending on what the underlying hardware 
supports."

And that can be triggered by using `amdgpu_reset_method=5` as the module 
option.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-18 17:56         ` André Almeida
@ 2024-10-18 21:07           ` Alex Deucher
  2024-10-24 17:48             ` Rodrigo Vivi
  0 siblings, 1 reply; 34+ messages in thread
From: Alex Deucher @ 2024-10-18 21:07 UTC (permalink / raw)
  To: André Almeida
  Cc: Rodrigo Vivi, Raag Jadav, intel-xe, thomas.hellstrom, simona,
	intel-gfx, joonas.lahtinen, dri-devel, himal.prasad.ghimiray,
	lucas.demarchi, tursulin, francois.dugast, jani.nikula, airlied,
	aravind.iddamsetty, anshuman.gupta, andi.shyti, matthew.d.roper,
	andriy.shevchenko, lina, kernel-dev, Alex Deucher,
	Christian König

On Fri, Oct 18, 2024 at 1:56 PM André Almeida <andrealmeid@igalia.com> wrote:
>
> Em 18/10/2024 12:31, Alex Deucher escreveu:
> > On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
> >>
> >> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
> >>> Hi Raag,
> >>>
> >>> Em 30/09/2024 04:38, Raag Jadav escreveu:
> >>>> Introduce device wedged event, which will notify userspace of wedged
> >>>> (hanged/unusable) state of the DRM device through a uevent. This is
> >>>> useful especially in cases where the device is no longer operating as
> >>>> expected even after a hardware reset and has become unrecoverable from
> >>>> driver context.
> >>>>
> >>>> Purpose of this implementation is to provide drivers a generic way to
> >>>> recover with the help of userspace intervention. Different drivers may
> >>>> have different ideas of a "wedged device" depending on their hardware
> >>>> implementation, and hence the vendor agnostic nature of the event.
> >>>> It is up to the drivers to decide when they see the need for recovery
> >>>> and how they want to recover from the available methods.
> >>>>
> >>>> Current implementation defines three recovery methods, out of which,
> >>>> drivers can choose to support any one or multiple of them. Preferred
> >>>> recovery method will be sent in the uevent environment as WEDGED=<method>.
> >>>> Userspace consumers (sysadmin) can define udev rules to parse this event
> >>>> and take respective action to recover the device.
> >>>>
> >>>>       =============== ==================================
> >>>>       Recovery method Consumer expectations
> >>>>       =============== ==================================
> >>>>       rebind          unbind + rebind driver
> >>>>       bus-reset       unbind + reset bus device + rebind
> >>>>       reboot          reboot system
> >>>>       =============== ==================================
> >>>>
> >>>>
> >>>
> >>> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/
> >>>
> >>> The motivation was that amdgpu was getting stuck after every GPU reset, and
> >>> there was just a black screen. The uevent would then trigger a daemon to
> >>> reset the compositor and getting things back together. As you can see in my
> >>> thread, the feature was blocked in favor of getting better overall GPU reset
> >>> from the kernel side.
> >>>
> >>> Which kind of scenarios are making i915/xe the need to have userspace
> >>> involvement? I tested a bunch of resets in i915 but never managed to get the
> >>> driver stuck.
> >>
> >> 2 scenarios:
> >>
> >> 1. Multiple levels of reset has failed and device was declared wedged. This is
> >> rare indeed as the resets improved a lot.
> >> 2. Debug case. We can boot the driver with option to declare device wedged at
> >> any timeout, so the device can be debugged.
> >>
> >>>
> >>> For the bus-reset, amdgpu does that too, but it doesn't require userspace
> >>> intervention.
> >>
> >> How do you trigger that?
> >
> > What do you mean by bus reset?  I think Chrisitian is just referring
> > to a full adapter reset (as opposed to a queue reset or something more
> > fine grained).  Driver can reset the device via MMIO or firmware,
> > depending on the device.  I think there are also PCI helpers for
> > things like PCI FLR.
> >
>
> I was referring to AMD_RESET_PCI:
>
> "Does a full bus reset using core Linux subsystem PCI reset and does a
> secondary bus reset or FLR, depending on what the underlying hardware
> supports."
>
> And that can be triggered by using `amdgpu_reset_method=5` as the module
> option.
>

That option doesn't actually do anything useful on most AMD GPUs.  We
don't support FLR on most boards and SBR doesn't work once the driver
has been loaded except for really old chips.  That said, internally
these all end up being mode1 or mode2 resets which the driver can
trigger directly and which are the defaults.

Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-17 19:16   ` André Almeida
  2024-10-18 14:56     ` Rodrigo Vivi
@ 2024-10-19 19:08     ` Raag Jadav
  1 sibling, 0 replies; 34+ messages in thread
From: Raag Jadav @ 2024-10-19 19:08 UTC (permalink / raw)
  To: André Almeida
  Cc: intel-xe, rodrigo.vivi, thomas.hellstrom, simona, intel-gfx,
	joonas.lahtinen, dri-devel, himal.prasad.ghimiray, lucas.demarchi,
	tursulin, francois.dugast, jani.nikula, airlied,
	aravind.iddamsetty, anshuman.gupta, andi.shyti, matthew.d.roper,
	andriy.shevchenko, lina, kernel-dev, Alex Deucher,
	Christian König

On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
> Hi Raag,
> 
> Em 30/09/2024 04:38, Raag Jadav escreveu:
> > Introduce device wedged event, which will notify userspace of wedged
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is no longer operating as
> > expected even after a hardware reset and has become unrecoverable from
> > driver context.
> > 
> > Purpose of this implementation is to provide drivers a generic way to
> > recover with the help of userspace intervention. Different drivers may
> > have different ideas of a "wedged device" depending on their hardware
> > implementation, and hence the vendor agnostic nature of the event.
> > It is up to the drivers to decide when they see the need for recovery
> > and how they want to recover from the available methods.
> > 
> > Current implementation defines three recovery methods, out of which,
> > drivers can choose to support any one or multiple of them. Preferred
> > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > Userspace consumers (sysadmin) can define udev rules to parse this event
> > and take respective action to recover the device.
> > 
> >      =============== ==================================
> >      Recovery method Consumer expectations
> >      =============== ==================================
> >      rebind          unbind + rebind driver
> >      bus-reset       unbind + reset bus device + rebind
> >      reboot          reboot system
> >      =============== ==================================
> > 
> > 
> 
> I proposed something similar in the past:
> https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/

Thanks for sharing. I went through it and I think we can use some of the ideas
with generic adaption.

While we can always execute scripts on uevent, it'd be good to have a userspace
daemon applying automated policies for wedge cases based on admin/user needs.
This way we can also manage repeat offenders.

Xe has devcoredump so telemetry would also be a nice addition.

Great opportunity to collaborate here.

> The motivation was that amdgpu was getting stuck after every GPU reset, and
> there was just a black screen. The uevent would then trigger a daemon to
> reset the compositor and getting things back together. As you can see in my
> thread, the feature was blocked in favor of getting better overall GPU reset
> from the kernel side.

We have hardware level resets but (although rare) they're also prone to failure.
We do what we can to recover from driver context but it adds on to the complexity
overtime. Something like wedging, if done right, would be much more robust IMHO.

Raag

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 1/5] drm: Introduce device wedged event
  2024-10-18 21:07           ` Alex Deucher
@ 2024-10-24 17:48             ` Rodrigo Vivi
  0 siblings, 0 replies; 34+ messages in thread
From: Rodrigo Vivi @ 2024-10-24 17:48 UTC (permalink / raw)
  To: Alex Deucher
  Cc: André Almeida, Raag Jadav, intel-xe, thomas.hellstrom,
	simona, intel-gfx, joonas.lahtinen, dri-devel,
	himal.prasad.ghimiray, lucas.demarchi, tursulin, francois.dugast,
	jani.nikula, airlied, aravind.iddamsetty, anshuman.gupta,
	andi.shyti, matthew.d.roper, andriy.shevchenko, lina, kernel-dev,
	Alex Deucher, Christian König

On Fri, Oct 18, 2024 at 05:07:22PM -0400, Alex Deucher wrote:
> On Fri, Oct 18, 2024 at 1:56 PM André Almeida <andrealmeid@igalia.com> wrote:
> >
> > Em 18/10/2024 12:31, Alex Deucher escreveu:
> > > On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
> > >>
> > >> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
> > >>> Hi Raag,
> > >>>
> > >>> Em 30/09/2024 04:38, Raag Jadav escreveu:
> > >>>> Introduce device wedged event, which will notify userspace of wedged
> > >>>> (hanged/unusable) state of the DRM device through a uevent. This is
> > >>>> useful especially in cases where the device is no longer operating as
> > >>>> expected even after a hardware reset and has become unrecoverable from
> > >>>> driver context.
> > >>>>
> > >>>> Purpose of this implementation is to provide drivers a generic way to
> > >>>> recover with the help of userspace intervention. Different drivers may
> > >>>> have different ideas of a "wedged device" depending on their hardware
> > >>>> implementation, and hence the vendor agnostic nature of the event.
> > >>>> It is up to the drivers to decide when they see the need for recovery
> > >>>> and how they want to recover from the available methods.
> > >>>>
> > >>>> Current implementation defines three recovery methods, out of which,
> > >>>> drivers can choose to support any one or multiple of them. Preferred
> > >>>> recovery method will be sent in the uevent environment as WEDGED=<method>.
> > >>>> Userspace consumers (sysadmin) can define udev rules to parse this event
> > >>>> and take respective action to recover the device.
> > >>>>
> > >>>>       =============== ==================================
> > >>>>       Recovery method Consumer expectations
> > >>>>       =============== ==================================
> > >>>>       rebind          unbind + rebind driver
> > >>>>       bus-reset       unbind + reset bus device + rebind
> > >>>>       reboot          reboot system
> > >>>>       =============== ==================================
> > >>>>
> > >>>>
> > >>>
> > >>> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/
> > >>>
> > >>> The motivation was that amdgpu was getting stuck after every GPU reset, and
> > >>> there was just a black screen. The uevent would then trigger a daemon to
> > >>> reset the compositor and getting things back together. As you can see in my
> > >>> thread, the feature was blocked in favor of getting better overall GPU reset
> > >>> from the kernel side.
> > >>>
> > >>> Which kind of scenarios are making i915/xe the need to have userspace
> > >>> involvement? I tested a bunch of resets in i915 but never managed to get the
> > >>> driver stuck.
> > >>
> > >> 2 scenarios:
> > >>
> > >> 1. Multiple levels of reset has failed and device was declared wedged. This is
> > >> rare indeed as the resets improved a lot.
> > >> 2. Debug case. We can boot the driver with option to declare device wedged at
> > >> any timeout, so the device can be debugged.
> > >>
> > >>>
> > >>> For the bus-reset, amdgpu does that too, but it doesn't require userspace
> > >>> intervention.
> > >>
> > >> How do you trigger that?
> > >
> > > What do you mean by bus reset?  I think Chrisitian is just referring
> > > to a full adapter reset (as opposed to a queue reset or something more
> > > fine grained).  Driver can reset the device via MMIO or firmware,
> > > depending on the device.  I think there are also PCI helpers for
> > > things like PCI FLR.
> > >
> >
> > I was referring to AMD_RESET_PCI:
> >
> > "Does a full bus reset using core Linux subsystem PCI reset and does a
> > secondary bus reset or FLR, depending on what the underlying hardware
> > supports."
> >
> > And that can be triggered by using `amdgpu_reset_method=5` as the module
> > option.
> >
> 
> That option doesn't actually do anything useful on most AMD GPUs.  We
> don't support FLR on most boards and SBR doesn't work once the driver
> has been loaded except for really old chips.  That said, internally
> these all end up being mode1 or mode2 resets which the driver can
> trigger directly and which are the defaults.

okay, this is the same for us then.
And this is the main reason that we have this option:
- unbind + reset bus device + rebind

unbind by itself needs to be a supported and working case regardless
the reset state. Then this sequence should be fine.

Afaik there's no way that the driver itself could call for the bus
reset.

> 
> Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-10-24 17:49 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-30  7:38 [PATCH v7 0/5] Introduce DRM device wedged event Raag Jadav
2024-09-30  7:38 ` [PATCH v7 1/5] drm: Introduce " Raag Jadav
2024-09-30 12:59   ` Andy Shevchenko
2024-10-01  5:08     ` Raag Jadav
2024-10-01 12:07       ` Andy Shevchenko
2024-10-01 14:18         ` Raag Jadav
2024-10-01 14:54           ` Andy Shevchenko
2024-10-01 16:42             ` Raag Jadav
2024-10-01 12:20   ` Michal Wajdeczko
2024-10-03 12:23     ` Raag Jadav
2024-10-08 15:02       ` Raag Jadav
2024-10-10 13:02         ` Lucas De Marchi
2024-10-11  8:47           ` Raag Jadav
2024-10-17  2:47   ` Raag Jadav
2024-10-17  7:59     ` Christian König
2024-10-17 16:43       ` Rodrigo Vivi
2024-10-18 10:58         ` Christian König
2024-10-18 12:46           ` Raag Jadav
2024-10-18 12:54             ` Christian König
2024-10-18 14:09               ` Raag Jadav
2024-10-17 19:16   ` André Almeida
2024-10-18 14:56     ` Rodrigo Vivi
2024-10-18 15:31       ` Alex Deucher
2024-10-18 17:56         ` André Almeida
2024-10-18 21:07           ` Alex Deucher
2024-10-24 17:48             ` Rodrigo Vivi
2024-10-19 19:08     ` Raag Jadav
2024-09-30  7:38 ` [PATCH v7 2/5] drm: Expose wedge recovery methods Raag Jadav
2024-09-30 13:01   ` Andy Shevchenko
2024-10-01  5:23     ` Raag Jadav
2024-09-30  7:38 ` [PATCH v7 3/5] drm/doc: Document device wedged event Raag Jadav
2024-09-30  7:38 ` [PATCH v7 4/5] drm/xe: Use " Raag Jadav
2024-09-30  7:38 ` [PATCH v7 5/5] drm/i915: " Raag Jadav
2024-09-30  7:47 ` ✗ CI.Patch_applied: failure for Introduce DRM device wedged event (rev5) Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).