Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v10 0/4] Introduce DRM device wedged event
@ 2024-11-28 15:37 Raag Jadav
  2024-11-28 15:37 ` [PATCH v10 1/4] drm: Introduce " Raag Jadav
                   ` (7 more replies)
  0 siblings, 8 replies; 29+ messages in thread
From: Raag Jadav @ 2024-11-28 15:37 UTC (permalink / raw)
  To: airlied, simona, lucas.demarchi, rodrigo.vivi, jani.nikula,
	andriy.shevchenko, lina, michal.wajdeczko, christian.koenig
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, anshuman.gupta, alexander.deucher,
	andrealmeid, amd-gfx, kernel-dev, Raag Jadav

This series introduces device wedged event in DRM subsystem and uses
it in xe and i915 drivers. Detailed description in commit message.

This was earlier attempted as xe specific uevent in v1 and v2.
https://patchwork.freedesktop.org/series/136909/

Similar work by André Almeida.
https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/

 v2: Change authorship to Himal (Aravind)
     Add uevent for all device wedged cases (Aravind)

 v3: Generic re-implementation in DRM subsystem (Lucas)

 v4: s/drm_dev_wedged/drm_dev_wedged_event
     Use drm_info() (Jani)
     Kernel doc adjustment (Aravind)
     Change authorship to Raag (Aravind)

 v5: Send recovery method with uevent (Lina)
     Expose supported recovery methods via sysfs (Lucas)

 v6: Access wedge_recovery_opts[] using helper function (Jani)
     Use snprintf() (Jani)

 v7: Convert recovery helpers into regular functions (Andy, Jani)
     Aesthetic adjustments (Andy)
     Handle invalid method cases
     Add documentation to drm-uapi.rst (Sima)

 v8: Drop sysfs and allow sending multiple methods with uevent (Lucas, Michal)
     Improve documentation (Christian, Rodrigo)
     static_assert() globally (Andy)

 v9: Document prerequisites section (Christian)
     Provide 'none' method for reset cases (Christian)
     Provide recovery opts using switch cases

v10: Clarify mmap cleanup and consumer prerequisites (Christian, Aravind)

Raag Jadav (4):
  drm: Introduce device wedged event
  drm/doc: Document device wedged event
  drm/xe: Use device wedged event
  drm/i915: Use device wedged event

 Documentation/gpu/drm-uapi.rst        | 112 +++++++++++++++++++++++++-
 drivers/gpu/drm/drm_drv.c             |  66 +++++++++++++++
 drivers/gpu/drm/i915/gt/intel_reset.c |   3 +
 drivers/gpu/drm/xe/xe_device.c        |   9 ++-
 include/drm/drm_device.h              |   8 ++
 include/drm/drm_drv.h                 |   1 +
 6 files changed, 194 insertions(+), 5 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v10 1/4] drm: Introduce device wedged event
  2024-11-28 15:37 [PATCH v10 0/4] Introduce DRM device wedged event Raag Jadav
@ 2024-11-28 15:37 ` Raag Jadav
  2024-11-29 13:40   ` André Almeida
  2024-12-12 18:31   ` André Almeida
  2024-11-28 15:37 ` [PATCH v10 2/4] drm/doc: Document " Raag Jadav
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 29+ messages in thread
From: Raag Jadav @ 2024-11-28 15:37 UTC (permalink / raw)
  To: airlied, simona, lucas.demarchi, rodrigo.vivi, jani.nikula,
	andriy.shevchenko, lina, michal.wajdeczko, christian.koenig
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, anshuman.gupta, alexander.deucher,
	andrealmeid, amd-gfx, kernel-dev, Raag Jadav

Introduce device wedged event, which notifies userspace of 'wedged'
(hanged/unusable) state of the DRM device through a uevent. This is
useful especially in cases where the device is no longer operating as
expected and has become unrecoverable from driver context. Purpose of
this implementation is to provide drivers a generic way to recover with
the help of userspace intervention without taking any drastic measures
in the driver.

A 'wedged' device is basically a dead device that needs attention. The
uevent is the notification that is sent to userspace along with a hint
about what could possibly be attempted to recover the device and bring
it back to usable state. Different drivers may have different ideas of
a 'wedged' device depending on their hardware implementation, and hence
the vendor agnostic nature of the event. It is up to the drivers to
decide when they see the need for device recovery and how they want to
recover from the available methods.

Driver prerequisites
--------------------

The driver, before opting for recovery, needs to make sure that the
'wedged' device doesn't harm the system as a whole by taking care of the
prerequisites. Necessary actions must include disabling DMA to system
memory as well as any communication channels with other devices. Further,
the driver must ensure that all dma_fences are signalled and any device
state that the core kernel might depend on is cleaned up. Any existing
mmap should be invalidated and page faults should be redirected to a
dummy page. Once the event is sent, the device must be kept in 'wedged'
state until the recovery is performed. New accesses to the device
(IOCTLs) should be blocked, preferably with an error code that resembles
the type of failure the device has encountered. This will signify the
reason for wedging, which can be reported to the application if needed.

Recovery
--------

Current implementation defines three recovery methods, out of which,
drivers can use any one, multiple or none. Method(s) of choice will be
sent in the uevent environment as ``WEDGED=<method1>[,<method2>]`` in
order of less to more side-effects. If driver is unsure about recovery
or method is unknown (like soft/hard reboot, firmware flashing, hardware
replacement or any other procedure which can't be attempted on the fly),
``WEDGED=unknown`` will be sent instead.

Userspace consumers can parse this event and attempt recovery as per the
following expectations.

    =============== ================================
    Recovery method Consumer expectations
    =============== ================================
    none            optional telemetry collection
    rebind          unbind + bind driver
    bus-reset       unbind + reset bus device + bind
    unknown         admin/user policy
    =============== ================================

The only exception to this is ``WEDGED=none``, which signifies that the
device was temporarily 'wedged' at some point but was able to recover using
device specific methods like reset. No explicit device recovery is expected
from the consumer in this case, but it can still take additional steps like
gathering telemetry information (devcoredump, syslog). This is useful
because the first hang is usually the most critical one which can result in
consequential hangs or complete wedging.

Consumer prerequisites
----------------------

It is the responsibility of the consumer to make sure that the device or
its resources are not in use by any process before attempting recovery.
With IOCTLs blocked and device already 'wedged', all device memory should
be unmapped and file descriptors should be closed to prevent leaks.

Example
-------

Udev rule::

    SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
    RUN+="/path/to/rebind.sh $env{DEVPATH}"

Recovery script::

    #!/bin/sh

    DEVPATH=$(readlink -f /sys/$1/device)
    DEVICE=$(basename $DEVPATH)
    DRIVER=$(readlink -f $DEVPATH/driver)

    echo -n $DEVICE > $DRIVER/unbind
    sleep 1
    echo -n $DEVICE > $DRIVER/bind

Customization
-------------

Although basic recovery is possible with a simple script, admin/users can
define custom policies around recovery action. For example, if the driver
supports multiple recovery methods, consumers can opt for the suitable one
based on policy definition. Consumers can also choose to have the device
available for debugging or additional data collection before performing
the recovery. This is useful especially when the driver is unsure about
recovery or method is unknown.

v4: s/drm_dev_wedged/drm_dev_wedged_event
    Use drm_info() (Jani)
    Kernel doc adjustment (Aravind)
v5: Send recovery method with uevent (Lina)
v6: Access wedge_recovery_opts[] using helper function (Jani)
    Use snprintf() (Jani)
v7: Convert recovery helpers into regular functions (Andy, Jani)
    Aesthetic adjustments (Andy)
    Handle invalid method cases
v8: Allow sending multiple methods with uevent (Lucas, Michal)
    static_assert() globally (Andy)
v9: Provide 'none' method for reset cases (Christian)
    Provide recovery opts using switch cases

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
 drivers/gpu/drm/drm_drv.c | 66 +++++++++++++++++++++++++++++++++++++++
 include/drm/drm_device.h  |  8 +++++
 include/drm/drm_drv.h     |  1 +
 3 files changed, 75 insertions(+)

diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index c2c172eb25df..e6583ebc6b05 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -26,6 +26,7 @@
  * DEALINGS IN THE SOFTWARE.
  */
 
+#include <linux/bitops.h>
 #include <linux/debugfs.h>
 #include <linux/fs.h>
 #include <linux/module.h>
@@ -33,6 +34,7 @@
 #include <linux/mount.h>
 #include <linux/pseudo_fs.h>
 #include <linux/slab.h>
+#include <linux/sprintf.h>
 #include <linux/srcu.h>
 #include <linux/xarray.h>
 
@@ -497,6 +499,70 @@ void drm_dev_unplug(struct drm_device *dev)
 }
 EXPORT_SYMBOL(drm_dev_unplug);
 
+/*
+ * Available recovery methods for wedged device. To be sent along with device
+ * wedged uevent.
+ */
+static const char *drm_get_wedge_recovery(unsigned int opt)
+{
+	switch (BIT(opt)) {
+	case DRM_WEDGE_RECOVERY_NONE:
+		return "none";
+	case DRM_WEDGE_RECOVERY_REBIND:
+		return "rebind";
+	case DRM_WEDGE_RECOVERY_BUS_RESET:
+		return "bus-reset";
+	default:
+		return NULL;
+	}
+}
+
+/**
+ * drm_dev_wedged_event - generate a device wedged uevent
+ * @dev: DRM device
+ * @method: method(s) to be used for recovery
+ *
+ * This generates a device wedged uevent for the DRM device specified by @dev.
+ * Recovery @method\(s) of choice will be sent in the uevent environment as
+ * ``WEDGED=<method1>[,<method2>]`` in order of less to more side-effects. If
+ * caller is unsure about recovery or @method is unknown (0), ``WEDGED=unknown``
+ * will be sent instead.
+ *
+ * Refer to 'Device Wedging' chapter in Documentation/gpu/drm-uapi.rst for more
+ * details.
+ *
+ * Returns: 0 on success, negative error code otherwise.
+ */
+int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
+{
+	const char *recovery = NULL;
+	unsigned int len, opt;
+	/* Event string length up to 28+ characters with available methods */
+	char event_string[32];
+	char *envp[] = { event_string, NULL };
+
+	len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
+
+	for_each_set_bit(opt, &method, BITS_PER_TYPE(method)) {
+		recovery = drm_get_wedge_recovery(opt);
+		if (drm_WARN(dev, !recovery, "device wedged, invalid recovery method %u\n", opt))
+			break;
+
+		len += scnprintf(event_string + len, sizeof(event_string), "%s,", recovery);
+	}
+
+	if (recovery)
+		/* Get rid of trailing comma */
+		event_string[len - 1] = '\0';
+	else
+		/* Caller is unsure about recovery, do the best we can at this point. */
+		snprintf(event_string, sizeof(event_string), "%s", "WEDGED=unknown");
+
+	drm_info(dev, "device wedged, needs recovery\n");
+	return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+}
+EXPORT_SYMBOL(drm_dev_wedged_event);
+
 /*
  * DRM internal mount
  * We want to be able to allocate our own "struct address_space" to control
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index c91f87b5242d..6ea54a578cda 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -21,6 +21,14 @@ struct inode;
 struct pci_dev;
 struct pci_controller;
 
+/*
+ * Recovery methods for wedged device in order of less to more side-effects.
+ * To be used with drm_dev_wedged_event() as recovery @method. Callers can
+ * use any one, multiple (or'd) or none depending on their needs.
+ */
+#define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
+#define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
+#define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
 
 /**
  * enum switch_power_state - power state of drm device
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index 1bbbcb8e2d23..f41a82839e28 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -479,6 +479,7 @@ void drm_put_dev(struct drm_device *dev);
 bool drm_dev_enter(struct drm_device *dev, int *idx);
 void drm_dev_exit(int idx);
 void drm_dev_unplug(struct drm_device *dev);
+int drm_dev_wedged_event(struct drm_device *dev, unsigned long method);
 
 /**
  * drm_dev_is_unplugged - is a DRM device unplugged
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v10 2/4] drm/doc: Document device wedged event
  2024-11-28 15:37 [PATCH v10 0/4] Introduce DRM device wedged event Raag Jadav
  2024-11-28 15:37 ` [PATCH v10 1/4] drm: Introduce " Raag Jadav
@ 2024-11-28 15:37 ` Raag Jadav
  2024-12-12 18:50   ` André Almeida
  2025-01-21  1:14   ` Xaver Hugl
  2024-11-28 15:37 ` [PATCH v10 3/4] drm/xe: Use " Raag Jadav
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 29+ messages in thread
From: Raag Jadav @ 2024-11-28 15:37 UTC (permalink / raw)
  To: airlied, simona, lucas.demarchi, rodrigo.vivi, jani.nikula,
	andriy.shevchenko, lina, michal.wajdeczko, christian.koenig
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, anshuman.gupta, alexander.deucher,
	andrealmeid, amd-gfx, kernel-dev, Raag Jadav

Add documentation for device wedged event in a new 'Device wedging'
chapter. The describes basic definitions, prerequisites and consumer
expectations along with an example.

 v8: Improve documentation (Christian, Rodrigo)
 v9: Add prerequisites section (Christian)
v10: Clarify mmap cleanup and consumer prerequisites (Christian, Aravind)

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 Documentation/gpu/drm-uapi.rst | 112 ++++++++++++++++++++++++++++++++-
 1 file changed, 109 insertions(+), 3 deletions(-)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index b75cc9a70d1f..da2927cde53d 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -371,9 +371,115 @@ Reporting causes of resets
 
 Apart from propagating the reset through the stack so apps can recover, it's
 really useful for driver developers to learn more about what caused the reset in
-the first place. DRM devices should make use of devcoredump to store relevant
-information about the reset, so this information can be added to user bug
-reports.
+the first place. For this, drivers can make use of devcoredump to store relevant
+information about the reset and send device wedged event without recovery method
+(as explained in next chapter) to notify userspace, so this information can be
+collected and added to user bug reports.
+
+Device wedging
+==============
+
+Drivers can optionally make use of device wedged event (implemented as
+drm_dev_wedged_event() in DRM subsystem), which notifies userspace of 'wedged'
+(hanged/unusable) state of the DRM device through a uevent. This is useful
+especially in cases where the device is no longer operating as expected and
+has become unrecoverable from driver context. Purpose of this implementation
+is to provide drivers a generic way to recover with the help of userspace
+intervention without taking any drastic measures in the driver.
+
+A 'wedged' device is basically a dead device that needs attention. The
+uevent is the notification that is sent to userspace along with a hint about
+what could possibly be attempted to recover the device and bring it back to
+usable state. Different drivers may have different ideas of a 'wedged' device
+depending on their hardware implementation, and hence the vendor agnostic
+nature of the event. It is up to the drivers to decide when they see the need
+for device recovery and how they want to recover from the available methods.
+
+Driver prerequisites
+--------------------
+
+The driver, before opting for recovery, needs to make sure that the 'wedged'
+device doesn't harm the system as a whole by taking care of the prerequisites.
+Necessary actions must include disabling DMA to system memory as well as any
+communication channels with other devices. Further, the driver must ensure
+that all dma_fences are signalled and any device state that the core kernel
+might depend on is cleaned up. Any existing mmap should be invalidated and
+page faults should be redirected to a dummy page. Once the event is sent, the
+device must be kept in 'wedged' state until the recovery is performed. New
+accesses to the device (IOCTLs) should be blocked, preferably with an error
+code that resembles the type of failure the device has encountered. This will
+signify the reason for wedging, which can be reported to the application if
+needed.
+
+Recovery
+--------
+
+Current implementation defines three recovery methods, out of which, drivers
+can use any one, multiple or none. Method(s) of choice will be sent in the
+uevent environment as ``WEDGED=<method1>[,<method2>]`` in order of less to
+more side-effects. If driver is unsure about recovery or method is unknown
+(like soft/hard reboot, firmware flashing, hardware replacement or any other
+procedure which can't be attempted on the fly), ``WEDGED=unknown`` will be
+sent instead.
+
+Userspace consumers can parse this event and attempt recovery as per the
+following expectations.
+
+    =============== ================================
+    Recovery method Consumer expectations
+    =============== ================================
+    none            optional telemetry collection
+    rebind          unbind + bind driver
+    bus-reset       unbind + reset bus device + bind
+    unknown         admin/user policy
+    =============== ================================
+
+The only exception to this is ``WEDGED=none``, which signifies that the
+device was temporarily 'wedged' at some point but was able to recover using
+device specific methods like reset. No explicit device recovery is expected
+from the consumer in this case, but it can still take additional steps like
+gathering telemetry information (devcoredump, syslog). This is useful
+because the first hang is usually the most critical one which can result in
+consequential hangs or complete wedging.
+
+Consumer prerequisites
+----------------------
+
+It is the responsibility of the consumer to make sure that the device or
+its resources are not in use by any process before attempting recovery.
+With IOCTLs blocked and device already 'wedged', all device memory should
+be unmapped and file descriptors should be closed to prevent leaks.
+
+Example
+-------
+
+Udev rule::
+
+    SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
+    RUN+="/path/to/rebind.sh $env{DEVPATH}"
+
+Recovery script::
+
+    #!/bin/sh
+
+    DEVPATH=$(readlink -f /sys/$1/device)
+    DEVICE=$(basename $DEVPATH)
+    DRIVER=$(readlink -f $DEVPATH/driver)
+
+    echo -n $DEVICE > $DRIVER/unbind
+    sleep 1
+    echo -n $DEVICE > $DRIVER/bind
+
+Customization
+-------------
+
+Although basic recovery is possible with a simple script, admin/users can
+define custom policies around recovery action. For example, if the driver
+supports multiple recovery methods, consumers can opt for the suitable one
+based on policy definition. Consumers can also choose to have the device
+available for debugging or additional data collection before performing the
+recovery. This is useful especially when the driver is unsure about recovery
+or method is unknown.
 
 .. _drm_driver_ioctl:
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v10 3/4] drm/xe: Use device wedged event
  2024-11-28 15:37 [PATCH v10 0/4] Introduce DRM device wedged event Raag Jadav
  2024-11-28 15:37 ` [PATCH v10 1/4] drm: Introduce " Raag Jadav
  2024-11-28 15:37 ` [PATCH v10 2/4] drm/doc: Document " Raag Jadav
@ 2024-11-28 15:37 ` Raag Jadav
  2024-11-28 15:37 ` [PATCH v10 4/4] drm/i915: " Raag Jadav
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Raag Jadav @ 2024-11-28 15:37 UTC (permalink / raw)
  To: airlied, simona, lucas.demarchi, rodrigo.vivi, jani.nikula,
	andriy.shevchenko, lina, michal.wajdeczko, christian.koenig
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, anshuman.gupta, alexander.deucher,
	andrealmeid, amd-gfx, kernel-dev, Raag Jadav

This was previously attempted as xe specific reset uevent but dropped
in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset uevent for now")
as part of refactoring.

Now that we have device wedged event provided by DRM core, make use
of it and support both driver rebind and bus-reset based recovery.
With this in place userspace will be notified of wedged device, on
the basis of which, userspace may take respective action to recover
the device.

$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent

KERNEL[265.802982] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm)
ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0
SUBSYSTEM=drm
WEDGED=rebind,bus-reset
DEVNAME=/dev/dri/card0
DEVTYPE=drm_minor
SEQNUM=5208
MAJOR=226
MINOR=0

v2: Change authorship to Himal (Aravind)
    Add uevent for all device wedged cases (Aravind)
v3: Generic re-implementation in DRM subsystem (Lucas)
v4: Change authorship to Raag (Aravind)

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_device.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 930bb2750e2e..f04737854887 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -991,11 +991,12 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
  * xe_device_declare_wedged - Declare device wedged
  * @xe: xe device instance
  *
- * This is a final state that can only be cleared with a mudule
+ * This is a final state that can only be cleared with a module
  * re-probe (unbind + bind).
  * In this state every IOCTL will be blocked so the GT cannot be used.
  * In general it will be called upon any critical error such as gt reset
- * failure or guc loading failure.
+ * failure or guc loading failure. Userspace will be notified of this state
+ * by a DRM uevent.
  * If xe.wedged module parameter is set to 2, this function will be called
  * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
  * snapshot capture. In this mode, GT reset won't be attempted so the state of
@@ -1025,6 +1026,10 @@ void xe_device_declare_wedged(struct xe_device *xe)
 			"IOCTLs and executions are blocked. Only a rebind may clear the failure\n"
 			"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
 			dev_name(xe->drm.dev));
+
+		/* Notify userspace of wedged device */
+		drm_dev_wedged_event(&xe->drm,
+				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
 	}
 
 	for_each_gt(gt, xe, id)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v10 4/4] drm/i915: Use device wedged event
  2024-11-28 15:37 [PATCH v10 0/4] Introduce DRM device wedged event Raag Jadav
                   ` (2 preceding siblings ...)
  2024-11-28 15:37 ` [PATCH v10 3/4] drm/xe: Use " Raag Jadav
@ 2024-11-28 15:37 ` Raag Jadav
  2024-11-28 17:45 ` ✗ Fi.CI.CHECKPATCH: warning for Introduce DRM device wedged event (rev8) Patchwork
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Raag Jadav @ 2024-11-28 15:37 UTC (permalink / raw)
  To: airlied, simona, lucas.demarchi, rodrigo.vivi, jani.nikula,
	andriy.shevchenko, lina, michal.wajdeczko, christian.koenig
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, anshuman.gupta, alexander.deucher,
	andrealmeid, amd-gfx, kernel-dev, Raag Jadav

Now that we have device wedged event provided by DRM core, make use
of it and support both driver rebind and bus-reset based recovery.
With this in place, userspace will be notified of wedged device on
gt reset failure.

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index f42f21632306..18cf50a1e84d 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1418,6 +1418,9 @@ static void intel_gt_reset_global(struct intel_gt *gt,
 
 	if (!test_bit(I915_WEDGED, &gt->reset.flags))
 		kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event);
+	else
+		drm_dev_wedged_event(&gt->i915->drm,
+				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
 }
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for Introduce DRM device wedged event (rev8)
  2024-11-28 15:37 [PATCH v10 0/4] Introduce DRM device wedged event Raag Jadav
                   ` (3 preceding siblings ...)
  2024-11-28 15:37 ` [PATCH v10 4/4] drm/i915: " Raag Jadav
@ 2024-11-28 17:45 ` Patchwork
  2024-11-28 17:45 ` ✗ Fi.CI.SPARSE: " Patchwork
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Patchwork @ 2024-11-28 17:45 UTC (permalink / raw)
  To: Raag Jadav; +Cc: intel-gfx

== Series Details ==

Series: Introduce DRM device wedged event (rev8)
URL   : https://patchwork.freedesktop.org/series/138069/
State : warning

== Summary ==

Error: dim checkpatch failed
48e6db3a81e0 drm: Introduce device wedged event
-:189: WARNING:STATIC_CONST_CHAR_ARRAY: char * array declaration might be better as static const
#189: FILE: drivers/gpu/drm/drm_drv.c:542:
+	char *envp[] = { event_string, NULL };

total: 0 errors, 1 warnings, 0 checks, 105 lines checked
cef184e3b461 drm/doc: Document device wedged event
d6cc8aa527ee drm/xe: Use device wedged event
-:20: WARNING:COMMIT_LOG_LONG_LINE: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#20: 
KERNEL[265.802982] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm)

total: 0 errors, 1 warnings, 0 checks, 24 lines checked
fa8ec2e1147a drm/i915: Use device wedged event



^ permalink raw reply	[flat|nested] 29+ messages in thread

* ✗ Fi.CI.SPARSE: warning for Introduce DRM device wedged event (rev8)
  2024-11-28 15:37 [PATCH v10 0/4] Introduce DRM device wedged event Raag Jadav
                   ` (4 preceding siblings ...)
  2024-11-28 17:45 ` ✗ Fi.CI.CHECKPATCH: warning for Introduce DRM device wedged event (rev8) Patchwork
@ 2024-11-28 17:45 ` Patchwork
  2024-11-28 17:59 ` ✗ i915.CI.BAT: failure " Patchwork
  2025-01-21  0:59 ` [PATCH v10 0/4] Introduce DRM device wedged event Xaver Hugl
  7 siblings, 0 replies; 29+ messages in thread
From: Patchwork @ 2024-11-28 17:45 UTC (permalink / raw)
  To: Raag Jadav; +Cc: intel-gfx

== Series Details ==

Series: Introduce DRM device wedged event (rev8)
URL   : https://patchwork.freedesktop.org/series/138069/
State : warning

== Summary ==

Error: dim sparse failed
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* ✗ i915.CI.BAT: failure for Introduce DRM device wedged event (rev8)
  2024-11-28 15:37 [PATCH v10 0/4] Introduce DRM device wedged event Raag Jadav
                   ` (5 preceding siblings ...)
  2024-11-28 17:45 ` ✗ Fi.CI.SPARSE: " Patchwork
@ 2024-11-28 17:59 ` Patchwork
  2025-01-21  0:59 ` [PATCH v10 0/4] Introduce DRM device wedged event Xaver Hugl
  7 siblings, 0 replies; 29+ messages in thread
From: Patchwork @ 2024-11-28 17:59 UTC (permalink / raw)
  To: Raag Jadav; +Cc: intel-gfx

== Series Details ==

Series: Introduce DRM device wedged event (rev8)
URL   : https://patchwork.freedesktop.org/series/138069/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_15759 -> Patchwork_138069v8
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_138069v8 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_138069v8, please notify your bug team (I915-ci-infra@lists.freedesktop.org) to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_138069v8/index.html

Participating hosts (45 -> 44)
------------------------------

  Missing    (1): fi-snb-2520m 

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_138069v8:

### IGT changes ###

#### Possible regressions ####

  * igt@kms_flip@basic-flip-vs-dpms@a-dp1:
    - fi-cfl-8109u:       [PASS][1] -> [DMESG-WARN][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_15759/fi-cfl-8109u/igt@kms_flip@basic-flip-vs-dpms@a-dp1.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_138069v8/fi-cfl-8109u/igt@kms_flip@basic-flip-vs-dpms@a-dp1.html

  
Known issues
------------

  Here are the changes found in Patchwork_138069v8 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@i915_pm_rpm@module-reload:
    - bat-dg1-7:          [PASS][3] -> [FAIL][4] ([i915#12903])
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_15759/bat-dg1-7/igt@i915_pm_rpm@module-reload.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_138069v8/bat-dg1-7/igt@i915_pm_rpm@module-reload.html

  * igt@i915_selftest@live:
    - bat-mtlp-8:         [PASS][5] -> [ABORT][6] ([i915#12061]) +1 other test abort
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_15759/bat-mtlp-8/igt@i915_selftest@live.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_138069v8/bat-mtlp-8/igt@i915_selftest@live.html

  * igt@i915_selftest@live@workarounds:
    - bat-arlh-3:         [PASS][7] -> [ABORT][8] ([i915#12061]) +1 other test abort
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_15759/bat-arlh-3/igt@i915_selftest@live@workarounds.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_138069v8/bat-arlh-3/igt@i915_selftest@live@workarounds.html
    - bat-arlh-2:         [PASS][9] -> [ABORT][10] ([i915#12061]) +1 other test abort
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_15759/bat-arlh-2/igt@i915_selftest@live@workarounds.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_138069v8/bat-arlh-2/igt@i915_selftest@live@workarounds.html
    - bat-mtlp-6:         [PASS][11] -> [ABORT][12] ([i915#12061]) +1 other test abort
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_15759/bat-mtlp-6/igt@i915_selftest@live@workarounds.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_138069v8/bat-mtlp-6/igt@i915_selftest@live@workarounds.html

  * igt@kms_chamelium_edid@dp-edid-read:
    - bat-dg2-13:         [PASS][13] -> [FAIL][14] ([i915#12505])
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_15759/bat-dg2-13/igt@kms_chamelium_edid@dp-edid-read.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_138069v8/bat-dg2-13/igt@kms_chamelium_edid@dp-edid-read.html

  * igt@kms_flip@basic-flip-vs-dpms:
    - fi-cfl-8109u:       [PASS][15] -> [DMESG-WARN][16] ([i915#12914])
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_15759/fi-cfl-8109u/igt@kms_flip@basic-flip-vs-dpms.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_138069v8/fi-cfl-8109u/igt@kms_flip@basic-flip-vs-dpms.html

  
#### Possible fixes ####

  * igt@i915_selftest@live@workarounds:
    - {bat-arls-6}:       [ABORT][17] ([i915#12061]) -> [PASS][18] +1 other test pass
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_15759/bat-arls-6/igt@i915_selftest@live@workarounds.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_138069v8/bat-arls-6/igt@i915_selftest@live@workarounds.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [i915#12061]: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12061
  [i915#12505]: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12505
  [i915#12903]: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12903
  [i915#12914]: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12914


Build changes
-------------

  * Linux: CI_DRM_15759 -> Patchwork_138069v8

  CI-20190529: 20190529
  CI_DRM_15759: 5379d0a88558b73308ad82f163e80b863626e90b @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_8129: 363499a879fee5b9b7eda8acf7c772bce3423493 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_138069v8: 5379d0a88558b73308ad82f163e80b863626e90b @ git://anongit.freedesktop.org/gfx-ci/linux

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_138069v8/index.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 1/4] drm: Introduce device wedged event
  2024-11-28 15:37 ` [PATCH v10 1/4] drm: Introduce " Raag Jadav
@ 2024-11-29 13:40   ` André Almeida
  2024-12-02  8:07     ` Raag Jadav
  2024-12-12 18:31   ` André Almeida
  1 sibling, 1 reply; 29+ messages in thread
From: André Almeida @ 2024-11-29 13:40 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, christian.koenig, rodrigo.vivi,
	michal.wajdeczko, lina, anshuman.gupta, alexander.deucher,
	amd-gfx, kernel-dev, airlied, simona, lucas.demarchi, jani.nikula,
	andriy.shevchenko

Hi Raag,

Em 28/11/2024 12:37, Raag Jadav escreveu:
> Introduce device wedged event, which notifies userspace of 'wedged'
> (hanged/unusable) state of the DRM device through a uevent. This is
> useful especially in cases where the device is no longer operating as
> expected and has become unrecoverable from driver context. Purpose of
> this implementation is to provide drivers a generic way to recover with
> the help of userspace intervention without taking any drastic measures
> in the driver.
> 
> A 'wedged' device is basically a dead device that needs attention. The
> uevent is the notification that is sent to userspace along with a hint
> about what could possibly be attempted to recover the device and bring
> it back to usable state. Different drivers may have different ideas of
> a 'wedged' device depending on their hardware implementation, and hence
> the vendor agnostic nature of the event. It is up to the drivers to
> decide when they see the need for device recovery and how they want to
> recover from the available methods.
> 

Thank you for your work. Do you think you can add the optional PID 
parameter, as the PID of the app that caused the reset? For SteamOS use 
case it has been proved to be useful to kill the fault app as well. If 
the reset was caused by a kthread, no PID can be provided hence it's an 
optional parameter.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 1/4] drm: Introduce device wedged event
  2024-11-29 13:40   ` André Almeida
@ 2024-12-02  8:07     ` Raag Jadav
  2024-12-03  5:00       ` Raag Jadav
  0 siblings, 1 reply; 29+ messages in thread
From: Raag Jadav @ 2024-12-02  8:07 UTC (permalink / raw)
  To: André Almeida
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, christian.koenig, rodrigo.vivi,
	michal.wajdeczko, lina, anshuman.gupta, alexander.deucher,
	amd-gfx, kernel-dev, airlied, simona, lucas.demarchi, jani.nikula,
	andriy.shevchenko

On Fri, Nov 29, 2024 at 10:40:14AM -0300, André Almeida wrote:
> Hi Raag,
> 
> Em 28/11/2024 12:37, Raag Jadav escreveu:
> > Introduce device wedged event, which notifies userspace of 'wedged'
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is no longer operating as
> > expected and has become unrecoverable from driver context. Purpose of
> > this implementation is to provide drivers a generic way to recover with
> > the help of userspace intervention without taking any drastic measures
> > in the driver.
> > 
> > A 'wedged' device is basically a dead device that needs attention. The
> > uevent is the notification that is sent to userspace along with a hint
> > about what could possibly be attempted to recover the device and bring
> > it back to usable state. Different drivers may have different ideas of
> > a 'wedged' device depending on their hardware implementation, and hence
> > the vendor agnostic nature of the event. It is up to the drivers to
> > decide when they see the need for device recovery and how they want to
> > recover from the available methods.
> > 
> 
> Thank you for your work. Do you think you can add the optional PID
> parameter, as the PID of the app that caused the reset? For SteamOS use case
> it has been proved to be useful to kill the fault app as well. If the reset
> was caused by a kthread, no PID can be provided hence it's an optional
> parameter.

Hmm, I'm not sure if it really fits here since it doesn't seem like
a generic usecase.

I'd still be open for it if found useful by the drivers but perhaps
as an extended feature in a separate series.

Raag

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 1/4] drm: Introduce device wedged event
  2024-12-02  8:07     ` Raag Jadav
@ 2024-12-03  5:00       ` Raag Jadav
  2024-12-03 10:18         ` Christian König
  0 siblings, 1 reply; 29+ messages in thread
From: Raag Jadav @ 2024-12-03  5:00 UTC (permalink / raw)
  To: André Almeida, christian.koenig
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, rodrigo.vivi, michal.wajdeczko, lina,
	anshuman.gupta, alexander.deucher, amd-gfx, kernel-dev, airlied,
	simona, lucas.demarchi, jani.nikula, andriy.shevchenko

On Mon, Dec 02, 2024 at 10:07:59AM +0200, Raag Jadav wrote:
> On Fri, Nov 29, 2024 at 10:40:14AM -0300, André Almeida wrote:
> > Hi Raag,
> > 
> > Em 28/11/2024 12:37, Raag Jadav escreveu:
> > > Introduce device wedged event, which notifies userspace of 'wedged'
> > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > useful especially in cases where the device is no longer operating as
> > > expected and has become unrecoverable from driver context. Purpose of
> > > this implementation is to provide drivers a generic way to recover with
> > > the help of userspace intervention without taking any drastic measures
> > > in the driver.
> > > 
> > > A 'wedged' device is basically a dead device that needs attention. The
> > > uevent is the notification that is sent to userspace along with a hint
> > > about what could possibly be attempted to recover the device and bring
> > > it back to usable state. Different drivers may have different ideas of
> > > a 'wedged' device depending on their hardware implementation, and hence
> > > the vendor agnostic nature of the event. It is up to the drivers to
> > > decide when they see the need for device recovery and how they want to
> > > recover from the available methods.
> > > 
> > 
> > Thank you for your work. Do you think you can add the optional PID
> > parameter, as the PID of the app that caused the reset? For SteamOS use case
> > it has been proved to be useful to kill the fault app as well. If the reset
> > was caused by a kthread, no PID can be provided hence it's an optional
> > parameter.
> 
> Hmm, I'm not sure if it really fits here since it doesn't seem like
> a generic usecase.
> 
> I'd still be open for it if found useful by the drivers but perhaps
> as an extended feature in a separate series.

What do you think Chris, are we good to go with v10?

Raag

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 1/4] drm: Introduce device wedged event
  2024-12-03  5:00       ` Raag Jadav
@ 2024-12-03 10:18         ` Christian König
  2024-12-04 11:17           ` Raag Jadav
  0 siblings, 1 reply; 29+ messages in thread
From: Christian König @ 2024-12-03 10:18 UTC (permalink / raw)
  To: Raag Jadav, André Almeida
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, rodrigo.vivi, michal.wajdeczko, lina,
	anshuman.gupta, alexander.deucher, amd-gfx, kernel-dev, airlied,
	simona, lucas.demarchi, jani.nikula, andriy.shevchenko

Am 03.12.24 um 06:00 schrieb Raag Jadav:
> On Mon, Dec 02, 2024 at 10:07:59AM +0200, Raag Jadav wrote:
>> On Fri, Nov 29, 2024 at 10:40:14AM -0300, André Almeida wrote:
>>> Hi Raag,
>>>
>>> Em 28/11/2024 12:37, Raag Jadav escreveu:
>>>> Introduce device wedged event, which notifies userspace of 'wedged'
>>>> (hanged/unusable) state of the DRM device through a uevent. This is
>>>> useful especially in cases where the device is no longer operating as
>>>> expected and has become unrecoverable from driver context. Purpose of
>>>> this implementation is to provide drivers a generic way to recover with
>>>> the help of userspace intervention without taking any drastic measures
>>>> in the driver.
>>>>
>>>> A 'wedged' device is basically a dead device that needs attention. The
>>>> uevent is the notification that is sent to userspace along with a hint
>>>> about what could possibly be attempted to recover the device and bring
>>>> it back to usable state. Different drivers may have different ideas of
>>>> a 'wedged' device depending on their hardware implementation, and hence
>>>> the vendor agnostic nature of the event. It is up to the drivers to
>>>> decide when they see the need for device recovery and how they want to
>>>> recover from the available methods.
>>>>
>>> Thank you for your work. Do you think you can add the optional PID
>>> parameter, as the PID of the app that caused the reset? For SteamOS use case
>>> it has been proved to be useful to kill the fault app as well. If the reset
>>> was caused by a kthread, no PID can be provided hence it's an optional
>>> parameter.
>> Hmm, I'm not sure if it really fits here since it doesn't seem like
>> a generic usecase.
>>
>> I'd still be open for it if found useful by the drivers but perhaps
>> as an extended feature in a separate series.
> What do you think Chris, are we good to go with v10?

I agree with Andre that the PID and maybe the new DRM client name would 
be really nice to have here.

We do have that in the device core dump we create, but if an application 
is supervised by daemon for example then that would be really useful.

On the other hand I think we should merge the documentation and code as 
is and then add the PID/name later on. That is essentially a separate 
discussion.

Regards,
Christian.

>
> Raag


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 1/4] drm: Introduce device wedged event
  2024-12-03 10:18         ` Christian König
@ 2024-12-04 11:17           ` Raag Jadav
  2024-12-11 17:14             ` Maxime Ripard
  0 siblings, 1 reply; 29+ messages in thread
From: Raag Jadav @ 2024-12-04 11:17 UTC (permalink / raw)
  To: Christian König, maarten.lankhorst, mripard, tzimmermann
  Cc: André Almeida, intel-xe, intel-gfx, dri-devel,
	himal.prasad.ghimiray, aravind.iddamsetty, rodrigo.vivi,
	michal.wajdeczko, lina, anshuman.gupta, alexander.deucher,
	amd-gfx, kernel-dev, airlied, simona, lucas.demarchi, jani.nikula,
	andriy.shevchenko

+ misc maintainers

On Tue, Dec 03, 2024 at 11:18:00AM +0100, Christian König wrote:
> Am 03.12.24 um 06:00 schrieb Raag Jadav:
> > On Mon, Dec 02, 2024 at 10:07:59AM +0200, Raag Jadav wrote:
> > > On Fri, Nov 29, 2024 at 10:40:14AM -0300, André Almeida wrote:
> > > > Hi Raag,
> > > > 
> > > > Em 28/11/2024 12:37, Raag Jadav escreveu:
> > > > > Introduce device wedged event, which notifies userspace of 'wedged'
> > > > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > > > useful especially in cases where the device is no longer operating as
> > > > > expected and has become unrecoverable from driver context. Purpose of
> > > > > this implementation is to provide drivers a generic way to recover with
> > > > > the help of userspace intervention without taking any drastic measures
> > > > > in the driver.
> > > > > 
> > > > > A 'wedged' device is basically a dead device that needs attention. The
> > > > > uevent is the notification that is sent to userspace along with a hint
> > > > > about what could possibly be attempted to recover the device and bring
> > > > > it back to usable state. Different drivers may have different ideas of
> > > > > a 'wedged' device depending on their hardware implementation, and hence
> > > > > the vendor agnostic nature of the event. It is up to the drivers to
> > > > > decide when they see the need for device recovery and how they want to
> > > > > recover from the available methods.
> > > > > 
> > > > Thank you for your work. Do you think you can add the optional PID
> > > > parameter, as the PID of the app that caused the reset? For SteamOS use case
> > > > it has been proved to be useful to kill the fault app as well. If the reset
> > > > was caused by a kthread, no PID can be provided hence it's an optional
> > > > parameter.
> > > Hmm, I'm not sure if it really fits here since it doesn't seem like
> > > a generic usecase.
> > > 
> > > I'd still be open for it if found useful by the drivers but perhaps
> > > as an extended feature in a separate series.
> > What do you think Chris, are we good to go with v10?
> 
> I agree with Andre that the PID and maybe the new DRM client name would be
> really nice to have here.
> 
> We do have that in the device core dump we create, but if an application is
> supervised by daemon for example then that would be really useful.
> 
> On the other hand I think we should merge the documentation and code as is
> and then add the PID/name later on. That is essentially a separate
> discussion.

So how do we proceed, perhaps through misc tree?

Raag

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 1/4] drm: Introduce device wedged event
  2024-12-04 11:17           ` Raag Jadav
@ 2024-12-11 17:14             ` Maxime Ripard
  2024-12-12 10:37               ` Raag Jadav
  0 siblings, 1 reply; 29+ messages in thread
From: Maxime Ripard @ 2024-12-11 17:14 UTC (permalink / raw)
  To: Raag Jadav
  Cc: Christian König, maarten.lankhorst, tzimmermann,
	André Almeida, intel-xe, intel-gfx, dri-devel,
	himal.prasad.ghimiray, aravind.iddamsetty, rodrigo.vivi,
	michal.wajdeczko, lina, anshuman.gupta, alexander.deucher,
	amd-gfx, kernel-dev, airlied, simona, lucas.demarchi, jani.nikula,
	andriy.shevchenko

[-- Attachment #1: Type: text/plain, Size: 2960 bytes --]

On Wed, Dec 04, 2024 at 01:17:17PM +0200, Raag Jadav wrote:
> + misc maintainers
> 
> On Tue, Dec 03, 2024 at 11:18:00AM +0100, Christian König wrote:
> > Am 03.12.24 um 06:00 schrieb Raag Jadav:
> > > On Mon, Dec 02, 2024 at 10:07:59AM +0200, Raag Jadav wrote:
> > > > On Fri, Nov 29, 2024 at 10:40:14AM -0300, André Almeida wrote:
> > > > > Hi Raag,
> > > > > 
> > > > > Em 28/11/2024 12:37, Raag Jadav escreveu:
> > > > > > Introduce device wedged event, which notifies userspace of 'wedged'
> > > > > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > > > > useful especially in cases where the device is no longer operating as
> > > > > > expected and has become unrecoverable from driver context. Purpose of
> > > > > > this implementation is to provide drivers a generic way to recover with
> > > > > > the help of userspace intervention without taking any drastic measures
> > > > > > in the driver.
> > > > > > 
> > > > > > A 'wedged' device is basically a dead device that needs attention. The
> > > > > > uevent is the notification that is sent to userspace along with a hint
> > > > > > about what could possibly be attempted to recover the device and bring
> > > > > > it back to usable state. Different drivers may have different ideas of
> > > > > > a 'wedged' device depending on their hardware implementation, and hence
> > > > > > the vendor agnostic nature of the event. It is up to the drivers to
> > > > > > decide when they see the need for device recovery and how they want to
> > > > > > recover from the available methods.
> > > > > > 
> > > > > Thank you for your work. Do you think you can add the optional PID
> > > > > parameter, as the PID of the app that caused the reset? For SteamOS use case
> > > > > it has been proved to be useful to kill the fault app as well. If the reset
> > > > > was caused by a kthread, no PID can be provided hence it's an optional
> > > > > parameter.
> > > > Hmm, I'm not sure if it really fits here since it doesn't seem like
> > > > a generic usecase.
> > > > 
> > > > I'd still be open for it if found useful by the drivers but perhaps
> > > > as an extended feature in a separate series.
> > > What do you think Chris, are we good to go with v10?
> > 
> > I agree with Andre that the PID and maybe the new DRM client name would be
> > really nice to have here.
> > 
> > We do have that in the device core dump we create, but if an application is
> > supervised by daemon for example then that would be really useful.
> > 
> > On the other hand I think we should merge the documentation and code as is
> > and then add the PID/name later on. That is essentially a separate
> > discussion.
> 
> So how do we proceed, perhaps through misc tree?

Provided it follows the usual rules (ie, Reviewed-by, open source
userspace tools using it if it's a new uAPI, etc.) then yeah, we can
merge it through drm-misc.

Maxime

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 273 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 1/4] drm: Introduce device wedged event
  2024-12-11 17:14             ` Maxime Ripard
@ 2024-12-12 10:37               ` Raag Jadav
  2024-12-16 16:07                 ` Maxime Ripard
  0 siblings, 1 reply; 29+ messages in thread
From: Raag Jadav @ 2024-12-12 10:37 UTC (permalink / raw)
  To: Maxime Ripard
  Cc: Christian König, maarten.lankhorst, tzimmermann,
	André Almeida, intel-xe, intel-gfx, dri-devel,
	himal.prasad.ghimiray, aravind.iddamsetty, rodrigo.vivi,
	michal.wajdeczko, lina, anshuman.gupta, alexander.deucher,
	amd-gfx, kernel-dev, airlied, simona, lucas.demarchi, jani.nikula,
	andriy.shevchenko

On Wed, Dec 11, 2024 at 06:14:12PM +0100, Maxime Ripard wrote:
> On Wed, Dec 04, 2024 at 01:17:17PM +0200, Raag Jadav wrote:
> > + misc maintainers
> > 
> > On Tue, Dec 03, 2024 at 11:18:00AM +0100, Christian König wrote:
> > > Am 03.12.24 um 06:00 schrieb Raag Jadav:
> > > > On Mon, Dec 02, 2024 at 10:07:59AM +0200, Raag Jadav wrote:
> > > > > On Fri, Nov 29, 2024 at 10:40:14AM -0300, André Almeida wrote:
> > > > > > Hi Raag,
> > > > > > 
> > > > > > Em 28/11/2024 12:37, Raag Jadav escreveu:
> > > > > > > Introduce device wedged event, which notifies userspace of 'wedged'
> > > > > > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > > > > > useful especially in cases where the device is no longer operating as
> > > > > > > expected and has become unrecoverable from driver context. Purpose of
> > > > > > > this implementation is to provide drivers a generic way to recover with
> > > > > > > the help of userspace intervention without taking any drastic measures
> > > > > > > in the driver.
> > > > > > > 
> > > > > > > A 'wedged' device is basically a dead device that needs attention. The
> > > > > > > uevent is the notification that is sent to userspace along with a hint
> > > > > > > about what could possibly be attempted to recover the device and bring
> > > > > > > it back to usable state. Different drivers may have different ideas of
> > > > > > > a 'wedged' device depending on their hardware implementation, and hence
> > > > > > > the vendor agnostic nature of the event. It is up to the drivers to
> > > > > > > decide when they see the need for device recovery and how they want to
> > > > > > > recover from the available methods.
> > > > > > > 
> > > > > > Thank you for your work. Do you think you can add the optional PID
> > > > > > parameter, as the PID of the app that caused the reset? For SteamOS use case
> > > > > > it has been proved to be useful to kill the fault app as well. If the reset
> > > > > > was caused by a kthread, no PID can be provided hence it's an optional
> > > > > > parameter.
> > > > > Hmm, I'm not sure if it really fits here since it doesn't seem like
> > > > > a generic usecase.
> > > > > 
> > > > > I'd still be open for it if found useful by the drivers but perhaps
> > > > > as an extended feature in a separate series.
> > > > What do you think Chris, are we good to go with v10?
> > > 
> > > I agree with Andre that the PID and maybe the new DRM client name would be
> > > really nice to have here.
> > > 
> > > We do have that in the device core dump we create, but if an application is
> > > supervised by daemon for example then that would be really useful.
> > > 
> > > On the other hand I think we should merge the documentation and code as is
> > > and then add the PID/name later on. That is essentially a separate
> > > discussion.
> > 
> > So how do we proceed, perhaps through misc tree?
> 
> Provided it follows the usual rules (ie, Reviewed-by, open source
> userspace tools using it if it's a new uAPI, etc.) then yeah, we can
> merge it through drm-misc.

My understanding is that the core patches are to be reviewed by the
maintainers? The rest of it (patch 2 to 4) seems already reviewed.

We have a documented example (patch 2) with udev rule and a reference
script which can be setup to get this working. Does that qualify as
a consumer?

Raag

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 1/4] drm: Introduce device wedged event
  2024-11-28 15:37 ` [PATCH v10 1/4] drm: Introduce " Raag Jadav
  2024-11-29 13:40   ` André Almeida
@ 2024-12-12 18:31   ` André Almeida
  2024-12-13 13:51     ` Raag Jadav
  1 sibling, 1 reply; 29+ messages in thread
From: André Almeida @ 2024-12-12 18:31 UTC (permalink / raw)
  To: Raag Jadav, airlied, simona, lucas.demarchi, rodrigo.vivi,
	jani.nikula, andriy.shevchenko, lina, michal.wajdeczko,
	christian.koenig
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, anshuman.gupta, alexander.deucher, amd-gfx,
	kernel-dev

Hi Raag,

Thank you for your patch.

Em 28/11/2024 12:37, Raag Jadav escreveu:

[...]

> +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> +{
> +	const char *recovery = NULL;
> +	unsigned int len, opt;
> +	/* Event string length up to 28+ characters with available methods */
> +	char event_string[32];
> +	char *envp[] = { event_string, NULL };
> +
> +	len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
> +
> +	for_each_set_bit(opt, &method, BITS_PER_TYPE(method)) {
> +		recovery = drm_get_wedge_recovery(opt);
> +		if (drm_WARN(dev, !recovery, "device wedged, invalid recovery method %u\n", opt))
> +			break;
> +
> +		len += scnprintf(event_string + len, sizeof(event_string), "%s,", recovery);
> +	}
> +
> +	if (recovery)
> +		/* Get rid of trailing comma */
> +		event_string[len - 1] = '\0';
> +	else
> +		/* Caller is unsure about recovery, do the best we can at this point. */
> +		snprintf(event_string, sizeof(event_string), "%s", "WEDGED=unknown");
> +
> +	drm_info(dev, "device wedged, needs recovery\n");

As documented in the commit message "No explicit device recovery is 
expected from the consumer in this case", I think this should be like this:

if (method != DRM_WEDGE_RECOVERY_NONE)
     drm_info(dev, "device wedged, needs recovery\n");

and maybe a note like this:

else
     drm_info(dev, "device reseted, but managed to recover\n");

Either way, this patch is:

Reviewed-by: André Almeida <andrealmeid@igalia.com>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 2/4] drm/doc: Document device wedged event
  2024-11-28 15:37 ` [PATCH v10 2/4] drm/doc: Document " Raag Jadav
@ 2024-12-12 18:50   ` André Almeida
  2024-12-17  8:42     ` Raag Jadav
  2025-01-21  1:14   ` Xaver Hugl
  1 sibling, 1 reply; 29+ messages in thread
From: André Almeida @ 2024-12-12 18:50 UTC (permalink / raw)
  To: Raag Jadav, airlied, simona, lucas.demarchi, rodrigo.vivi,
	jani.nikula, andriy.shevchenko, lina, michal.wajdeczko,
	christian.koenig
  Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, anshuman.gupta, alexander.deucher, amd-gfx,
	kernel-dev



Em 28/11/2024 12:37, Raag Jadav escreveu:
> Add documentation for device wedged event in a new 'Device wedging'
> chapter. The describes basic definitions, prerequisites and consumer
> expectations along with an example.
> 
>   v8: Improve documentation (Christian, Rodrigo)
>   v9: Add prerequisites section (Christian)
> v10: Clarify mmap cleanup and consumer prerequisites (Christian, Aravind)
> 
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> Reviewed-by: Christian König <christian.koenig@amd.com>
> ---
>   Documentation/gpu/drm-uapi.rst | 112 ++++++++++++++++++++++++++++++++-
>   1 file changed, 109 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index b75cc9a70d1f..da2927cde53d 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -371,9 +371,115 @@ Reporting causes of resets

I think it's a good idea to add a note about "device wedged event" in 
the section "Device reset > Kernel Mode Driver" since the idea is to 
explain what should kernel developer add to their kernel drivers to be 
used when a device resets.

>   
>   Apart from propagating the reset through the stack so apps can recover, it's
>   really useful for driver developers to learn more about what caused the reset in
> -the first place. DRM devices should make use of devcoredump to store relevant
> -information about the reset, so this information can be added to user bug
> -reports.
> +the first place. For this, drivers can make use of devcoredump to store relevant
> +information about the reset and send device wedged event without recovery method

and send a device wedged event with recovery method as "none" (as 
explained in the chapter "Device wedging")

> +(as explained in next chapter) to notify userspace, so this information can be
> +collected and added to user bug reports.
> +

With those changes applied:

Reviewed-by: André Almeida <andrealmeid@igalia.com>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 1/4] drm: Introduce device wedged event
  2024-12-12 18:31   ` André Almeida
@ 2024-12-13 13:51     ` Raag Jadav
  0 siblings, 0 replies; 29+ messages in thread
From: Raag Jadav @ 2024-12-13 13:51 UTC (permalink / raw)
  To: André Almeida
  Cc: airlied, simona, lucas.demarchi, rodrigo.vivi, jani.nikula,
	andriy.shevchenko, lina, michal.wajdeczko, christian.koenig,
	intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, anshuman.gupta, alexander.deucher, amd-gfx,
	kernel-dev

On Thu, Dec 12, 2024 at 03:31:01PM -0300, André Almeida wrote:
> Hi Raag,
> 
> Thank you for your patch.
> 
> Em 28/11/2024 12:37, Raag Jadav escreveu:
> 
> [...]
> 
> > +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> > +{
> > +	const char *recovery = NULL;
> > +	unsigned int len, opt;
> > +	/* Event string length up to 28+ characters with available methods */
> > +	char event_string[32];
> > +	char *envp[] = { event_string, NULL };
> > +
> > +	len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
> > +
> > +	for_each_set_bit(opt, &method, BITS_PER_TYPE(method)) {
> > +		recovery = drm_get_wedge_recovery(opt);
> > +		if (drm_WARN(dev, !recovery, "device wedged, invalid recovery method %u\n", opt))
> > +			break;
> > +
> > +		len += scnprintf(event_string + len, sizeof(event_string), "%s,", recovery);
> > +	}
> > +
> > +	if (recovery)
> > +		/* Get rid of trailing comma */
> > +		event_string[len - 1] = '\0';
> > +	else
> > +		/* Caller is unsure about recovery, do the best we can at this point. */
> > +		snprintf(event_string, sizeof(event_string), "%s", "WEDGED=unknown");
> > +
> > +	drm_info(dev, "device wedged, needs recovery\n");
> 
> As documented in the commit message "No explicit device recovery is expected
> from the consumer in this case", I think this should be like this:
> 
> if (method != DRM_WEDGE_RECOVERY_NONE)
>     drm_info(dev, "device wedged, needs recovery\n");
> 
> and maybe a note like this:
> 
> else
>     drm_info(dev, "device reseted, but managed to recover\n");

Or perhaps

	drm_info(dev, "device wedged, but recovered through reset\n");

> Either way, this patch is:
> 
> Reviewed-by: André Almeida <andrealmeid@igalia.com>

Thanks for the review.

Raag

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 1/4] drm: Introduce device wedged event
  2024-12-12 10:37               ` Raag Jadav
@ 2024-12-16 16:07                 ` Maxime Ripard
  0 siblings, 0 replies; 29+ messages in thread
From: Maxime Ripard @ 2024-12-16 16:07 UTC (permalink / raw)
  To: Raag Jadav
  Cc: Christian König, maarten.lankhorst, tzimmermann,
	André Almeida, intel-xe, intel-gfx, dri-devel,
	himal.prasad.ghimiray, aravind.iddamsetty, rodrigo.vivi,
	michal.wajdeczko, lina, anshuman.gupta, alexander.deucher,
	amd-gfx, kernel-dev, airlied, simona, lucas.demarchi, jani.nikula,
	andriy.shevchenko

[-- Attachment #1: Type: text/plain, Size: 4105 bytes --]

Hi,

On Thu, Dec 12, 2024 at 12:37:59PM +0200, Raag Jadav wrote:
> On Wed, Dec 11, 2024 at 06:14:12PM +0100, Maxime Ripard wrote:
> > On Wed, Dec 04, 2024 at 01:17:17PM +0200, Raag Jadav wrote:
> > > + misc maintainers
> > > 
> > > On Tue, Dec 03, 2024 at 11:18:00AM +0100, Christian König wrote:
> > > > Am 03.12.24 um 06:00 schrieb Raag Jadav:
> > > > > On Mon, Dec 02, 2024 at 10:07:59AM +0200, Raag Jadav wrote:
> > > > > > On Fri, Nov 29, 2024 at 10:40:14AM -0300, André Almeida wrote:
> > > > > > > Hi Raag,
> > > > > > > 
> > > > > > > Em 28/11/2024 12:37, Raag Jadav escreveu:
> > > > > > > > Introduce device wedged event, which notifies userspace of 'wedged'
> > > > > > > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > > > > > > useful especially in cases where the device is no longer operating as
> > > > > > > > expected and has become unrecoverable from driver context. Purpose of
> > > > > > > > this implementation is to provide drivers a generic way to recover with
> > > > > > > > the help of userspace intervention without taking any drastic measures
> > > > > > > > in the driver.
> > > > > > > > 
> > > > > > > > A 'wedged' device is basically a dead device that needs attention. The
> > > > > > > > uevent is the notification that is sent to userspace along with a hint
> > > > > > > > about what could possibly be attempted to recover the device and bring
> > > > > > > > it back to usable state. Different drivers may have different ideas of
> > > > > > > > a 'wedged' device depending on their hardware implementation, and hence
> > > > > > > > the vendor agnostic nature of the event. It is up to the drivers to
> > > > > > > > decide when they see the need for device recovery and how they want to
> > > > > > > > recover from the available methods.
> > > > > > > > 
> > > > > > > Thank you for your work. Do you think you can add the optional PID
> > > > > > > parameter, as the PID of the app that caused the reset? For SteamOS use case
> > > > > > > it has been proved to be useful to kill the fault app as well. If the reset
> > > > > > > was caused by a kthread, no PID can be provided hence it's an optional
> > > > > > > parameter.
> > > > > > Hmm, I'm not sure if it really fits here since it doesn't seem like
> > > > > > a generic usecase.
> > > > > > 
> > > > > > I'd still be open for it if found useful by the drivers but perhaps
> > > > > > as an extended feature in a separate series.
> > > > > What do you think Chris, are we good to go with v10?
> > > > 
> > > > I agree with Andre that the PID and maybe the new DRM client name would be
> > > > really nice to have here.
> > > > 
> > > > We do have that in the device core dump we create, but if an application is
> > > > supervised by daemon for example then that would be really useful.
> > > > 
> > > > On the other hand I think we should merge the documentation and code as is
> > > > and then add the PID/name later on. That is essentially a separate
> > > > discussion.
> > > 
> > > So how do we proceed, perhaps through misc tree?
> > 
> > Provided it follows the usual rules (ie, Reviewed-by, open source
> > userspace tools using it if it's a new uAPI, etc.) then yeah, we can
> > merge it through drm-misc.
> 
> My understanding is that the core patches are to be reviewed by the
> maintainers? The rest of it (patch 2 to 4) seems already reviewed.
> 
> We have a documented example (patch 2) with udev rule and a reference
> script which can be setup to get this working. Does that qualify as
> a consumer?

Given the description you stated above, I'd expect a compositor to be
the expected user, right?

Our doc
(https://docs.kernel.org/gpu/drm-uapi.html#open-source-userspace-requirements)
states:

  The open-source userspace must not be a toy/test application, but the
  real thing. Specifically it needs to handle all the usual error and
  corner cases. These are often the places where new uAPI falls apart
  and hence essential to assess the fitness of a proposed interface.

Maxime

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 273 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 2/4] drm/doc: Document device wedged event
  2024-12-12 18:50   ` André Almeida
@ 2024-12-17  8:42     ` Raag Jadav
  2024-12-17 11:57       ` André Almeida
  0 siblings, 1 reply; 29+ messages in thread
From: Raag Jadav @ 2024-12-17  8:42 UTC (permalink / raw)
  To: André Almeida
  Cc: airlied, simona, lucas.demarchi, rodrigo.vivi, jani.nikula,
	andriy.shevchenko, lina, michal.wajdeczko, christian.koenig,
	intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, anshuman.gupta, alexander.deucher, amd-gfx,
	kernel-dev

On Thu, Dec 12, 2024 at 03:50:29PM -0300, André Almeida wrote:
> Em 28/11/2024 12:37, Raag Jadav escreveu:
> > Add documentation for device wedged event in a new 'Device wedging'
> > chapter. The describes basic definitions, prerequisites and consumer
> > expectations along with an example.
> > 
> >   v8: Improve documentation (Christian, Rodrigo)
> >   v9: Add prerequisites section (Christian)
> > v10: Clarify mmap cleanup and consumer prerequisites (Christian, Aravind)
> > 
> > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > Reviewed-by: Christian König <christian.koenig@amd.com>
> > ---
> >   Documentation/gpu/drm-uapi.rst | 112 ++++++++++++++++++++++++++++++++-
> >   1 file changed, 109 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index b75cc9a70d1f..da2927cde53d 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -371,9 +371,115 @@ Reporting causes of resets
> 
> I think it's a good idea to add a note about "device wedged event" in the
> section "Device reset > Kernel Mode Driver" since the idea is to explain
> what should kernel developer add to their kernel drivers to be used when a
> device resets.

Wouldn't that be just a repetition of below? Also, I'm not sure if it's a hard
requirement.

> >   Apart from propagating the reset through the stack so apps can recover, it's
> >   really useful for driver developers to learn more about what caused the reset in
> > -the first place. DRM devices should make use of devcoredump to store relevant
> > -information about the reset, so this information can be added to user bug
> > -reports.
> > +the first place. For this, drivers can make use of devcoredump to store relevant
> > +information about the reset and send device wedged event without recovery method
> 
> and send a device wedged event with recovery method as "none" (as explained
> in the chapter "Device wedging")

Sure.

> > +(as explained in next chapter) to notify userspace, so this information can be
> > +collected and added to user bug reports.
> > +
> 
> With those changes applied:
> 
> Reviewed-by: André Almeida <andrealmeid@igalia.com>

Thanks.

Raag

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 2/4] drm/doc: Document device wedged event
  2024-12-17  8:42     ` Raag Jadav
@ 2024-12-17 11:57       ` André Almeida
  0 siblings, 0 replies; 29+ messages in thread
From: André Almeida @ 2024-12-17 11:57 UTC (permalink / raw)
  To: Raag Jadav
  Cc: airlied, simona, lucas.demarchi, rodrigo.vivi, jani.nikula,
	andriy.shevchenko, lina, michal.wajdeczko, christian.koenig,
	intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, anshuman.gupta, alexander.deucher, amd-gfx,
	kernel-dev

Em 17/12/2024 05:42, Raag Jadav escreveu:
> On Thu, Dec 12, 2024 at 03:50:29PM -0300, André Almeida wrote:
>> Em 28/11/2024 12:37, Raag Jadav escreveu:
>>> Add documentation for device wedged event in a new 'Device wedging'
>>> chapter. The describes basic definitions, prerequisites and consumer
>>> expectations along with an example.
>>>
>>>    v8: Improve documentation (Christian, Rodrigo)
>>>    v9: Add prerequisites section (Christian)
>>> v10: Clarify mmap cleanup and consumer prerequisites (Christian, Aravind)
>>>
>>> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>> ---
>>>    Documentation/gpu/drm-uapi.rst | 112 ++++++++++++++++++++++++++++++++-
>>>    1 file changed, 109 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>>> index b75cc9a70d1f..da2927cde53d 100644
>>> --- a/Documentation/gpu/drm-uapi.rst
>>> +++ b/Documentation/gpu/drm-uapi.rst
>>> @@ -371,9 +371,115 @@ Reporting causes of resets
>>
>> I think it's a good idea to add a note about "device wedged event" in the
>> section "Device reset > Kernel Mode Driver" since the idea is to explain
>> what should kernel developer add to their kernel drivers to be used when a
>> device resets.
> 
> Wouldn't that be just a repetition of below? Also, I'm not sure if it's a hard
> requirement.
> 

Right, it can get repetitive indeed, no need to apply this suggestion then.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 0/4] Introduce DRM device wedged event
  2024-11-28 15:37 [PATCH v10 0/4] Introduce DRM device wedged event Raag Jadav
                   ` (6 preceding siblings ...)
  2024-11-28 17:59 ` ✗ i915.CI.BAT: failure " Patchwork
@ 2025-01-21  0:59 ` Xaver Hugl
  2025-01-22  4:48   ` Raag Jadav
  7 siblings, 1 reply; 29+ messages in thread
From: Xaver Hugl @ 2025-01-21  0:59 UTC (permalink / raw)
  To: Raag Jadav
  Cc: airlied, simona, lucas.demarchi, rodrigo.vivi, jani.nikula,
	andriy.shevchenko, lina, michal.wajdeczko, christian.koenig,
	intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, anshuman.gupta, alexander.deucher,
	andrealmeid, amd-gfx, kernel-dev

Hi,

I experimented with using this in KWin, and
https://invent.kde.org/plasma/kwin/-/merge_requests/7027/diffs?commit_id=6da40f1b9e2bc94615a436de4778880cee16f940
makes it fall back to a software renderer when a rebind is required to
recover the GPU.
Making it also survive the rebind properly is more challenging
(current version of the MR doesn't do it for you and just crashes if
you do it with a udev rule or manually), but it's doable - and not a
problem of the API.

I'd really like to have the PID of the client that triggered the GPU
reset, so that we can kill it if multiple resets are triggered in a
row (or switch to software rendering if it's KWin itself) and show a
user-friendly notification about why their app(s) crashed, but that
can be added later.

- Xaver

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 2/4] drm/doc: Document device wedged event
  2024-11-28 15:37 ` [PATCH v10 2/4] drm/doc: Document " Raag Jadav
  2024-12-12 18:50   ` André Almeida
@ 2025-01-21  1:14   ` Xaver Hugl
  2025-01-22  5:22     ` Raag Jadav
  1 sibling, 1 reply; 29+ messages in thread
From: Xaver Hugl @ 2025-01-21  1:14 UTC (permalink / raw)
  To: Raag Jadav
  Cc: airlied, simona, lucas.demarchi, rodrigo.vivi, jani.nikula,
	andriy.shevchenko, lina, michal.wajdeczko, christian.koenig,
	intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, anshuman.gupta, alexander.deucher,
	andrealmeid, amd-gfx, kernel-dev

> +It is the responsibility of the consumer to make sure that the device or
> +its resources are not in use by any process before attempting recovery.
I'm not convinced this is actually doable in practice, outside of
killing all apps that aren't the one trying to recover the GPU.
Is this just about not crashing those processes if they don't handle
GPU hotunplugs well, about leaks, or something else?

> +With IOCTLs blocked and device already 'wedged', all device memory should
> +be unmapped and file descriptors should be closed to prevent leaks.
Afaiu from a userspace POV, a rebind is just like a GPU hotunplug +
hotplug with matching "remove" and "add" udev events. As long as the
application cleans up resources related to the device when it receives
the event, there should be no leaks with a normal hotunplug... Is this
different enough that we can't have the same expectations?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 0/4] Introduce DRM device wedged event
  2025-01-21  0:59 ` [PATCH v10 0/4] Introduce DRM device wedged event Xaver Hugl
@ 2025-01-22  4:48   ` Raag Jadav
  0 siblings, 0 replies; 29+ messages in thread
From: Raag Jadav @ 2025-01-22  4:48 UTC (permalink / raw)
  To: Xaver Hugl
  Cc: airlied, simona, lucas.demarchi, rodrigo.vivi, jani.nikula,
	andriy.shevchenko, lina, michal.wajdeczko, christian.koenig,
	intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, anshuman.gupta, alexander.deucher,
	andrealmeid, amd-gfx, kernel-dev

On Tue, Jan 21, 2025 at 01:59:47AM +0100, Xaver Hugl wrote:
> Hi,
> 
> I experimented with using this in KWin, and
> https://invent.kde.org/plasma/kwin/-/merge_requests/7027/diffs?commit_id=6da40f1b9e2bc94615a436de4778880cee16f940
> makes it fall back to a software renderer when a rebind is required to
> recover the GPU.
> Making it also survive the rebind properly is more challenging
> (current version of the MR doesn't do it for you and just crashes if
> you do it with a udev rule or manually), but it's doable - and not a
> problem of the API.
> 
> I'd really like to have the PID of the client that triggered the GPU
> reset, so that we can kill it if multiple resets are triggered in a
> row (or switch to software rendering if it's KWin itself) and show a
> user-friendly notification about why their app(s) crashed, but that
> can be added later.

Excellent! While we have our consumer implementation in progress, it's
always good to have wider adoption.

Thank you for your contribution.

Raag

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 2/4] drm/doc: Document device wedged event
  2025-01-21  1:14   ` Xaver Hugl
@ 2025-01-22  5:22     ` Raag Jadav
  2025-01-27 10:23       ` Pekka Paalanen
  0 siblings, 1 reply; 29+ messages in thread
From: Raag Jadav @ 2025-01-22  5:22 UTC (permalink / raw)
  To: Xaver Hugl
  Cc: airlied, simona, lucas.demarchi, rodrigo.vivi, jani.nikula,
	andriy.shevchenko, lina, michal.wajdeczko, christian.koenig,
	intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
	aravind.iddamsetty, anshuman.gupta, alexander.deucher,
	andrealmeid, amd-gfx, kernel-dev

On Tue, Jan 21, 2025 at 02:14:56AM +0100, Xaver Hugl wrote:
> > +It is the responsibility of the consumer to make sure that the device or
> > +its resources are not in use by any process before attempting recovery.
> I'm not convinced this is actually doable in practice, outside of
> killing all apps that aren't the one trying to recover the GPU.
> Is this just about not crashing those processes if they don't handle
> GPU hotunplugs well, about leaks, or something else?

Correct, all of it. And since the compositor is in charge of device resources,
this way it atleast has the opportunity to recover the device and recreate
context without all the userspace violence.

I'm not entirely aware of its feasibility though, perhaps something for the
consumers to experiment.

> > +With IOCTLs blocked and device already 'wedged', all device memory should
> > +be unmapped and file descriptors should be closed to prevent leaks.
> Afaiu from a userspace POV, a rebind is just like a GPU hotunplug +
> hotplug with matching "remove" and "add" udev events. As long as the
> application cleans up resources related to the device when it receives
> the event, there should be no leaks with a normal hotunplug... Is this
> different enough that we can't have the same expectations?

The thing about "remove" event is that it is generated *after* we opt for an
unbind, and at that point it might be already too late if userspace doesn't
get enough time to clean things up while the device is removed with a live
client resulting in unknown consequences.

The idea here is to clean things up *before* we opt for an unbind leaving
no room for side effects.

Raag

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 2/4] drm/doc: Document device wedged event
  2025-01-22  5:22     ` Raag Jadav
@ 2025-01-27 10:23       ` Pekka Paalanen
  2025-01-28  9:37         ` Raag Jadav
  0 siblings, 1 reply; 29+ messages in thread
From: Pekka Paalanen @ 2025-01-27 10:23 UTC (permalink / raw)
  To: Raag Jadav
  Cc: Xaver Hugl, airlied, simona, lucas.demarchi, rodrigo.vivi,
	jani.nikula, andriy.shevchenko, lina, michal.wajdeczko,
	christian.koenig, intel-xe, intel-gfx, dri-devel,
	himal.prasad.ghimiray, aravind.iddamsetty, anshuman.gupta,
	alexander.deucher, andrealmeid, amd-gfx, kernel-dev

[-- Attachment #1: Type: text/plain, Size: 4094 bytes --]

On Wed, 22 Jan 2025 07:22:25 +0200
Raag Jadav <raag.jadav@intel.com> wrote:

> On Tue, Jan 21, 2025 at 02:14:56AM +0100, Xaver Hugl wrote:
> > > +It is the responsibility of the consumer to make sure that the device or
> > > +its resources are not in use by any process before attempting recovery.  
> > I'm not convinced this is actually doable in practice, outside of
> > killing all apps that aren't the one trying to recover the GPU.
> > Is this just about not crashing those processes if they don't handle
> > GPU hotunplugs well, about leaks, or something else?  
> 
> Correct, all of it. And since the compositor is in charge of device resources,
> this way it atleast has the opportunity to recover the device and recreate
> context without all the userspace violence.

Hi Raag,

sorry, I haven't followed this series, so I wonder, why should
userspace be part of recovering the device? Why doesn't the kernel
automatically load a new driver instance with a new DRM device node?

Of course userspace needs to deal with stuff suddenly erroring out, and
destroy existing related resources, then wait for a working device
to appear and rebuild all state. The kernel driver already needs to
make the existing open stuff inert and harmless, why does it need an
acknowledgement from userspace to unbind and re-bind?

> I'm not entirely aware of its feasibility though, perhaps something for the
> consumers to experiment.

If consumers mean userspace, then no, not reliably. But the kernel can
do it.

I see in the commit message written:

	"For example, if the driver supports multiple recovery methods,
	consumers can opt for the suitable one based on policy
	definition."

How could consumers know what to do? How can they guess what would be
enough to recover the device? Isn't that the kernel driver's job to
know?

(More important for userspace would be know if dmabuf fds remain
pointing to valid memory retaining its contents or if the contents are
lost. Userspace cannot tell which device a dmabuf originates from,
AFAIK, so this would need to be added in the generic dmabuf UAPI.)

	"Consumers can also choose to have the device available for
	debugging or additional data collection before performing the
	recovery."

Couldn't the wedged driver instance remain detached from the hardware
while a new driver instance initializes? Then debug data remains until
the wedged device is fully closed from userspace, or maybe devcore dump
retains it.

I presume that WEDGED=none case should retain the debug data somehow as
well.

> > > +With IOCTLs blocked and device already 'wedged', all device memory should

btw. when I see "blocked" I think of the function call not returning
yet. But in this patch "blocked" seems to be synonymous for "returns
an error immediately". Would it be possible to avoid the word "blocked"
for this?

> > > +be unmapped and file descriptors should be closed to prevent leaks.  
> > Afaiu from a userspace POV, a rebind is just like a GPU hotunplug +
> > hotplug with matching "remove" and "add" udev events. As long as the
> > application cleans up resources related to the device when it receives
> > the event, there should be no leaks with a normal hotunplug... Is this
> > different enough that we can't have the same expectations?  
> 
> The thing about "remove" event is that it is generated *after* we opt for an
> unbind, and at that point it might be already too late if userspace doesn't
> get enough time to clean things up while the device is removed with a live
> client resulting in unknown consequences.
> 
> The idea here is to clean things up *before* we opt for an unbind leaving
> no room for side effects.

Something here feels fragile. There should not be a deadline for
userspace to finish cleaning up. What was described for KMS device nodes
in this same document seems like a more reliable approach: keep the
dead driver instance around until userspace has closed all references
to it. The device node could be removed earlier.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 2/4] drm/doc: Document device wedged event
  2025-01-27 10:23       ` Pekka Paalanen
@ 2025-01-28  9:37         ` Raag Jadav
  2025-01-28 11:38           ` Pekka Paalanen
  0 siblings, 1 reply; 29+ messages in thread
From: Raag Jadav @ 2025-01-28  9:37 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Xaver Hugl, airlied, simona, lucas.demarchi, rodrigo.vivi,
	jani.nikula, andriy.shevchenko, lina, michal.wajdeczko,
	christian.koenig, intel-xe, intel-gfx, dri-devel,
	himal.prasad.ghimiray, aravind.iddamsetty, anshuman.gupta,
	alexander.deucher, andrealmeid, amd-gfx, kernel-dev

On Mon, Jan 27, 2025 at 12:23:28PM +0200, Pekka Paalanen wrote:
> On Wed, 22 Jan 2025 07:22:25 +0200
> Raag Jadav <raag.jadav@intel.com> wrote:
> 
> > On Tue, Jan 21, 2025 at 02:14:56AM +0100, Xaver Hugl wrote:
> > > > +It is the responsibility of the consumer to make sure that the device or
> > > > +its resources are not in use by any process before attempting recovery.  
> > > I'm not convinced this is actually doable in practice, outside of
> > > killing all apps that aren't the one trying to recover the GPU.
> > > Is this just about not crashing those processes if they don't handle
> > > GPU hotunplugs well, about leaks, or something else?  
> > 
> > Correct, all of it. And since the compositor is in charge of device resources,
> > this way it atleast has the opportunity to recover the device and recreate
> > context without all the userspace violence.
> 
> Hi Raag,
> 
> sorry, I haven't followed this series, so I wonder, why should
> userspace be part of recovering the device? Why doesn't the kernel
> automatically load a new driver instance with a new DRM device node?

There are things like bus level reset (PCI SBR) and re-enumeration that are
not possible from driver context (or atleast I'm not aware of it), so a new
instance is just as useful/less as the old one.

> Of course userspace needs to deal with stuff suddenly erroring out, and
> destroy existing related resources, then wait for a working device
> to appear and rebuild all state. The kernel driver already needs to
> make the existing open stuff inert and harmless, why does it need an
> acknowledgement from userspace to unbind and re-bind?

Rebind is kind of a stepping stone to the above.

> > I'm not entirely aware of its feasibility though, perhaps something for the
> > consumers to experiment.
> 
> If consumers mean userspace, then no, not reliably. But the kernel can
> do it.

Can you please elaborate or refer to an example?

> I see in the commit message written:
> 
> 	"For example, if the driver supports multiple recovery methods,
> 	consumers can opt for the suitable one based on policy
> 	definition."
> 
> How could consumers know what to do? How can they guess what would be
> enough to recover the device? Isn't that the kernel driver's job to
> know?

Yes, 'WEDGED=' value are the known methods that are expected to work. The
policy is how the consumer can decide which one to opt for depending on the
scenario. For example, the less drastic method could work in most cases, but
you'd probably want to opt for a more drastic method for repeat offences or
perhaps if something more serious is discovered from "optional telemetry
collection".

> (More important for userspace would be know if dmabuf fds remain
> pointing to valid memory retaining its contents or if the contents are
> lost. Userspace cannot tell which device a dmabuf originates from,
> AFAIK, so this would need to be added in the generic dmabuf UAPI.)

Not sure if I understand, perhaps Christian can shed some light here.

> 	"Consumers can also choose to have the device available for
> 	debugging or additional data collection before performing the
> 	recovery."
> 
> Couldn't the wedged driver instance remain detached from the hardware
> while a new driver instance initializes? Then debug data remains until
> the wedged device is fully closed from userspace, or maybe devcore dump
> retains it.
> 
> I presume that WEDGED=none case should retain the debug data somehow as
> well.

Indeed, but it's optional so depends on the driver.

> > > > +With IOCTLs blocked and device already 'wedged', all device memory should
> 
> btw. when I see "blocked" I think of the function call not returning
> yet. But in this patch "blocked" seems to be synonymous for "returns
> an error immediately". Would it be possible to avoid the word "blocked"
> for this?

It is meant as "blocking the access", but fair enough. We can have a quick
fix later on if it's not too big of a concern.

> > > > +be unmapped and file descriptors should be closed to prevent leaks.  
> > > Afaiu from a userspace POV, a rebind is just like a GPU hotunplug +
> > > hotplug with matching "remove" and "add" udev events. As long as the
> > > application cleans up resources related to the device when it receives
> > > the event, there should be no leaks with a normal hotunplug... Is this
> > > different enough that we can't have the same expectations?  
> > 
> > The thing about "remove" event is that it is generated *after* we opt for an
> > unbind, and at that point it might be already too late if userspace doesn't
> > get enough time to clean things up while the device is removed with a live
> > client resulting in unknown consequences.
> > 
> > The idea here is to clean things up *before* we opt for an unbind leaving
> > no room for side effects.
> 
> Something here feels fragile. There should not be a deadline for
> userspace to finish cleaning up. What was described for KMS device nodes
> in this same document seems like a more reliable approach: keep the
> dead driver instance around until userspace has closed all references
> to it. The device node could be removed earlier.

I'm not sure if I'm following here. The driver instance will exist as long
as the dead device exists, which the consumer can remove if/when it chooses
to trigger an unbind from userspace. There is no deadline for it.

The consumer can choose to rely on hotplug events if it wishes, but the point
here is that it doesn't guarantee a clean recovery in all cases.

Raag

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 2/4] drm/doc: Document device wedged event
  2025-01-28  9:37         ` Raag Jadav
@ 2025-01-28 11:38           ` Pekka Paalanen
  2025-01-29  7:04             ` Raag Jadav
  0 siblings, 1 reply; 29+ messages in thread
From: Pekka Paalanen @ 2025-01-28 11:38 UTC (permalink / raw)
  To: Raag Jadav
  Cc: Xaver Hugl, airlied, simona, lucas.demarchi, rodrigo.vivi,
	jani.nikula, andriy.shevchenko, lina, michal.wajdeczko,
	christian.koenig, intel-xe, intel-gfx, dri-devel,
	himal.prasad.ghimiray, aravind.iddamsetty, anshuman.gupta,
	alexander.deucher, andrealmeid, amd-gfx, kernel-dev

[-- Attachment #1: Type: text/plain, Size: 5676 bytes --]

On Tue, 28 Jan 2025 11:37:53 +0200
Raag Jadav <raag.jadav@intel.com> wrote:

> On Mon, Jan 27, 2025 at 12:23:28PM +0200, Pekka Paalanen wrote:
> > On Wed, 22 Jan 2025 07:22:25 +0200
> > Raag Jadav <raag.jadav@intel.com> wrote:
> >   
> > > On Tue, Jan 21, 2025 at 02:14:56AM +0100, Xaver Hugl wrote:  
> > > > > +It is the responsibility of the consumer to make sure that the device or
> > > > > +its resources are not in use by any process before attempting recovery.    
> > > > I'm not convinced this is actually doable in practice, outside of
> > > > killing all apps that aren't the one trying to recover the GPU.
> > > > Is this just about not crashing those processes if they don't handle
> > > > GPU hotunplugs well, about leaks, or something else?    
> > > 
> > > Correct, all of it. And since the compositor is in charge of device resources,
> > > this way it atleast has the opportunity to recover the device and recreate
> > > context without all the userspace violence.  
> > 
> > Hi Raag,
> > 
> > sorry, I haven't followed this series, so I wonder, why should
> > userspace be part of recovering the device? Why doesn't the kernel
> > automatically load a new driver instance with a new DRM device node?  
> 
> There are things like bus level reset (PCI SBR) and re-enumeration that are
> not possible from driver context (or atleast I'm not aware of it), so a new
> instance is just as useful/less as the old one.

Ok, "not possible from driver context" is a key revelation. I wonder if
starting an overview section with that in the documentation would help
getting the right mindset.

Did I miss that somewhere?

I thought bus level reset meant resetting the PCI device by some bus
API. Clearly mistaken, I suppose you mean resetting the whole bus
including everything on it?

> > I see in the commit message written:
> > 
> > 	"For example, if the driver supports multiple recovery methods,
> > 	consumers can opt for the suitable one based on policy
> > 	definition."
> > 
> > How could consumers know what to do? How can they guess what would be
> > enough to recover the device? Isn't that the kernel driver's job to
> > know?  
> 
> Yes, 'WEDGED=' value are the known methods that are expected to work. The
> policy is how the consumer can decide which one to opt for depending on the
> scenario. For example, the less drastic method could work in most cases, but
> you'd probably want to opt for a more drastic method for repeat offences or
> perhaps if something more serious is discovered from "optional telemetry
> collection".

Aha, cool.

> > (More important for userspace would be know if dmabuf fds remain
> > pointing to valid memory retaining its contents or if the contents are
> > lost. Userspace cannot tell which device a dmabuf originates from,
> > AFAIK, so this would need to be added in the generic dmabuf UAPI.)  
> 
> Not sure if I understand, perhaps Christian can shed some light here.

A system might have multiple GPUs, and one GPU going down may leave all
the rest working as usual. A Wayland compositor would want to tell the
difference between still-good and possibly or definitely lost dmabufs
it received from its clients.

But this is off-topic in this thread I believe, nothing to the series
at hand.

> > > > > +With IOCTLs blocked and device already 'wedged', all device memory should  
> > 
> > btw. when I see "blocked" I think of the function call not returning
> > yet. But in this patch "blocked" seems to be synonymous for "returns
> > an error immediately". Would it be possible to avoid the word "blocked"
> > for this?  
> 
> It is meant as "blocking the access", but fair enough. We can have a quick
> fix later on if it's not too big of a concern.

Sure, I don't mind.

> > > > > +be unmapped and file descriptors should be closed to prevent leaks.    
> > > > Afaiu from a userspace POV, a rebind is just like a GPU hotunplug +
> > > > hotplug with matching "remove" and "add" udev events. As long as the
> > > > application cleans up resources related to the device when it receives
> > > > the event, there should be no leaks with a normal hotunplug... Is this
> > > > different enough that we can't have the same expectations?    
> > > 
> > > The thing about "remove" event is that it is generated *after* we opt for an
> > > unbind, and at that point it might be already too late if userspace doesn't
> > > get enough time to clean things up while the device is removed with a live
> > > client resulting in unknown consequences.
> > > 
> > > The idea here is to clean things up *before* we opt for an unbind leaving
> > > no room for side effects.  
> > 
> > Something here feels fragile. There should not be a deadline for
> > userspace to finish cleaning up. What was described for KMS device nodes
> > in this same document seems like a more reliable approach: keep the
> > dead driver instance around until userspace has closed all references
> > to it. The device node could be removed earlier.  
> 
> I'm not sure if I'm following here. The driver instance will exist as long
> as the dead device exists, which the consumer can remove if/when it chooses
> to trigger an unbind from userspace. There is no deadline for it.

I was going by your words: "it might be already too late if userspace
doesn't get enough time to clean things up".

> The consumer can choose to rely on hotplug events if it wishes, but the point
> here is that it doesn't guarantee a clean recovery in all cases.

Clearly I don't understand the whole picture here. No worries,
nevermind.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 2/4] drm/doc: Document device wedged event
  2025-01-28 11:38           ` Pekka Paalanen
@ 2025-01-29  7:04             ` Raag Jadav
  0 siblings, 0 replies; 29+ messages in thread
From: Raag Jadav @ 2025-01-29  7:04 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Xaver Hugl, airlied, simona, lucas.demarchi, rodrigo.vivi,
	jani.nikula, andriy.shevchenko, lina, michal.wajdeczko,
	christian.koenig, intel-xe, intel-gfx, dri-devel,
	himal.prasad.ghimiray, aravind.iddamsetty, anshuman.gupta,
	alexander.deucher, andrealmeid, amd-gfx, kernel-dev

On Tue, Jan 28, 2025 at 01:38:09PM +0200, Pekka Paalanen wrote:
> On Tue, 28 Jan 2025 11:37:53 +0200
> Raag Jadav <raag.jadav@intel.com> wrote:
> 
> > On Mon, Jan 27, 2025 at 12:23:28PM +0200, Pekka Paalanen wrote:
> > > On Wed, 22 Jan 2025 07:22:25 +0200
> > > Raag Jadav <raag.jadav@intel.com> wrote:
> > >   
> > > > On Tue, Jan 21, 2025 at 02:14:56AM +0100, Xaver Hugl wrote:  
> > > > > > +It is the responsibility of the consumer to make sure that the device or
> > > > > > +its resources are not in use by any process before attempting recovery.    
> > > > > I'm not convinced this is actually doable in practice, outside of
> > > > > killing all apps that aren't the one trying to recover the GPU.
> > > > > Is this just about not crashing those processes if they don't handle
> > > > > GPU hotunplugs well, about leaks, or something else?    
> > > > 
> > > > Correct, all of it. And since the compositor is in charge of device resources,
> > > > this way it atleast has the opportunity to recover the device and recreate
> > > > context without all the userspace violence.  
> > > 
> > > Hi Raag,
> > > 
> > > sorry, I haven't followed this series, so I wonder, why should
> > > userspace be part of recovering the device? Why doesn't the kernel
> > > automatically load a new driver instance with a new DRM device node?  
> > 
> > There are things like bus level reset (PCI SBR) and re-enumeration that are
> > not possible from driver context (or atleast I'm not aware of it), so a new
> > instance is just as useful/less as the old one.
> 
> Ok, "not possible from driver context" is a key revelation. I wonder if
> starting an overview section with that in the documentation would help
> getting the right mindset.

Not "not possible" in a literal sense, but rather allowing something that
drastic and convoluted that is probably beyond the scope of the driver.

> Did I miss that somewhere?

The first two paragraphs are meant as an introduction. Let me know if
something's not translating so well.

> I thought bus level reset meant resetting the PCI device by some bus
> API. Clearly mistaken, I suppose you mean resetting the whole bus
> including everything on it?

I'm no PCI expert but yes, it is atleast my understanding.

...

> > > (More important for userspace would be know if dmabuf fds remain
> > > pointing to valid memory retaining its contents or if the contents are
> > > lost. Userspace cannot tell which device a dmabuf originates from,
> > > AFAIK, so this would need to be added in the generic dmabuf UAPI.)  
> > 
> > Not sure if I understand, perhaps Christian can shed some light here.
> 
> A system might have multiple GPUs, and one GPU going down may leave all
> the rest working as usual. A Wayland compositor would want to tell the
> difference between still-good and possibly or definitely lost dmabufs
> it received from its clients.

Usually 'DEVNAME=' and 'DEVPATH=' values refer to the device that generates
the event.

> But this is off-topic in this thread I believe, nothing to the series
> at hand.

...

> > > > > > +be unmapped and file descriptors should be closed to prevent leaks.    
> > > > > Afaiu from a userspace POV, a rebind is just like a GPU hotunplug +
> > > > > hotplug with matching "remove" and "add" udev events. As long as the
> > > > > application cleans up resources related to the device when it receives
> > > > > the event, there should be no leaks with a normal hotunplug... Is this
> > > > > different enough that we can't have the same expectations?    
> > > > 
> > > > The thing about "remove" event is that it is generated *after* we opt for an
> > > > unbind, and at that point it might be already too late if userspace doesn't
> > > > get enough time to clean things up while the device is removed with a live
> > > > client resulting in unknown consequences.
> > > > 
> > > > The idea here is to clean things up *before* we opt for an unbind leaving
> > > > no room for side effects.  
> > > 
> > > Something here feels fragile. There should not be a deadline for
> > > userspace to finish cleaning up. What was described for KMS device nodes
> > > in this same document seems like a more reliable approach: keep the
> > > dead driver instance around until userspace has closed all references
> > > to it. The device node could be removed earlier.  
> > 
> > I'm not sure if I'm following here. The driver instance will exist as long
> > as the dead device exists, which the consumer can remove if/when it chooses
> > to trigger an unbind from userspace. There is no deadline for it.
> 
> I was going by your words: "it might be already too late if userspace
> doesn't get enough time to clean things up".

The idea here is to completely detach kernel and userspace *before* moving
forward with the recovery.

> > The consumer can choose to rely on hotplug events if it wishes, but the point
> > here is that it doesn't guarantee a clean recovery in all cases.
> 
> Clearly I don't understand the whole picture here. No worries,
> nevermind.

Less moving parts comes with more chances of success.

Raag

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-01-29  7:04 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-28 15:37 [PATCH v10 0/4] Introduce DRM device wedged event Raag Jadav
2024-11-28 15:37 ` [PATCH v10 1/4] drm: Introduce " Raag Jadav
2024-11-29 13:40   ` André Almeida
2024-12-02  8:07     ` Raag Jadav
2024-12-03  5:00       ` Raag Jadav
2024-12-03 10:18         ` Christian König
2024-12-04 11:17           ` Raag Jadav
2024-12-11 17:14             ` Maxime Ripard
2024-12-12 10:37               ` Raag Jadav
2024-12-16 16:07                 ` Maxime Ripard
2024-12-12 18:31   ` André Almeida
2024-12-13 13:51     ` Raag Jadav
2024-11-28 15:37 ` [PATCH v10 2/4] drm/doc: Document " Raag Jadav
2024-12-12 18:50   ` André Almeida
2024-12-17  8:42     ` Raag Jadav
2024-12-17 11:57       ` André Almeida
2025-01-21  1:14   ` Xaver Hugl
2025-01-22  5:22     ` Raag Jadav
2025-01-27 10:23       ` Pekka Paalanen
2025-01-28  9:37         ` Raag Jadav
2025-01-28 11:38           ` Pekka Paalanen
2025-01-29  7:04             ` Raag Jadav
2024-11-28 15:37 ` [PATCH v10 3/4] drm/xe: Use " Raag Jadav
2024-11-28 15:37 ` [PATCH v10 4/4] drm/i915: " Raag Jadav
2024-11-28 17:45 ` ✗ Fi.CI.CHECKPATCH: warning for Introduce DRM device wedged event (rev8) Patchwork
2024-11-28 17:45 ` ✗ Fi.CI.SPARSE: " Patchwork
2024-11-28 17:59 ` ✗ i915.CI.BAT: failure " Patchwork
2025-01-21  0:59 ` [PATCH v10 0/4] Introduce DRM device wedged event Xaver Hugl
2025-01-22  4:48   ` Raag Jadav

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox