* Re: [PATCH v7 1/5] drm: Introduce device wedged event
[not found] ` <20240930073845.347326-2-raag.jadav@intel.com>
@ 2024-10-17 2:47 ` Raag Jadav
2024-10-17 7:59 ` Christian König
0 siblings, 1 reply; 7+ messages in thread
From: Raag Jadav @ 2024-10-17 2:47 UTC (permalink / raw)
To: airlied, simona, lucas.demarchi, thomas.hellstrom, rodrigo.vivi,
jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina
Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
matthew.d.roper, boris.brezillon, adrian.larumbe, kernel, maraeo,
christian.koenig, friedrich.vock, michel, joshua,
alexander.deucher, andrealmeid, amd-gfx
On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
> Introduce device wedged event, which will notify userspace of wedged
> (hanged/unusable) state of the DRM device through a uevent. This is
> useful especially in cases where the device is no longer operating as
> expected even after a hardware reset and has become unrecoverable from
> driver context.
>
> Purpose of this implementation is to provide drivers a generic way to
> recover with the help of userspace intervention. Different drivers may
> have different ideas of a "wedged device" depending on their hardware
> implementation, and hence the vendor agnostic nature of the event.
> It is up to the drivers to decide when they see the need for recovery
> and how they want to recover from the available methods.
>
> Current implementation defines three recovery methods, out of which,
> drivers can choose to support any one or multiple of them. Preferred
> recovery method will be sent in the uevent environment as WEDGED=<method>.
> Userspace consumers (sysadmin) can define udev rules to parse this event
> and take respective action to recover the device.
>
> =============== ==================================
> Recovery method Consumer expectations
> =============== ==================================
> rebind unbind + rebind driver
> bus-reset unbind + reset bus device + rebind
> reboot reboot system
> =============== ==================================
>
> v4: s/drm_dev_wedged/drm_dev_wedged_event
> Use drm_info() (Jani)
> Kernel doc adjustment (Aravind)
> v5: Send recovery method with uevent (Lina)
> v6: Access wedge_recovery_opts[] using helper function (Jani)
> Use snprintf() (Jani)
> v7: Convert recovery helpers into regular functions (Andy, Jani)
> Aesthetic adjustments (Andy)
> Handle invalid method cases
>
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> ---
Cc'ing amd, collabora and others as I found semi-related work at
https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/
https://lore.kernel.org/amd-gfx/20240725150055.1991893-1-alexander.deucher@amd.com/
https://lore.kernel.org/dri-devel/20241011225906.3789965-3-adrian.larumbe@collabora.com/
https://lore.kernel.org/amd-gfx/CAAxE2A5v_RkZ9ex4=7jiBSKVb22_1FAj0AANBcmKtETt5c3gVA@mail.gmail.com/
Please share feedback about usefulness and adoption of this.
Improvements are welcome.
Raag
> drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++
> include/drm/drm_device.h | 23 ++++++++++++
> include/drm/drm_drv.h | 3 ++
> 3 files changed, 103 insertions(+)
>
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index ac30b0ec9d93..cfe9600da2ee 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -26,6 +26,8 @@
> * DEALINGS IN THE SOFTWARE.
> */
>
> +#include <linux/array_size.h>
> +#include <linux/build_bug.h>
> #include <linux/debugfs.h>
> #include <linux/fs.h>
> #include <linux/module.h>
> @@ -33,6 +35,7 @@
> #include <linux/mount.h>
> #include <linux/pseudo_fs.h>
> #include <linux/slab.h>
> +#include <linux/sprintf.h>
> #include <linux/srcu.h>
> #include <linux/xarray.h>
>
> @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root;
>
> DEFINE_STATIC_SRCU(drm_unplug_srcu);
>
> +/*
> + * Available recovery methods for wedged device. To be sent along with device
> + * wedged uevent.
> + */
> +static const char *const drm_wedge_recovery_opts[] = {
> + [DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> +};
> +
> +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> +{
> + static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
> +
> + return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> +}
> +
> +/**
> + * drm_wedge_recovery_name - provide wedge recovery name
> + * @method: method to be used for recovery
> + *
> + * This validates wedge recovery @method against the available ones in
> + * drm_wedge_recovery_opts[] and provides respective recovery name in string
> + * format if found valid.
> + *
> + * Returns: pointer to const recovery string on success, NULL otherwise.
> + */
> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method)
> +{
> + if (drm_wedge_recovery_is_valid(method))
> + return drm_wedge_recovery_opts[method];
> +
> + return NULL;
> +}
> +EXPORT_SYMBOL(drm_wedge_recovery_name);
> +
> /*
> * DRM Minors
> * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
> @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev)
> }
> EXPORT_SYMBOL(drm_dev_unplug);
>
> +/**
> + * drm_dev_wedged_event - generate a device wedged uevent
> + * @dev: DRM device
> + * @method: method to be used for recovery
> + *
> + * This generates a device wedged uevent for the DRM device specified by @dev.
> + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device)
> + * is sent in the uevent environment as WEDGED=<method>, on the basis of which,
> + * userspace may take respective action to recover the device.
> + *
> + * Returns: 0 on success, or negative error code otherwise.
> + */
> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method)
> +{
> + /* Event string length up to 16+ characters with available methods */
> + char event_string[32] = {};
> + char *envp[] = { event_string, NULL };
> + const char *recovery;
> +
> + recovery = drm_wedge_recovery_name(method);
> + if (!recovery) {
> + drm_err(dev, "device wedged, invalid recovery method %d\n", method);
> + return -EINVAL;
> + }
> +
> + if (!test_bit(method, &dev->wedge_recovery)) {
> + drm_err(dev, "device wedged, %s based recovery not supported\n",
> + drm_wedge_recovery_name(method));
> + return -EOPNOTSUPP;
> + }
> +
> + snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery);
> +
> + drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery);
> + return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> +}
> +EXPORT_SYMBOL(drm_dev_wedged_event);
> +
> /*
> * DRM internal mount
> * We want to be able to allocate our own "struct address_space" to control
> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index c91f87b5242d..fed6f20e52fb 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -40,6 +40,26 @@ enum switch_power_state {
> DRM_SWITCH_POWER_DYNAMIC_OFF = 3,
> };
>
> +/**
> + * enum drm_wedge_recovery - Recovery method for wedged device in order of
> + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
> + * Drivers can choose to support any one or multiple of them depending on
> + * their needs.
> + */
> +enum drm_wedge_recovery {
> + /** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
> + DRM_WEDGE_RECOVERY_REBIND,
> +
> + /** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
> + DRM_WEDGE_RECOVERY_BUS_RESET,
> +
> + /** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
> + DRM_WEDGE_RECOVERY_REBOOT,
> +
> + /** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
> + DRM_WEDGE_RECOVERY_MAX
> +};
> +
> /**
> * struct drm_device - DRM device structure
> *
> @@ -317,6 +337,9 @@ struct drm_device {
> * Root directory for debugfs files.
> */
> struct dentry *debugfs_root;
> +
> + /** @wedge_recovery: Supported recovery methods for wedged device */
> + unsigned long wedge_recovery;
> };
>
> #endif
> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
> index 02ea4e3248fd..d8dbc77010b0 100644
> --- a/include/drm/drm_drv.h
> +++ b/include/drm/drm_drv.h
> @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx);
> void drm_dev_exit(int idx);
> void drm_dev_unplug(struct drm_device *dev);
>
> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method);
> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method);
> +
> /**
> * drm_dev_is_unplugged - is a DRM device unplugged
> * @dev: DRM device
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v7 1/5] drm: Introduce device wedged event
2024-10-17 2:47 ` [PATCH v7 1/5] drm: Introduce device wedged event Raag Jadav
@ 2024-10-17 7:59 ` Christian König
2024-10-17 16:43 ` Rodrigo Vivi
0 siblings, 1 reply; 7+ messages in thread
From: Christian König @ 2024-10-17 7:59 UTC (permalink / raw)
To: Raag Jadav, airlied, simona, lucas.demarchi, thomas.hellstrom,
rodrigo.vivi, jani.nikula, andriy.shevchenko, joonas.lahtinen,
tursulin, lina
Cc: intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
matthew.d.roper, boris.brezillon, adrian.larumbe, kernel, maraeo,
friedrich.vock, michel, joshua, alexander.deucher, andrealmeid,
amd-gfx
Am 17.10.24 um 04:47 schrieb Raag Jadav:
> On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
>> Introduce device wedged event, which will notify userspace of wedged
>> (hanged/unusable) state of the DRM device through a uevent. This is
>> useful especially in cases where the device is no longer operating as
>> expected even after a hardware reset and has become unrecoverable from
>> driver context.
Well introduce is probably the wrong wording since i915 already has that
and amdgpu looked into it but never upstreamed the support.
I would rather say standardize.
>>
>> Purpose of this implementation is to provide drivers a generic way to
>> recover with the help of userspace intervention. Different drivers may
>> have different ideas of a "wedged device" depending on their hardware
>> implementation, and hence the vendor agnostic nature of the event.
>> It is up to the drivers to decide when they see the need for recovery
>> and how they want to recover from the available methods.
>>
>> Current implementation defines three recovery methods, out of which,
>> drivers can choose to support any one or multiple of them. Preferred
>> recovery method will be sent in the uevent environment as WEDGED=<method>.
>> Userspace consumers (sysadmin) can define udev rules to parse this event
>> and take respective action to recover the device.
>>
>> =============== ==================================
>> Recovery method Consumer expectations
>> =============== ==================================
>> rebind unbind + rebind driver
>> bus-reset unbind + reset bus device + rebind
>> reboot reboot system
>> =============== ==================================
Well that sounds like userspace would need to be involved in recovery.
That in turn is a complete no-go since we at least need to signal all
dma_fences to unblock the kernel. In other words things like bus reset
needs to happen inside the kernel and *not* in userspace.
What we can do is to signal to userspace: Hey a bus reset of device X
happened, maybe restart container, daemon, whatever service which was
using this device.
Regards,
Christian.
>>
>> v4: s/drm_dev_wedged/drm_dev_wedged_event
>> Use drm_info() (Jani)
>> Kernel doc adjustment (Aravind)
>> v5: Send recovery method with uevent (Lina)
>> v6: Access wedge_recovery_opts[] using helper function (Jani)
>> Use snprintf() (Jani)
>> v7: Convert recovery helpers into regular functions (Andy, Jani)
>> Aesthetic adjustments (Andy)
>> Handle invalid method cases
>>
>> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
>> ---
> Cc'ing amd, collabora and others as I found semi-related work at
>
> https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/
> https://lore.kernel.org/amd-gfx/20240725150055.1991893-1-alexander.deucher@amd.com/
> https://lore.kernel.org/dri-devel/20241011225906.3789965-3-adrian.larumbe@collabora.com/
> https://lore.kernel.org/amd-gfx/CAAxE2A5v_RkZ9ex4=7jiBSKVb22_1FAj0AANBcmKtETt5c3gVA@mail.gmail.com/
>
>
> Please share feedback about usefulness and adoption of this.
> Improvements are welcome.
>
> Raag
>
>> drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++
>> include/drm/drm_device.h | 23 ++++++++++++
>> include/drm/drm_drv.h | 3 ++
>> 3 files changed, 103 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index ac30b0ec9d93..cfe9600da2ee 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -26,6 +26,8 @@
>> * DEALINGS IN THE SOFTWARE.
>> */
>>
>> +#include <linux/array_size.h>
>> +#include <linux/build_bug.h>
>> #include <linux/debugfs.h>
>> #include <linux/fs.h>
>> #include <linux/module.h>
>> @@ -33,6 +35,7 @@
>> #include <linux/mount.h>
>> #include <linux/pseudo_fs.h>
>> #include <linux/slab.h>
>> +#include <linux/sprintf.h>
>> #include <linux/srcu.h>
>> #include <linux/xarray.h>
>>
>> @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root;
>>
>> DEFINE_STATIC_SRCU(drm_unplug_srcu);
>>
>> +/*
>> + * Available recovery methods for wedged device. To be sent along with device
>> + * wedged uevent.
>> + */
>> +static const char *const drm_wedge_recovery_opts[] = {
>> + [DRM_WEDGE_RECOVERY_REBIND] = "rebind",
>> + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
>> + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
>> +};
>> +
>> +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
>> +{
>> + static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
>> +
>> + return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
>> +}
>> +
>> +/**
>> + * drm_wedge_recovery_name - provide wedge recovery name
>> + * @method: method to be used for recovery
>> + *
>> + * This validates wedge recovery @method against the available ones in
>> + * drm_wedge_recovery_opts[] and provides respective recovery name in string
>> + * format if found valid.
>> + *
>> + * Returns: pointer to const recovery string on success, NULL otherwise.
>> + */
>> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method)
>> +{
>> + if (drm_wedge_recovery_is_valid(method))
>> + return drm_wedge_recovery_opts[method];
>> +
>> + return NULL;
>> +}
>> +EXPORT_SYMBOL(drm_wedge_recovery_name);
>> +
>> /*
>> * DRM Minors
>> * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
>> @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev)
>> }
>> EXPORT_SYMBOL(drm_dev_unplug);
>>
>> +/**
>> + * drm_dev_wedged_event - generate a device wedged uevent
>> + * @dev: DRM device
>> + * @method: method to be used for recovery
>> + *
>> + * This generates a device wedged uevent for the DRM device specified by @dev.
>> + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device)
>> + * is sent in the uevent environment as WEDGED=<method>, on the basis of which,
>> + * userspace may take respective action to recover the device.
>> + *
>> + * Returns: 0 on success, or negative error code otherwise.
>> + */
>> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method)
>> +{
>> + /* Event string length up to 16+ characters with available methods */
>> + char event_string[32] = {};
>> + char *envp[] = { event_string, NULL };
>> + const char *recovery;
>> +
>> + recovery = drm_wedge_recovery_name(method);
>> + if (!recovery) {
>> + drm_err(dev, "device wedged, invalid recovery method %d\n", method);
>> + return -EINVAL;
>> + }
>> +
>> + if (!test_bit(method, &dev->wedge_recovery)) {
>> + drm_err(dev, "device wedged, %s based recovery not supported\n",
>> + drm_wedge_recovery_name(method));
>> + return -EOPNOTSUPP;
>> + }
>> +
>> + snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery);
>> +
>> + drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery);
>> + return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
>> +}
>> +EXPORT_SYMBOL(drm_dev_wedged_event);
>> +
>> /*
>> * DRM internal mount
>> * We want to be able to allocate our own "struct address_space" to control
>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>> index c91f87b5242d..fed6f20e52fb 100644
>> --- a/include/drm/drm_device.h
>> +++ b/include/drm/drm_device.h
>> @@ -40,6 +40,26 @@ enum switch_power_state {
>> DRM_SWITCH_POWER_DYNAMIC_OFF = 3,
>> };
>>
>> +/**
>> + * enum drm_wedge_recovery - Recovery method for wedged device in order of
>> + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
>> + * Drivers can choose to support any one or multiple of them depending on
>> + * their needs.
>> + */
>> +enum drm_wedge_recovery {
>> + /** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
>> + DRM_WEDGE_RECOVERY_REBIND,
>> +
>> + /** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
>> + DRM_WEDGE_RECOVERY_BUS_RESET,
>> +
>> + /** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
>> + DRM_WEDGE_RECOVERY_REBOOT,
>> +
>> + /** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
>> + DRM_WEDGE_RECOVERY_MAX
>> +};
>> +
>> /**
>> * struct drm_device - DRM device structure
>> *
>> @@ -317,6 +337,9 @@ struct drm_device {
>> * Root directory for debugfs files.
>> */
>> struct dentry *debugfs_root;
>> +
>> + /** @wedge_recovery: Supported recovery methods for wedged device */
>> + unsigned long wedge_recovery;
>> };
>>
>> #endif
>> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
>> index 02ea4e3248fd..d8dbc77010b0 100644
>> --- a/include/drm/drm_drv.h
>> +++ b/include/drm/drm_drv.h
>> @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx);
>> void drm_dev_exit(int idx);
>> void drm_dev_unplug(struct drm_device *dev);
>>
>> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method);
>> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method);
>> +
>> /**
>> * drm_dev_is_unplugged - is a DRM device unplugged
>> * @dev: DRM device
>> --
>> 2.34.1
>>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v7 1/5] drm: Introduce device wedged event
2024-10-17 7:59 ` Christian König
@ 2024-10-17 16:43 ` Rodrigo Vivi
2024-10-18 10:58 ` Christian König
0 siblings, 1 reply; 7+ messages in thread
From: Rodrigo Vivi @ 2024-10-17 16:43 UTC (permalink / raw)
To: Christian König
Cc: Raag Jadav, airlied, simona, lucas.demarchi, thomas.hellstrom,
jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina,
intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
matthew.d.roper, boris.brezillon, adrian.larumbe, kernel, maraeo,
friedrich.vock, michel, joshua, alexander.deucher, andrealmeid,
amd-gfx
On Thu, Oct 17, 2024 at 09:59:10AM +0200, Christian König wrote:
> Am 17.10.24 um 04:47 schrieb Raag Jadav:
> > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
> > > Introduce device wedged event, which will notify userspace of wedged
> > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > useful especially in cases where the device is no longer operating as
> > > expected even after a hardware reset and has become unrecoverable from
> > > driver context.
>
> Well introduce is probably the wrong wording since i915 already has that and
> amdgpu looked into it but never upstreamed the support.
in i915 we have the reset and error uevents, but not one specific for 'wedge'.
This would indeed be a new one.
>
> I would rather say standardize.
>
> > >
> > > Purpose of this implementation is to provide drivers a generic way to
> > > recover with the help of userspace intervention. Different drivers may
> > > have different ideas of a "wedged device" depending on their hardware
> > > implementation, and hence the vendor agnostic nature of the event.
> > > It is up to the drivers to decide when they see the need for recovery
> > > and how they want to recover from the available methods.
> > >
> > > Current implementation defines three recovery methods, out of which,
> > > drivers can choose to support any one or multiple of them. Preferred
> > > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > > Userspace consumers (sysadmin) can define udev rules to parse this event
> > > and take respective action to recover the device.
> > >
> > > =============== ==================================
> > > Recovery method Consumer expectations
> > > =============== ==================================
> > > rebind unbind + rebind driver
> > > bus-reset unbind + reset bus device + rebind
> > > reboot reboot system
> > > =============== ==================================
>
> Well that sounds like userspace would need to be involved in recovery.
>
> That in turn is a complete no-go since we at least need to signal all
> dma_fences to unblock the kernel. In other words things like bus reset needs
> to happen inside the kernel and *not* in userspace.
>
> What we can do is to signal to userspace: Hey a bus reset of device X
> happened, maybe restart container, daemon, whatever service which was using
> this device.
Well, when we declare device 'wedged' it is because we don't want to take
any drastic measures inside the kernel and want to leave it in a protected
and unusable state. In a way that users wouldn't lose display for instance,
or at least the device is in a debugable state.
Then, the instructions here is to tell what could possibly be attempted
from userspace to get the device to an usable state.
The 'wedge' mode (the one emiting this uevent) needs to be responsible
for signaling all the fences and everything needed for a clean unbind
and whatever next step might be indicated to userspace.
That should already be part of any wedged mode, regardless the uevent
to inform the userspace here.
>
> Regards,
> Christian.
>
> > >
> > > v4: s/drm_dev_wedged/drm_dev_wedged_event
> > > Use drm_info() (Jani)
> > > Kernel doc adjustment (Aravind)
> > > v5: Send recovery method with uevent (Lina)
> > > v6: Access wedge_recovery_opts[] using helper function (Jani)
> > > Use snprintf() (Jani)
> > > v7: Convert recovery helpers into regular functions (Andy, Jani)
> > > Aesthetic adjustments (Andy)
> > > Handle invalid method cases
> > >
> > > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > > ---
> > Cc'ing amd, collabora and others as I found semi-related work at
> >
> > https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/
> > https://lore.kernel.org/amd-gfx/20240725150055.1991893-1-alexander.deucher@amd.com/
> > https://lore.kernel.org/dri-devel/20241011225906.3789965-3-adrian.larumbe@collabora.com/
> > https://lore.kernel.org/amd-gfx/CAAxE2A5v_RkZ9ex4=7jiBSKVb22_1FAj0AANBcmKtETt5c3gVA@mail.gmail.com/
> >
> >
> > Please share feedback about usefulness and adoption of this.
> > Improvements are welcome.
> >
> > Raag
> >
> > > drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++
> > > include/drm/drm_device.h | 23 ++++++++++++
> > > include/drm/drm_drv.h | 3 ++
> > > 3 files changed, 103 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > index ac30b0ec9d93..cfe9600da2ee 100644
> > > --- a/drivers/gpu/drm/drm_drv.c
> > > +++ b/drivers/gpu/drm/drm_drv.c
> > > @@ -26,6 +26,8 @@
> > > * DEALINGS IN THE SOFTWARE.
> > > */
> > > +#include <linux/array_size.h>
> > > +#include <linux/build_bug.h>
> > > #include <linux/debugfs.h>
> > > #include <linux/fs.h>
> > > #include <linux/module.h>
> > > @@ -33,6 +35,7 @@
> > > #include <linux/mount.h>
> > > #include <linux/pseudo_fs.h>
> > > #include <linux/slab.h>
> > > +#include <linux/sprintf.h>
> > > #include <linux/srcu.h>
> > > #include <linux/xarray.h>
> > > @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root;
> > > DEFINE_STATIC_SRCU(drm_unplug_srcu);
> > > +/*
> > > + * Available recovery methods for wedged device. To be sent along with device
> > > + * wedged uevent.
> > > + */
> > > +static const char *const drm_wedge_recovery_opts[] = {
> > > + [DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> > > + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> > > + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> > > +};
> > > +
> > > +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> > > +{
> > > + static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
> > > +
> > > + return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> > > +}
> > > +
> > > +/**
> > > + * drm_wedge_recovery_name - provide wedge recovery name
> > > + * @method: method to be used for recovery
> > > + *
> > > + * This validates wedge recovery @method against the available ones in
> > > + * drm_wedge_recovery_opts[] and provides respective recovery name in string
> > > + * format if found valid.
> > > + *
> > > + * Returns: pointer to const recovery string on success, NULL otherwise.
> > > + */
> > > +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method)
> > > +{
> > > + if (drm_wedge_recovery_is_valid(method))
> > > + return drm_wedge_recovery_opts[method];
> > > +
> > > + return NULL;
> > > +}
> > > +EXPORT_SYMBOL(drm_wedge_recovery_name);
> > > +
> > > /*
> > > * DRM Minors
> > > * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
> > > @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev)
> > > }
> > > EXPORT_SYMBOL(drm_dev_unplug);
> > > +/**
> > > + * drm_dev_wedged_event - generate a device wedged uevent
> > > + * @dev: DRM device
> > > + * @method: method to be used for recovery
> > > + *
> > > + * This generates a device wedged uevent for the DRM device specified by @dev.
> > > + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device)
> > > + * is sent in the uevent environment as WEDGED=<method>, on the basis of which,
> > > + * userspace may take respective action to recover the device.
> > > + *
> > > + * Returns: 0 on success, or negative error code otherwise.
> > > + */
> > > +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method)
> > > +{
> > > + /* Event string length up to 16+ characters with available methods */
> > > + char event_string[32] = {};
> > > + char *envp[] = { event_string, NULL };
> > > + const char *recovery;
> > > +
> > > + recovery = drm_wedge_recovery_name(method);
> > > + if (!recovery) {
> > > + drm_err(dev, "device wedged, invalid recovery method %d\n", method);
> > > + return -EINVAL;
> > > + }
> > > +
> > > + if (!test_bit(method, &dev->wedge_recovery)) {
> > > + drm_err(dev, "device wedged, %s based recovery not supported\n",
> > > + drm_wedge_recovery_name(method));
> > > + return -EOPNOTSUPP;
> > > + }
> > > +
> > > + snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery);
> > > +
> > > + drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery);
> > > + return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> > > +}
> > > +EXPORT_SYMBOL(drm_dev_wedged_event);
> > > +
> > > /*
> > > * DRM internal mount
> > > * We want to be able to allocate our own "struct address_space" to control
> > > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > index c91f87b5242d..fed6f20e52fb 100644
> > > --- a/include/drm/drm_device.h
> > > +++ b/include/drm/drm_device.h
> > > @@ -40,6 +40,26 @@ enum switch_power_state {
> > > DRM_SWITCH_POWER_DYNAMIC_OFF = 3,
> > > };
> > > +/**
> > > + * enum drm_wedge_recovery - Recovery method for wedged device in order of
> > > + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
> > > + * Drivers can choose to support any one or multiple of them depending on
> > > + * their needs.
> > > + */
> > > +enum drm_wedge_recovery {
> > > + /** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
> > > + DRM_WEDGE_RECOVERY_REBIND,
> > > +
> > > + /** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
> > > + DRM_WEDGE_RECOVERY_BUS_RESET,
> > > +
> > > + /** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
> > > + DRM_WEDGE_RECOVERY_REBOOT,
> > > +
> > > + /** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
> > > + DRM_WEDGE_RECOVERY_MAX
> > > +};
> > > +
> > > /**
> > > * struct drm_device - DRM device structure
> > > *
> > > @@ -317,6 +337,9 @@ struct drm_device {
> > > * Root directory for debugfs files.
> > > */
> > > struct dentry *debugfs_root;
> > > +
> > > + /** @wedge_recovery: Supported recovery methods for wedged device */
> > > + unsigned long wedge_recovery;
> > > };
> > > #endif
> > > diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
> > > index 02ea4e3248fd..d8dbc77010b0 100644
> > > --- a/include/drm/drm_drv.h
> > > +++ b/include/drm/drm_drv.h
> > > @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx);
> > > void drm_dev_exit(int idx);
> > > void drm_dev_unplug(struct drm_device *dev);
> > > +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method);
> > > +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method);
> > > +
> > > /**
> > > * drm_dev_is_unplugged - is a DRM device unplugged
> > > * @dev: DRM device
> > > --
> > > 2.34.1
> > >
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v7 1/5] drm: Introduce device wedged event
2024-10-17 16:43 ` Rodrigo Vivi
@ 2024-10-18 10:58 ` Christian König
2024-10-18 12:46 ` Raag Jadav
0 siblings, 1 reply; 7+ messages in thread
From: Christian König @ 2024-10-18 10:58 UTC (permalink / raw)
To: Rodrigo Vivi
Cc: Raag Jadav, airlied, simona, lucas.demarchi, thomas.hellstrom,
jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina,
intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
matthew.d.roper, boris.brezillon, adrian.larumbe, kernel, maraeo,
friedrich.vock, michel, joshua, alexander.deucher, andrealmeid,
amd-gfx
[-- Attachment #1: Type: text/plain, Size: 3577 bytes --]
Am 17.10.24 um 18:43 schrieb Rodrigo Vivi:
> On Thu, Oct 17, 2024 at 09:59:10AM +0200, Christian König wrote:
>>>> Purpose of this implementation is to provide drivers a generic way to
>>>> recover with the help of userspace intervention. Different drivers may
>>>> have different ideas of a "wedged device" depending on their hardware
>>>> implementation, and hence the vendor agnostic nature of the event.
>>>> It is up to the drivers to decide when they see the need for recovery
>>>> and how they want to recover from the available methods.
>>>>
>>>> Current implementation defines three recovery methods, out of which,
>>>> drivers can choose to support any one or multiple of them. Preferred
>>>> recovery method will be sent in the uevent environment as WEDGED=<method>.
>>>> Userspace consumers (sysadmin) can define udev rules to parse this event
>>>> and take respective action to recover the device.
>>>>
>>>> =============== ==================================
>>>> Recovery method Consumer expectations
>>>> =============== ==================================
>>>> rebind unbind + rebind driver
>>>> bus-reset unbind + reset bus device + rebind
>>>> reboot reboot system
>>>> =============== ==================================
>> Well that sounds like userspace would need to be involved in recovery.
>>
>> That in turn is a complete no-go since we at least need to signal all
>> dma_fences to unblock the kernel. In other words things like bus reset needs
>> to happen inside the kernel and *not* in userspace.
>>
>> What we can do is to signal to userspace: Hey a bus reset of device X
>> happened, maybe restart container, daemon, whatever service which was using
>> this device.
> Well, when we declare device 'wedged' it is because we don't want to take
> any drastic measures inside the kernel and want to leave it in a protected
> and unusable state. In a way that users wouldn't lose display for instance,
> or at least the device is in a debugable state.
Uff, that needs to be very very well documented or otherwise the whole
approach is an absolutely clear NAK from my side as DMA-buf maintainer.
>
> Then, the instructions here is to tell what could possibly be attempted
> from userspace to get the device to an usable state.
>
> The 'wedge' mode (the one emiting this uevent) needs to be responsible
> for signaling all the fences and everything needed for a clean unbind
> and whatever next step might be indicated to userspace.
>
> That should already be part of any wedged mode, regardless the uevent
> to inform the userspace here.
You need to approach that from a different side. With the current patch
set you are ignoring documented mandatory driver behavior as far as I
can see.
So first of all describe in the documentation what the wedged mode is
and what requirements a driver has to fulfill to enter it:
https://docs.kernel.org/gpu/drm-uapi.html#device-reset
Especially document that all system memory accesses of the device needs
to be blocked by (for example) disabling DMA accesses in the PCI config
space.
When it is guaranteed that the device can't access any system memory any
more the device driver should signal all pending fences of this device.
And only after all of that is done the driver can send an uevent to
inform userspace that it can debug the hanged state.
As far as I can see this makes the enum how to recover the device
superfluous because you will most likely always need a bus reset to get
out of this again.
Regards,
Christian.
[-- Attachment #2: Type: text/html, Size: 4494 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v7 1/5] drm: Introduce device wedged event
2024-10-18 10:58 ` Christian König
@ 2024-10-18 12:46 ` Raag Jadav
2024-10-18 12:54 ` Christian König
0 siblings, 1 reply; 7+ messages in thread
From: Raag Jadav @ 2024-10-18 12:46 UTC (permalink / raw)
To: Christian König
Cc: Rodrigo Vivi, airlied, simona, lucas.demarchi, thomas.hellstrom,
jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina,
intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
matthew.d.roper, boris.brezillon, adrian.larumbe, kernel, maraeo,
friedrich.vock, michel, joshua, alexander.deucher, andrealmeid,
amd-gfx
On Fri, Oct 18, 2024 at 12:58:09PM +0200, Christian König wrote:
> Am 17.10.24 um 18:43 schrieb Rodrigo Vivi:
> > On Thu, Oct 17, 2024 at 09:59:10AM +0200, Christian König wrote:
> > > > > Purpose of this implementation is to provide drivers a generic way to
> > > > > recover with the help of userspace intervention. Different drivers may
> > > > > have different ideas of a "wedged device" depending on their hardware
> > > > > implementation, and hence the vendor agnostic nature of the event.
> > > > > It is up to the drivers to decide when they see the need for recovery
> > > > > and how they want to recover from the available methods.
> > > > >
> > > > > Current implementation defines three recovery methods, out of which,
> > > > > drivers can choose to support any one or multiple of them. Preferred
> > > > > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > > > > Userspace consumers (sysadmin) can define udev rules to parse this event
> > > > > and take respective action to recover the device.
> > > > >
> > > > > =============== ==================================
> > > > > Recovery method Consumer expectations
> > > > > =============== ==================================
> > > > > rebind unbind + rebind driver
> > > > > bus-reset unbind + reset bus device + rebind
> > > > > reboot reboot system
> > > > > =============== ==================================
> > > Well that sounds like userspace would need to be involved in recovery.
> > >
> > > That in turn is a complete no-go since we at least need to signal all
> > > dma_fences to unblock the kernel. In other words things like bus reset needs
> > > to happen inside the kernel and *not* in userspace.
> > >
> > > What we can do is to signal to userspace: Hey a bus reset of device X
> > > happened, maybe restart container, daemon, whatever service which was using
> > > this device.
> > Well, when we declare device 'wedged' it is because we don't want to take
> > any drastic measures inside the kernel and want to leave it in a protected
> > and unusable state. In a way that users wouldn't lose display for instance,
> > or at least the device is in a debugable state.
>
> Uff, that needs to be very very well documented or otherwise the whole
> approach is an absolutely clear NAK from my side as DMA-buf maintainer.
>
> >
> > Then, the instructions here is to tell what could possibly be attempted
> > from userspace to get the device to an usable state.
> >
> > The 'wedge' mode (the one emiting this uevent) needs to be responsible
> > for signaling all the fences and everything needed for a clean unbind
> > and whatever next step might be indicated to userspace.
> >
> > That should already be part of any wedged mode, regardless the uevent
> > to inform the userspace here.
>
> You need to approach that from a different side. With the current patch set
> you are ignoring documented mandatory driver behavior as far as I can see.
>
> So first of all describe in the documentation what the wedged mode is and
> what requirements a driver has to fulfill to enter it:
> https://docs.kernel.org/gpu/drm-uapi.html#device-reset
>
> Especially document that all system memory accesses of the device needs to
> be blocked by (for example) disabling DMA accesses in the PCI config space.
>
> When it is guaranteed that the device can't access any system memory any
> more the device driver should signal all pending fences of this device.
>
> And only after all of that is done the driver can send an uevent to inform
> userspace that it can debug the hanged state.
Sure, will do.
> As far as I can see this makes the enum how to recover the device
> superfluous because you will most likely always need a bus reset to get out
> of this again.
That depends on the kind of fault the device has encountered and the bus it is
sitting on. There could be buses that don't support reset.
Raag
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v7 1/5] drm: Introduce device wedged event
2024-10-18 12:46 ` Raag Jadav
@ 2024-10-18 12:54 ` Christian König
2024-10-18 14:09 ` Raag Jadav
0 siblings, 1 reply; 7+ messages in thread
From: Christian König @ 2024-10-18 12:54 UTC (permalink / raw)
To: Raag Jadav
Cc: Rodrigo Vivi, airlied, simona, lucas.demarchi, thomas.hellstrom,
jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina,
intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
matthew.d.roper, boris.brezillon, adrian.larumbe, kernel, maraeo,
friedrich.vock, michel, joshua, alexander.deucher, andrealmeid,
amd-gfx
[-- Attachment #1: Type: text/plain, Size: 548 bytes --]
Am 18.10.24 um 14:46 schrieb Raag Jadav:
>> As far as I can see this makes the enum how to recover the device
>> superfluous because you will most likely always need a bus reset to get out
>> of this again.
> That depends on the kind of fault the device has encountered and the bus it is
> sitting on. There could be buses that don't support reset.
That is even more an argument to not expose this in the uevent.
Getting the device working again is strongly device dependent and can't
be handled in a generic way.
Regards,
Christian.
>
> Raag
[-- Attachment #2: Type: text/html, Size: 1179 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v7 1/5] drm: Introduce device wedged event
2024-10-18 12:54 ` Christian König
@ 2024-10-18 14:09 ` Raag Jadav
0 siblings, 0 replies; 7+ messages in thread
From: Raag Jadav @ 2024-10-18 14:09 UTC (permalink / raw)
To: Christian König
Cc: Rodrigo Vivi, airlied, simona, lucas.demarchi, thomas.hellstrom,
jani.nikula, andriy.shevchenko, joonas.lahtinen, tursulin, lina,
intel-xe, intel-gfx, dri-devel, himal.prasad.ghimiray,
francois.dugast, aravind.iddamsetty, anshuman.gupta, andi.shyti,
matthew.d.roper, boris.brezillon, adrian.larumbe, kernel, maraeo,
friedrich.vock, michel, joshua, alexander.deucher, andrealmeid,
amd-gfx
On Fri, Oct 18, 2024 at 02:54:38PM +0200, Christian König wrote:
> Am 18.10.24 um 14:46 schrieb Raag Jadav:
> > > As far as I can see this makes the enum how to recover the device
> > > superfluous because you will most likely always need a bus reset to get out
> > > of this again.
> > That depends on the kind of fault the device has encountered and the bus it is
> > sitting on. There could be buses that don't support reset.
>
> That is even more an argument to not expose this in the uevent.
>
> Getting the device working again is strongly device dependent and can't be
> handled in a generic way.
My understanding is that the proposed methods can be handled in a generic way
and are useful for the devices that do support it. This way the userspace can
atleast have a hint about recovery.
For others we can have something like WEDGED=none (as proposed by Michal and
Lucas in other threads) and let admin/user decide how to deal with it.
Raag
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-10-19 13:05 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20240930073845.347326-1-raag.jadav@intel.com>
[not found] ` <20240930073845.347326-2-raag.jadav@intel.com>
2024-10-17 2:47 ` [PATCH v7 1/5] drm: Introduce device wedged event Raag Jadav
2024-10-17 7:59 ` Christian König
2024-10-17 16:43 ` Rodrigo Vivi
2024-10-18 10:58 ` Christian König
2024-10-18 12:46 ` Raag Jadav
2024-10-18 12:54 ` Christian König
2024-10-18 14:09 ` Raag Jadav
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox