Re: [PATCH v8 2/4] drm/doc: Document device wedged event

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Raag Jadav <raag.jadav@intel.com>
To: "Christian König" <christian.koenig@amd.com>
Cc: airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com,
	rodrigo.vivi@intel.com, jani.nikula@linux.intel.com,
	andriy.shevchenko@linux.intel.com, lina@asahilina.net,
	michal.wajdeczko@intel.com, intel-xe@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	himal.prasad.ghimiray@intel.com,
	aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com,
	alexander.deucher@amd.com, andrealmeid@igalia.com,
	amd-gfx@lists.freedesktop.org, kernel-dev@igalia.com
Subject: Re: [PATCH v8 2/4] drm/doc: Document device wedged event
Date: Fri, 1 Nov 2024 06:26:21 +0200	[thread overview]
Message-ID: <ZyRYbSm2ki7yUTTz@black.fi.intel.com> (raw)
In-Reply-To: <69b81f6d-203b-451e-935b-085216dfebd1@amd.com>

On Tue, Oct 29, 2024 at 10:51:34AM +0100, Christian König wrote:
> Am 25.10.24 um 10:48 schrieb Raag Jadav:
> > Add documentation for device wedged event in a new 'Device wedging'
> > chapter. The describes basic definitions and consumer expectations
> > along with an example.
> > 
> > v8: Improve documentation (Christian, Rodrigo)
> > 
> > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > ---
> >   Documentation/gpu/drm-uapi.rst | 75 ++++++++++++++++++++++++++++++++++
> >   1 file changed, 75 insertions(+)
> > 
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 370d820be248..11a7446233b5 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -362,6 +362,81 @@ the first place. DRM devices should make use of devcoredump to store relevant
> >   information about the reset, so this information can be added to user bug
> >   reports.
> > +Device wedging
> > +==============
> > +
> > +Drivers can optionally make use of device wedged event (implemented as
> > +drm_dev_wedged_event() in DRM subsystem) which notifies userspace of wedged
> > +(hanged/unusable) state of the DRM device through a uevent. This is useful
> > +especially in cases where the device is no longer operating as expected even
> > +after a reset and has become unrecoverable from driver context. Purpose of
> > +this implementation is to provide drivers a generic way to recover with the
> > +help of userspace intervention without taking any drastic measures in the
> > +driver.
> > +
> > +A 'wedged' device is basically a dead device that needs attention. The
> > +uevent is the notification that is sent to userspace along with a hint about
> > +what could possibly be attempted to recover the device and bring it back to
> > +usable state. Different drivers may have different ideas of a 'wedged' device
> > +depending on their hardware implementation, and hence the vendor agnostic
> > +nature of the event. It is up to the drivers to decide when they see the need
> > +for recovery and how they want to recover from the available methods.
> > +
> > +Recovery
> > +--------
> > +
> > +Current implementation defines two recovery methods, out of which, drivers
> > +can use any one, both or none. Method(s) of choice will be sent in the uevent
> > +environment as ``WEDGED=<method1>[,<method2>]`` in order of less to more side
> > +effects. If driver is unsure about recovery or method is unknown (like reboot,
> > +firmware flashing, hardware replacement or any other procedure which can't be
> > +attempted on the fly), ``WEDGED=none`` will be sent instead.
> > +
> > +It is the responsibility of the driver to perform required cleanups (like
> > +disabling system memory access or signalling dma_fences) and prepare itself
> > +for the recovery before sending the event.
> 
> That needs to be more like a warning and should have a bit more text. Maybe
> even a separate section.
> 
> Something like this maybe:
> 
> Prerequisites
> ------------------
> 
> Drivers needs to make sure that the device won't harm the system as a
> whole by keeping it in a wedged state. Necessary actions must include
> disabling DMA to system memory as well as communication channels
> with other devices.
> Further drivers must ensure that all dma_fences are signaled and other
> state the core kernel might depend on are cleaned up.
> All ongoing waiting for hardware state changes must be aborted and
> new accesses to the hardware sufficiently blocked....
> 
> 
> I'm not a native speaker of English, so feel free to re-phrase that. But the

Neither am I, but will try my best.

> general approach should be like don't do this before you have made sure a, b
> and c.

Sure, thanks.

> >   Once the event is sent, driver
> > +should block all IOCTLs with an error code.
> 
> Better define which error code that should be. I think -ENODEV similar to a
> hotplug case would be appropriate.

Why not leave it to the drivers to decide depending on the type of failure?

> >   This will signify the reason for
> > +wegeding which can be reported to the application if needed.
> > +
> > +Userspace consumers can parse this event and attempt recovery as per below
> > +expectations.
> > +
> > +    =============== ==================================
> > +    Recovery method Consumer expectations
> > +    =============== ==================================
> > +    rebind          unbind + rebind driver
> > +    bus-reset       unbind + reset bus device + rebind
> > +    none            admin/user policy
> > +    =============== ==================================
> > +
> > +Example for rebind
> > +~~~~~~~~~~~~~~~~~~
> > +
> > +Udev rule::
> > +
> > +    SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
> > +    RUN+="/path/to/rebind.sh $env{DEVPATH}"
> > +
> > +Recovery script::
> > +
> > +    #!/bin/sh
> > +
> > +    DEVPATH=$(readlink -f /sys/$1/device)
> > +    DEVICE=$(basename $DEVPATH)
> > +    DRIVER=$(readlink -f $DEVPATH/driver)
> > +
> > +    echo -n $DEVICE > $DRIVER/unbind
> > +    sleep 1
> > +    echo -n $DEVICE > $DRIVER/bind
> > +
> > +Although scripts are simple enough for basic recovery, admin/users can define
> > +customized policies around recovery action. For example if the driver supports
> > +multiple recovery methods, consumers can opt for the suitable one based on
> > +policy definition. Consumers can also take additional steps like gathering
> > +telemetry information (devcoredump, syslog)
> 
> I'm really wondering if we shouldn't standardize successful resets with this
> event as well?
> 
> E.g. like there was an issue with device X, please collect the devcoredump.

This seems to fit into WEDGED=none case, although with this we'd probably
need to redefine what 'wedged' means.

> And then as a second step have the WEDGED property to indicate:
> a) reset successful, no actions needed.
> b) detach and rebind from the bus
> c) bus-reset
> d) impossible to recover but system as a whole can continue to work.
> e) it's on fire! Run sync and shut down power immediately.

Raag

next prev parent reply	other threads:[~2024-11-01  4:26 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-25  8:48 [PATCH v8 0/4] Introduce DRM device wedged event Raag Jadav
2024-10-25  8:48 ` [PATCH v8 1/4] drm: Introduce " Raag Jadav
2024-10-25  9:08   ` Jani Nikula
2024-10-25 14:45     ` Andy Shevchenko
2024-10-26 15:27       ` Raag Jadav
2024-10-28  8:50     ` Jani Nikula
2024-10-26  6:15   ` kernel test robot
2024-10-26 10:10   ` kernel test robot
2024-10-25  8:48 ` [PATCH v8 2/4] drm/doc: Document " Raag Jadav
2024-10-29  9:51   ` Christian König
2024-11-01  4:26     ` Raag Jadav [this message]
2024-10-25  8:48 ` [PATCH v8 3/4] drm/xe: Use " Raag Jadav
2024-10-25  8:48 ` [PATCH v8 4/4] drm/i915: " Raag Jadav
2024-10-25  9:09 ` ✓ CI.Patch_applied: success for Introduce DRM device wedged event (rev6) Patchwork
2024-10-25  9:09 ` ✗ CI.checkpatch: warning " Patchwork
2024-10-25  9:10 ` ✓ CI.KUnit: success " Patchwork
2024-10-25  9:27 ` ✓ CI.Build: " Patchwork
2024-10-25  9:30 ` ✓ CI.Hooks: " Patchwork
2024-10-25  9:32 ` ✗ CI.checksparse: warning " Patchwork
2024-10-25  9:56 ` ✗ CI.BAT: failure " Patchwork
2024-10-26 19:57 ` ✗ CI.FULL: " Patchwork
2024-10-29 11:20 ` Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZyRYbSm2ki7yUTTz@black.fi.intel.com \
    --to=raag.jadav@intel.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrealmeid@igalia.com \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jani.nikula@linux.intel.com \
    --cc=kernel-dev@igalia.com \
    --cc=lina@asahilina.net \
    --cc=lucas.demarchi@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona@ffwll.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox