From: Raag Jadav <raag.jadav@intel.com>
To: "Christian König" <christian.koenig@amd.com>
Cc: airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com,
rodrigo.vivi@intel.com, jani.nikula@linux.intel.com,
andriy.shevchenko@linux.intel.com, lina@asahilina.net,
michal.wajdeczko@intel.com, intel-xe@lists.freedesktop.org,
intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
himal.prasad.ghimiray@intel.com,
aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com,
alexander.deucher@amd.com, andrealmeid@igalia.com,
amd-gfx@lists.freedesktop.org, kernel-dev@igalia.com
Subject: Re: [PATCH v8 2/4] drm/doc: Document device wedged event
Date: Fri, 1 Nov 2024 06:26:21 +0200 [thread overview]
Message-ID: <ZyRYbSm2ki7yUTTz@black.fi.intel.com> (raw)
In-Reply-To: <69b81f6d-203b-451e-935b-085216dfebd1@amd.com>
On Tue, Oct 29, 2024 at 10:51:34AM +0100, Christian König wrote:
> Am 25.10.24 um 10:48 schrieb Raag Jadav:
> > Add documentation for device wedged event in a new 'Device wedging'
> > chapter. The describes basic definitions and consumer expectations
> > along with an example.
> >
> > v8: Improve documentation (Christian, Rodrigo)
> >
> > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > ---
> > Documentation/gpu/drm-uapi.rst | 75 ++++++++++++++++++++++++++++++++++
> > 1 file changed, 75 insertions(+)
> >
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 370d820be248..11a7446233b5 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -362,6 +362,81 @@ the first place. DRM devices should make use of devcoredump to store relevant
> > information about the reset, so this information can be added to user bug
> > reports.
> > +Device wedging
> > +==============
> > +
> > +Drivers can optionally make use of device wedged event (implemented as
> > +drm_dev_wedged_event() in DRM subsystem) which notifies userspace of wedged
> > +(hanged/unusable) state of the DRM device through a uevent. This is useful
> > +especially in cases where the device is no longer operating as expected even
> > +after a reset and has become unrecoverable from driver context. Purpose of
> > +this implementation is to provide drivers a generic way to recover with the
> > +help of userspace intervention without taking any drastic measures in the
> > +driver.
> > +
> > +A 'wedged' device is basically a dead device that needs attention. The
> > +uevent is the notification that is sent to userspace along with a hint about
> > +what could possibly be attempted to recover the device and bring it back to
> > +usable state. Different drivers may have different ideas of a 'wedged' device
> > +depending on their hardware implementation, and hence the vendor agnostic
> > +nature of the event. It is up to the drivers to decide when they see the need
> > +for recovery and how they want to recover from the available methods.
> > +
> > +Recovery
> > +--------
> > +
> > +Current implementation defines two recovery methods, out of which, drivers
> > +can use any one, both or none. Method(s) of choice will be sent in the uevent
> > +environment as ``WEDGED=<method1>[,<method2>]`` in order of less to more side
> > +effects. If driver is unsure about recovery or method is unknown (like reboot,
> > +firmware flashing, hardware replacement or any other procedure which can't be
> > +attempted on the fly), ``WEDGED=none`` will be sent instead.
> > +
> > +It is the responsibility of the driver to perform required cleanups (like
> > +disabling system memory access or signalling dma_fences) and prepare itself
> > +for the recovery before sending the event.
>
> That needs to be more like a warning and should have a bit more text. Maybe
> even a separate section.
>
> Something like this maybe:
>
> Prerequisites
> ------------------
>
> Drivers needs to make sure that the device won't harm the system as a
> whole by keeping it in a wedged state. Necessary actions must include
> disabling DMA to system memory as well as communication channels
> with other devices.
> Further drivers must ensure that all dma_fences are signaled and other
> state the core kernel might depend on are cleaned up.
> All ongoing waiting for hardware state changes must be aborted and
> new accesses to the hardware sufficiently blocked....
>
>
> I'm not a native speaker of English, so feel free to re-phrase that. But the
Neither am I, but will try my best.
> general approach should be like don't do this before you have made sure a, b
> and c.
Sure, thanks.
> > Once the event is sent, driver
> > +should block all IOCTLs with an error code.
>
> Better define which error code that should be. I think -ENODEV similar to a
> hotplug case would be appropriate.
Why not leave it to the drivers to decide depending on the type of failure?
> > This will signify the reason for
> > +wegeding which can be reported to the application if needed.
> > +
> > +Userspace consumers can parse this event and attempt recovery as per below
> > +expectations.
> > +
> > + =============== ==================================
> > + Recovery method Consumer expectations
> > + =============== ==================================
> > + rebind unbind + rebind driver
> > + bus-reset unbind + reset bus device + rebind
> > + none admin/user policy
> > + =============== ==================================
> > +
> > +Example for rebind
> > +~~~~~~~~~~~~~~~~~~
> > +
> > +Udev rule::
> > +
> > + SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
> > + RUN+="/path/to/rebind.sh $env{DEVPATH}"
> > +
> > +Recovery script::
> > +
> > + #!/bin/sh
> > +
> > + DEVPATH=$(readlink -f /sys/$1/device)
> > + DEVICE=$(basename $DEVPATH)
> > + DRIVER=$(readlink -f $DEVPATH/driver)
> > +
> > + echo -n $DEVICE > $DRIVER/unbind
> > + sleep 1
> > + echo -n $DEVICE > $DRIVER/bind
> > +
> > +Although scripts are simple enough for basic recovery, admin/users can define
> > +customized policies around recovery action. For example if the driver supports
> > +multiple recovery methods, consumers can opt for the suitable one based on
> > +policy definition. Consumers can also take additional steps like gathering
> > +telemetry information (devcoredump, syslog)
>
> I'm really wondering if we shouldn't standardize successful resets with this
> event as well?
>
> E.g. like there was an issue with device X, please collect the devcoredump.
This seems to fit into WEDGED=none case, although with this we'd probably
need to redefine what 'wedged' means.
> And then as a second step have the WEDGED property to indicate:
> a) reset successful, no actions needed.
> b) detach and rebind from the bus
> c) bus-reset
> d) impossible to recover but system as a whole can continue to work.
> e) it's on fire! Run sync and shut down power immediately.
Raag
next prev parent reply other threads:[~2024-11-01 4:26 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-25 8:48 [PATCH v8 0/4] Introduce DRM device wedged event Raag Jadav
2024-10-25 8:48 ` [PATCH v8 1/4] drm: Introduce " Raag Jadav
2024-10-25 9:08 ` Jani Nikula
2024-10-25 14:45 ` Andy Shevchenko
2024-10-26 15:27 ` Raag Jadav
2024-10-28 8:50 ` Jani Nikula
2024-10-26 6:15 ` kernel test robot
2024-10-26 10:10 ` kernel test robot
2024-10-25 8:48 ` [PATCH v8 2/4] drm/doc: Document " Raag Jadav
2024-10-29 9:51 ` Christian König
2024-11-01 4:26 ` Raag Jadav [this message]
2024-10-25 8:48 ` [PATCH v8 3/4] drm/xe: Use " Raag Jadav
2024-10-25 8:48 ` [PATCH v8 4/4] drm/i915: " Raag Jadav
2024-10-25 9:09 ` ✓ CI.Patch_applied: success for Introduce DRM device wedged event (rev6) Patchwork
2024-10-25 9:09 ` ✗ CI.checkpatch: warning " Patchwork
2024-10-25 9:10 ` ✓ CI.KUnit: success " Patchwork
2024-10-25 9:27 ` ✓ CI.Build: " Patchwork
2024-10-25 9:30 ` ✓ CI.Hooks: " Patchwork
2024-10-25 9:32 ` ✗ CI.checksparse: warning " Patchwork
2024-10-25 9:56 ` ✗ CI.BAT: failure " Patchwork
2024-10-26 19:57 ` ✗ CI.FULL: " Patchwork
2024-10-29 11:20 ` Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZyRYbSm2ki7yUTTz@black.fi.intel.com \
--to=raag.jadav@intel.com \
--cc=airlied@gmail.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=andrealmeid@igalia.com \
--cc=andriy.shevchenko@linux.intel.com \
--cc=anshuman.gupta@intel.com \
--cc=aravind.iddamsetty@linux.intel.com \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=himal.prasad.ghimiray@intel.com \
--cc=intel-gfx@lists.freedesktop.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=jani.nikula@linux.intel.com \
--cc=kernel-dev@igalia.com \
--cc=lina@asahilina.net \
--cc=lucas.demarchi@intel.com \
--cc=michal.wajdeczko@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=simona@ffwll.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox