From: "Christian König" <christian.koenig@amd.com>
To: Raag Jadav <raag.jadav@intel.com>,
airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com,
rodrigo.vivi@intel.com, jani.nikula@linux.intel.com,
andriy.shevchenko@linux.intel.com, lina@asahilina.net,
michal.wajdeczko@intel.com
Cc: intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org,
dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com,
aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com,
alexander.deucher@amd.com, andrealmeid@igalia.com,
amd-gfx@lists.freedesktop.org, kernel-dev@igalia.com
Subject: Re: [PATCH v9 2/4] drm/doc: Document device wedged event
Date: Fri, 15 Nov 2024 10:19:42 +0100 [thread overview]
Message-ID: <b5798f03-51d2-4517-8866-8e3368e4531d@amd.com> (raw)
In-Reply-To: <20241115050733.806934-3-raag.jadav@intel.com>
Am 15.11.24 um 06:07 schrieb Raag Jadav:
> Add documentation for device wedged event in a new 'Device wedging'
> chapter. The describes basic definitions and consumer expectations
> along with an example.
>
> v8: Improve documentation (Christian, Rodrigo)
> v9: Add prerequisites section (Christian)
>
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Sounds totally sane to me, but I'm not a native speaker of English so
other should probably look at it as well.
Anyway feel free to add Reviewed-by: Christian König
<christian.koenig@amd.com>.
Regards,
Christian.
> ---
> Documentation/gpu/drm-uapi.rst | 102 ++++++++++++++++++++++++++++++++-
> 1 file changed, 99 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index b75cc9a70d1f..33d9c253d4d6 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -371,9 +371,105 @@ Reporting causes of resets
>
> Apart from propagating the reset through the stack so apps can recover, it's
> really useful for driver developers to learn more about what caused the reset in
> -the first place. DRM devices should make use of devcoredump to store relevant
> -information about the reset, so this information can be added to user bug
> -reports.
> +the first place. For this, drivers can make use of devcoredump to store relevant
> +information about the reset and send device wedged event without recovery method
> +(as explained in next chapter) to notify userspace, so this information can be
> +collected and added to user bug reports.
> +
> +Device wedging
> +==============
> +
> +Drivers can optionally make use of device wedged event (implemented as
> +drm_dev_wedged_event() in DRM subsystem), which notifies userspace of 'wedged'
> +(hanged/unusable) state of the DRM device through a uevent. This is useful
> +especially in cases where the device is no longer operating as expected and
> +has become unrecoverable from driver context. Purpose of this implementation
> +is to provide drivers a generic way to recover with the help of userspace
> +intervention without taking any drastic measures in the driver.
> +
> +A 'wedged' device is basically a dead device that needs attention. The
> +uevent is the notification that is sent to userspace along with a hint about
> +what could possibly be attempted to recover the device and bring it back to
> +usable state. Different drivers may have different ideas of a 'wedged' device
> +depending on their hardware implementation, and hence the vendor agnostic
> +nature of the event. It is up to the drivers to decide when they see the need
> +for recovery and how they want to recover from the available methods.
> +
> +Prerequisites
> +-------------
> +
> +The driver, before opting for recovery, needs to make sure that the 'wedged'
> +device doesn't harm the system as a whole by taking care of the prerequisites.
> +Necessary actions must include disabling DMA to system memory as well as any
> +communication channels with other devices. Further, the driver must ensure
> +that all dma_fences are signalled and any device state that the core kernel
> +might depend on are cleaned up. Once the event is sent, the device must be
> +kept in 'wedged' state until the recovery is performed. New accesses to the
> +device (IOCTLs) should be blocked, preferably with an error code that
> +resembles the type of failure the device has encountered. This will signify
> +the reason for wegeding which can be reported to the application if needed.
> +
> +Recovery
> +--------
> +
> +Current implementation defines three recovery methods, out of which, drivers
> +can use any one, multiple or none. Method(s) of choice will be sent in the
> +uevent environment as ``WEDGED=<method1>[,<method2>]`` in order of less to
> +more side-effects. If driver is unsure about recovery or method is unknown
> +(like soft/hard reboot, firmware flashing, hardware replacement or any other
> +procedure which can't be attempted on the fly), ``WEDGED=unknown`` will be
> +sent instead.
> +
> +Userspace consumers can parse this event and attempt recovery as per the
> +following expectations.
> +
> + =============== ================================
> + Recovery method Consumer expectations
> + =============== ================================
> + none optional telemetry collection
> + rebind unbind + bind driver
> + bus-reset unbind + reset bus device + bind
> + unknown admin/user policy
> + =============== ================================
> +
> +The only exception to this is ``WEDGED=none``, which signifies that the
> +device was temporarily 'wedged' at some point but was able to recover using
> +device specific methods like reset. No explicit action is expected from
> +userspace consumers in this case, but they can still take additional steps
> +like gathering telemetry information (devcoredump, syslog). This is useful
> +because the first hang is usually the most critical one which can result in
> +consequential hangs or complete wedging.
> +
> +Example
> +-------
> +
> +Udev rule::
> +
> + SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
> + RUN+="/path/to/rebind.sh $env{DEVPATH}"
> +
> +Recovery script::
> +
> + #!/bin/sh
> +
> + DEVPATH=$(readlink -f /sys/$1/device)
> + DEVICE=$(basename $DEVPATH)
> + DRIVER=$(readlink -f $DEVPATH/driver)
> +
> + echo -n $DEVICE > $DRIVER/unbind
> + sleep 1
> + echo -n $DEVICE > $DRIVER/bind
> +
> +Customization
> +-------------
> +
> +Although basic recovery is possible with a simple script, admin/users can
> +define custom policies around recovery action. For example, if the driver
> +supports multiple recovery methods, consumers can opt for the suitable one
> +based on policy definition. Consumers can also choose to have the device
> +available for debugging or additional data collection before performing the
> +recovery. This is useful especially when the driver is unsure about recovery
> +or method is unknown.
>
> .. _drm_driver_ioctl:
>
next prev parent reply other threads:[~2024-11-15 9:19 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-15 5:07 [PATCH v9 0/4] Introduce DRM device wedged event Raag Jadav
2024-11-15 5:07 ` [PATCH v9 1/4] drm: Introduce " Raag Jadav
2024-11-18 14:56 ` Aravind Iddamsetty
2024-11-22 7:07 ` Raag Jadav
2024-11-22 10:09 ` Christian König
2024-11-22 16:02 ` Raag Jadav
2024-11-25 5:56 ` Aravind Iddamsetty
2024-11-25 10:27 ` Christian König
2024-11-25 9:32 ` Christian König
2024-11-26 6:38 ` Raag Jadav
2024-11-26 8:12 ` Christian König
2024-11-15 5:07 ` [PATCH v9 2/4] drm/doc: Document " Raag Jadav
2024-11-15 9:19 ` Christian König [this message]
2024-11-15 11:44 ` Andy Shevchenko
2024-11-15 5:07 ` [PATCH v9 3/4] drm/xe: Use " Raag Jadav
2024-11-19 3:28 ` Aravind Iddamsetty
2024-11-19 4:55 ` Ghimiray, Himal Prasad
2024-11-20 7:26 ` Raag Jadav
2024-11-15 5:07 ` [PATCH v9 4/4] drm/i915: " Raag Jadav
2024-11-19 3:43 ` Aravind Iddamsetty
2024-11-15 5:15 ` ✓ CI.Patch_applied: success for Introduce DRM device wedged event (rev7) Patchwork
2024-11-15 5:15 ` ✗ CI.checkpatch: warning " Patchwork
2024-11-15 5:16 ` ✓ CI.KUnit: success " Patchwork
2024-11-15 5:25 ` ✗ CI.Build: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b5798f03-51d2-4517-8866-8e3368e4531d@amd.com \
--to=christian.koenig@amd.com \
--cc=airlied@gmail.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=andrealmeid@igalia.com \
--cc=andriy.shevchenko@linux.intel.com \
--cc=anshuman.gupta@intel.com \
--cc=aravind.iddamsetty@linux.intel.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=himal.prasad.ghimiray@intel.com \
--cc=intel-gfx@lists.freedesktop.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=jani.nikula@linux.intel.com \
--cc=kernel-dev@igalia.com \
--cc=lina@asahilina.net \
--cc=lucas.demarchi@intel.com \
--cc=michal.wajdeczko@intel.com \
--cc=raag.jadav@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=simona@ffwll.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox