Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Raag Jadav <raag.jadav@intel.com>
To: "Christian König" <christian.koenig@amd.com>
Cc: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>,
	airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com,
	rodrigo.vivi@intel.com, jani.nikula@linux.intel.com,
	andriy.shevchenko@linux.intel.com, lina@asahilina.net,
	michal.wajdeczko@intel.com, intel-xe@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	himal.prasad.ghimiray@intel.com, anshuman.gupta@intel.com,
	alexander.deucher@amd.com, andrealmeid@igalia.com,
	amd-gfx@lists.freedesktop.org, kernel-dev@igalia.com
Subject: Re: [PATCH v9 1/4] drm: Introduce device wedged event
Date: Tue, 26 Nov 2024 08:38:59 +0200	[thread overview]
Message-ID: <Z0VtA5o2cW_snZbf@black.fi.intel.com> (raw)
In-Reply-To: <1018930b-98cc-432a-a4fe-6898ffa51d29@amd.com>

On Mon, Nov 25, 2024 at 10:32:42AM +0100, Christian König wrote:
> Am 22.11.24 um 17:02 schrieb Raag Jadav:
> > On Fri, Nov 22, 2024 at 11:09:32AM +0100, Christian König wrote:
> > > Am 22.11.24 um 08:07 schrieb Raag Jadav:
> > > > On Mon, Nov 18, 2024 at 08:26:37PM +0530, Aravind Iddamsetty wrote:
> > > > > On 15/11/24 10:37, Raag Jadav wrote:
> > > > > > Introduce device wedged event, which notifies userspace of 'wedged'
> > > > > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > > > > useful especially in cases where the device is no longer operating as
> > > > > > expected and has become unrecoverable from driver context. Purpose of
> > > > > > this implementation is to provide drivers a generic way to recover with
> > > > > > the help of userspace intervention without taking any drastic measures
> > > > > > in the driver.
> > > > > > 
> > > > > > A 'wedged' device is basically a dead device that needs attention. The
> > > > > > uevent is the notification that is sent to userspace along with a hint
> > > > > > about what could possibly be attempted to recover the device and bring
> > > > > > it back to usable state. Different drivers may have different ideas of
> > > > > > a 'wedged' device depending on their hardware implementation, and hence
> > > > > > the vendor agnostic nature of the event. It is up to the drivers to
> > > > > > decide when they see the need for recovery and how they want to recover
> > > > > > from the available methods.
> > > > > > 
> > > > > > Prerequisites
> > > > > > -------------
> > > > > > 
> > > > > > The driver, before opting for recovery, needs to make sure that the
> > > > > > 'wedged' device doesn't harm the system as a whole by taking care of the
> > > > > > prerequisites. Necessary actions must include disabling DMA to system
> > > > > > memory as well as any communication channels with other devices. Further,
> > > > > > the driver must ensure that all dma_fences are signalled and any device
> > > > > > state that the core kernel might depend on are cleaned up. Once the event
> > > > > > is sent, the device must be kept in 'wedged' state until the recovery is
> > > > > > performed. New accesses to the device (IOCTLs) should be blocked,
> > > > > > preferably with an error code that resembles the type of failure the
> > > > > > device has encountered. This will signify the reason for wegeding which
> > > > > > can be reported to the application if needed.
> > > > > should we even drop the mmaps we created?
> > > > Whatever is required for a clean recovery, yes.
> > > > 
> > > > Although how would this play out? Do we risk loosing display?
> > > > Or any other possible side-effects?
> > > Before sending a wedge event all DMA transfers of the device have to be
> > > blocked.
> > > 
> > > So yes, all display, mmap() and file descriptor connections you had with the
> > > device would need to be re-created.
> > Does it mean we'd have to rely on userspace to unmap()?
> 
> Yes and no :)
> 
> The handling should be similar to how hotplug is handled. E.g. the device
> becomes inaccessible by normal applications all mappings become invalid.

Isn't that just unbind (which is already part of recovery)?

> But we don't send a SIGBUS or similar on access, instead all mappings
> redirected to a dummy page which basically shallows all writes and gives
> undefined data on reads.
> 
> On IOCTLs the applications should get an error code and eventually restart
> or at least unmap all their mappings.

Thanks for the detailed explanation.

Rethinking about this, the criteria set for prerequisites is to not do
anything that could possibly harm the system. So I think the important
question is,

with fences signalled and ioctls already blocked, is live mmap on a wedged
device capable of producing harmful behaviour or unintended side-effects
(atleast until the application has the opportunity to unmap() as part of
recovery)?

Raag

  reply	other threads:[~2024-11-26  6:39 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-15  5:07 [PATCH v9 0/4] Introduce DRM device wedged event Raag Jadav
2024-11-15  5:07 ` [PATCH v9 1/4] drm: Introduce " Raag Jadav
2024-11-18 14:56   ` Aravind Iddamsetty
2024-11-22  7:07     ` Raag Jadav
2024-11-22 10:09       ` Christian König
2024-11-22 16:02         ` Raag Jadav
2024-11-25  5:56           ` Aravind Iddamsetty
2024-11-25 10:27             ` Christian König
2024-11-25  9:32           ` Christian König
2024-11-26  6:38             ` Raag Jadav [this message]
2024-11-26  8:12               ` Christian König
2024-11-15  5:07 ` [PATCH v9 2/4] drm/doc: Document " Raag Jadav
2024-11-15  9:19   ` Christian König
2024-11-15 11:44     ` Andy Shevchenko
2024-11-15  5:07 ` [PATCH v9 3/4] drm/xe: Use " Raag Jadav
2024-11-19  3:28   ` Aravind Iddamsetty
2024-11-19  4:55   ` Ghimiray, Himal Prasad
2024-11-20  7:26     ` Raag Jadav
2024-11-15  5:07 ` [PATCH v9 4/4] drm/i915: " Raag Jadav
2024-11-19  3:43   ` Aravind Iddamsetty
2024-11-15  5:46 ` ✗ Fi.CI.CHECKPATCH: warning for Introduce DRM device wedged event (rev7) Patchwork
2024-11-15  5:46 ` ✗ Fi.CI.SPARSE: " Patchwork
2024-11-15  8:09 ` ✓ Fi.CI.BAT: success " Patchwork
2024-11-15 10:34 ` ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z0VtA5o2cW_snZbf@black.fi.intel.com \
    --to=raag.jadav@intel.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrealmeid@igalia.com \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jani.nikula@linux.intel.com \
    --cc=kernel-dev@igalia.com \
    --cc=lina@asahilina.net \
    --cc=lucas.demarchi@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona@ffwll.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox