Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: Raag Jadav <raag.jadav@intel.com>
Cc: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>,
	airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com,
	rodrigo.vivi@intel.com, jani.nikula@linux.intel.com,
	andriy.shevchenko@linux.intel.com, lina@asahilina.net,
	michal.wajdeczko@intel.com, intel-xe@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	himal.prasad.ghimiray@intel.com, anshuman.gupta@intel.com,
	alexander.deucher@amd.com, andrealmeid@igalia.com,
	amd-gfx@lists.freedesktop.org, kernel-dev@igalia.com
Subject: Re: [PATCH v9 1/4] drm: Introduce device wedged event
Date: Tue, 26 Nov 2024 09:12:45 +0100	[thread overview]
Message-ID: <4a99aedb-e98f-420d-a2ff-69b2a7827a09@amd.com> (raw)
In-Reply-To: <Z0VtA5o2cW_snZbf@black.fi.intel.com>

[-- Attachment #1: Type: text/plain, Size: 4559 bytes --]

Am 26.11.24 um 07:38 schrieb Raag Jadav:
> On Mon, Nov 25, 2024 at 10:32:42AM +0100, Christian König wrote:
>> Am 22.11.24 um 17:02 schrieb Raag Jadav:
>>> On Fri, Nov 22, 2024 at 11:09:32AM +0100, Christian König wrote:
>>>> Am 22.11.24 um 08:07 schrieb Raag Jadav:
>>>>> On Mon, Nov 18, 2024 at 08:26:37PM +0530, Aravind Iddamsetty wrote:
>>>>>> On 15/11/24 10:37, Raag Jadav wrote:
>>>>>>> Introduce device wedged event, which notifies userspace of 'wedged'
>>>>>>> (hanged/unusable) state of the DRM device through a uevent. This is
>>>>>>> useful especially in cases where the device is no longer operating as
>>>>>>> expected and has become unrecoverable from driver context. Purpose of
>>>>>>> this implementation is to provide drivers a generic way to recover with
>>>>>>> the help of userspace intervention without taking any drastic measures
>>>>>>> in the driver.
>>>>>>>
>>>>>>> A 'wedged' device is basically a dead device that needs attention. The
>>>>>>> uevent is the notification that is sent to userspace along with a hint
>>>>>>> about what could possibly be attempted to recover the device and bring
>>>>>>> it back to usable state. Different drivers may have different ideas of
>>>>>>> a 'wedged' device depending on their hardware implementation, and hence
>>>>>>> the vendor agnostic nature of the event. It is up to the drivers to
>>>>>>> decide when they see the need for recovery and how they want to recover
>>>>>>> from the available methods.
>>>>>>>
>>>>>>> Prerequisites
>>>>>>> -------------
>>>>>>>
>>>>>>> The driver, before opting for recovery, needs to make sure that the
>>>>>>> 'wedged' device doesn't harm the system as a whole by taking care of the
>>>>>>> prerequisites. Necessary actions must include disabling DMA to system
>>>>>>> memory as well as any communication channels with other devices. Further,
>>>>>>> the driver must ensure that all dma_fences are signalled and any device
>>>>>>> state that the core kernel might depend on are cleaned up. Once the event
>>>>>>> is sent, the device must be kept in 'wedged' state until the recovery is
>>>>>>> performed. New accesses to the device (IOCTLs) should be blocked,
>>>>>>> preferably with an error code that resembles the type of failure the
>>>>>>> device has encountered. This will signify the reason for wegeding which
>>>>>>> can be reported to the application if needed.
>>>>>> should we even drop the mmaps we created?
>>>>> Whatever is required for a clean recovery, yes.
>>>>>
>>>>> Although how would this play out? Do we risk loosing display?
>>>>> Or any other possible side-effects?
>>>> Before sending a wedge event all DMA transfers of the device have to be
>>>> blocked.
>>>>
>>>> So yes, all display, mmap() and file descriptor connections you had with the
>>>> device would need to be re-created.
>>> Does it mean we'd have to rely on userspace to unmap()?
>> Yes and no :)
>>
>> The handling should be similar to how hotplug is handled. E.g. the device
>> becomes inaccessible by normal applications all mappings become invalid.
> Isn't that just unbind (which is already part of recovery)?

No, unbind just invalidates all mappings but it doesn't catches any page 
faults which would validate them again.

The driver or framework must make sure that page faults now get 
redirected to a dummy page. See ttm_bo_vm_dummy_page() for how TTM 
handles that for all drivers using it.

Not sure about i915, since it never deals with device memory it can 
potentially just keep the access to the allocated system memory intact.

>> But we don't send a SIGBUS or similar on access, instead all mappings
>> redirected to a dummy page which basically shallows all writes and gives
>> undefined data on reads.
>>
>> On IOCTLs the applications should get an error code and eventually restart
>> or at least unmap all their mappings.
> Thanks for the detailed explanation.
>
> Rethinking about this, the criteria set for prerequisites is to not do
> anything that could possibly harm the system. So I think the important
> question is,
>
> with fences signalled and ioctls already blocked, is live mmap on a wedged
> device capable of producing harmful behaviour or unintended side-effects
> (atleast until the application has the opportunity to unmap() as part of
> recovery)?

I think we are already rather good there.

The potential options are to redirect everything to a dummy page or to 
crash the application by sending a SIGBUS.

Redirecting everything to the dummy page sounds like the more defensive 
approach.

Regards,
Christian.

>
> Raag

[-- Attachment #2: Type: text/html, Size: 6078 bytes --]

  reply	other threads:[~2024-11-26  8:13 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-15  5:07 [PATCH v9 0/4] Introduce DRM device wedged event Raag Jadav
2024-11-15  5:07 ` [PATCH v9 1/4] drm: Introduce " Raag Jadav
2024-11-18 14:56   ` Aravind Iddamsetty
2024-11-22  7:07     ` Raag Jadav
2024-11-22 10:09       ` Christian König
2024-11-22 16:02         ` Raag Jadav
2024-11-25  5:56           ` Aravind Iddamsetty
2024-11-25 10:27             ` Christian König
2024-11-25  9:32           ` Christian König
2024-11-26  6:38             ` Raag Jadav
2024-11-26  8:12               ` Christian König [this message]
2024-11-15  5:07 ` [PATCH v9 2/4] drm/doc: Document " Raag Jadav
2024-11-15  9:19   ` Christian König
2024-11-15 11:44     ` Andy Shevchenko
2024-11-15  5:07 ` [PATCH v9 3/4] drm/xe: Use " Raag Jadav
2024-11-19  3:28   ` Aravind Iddamsetty
2024-11-19  4:55   ` Ghimiray, Himal Prasad
2024-11-20  7:26     ` Raag Jadav
2024-11-15  5:07 ` [PATCH v9 4/4] drm/i915: " Raag Jadav
2024-11-19  3:43   ` Aravind Iddamsetty
2024-11-15  5:46 ` ✗ Fi.CI.CHECKPATCH: warning for Introduce DRM device wedged event (rev7) Patchwork
2024-11-15  5:46 ` ✗ Fi.CI.SPARSE: " Patchwork
2024-11-15  8:09 ` ✓ Fi.CI.BAT: success " Patchwork
2024-11-15 10:34 ` ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4a99aedb-e98f-420d-a2ff-69b2a7827a09@amd.com \
    --to=christian.koenig@amd.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrealmeid@igalia.com \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jani.nikula@linux.intel.com \
    --cc=kernel-dev@igalia.com \
    --cc=lina@asahilina.net \
    --cc=lucas.demarchi@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona@ffwll.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox