Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Pekka Paalanen <pekka.paalanen@haloniitty.fi>
To: Raag Jadav <raag.jadav@intel.com>
Cc: Xaver Hugl <xaver.hugl@kde.org>,
	airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com,
	rodrigo.vivi@intel.com, jani.nikula@linux.intel.com,
	andriy.shevchenko@linux.intel.com, lina@asahilina.net,
	michal.wajdeczko@intel.com, christian.koenig@amd.com,
	intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com,
	aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com,
	alexander.deucher@amd.com, andrealmeid@igalia.com,
	amd-gfx@lists.freedesktop.org, kernel-dev@igalia.com
Subject: Re: [PATCH v10 2/4] drm/doc: Document device wedged event
Date: Tue, 28 Jan 2025 13:38:09 +0200	[thread overview]
Message-ID: <20250128133809.1e5ed2be@eldfell> (raw)
In-Reply-To: <Z5ilcc0hC8nvz0Li@black.fi.intel.com>

[-- Attachment #1: Type: text/plain, Size: 5676 bytes --]

On Tue, 28 Jan 2025 11:37:53 +0200
Raag Jadav <raag.jadav@intel.com> wrote:

> On Mon, Jan 27, 2025 at 12:23:28PM +0200, Pekka Paalanen wrote:
> > On Wed, 22 Jan 2025 07:22:25 +0200
> > Raag Jadav <raag.jadav@intel.com> wrote:
> >   
> > > On Tue, Jan 21, 2025 at 02:14:56AM +0100, Xaver Hugl wrote:  
> > > > > +It is the responsibility of the consumer to make sure that the device or
> > > > > +its resources are not in use by any process before attempting recovery.    
> > > > I'm not convinced this is actually doable in practice, outside of
> > > > killing all apps that aren't the one trying to recover the GPU.
> > > > Is this just about not crashing those processes if they don't handle
> > > > GPU hotunplugs well, about leaks, or something else?    
> > > 
> > > Correct, all of it. And since the compositor is in charge of device resources,
> > > this way it atleast has the opportunity to recover the device and recreate
> > > context without all the userspace violence.  
> > 
> > Hi Raag,
> > 
> > sorry, I haven't followed this series, so I wonder, why should
> > userspace be part of recovering the device? Why doesn't the kernel
> > automatically load a new driver instance with a new DRM device node?  
> 
> There are things like bus level reset (PCI SBR) and re-enumeration that are
> not possible from driver context (or atleast I'm not aware of it), so a new
> instance is just as useful/less as the old one.

Ok, "not possible from driver context" is a key revelation. I wonder if
starting an overview section with that in the documentation would help
getting the right mindset.

Did I miss that somewhere?

I thought bus level reset meant resetting the PCI device by some bus
API. Clearly mistaken, I suppose you mean resetting the whole bus
including everything on it?

> > I see in the commit message written:
> > 
> > 	"For example, if the driver supports multiple recovery methods,
> > 	consumers can opt for the suitable one based on policy
> > 	definition."
> > 
> > How could consumers know what to do? How can they guess what would be
> > enough to recover the device? Isn't that the kernel driver's job to
> > know?  
> 
> Yes, 'WEDGED=' value are the known methods that are expected to work. The
> policy is how the consumer can decide which one to opt for depending on the
> scenario. For example, the less drastic method could work in most cases, but
> you'd probably want to opt for a more drastic method for repeat offences or
> perhaps if something more serious is discovered from "optional telemetry
> collection".

Aha, cool.

> > (More important for userspace would be know if dmabuf fds remain
> > pointing to valid memory retaining its contents or if the contents are
> > lost. Userspace cannot tell which device a dmabuf originates from,
> > AFAIK, so this would need to be added in the generic dmabuf UAPI.)  
> 
> Not sure if I understand, perhaps Christian can shed some light here.

A system might have multiple GPUs, and one GPU going down may leave all
the rest working as usual. A Wayland compositor would want to tell the
difference between still-good and possibly or definitely lost dmabufs
it received from its clients.

But this is off-topic in this thread I believe, nothing to the series
at hand.

> > > > > +With IOCTLs blocked and device already 'wedged', all device memory should  
> > 
> > btw. when I see "blocked" I think of the function call not returning
> > yet. But in this patch "blocked" seems to be synonymous for "returns
> > an error immediately". Would it be possible to avoid the word "blocked"
> > for this?  
> 
> It is meant as "blocking the access", but fair enough. We can have a quick
> fix later on if it's not too big of a concern.

Sure, I don't mind.

> > > > > +be unmapped and file descriptors should be closed to prevent leaks.    
> > > > Afaiu from a userspace POV, a rebind is just like a GPU hotunplug +
> > > > hotplug with matching "remove" and "add" udev events. As long as the
> > > > application cleans up resources related to the device when it receives
> > > > the event, there should be no leaks with a normal hotunplug... Is this
> > > > different enough that we can't have the same expectations?    
> > > 
> > > The thing about "remove" event is that it is generated *after* we opt for an
> > > unbind, and at that point it might be already too late if userspace doesn't
> > > get enough time to clean things up while the device is removed with a live
> > > client resulting in unknown consequences.
> > > 
> > > The idea here is to clean things up *before* we opt for an unbind leaving
> > > no room for side effects.  
> > 
> > Something here feels fragile. There should not be a deadline for
> > userspace to finish cleaning up. What was described for KMS device nodes
> > in this same document seems like a more reliable approach: keep the
> > dead driver instance around until userspace has closed all references
> > to it. The device node could be removed earlier.  
> 
> I'm not sure if I'm following here. The driver instance will exist as long
> as the dead device exists, which the consumer can remove if/when it chooses
> to trigger an unbind from userspace. There is no deadline for it.

I was going by your words: "it might be already too late if userspace
doesn't get enough time to clean things up".

> The consumer can choose to rely on hotplug events if it wishes, but the point
> here is that it doesn't guarantee a clean recovery in all cases.

Clearly I don't understand the whole picture here. No worries,
nevermind.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2025-01-28 11:38 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-28 15:37 [PATCH v10 0/4] Introduce DRM device wedged event Raag Jadav
2024-11-28 15:37 ` [PATCH v10 1/4] drm: Introduce " Raag Jadav
2024-11-29 13:40   ` André Almeida
2024-12-02  8:07     ` Raag Jadav
2024-12-03  5:00       ` Raag Jadav
2024-12-03 10:18         ` Christian König
2024-12-04 11:17           ` Raag Jadav
2024-12-11 17:14             ` Maxime Ripard
2024-12-12 10:37               ` Raag Jadav
2024-12-16 16:07                 ` Maxime Ripard
2024-12-12 18:31   ` André Almeida
2024-12-13 13:51     ` Raag Jadav
2024-11-28 15:37 ` [PATCH v10 2/4] drm/doc: Document " Raag Jadav
2024-12-12 18:50   ` André Almeida
2024-12-17  8:42     ` Raag Jadav
2024-12-17 11:57       ` André Almeida
2025-01-21  1:14   ` Xaver Hugl
2025-01-22  5:22     ` Raag Jadav
2025-01-27 10:23       ` Pekka Paalanen
2025-01-28  9:37         ` Raag Jadav
2025-01-28 11:38           ` Pekka Paalanen [this message]
2025-01-29  7:04             ` Raag Jadav
2024-11-28 15:37 ` [PATCH v10 3/4] drm/xe: Use " Raag Jadav
2024-11-28 15:37 ` [PATCH v10 4/4] drm/i915: " Raag Jadav
2024-11-28 15:52 ` ✓ CI.Patch_applied: success for Introduce DRM device wedged event (rev8) Patchwork
2024-11-28 15:52 ` ✗ CI.checkpatch: warning " Patchwork
2024-11-28 15:53 ` ✓ CI.KUnit: success " Patchwork
2024-11-28 16:11 ` ✓ CI.Build: " Patchwork
2024-11-28 16:13 ` ✓ CI.Hooks: " Patchwork
2024-11-28 16:15 ` ✗ CI.checksparse: warning " Patchwork
2024-11-28 16:33 ` ✓ Xe.CI.BAT: success " Patchwork
2024-11-29  0:07 ` ✗ Xe.CI.Full: failure " Patchwork
2025-01-21  0:59 ` [PATCH v10 0/4] Introduce DRM device wedged event Xaver Hugl
2025-01-22  4:48   ` Raag Jadav

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250128133809.1e5ed2be@eldfell \
    --to=pekka.paalanen@haloniitty.fi \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrealmeid@igalia.com \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jani.nikula@linux.intel.com \
    --cc=kernel-dev@igalia.com \
    --cc=lina@asahilina.net \
    --cc=lucas.demarchi@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona@ffwll.ch \
    --cc=xaver.hugl@kde.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox