From: Raag Jadav <raag.jadav@intel.com>
To: Pekka Paalanen <pekka.paalanen@haloniitty.fi>
Cc: Xaver Hugl <xaver.hugl@kde.org>,
airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com,
rodrigo.vivi@intel.com, jani.nikula@linux.intel.com,
andriy.shevchenko@linux.intel.com, lina@asahilina.net,
michal.wajdeczko@intel.com, christian.koenig@amd.com,
intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org,
dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com,
aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com,
alexander.deucher@amd.com, andrealmeid@igalia.com,
amd-gfx@lists.freedesktop.org, kernel-dev@igalia.com
Subject: Re: [PATCH v10 2/4] drm/doc: Document device wedged event
Date: Wed, 29 Jan 2025 09:04:46 +0200 [thread overview]
Message-ID: <Z5nTDrUVQjCcK9Cb@black.fi.intel.com> (raw)
In-Reply-To: <20250128133809.1e5ed2be@eldfell>
On Tue, Jan 28, 2025 at 01:38:09PM +0200, Pekka Paalanen wrote:
> On Tue, 28 Jan 2025 11:37:53 +0200
> Raag Jadav <raag.jadav@intel.com> wrote:
>
> > On Mon, Jan 27, 2025 at 12:23:28PM +0200, Pekka Paalanen wrote:
> > > On Wed, 22 Jan 2025 07:22:25 +0200
> > > Raag Jadav <raag.jadav@intel.com> wrote:
> > >
> > > > On Tue, Jan 21, 2025 at 02:14:56AM +0100, Xaver Hugl wrote:
> > > > > > +It is the responsibility of the consumer to make sure that the device or
> > > > > > +its resources are not in use by any process before attempting recovery.
> > > > > I'm not convinced this is actually doable in practice, outside of
> > > > > killing all apps that aren't the one trying to recover the GPU.
> > > > > Is this just about not crashing those processes if they don't handle
> > > > > GPU hotunplugs well, about leaks, or something else?
> > > >
> > > > Correct, all of it. And since the compositor is in charge of device resources,
> > > > this way it atleast has the opportunity to recover the device and recreate
> > > > context without all the userspace violence.
> > >
> > > Hi Raag,
> > >
> > > sorry, I haven't followed this series, so I wonder, why should
> > > userspace be part of recovering the device? Why doesn't the kernel
> > > automatically load a new driver instance with a new DRM device node?
> >
> > There are things like bus level reset (PCI SBR) and re-enumeration that are
> > not possible from driver context (or atleast I'm not aware of it), so a new
> > instance is just as useful/less as the old one.
>
> Ok, "not possible from driver context" is a key revelation. I wonder if
> starting an overview section with that in the documentation would help
> getting the right mindset.
Not "not possible" in a literal sense, but rather allowing something that
drastic and convoluted that is probably beyond the scope of the driver.
> Did I miss that somewhere?
The first two paragraphs are meant as an introduction. Let me know if
something's not translating so well.
> I thought bus level reset meant resetting the PCI device by some bus
> API. Clearly mistaken, I suppose you mean resetting the whole bus
> including everything on it?
I'm no PCI expert but yes, it is atleast my understanding.
...
> > > (More important for userspace would be know if dmabuf fds remain
> > > pointing to valid memory retaining its contents or if the contents are
> > > lost. Userspace cannot tell which device a dmabuf originates from,
> > > AFAIK, so this would need to be added in the generic dmabuf UAPI.)
> >
> > Not sure if I understand, perhaps Christian can shed some light here.
>
> A system might have multiple GPUs, and one GPU going down may leave all
> the rest working as usual. A Wayland compositor would want to tell the
> difference between still-good and possibly or definitely lost dmabufs
> it received from its clients.
Usually 'DEVNAME=' and 'DEVPATH=' values refer to the device that generates
the event.
> But this is off-topic in this thread I believe, nothing to the series
> at hand.
...
> > > > > > +be unmapped and file descriptors should be closed to prevent leaks.
> > > > > Afaiu from a userspace POV, a rebind is just like a GPU hotunplug +
> > > > > hotplug with matching "remove" and "add" udev events. As long as the
> > > > > application cleans up resources related to the device when it receives
> > > > > the event, there should be no leaks with a normal hotunplug... Is this
> > > > > different enough that we can't have the same expectations?
> > > >
> > > > The thing about "remove" event is that it is generated *after* we opt for an
> > > > unbind, and at that point it might be already too late if userspace doesn't
> > > > get enough time to clean things up while the device is removed with a live
> > > > client resulting in unknown consequences.
> > > >
> > > > The idea here is to clean things up *before* we opt for an unbind leaving
> > > > no room for side effects.
> > >
> > > Something here feels fragile. There should not be a deadline for
> > > userspace to finish cleaning up. What was described for KMS device nodes
> > > in this same document seems like a more reliable approach: keep the
> > > dead driver instance around until userspace has closed all references
> > > to it. The device node could be removed earlier.
> >
> > I'm not sure if I'm following here. The driver instance will exist as long
> > as the dead device exists, which the consumer can remove if/when it chooses
> > to trigger an unbind from userspace. There is no deadline for it.
>
> I was going by your words: "it might be already too late if userspace
> doesn't get enough time to clean things up".
The idea here is to completely detach kernel and userspace *before* moving
forward with the recovery.
> > The consumer can choose to rely on hotplug events if it wishes, but the point
> > here is that it doesn't guarantee a clean recovery in all cases.
>
> Clearly I don't understand the whole picture here. No worries,
> nevermind.
Less moving parts comes with more chances of success.
Raag
next prev parent reply other threads:[~2025-01-29 8:29 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-28 15:37 [PATCH v10 0/4] Introduce DRM device wedged event Raag Jadav
2024-11-28 15:37 ` [PATCH v10 1/4] drm: Introduce " Raag Jadav
2024-11-29 13:40 ` André Almeida
2024-12-02 8:07 ` Raag Jadav
2024-12-03 5:00 ` Raag Jadav
2024-12-03 10:18 ` Christian König
2024-12-04 11:17 ` Raag Jadav
2024-12-11 17:14 ` Maxime Ripard
2024-12-12 10:37 ` Raag Jadav
2024-12-16 16:07 ` Maxime Ripard
2024-12-12 18:31 ` André Almeida
2024-12-13 13:51 ` Raag Jadav
2024-11-28 15:37 ` [PATCH v10 2/4] drm/doc: Document " Raag Jadav
2024-12-12 18:50 ` André Almeida
2024-12-17 8:42 ` Raag Jadav
2024-12-17 11:57 ` André Almeida
2025-01-21 1:14 ` Xaver Hugl
2025-01-22 5:22 ` Raag Jadav
2025-01-27 10:23 ` Pekka Paalanen
2025-01-28 9:37 ` Raag Jadav
2025-01-28 11:38 ` Pekka Paalanen
2025-01-29 7:04 ` Raag Jadav [this message]
2024-11-28 15:37 ` [PATCH v10 3/4] drm/xe: Use " Raag Jadav
2024-11-28 15:37 ` [PATCH v10 4/4] drm/i915: " Raag Jadav
2024-11-28 15:52 ` ✓ CI.Patch_applied: success for Introduce DRM device wedged event (rev8) Patchwork
2024-11-28 15:52 ` ✗ CI.checkpatch: warning " Patchwork
2024-11-28 15:53 ` ✓ CI.KUnit: success " Patchwork
2024-11-28 16:11 ` ✓ CI.Build: " Patchwork
2024-11-28 16:13 ` ✓ CI.Hooks: " Patchwork
2024-11-28 16:15 ` ✗ CI.checksparse: warning " Patchwork
2024-11-28 16:33 ` ✓ Xe.CI.BAT: success " Patchwork
2024-11-28 17:45 ` ✗ Fi.CI.CHECKPATCH: warning " Patchwork
2024-11-28 17:45 ` ✗ Fi.CI.SPARSE: " Patchwork
2024-11-28 17:59 ` ✗ i915.CI.BAT: failure " Patchwork
2024-11-29 0:07 ` ✗ Xe.CI.Full: " Patchwork
2025-01-21 0:59 ` [PATCH v10 0/4] Introduce DRM device wedged event Xaver Hugl
2025-01-22 4:48 ` Raag Jadav
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z5nTDrUVQjCcK9Cb@black.fi.intel.com \
--to=raag.jadav@intel.com \
--cc=airlied@gmail.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=andrealmeid@igalia.com \
--cc=andriy.shevchenko@linux.intel.com \
--cc=anshuman.gupta@intel.com \
--cc=aravind.iddamsetty@linux.intel.com \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=himal.prasad.ghimiray@intel.com \
--cc=intel-gfx@lists.freedesktop.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=jani.nikula@linux.intel.com \
--cc=kernel-dev@igalia.com \
--cc=lina@asahilina.net \
--cc=lucas.demarchi@intel.com \
--cc=michal.wajdeczko@intel.com \
--cc=pekka.paalanen@haloniitty.fi \
--cc=rodrigo.vivi@intel.com \
--cc=simona@ffwll.ch \
--cc=xaver.hugl@kde.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.