From: Boris Brezillon <boris.brezillon@collabora.com>
To: "Christian König" <christian.koenig@amd.com>
Cc: dri-devel@lists.freedesktop.org,
"Steven Price" <steven.price@arm.com>,
"Liviu Dudau" <liviu.dudau@arm.com>,
"Adrián Larumbe" <adrian.larumbe@collabora.com>,
kernel@collabora.com, "Luben Tuikov" <ltuikov89@gmail.com>,
"Matthew Brost" <matthew.brost@intel.com>,
"Danilo Krummrich" <dakr@redhat.com>
Subject: Re: [RFC PATCH] drm/sched: Fix a UAF on drm_sched_fence::sched
Date: Fri, 30 Aug 2024 11:37:21 +0200 [thread overview]
Message-ID: <20240830113721.6174f3d9@collabora.com> (raw)
In-Reply-To: <bdc018b8-3732-4123-a752-b4e0e7e150dc@amd.com>
Hi Christian,
On Fri, 30 Aug 2024 10:14:18 +0200
Christian König <christian.koenig@amd.com> wrote:
> Am 29.08.24 um 19:12 schrieb Boris Brezillon:
> > dma_fence objects created by an entity might outlive the
> > drm_gpu_scheduler this entity was bound to if those fences are retained
> > by other other objects, like a dma_buf resv. This means that
> > drm_sched_fence::sched might be invalid when the resv is walked, which
> > in turn leads to a UAF when dma_fence_ops::get_timeline_name() is called.
> >
> > This probably went unnoticed so far, because the drm_gpu_scheduler had
> > the lifetime of the drm_device, so, unless you were removing the device,
> > there were no reasons for the scheduler to be gone before its fences.
>
> Nope, that is intentional design. get_timeline_name() is not safe to be
> called after the fence signaled because that would causes circular
> dependency problems.
Do you mean the dma_fence layer should not call get_timeline_name()
after it's been signalled (looking at the code/doc, it doesn't seem to
be the case), or do you mean the drm_sched implementation of the fence
interface is wrong and should assume the fence can live longer than its
creator?
>
> E.g. when you have hardware fences it can happen that fences reference a
> driver module (for the function printing the name) and the module in
> turn keeps fences around.
>
> So you easily end up with a module you can never unload.
On the other hand, I think preventing the module from being unloaded is
the right thing to do, because otherwise the dma_fence_ops might be
gone when they get dereferenced in the release path. That's also a
problem I noticed when I started working on the initial panthor driver
without drm_sched. To solve that I ended up retaining a module ref for
each fence created, and releasing this ref in the
dma_fence_ops::release() function.
drm_sched adds an indirection that allows drivers to not care, but
that's still a problem if you end up unloading drm_sched while some of
its drm_sched_fence fences are owned by external components.
>
>
> > With the introduction of a new model where each entity has its own
> > drm_gpu_scheduler instance, this situation is likely to happen every time
> > a GPU context is destroyed and some of its fences remain attached to
> > dma_buf objects still owned by other drivers/processes.
> >
> > In order to make drm_sched_fence_get_timeline_name() safe, we need to
> > copy the scheduler name into our own refcounted object that's only
> > destroyed when both the scheduler and all its fences are gone.
> >
> > The fact drm_sched_fence might have a reference to the drm_gpu_scheduler
> > even after it's been released is worrisome though, but I'd rather
> > discuss that with everyone than come up with a solution that's likely
> > to end up being rejected.
> >
> > Note that the bug was found while repeatedly reading dma_buf's debugfs
> > file, which, at some point, calls dma_resv_describe() on a resv that
> > contains signalled fences coming from a destroyed GPU context.
> > AFAIK, there's nothing invalid there.
>
> Yeah but reading debugfs is not guaranteed to crash the kernel.
>
> On the other hand the approach with a kref'ed string looks rather sane
> to me. One comment on this below.
There's still the problem I mentioned above (unloading drm_sched can
make things crash). Are there any plans to fix that? The simple option
would be to prevent compiling drm_sched as a module, but that's not an
option because it depends on DRM which is a tristate too. Maybe we
could have drm_sched_fence.o linked statically, just like dma-fence.c
is linked statically to prevent the stub ops from disappearing.
Not sure if drm_sched_fence.c depends on symbols defined in
sched_{main,entity}.c or other parts of the DRM subsystem though.
> > +/**
> > + * struct drm_sched_fence_timeline - Wrapped around the timeline name
> > + *
> > + * This is needed to cope with the fact dma_fence objects created by
> > + * an entity might outlive the drm_gpu_scheduler this entity was bound
> > + * to, making drm_sched_fence::sched invalid and leading to a UAF when
> > + * dma_fence_ops::get_timeline_name() is called.
> > + */
> > +struct drm_sched_fence_timeline {
> > + /** @kref: Reference count of this timeline object. */
> > + struct kref kref;
> > +
> > + /**
> > + * @name: Name of the timeline.
> > + *
> > + * This is currently a copy of drm_gpu_scheduler::name.
> > + */
> > + const char *name;
>
> Make that a char name[] and embed the name into the structure. The macro
> struct_size() can be used to calculate the size.
Sure I can do that.
Regards,
Boris
next prev parent reply other threads:[~2024-08-30 9:37 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-29 17:12 [RFC PATCH] drm/sched: Fix a UAF on drm_sched_fence::sched Boris Brezillon
2024-08-30 8:14 ` Christian König
2024-08-30 9:37 ` Boris Brezillon [this message]
2024-08-30 10:44 ` Boris Brezillon
2024-08-30 12:57 ` Christian König
2024-08-30 21:43 ` Matthew Brost
2024-08-31 7:25 ` Boris Brezillon
2024-09-02 10:43 ` Christian König
2024-09-02 13:23 ` Daniel Vetter
2024-09-02 14:18 ` Christian König
2024-09-03 8:13 ` Simona Vetter
2024-09-04 7:40 ` Christian König
2024-09-04 9:46 ` Simona Vetter
2024-09-04 10:03 ` Simona Vetter
2024-09-04 10:26 ` Boris Brezillon
2024-09-04 10:23 ` Boris Brezillon
2024-09-01 22:39 ` kernel test robot
2024-09-02 3:14 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240830113721.6174f3d9@collabora.com \
--to=boris.brezillon@collabora.com \
--cc=adrian.larumbe@collabora.com \
--cc=christian.koenig@amd.com \
--cc=dakr@redhat.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=kernel@collabora.com \
--cc=liviu.dudau@arm.com \
--cc=ltuikov89@gmail.com \
--cc=matthew.brost@intel.com \
--cc=steven.price@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.