All of lore.kernel.org
 help / color / mirror / Atom feed
From: Boris Brezillon <boris.brezillon@collabora.com>
To: Rob Clark <rob.clark@oss.qualcomm.com>
Cc: Steven Price <steven.price@arm.com>,
	Liviu Dudau <liviu.dudau@arm.com>,
	Dmitry Osipenko <dmitry.osipenko@collabora.com>,
	Maarten Lankhorst <maarten.lankhorst@linux.intel.com>,
	Maxime Ripard <mripard@kernel.org>,
	Thomas Zimmermann <tzimmermann@suse.de>,
	David Airlie <airlied@gmail.com>, Simona Vetter <simona@ffwll.ch>,
	Akash Goel <akash.goel@arm.com>, Chia-I Wu <olvaffe@gmail.com>,
	Dmitry Baryshkov <lumag@kernel.org>,
	Abhinav Kumar <abhinav.kumar@linux.dev>,
	Jessica Zhang <jesszhan0024@gmail.com>,
	Sean Paul <sean@poorly.run>,
	Marijn Suijten <marijn.suijten@somainline.org>,
	linux-arm-msm@vger.kernel.org, freedreno@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/3] drm/gem: Fix a race between drm_gem_lru_scan() and drm_gem_object_release()
Date: Fri, 8 May 2026 10:41:07 +0200	[thread overview]
Message-ID: <20260508104107.055223e5@fedora> (raw)
In-Reply-To: <CACSVV01zGLN8FV3Fpw1BnW+zSokE7n2XJ2dBmDw8-n=MXBmDnw@mail.gmail.com>

On Thu, 7 May 2026 14:38:23 -0700
Rob Clark <rob.clark@oss.qualcomm.com> wrote:

> On Thu, May 7, 2026 at 5:46 AM Boris Brezillon
> <boris.brezillon@collabora.com> wrote:
> >
> > On Wed, 06 May 2026 14:16:27 +0200
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >  
> > > The following race can currently happen:
> > >
> > > | Thread 0 in `drm_gem_lru_scan`               | Thread 1 in `drm_gem_object_release` |
> > > | -                                            | -                                    |
> > > | move obj1 with refcount==0 to `still_in_lru` |                                      |
> > > | move obj2 with refcount!=0 to `still_in_lru` |                                      |
> > > | mutex_unlock                                 |                                      |
> > > | shrink obj2                                  |                                      |
> > > |                                              | lru = obj1->lru; // `still_in_lru`   |
> > > | mutex_lock                                   |                                      |
> > > | move obj1 back to the original lru           |                                      |
> > > | mutex_unlock                                 |                                      |
> > > | return                                       |                                      |
> > > |                                              | dereference `still_in_lru`           |
> > >
> > > Move the drm_gem_lru_move_tail_locked() after the
> > > kref_get_unless_zero() check so that we don't end up with a
> > > vanishing LRU when we hit drm_gem_object_release(). We also need to
> > > remove the skipped object from its LRU, otherwise we'll keep hitting
> > > it on subsequent loop iterations until it's actually removed from the
> > > list in the drm_gem_release().
> > >
> > > Fixes: e7c2af13f811 ("drm/gem: Add LRU/shrinker helper")
> > > Reported-by: Chia-I Wu <olvaffe@gmail.com>
> > > Closes: https://gitlab.freedesktop.org/panfrost/linux/-/work_items/86
> > > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > > Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
> > > ---
> > >  drivers/gpu/drm/drm_gem.c | 14 +++++++++-----
> > >  1 file changed, 9 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> > > index fca42949eb2b..97cf63de0112 100644
> > > --- a/drivers/gpu/drm/drm_gem.c
> > > +++ b/drivers/gpu/drm/drm_gem.c
> > > @@ -1660,15 +1660,19 @@ drm_gem_lru_scan(struct drm_gem_lru *lru,
> > >               if (!obj)
> > >                       break;
> > >
> > > -             drm_gem_lru_move_tail_locked(&still_in_lru, obj);
> > > -
> > >               /*
> > >                * If it's in the process of being freed, gem_object->free()
> > > -              * may be blocked on lock waiting to remove it.  So just
> > > -              * skip it.
> > > +              * may be blocked on lock waiting to remove it.  So just remove
> > > +              * it from its current LRU and skip it.
> > >                */
> > > -             if (!kref_get_unless_zero(&obj->refcount))
> > > +             if (!kref_get_unless_zero(&obj->refcount)) {
> > > +                     if (obj->lru)
> > > +                             drm_gem_lru_remove_locked(obj);
> > > +  
> >
> > Actually, this thing is still racy, because obj->lru is dereferenced
> > without the lru->lock held in drm_gem_object_release(). At this point
> > I'm wondering if we should expose a drm_gem_lru_remove() taking the LRU
> > lock as an argument as suggested by Steve, and delegate the
> > responsibility to call drm_gem_lru_remove() to the driver. Either that,
> > or we make it so the LRU lock is attached to the drm_device instead of
> > the GEM (both MSM and panthor assume a device-wide lock for LRU
> > manipulation).
> >
> > Rob, what's your take on this matter?  
> 
> I don't think there is a race, because of the kref_get_unless_zero().
> Other than lru_scan, there shouldn't be cases where someone is moving
> an obj between LRUs racing with drm_gem_object_release(), because that
> means they don't own a reference on the obj they are manipulating.

Yeah, but the race I'm talking about is drm_gem_object_release()
vs drm_gem_lru_scan(), so at this point refcount is zero, and this
patch only moves the needle, but doesn't fix the problem entirely:


| Thread 0 in `drm_gem_lru_scan`               | Thread 1 in `drm_gem_object_release` |
| -                                            | -                                    |
|                                              | drm_gem_lru_remove()                 |
|                                              |    lru = obj->lru                    |
|                                              |    if (!lru) return;                 |
| lock(still_in_lru.lock)                      |                                      |
|    if (refcount == 0)                        |                                      |
|       drm_gem_lru_remove_locked(obj)         |                                      |
|         obj->lru = NULL                      |                                      |
|    .....                                     |                                      |
| unlock(still_in_lru.lock)                    |                                      |
|                                              |    lock(lru->lock)                   |
|                                              |       drm_gem_lru_remove_locked(obj) |
|                                              |         obj->lru==NULL => NULL deref |
|                                              |    unlock(lru->lock)                 |

We can of course add an extra

	if (!obj->lru) return;

in drm_gem_lru_remove_locked() to cover for this race, and add a
READ_ONCE in drm_gem_lru_remove() to make sure the compiler doesn't
do crazy things like dereferencing obj->lru twice instead of having
the LRU pointer stored in a register. That still assumes that the lru
we assigned to our local variable is valid even after the
drm_gem_lru_remove_locked(obj) call, which is true at least for MSM and
and panthor because they have their LRUs attached to the drm_device,
which outlives any GEMs attached to it. But it's not something the API
enforce or document as a requirement.

> 
> That said, I can't really think of a sensible thing to do with more
> than a single LRU lock per device.  And it does make things easier to
> reason about.

Okay, I'll give it a try then.

  reply	other threads:[~2026-05-08  8:41 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-06 12:16 [PATCH 0/3] drm/panthor: Fix a race in the shrinker logic Boris Brezillon
2026-05-06 12:16 ` [PATCH 1/3] drm/panthor: Don't use the racy drm_gem_lru_remove() helper Boris Brezillon
2026-05-06 15:40   ` Steven Price
2026-05-06 16:25     ` Boris Brezillon
2026-05-07 10:01   ` Liviu Dudau
2026-05-07 12:10     ` Boris Brezillon
2026-05-07 14:40       ` Liviu Dudau
2026-05-07 15:03         ` Boris Brezillon
2026-05-07 15:18           ` Rob Clark
2026-05-06 12:16 ` [PATCH 2/3] drm/gem: Fix a race between drm_gem_lru_scan() and drm_gem_object_release() Boris Brezillon
2026-05-06 13:21   ` Rob Clark
2026-05-06 14:33     ` Boris Brezillon
2026-05-07 10:18   ` Liviu Dudau
2026-05-07 12:46   ` Boris Brezillon
2026-05-07 21:38     ` Rob Clark
2026-05-08  8:41       ` Boris Brezillon [this message]
2026-05-08 13:49         ` Rob Clark
2026-05-06 12:16 ` [PATCH 3/3] drm/gem: Stop exposing the racy/unsafe drm_gem_lru_remove() helper Boris Brezillon
2026-05-06 15:40   ` Steven Price
2026-05-07 10:20   ` Liviu Dudau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260508104107.055223e5@fedora \
    --to=boris.brezillon@collabora.com \
    --cc=abhinav.kumar@linux.dev \
    --cc=airlied@gmail.com \
    --cc=akash.goel@arm.com \
    --cc=dmitry.osipenko@collabora.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=freedreno@lists.freedesktop.org \
    --cc=jesszhan0024@gmail.com \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=liviu.dudau@arm.com \
    --cc=lumag@kernel.org \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=marijn.suijten@somainline.org \
    --cc=mripard@kernel.org \
    --cc=olvaffe@gmail.com \
    --cc=rob.clark@oss.qualcomm.com \
    --cc=sean@poorly.run \
    --cc=simona@ffwll.ch \
    --cc=steven.price@arm.com \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.