* [PATCH 1/3] drm/panthor: Don't use the racy drm_gem_lru_remove() helper
2026-05-06 12:16 [PATCH 0/3] drm/panthor: Fix a race in the shrinker logic Boris Brezillon
@ 2026-05-06 12:16 ` Boris Brezillon
2026-05-06 15:40 ` Steven Price
2026-05-07 10:01 ` Liviu Dudau
2026-05-06 12:16 ` [PATCH 2/3] drm/gem: Fix a race between drm_gem_lru_scan() and drm_gem_object_release() Boris Brezillon
2026-05-06 12:16 ` [PATCH 3/3] drm/gem: Stop exposing the racy/unsafe drm_gem_lru_remove() helper Boris Brezillon
2 siblings, 2 replies; 20+ messages in thread
From: Boris Brezillon @ 2026-05-06 12:16 UTC (permalink / raw)
To: Steven Price, Liviu Dudau, Boris Brezillon, Dmitry Osipenko
Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Akash Goel, Chia-I Wu, Rob Clark, Dmitry Baryshkov,
Abhinav Kumar, Jessica Zhang, Sean Paul, Marijn Suijten,
linux-arm-msm, freedreno, dri-devel, linux-kernel
drm_gem_lru_remove() dereference stores drm_gem_object::lru in a local
variable that's then dereferenced to acquire the LRU lock. Because this
assignment in done without the LRU lock held, it can race with
drm_gem_lru_scan() where drm_gem_object::lru is temporarily assigned
a stack-allcated LRU that goes away when leaving the function. By
the time we dereference this local lru variable, the object might already
be gone.
It feels like drm_gem_lru_move_tail() was never meant to be used this
way, because there's no easy way we can avoid this race unless we defer
the locking to the caller. Let's add an explicit LRU for unreclaimable
BOs instead, and have all BOs added to this LRU at creation time.
Fixes: fb42964e2a76 ("drm/panthor: Add a GEM shrinker")
Reported-by: Chia-I Wu <olvaffe@gmail.com>
Closes: https://gitlab.freedesktop.org/panfrost/linux/-/work_items/86
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
---
drivers/gpu/drm/panthor/panthor_device.h | 10 ++++++++++
drivers/gpu/drm/panthor/panthor_gem.c | 5 ++++-
2 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index 4e4607bca7cc..45b71546f83c 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -190,6 +190,16 @@ struct panthor_device {
/** @reclaim.lock: Lock protecting all LRUs */
struct mutex lock;
+ /**
+ * @reclaim.unreclaimable: unreclaimable BOs
+ *
+ * Either the BO is unreclaimable because it has no pages allocated,
+ * or it's unreclaimable because pages are pinned.
+ *
+ * All BOs start in that list at creation time.
+ */
+ struct drm_gem_lru unreclaimable;
+
/**
* @reclaim.unused: BOs with unused pages
*
diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
index 13295d7a593d..8e31740126e7 100644
--- a/drivers/gpu/drm/panthor/panthor_gem.c
+++ b/drivers/gpu/drm/panthor/panthor_gem.c
@@ -204,7 +204,7 @@ void panthor_gem_update_reclaim_state_locked(struct panthor_gem_object *bo,
drm_gem_lru_move_tail(&ptdev->reclaim.gpu_mapped_shared, &bo->base);
break;
case PANTHOR_GEM_UNRECLAIMABLE:
- drm_gem_lru_remove(&bo->base);
+ drm_gem_lru_move_tail(&ptdev->reclaim.unreclaimable, &bo->base);
break;
default:
drm_WARN(&ptdev->base, true, "invalid GEM reclaim state (%d)\n", new_state);
@@ -994,6 +994,7 @@ static struct panthor_gem_object *
panthor_gem_create(struct drm_device *dev, size_t size, uint32_t flags,
struct panthor_vm *exclusive_vm, u32 usage_flags)
{
+ struct panthor_device *ptdev = container_of(dev, struct panthor_device, base);
struct panthor_gem_object *bo;
int ret;
@@ -1026,6 +1027,7 @@ panthor_gem_create(struct drm_device *dev, size_t size, uint32_t flags,
}
panthor_gem_debugfs_set_usage_flags(bo, usage_flags);
+ drm_gem_lru_move_tail(&ptdev->reclaim.unreclaimable, &bo->base);
return bo;
err_put:
@@ -1551,6 +1553,7 @@ int panthor_gem_shrinker_init(struct panthor_device *ptdev)
return ret;
INIT_LIST_HEAD(&ptdev->reclaim.vms);
+ drm_gem_lru_init(&ptdev->reclaim.unreclaimable, &ptdev->reclaim.lock);
drm_gem_lru_init(&ptdev->reclaim.unused, &ptdev->reclaim.lock);
drm_gem_lru_init(&ptdev->reclaim.mmapped, &ptdev->reclaim.lock);
drm_gem_lru_init(&ptdev->reclaim.gpu_mapped_shared, &ptdev->reclaim.lock);
--
2.54.0
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH 1/3] drm/panthor: Don't use the racy drm_gem_lru_remove() helper
2026-05-06 12:16 ` [PATCH 1/3] drm/panthor: Don't use the racy drm_gem_lru_remove() helper Boris Brezillon
@ 2026-05-06 15:40 ` Steven Price
2026-05-06 16:25 ` Boris Brezillon
2026-05-07 10:01 ` Liviu Dudau
1 sibling, 1 reply; 20+ messages in thread
From: Steven Price @ 2026-05-06 15:40 UTC (permalink / raw)
To: Boris Brezillon, Liviu Dudau, Dmitry Osipenko
Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Akash Goel, Chia-I Wu, Rob Clark, Dmitry Baryshkov,
Abhinav Kumar, Jessica Zhang, Sean Paul, Marijn Suijten,
linux-arm-msm, freedreno, dri-devel, linux-kernel
On 06/05/2026 13:16, Boris Brezillon wrote:
> drm_gem_lru_remove() dereference stores drm_gem_object::lru in a local
> variable that's then dereferenced to acquire the LRU lock. Because this
> assignment in done without the LRU lock held, it can race with
s/in/is/ ^^
> drm_gem_lru_scan() where drm_gem_object::lru is temporarily assigned
> a stack-allcated LRU that goes away when leaving the function. By
> the time we dereference this local lru variable, the object might already
> be gone.
>
> It feels like drm_gem_lru_move_tail() was never meant to be used this
> way, because there's no easy way we can avoid this race unless we defer
> the locking to the caller. Let's add an explicit LRU for unreclaimable
> BOs instead, and have all BOs added to this LRU at creation time.
>
> Fixes: fb42964e2a76 ("drm/panthor: Add a GEM shrinker")
> Reported-by: Chia-I Wu <olvaffe@gmail.com>
> Closes: https://gitlab.freedesktop.org/panfrost/linux/-/work_items/86
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
With minor typos fixed
Reviewed-by: Steven Price <steven.price@arm.com>
Although an alternative would be to expose drm_gem_lru_remove_locked()
in some form (maybe a wrapper which requires passing in the lock?)
> ---
> drivers/gpu/drm/panthor/panthor_device.h | 10 ++++++++++
> drivers/gpu/drm/panthor/panthor_gem.c | 5 ++++-
> 2 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index 4e4607bca7cc..45b71546f83c 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -190,6 +190,16 @@ struct panthor_device {
> /** @reclaim.lock: Lock protecting all LRUs */
> struct mutex lock;
>
> + /**
> + * @reclaim.unreclaimable: unreclaimable BOs
> + *
> + * Either the BO is unreclaimable because it has no pages allocated,
> + * or it's unreclaimable because pages are pinned.
> + *
> + * All BOs start in that list at creation time.
s/that/this/ ^^^^
Thanks,
Steve
> + */
> + struct drm_gem_lru unreclaimable;
> +
> /**
> * @reclaim.unused: BOs with unused pages
> *
> diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
> index 13295d7a593d..8e31740126e7 100644
> --- a/drivers/gpu/drm/panthor/panthor_gem.c
> +++ b/drivers/gpu/drm/panthor/panthor_gem.c
> @@ -204,7 +204,7 @@ void panthor_gem_update_reclaim_state_locked(struct panthor_gem_object *bo,
> drm_gem_lru_move_tail(&ptdev->reclaim.gpu_mapped_shared, &bo->base);
> break;
> case PANTHOR_GEM_UNRECLAIMABLE:
> - drm_gem_lru_remove(&bo->base);
> + drm_gem_lru_move_tail(&ptdev->reclaim.unreclaimable, &bo->base);
> break;
> default:
> drm_WARN(&ptdev->base, true, "invalid GEM reclaim state (%d)\n", new_state);
> @@ -994,6 +994,7 @@ static struct panthor_gem_object *
> panthor_gem_create(struct drm_device *dev, size_t size, uint32_t flags,
> struct panthor_vm *exclusive_vm, u32 usage_flags)
> {
> + struct panthor_device *ptdev = container_of(dev, struct panthor_device, base);
> struct panthor_gem_object *bo;
> int ret;
>
> @@ -1026,6 +1027,7 @@ panthor_gem_create(struct drm_device *dev, size_t size, uint32_t flags,
> }
>
> panthor_gem_debugfs_set_usage_flags(bo, usage_flags);
> + drm_gem_lru_move_tail(&ptdev->reclaim.unreclaimable, &bo->base);
> return bo;
>
> err_put:
> @@ -1551,6 +1553,7 @@ int panthor_gem_shrinker_init(struct panthor_device *ptdev)
> return ret;
>
> INIT_LIST_HEAD(&ptdev->reclaim.vms);
> + drm_gem_lru_init(&ptdev->reclaim.unreclaimable, &ptdev->reclaim.lock);
> drm_gem_lru_init(&ptdev->reclaim.unused, &ptdev->reclaim.lock);
> drm_gem_lru_init(&ptdev->reclaim.mmapped, &ptdev->reclaim.lock);
> drm_gem_lru_init(&ptdev->reclaim.gpu_mapped_shared, &ptdev->reclaim.lock);
>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 1/3] drm/panthor: Don't use the racy drm_gem_lru_remove() helper
2026-05-06 15:40 ` Steven Price
@ 2026-05-06 16:25 ` Boris Brezillon
0 siblings, 0 replies; 20+ messages in thread
From: Boris Brezillon @ 2026-05-06 16:25 UTC (permalink / raw)
To: Steven Price
Cc: Liviu Dudau, Dmitry Osipenko, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter, Akash Goel,
Chia-I Wu, Rob Clark, Dmitry Baryshkov, Abhinav Kumar,
Jessica Zhang, Sean Paul, Marijn Suijten, linux-arm-msm,
freedreno, dri-devel, linux-kernel
On Wed, 6 May 2026 16:40:22 +0100
Steven Price <steven.price@arm.com> wrote:
> On 06/05/2026 13:16, Boris Brezillon wrote:
> > drm_gem_lru_remove() dereference stores drm_gem_object::lru in a local
> > variable that's then dereferenced to acquire the LRU lock. Because this
> > assignment in done without the LRU lock held, it can race with
> s/in/is/ ^^
> > drm_gem_lru_scan() where drm_gem_object::lru is temporarily assigned
> > a stack-allcated LRU that goes away when leaving the function. By
> > the time we dereference this local lru variable, the object might already
> > be gone.
> >
> > It feels like drm_gem_lru_move_tail() was never meant to be used this
> > way, because there's no easy way we can avoid this race unless we defer
> > the locking to the caller. Let's add an explicit LRU for unreclaimable
> > BOs instead, and have all BOs added to this LRU at creation time.
> >
> > Fixes: fb42964e2a76 ("drm/panthor: Add a GEM shrinker")
> > Reported-by: Chia-I Wu <olvaffe@gmail.com>
> > Closes: https://gitlab.freedesktop.org/panfrost/linux/-/work_items/86
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
>
> With minor typos fixed
>
> Reviewed-by: Steven Price <steven.price@arm.com>
>
> Although an alternative would be to expose drm_gem_lru_remove_locked()
> in some form (maybe a wrapper which requires passing in the lock?)
I considered that too, but I thought it was less invasive to just have
a default LRU to start in at creation time, and end in there's nothing
left to reclaim. It's also what MSM does, so I figured I'd do that too.
>
> > ---
> > drivers/gpu/drm/panthor/panthor_device.h | 10 ++++++++++
> > drivers/gpu/drm/panthor/panthor_gem.c | 5 ++++-
> > 2 files changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> > index 4e4607bca7cc..45b71546f83c 100644
> > --- a/drivers/gpu/drm/panthor/panthor_device.h
> > +++ b/drivers/gpu/drm/panthor/panthor_device.h
> > @@ -190,6 +190,16 @@ struct panthor_device {
> > /** @reclaim.lock: Lock protecting all LRUs */
> > struct mutex lock;
> >
> > + /**
> > + * @reclaim.unreclaimable: unreclaimable BOs
> > + *
> > + * Either the BO is unreclaimable because it has no pages allocated,
> > + * or it's unreclaimable because pages are pinned.
> > + *
> > + * All BOs start in that list at creation time.
> s/that/this/ ^^^^
>
> Thanks,
> Steve
>
> > + */
> > + struct drm_gem_lru unreclaimable;
> > +
> > /**
> > * @reclaim.unused: BOs with unused pages
> > *
> > diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
> > index 13295d7a593d..8e31740126e7 100644
> > --- a/drivers/gpu/drm/panthor/panthor_gem.c
> > +++ b/drivers/gpu/drm/panthor/panthor_gem.c
> > @@ -204,7 +204,7 @@ void panthor_gem_update_reclaim_state_locked(struct panthor_gem_object *bo,
> > drm_gem_lru_move_tail(&ptdev->reclaim.gpu_mapped_shared, &bo->base);
> > break;
> > case PANTHOR_GEM_UNRECLAIMABLE:
> > - drm_gem_lru_remove(&bo->base);
> > + drm_gem_lru_move_tail(&ptdev->reclaim.unreclaimable, &bo->base);
> > break;
> > default:
> > drm_WARN(&ptdev->base, true, "invalid GEM reclaim state (%d)\n", new_state);
> > @@ -994,6 +994,7 @@ static struct panthor_gem_object *
> > panthor_gem_create(struct drm_device *dev, size_t size, uint32_t flags,
> > struct panthor_vm *exclusive_vm, u32 usage_flags)
> > {
> > + struct panthor_device *ptdev = container_of(dev, struct panthor_device, base);
> > struct panthor_gem_object *bo;
> > int ret;
> >
> > @@ -1026,6 +1027,7 @@ panthor_gem_create(struct drm_device *dev, size_t size, uint32_t flags,
> > }
> >
> > panthor_gem_debugfs_set_usage_flags(bo, usage_flags);
> > + drm_gem_lru_move_tail(&ptdev->reclaim.unreclaimable, &bo->base);
> > return bo;
> >
> > err_put:
> > @@ -1551,6 +1553,7 @@ int panthor_gem_shrinker_init(struct panthor_device *ptdev)
> > return ret;
> >
> > INIT_LIST_HEAD(&ptdev->reclaim.vms);
> > + drm_gem_lru_init(&ptdev->reclaim.unreclaimable, &ptdev->reclaim.lock);
> > drm_gem_lru_init(&ptdev->reclaim.unused, &ptdev->reclaim.lock);
> > drm_gem_lru_init(&ptdev->reclaim.mmapped, &ptdev->reclaim.lock);
> > drm_gem_lru_init(&ptdev->reclaim.gpu_mapped_shared, &ptdev->reclaim.lock);
> >
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/3] drm/panthor: Don't use the racy drm_gem_lru_remove() helper
2026-05-06 12:16 ` [PATCH 1/3] drm/panthor: Don't use the racy drm_gem_lru_remove() helper Boris Brezillon
2026-05-06 15:40 ` Steven Price
@ 2026-05-07 10:01 ` Liviu Dudau
2026-05-07 12:10 ` Boris Brezillon
1 sibling, 1 reply; 20+ messages in thread
From: Liviu Dudau @ 2026-05-07 10:01 UTC (permalink / raw)
To: Boris Brezillon
Cc: Steven Price, Dmitry Osipenko, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter, Akash Goel,
Chia-I Wu, Rob Clark, Dmitry Baryshkov, Abhinav Kumar,
Jessica Zhang, Sean Paul, Marijn Suijten, linux-arm-msm,
freedreno, dri-devel, linux-kernel
On Wed, May 06, 2026 at 02:16:26PM +0200, Boris Brezillon wrote:
> drm_gem_lru_remove() dereference stores drm_gem_object::lru in a local
> variable that's then dereferenced to acquire the LRU lock. Because this
> assignment in done without the LRU lock held, it can race with
> drm_gem_lru_scan() where drm_gem_object::lru is temporarily assigned
> a stack-allcated LRU that goes away when leaving the function. By
> the time we dereference this local lru variable, the object might already
> be gone.
>
> It feels like drm_gem_lru_move_tail() was never meant to be used this
> way, because there's no easy way we can avoid this race unless we defer
> the locking to the caller. Let's add an explicit LRU for unreclaimable
> BOs instead, and have all BOs added to this LRU at creation time.
I would argue that drm_gem_lru_scan() is broken by design. If you're going
to release the LRU lock in the middle of a loop you can expect that someone
will get hold of your stack-allocated LRU and end up picking the pieces.
This patch is fine in itself by trying to avoid stepping into the fight,
but I think we should also add a warning in drm_gem_lru_scan() for future
users to be aware of the dangers.
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Best regards,
Liviu
>
> Fixes: fb42964e2a76 ("drm/panthor: Add a GEM shrinker")
> Reported-by: Chia-I Wu <olvaffe@gmail.com>
> Closes: https://gitlab.freedesktop.org/panfrost/linux/-/work_items/86
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
> ---
> drivers/gpu/drm/panthor/panthor_device.h | 10 ++++++++++
> drivers/gpu/drm/panthor/panthor_gem.c | 5 ++++-
> 2 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index 4e4607bca7cc..45b71546f83c 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -190,6 +190,16 @@ struct panthor_device {
> /** @reclaim.lock: Lock protecting all LRUs */
> struct mutex lock;
>
> + /**
> + * @reclaim.unreclaimable: unreclaimable BOs
> + *
> + * Either the BO is unreclaimable because it has no pages allocated,
> + * or it's unreclaimable because pages are pinned.
> + *
> + * All BOs start in that list at creation time.
> + */
> + struct drm_gem_lru unreclaimable;
> +
> /**
> * @reclaim.unused: BOs with unused pages
> *
> diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
> index 13295d7a593d..8e31740126e7 100644
> --- a/drivers/gpu/drm/panthor/panthor_gem.c
> +++ b/drivers/gpu/drm/panthor/panthor_gem.c
> @@ -204,7 +204,7 @@ void panthor_gem_update_reclaim_state_locked(struct panthor_gem_object *bo,
> drm_gem_lru_move_tail(&ptdev->reclaim.gpu_mapped_shared, &bo->base);
> break;
> case PANTHOR_GEM_UNRECLAIMABLE:
> - drm_gem_lru_remove(&bo->base);
> + drm_gem_lru_move_tail(&ptdev->reclaim.unreclaimable, &bo->base);
> break;
> default:
> drm_WARN(&ptdev->base, true, "invalid GEM reclaim state (%d)\n", new_state);
> @@ -994,6 +994,7 @@ static struct panthor_gem_object *
> panthor_gem_create(struct drm_device *dev, size_t size, uint32_t flags,
> struct panthor_vm *exclusive_vm, u32 usage_flags)
> {
> + struct panthor_device *ptdev = container_of(dev, struct panthor_device, base);
> struct panthor_gem_object *bo;
> int ret;
>
> @@ -1026,6 +1027,7 @@ panthor_gem_create(struct drm_device *dev, size_t size, uint32_t flags,
> }
>
> panthor_gem_debugfs_set_usage_flags(bo, usage_flags);
> + drm_gem_lru_move_tail(&ptdev->reclaim.unreclaimable, &bo->base);
> return bo;
>
> err_put:
> @@ -1551,6 +1553,7 @@ int panthor_gem_shrinker_init(struct panthor_device *ptdev)
> return ret;
>
> INIT_LIST_HEAD(&ptdev->reclaim.vms);
> + drm_gem_lru_init(&ptdev->reclaim.unreclaimable, &ptdev->reclaim.lock);
> drm_gem_lru_init(&ptdev->reclaim.unused, &ptdev->reclaim.lock);
> drm_gem_lru_init(&ptdev->reclaim.mmapped, &ptdev->reclaim.lock);
> drm_gem_lru_init(&ptdev->reclaim.gpu_mapped_shared, &ptdev->reclaim.lock);
>
> --
> 2.54.0
>
--
====================
| I would like to |
| fix the world, |
| but they're not |
| giving me the |
\ source code! /
---------------
¯\_(ツ)_/¯
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 1/3] drm/panthor: Don't use the racy drm_gem_lru_remove() helper
2026-05-07 10:01 ` Liviu Dudau
@ 2026-05-07 12:10 ` Boris Brezillon
2026-05-07 14:40 ` Liviu Dudau
0 siblings, 1 reply; 20+ messages in thread
From: Boris Brezillon @ 2026-05-07 12:10 UTC (permalink / raw)
To: Liviu Dudau, dri-devel
Cc: Steven Price, Dmitry Osipenko, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter, Akash Goel,
Chia-I Wu, Rob Clark, Dmitry Baryshkov, Abhinav Kumar,
Jessica Zhang, Sean Paul, Marijn Suijten, linux-arm-msm,
freedreno, linux-kernel
On Thu, 7 May 2026 11:01:25 +0100
Liviu Dudau <liviu.dudau@arm.com> wrote:
> On Wed, May 06, 2026 at 02:16:26PM +0200, Boris Brezillon wrote:
> > drm_gem_lru_remove() dereference stores drm_gem_object::lru in a local
> > variable that's then dereferenced to acquire the LRU lock. Because this
> > assignment in done without the LRU lock held, it can race with
> > drm_gem_lru_scan() where drm_gem_object::lru is temporarily assigned
> > a stack-allcated LRU that goes away when leaving the function. By
> > the time we dereference this local lru variable, the object might already
> > be gone.
> >
> > It feels like drm_gem_lru_move_tail() was never meant to be used this
> > way, because there's no easy way we can avoid this race unless we defer
> > the locking to the caller. Let's add an explicit LRU for unreclaimable
> > BOs instead, and have all BOs added to this LRU at creation time.
>
> I would argue that drm_gem_lru_scan() is broken by design. If you're going
> to release the LRU lock in the middle of a loop you can expect that someone
> will get hold of your stack-allocated LRU and end up picking the pieces.
I think it's fine as long as you always use the drm_gem_lru helpers to
manipulate the lru field, which is true of a lot of kernel constructs.
> This patch is fine in itself by trying to avoid stepping into the fight,
> but I think we should also add a warning in drm_gem_lru_scan() for future
> users to be aware of the dangers.
Warning the user about what? There's nothing they can do about it, and
I don't even think it's unsafe per-se, unless someone goes off and
stores the drm_gem_object::lru value somewhere else while their shrink()
callback is called, and accesses it later, outside the shrinker path.
Given drm_gem_lru is not refcounted, there's no way one could safely
hold on the LRU they saw in the shrink() callback anyway, so I don't
think that's fair to blame the drm_gem_lru API for this kind of misuse.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/3] drm/panthor: Don't use the racy drm_gem_lru_remove() helper
2026-05-07 12:10 ` Boris Brezillon
@ 2026-05-07 14:40 ` Liviu Dudau
2026-05-07 15:03 ` Boris Brezillon
0 siblings, 1 reply; 20+ messages in thread
From: Liviu Dudau @ 2026-05-07 14:40 UTC (permalink / raw)
To: Boris Brezillon
Cc: dri-devel, Steven Price, Dmitry Osipenko, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Akash Goel, Chia-I Wu, Rob Clark, Dmitry Baryshkov, Abhinav Kumar,
Jessica Zhang, Sean Paul, Marijn Suijten, linux-arm-msm,
freedreno, linux-kernel
On Thu, May 07, 2026 at 02:10:27PM +0200, Boris Brezillon wrote:
> On Thu, 7 May 2026 11:01:25 +0100
> Liviu Dudau <liviu.dudau@arm.com> wrote:
>
> > On Wed, May 06, 2026 at 02:16:26PM +0200, Boris Brezillon wrote:
> > > drm_gem_lru_remove() dereference stores drm_gem_object::lru in a local
> > > variable that's then dereferenced to acquire the LRU lock. Because this
> > > assignment in done without the LRU lock held, it can race with
> > > drm_gem_lru_scan() where drm_gem_object::lru is temporarily assigned
> > > a stack-allcated LRU that goes away when leaving the function. By
> > > the time we dereference this local lru variable, the object might already
> > > be gone.
> > >
> > > It feels like drm_gem_lru_move_tail() was never meant to be used this
> > > way, because there's no easy way we can avoid this race unless we defer
> > > the locking to the caller. Let's add an explicit LRU for unreclaimable
> > > BOs instead, and have all BOs added to this LRU at creation time.
> >
> > I would argue that drm_gem_lru_scan() is broken by design. If you're going
> > to release the LRU lock in the middle of a loop you can expect that someone
> > will get hold of your stack-allocated LRU and end up picking the pieces.
>
> I think it's fine as long as you always use the drm_gem_lru helpers to
> manipulate the lru field, which is true of a lot of kernel constructs.
I think drm_gem_lru_scan() should never set an object's lru field to still_in_lru.
It should set it to NULL when the object's node is removed from its lru and add
it into still_in_lru without making the drm_gem_object->lru to point back to it.
At the very end when we splice back the still_in_lru list back into lru's list we
can then update obj->lru.
>
> > This patch is fine in itself by trying to avoid stepping into the fight,
> > but I think we should also add a warning in drm_gem_lru_scan() for future
> > users to be aware of the dangers.
>
> Warning the user about what? There's nothing they can do about it, and
> I don't even think it's unsafe per-se, unless someone goes off and
> stores the drm_gem_object::lru value somewhere else while their shrink()
> callback is called, and accesses it later, outside the shrinker path.
> Given drm_gem_lru is not refcounted, there's no way one could safely
> hold on the LRU they saw in the shrink() callback anyway, so I don't
> think that's fair to blame the drm_gem_lru API for this kind of misuse.
Yeah, that would be the warning: don't store the object's lru as you might
get a temporary one that will become invalid after the shrinker has run.
Best regards,
Liviu
--
====================
| I would like to |
| fix the world, |
| but they're not |
| giving me the |
\ source code! /
---------------
¯\_(ツ)_/¯
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 1/3] drm/panthor: Don't use the racy drm_gem_lru_remove() helper
2026-05-07 14:40 ` Liviu Dudau
@ 2026-05-07 15:03 ` Boris Brezillon
2026-05-07 15:18 ` Rob Clark
0 siblings, 1 reply; 20+ messages in thread
From: Boris Brezillon @ 2026-05-07 15:03 UTC (permalink / raw)
To: Liviu Dudau
Cc: dri-devel, Steven Price, Dmitry Osipenko, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Akash Goel, Chia-I Wu, Rob Clark, Dmitry Baryshkov, Abhinav Kumar,
Jessica Zhang, Sean Paul, Marijn Suijten, linux-arm-msm,
freedreno, linux-kernel
On Thu, 7 May 2026 15:40:51 +0100
Liviu Dudau <liviu.dudau@arm.com> wrote:
> On Thu, May 07, 2026 at 02:10:27PM +0200, Boris Brezillon wrote:
> > On Thu, 7 May 2026 11:01:25 +0100
> > Liviu Dudau <liviu.dudau@arm.com> wrote:
> >
> > > On Wed, May 06, 2026 at 02:16:26PM +0200, Boris Brezillon wrote:
> > > > drm_gem_lru_remove() dereference stores drm_gem_object::lru in a local
> > > > variable that's then dereferenced to acquire the LRU lock. Because this
> > > > assignment in done without the LRU lock held, it can race with
> > > > drm_gem_lru_scan() where drm_gem_object::lru is temporarily assigned
> > > > a stack-allcated LRU that goes away when leaving the function. By
> > > > the time we dereference this local lru variable, the object might already
> > > > be gone.
> > > >
> > > > It feels like drm_gem_lru_move_tail() was never meant to be used this
> > > > way, because there's no easy way we can avoid this race unless we defer
> > > > the locking to the caller. Let's add an explicit LRU for unreclaimable
> > > > BOs instead, and have all BOs added to this LRU at creation time.
> > >
> > > I would argue that drm_gem_lru_scan() is broken by design. If you're going
> > > to release the LRU lock in the middle of a loop you can expect that someone
> > > will get hold of your stack-allocated LRU and end up picking the pieces.
> >
> > I think it's fine as long as you always use the drm_gem_lru helpers to
> > manipulate the lru field, which is true of a lot of kernel constructs.
>
> I think drm_gem_lru_scan() should never set an object's lru field to still_in_lru.
> It should set it to NULL when the object's node is removed from its lru and add
> it into still_in_lru without making the drm_gem_object->lru to point back to it.
> At the very end when we splice back the still_in_lru list back into lru's list we
> can then update obj->lru.
Then you run into another race between drm_gem_lru_scan() and
drm_gem_object_release(), where the LRU removal in _release() is
skipped because obj->lru is NULL, and all of a sudden, the still_in_lru
list has an element that's freed. Honestly, I don't think obj->lru
pointing to a stack allocated object is a problem as long as we don't
let gem users play freely with obj->lru (which we shouldn't do anyway).
>
> >
> > > This patch is fine in itself by trying to avoid stepping into the fight,
> > > but I think we should also add a warning in drm_gem_lru_scan() for future
> > > users to be aware of the dangers.
> >
> > Warning the user about what? There's nothing they can do about it, and
> > I don't even think it's unsafe per-se, unless someone goes off and
> > stores the drm_gem_object::lru value somewhere else while their shrink()
> > callback is called, and accesses it later, outside the shrinker path.
> > Given drm_gem_lru is not refcounted, there's no way one could safely
> > hold on the LRU they saw in the shrink() callback anyway, so I don't
> > think that's fair to blame the drm_gem_lru API for this kind of misuse.
>
> Yeah, that would be the warning: don't store the object's lru as you might
> get a temporary one that will become invalid after the shrinker has run.
Oh, you mean a comment explaining this should be avoided, not an actual
drm_warn(). Then yes, I think it's fine to document the expectations in
the drm_gem_object::lru doc.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/3] drm/panthor: Don't use the racy drm_gem_lru_remove() helper
2026-05-07 15:03 ` Boris Brezillon
@ 2026-05-07 15:18 ` Rob Clark
0 siblings, 0 replies; 20+ messages in thread
From: Rob Clark @ 2026-05-07 15:18 UTC (permalink / raw)
To: Boris Brezillon
Cc: Liviu Dudau, dri-devel, Steven Price, Dmitry Osipenko,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Akash Goel, Chia-I Wu, Dmitry Baryshkov,
Abhinav Kumar, Jessica Zhang, Sean Paul, Marijn Suijten,
linux-arm-msm, freedreno, linux-kernel
On Thu, May 7, 2026 at 8:03 AM Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> On Thu, 7 May 2026 15:40:51 +0100
> Liviu Dudau <liviu.dudau@arm.com> wrote:
>
> > On Thu, May 07, 2026 at 02:10:27PM +0200, Boris Brezillon wrote:
> > > On Thu, 7 May 2026 11:01:25 +0100
> > > Liviu Dudau <liviu.dudau@arm.com> wrote:
> > >
> > > > On Wed, May 06, 2026 at 02:16:26PM +0200, Boris Brezillon wrote:
> > > > > drm_gem_lru_remove() dereference stores drm_gem_object::lru in a local
> > > > > variable that's then dereferenced to acquire the LRU lock. Because this
> > > > > assignment in done without the LRU lock held, it can race with
> > > > > drm_gem_lru_scan() where drm_gem_object::lru is temporarily assigned
> > > > > a stack-allcated LRU that goes away when leaving the function. By
> > > > > the time we dereference this local lru variable, the object might already
> > > > > be gone.
> > > > >
> > > > > It feels like drm_gem_lru_move_tail() was never meant to be used this
> > > > > way, because there's no easy way we can avoid this race unless we defer
> > > > > the locking to the caller. Let's add an explicit LRU for unreclaimable
> > > > > BOs instead, and have all BOs added to this LRU at creation time.
> > > >
> > > > I would argue that drm_gem_lru_scan() is broken by design. If you're going
> > > > to release the LRU lock in the middle of a loop you can expect that someone
> > > > will get hold of your stack-allocated LRU and end up picking the pieces.
> > >
> > > I think it's fine as long as you always use the drm_gem_lru helpers to
> > > manipulate the lru field, which is true of a lot of kernel constructs.
> >
> > I think drm_gem_lru_scan() should never set an object's lru field to still_in_lru.
> > It should set it to NULL when the object's node is removed from its lru and add
> > it into still_in_lru without making the drm_gem_object->lru to point back to it.
> > At the very end when we splice back the still_in_lru list back into lru's list we
> > can then update obj->lru.
>
> Then you run into another race between drm_gem_lru_scan() and
> drm_gem_object_release(), where the LRU removal in _release() is
> skipped because obj->lru is NULL, and all of a sudden, the still_in_lru
> list has an element that's freed. Honestly, I don't think obj->lru
> pointing to a stack allocated object is a problem as long as we don't
> let gem users play freely with obj->lru (which we shouldn't do anyway).
Holding a ref while we drop the lock should keep us out of
drm_gem_object_release()..
But yeah, we should probably just document this..
BR,
-R
> >
> > >
> > > > This patch is fine in itself by trying to avoid stepping into the fight,
> > > > but I think we should also add a warning in drm_gem_lru_scan() for future
> > > > users to be aware of the dangers.
> > >
> > > Warning the user about what? There's nothing they can do about it, and
> > > I don't even think it's unsafe per-se, unless someone goes off and
> > > stores the drm_gem_object::lru value somewhere else while their shrink()
> > > callback is called, and accesses it later, outside the shrinker path.
> > > Given drm_gem_lru is not refcounted, there's no way one could safely
> > > hold on the LRU they saw in the shrink() callback anyway, so I don't
> > > think that's fair to blame the drm_gem_lru API for this kind of misuse.
> >
> > Yeah, that would be the warning: don't store the object's lru as you might
> > get a temporary one that will become invalid after the shrinker has run.
>
> Oh, you mean a comment explaining this should be avoided, not an actual
> drm_warn(). Then yes, I think it's fine to document the expectations in
> the drm_gem_object::lru doc.
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 2/3] drm/gem: Fix a race between drm_gem_lru_scan() and drm_gem_object_release()
2026-05-06 12:16 [PATCH 0/3] drm/panthor: Fix a race in the shrinker logic Boris Brezillon
2026-05-06 12:16 ` [PATCH 1/3] drm/panthor: Don't use the racy drm_gem_lru_remove() helper Boris Brezillon
@ 2026-05-06 12:16 ` Boris Brezillon
2026-05-06 13:21 ` Rob Clark
` (2 more replies)
2026-05-06 12:16 ` [PATCH 3/3] drm/gem: Stop exposing the racy/unsafe drm_gem_lru_remove() helper Boris Brezillon
2 siblings, 3 replies; 20+ messages in thread
From: Boris Brezillon @ 2026-05-06 12:16 UTC (permalink / raw)
To: Steven Price, Liviu Dudau, Boris Brezillon, Dmitry Osipenko
Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Akash Goel, Chia-I Wu, Rob Clark, Dmitry Baryshkov,
Abhinav Kumar, Jessica Zhang, Sean Paul, Marijn Suijten,
linux-arm-msm, freedreno, dri-devel, linux-kernel
The following race can currently happen:
| Thread 0 in `drm_gem_lru_scan` | Thread 1 in `drm_gem_object_release` |
| - | - |
| move obj1 with refcount==0 to `still_in_lru` | |
| move obj2 with refcount!=0 to `still_in_lru` | |
| mutex_unlock | |
| shrink obj2 | |
| | lru = obj1->lru; // `still_in_lru` |
| mutex_lock | |
| move obj1 back to the original lru | |
| mutex_unlock | |
| return | |
| | dereference `still_in_lru` |
Move the drm_gem_lru_move_tail_locked() after the
kref_get_unless_zero() check so that we don't end up with a
vanishing LRU when we hit drm_gem_object_release(). We also need to
remove the skipped object from its LRU, otherwise we'll keep hitting
it on subsequent loop iterations until it's actually removed from the
list in the drm_gem_release().
Fixes: e7c2af13f811 ("drm/gem: Add LRU/shrinker helper")
Reported-by: Chia-I Wu <olvaffe@gmail.com>
Closes: https://gitlab.freedesktop.org/panfrost/linux/-/work_items/86
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
---
drivers/gpu/drm/drm_gem.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index fca42949eb2b..97cf63de0112 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1660,15 +1660,19 @@ drm_gem_lru_scan(struct drm_gem_lru *lru,
if (!obj)
break;
- drm_gem_lru_move_tail_locked(&still_in_lru, obj);
-
/*
* If it's in the process of being freed, gem_object->free()
- * may be blocked on lock waiting to remove it. So just
- * skip it.
+ * may be blocked on lock waiting to remove it. So just remove
+ * it from its current LRU and skip it.
*/
- if (!kref_get_unless_zero(&obj->refcount))
+ if (!kref_get_unless_zero(&obj->refcount)) {
+ if (obj->lru)
+ drm_gem_lru_remove_locked(obj);
+
continue;
+ }
+
+ drm_gem_lru_move_tail_locked(&still_in_lru, obj);
/*
* Now that we own a reference, we can drop the lock for the
--
2.54.0
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH 2/3] drm/gem: Fix a race between drm_gem_lru_scan() and drm_gem_object_release()
2026-05-06 12:16 ` [PATCH 2/3] drm/gem: Fix a race between drm_gem_lru_scan() and drm_gem_object_release() Boris Brezillon
@ 2026-05-06 13:21 ` Rob Clark
2026-05-06 14:33 ` Boris Brezillon
2026-05-07 10:18 ` Liviu Dudau
2026-05-07 12:46 ` Boris Brezillon
2 siblings, 1 reply; 20+ messages in thread
From: Rob Clark @ 2026-05-06 13:21 UTC (permalink / raw)
To: Boris Brezillon
Cc: Steven Price, Liviu Dudau, Dmitry Osipenko, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Akash Goel, Chia-I Wu, Dmitry Baryshkov, Abhinav Kumar,
Jessica Zhang, Sean Paul, Marijn Suijten, linux-arm-msm,
freedreno, dri-devel, linux-kernel
On Wed, May 6, 2026 at 5:16 AM Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> The following race can currently happen:
>
> | Thread 0 in `drm_gem_lru_scan` | Thread 1 in `drm_gem_object_release` |
> | - | - |
> | move obj1 with refcount==0 to `still_in_lru` | |
> | move obj2 with refcount!=0 to `still_in_lru` | |
> | mutex_unlock | |
> | shrink obj2 | |
> | | lru = obj1->lru; // `still_in_lru` |
> | mutex_lock | |
> | move obj1 back to the original lru | |
> | mutex_unlock | |
> | return | |
> | | dereference `still_in_lru` |
>
> Move the drm_gem_lru_move_tail_locked() after the
> kref_get_unless_zero() check so that we don't end up with a
> vanishing LRU when we hit drm_gem_object_release(). We also need to
> remove the skipped object from its LRU, otherwise we'll keep hitting
> it on subsequent loop iterations until it's actually removed from the
> list in the drm_gem_release().
>
> Fixes: e7c2af13f811 ("drm/gem: Add LRU/shrinker helper")
> Reported-by: Chia-I Wu <olvaffe@gmail.com>
> Closes: https://gitlab.freedesktop.org/panfrost/linux/-/work_items/86
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
> ---
> drivers/gpu/drm/drm_gem.c | 14 +++++++++-----
> 1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> index fca42949eb2b..97cf63de0112 100644
> --- a/drivers/gpu/drm/drm_gem.c
> +++ b/drivers/gpu/drm/drm_gem.c
> @@ -1660,15 +1660,19 @@ drm_gem_lru_scan(struct drm_gem_lru *lru,
> if (!obj)
> break;
>
> - drm_gem_lru_move_tail_locked(&still_in_lru, obj);
> -
> /*
> * If it's in the process of being freed, gem_object->free()
> - * may be blocked on lock waiting to remove it. So just
> - * skip it.
> + * may be blocked on lock waiting to remove it. So just remove
> + * it from its current LRU and skip it.
> */
> - if (!kref_get_unless_zero(&obj->refcount))
> + if (!kref_get_unless_zero(&obj->refcount)) {
> + if (obj->lru)
> + drm_gem_lru_remove_locked(obj);
if we are iterating a LRU.. and lru->lock is held, shouldn't obj->lru
always be non-null?
BR,
-R
> +
> continue;
> + }
> +
> + drm_gem_lru_move_tail_locked(&still_in_lru, obj);
>
> /*
> * Now that we own a reference, we can drop the lock for the
>
> --
> 2.54.0
>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 2/3] drm/gem: Fix a race between drm_gem_lru_scan() and drm_gem_object_release()
2026-05-06 13:21 ` Rob Clark
@ 2026-05-06 14:33 ` Boris Brezillon
0 siblings, 0 replies; 20+ messages in thread
From: Boris Brezillon @ 2026-05-06 14:33 UTC (permalink / raw)
To: Rob Clark
Cc: Steven Price, Liviu Dudau, Dmitry Osipenko, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Akash Goel, Chia-I Wu, Dmitry Baryshkov, Abhinav Kumar,
Jessica Zhang, Sean Paul, Marijn Suijten, linux-arm-msm,
freedreno, dri-devel, linux-kernel
On Wed, 6 May 2026 06:21:42 -0700
Rob Clark <rob.clark@oss.qualcomm.com> wrote:
> On Wed, May 6, 2026 at 5:16 AM Boris Brezillon
> <boris.brezillon@collabora.com> wrote:
> >
> > The following race can currently happen:
> >
> > | Thread 0 in `drm_gem_lru_scan` | Thread 1 in `drm_gem_object_release` |
> > | - | - |
> > | move obj1 with refcount==0 to `still_in_lru` | |
> > | move obj2 with refcount!=0 to `still_in_lru` | |
> > | mutex_unlock | |
> > | shrink obj2 | |
> > | | lru = obj1->lru; // `still_in_lru` |
> > | mutex_lock | |
> > | move obj1 back to the original lru | |
> > | mutex_unlock | |
> > | return | |
> > | | dereference `still_in_lru` |
> >
> > Move the drm_gem_lru_move_tail_locked() after the
> > kref_get_unless_zero() check so that we don't end up with a
> > vanishing LRU when we hit drm_gem_object_release(). We also need to
> > remove the skipped object from its LRU, otherwise we'll keep hitting
> > it on subsequent loop iterations until it's actually removed from the
> > list in the drm_gem_release().
> >
> > Fixes: e7c2af13f811 ("drm/gem: Add LRU/shrinker helper")
> > Reported-by: Chia-I Wu <olvaffe@gmail.com>
> > Closes: https://gitlab.freedesktop.org/panfrost/linux/-/work_items/86
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
> > ---
> > drivers/gpu/drm/drm_gem.c | 14 +++++++++-----
> > 1 file changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> > index fca42949eb2b..97cf63de0112 100644
> > --- a/drivers/gpu/drm/drm_gem.c
> > +++ b/drivers/gpu/drm/drm_gem.c
> > @@ -1660,15 +1660,19 @@ drm_gem_lru_scan(struct drm_gem_lru *lru,
> > if (!obj)
> > break;
> >
> > - drm_gem_lru_move_tail_locked(&still_in_lru, obj);
> > -
> > /*
> > * If it's in the process of being freed, gem_object->free()
> > - * may be blocked on lock waiting to remove it. So just
> > - * skip it.
> > + * may be blocked on lock waiting to remove it. So just remove
> > + * it from its current LRU and skip it.
> > */
> > - if (!kref_get_unless_zero(&obj->refcount))
> > + if (!kref_get_unless_zero(&obj->refcount)) {
> > + if (obj->lru)
> > + drm_gem_lru_remove_locked(obj);
>
> if we are iterating a LRU.. and lru->lock is held, shouldn't obj->lru
> always be non-null?
Right, the != NULL check is not needed here.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/3] drm/gem: Fix a race between drm_gem_lru_scan() and drm_gem_object_release()
2026-05-06 12:16 ` [PATCH 2/3] drm/gem: Fix a race between drm_gem_lru_scan() and drm_gem_object_release() Boris Brezillon
2026-05-06 13:21 ` Rob Clark
@ 2026-05-07 10:18 ` Liviu Dudau
2026-05-07 12:46 ` Boris Brezillon
2 siblings, 0 replies; 20+ messages in thread
From: Liviu Dudau @ 2026-05-07 10:18 UTC (permalink / raw)
To: Boris Brezillon
Cc: Steven Price, Dmitry Osipenko, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter, Akash Goel,
Chia-I Wu, Rob Clark, Dmitry Baryshkov, Abhinav Kumar,
Jessica Zhang, Sean Paul, Marijn Suijten, linux-arm-msm,
freedreno, dri-devel, linux-kernel
On Wed, May 06, 2026 at 02:16:27PM +0200, Boris Brezillon wrote:
> The following race can currently happen:
>
> | Thread 0 in `drm_gem_lru_scan` | Thread 1 in `drm_gem_object_release` |
> | - | - |
> | move obj1 with refcount==0 to `still_in_lru` | |
> | move obj2 with refcount!=0 to `still_in_lru` | |
> | mutex_unlock | |
> | shrink obj2 | |
> | | lru = obj1->lru; // `still_in_lru` |
> | mutex_lock | |
> | move obj1 back to the original lru | |
> | mutex_unlock | |
> | return | |
> | | dereference `still_in_lru` |
>
> Move the drm_gem_lru_move_tail_locked() after the
> kref_get_unless_zero() check so that we don't end up with a
> vanishing LRU when we hit drm_gem_object_release(). We also need to
> remove the skipped object from its LRU, otherwise we'll keep hitting
> it on subsequent loop iterations until it's actually removed from the
> list in the drm_gem_release().
>
> Fixes: e7c2af13f811 ("drm/gem: Add LRU/shrinker helper")
> Reported-by: Chia-I Wu <olvaffe@gmail.com>
> Closes: https://gitlab.freedesktop.org/panfrost/linux/-/work_items/86
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Best regards,
Liviu
> ---
> drivers/gpu/drm/drm_gem.c | 14 +++++++++-----
> 1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> index fca42949eb2b..97cf63de0112 100644
> --- a/drivers/gpu/drm/drm_gem.c
> +++ b/drivers/gpu/drm/drm_gem.c
> @@ -1660,15 +1660,19 @@ drm_gem_lru_scan(struct drm_gem_lru *lru,
> if (!obj)
> break;
>
> - drm_gem_lru_move_tail_locked(&still_in_lru, obj);
> -
> /*
> * If it's in the process of being freed, gem_object->free()
> - * may be blocked on lock waiting to remove it. So just
> - * skip it.
> + * may be blocked on lock waiting to remove it. So just remove
> + * it from its current LRU and skip it.
> */
> - if (!kref_get_unless_zero(&obj->refcount))
> + if (!kref_get_unless_zero(&obj->refcount)) {
> + if (obj->lru)
> + drm_gem_lru_remove_locked(obj);
> +
> continue;
> + }
> +
> + drm_gem_lru_move_tail_locked(&still_in_lru, obj);
>
> /*
> * Now that we own a reference, we can drop the lock for the
>
> --
> 2.54.0
>
--
====================
| I would like to |
| fix the world, |
| but they're not |
| giving me the |
\ source code! /
---------------
¯\_(ツ)_/¯
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 2/3] drm/gem: Fix a race between drm_gem_lru_scan() and drm_gem_object_release()
2026-05-06 12:16 ` [PATCH 2/3] drm/gem: Fix a race between drm_gem_lru_scan() and drm_gem_object_release() Boris Brezillon
2026-05-06 13:21 ` Rob Clark
2026-05-07 10:18 ` Liviu Dudau
@ 2026-05-07 12:46 ` Boris Brezillon
2026-05-07 21:38 ` Rob Clark
2 siblings, 1 reply; 20+ messages in thread
From: Boris Brezillon @ 2026-05-07 12:46 UTC (permalink / raw)
To: Steven Price, Liviu Dudau, Boris Brezillon, Dmitry Osipenko
Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Akash Goel, Chia-I Wu, Rob Clark, Dmitry Baryshkov,
Abhinav Kumar, Jessica Zhang, Sean Paul, Marijn Suijten,
linux-arm-msm, freedreno, dri-devel, linux-kernel
On Wed, 06 May 2026 14:16:27 +0200
Boris Brezillon <boris.brezillon@collabora.com> wrote:
> The following race can currently happen:
>
> | Thread 0 in `drm_gem_lru_scan` | Thread 1 in `drm_gem_object_release` |
> | - | - |
> | move obj1 with refcount==0 to `still_in_lru` | |
> | move obj2 with refcount!=0 to `still_in_lru` | |
> | mutex_unlock | |
> | shrink obj2 | |
> | | lru = obj1->lru; // `still_in_lru` |
> | mutex_lock | |
> | move obj1 back to the original lru | |
> | mutex_unlock | |
> | return | |
> | | dereference `still_in_lru` |
>
> Move the drm_gem_lru_move_tail_locked() after the
> kref_get_unless_zero() check so that we don't end up with a
> vanishing LRU when we hit drm_gem_object_release(). We also need to
> remove the skipped object from its LRU, otherwise we'll keep hitting
> it on subsequent loop iterations until it's actually removed from the
> list in the drm_gem_release().
>
> Fixes: e7c2af13f811 ("drm/gem: Add LRU/shrinker helper")
> Reported-by: Chia-I Wu <olvaffe@gmail.com>
> Closes: https://gitlab.freedesktop.org/panfrost/linux/-/work_items/86
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
> ---
> drivers/gpu/drm/drm_gem.c | 14 +++++++++-----
> 1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> index fca42949eb2b..97cf63de0112 100644
> --- a/drivers/gpu/drm/drm_gem.c
> +++ b/drivers/gpu/drm/drm_gem.c
> @@ -1660,15 +1660,19 @@ drm_gem_lru_scan(struct drm_gem_lru *lru,
> if (!obj)
> break;
>
> - drm_gem_lru_move_tail_locked(&still_in_lru, obj);
> -
> /*
> * If it's in the process of being freed, gem_object->free()
> - * may be blocked on lock waiting to remove it. So just
> - * skip it.
> + * may be blocked on lock waiting to remove it. So just remove
> + * it from its current LRU and skip it.
> */
> - if (!kref_get_unless_zero(&obj->refcount))
> + if (!kref_get_unless_zero(&obj->refcount)) {
> + if (obj->lru)
> + drm_gem_lru_remove_locked(obj);
> +
Actually, this thing is still racy, because obj->lru is dereferenced
without the lru->lock held in drm_gem_object_release(). At this point
I'm wondering if we should expose a drm_gem_lru_remove() taking the LRU
lock as an argument as suggested by Steve, and delegate the
responsibility to call drm_gem_lru_remove() to the driver. Either that,
or we make it so the LRU lock is attached to the drm_device instead of
the GEM (both MSM and panthor assume a device-wide lock for LRU
manipulation).
Rob, what's your take on this matter?
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 2/3] drm/gem: Fix a race between drm_gem_lru_scan() and drm_gem_object_release()
2026-05-07 12:46 ` Boris Brezillon
@ 2026-05-07 21:38 ` Rob Clark
2026-05-08 8:41 ` Boris Brezillon
0 siblings, 1 reply; 20+ messages in thread
From: Rob Clark @ 2026-05-07 21:38 UTC (permalink / raw)
To: Boris Brezillon
Cc: Steven Price, Liviu Dudau, Dmitry Osipenko, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Akash Goel, Chia-I Wu, Dmitry Baryshkov, Abhinav Kumar,
Jessica Zhang, Sean Paul, Marijn Suijten, linux-arm-msm,
freedreno, dri-devel, linux-kernel
On Thu, May 7, 2026 at 5:46 AM Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> On Wed, 06 May 2026 14:16:27 +0200
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
>
> > The following race can currently happen:
> >
> > | Thread 0 in `drm_gem_lru_scan` | Thread 1 in `drm_gem_object_release` |
> > | - | - |
> > | move obj1 with refcount==0 to `still_in_lru` | |
> > | move obj2 with refcount!=0 to `still_in_lru` | |
> > | mutex_unlock | |
> > | shrink obj2 | |
> > | | lru = obj1->lru; // `still_in_lru` |
> > | mutex_lock | |
> > | move obj1 back to the original lru | |
> > | mutex_unlock | |
> > | return | |
> > | | dereference `still_in_lru` |
> >
> > Move the drm_gem_lru_move_tail_locked() after the
> > kref_get_unless_zero() check so that we don't end up with a
> > vanishing LRU when we hit drm_gem_object_release(). We also need to
> > remove the skipped object from its LRU, otherwise we'll keep hitting
> > it on subsequent loop iterations until it's actually removed from the
> > list in the drm_gem_release().
> >
> > Fixes: e7c2af13f811 ("drm/gem: Add LRU/shrinker helper")
> > Reported-by: Chia-I Wu <olvaffe@gmail.com>
> > Closes: https://gitlab.freedesktop.org/panfrost/linux/-/work_items/86
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
> > ---
> > drivers/gpu/drm/drm_gem.c | 14 +++++++++-----
> > 1 file changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> > index fca42949eb2b..97cf63de0112 100644
> > --- a/drivers/gpu/drm/drm_gem.c
> > +++ b/drivers/gpu/drm/drm_gem.c
> > @@ -1660,15 +1660,19 @@ drm_gem_lru_scan(struct drm_gem_lru *lru,
> > if (!obj)
> > break;
> >
> > - drm_gem_lru_move_tail_locked(&still_in_lru, obj);
> > -
> > /*
> > * If it's in the process of being freed, gem_object->free()
> > - * may be blocked on lock waiting to remove it. So just
> > - * skip it.
> > + * may be blocked on lock waiting to remove it. So just remove
> > + * it from its current LRU and skip it.
> > */
> > - if (!kref_get_unless_zero(&obj->refcount))
> > + if (!kref_get_unless_zero(&obj->refcount)) {
> > + if (obj->lru)
> > + drm_gem_lru_remove_locked(obj);
> > +
>
> Actually, this thing is still racy, because obj->lru is dereferenced
> without the lru->lock held in drm_gem_object_release(). At this point
> I'm wondering if we should expose a drm_gem_lru_remove() taking the LRU
> lock as an argument as suggested by Steve, and delegate the
> responsibility to call drm_gem_lru_remove() to the driver. Either that,
> or we make it so the LRU lock is attached to the drm_device instead of
> the GEM (both MSM and panthor assume a device-wide lock for LRU
> manipulation).
>
> Rob, what's your take on this matter?
I don't think there is a race, because of the kref_get_unless_zero().
Other than lru_scan, there shouldn't be cases where someone is moving
an obj between LRUs racing with drm_gem_object_release(), because that
means they don't own a reference on the obj they are manipulating.
That said, I can't really think of a sensible thing to do with more
than a single LRU lock per device. And it does make things easier to
reason about.
BR,
-R
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 2/3] drm/gem: Fix a race between drm_gem_lru_scan() and drm_gem_object_release()
2026-05-07 21:38 ` Rob Clark
@ 2026-05-08 8:41 ` Boris Brezillon
2026-05-08 13:49 ` Rob Clark
0 siblings, 1 reply; 20+ messages in thread
From: Boris Brezillon @ 2026-05-08 8:41 UTC (permalink / raw)
To: Rob Clark
Cc: Steven Price, Liviu Dudau, Dmitry Osipenko, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Akash Goel, Chia-I Wu, Dmitry Baryshkov, Abhinav Kumar,
Jessica Zhang, Sean Paul, Marijn Suijten, linux-arm-msm,
freedreno, dri-devel, linux-kernel
On Thu, 7 May 2026 14:38:23 -0700
Rob Clark <rob.clark@oss.qualcomm.com> wrote:
> On Thu, May 7, 2026 at 5:46 AM Boris Brezillon
> <boris.brezillon@collabora.com> wrote:
> >
> > On Wed, 06 May 2026 14:16:27 +0200
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >
> > > The following race can currently happen:
> > >
> > > | Thread 0 in `drm_gem_lru_scan` | Thread 1 in `drm_gem_object_release` |
> > > | - | - |
> > > | move obj1 with refcount==0 to `still_in_lru` | |
> > > | move obj2 with refcount!=0 to `still_in_lru` | |
> > > | mutex_unlock | |
> > > | shrink obj2 | |
> > > | | lru = obj1->lru; // `still_in_lru` |
> > > | mutex_lock | |
> > > | move obj1 back to the original lru | |
> > > | mutex_unlock | |
> > > | return | |
> > > | | dereference `still_in_lru` |
> > >
> > > Move the drm_gem_lru_move_tail_locked() after the
> > > kref_get_unless_zero() check so that we don't end up with a
> > > vanishing LRU when we hit drm_gem_object_release(). We also need to
> > > remove the skipped object from its LRU, otherwise we'll keep hitting
> > > it on subsequent loop iterations until it's actually removed from the
> > > list in the drm_gem_release().
> > >
> > > Fixes: e7c2af13f811 ("drm/gem: Add LRU/shrinker helper")
> > > Reported-by: Chia-I Wu <olvaffe@gmail.com>
> > > Closes: https://gitlab.freedesktop.org/panfrost/linux/-/work_items/86
> > > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > > Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
> > > ---
> > > drivers/gpu/drm/drm_gem.c | 14 +++++++++-----
> > > 1 file changed, 9 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> > > index fca42949eb2b..97cf63de0112 100644
> > > --- a/drivers/gpu/drm/drm_gem.c
> > > +++ b/drivers/gpu/drm/drm_gem.c
> > > @@ -1660,15 +1660,19 @@ drm_gem_lru_scan(struct drm_gem_lru *lru,
> > > if (!obj)
> > > break;
> > >
> > > - drm_gem_lru_move_tail_locked(&still_in_lru, obj);
> > > -
> > > /*
> > > * If it's in the process of being freed, gem_object->free()
> > > - * may be blocked on lock waiting to remove it. So just
> > > - * skip it.
> > > + * may be blocked on lock waiting to remove it. So just remove
> > > + * it from its current LRU and skip it.
> > > */
> > > - if (!kref_get_unless_zero(&obj->refcount))
> > > + if (!kref_get_unless_zero(&obj->refcount)) {
> > > + if (obj->lru)
> > > + drm_gem_lru_remove_locked(obj);
> > > +
> >
> > Actually, this thing is still racy, because obj->lru is dereferenced
> > without the lru->lock held in drm_gem_object_release(). At this point
> > I'm wondering if we should expose a drm_gem_lru_remove() taking the LRU
> > lock as an argument as suggested by Steve, and delegate the
> > responsibility to call drm_gem_lru_remove() to the driver. Either that,
> > or we make it so the LRU lock is attached to the drm_device instead of
> > the GEM (both MSM and panthor assume a device-wide lock for LRU
> > manipulation).
> >
> > Rob, what's your take on this matter?
>
> I don't think there is a race, because of the kref_get_unless_zero().
> Other than lru_scan, there shouldn't be cases where someone is moving
> an obj between LRUs racing with drm_gem_object_release(), because that
> means they don't own a reference on the obj they are manipulating.
Yeah, but the race I'm talking about is drm_gem_object_release()
vs drm_gem_lru_scan(), so at this point refcount is zero, and this
patch only moves the needle, but doesn't fix the problem entirely:
| Thread 0 in `drm_gem_lru_scan` | Thread 1 in `drm_gem_object_release` |
| - | - |
| | drm_gem_lru_remove() |
| | lru = obj->lru |
| | if (!lru) return; |
| lock(still_in_lru.lock) | |
| if (refcount == 0) | |
| drm_gem_lru_remove_locked(obj) | |
| obj->lru = NULL | |
| ..... | |
| unlock(still_in_lru.lock) | |
| | lock(lru->lock) |
| | drm_gem_lru_remove_locked(obj) |
| | obj->lru==NULL => NULL deref |
| | unlock(lru->lock) |
We can of course add an extra
if (!obj->lru) return;
in drm_gem_lru_remove_locked() to cover for this race, and add a
READ_ONCE in drm_gem_lru_remove() to make sure the compiler doesn't
do crazy things like dereferencing obj->lru twice instead of having
the LRU pointer stored in a register. That still assumes that the lru
we assigned to our local variable is valid even after the
drm_gem_lru_remove_locked(obj) call, which is true at least for MSM and
and panthor because they have their LRUs attached to the drm_device,
which outlives any GEMs attached to it. But it's not something the API
enforce or document as a requirement.
>
> That said, I can't really think of a sensible thing to do with more
> than a single LRU lock per device. And it does make things easier to
> reason about.
Okay, I'll give it a try then.
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 2/3] drm/gem: Fix a race between drm_gem_lru_scan() and drm_gem_object_release()
2026-05-08 8:41 ` Boris Brezillon
@ 2026-05-08 13:49 ` Rob Clark
0 siblings, 0 replies; 20+ messages in thread
From: Rob Clark @ 2026-05-08 13:49 UTC (permalink / raw)
To: Boris Brezillon
Cc: Steven Price, Liviu Dudau, Dmitry Osipenko, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Akash Goel, Chia-I Wu, Dmitry Baryshkov, Abhinav Kumar,
Jessica Zhang, Sean Paul, Marijn Suijten, linux-arm-msm,
freedreno, dri-devel, linux-kernel
On Fri, May 8, 2026 at 1:41 AM Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> On Thu, 7 May 2026 14:38:23 -0700
> Rob Clark <rob.clark@oss.qualcomm.com> wrote:
>
> > On Thu, May 7, 2026 at 5:46 AM Boris Brezillon
> > <boris.brezillon@collabora.com> wrote:
> > >
> > > On Wed, 06 May 2026 14:16:27 +0200
> > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > >
> > > > The following race can currently happen:
> > > >
> > > > | Thread 0 in `drm_gem_lru_scan` | Thread 1 in `drm_gem_object_release` |
> > > > | - | - |
> > > > | move obj1 with refcount==0 to `still_in_lru` | |
> > > > | move obj2 with refcount!=0 to `still_in_lru` | |
> > > > | mutex_unlock | |
> > > > | shrink obj2 | |
> > > > | | lru = obj1->lru; // `still_in_lru` |
> > > > | mutex_lock | |
> > > > | move obj1 back to the original lru | |
> > > > | mutex_unlock | |
> > > > | return | |
> > > > | | dereference `still_in_lru` |
> > > >
> > > > Move the drm_gem_lru_move_tail_locked() after the
> > > > kref_get_unless_zero() check so that we don't end up with a
> > > > vanishing LRU when we hit drm_gem_object_release(). We also need to
> > > > remove the skipped object from its LRU, otherwise we'll keep hitting
> > > > it on subsequent loop iterations until it's actually removed from the
> > > > list in the drm_gem_release().
> > > >
> > > > Fixes: e7c2af13f811 ("drm/gem: Add LRU/shrinker helper")
> > > > Reported-by: Chia-I Wu <olvaffe@gmail.com>
> > > > Closes: https://gitlab.freedesktop.org/panfrost/linux/-/work_items/86
> > > > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > > > Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
> > > > ---
> > > > drivers/gpu/drm/drm_gem.c | 14 +++++++++-----
> > > > 1 file changed, 9 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> > > > index fca42949eb2b..97cf63de0112 100644
> > > > --- a/drivers/gpu/drm/drm_gem.c
> > > > +++ b/drivers/gpu/drm/drm_gem.c
> > > > @@ -1660,15 +1660,19 @@ drm_gem_lru_scan(struct drm_gem_lru *lru,
> > > > if (!obj)
> > > > break;
> > > >
> > > > - drm_gem_lru_move_tail_locked(&still_in_lru, obj);
> > > > -
> > > > /*
> > > > * If it's in the process of being freed, gem_object->free()
> > > > - * may be blocked on lock waiting to remove it. So just
> > > > - * skip it.
> > > > + * may be blocked on lock waiting to remove it. So just remove
> > > > + * it from its current LRU and skip it.
> > > > */
> > > > - if (!kref_get_unless_zero(&obj->refcount))
> > > > + if (!kref_get_unless_zero(&obj->refcount)) {
> > > > + if (obj->lru)
> > > > + drm_gem_lru_remove_locked(obj);
> > > > +
> > >
> > > Actually, this thing is still racy, because obj->lru is dereferenced
> > > without the lru->lock held in drm_gem_object_release(). At this point
> > > I'm wondering if we should expose a drm_gem_lru_remove() taking the LRU
> > > lock as an argument as suggested by Steve, and delegate the
> > > responsibility to call drm_gem_lru_remove() to the driver. Either that,
> > > or we make it so the LRU lock is attached to the drm_device instead of
> > > the GEM (both MSM and panthor assume a device-wide lock for LRU
> > > manipulation).
> > >
> > > Rob, what's your take on this matter?
> >
> > I don't think there is a race, because of the kref_get_unless_zero().
> > Other than lru_scan, there shouldn't be cases where someone is moving
> > an obj between LRUs racing with drm_gem_object_release(), because that
> > means they don't own a reference on the obj they are manipulating.
>
> Yeah, but the race I'm talking about is drm_gem_object_release()
> vs drm_gem_lru_scan(), so at this point refcount is zero, and this
> patch only moves the needle, but doesn't fix the problem entirely:
>
>
> | Thread 0 in `drm_gem_lru_scan` | Thread 1 in `drm_gem_object_release` |
> | - | - |
> | | drm_gem_lru_remove() |
> | | lru = obj->lru |
> | | if (!lru) return; |
> | lock(still_in_lru.lock) | |
> | if (refcount == 0) | |
> | drm_gem_lru_remove_locked(obj) | |
> | obj->lru = NULL | |
> | ..... | |
> | unlock(still_in_lru.lock) | |
> | | lock(lru->lock) |
> | | drm_gem_lru_remove_locked(obj) |
> | | obj->lru==NULL => NULL deref |
> | | unlock(lru->lock) |
>
> We can of course add an extra
>
> if (!obj->lru) return;
>
> in drm_gem_lru_remove_locked() to cover for this race, and add a
> READ_ONCE in drm_gem_lru_remove() to make sure the compiler doesn't
> do crazy things like dereferencing obj->lru twice instead of having
> the LRU pointer stored in a register. That still assumes that the lru
> we assigned to our local variable is valid even after the
> drm_gem_lru_remove_locked(obj) call, which is true at least for MSM and
> and panthor because they have their LRUs attached to the drm_device,
> which outlives any GEMs attached to it. But it's not something the API
> enforce or document as a requirement.
Ahh, right.. yeah drm_gem_lru_remove() should READ_ONCE(), and check
obj->lru after acquiring the lock.
And yeah, that is assuming there aren't lifetime issues with the lock itself
> >
> > That said, I can't really think of a sensible thing to do with more
> > than a single LRU lock per device. And it does make things easier to
> > reason about.
>
> Okay, I'll give it a try then.
sounds good
BR,
-R
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 3/3] drm/gem: Stop exposing the racy/unsafe drm_gem_lru_remove() helper
2026-05-06 12:16 [PATCH 0/3] drm/panthor: Fix a race in the shrinker logic Boris Brezillon
2026-05-06 12:16 ` [PATCH 1/3] drm/panthor: Don't use the racy drm_gem_lru_remove() helper Boris Brezillon
2026-05-06 12:16 ` [PATCH 2/3] drm/gem: Fix a race between drm_gem_lru_scan() and drm_gem_object_release() Boris Brezillon
@ 2026-05-06 12:16 ` Boris Brezillon
2026-05-06 15:40 ` Steven Price
2026-05-07 10:20 ` Liviu Dudau
2 siblings, 2 replies; 20+ messages in thread
From: Boris Brezillon @ 2026-05-06 12:16 UTC (permalink / raw)
To: Steven Price, Liviu Dudau, Boris Brezillon, Dmitry Osipenko
Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Akash Goel, Chia-I Wu, Rob Clark, Dmitry Baryshkov,
Abhinav Kumar, Jessica Zhang, Sean Paul, Marijn Suijten,
linux-arm-msm, freedreno, dri-devel, linux-kernel
The only place where it's safe to call drm_gem_lru_remove() is when
we know the drm_gem_object::lru field can't be concurrently updated,
which we know is the case when the drm_gem_object is destroyed.
Rather than trying to make that safe, let's kill the function and inline
its content in drm_gem_object_release().
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
---
drivers/gpu/drm/drm_gem.c | 49 +++++++++++++++++------------------------------
include/drm/drm_gem.h | 1 -
2 files changed, 18 insertions(+), 32 deletions(-)
diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 97cf63de0112..b6df4f62f7b5 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1108,6 +1108,15 @@ drm_gem_release(struct drm_device *dev, struct drm_file *file_private)
idr_destroy(&file_private->object_idr);
}
+static void
+drm_gem_lru_remove_locked(struct drm_gem_object *obj)
+{
+ obj->lru->count -= obj->size >> PAGE_SHIFT;
+ WARN_ON(obj->lru->count < 0);
+ list_del(&obj->lru_node);
+ obj->lru = NULL;
+}
+
/**
* drm_gem_object_release - release GEM buffer object resources
* @obj: GEM buffer object
@@ -1124,7 +1133,15 @@ drm_gem_object_release(struct drm_gem_object *obj)
drm_gem_private_object_fini(obj);
drm_gem_free_mmap_offset(obj);
- drm_gem_lru_remove(obj);
+
+ /* If the object refcount drops to zero, it means no one can change
+ * the LRU it's inserted into, so it's safe to dereference
+ * drm_gem_object::lru without the drm_gem_lru::lock held.
+ */
+ if (obj->lru) {
+ guard(mutex)(obj->lru->lock);
+ drm_gem_lru_remove_locked(obj);
+ }
}
EXPORT_SYMBOL(drm_gem_object_release);
@@ -1552,36 +1569,6 @@ drm_gem_lru_init(struct drm_gem_lru *lru, struct mutex *lock)
}
EXPORT_SYMBOL(drm_gem_lru_init);
-static void
-drm_gem_lru_remove_locked(struct drm_gem_object *obj)
-{
- obj->lru->count -= obj->size >> PAGE_SHIFT;
- WARN_ON(obj->lru->count < 0);
- list_del(&obj->lru_node);
- obj->lru = NULL;
-}
-
-/**
- * drm_gem_lru_remove - remove object from whatever LRU it is in
- *
- * If the object is currently in any LRU, remove it.
- *
- * @obj: The GEM object to remove from current LRU
- */
-void
-drm_gem_lru_remove(struct drm_gem_object *obj)
-{
- struct drm_gem_lru *lru = obj->lru;
-
- if (!lru)
- return;
-
- mutex_lock(lru->lock);
- drm_gem_lru_remove_locked(obj);
- mutex_unlock(lru->lock);
-}
-EXPORT_SYMBOL(drm_gem_lru_remove);
-
/**
* drm_gem_lru_move_tail_locked - move the object to the tail of the LRU
*
diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
index 86f5846154f7..d527df98d142 100644
--- a/include/drm/drm_gem.h
+++ b/include/drm/drm_gem.h
@@ -611,7 +611,6 @@ int drm_gem_dumb_map_offset(struct drm_file *file, struct drm_device *dev,
u32 handle, u64 *offset);
void drm_gem_lru_init(struct drm_gem_lru *lru, struct mutex *lock);
-void drm_gem_lru_remove(struct drm_gem_object *obj);
void drm_gem_lru_move_tail_locked(struct drm_gem_lru *lru, struct drm_gem_object *obj);
void drm_gem_lru_move_tail(struct drm_gem_lru *lru, struct drm_gem_object *obj);
unsigned long
--
2.54.0
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH 3/3] drm/gem: Stop exposing the racy/unsafe drm_gem_lru_remove() helper
2026-05-06 12:16 ` [PATCH 3/3] drm/gem: Stop exposing the racy/unsafe drm_gem_lru_remove() helper Boris Brezillon
@ 2026-05-06 15:40 ` Steven Price
2026-05-07 10:20 ` Liviu Dudau
1 sibling, 0 replies; 20+ messages in thread
From: Steven Price @ 2026-05-06 15:40 UTC (permalink / raw)
To: Boris Brezillon, Liviu Dudau, Dmitry Osipenko
Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Akash Goel, Chia-I Wu, Rob Clark, Dmitry Baryshkov,
Abhinav Kumar, Jessica Zhang, Sean Paul, Marijn Suijten,
linux-arm-msm, freedreno, dri-devel, linux-kernel
On 06/05/2026 13:16, Boris Brezillon wrote:
> The only place where it's safe to call drm_gem_lru_remove() is when
> we know the drm_gem_object::lru field can't be concurrently updated,
> which we know is the case when the drm_gem_object is destroyed.
>
> Rather than trying to make that safe, let's kill the function and inline
> its content in drm_gem_object_release().
>
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
Reviewed-by: Steven Price <steven.price@arm.com>
> ---
> drivers/gpu/drm/drm_gem.c | 49 +++++++++++++++++------------------------------
> include/drm/drm_gem.h | 1 -
> 2 files changed, 18 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> index 97cf63de0112..b6df4f62f7b5 100644
> --- a/drivers/gpu/drm/drm_gem.c
> +++ b/drivers/gpu/drm/drm_gem.c
> @@ -1108,6 +1108,15 @@ drm_gem_release(struct drm_device *dev, struct drm_file *file_private)
> idr_destroy(&file_private->object_idr);
> }
>
> +static void
> +drm_gem_lru_remove_locked(struct drm_gem_object *obj)
> +{
> + obj->lru->count -= obj->size >> PAGE_SHIFT;
> + WARN_ON(obj->lru->count < 0);
> + list_del(&obj->lru_node);
> + obj->lru = NULL;
> +}
> +
> /**
> * drm_gem_object_release - release GEM buffer object resources
> * @obj: GEM buffer object
> @@ -1124,7 +1133,15 @@ drm_gem_object_release(struct drm_gem_object *obj)
> drm_gem_private_object_fini(obj);
>
> drm_gem_free_mmap_offset(obj);
> - drm_gem_lru_remove(obj);
> +
> + /* If the object refcount drops to zero, it means no one can change
> + * the LRU it's inserted into, so it's safe to dereference
> + * drm_gem_object::lru without the drm_gem_lru::lock held.
> + */
> + if (obj->lru) {
> + guard(mutex)(obj->lru->lock);
> + drm_gem_lru_remove_locked(obj);
> + }
> }
> EXPORT_SYMBOL(drm_gem_object_release);
>
> @@ -1552,36 +1569,6 @@ drm_gem_lru_init(struct drm_gem_lru *lru, struct mutex *lock)
> }
> EXPORT_SYMBOL(drm_gem_lru_init);
>
> -static void
> -drm_gem_lru_remove_locked(struct drm_gem_object *obj)
> -{
> - obj->lru->count -= obj->size >> PAGE_SHIFT;
> - WARN_ON(obj->lru->count < 0);
> - list_del(&obj->lru_node);
> - obj->lru = NULL;
> -}
> -
> -/**
> - * drm_gem_lru_remove - remove object from whatever LRU it is in
> - *
> - * If the object is currently in any LRU, remove it.
> - *
> - * @obj: The GEM object to remove from current LRU
> - */
> -void
> -drm_gem_lru_remove(struct drm_gem_object *obj)
> -{
> - struct drm_gem_lru *lru = obj->lru;
> -
> - if (!lru)
> - return;
> -
> - mutex_lock(lru->lock);
> - drm_gem_lru_remove_locked(obj);
> - mutex_unlock(lru->lock);
> -}
> -EXPORT_SYMBOL(drm_gem_lru_remove);
> -
> /**
> * drm_gem_lru_move_tail_locked - move the object to the tail of the LRU
> *
> diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
> index 86f5846154f7..d527df98d142 100644
> --- a/include/drm/drm_gem.h
> +++ b/include/drm/drm_gem.h
> @@ -611,7 +611,6 @@ int drm_gem_dumb_map_offset(struct drm_file *file, struct drm_device *dev,
> u32 handle, u64 *offset);
>
> void drm_gem_lru_init(struct drm_gem_lru *lru, struct mutex *lock);
> -void drm_gem_lru_remove(struct drm_gem_object *obj);
> void drm_gem_lru_move_tail_locked(struct drm_gem_lru *lru, struct drm_gem_object *obj);
> void drm_gem_lru_move_tail(struct drm_gem_lru *lru, struct drm_gem_object *obj);
> unsigned long
>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 3/3] drm/gem: Stop exposing the racy/unsafe drm_gem_lru_remove() helper
2026-05-06 12:16 ` [PATCH 3/3] drm/gem: Stop exposing the racy/unsafe drm_gem_lru_remove() helper Boris Brezillon
2026-05-06 15:40 ` Steven Price
@ 2026-05-07 10:20 ` Liviu Dudau
1 sibling, 0 replies; 20+ messages in thread
From: Liviu Dudau @ 2026-05-07 10:20 UTC (permalink / raw)
To: Boris Brezillon
Cc: Steven Price, Dmitry Osipenko, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter, Akash Goel,
Chia-I Wu, Rob Clark, Dmitry Baryshkov, Abhinav Kumar,
Jessica Zhang, Sean Paul, Marijn Suijten, linux-arm-msm,
freedreno, dri-devel, linux-kernel
On Wed, May 06, 2026 at 02:16:28PM +0200, Boris Brezillon wrote:
> The only place where it's safe to call drm_gem_lru_remove() is when
> we know the drm_gem_object::lru field can't be concurrently updated,
> which we know is the case when the drm_gem_object is destroyed.
>
> Rather than trying to make that safe, let's kill the function and inline
> its content in drm_gem_object_release().
>
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Best regards,
Liviu
> ---
> drivers/gpu/drm/drm_gem.c | 49 +++++++++++++++++------------------------------
> include/drm/drm_gem.h | 1 -
> 2 files changed, 18 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> index 97cf63de0112..b6df4f62f7b5 100644
> --- a/drivers/gpu/drm/drm_gem.c
> +++ b/drivers/gpu/drm/drm_gem.c
> @@ -1108,6 +1108,15 @@ drm_gem_release(struct drm_device *dev, struct drm_file *file_private)
> idr_destroy(&file_private->object_idr);
> }
>
> +static void
> +drm_gem_lru_remove_locked(struct drm_gem_object *obj)
> +{
> + obj->lru->count -= obj->size >> PAGE_SHIFT;
> + WARN_ON(obj->lru->count < 0);
> + list_del(&obj->lru_node);
> + obj->lru = NULL;
> +}
> +
> /**
> * drm_gem_object_release - release GEM buffer object resources
> * @obj: GEM buffer object
> @@ -1124,7 +1133,15 @@ drm_gem_object_release(struct drm_gem_object *obj)
> drm_gem_private_object_fini(obj);
>
> drm_gem_free_mmap_offset(obj);
> - drm_gem_lru_remove(obj);
> +
> + /* If the object refcount drops to zero, it means no one can change
> + * the LRU it's inserted into, so it's safe to dereference
> + * drm_gem_object::lru without the drm_gem_lru::lock held.
> + */
> + if (obj->lru) {
> + guard(mutex)(obj->lru->lock);
> + drm_gem_lru_remove_locked(obj);
> + }
> }
> EXPORT_SYMBOL(drm_gem_object_release);
>
> @@ -1552,36 +1569,6 @@ drm_gem_lru_init(struct drm_gem_lru *lru, struct mutex *lock)
> }
> EXPORT_SYMBOL(drm_gem_lru_init);
>
> -static void
> -drm_gem_lru_remove_locked(struct drm_gem_object *obj)
> -{
> - obj->lru->count -= obj->size >> PAGE_SHIFT;
> - WARN_ON(obj->lru->count < 0);
> - list_del(&obj->lru_node);
> - obj->lru = NULL;
> -}
> -
> -/**
> - * drm_gem_lru_remove - remove object from whatever LRU it is in
> - *
> - * If the object is currently in any LRU, remove it.
> - *
> - * @obj: The GEM object to remove from current LRU
> - */
> -void
> -drm_gem_lru_remove(struct drm_gem_object *obj)
> -{
> - struct drm_gem_lru *lru = obj->lru;
> -
> - if (!lru)
> - return;
> -
> - mutex_lock(lru->lock);
> - drm_gem_lru_remove_locked(obj);
> - mutex_unlock(lru->lock);
> -}
> -EXPORT_SYMBOL(drm_gem_lru_remove);
> -
> /**
> * drm_gem_lru_move_tail_locked - move the object to the tail of the LRU
> *
> diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
> index 86f5846154f7..d527df98d142 100644
> --- a/include/drm/drm_gem.h
> +++ b/include/drm/drm_gem.h
> @@ -611,7 +611,6 @@ int drm_gem_dumb_map_offset(struct drm_file *file, struct drm_device *dev,
> u32 handle, u64 *offset);
>
> void drm_gem_lru_init(struct drm_gem_lru *lru, struct mutex *lock);
> -void drm_gem_lru_remove(struct drm_gem_object *obj);
> void drm_gem_lru_move_tail_locked(struct drm_gem_lru *lru, struct drm_gem_object *obj);
> void drm_gem_lru_move_tail(struct drm_gem_lru *lru, struct drm_gem_object *obj);
> unsigned long
>
> --
> 2.54.0
>
--
====================
| I would like to |
| fix the world, |
| but they're not |
| giving me the |
\ source code! /
---------------
¯\_(ツ)_/¯
^ permalink raw reply [flat|nested] 20+ messages in thread